class: center, middle, inverse, title-slide .title[ # Tools for easy data visualization and exploration ] .author[ ### Julia Romanowska ] .institute[ ### BIOS, IGS, UiB ] .date[ ### October 25, 2022 ] --- class: inverse, middle, left # Outline ## Data exploration ## Visualization basics ## Advanced plotting with {ggplot2} ??? Visualization of data - whether it's results of statistical analyses or just some exemplary data points - is so much more than just the actual figure. The tools for creating a visualization are the same, but the purpose might be very different: - we should create lots of visualizations when exploring the data - we will often plot analyses results - these might be easier to interact with than tables - we should _carefully design_ final plots, whether for a presentation or paper. During the workshop, we will talk about these different aspects of data visualization. I will also show examples of tools that one can use, both very simple and advanced. --- class: inverse, middle, center # <img src="figs/noun-telescope-white.png" tag="Telescope by anggun from NounProject.com" alt="Telescope icon by anggun from NounProject.com" style="height: 60px"> Data exploration ??? When we get new data in our hands, first thing we do is to explore. For that, summarizing data is very useful, but sometimes we need also to visualize data to be able to understand it or look at outliers. But most importantly, we need to check whether the data is in correct format and contains logical variables. We call this usable data format... --- ## Tidy data ??? ... tidy data. At first, you might think that tidy data is the one where all the missing or unreliable datapoints are removed. But no! -- **is not:** - data with missing values removed - data with unreliable measurements removed - data that looks nice in Excel ??? So what is it? -- **is:** OBSERVATIONS = ROWS VARIABLES = COLUMNS ??? This is a format of data where each observation is in a row and each variable is in a column. Let's look at some examples. --- ## Tidy data **Examples** | | treatmenta | treatmentb | |:------------|:-----------:|:-----------:| | John Smith | --- | 2 | | Jane Doe | 16 | 11 | | Mary Johnson | 3 | 1 | <div style="text-align: center; margin-bottom: 20px; margin-top: 20px;"> <span style="font-style: italic;">Table 1:</span> Typical presentation dataset. </div> | | John Smith | Jane Doe | Mary Johnson | |:------------|:-----------:|:-----------:|:-------------:| | treatmenta | - | 16 | 3 | | treatmentb | 2 | 11 | 1 | <div style="text-align: center; margin-bottom: 20px; margin-top: 20px;"> <span style="font-style: italic;">Table 2:</span> The same data as in Table 1 but structured differently. </div> <p style="font-size: small; font-style: italic; position: absolute; bottom: 40px;"> src: Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10). </p> ??? Why these datasets are not tidy? The values under `treatmenta` and `treatmentb` are measuring the same thing, but perhaps at a different time, so these are not two variables. What would happen if we started analysing this and suddenly, we will get more data where there are now treatments a, b, and c? We would have to re-write the analysis scripts because we have had only a and b until now. In Table 2 - how do we approach analysis here? We have (potentially) very many columns with persons - should we analyse each separately by hand? --- ## Tidy data **Examples - ctd** | person | treatment | result | |:--------|:---------:|:------:| | John Smith | a | — | | Jane Doe | a | 16 | | Mary Johnson | a | 3 | | John Smith | b | 2 | | Jane Doe | b | 11 | | Mary Johnson | b | 1 | <div style="text-align: center; margin-bottom: 20px; margin-top: 20px;"> <span style="font-style: italic;">Table 3:</span> The same data as in Table 1 but with variables in columns and observations in rows. </div> ??? With tidy data, we can create a script that gathers all the treatment per person and e.g., compares the results. This script will be unchanged if there come more persons to our study or if the treatment changes. --- ## Tidy data <img src="figs/tidydata_2.jpg" style="width: 80%;"> --- ## Tidy data <img src="figs/tidydata_5.jpg" style="width: 80%;"> ??? If data is tidy, it's easier to automate the exploration and analysis. --- ## Automatic summary - depends on the data type _(character, factor, numeric, date/time)_ - duplicated measurements - missing data - correlated variables --- ## Automatic summary - tools: - `{skimr}` *basic summary of each variable,* - `{DataExplorer}` *automatic data exploration, with plots, correlations between variables, etc.* - `{dataReporter}` *automatic report, detailed, focus on potential problems in data* --- class: inverse, center, middle #
DEMO --- class: inverse, middle, center #
DataViz design --- # Warmup
−
+
03
:
00
> Grab a piece of paper and a pencil and draw this data! |DATE | CAPACITY| DEMAND| |:-------|--------:|------:| |2019-04 | 29263| 46193| |2019-05 | 28037| 49131| |2019-06 | 21596| 50124| |2019-07 | 25895| 48850| |2019-08 | 25813| 47602| |2019-09 | 22427| 43697| |2019-10 | 23605| 41058| |2019-11 | 24263| 37364| |2019-12 | 24243| 34364| ??? Questions - was it easy to imagine the data? - how did you start drawing? - did you change your point of view of the data _after_ you've sketched it? - what questions arised when you saw the data plotted? --- ## Data viz _design_ ??? It's not enough to _show_ your data - you need to carefuly pick the datapoints you want to focus on, think about the fonts, the axes, the orientation of the plot, the colors(!), and title/captions. All this process is called _designing_ your visual. And as with any other design or planning, it all starts with... -- -
pen &
paper ??? When you grab a blank sheet of paper, you're getting free of all the limitations that a tool of your choice can give you. And of the frustrations as well, if you're only beginning to learn the tool! With a pen and paper, you may focus on what's important for the presentation of the data... -- -
what the data tells me? ??? Normally, we would plot data when we want to explore some variables or when we have results of an analysis. One thing is known: we don't know how the data look like. We may know how it _should_ look like, though. That's why we may start designing with one type of plot in mind, but in the end the plot might look completely different! That's why it's important to draw by hand some sketches and think about the question here. When we finally have plotted something and understood the message behind the data, we need to improve the visual. -- name: declutter - **declutter** + **explain** - _([example](#plot_examples))_ ??? - do we really need all the points? - what about the axis ticks? and axis labels? - can we include a take-home message in title/caption? - is the legend in right place? - have we used **color** wisely? (I will talk about the colors next) -- <br> > Who's my target audience? ??? This entire process needs to be repeated when we present our plot for different groups of people! This process is very important and one could talk about this for hours. I've just mentioned important points, and gathered some references for you to read if you're interested. --- ## Plot type Plethora of plot types! - good short info about most used types: https://www.storytellingwithdata.com/chart-guide - all, including more exotic types: https://datavizproject.com/ -- Different type, based on: - continuous vs. categorical data - all points vs. averaged/smoothed - categories, time series, etc. -- name: plot_types Some tips: - [lasagne plot](#spaghetti) _if you really need to plot **all** the data points_ - [circles and bars](#area) _size matters!_ ??? But the most important tip you get from me on choosing the plot type is... --- class: inverse, right, middle ##
DON'T USE PIE CHART!
### too high ink:information ratio ### area is difficult to grasp ### bars or points are _much better!_ ??? Pie chart is OK when showing two-three categories - this requires then much "ink" and gives relatively little information. Although nowadays, nobody cares much about "ink", the low ink:information ratio is still considered as one of the markers of good quality in visual design. Human eye is not able to assess correctly small changes in area! Bars or points are much easier to create and much easier to read! --- class: inverse, right, middle #
DON'T EVER USE 3D!
--- ## Choosing medium and audience - popular science *vs.* scientific article - manuscript *vs.* presentation - font size, level of details <img src="figs/selcuk_corruption.png" style="width:30%;"> <img src="figs/economist_graph.png" style="width:45%;"> src: *Akcay, S. (2006). Corruption and human development. Cato Journal 26(1), 29-48*; https://www.economist.com/graphic-detail/2011/12/02/corrosive-corruption --- ## Choosing medium and audience - popular science *vs.* scientific article - manuscript *vs.* presentation - font size, level of details <img src="figs/Rougier_PLoSCompBiol2014_fig3.png" style="width: 70%;"> src: Rougier et al. PLoS Comput. Biol. 10, e1003833 (2014) --- ## Choosing color - **accessibility!** - good tool for choosing palette: [ColorBrewer](https://colorbrewer2.org/) by Cynthia Brewer - recommended paper: [Crameri, F., Shephard, G. E., & Heron, P. J. *(2020)*. The misuse of colour in science communication. *Nature Communications*, 11(1), 1–10.](https://www.nature.com/articles/s41467-020-19160-7) >
wrong color palette can _distort_ the data and > _change_ the meaning of your visual ??? Colors are important - they create meaning, grab viewer's attention, show differences or similarities, and when used wrongly can distort the information. We want to choose the colors _wisely_, so that everyone can get the information we wanted to show with ease. **Everyone** = we need to consider those who have problems with seeing all colors or problems with seeing at all. **With ease** = watch out for colors that are associated with a specific meaning. What to consider when choosing colors: - are there colors that _naturally_ come to mind when thinking about the information we want to display? (e.g., hot-cold areas, bad-good treatments, seasonal change in measurements) - check here: https://informationisbeautiful.net/visualizations/colours-in-cultures/ - blog post: https://www.storytellingwithdata.com/blog/2021/6/8/colors-and-emotions-in-data-visualization - do I need colors or is it enough with gray-shades? - is the color scale continuous or categorical? - _in case of continuous color scale:_ is it the best way to present a change or would it be enough to show, e.g., size or bars along a timeline? - _when dealing with very many categories:_ can a handfull of them be colored, leaving the rest in gray, for contrast? --- class: inverse, left, bottom ## SUMMARY ### know your data ??? So what we've learned til now: - spend time on exploring the data and check what the data tells you -- ### know your audience ??? - when critiquing the plot, have in mind who will look at it and under what circumstances! - you can't show the same plot in a scientific journal article, and during a presentation for students! -- ### _design_ your visual ??? - use time to design the visual - don't just show all the data you have, but choose carefully what to show and how so that the message is clear --- ##
Automatic plotting - [{esquisse} package](https://dreamrs.github.io/esquisse/index.html) - [{finalfit} package](https://finalfit.org/articles/finalfit.html) - _many others, depending on analysis type_ --- class: inverse, center, middle #
DEMO --- class: inverse, right, middle ##
Not automatic plotting with {ggplot2} --- ### A plot in {ggplot2} consists of several layers: -- 1. data 2. aesthetics (`aes`) ([original documentation](https://ggplot2.tidyverse.org/reference/index.html#section-aesthetics)) 3. `geom`s ([original documentation](https://ggplot2.tidyverse.org/reference/index.html#section-geoms)) 4. `scale`s ([original documentation](https://ggplot2.tidyverse.org/reference/index.html#section-scales)) 5. `theme` ([original documentation](https://ggplot2.tidyverse.org/reference/index.html#section-themes)) ??? 1. data - strict format: tidy data! - one row per datapoint - all grouping must be included in the data 2. aesthetics (`aes`) ([original documentation](https://ggplot2.tidyverse.org/reference/index.html#section-aesthetics)) - how to map the data to the graph? - which column is *x*, *y*, ... - which column provides grouping (based on the grouping, one can either connect points into lines, split a graph into several facets, or color differently each group) 3. `geom`s ([original documentation](https://ggplot2.tidyverse.org/reference/index.html#section-geoms)) - how to visualize the data? - points (`geom_point`), lines (`geom_line`), bars (`geom_bar`), etc. - `geom`s are connected to `stat`s that conduct any necessary pre-processing of data (e.g., `geom_histogram` would first calculate the number of observations in each bin through `stat_bin`) 4. `scale`s ([original documentation](https://ggplot2.tidyverse.org/reference/index.html#section-scales)) - any type of data representation on the plot - coordinates: `scale_x_continuous`, `scale_x_discrete`, `scale_x_date`, etc. - colors: `scale_color_manual`, `scale_colour_brewer`, `scale_fill_continuous`, etc. 5. `theme` ([original documentation](https://ggplot2.tidyverse.org/reference/index.html#section-themes)) - visual aspects - size and types of fonts - positioning of axes, legends, etc. Each layer is added to the previous one with a `+` sign. The result can be saved to an object and added upon later. --- class: inverse, right, middle ## Example plot with {ggplot2} --- ## Data <div style="float: right; position: fixed; right: 10px; top: 10px;"> <img style="float: right; position: relative; width: 120px" src="figs/palmer_penguins_logo.png" alt="palmer penguins R package logo"/><br> </div> [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/) <br><br><br> ``` # A tibble: 344 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # … with 334 more rows, and 2 more variables: sex <fct>, year <int> ``` ??? I will be using this simple dataset with morphological data of penguins living on a certain island. -- > **IMPORTANT** start with tidy data! **each row is an observation, each column is a variable** --- count: false .panel1-penguins_species_simple-auto[ ```r *ggplot(data = penguins) ``` ] .panel2-penguins_species_simple-auto[ ![](presentation_JRom_files/figure-html/penguins_species_simple_auto_01_output-1.png)<!-- --> ] --- count: false .panel1-penguins_species_simple-auto[ ```r ggplot(data = penguins) + * aes( * x = bill_length_mm, * y = bill_depth_mm * ) ``` ] .panel2-penguins_species_simple-auto[ ![](presentation_JRom_files/figure-html/penguins_species_simple_auto_02_output-1.png)<!-- --> ] --- count: false .panel1-penguins_species_simple-auto[ ```r ggplot(data = penguins) + aes( x = bill_length_mm, y = bill_depth_mm ) + * geom_point() ``` ] .panel2-penguins_species_simple-auto[ ![](presentation_JRom_files/figure-html/penguins_species_simple_auto_03_output-1.png)<!-- --> ] --- count: false .panel1-penguins_species_simple-auto[ ```r ggplot(data = penguins) + aes( x = bill_length_mm, y = bill_depth_mm ) + geom_point() + * geom_smooth(method = "lm") ``` ] .panel2-penguins_species_simple-auto[ ![](presentation_JRom_files/figure-html/penguins_species_simple_auto_04_output-1.png)<!-- --> ] <style> .panel1-penguins_species_simple-auto { color: black; width: 51.5789473684211%; hight: 32%; float: left; padding-left: 1%; font-size: 50% } .panel2-penguins_species_simple-auto { color: black; width: 46.421052631579%; hight: 32%; float: left; padding-left: 1%; font-size: 50% } .panel3-penguins_species_simple-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 50% } </style> --- count: false .panel1-penguins_species-auto[ ```r *ggplot(data = penguins) ``` ] .panel2-penguins_species-auto[ ![](presentation_JRom_files/figure-html/penguins_species_auto_01_output-1.png)<!-- --> ] --- count: false .panel1-penguins_species-auto[ ```r ggplot(data = penguins) + * aes( * x = bill_length_mm, * y = bill_depth_mm * ) ``` ] .panel2-penguins_species-auto[ ![](presentation_JRom_files/figure-html/penguins_species_auto_02_output-1.png)<!-- --> ] --- count: false .panel1-penguins_species-auto[ ```r ggplot(data = penguins) + aes( x = bill_length_mm, y = bill_depth_mm ) + * geom_point( * aes(color = species, * shape = species), * size = 2 * ) ``` ] .panel2-penguins_species-auto[ ![](presentation_JRom_files/figure-html/penguins_species_auto_03_output-1.png)<!-- --> ] --- count: false .panel1-penguins_species-auto[ ```r ggplot(data = penguins) + aes( x = bill_length_mm, y = bill_depth_mm ) + geom_point( aes(color = species, shape = species), size = 2 ) + * geom_smooth( * method = "lm", * se = FALSE, * aes(color = species) * ) ``` ] .panel2-penguins_species-auto[ ![](presentation_JRom_files/figure-html/penguins_species_auto_04_output-1.png)<!-- --> ] --- count: false .panel1-penguins_species-auto[ ```r ggplot(data = penguins) + aes( x = bill_length_mm, y = bill_depth_mm ) + geom_point( aes(color = species, shape = species), size = 2 ) + geom_smooth( method = "lm", se = FALSE, aes(color = species) ) *penguins_default <- last_plot() ``` ] .panel2-penguins_species-auto[ ![](presentation_JRom_files/figure-html/penguins_species_auto_05_output-1.png)<!-- --> ] <style> .panel1-penguins_species-auto { color: black; width: 51.5789473684211%; hight: 32%; float: left; padding-left: 1%; font-size: 50% } .panel2-penguins_species-auto { color: black; width: 46.421052631579%; hight: 32%; float: left; padding-left: 1%; font-size: 50% } .panel3-penguins_species-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 50% } </style> ??? Basics of ggplot2: - layers - 'aesthetics' + data + geom --- count: false .panel1-penguins_species2-auto[ ```r *penguins_default ``` ] .panel2-penguins_species2-auto[ ![](presentation_JRom_files/figure-html/penguins_species2_auto_01_output-1.png)<!-- --> ] --- count: false .panel1-penguins_species2-auto[ ```r penguins_default + * scale_color_manual( * values = c("darkorange", * "darkorchid", * "cyan4") * ) ``` ] .panel2-penguins_species2-auto[ ![](presentation_JRom_files/figure-html/penguins_species2_auto_02_output-1.png)<!-- --> ] --- count: false .panel1-penguins_species2-auto[ ```r penguins_default + scale_color_manual( values = c("darkorange", "darkorchid", "cyan4") ) + * facet_grid(rows = vars(species)) ``` ] .panel2-penguins_species2-auto[ ![](presentation_JRom_files/figure-html/penguins_species2_auto_03_output-1.png)<!-- --> ] --- count: false .panel1-penguins_species2-auto[ ```r penguins_default + scale_color_manual( values = c("darkorange", "darkorchid", "cyan4") ) + facet_grid(rows = vars(species)) + * labs(title = * "Penguin species differentation") ``` ] .panel2-penguins_species2-auto[ ![](presentation_JRom_files/figure-html/penguins_species2_auto_04_output-1.png)<!-- --> ] --- count: false .panel1-penguins_species2-auto[ ```r penguins_default + scale_color_manual( values = c("darkorange", "darkorchid", "cyan4") ) + facet_grid(rows = vars(species)) + labs(title = "Penguin species differentation") + * labs(subtitle = * "based on bill depth and bill length") ``` ] .panel2-penguins_species2-auto[ ![](presentation_JRom_files/figure-html/penguins_species2_auto_05_output-1.png)<!-- --> ] --- count: false .panel1-penguins_species2-auto[ ```r penguins_default + scale_color_manual( values = c("darkorange", "darkorchid", "cyan4") ) + facet_grid(rows = vars(species)) + labs(title = "Penguin species differentation") + labs(subtitle = "based on bill depth and bill length") + * xlab("bill length (mm)") ``` ] .panel2-penguins_species2-auto[ ![](presentation_JRom_files/figure-html/penguins_species2_auto_06_output-1.png)<!-- --> ] --- count: false .panel1-penguins_species2-auto[ ```r penguins_default + scale_color_manual( values = c("darkorange", "darkorchid", "cyan4") ) + facet_grid(rows = vars(species)) + labs(title = "Penguin species differentation") + labs(subtitle = "based on bill depth and bill length") + xlab("bill length (mm)") + * ylab("bill depth (mm)") ``` ] .panel2-penguins_species2-auto[ ![](presentation_JRom_files/figure-html/penguins_species2_auto_07_output-1.png)<!-- --> ] --- count: false .panel1-penguins_species2-auto[ ```r penguins_default + scale_color_manual( values = c("darkorange", "darkorchid", "cyan4") ) + facet_grid(rows = vars(species)) + labs(title = "Penguin species differentation") + labs(subtitle = "based on bill depth and bill length") + xlab("bill length (mm)") + ylab("bill depth (mm)") + * theme_minimal() ``` ] .panel2-penguins_species2-auto[ ![](presentation_JRom_files/figure-html/penguins_species2_auto_08_output-1.png)<!-- --> ] --- count: false .panel1-penguins_species2-auto[ ```r penguins_default + scale_color_manual( values = c("darkorange", "darkorchid", "cyan4") ) + facet_grid(rows = vars(species)) + labs(title = "Penguin species differentation") + labs(subtitle = "based on bill depth and bill length") + xlab("bill length (mm)") + ylab("bill depth (mm)") + theme_minimal() + * theme( * plot.title = * element_text(face = "bold"), * plot.subtitle = * element_text(face = "italic")) ``` ] .panel2-penguins_species2-auto[ ![](presentation_JRom_files/figure-html/penguins_species2_auto_09_output-1.png)<!-- --> ] --- count: false .panel1-penguins_species2-auto[ ```r penguins_default + scale_color_manual( values = c("darkorange", "darkorchid", "cyan4") ) + facet_grid(rows = vars(species)) + labs(title = "Penguin species differentation") + labs(subtitle = "based on bill depth and bill length") + xlab("bill length (mm)") + ylab("bill depth (mm)") + theme_minimal() + theme( plot.title = element_text(face = "bold"), plot.subtitle = element_text(face = "italic")) + * theme(legend.position = "none") ``` ] .panel2-penguins_species2-auto[ ![](presentation_JRom_files/figure-html/penguins_species2_auto_10_output-1.png)<!-- --> ] <style> .panel1-penguins_species2-auto { color: black; width: 51.5789473684211%; hight: 32%; float: left; padding-left: 1%; font-size: 50% } .panel2-penguins_species2-auto { color: black; width: 46.421052631579%; hight: 32%; float: left; padding-left: 1%; font-size: 50% } .panel3-penguins_species2-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 50% } </style> ??? OK, that was the basic graph, but we can modify everything: - change of color, - split into sub-graphs, - adding graph title (and subtitle, caption) - adding axes titles - styling --- class: inverse, center, middle #
DEMO --- name: plot_examples **Make a point!** <img src="figs/dataviz5.jpg" style="width: 45%;"> .footnote[src: Natalia Kiseleva, http://eolay.tilda.ws/en] --- **Make a point!** <img src="figs/dataviz4.jpg" style="width: 50%;"> .footnote[src: Natalia Kiseleva, http://eolay.tilda.ws/en] [back](#declutter) --- name: spaghetti ## Not spaghetti, but lasagne... <img src="figs/Duke_StatMed2014_fig2.png" style="width: 48%;"><img src="figs/Duke_StatMed2014_fig3.png" style="width: 45%;"> .footnote[src: Duke, S. P. et al. Stat.Med. 34 (2015) [back](#plot_types)] --- name: area ## Use disc area and bars starting from 0 <img src="figs/Rougier_PLoSCompBiol2014_fig6.png" style="width: 98%;"> .footnote[src: Rougier et al. PLoS Comput. Biol. 10, e1003833 (2014)] --- ## Don't clip the bars <img src="figs/dataviz7.png" style="width: 70%;"> src: Natalia Kiseleva, http://eolay.tilda.ws/en [back](#plot_types)