class: center, middle, inverse, title-slide # Training Workshop on Data Visualization using ggplot2 in R ## Session 1: Introduction to data wrangling --- layout: true --- ## Topics + R objects and packages + Reading data into R + Basic data wrangling with `dplyr` + Reshaping data + Basic management of data types + text data (string) + categorical data (factor) + date data --- ### It's normal to struggle but it gets better and exciting! <img src="img/r_first_then_new.png" width="80%" style="display: block; margin: auto;" /> .fifty[Illustration adapted from [Allison Horst](https://twitter.com/allison_horst)] --- class: middle center # R Objects and packages ---- --- ## R Objects .leftcol60[ + You can consider R objects as "*saving information*" + e.g., text, number, matrix, vector, dataframe. + In other words everything in R is an object. ] .rightcol40[ <img src="img/r_objects.gif" width="60%" /> ] --- ## R Objects + Objects are assigned a value using **`<-`** .leftcol[ .details[ ```r a1 <- 10 print(a1) ``` ``` [1] 10 ``` ] .details[ ```r a2 <- 20 a2 ``` ``` [1] 20 ``` ] .details[ ```r a3 <- c(10, 20, 30) a3 ``` ``` [1] 10 20 30 ``` ] ] .rightcol[ .details[ ```r a1 * a2 ``` ``` [1] 200 ``` ] .details[ ```r st_name <- "christopher" st_age <- 23 st_sex <- "male" student <- c(st_name, st_age, st_sex) student ``` ``` [1] "christopher" "23" "male" ``` ] ] --- ## R packages .leftcol60[ + Collection of functions that load into your working environment. + It contain code that other R users have prepared for the community. + Installing packages ```r install.packages("tidyverse") ``` + Loading packages ```r library(tidyverse) ``` ] .rightcol40[ <img src="img/package.gif" width="70%" style="display: block; margin: auto 0 auto auto;" /> ] --- class: middle center # Reading data into R ---- --- ## Importing data + SPSS, Stata, SAS files: [haven package](https://haven.tidyverse.org/) + Excel files: [readxl package](https://readxl.tidyverse.org/) + CSV files: [readr package](https://readr.tidyverse.org/) --- ## Reading data into R #### SPSS, Stata & SAS using [haven package](https://haven.tidyverse.org/) .leftcol60[ ```r library(haven) ``` ```r # SPSS read_sav("path/data.sav") ``` ```r # Stata read_dta("path/data.dta") ``` ```r # SAS read_sas("path/data.sas7bdat") ``` ] .rightcol40[ <img src="img/haven.png" width="40%" style="display: block; margin: auto;" /> ] --- ## Reading data into R #### Excel files using [readxl package](https://readxl.tidyverse.org/) .leftcol60[ ```r library(readxl) read_excel("path/dataset.xls") ``` ``` # A tibble: 150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <chr> 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5 3.4 1.5 0.2 setosa 9 4.4 2.9 1.4 0.2 setosa 10 4.9 3.1 1.5 0.1 setosa # ... with 140 more rows ``` ] .rightcol40[ <img src="img/readxl.png" width="40%" style="display: block; margin: auto;" /> ] --- ## Reading data into R #### CSV files using [readr package](https://readr.tidyverse.org/) .leftcol60[ ```r install.packages("readr") library(readr) ``` ```r # comma separated (CSV) files read_csv("path/data.csv") ``` ```r # tab separated files read_tsv("path/data.tsv") ``` ```r # general delimited files read_delim("path/data.delim") ``` ] .rightcol40[ <img src="img/readr.png" width="40%" style="display: block; margin: auto;" /> ] --- class: middle center # Basic data wrangling with `dplyr` ---- --- class: middle center <img src="img/tidyverse.png" width="30%" /> # Tidyverse --- ## What is a tidyverse? A collection of R packages designed for data science. All packages share an underlying philosophy, grammar, and data structure. <center> <img src="https://rstudio-education.github.io/tidyverse-cookbook/images/data-science-workflow.png" style="width: 60%" /> </center> --- ## Tidyverse :: tidy data <center> <img src="https://www.openscapes.org/img/blog/tidydata/tidydata_1.jpg" style="width: 70%" /> </center> .fifty[Artist: [Allison Horst](https://github.com/allisonhorst)] --- ## Tidyverse :: tidy data <center> <img src="https://www.openscapes.org/img/blog/tidydata/tidydata_2.jpg" style="width: 60%" /> </center> .fifty[Artist: [Allison Horst](https://github.com/allisonhorst)] --- class: middle center ## Data wrangling using `dplyr` <center> <img src="https://github.com/allisonhorst/stats-illustrations/blob/master/rstats-artwork/dplyr_wrangling.png?raw=true" style="width: 40%" /> </center> .fifty[Artist: [Allison Horst](https://github.com/allisonhorst)] --- ## `dplyr` .leftcol[ **Overview** + `select()` picks variables based on their names + `mutate()` adds new variables + `filter()` picks cases based on their values + `summarise()` reduces multiple values down to a single summary + `arrange()` change the ordering of the rows see `dplyr` [cheatsheets](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf) ] .rightcol[ <img src="img/dplyr.png" width="80%" style="display: block; margin: auto;" /> ] --- ## `select()` ![](https://favstats.shinyapps.io/r_intro/_w_bfa1a45e/images/select.png) .leftcol[ ```r data ``` ``` # A tibble: 1,704 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. 7 Afghanistan Asia 1982 39.9 12881816 978. 8 Afghanistan Asia 1987 40.8 13867957 852. 9 Afghanistan Asia 1992 41.7 16317921 649. 10 Afghanistan Asia 1997 41.8 22227415 635. # ... with 1,694 more rows ``` ] .rightcol[ ```r select(data, continent, country, pop) ``` ``` # A tibble: 1,704 x 3 continent country pop <fct> <fct> <int> 1 Asia Afghanistan 8425333 2 Asia Afghanistan 9240934 3 Asia Afghanistan 10267083 4 Asia Afghanistan 11537966 5 Asia Afghanistan 13079460 6 Asia Afghanistan 14880372 7 Asia Afghanistan 12881816 8 Asia Afghanistan 13867957 9 Asia Afghanistan 16317921 10 Asia Afghanistan 22227415 # ... with 1,694 more rows ``` ] --- ## `select()` We can also **remove** variables with a **`-`** (minus) .leftcol[ ```r data ``` ``` # A tibble: 1,704 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. 7 Afghanistan Asia 1982 39.9 12881816 978. 8 Afghanistan Asia 1987 40.8 13867957 852. 9 Afghanistan Asia 1992 41.7 16317921 649. 10 Afghanistan Asia 1997 41.8 22227415 635. # ... with 1,694 more rows ``` ] .rightcol[ ```r select(data, -year, -pop) ``` ``` # A tibble: 1,704 x 4 country continent lifeExp gdpPercap <fct> <fct> <dbl> <dbl> 1 Afghanistan Asia 28.8 779. 2 Afghanistan Asia 30.3 821. 3 Afghanistan Asia 32.0 853. 4 Afghanistan Asia 34.0 836. 5 Afghanistan Asia 36.1 740. 6 Afghanistan Asia 38.4 786. 7 Afghanistan Asia 39.9 978. 8 Afghanistan Asia 40.8 852. 9 Afghanistan Asia 41.7 649. 10 Afghanistan Asia 41.8 635. # ... with 1,694 more rows ``` ] --- ## `select()` **Selection helpers** These *selection helpers* match variables according to a given pattern. + `starts_with()` starts with a prefix + `ends_with()` ends with a suffix + `contains()` contains a literal string + `matches()` matches regular expression --- ## `filter()` ![](https://favstats.shinyapps.io/r_intro/_w_bfa1a45e/images/filter.png) .leftcol[ ```r data ``` ``` # A tibble: 10 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. 7 Afghanistan Asia 1982 39.9 12881816 978. 8 Afghanistan Asia 1987 40.8 13867957 852. 9 Afghanistan Asia 1992 41.7 16317921 649. 10 Afghanistan Asia 1997 41.8 22227415 635. ``` ] .rightcol[ ```r filter(data, country == "Philippines") ``` ``` # A tibble: 10 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Philippines Asia 1952 47.8 22438691 1273. 2 Philippines Asia 1957 51.3 26072194 1548. 3 Philippines Asia 1962 54.8 30325264 1650. 4 Philippines Asia 1967 56.4 35356600 1814. 5 Philippines Asia 1972 58.1 40850141 1989. 6 Philippines Asia 1977 60.1 46850962 2373. 7 Philippines Asia 1982 62.1 53456774 2603. 8 Philippines Asia 1987 64.2 60017788 2190. 9 Philippines Asia 1992 66.5 67185766 2279. 10 Philippines Asia 1997 68.6 75012988 2537. ``` ] --- ## `mutate()` ![](https://favstats.shinyapps.io/r_intro/_w_dfe6b732/images/mutate.png) The `mutate` function will take a statement similar to this: + `variable_name` = `do_some_calculation` + `variable_name` will be attached at the end of the dataset. --- ## `mutate()` Let's calculate the `gdp` .leftcol[ ```r data ``` ``` # A tibble: 1,704 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. 7 Afghanistan Asia 1982 39.9 12881816 978. 8 Afghanistan Asia 1987 40.8 13867957 852. 9 Afghanistan Asia 1992 41.7 16317921 649. 10 Afghanistan Asia 1997 41.8 22227415 635. # ... with 1,694 more rows ``` ] .rightcol[ ```r mutate(data, GDP = gdpPercap * pop) ``` ``` # A tibble: 1,704 x 7 country continent year lifeExp pop gdpPercap GDP <fct> <fct> <int> <dbl> <int> <dbl> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330. 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670. 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797. 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150. 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274. 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231. 7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401. 8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309. 9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589. 10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875. # ... with 1,694 more rows ``` ] --- ## `group_by` and `summarise()` + Use when you want to aggregate your data (by groups). + Sometimes we want to calculate group statistics. <br> .center[ <img src="https://learn.r-journalism.com/wrangling/dplyr/images/groupby.png" style="width: 60%" /> ] --- ## `group_by` and `summarise()` Suppose we want to know the average population by continent. .leftcol40[ ```r data ``` ``` # A tibble: 1,704 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. 7 Afghanistan Asia 1982 39.9 12881816 978. 8 Afghanistan Asia 1987 40.8 13867957 852. 9 Afghanistan Asia 1992 41.7 16317921 649. 10 Afghanistan Asia 1997 41.8 22227415 635. # ... with 1,694 more rows ``` ] .rightcol60[ ```r grouped_by_continent <- group_by(data, continent) summarise(grouped_by_continent, avg_pop = mean(pop)) ``` ``` # A tibble: 5 x 2 continent avg_pop <fct> <dbl> 1 Africa 9916003. 2 Americas 24504795. 3 Asia 77038722. 4 Europe 17169765. 5 Oceania 8874672. ``` ] --- ## `group_by` and `summarise()` Suppose we want to know the average population by continent. .leftcol40[ ```r data ``` ``` # A tibble: 1,704 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. 7 Afghanistan Asia 1982 39.9 12881816 978. 8 Afghanistan Asia 1987 40.8 13867957 852. 9 Afghanistan Asia 1992 41.7 16317921 649. 10 Afghanistan Asia 1997 41.8 22227415 635. # ... with 1,694 more rows ``` ] .rightcol60[ ```r grouped_by_continent <- group_by(data, continent) summarised_data <- summarise(grouped_by_continent, avg_pop = mean(pop)) summarised_data ``` ``` # A tibble: 5 x 2 continent avg_pop <fct> <dbl> 1 Africa 9916003. 2 Americas 24504795. 3 Asia 77038722. 4 Europe 17169765. 5 Oceania 8874672. ``` ] --- .leftcol[ <img src="img/code_convo.jpg" width="80%" style="display: block; margin: auto;" /> ] .rightcol[ <img src="img/teary.gif" width="90%" style="display: block; margin: auto;" /> ] --- class: middle center ## `%>%` pipe operator <center> <img src="https://rpodcast.github.io/officer-advrmarkdown/img/magrittr.png" style="width: 40%" /> </center> --- ## The %>% operator The **`%>%`** helps your write code in a way that is easier to read and understand. .leftcol[ Calculating population by continent **without %>%** ```r grouped_by_continent <- group_by(data, continent) summarised_data <- summarise(grouped_by_continent, avg_pop = mean(pop)) summarised_data ``` ``` # A tibble: 5 x 2 continent avg_pop <fct> <dbl> 1 Africa 9916003. 2 Americas 24504795. 3 Asia 77038722. 4 Europe 17169765. 5 Oceania 8874672. ``` ] .rightcol[ Calculating population by continent **with %>%** ```r data %>% group_by(continent) %>% summarise(avg_pop = mean(pop)) ``` ``` # A tibble: 5 x 2 continent avg_pop <fct> <dbl> 1 Africa 9916003. 2 Americas 24504795. 3 Asia 77038722. 4 Europe 17169765. 5 Oceania 8874672. ``` ] --- ## The %>% operator .leftcol[ Calculating population by continent **without %>%** ```r filtered_by_asia <- filter(data, continent == "Asia") grouped_by_country <- group_by(filtered_by_asia, country) summarised_by_country <- summarise(grouped_by_country, avg_lifeExp = mean(lifeExp)) ``` .code-output-scroll[ ``` # A tibble: 7 x 2 country avg_lifeExp <fct> <dbl> 1 Afghanistan 37.5 2 Bahrain 65.6 3 Bangladesh 49.8 4 Cambodia 47.9 5 China 61.8 6 Hong Kong, China 73.5 7 India 53.2 ``` ] ] .rightcol[ Calculating population by continent **with %>%** ```r data %>% filter(continent == "Asia") %>% group_by(country) %>% summarise(avg_lifeExp = mean(lifeExp)) ``` .code-output-scroll[ ``` # A tibble: 7 x 2 country avg_lifeExp <fct> <dbl> 1 Afghanistan 37.5 2 Bahrain 65.6 3 Bangladesh 49.8 4 Cambodia 47.9 5 China 61.8 6 Hong Kong, China 73.5 7 India 53.2 ``` ] ] --- ## The %>% operator Suppose we want to know the evarage life expectancy of Asian countries per year. .leftcol[ Calculating population by continent **without %>%** ```r filtered_by_asia <- filter(data, continent == "Asia") grouped_by_country_year <- group_by(filtered_by_asia, country, year) summarise(grouped_by_country_year, avg_lifeExp = mean(lifeExp)) ``` .code-output-scroll[ ``` # A tibble: 5 x 3 # Groups: country [1] country year avg_lifeExp <fct> <int> <dbl> 1 Afghanistan 1952 28.8 2 Afghanistan 1957 30.3 3 Afghanistan 1962 32.0 4 Afghanistan 1967 34.0 5 Afghanistan 1972 36.1 ``` ] ] .rightcol[ Calculating population by continent **with %>%** ```r data %>% filter(continent == "Asia") %>% group_by(country, year) %>% summarise(avg_lifeExp = mean(lifeExp)) ``` .code-output-scroll[ ``` # A tibble: 5 x 3 # Groups: country [1] country year avg_lifeExp <fct> <int> <dbl> 1 Afghanistan 1952 28.8 2 Afghanistan 1957 30.3 3 Afghanistan 1962 32.0 4 Afghanistan 1967 34.0 5 Afghanistan 1972 36.1 ``` ] ] --- class: middle center # Reshaping data ---- --- ## Wide vs Long data format <img src="img/pivoting.jpg" width="80%" style="display: block; margin: auto;" /> --- ## Wide vs Long data format .leftcol[ **`pivot_longer`** ``` pivot_longer(data, names_to = ..., values_to = ...,) ``` + Transform wider data format to long data format <img src="img/wide_to_long.jpg" width="80%" style="display: block; margin: auto;" /> ] --- ## Wide vs Long data format **`pivot_longer`** .leftcol[ ```r library(readxl) urbanpop <- read_excel("data/urbanpop.xlsx") print(urbanpop) ``` ``` # A tibble: 10 x 8 country `1960` `1961` `1962` `1963` `1964` `1965` `1966` <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Afghanistan 769308 814923. 858522. 903914. 951226. 1000582. 1058743. 2 Albania 494443 511803. 529439. 547377. 565572. 583983. 602512. 3 Algeria 3293999 3515148. 3739963. 3973289. 4220987. 4488176. 4649105. 4 American Samoa NA 13660. 14166. 14759. 15396. 16045. 16693. 5 Andorra NA 8724. 9700. 10748. 11866. 13053. 14217. 6 Angola 521205 548265. 579695. 612087. 645262. 679109. 717833. 7 Antigua and Barbuda 21699 21635. 21664. 21741. 21830. 21909. 22003. 8 Argentina 15224096 15545223. 15912120. 16282345. 16654412. 17027712. 17389812. 9 Armenia 957974 1008597. 1061426. 1115612. 1170683. 1226270. 1281582. 10 Aruba 24996 28140. 28533. 28763. 28923. 29083. 29252. ``` ] .rightcol[ ```r pivot_longer(data = urbanpop, cols = "1960":"1966", names_to = "year", values_to = "pop") ``` ``` # A tibble: 10 x 3 country year pop <chr> <chr> <dbl> 1 Afghanistan 1960 769308 2 Afghanistan 1961 814923. 3 Afghanistan 1962 858522. 4 Afghanistan 1963 903914. 5 Afghanistan 1964 951226. 6 Afghanistan 1965 1000582. 7 Afghanistan 1966 1058743. 8 Albania 1960 494443 9 Albania 1961 511803. 10 Albania 1962 529439. ``` ] --- ## Wide vs Long data format .leftcol[ **`pivot_wider`** ``` pivot_wider(data, names_from = ..., values_from = ...,) ``` + Transform long data format to wide data format <img src="img/long_to_wide.jpg" width="80%" style="display: block; margin: auto;" /> ] --- ## Wide vs Long data format **`pivot_wider`** .leftcol[ ```r potato_data ``` ``` # A tibble: 1,280 x 3 id measure value <int> <chr> <dbl> 1 1 area 1 2 1 temp 1 3 1 size 1 4 1 storage 1 5 1 method 1 6 1 texture 2.9 7 1 flavor 3.2 8 1 moistness 3 9 2 area 1 10 2 temp 1 # ... with 1,270 more rows ``` ] .rightcol[ ```r pivot_wider(data = potato_data, names_from = "measure", values_from = "value") ``` ``` # A tibble: 160 x 9 id area temp size storage method texture flavor moistness <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 1 1 1 1 1 2.9 3.2 3 2 2 1 1 1 1 2 2.3 2.5 2.6 3 3 1 1 1 1 3 2.5 2.8 2.8 4 4 1 1 1 1 4 2.1 2.9 2.4 5 5 1 1 1 1 5 1.9 2.8 2.2 6 6 1 1 1 2 1 1.8 3 1.7 7 7 1 1 1 2 2 2.6 3.1 2.4 8 8 1 1 1 2 3 3 3 2.9 9 9 1 1 1 2 4 2.2 3.2 2.5 10 10 1 1 1 2 5 2 2.8 1.9 # ... with 150 more rows ``` ] --- class: middle center # Basic management of data types ---- --- ## Strings .leftcol[ + string is a character that is made of one character or contains a collection of characters. + enclosed inside single quotes or double quotes + `tidyverse` **`stringr`** package provide tools for manipulating with strings ] --- ## Strings `stringr` [cheatsheets](https://github.com/rstudio/cheatsheets/blob/main/strings.pdf) <img src="https://raw.githubusercontent.com/rstudio/cheatsheets/main/pngs/thumbnails/strings-cheatsheet-thumbs.png" width="90%" style="display: block; margin: auto;" /> --- ## Strings **Converting strings to lower/ upper/ title case** + `str_to_lower()` + `str_to_upper()` + `str_to_title()` --- ## Strings **Converting strings to lower/ upper/ title case** .leftcol[ + `str_to_lower()` + `str_to_upper()` + `str_to_title()` ] .rightcol[ ```r fruits <- stringr::fruit fruit ``` ``` [1] "APPLE" "APRICOT" "AVOCADO" "BANANA" [5] "BELL PEPPER" "BILBERRY" "BLACKBERRY" "BLACKCURRANT" [9] "BLOOD ORANGE" "BLUEBERRY" "BOYSENBERRY" "BREADFRUIT" [13] "CANARY MELON" "CANTALOUPE" "CHERIMOYA" "CHERRY" [17] "CHILI PEPPER" "CLEMENTINE" "CLOUDBERRY" "COCONUT" [21] "CRANBERRY" "CUCUMBER" "CURRANT" "DAMSON" [25] "DATE" "DRAGONFRUIT" "DURIAN" "EGGPLANT" [29] "ELDERBERRY" "FEIJOA" "FIG" "GOJI BERRY" [33] "GOOSEBERRY" "GRAPE" "GRAPEFRUIT" "GUAVA" [37] "HONEYDEW" "HUCKLEBERRY" "JACKFRUIT" "JAMBUL" [41] "JUJUBE" "KIWI FRUIT" "KUMQUAT" "LEMON" [45] "LIME" "LOQUAT" "LYCHEE" "MANDARINE" [49] "MANGO" "MULBERRY" "NECTARINE" "NUT" [53] "OLIVE" "ORANGE" "PAMELO" "PAPAYA" [57] "PASSIONFRUIT" "PEACH" "PEAR" "PERSIMMON" [61] "PHYSALIS" "PINEAPPLE" "PLUM" "POMEGRANATE" [65] "POMELO" "PURPLE MANGOSTEEN" "QUINCE" "RAISIN" [69] "RAMBUTAN" "RASPBERRY" "REDCURRANT" "ROCK MELON" [73] "SALAL BERRY" "SATSUMA" "STAR FRUIT" "STRAWBERRY" [77] "TAMARILLO" "TANGERINE" "UGLI FRUIT" "WATERMELON" ``` ] --- # Strings **Converting strings to lower/ upper/ title case** + `str_to_lower()` .leftcol[ ```r fruits ``` ``` [1] "APPLE" "APRICOT" "AVOCADO" "BANANA" "BELL PEPPER" "BILBERRY" [7] "BLACKBERRY" "BLACKCURRANT" "BLOOD ORANGE" "BLUEBERRY" "BOYSENBERRY" "BREADFRUIT" [13] "CANARY MELON" "CANTALOUPE" "CHERIMOYA" "CHERRY" "CHILI PEPPER" "CLEMENTINE" [19] "CLOUDBERRY" "COCONUT" "CRANBERRY" "CUCUMBER" "CURRANT" "DAMSON" [25] "DATE" "DRAGONFRUIT" "DURIAN" "EGGPLANT" "ELDERBERRY" "FEIJOA" ``` ] .rightcol[ ```r *fruits %>% str_to_lower() ``` ``` [1] "apple" "apricot" "avocado" "banana" "bell pepper" "bilberry" [7] "blackberry" "blackcurrant" "blood orange" "blueberry" "boysenberry" "breadfruit" [13] "canary melon" "cantaloupe" "cherimoya" "cherry" "chili pepper" "clementine" [19] "cloudberry" "coconut" "cranberry" "cucumber" "currant" "damson" [25] "date" "dragonfruit" "durian" "eggplant" "elderberry" "feijoa" ``` ] --- # Strings **Converting strings to lower/ upper/ title case** + `str_to_upper()` .leftcol[ ```r fruits ``` ``` [1] "apple" "apricot" "avocado" "banana" "bell pepper" "bilberry" [7] "blackberry" "blackcurrant" "blood orange" "blueberry" "boysenberry" "breadfruit" [13] "canary melon" "cantaloupe" "cherimoya" "cherry" "chili pepper" "clementine" [19] "cloudberry" "coconut" "cranberry" "cucumber" "currant" "damson" [25] "date" "dragonfruit" "durian" "eggplant" "elderberry" "feijoa" ``` ] .rightcol[ ```r *fruits %>% str_to_lower ``` ``` [1] "APPLE" "APRICOT" "AVOCADO" "BANANA" "BELL PEPPER" "BILBERRY" [7] "BLACKBERRY" "BLACKCURRANT" "BLOOD ORANGE" "BLUEBERRY" "BOYSENBERRY" "BREADFRUIT" [13] "CANARY MELON" "CANTALOUPE" "CHERIMOYA" "CHERRY" "CHILI PEPPER" "CLEMENTINE" [19] "CLOUDBERRY" "COCONUT" "CRANBERRY" "CUCUMBER" "CURRANT" "DAMSON" [25] "DATE" "DRAGONFRUIT" "DURIAN" "EGGPLANT" "ELDERBERRY" "FEIJOA" ``` ] --- # Strings **Converting strings to lower/ upper/ title case** + `str_to_title()` .leftcol[ ```r fruits ``` ``` [1] "apple" "apricot" "avocado" "banana" "bell pepper" "bilberry" [7] "blackberry" "blackcurrant" "blood orange" "blueberry" "boysenberry" "breadfruit" [13] "canary melon" "cantaloupe" "cherimoya" "cherry" "chili pepper" "clementine" [19] "cloudberry" "coconut" "cranberry" "cucumber" "currant" "damson" [25] "date" "dragonfruit" "durian" "eggplant" "elderberry" "feijoa" ``` ] .rightcol[ ```r *fruits %>% str_to_title() ``` ``` [1] "Apple" "Apricot" "Avocado" "Banana" "Bell Pepper" "Bilberry" [7] "Blackberry" "Blackcurrant" "Blood Orange" "Blueberry" "Boysenberry" "Breadfruit" [13] "Canary Melon" "Cantaloupe" "Cherimoya" "Cherry" "Chili Pepper" "Clementine" [19] "Cloudberry" "Coconut" "Cranberry" "Cucumber" "Currant" "Damson" [25] "Date" "Dragonfruit" "Durian" "Eggplant" "Elderberry" "Feijoa" ``` ] --- ## Strings **Joining of multiple strings** + `str_c()` .leftcol[ ```r pop_data <- gapminder::gapminder pop_data ``` ``` # A tibble: 10 x 3 continent country pop <fct> <fct> <dbl> 1 Africa Algeria 19875406. 2 Africa Angola 7309390. 3 Africa Benin 4017497. 4 Africa Botswana 971186. 5 Africa Burkina Faso 7548677. 6 Africa Burundi 4651608. 7 Africa Cameroon 9816648. 8 Africa Central African Republic 2560963 9 Africa Chad 5329256. 10 Africa Comoros 361684. ``` ] .rightcol[ ```r pop_data %>% * mutate(label = str_c(continent, country, sep = ": ")) ``` ``` # A tibble: 9 x 4 continent country pop label <fct> <fct> <dbl> <chr> 1 Africa Algeria 19875406. Africa: Algeria 2 Africa Angola 7309390. Africa: Angola 3 Africa Benin 4017497. Africa: Benin 4 Africa Botswana 971186. Africa: Botswana 5 Africa Burkina Faso 7548677. Africa: Burkina Faso 6 Africa Burundi 4651608. Africa: Burundi 7 Africa Cameroon 9816648. Africa: Cameroon 8 Africa Central African Republic 2560963 Africa: Central African Republic 9 Africa Chad 5329256. Africa: Chad ``` ] --- ## Strings **Replace matched patterns in the strings** + `str_replace()` + `str_replace_all()` --- ## Strings **Replace matched patterns in the strings** + `str_replace()` + `str_replace(string, pattern, replacement)` .leftcol[ ```r pop_data %>% count(continent) ``` ``` # A tibble: 5 x 2 continent n <fct> <int> 1 Africa 52 2 Americas 25 3 Asia 33 4 Europe 30 5 Oceania 2 ``` ] .rightcol[ ```r pop_data %>% count(continent) %>% * mutate(continent_2 = str_replace(continent, "Americas", "USA")) ``` ``` # A tibble: 5 x 3 continent n continent_2 <fct> <int> <chr> 1 Africa 52 Africa 2 Americas 25 USA 3 Asia 33 Asia 4 Europe 30 Europe 5 Oceania 2 Oceania ``` ] --- ## Factors .leftcol[ + R represents categorical data with factors. + A factor is an integer vector with a levels attribute that stores a set of mappings between integers and categorical variables. + `tidyverse` **`forcats`** package provide tools for manipulating with strings ] --- ## Factors `forcats` [cheatsheets](https://github.com/rstudio/cheatsheets/blob/main/factors.pdf) <img src="https://raw.githubusercontent.com/rstudio/cheatsheets/master/pngs/thumbnails/forcats-cheatsheet-thumbs.png" width="50%" style="display: block; margin: auto;" /> --- ## Factors + `factor()` + `fct_reorder()` + `fct_lump()` --- ## Factors .leftcol40[ + `factor()` + encode a vector as a factor + e.g., category & enumerated type ```r factor(x = character(), levels, labels = levels) ``` ] .rightcol60[ **Example** ```r rank <- c("third", "second", "fourth", "first") sort(rank) ``` ``` [1] "first" "fourth" "second" "third" ``` ] --- ## Factors .leftcol40[ + `factor()` + encode a vector as a factor + e.g., category & enumerated type ```r factor(x = character(), levels, labels = levels) ``` ] .rightcol60[ **Example** ```r rank_levels <- c("first", "second", "third", "fourth") ``` ```r *rank_factor <- factor(x = rank, levels = rank_levels) rank_factor ``` ``` [1] third second fourth first Levels: first second third fourth ``` ```r sort(rank_factor) ``` ``` [1] first second third fourth Levels: first second third fourth ``` ] --- ## Factors .leftcol40[ ```r mpg %>% select(model, class, displ) %>% head(4) ``` ``` # A tibble: 4 x 3 model class displ <chr> <chr> <dbl> 1 a4 compact 1.8 2 a4 compact 1.8 3 a4 compact 2 4 a4 compact 2 ``` ] .rightcol60[ ```r mpg %>% * mutate(class = factor(class)) %>% select(model, class, displ) ``` ``` # A tibble: 234 x 3 model class displ <chr> <fct> <dbl> 1 a4 compact 1.8 2 a4 compact 1.8 3 a4 compact 2 4 a4 compact 2 5 a4 compact 2.8 6 a4 compact 2.8 7 a4 compact 3.1 8 a4 quattro compact 1.8 9 a4 quattro compact 1.8 10 a4 quattro compact 2 # ... with 224 more rows ``` ] --- ## Factors .leftcol40[ ```r mpg %>% count(class) ``` ``` # A tibble: 7 x 2 class n <chr> <int> 1 2seater 5 2 compact 47 3 midsize 41 4 minivan 11 5 pickup 33 6 subcompact 35 7 suv 62 ``` ] .rightcol60[ ```r mpg %>% * mutate(class = factor(class)) %>% select(model, class, displ) ``` ``` # A tibble: 234 x 3 model class displ <chr> <fct> <dbl> 1 a4 compact 1.8 2 a4 compact 1.8 3 a4 compact 2 4 a4 compact 2 5 a4 compact 2.8 6 a4 compact 2.8 7 a4 compact 3.1 8 a4 quattro compact 1.8 9 a4 quattro compact 1.8 10 a4 quattro compact 2 # ... with 224 more rows ``` ] --- ## Factors .leftcol[ + `fct_reorder()` + reorder factor levels by sorting along another variable ```r fct_reorder(.f, .x, .desc, ...) ``` ** Sample data** ```r relig_summary <- gss_cat %>% group_by(relig) %>% summarise(age = mean(age, na.rm = TRUE), tvhourse = mean(tvhours, na.rm = TRUE), n = n()) ``` ] .rightcol[ ```r relig_summary ``` ``` # A tibble: 15 x 4 relig age tvhourse n <fct> <dbl> <dbl> <int> 1 No answer 49.5 2.72 93 2 Don't know 35.9 4.62 15 3 Inter-nondenominational 40.0 2.87 109 4 Native american 38.9 3.46 23 5 Christian 40.1 2.79 689 6 Orthodox-christian 50.4 2.42 95 7 Moslem/islam 37.6 2.44 104 8 Other eastern 45.9 1.67 32 9 Hinduism 37.7 1.89 71 10 Buddhism 44.7 2.38 147 11 Other 41.0 2.73 224 12 None 41.2 2.71 3523 13 Jewish 52.4 2.52 388 14 Catholic 46.9 2.96 5124 15 Protestant 49.9 3.15 10846 ``` ] --- ## Factors .leftcol[ + `fct_reorder()` + reorder factor levels by sorting along another variable ```r fct_reorder(.f, .x, .desc, ...) ``` ** Sample data** ```r relig_summary <- gss_cat %>% group_by(relig) %>% summarise(age = mean(age, na.rm = TRUE), tvhours = mean(tvhours, na.rm = TRUE), n = n()) %>% ungroup() ``` ] .rightcol[ ```r relig_summary %>% ggplot(aes(x = tvhours, y = relig)) + geom_point(size = 4) ``` <img src="01_intro_datawrangling_files/figure-html/unnamed-chunk-101-1.png" width="100%" /> ] --- ## Factors .leftcol[ + `fct_reorder()` + reorder factor levels by sorting along another variable ```r fct_reorder(.f, .x, .desc, ...) ``` ** Reordered data** ```r relig_summary_reordered <- relig_summary %>% * mutate(relig = fct_reorder(relig, tvhours)) ``` ] .rightcol[ ```r relig_summary_reordered %>% ggplot(aes(x = tvhours, y = relig)) + geom_point(size = 4) ``` <img src="01_intro_datawrangling_files/figure-html/unnamed-chunk-104-1.png" width="100%" /> ] --- ## Factors .leftcol[ + `fct_lump()` + lump together factor levels into "other" ```r fct_lump(f, n,other_level, ...) ``` **lumped factors** ```r relig_summary_lumped <- gss_cat %>% filter(!relig %in% c("Other", "None")) %>% * mutate(relig = fct_lump(f = relig, n = 7, other_level = "Other religion")) %>% group_by(relig) %>% summarise(age = mean(age, na.rm = TRUE), tvhours = mean(tvhours, na.rm = TRUE), n = n()) %>% ungroup() %>% mutate(relig = fct_reorder(relig, tvhours)) ``` ] .rightcol[ ```r relig_summary_lumped ``` ``` # A tibble: 8 x 4 relig age tvhours n <fct> <dbl> <dbl> <int> 1 Inter-nondenominational 40.0 2.87 109 2 Christian 40.1 2.79 689 3 Moslem/islam 37.6 2.44 104 4 Buddhism 44.7 2.38 147 5 Jewish 52.4 2.52 388 6 Catholic 46.9 2.96 5124 7 Protestant 49.9 3.15 10846 8 Other religion 45.4 2.53 329 ``` ] --- ## Factors .leftcol[ + `fct_lump()` + lump together factor levels into "other" ```r fct_lump(f, n,other_level, ...) ``` ] .rightcol[ ```r relig_summary_lumped %>% ggplot(aes(x = tvhours, y = relig)) + geom_point(size = 4) ``` <img src="01_intro_datawrangling_files/figure-html/unnamed-chunk-109-1.png" width="100%" /> ] --- class: middle center # What a start! <img src="img/welldone.gif" width="35%" /> --- class: middle center # Thank you! #### Slides created via the R packages: .leftcol[ <img src="img/xaringan.png" style="display:inline-block; margin: 0" width=20%/> ### xaringan by Yihui ] .rightcol[ <img src="img/xaringanthemer.png" style="display:inline-block; margin: 0" width=25%/> ### xaringanthemer and xaringanExtra by Garrick ]