Training Workshop on Data Visualization using ggplot2 in R

# Training Workshop on Data Visualization using ggplot2 in R
## Session 1: Introduction to data wrangling

---

---

## Topics

+ R objects and packages

+ Reading data into R

+ Basic data wrangling with `dplyr`

+ Reshaping data
  
+ Basic management of data types

+ text data (string)
  
  + categorical data (factor)
  
  + date data
  
---

### It's normal to struggle but it gets better and exciting!

---

# R Objects and packages
----

---

## R Objects

+ e.g., text, number, matrix, vector, dataframe.

+ In other words everything in R is an object.
]

]

---

## R Objects

+ Objects are assigned a value using **`<-`**

```r
a1 <- 10
print(a1)
```

```
[1] 10
```

]

```r
a2 <- 20
a2
```

```
[1] 20
```

]

```r
a3 <- c(10, 20, 30)
a3
```

```
[1] 10 20 30
```

]
]

```r
a1 * a2
```

```
[1] 200
```

]

```r
st_name <- "christopher"
st_age <- 23
st_sex <- "male"

student <- c(st_name, st_age, st_sex)
student
```

```
[1] "christopher" "23"          "male"       
```

]
]

---

## R packages

+ Collection of functions that load into your working environment.

+ It contain code that other R users have prepared for the community.

+ Installing packages

```r
install.packages("tidyverse")
```

+ Loading packages

```r
library(tidyverse)
```

]

]

---

# Reading data into R
----

---

## Importing data

+ SPSS, Stata, SAS files: [haven package](https://haven.tidyverse.org/)

+ Excel files: [readxl package](https://readxl.tidyverse.org/)

+ CSV files: [readr package](https://readr.tidyverse.org/)

---

## Reading data into R

#### SPSS, Stata & SAS using [haven package](https://haven.tidyverse.org/)

```r
library(haven)
```

```r
# SPSS
read_sav("path/data.sav")
```

```r
# Stata
read_dta("path/data.dta")
```

```r
# SAS
read_sas("path/data.sas7bdat")
```

]

]

---

## Reading data into R

#### Excel files using [readxl package](https://readxl.tidyverse.org/)

```r
library(readxl)
read_excel("path/dataset.xls")
```

```
# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ... with 140 more rows
```

]

]

---

## Reading data into R

#### CSV files using [readr package](https://readr.tidyverse.org/)

```r
install.packages("readr")
library(readr)
```

```r
# comma separated (CSV) files
read_csv("path/data.csv")
```

```r
# tab separated files
read_tsv("path/data.tsv")
```

```r
# general delimited files
read_delim("path/data.delim")
```

]

]

---

# Basic data wrangling with `dplyr`
----

---
class: middle center

# Tidyverse

---

## What is a tidyverse?

A collection of R packages designed for data science.

All packages share an underlying philosophy, grammar, and data structure.

---

## Tidyverse :: tidy data

---

## Tidyverse :: tidy data

---

## Data wrangling using `dplyr`

---

## `dplyr`
.leftcol[
**Overview**

+ `select()` picks variables based on their names

+ `mutate()` adds new variables

+ `filter()` picks cases based on their values

+ `summarise()` reduces multiple values down to a single summary

+ `arrange()` change the ordering of the rows

see `dplyr` [cheatsheets](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf)
]

]

---

## `select()`

![](https://favstats.shinyapps.io/r_intro/_w_bfa1a45e/images/select.png)

```r
data
```

```
# A tibble: 1,704 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ... with 1,694 more rows
```

]

```r
select(data, continent, country, pop)
```

```
# A tibble: 1,704 x 3
   continent country          pop
   <fct>     <fct>          <int>
 1 Asia      Afghanistan  8425333
 2 Asia      Afghanistan  9240934
 3 Asia      Afghanistan 10267083
 4 Asia      Afghanistan 11537966
 5 Asia      Afghanistan 13079460
 6 Asia      Afghanistan 14880372
 7 Asia      Afghanistan 12881816
 8 Asia      Afghanistan 13867957
 9 Asia      Afghanistan 16317921
10 Asia      Afghanistan 22227415
# ... with 1,694 more rows
```

]

---

## `select()`

We can also **remove** variables with a **`-`** (minus)

```r
data
```

]

```r
select(data, -year, -pop)
```

```
# A tibble: 1,704 x 4
   country     continent lifeExp gdpPercap
   <fct>       <fct>       <dbl>     <dbl>
 1 Afghanistan Asia         28.8      779.
 2 Afghanistan Asia         30.3      821.
 3 Afghanistan Asia         32.0      853.
 4 Afghanistan Asia         34.0      836.
 5 Afghanistan Asia         36.1      740.
 6 Afghanistan Asia         38.4      786.
 7 Afghanistan Asia         39.9      978.
 8 Afghanistan Asia         40.8      852.
 9 Afghanistan Asia         41.7      649.
10 Afghanistan Asia         41.8      635.
# ... with 1,694 more rows
```

]

---

## `select()`

**Selection helpers**

These *selection helpers* match variables according to a given pattern.

+ `starts_with()` starts with a prefix

+ `ends_with()` ends with a suffix

+ `contains()` contains a literal string

+ `matches()` matches regular expression

---

## `filter()`

![](https://favstats.shinyapps.io/r_intro/_w_bfa1a45e/images/filter.png)

```r
data
```

```
# A tibble: 10 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
```

]

```r
filter(data, country == "Philippines")
```

```
# A tibble: 10 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Philippines Asia       1952    47.8 22438691     1273.
 2 Philippines Asia       1957    51.3 26072194     1548.
 3 Philippines Asia       1962    54.8 30325264     1650.
 4 Philippines Asia       1967    56.4 35356600     1814.
 5 Philippines Asia       1972    58.1 40850141     1989.
 6 Philippines Asia       1977    60.1 46850962     2373.
 7 Philippines Asia       1982    62.1 53456774     2603.
 8 Philippines Asia       1987    64.2 60017788     2190.
 9 Philippines Asia       1992    66.5 67185766     2279.
10 Philippines Asia       1997    68.6 75012988     2537.
```

]

---

## `mutate()`

![](https://favstats.shinyapps.io/r_intro/_w_dfe6b732/images/mutate.png)

The `mutate` function will take a statement similar to this:

+ `variable_name` = `do_some_calculation`

+ `variable_name` will be attached at the end of the dataset.

---

## `mutate()`

Let's calculate the `gdp`

```r
data
```

]

```r
mutate(data, GDP = gdpPercap * pop)
```

```
# A tibble: 1,704 x 7
   country     continent  year lifeExp      pop gdpPercap          GDP
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
 2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
 3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
 4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
 5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
 6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
 7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
 8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
 9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
# ... with 1,694 more rows
```

]

---

## `group_by` and `summarise()`

+ Use when you want to aggregate your data (by groups).

+ Sometimes we want to calculate group statistics.

<br>

]

---

## `group_by` and `summarise()`

Suppose we want to know the average population by continent.
.leftcol40[

```r
data
```

]

```r
grouped_by_continent <- group_by(data, continent)
summarise(grouped_by_continent, avg_pop = mean(pop))
```

```
# A tibble: 5 x 2
  continent   avg_pop
  <fct>         <dbl>
1 Africa     9916003.
2 Americas  24504795.
3 Asia      77038722.
4 Europe    17169765.
5 Oceania    8874672.
```

]

---

## `group_by` and `summarise()`

Suppose we want to know the average population by continent.
.leftcol40[

```r
data
```

]

```r
grouped_by_continent <- group_by(data, continent)
summarised_data <- summarise(grouped_by_continent, avg_pop = mean(pop))
summarised_data
```

```
# A tibble: 5 x 2
  continent   avg_pop
  <fct>         <dbl>
1 Africa     9916003.
2 Americas  24504795.
3 Asia      77038722.
4 Europe    17169765.
5 Oceania    8874672.
```

]

---

]

]

---

## `%>%` pipe operator

---

## The %>% operator

The **`%>%`** helps your write code in a way that is easier to read and understand.

Calculating population by continent **without %>%**

```r
grouped_by_continent <- group_by(data, continent)
summarised_data <- summarise(grouped_by_continent, avg_pop = mean(pop))
summarised_data
```

```
# A tibble: 5 x 2
  continent   avg_pop
  <fct>         <dbl>
1 Africa     9916003.
2 Americas  24504795.
3 Asia      77038722.
4 Europe    17169765.
5 Oceania    8874672.
```

]

Calculating population by continent **with %>%**

```r
data %>% 
  group_by(continent) %>% 
  summarise(avg_pop = mean(pop))
```

```
# A tibble: 5 x 2
  continent   avg_pop
  <fct>         <dbl>
1 Africa     9916003.
2 Americas  24504795.
3 Asia      77038722.
4 Europe    17169765.
5 Oceania    8874672.
```

]

---

## The %>% operator

```r
filtered_by_asia <- filter(data, continent == "Asia")
grouped_by_country <- group_by(filtered_by_asia, country)
summarised_by_country <- summarise(grouped_by_country, avg_lifeExp = mean(lifeExp))
```

```
# A tibble: 7 x 2
  country          avg_lifeExp
  <fct>                  <dbl>
1 Afghanistan             37.5
2 Bahrain                 65.6
3 Bangladesh              49.8
4 Cambodia                47.9
5 China                   61.8
6 Hong Kong, China        73.5
7 India                   53.2
```

]
]

Calculating population by continent **with %>%**

```r
data %>% 
  filter(continent == "Asia") %>% 
  group_by(country) %>% 
  summarise(avg_lifeExp = mean(lifeExp))
```

]
]

---

## The %>% operator

Suppose we want to know the evarage life expectancy of Asian countries per year.

Calculating population by continent **without %>%**

```r
filtered_by_asia <- filter(data, continent == "Asia")
grouped_by_country_year <- group_by(filtered_by_asia, country, year)
summarise(grouped_by_country_year, avg_lifeExp = mean(lifeExp))
```

```
# A tibble: 5 x 3
# Groups:   country [1]
  country      year avg_lifeExp
  <fct>       <int>       <dbl>
1 Afghanistan  1952        28.8
2 Afghanistan  1957        30.3
3 Afghanistan  1962        32.0
4 Afghanistan  1967        34.0
5 Afghanistan  1972        36.1
```

]
]

Calculating population by continent **with %>%**

```r
data %>% 
  filter(continent == "Asia") %>% 
  group_by(country, year) %>% 
  summarise(avg_lifeExp = mean(lifeExp))
```

]
]

---

# Reshaping data
----

---

## Wide vs Long data format

---

## Wide vs Long data format

**`pivot_longer`**

```
pivot_longer(data, names_to = ..., values_to = ...,)
```

+ Transform wider data format to long data format

]

---

## Wide vs Long data format

**`pivot_longer`**

```r
library(readxl)
urbanpop <- read_excel("data/urbanpop.xlsx")
print(urbanpop)
```

```
# A tibble: 10 x 8
   country               `1960`    `1961`    `1962`    `1963`    `1964`    `1965`    `1966`
   <chr>                  <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 Afghanistan           769308   814923.   858522.   903914.   951226.  1000582.  1058743.
 2 Albania               494443   511803.   529439.   547377.   565572.   583983.   602512.
 3 Algeria              3293999  3515148.  3739963.  3973289.  4220987.  4488176.  4649105.
 4 American Samoa            NA    13660.    14166.    14759.    15396.    16045.    16693.
 5 Andorra                   NA     8724.     9700.    10748.    11866.    13053.    14217.
 6 Angola                521205   548265.   579695.   612087.   645262.   679109.   717833.
 7 Antigua and Barbuda    21699    21635.    21664.    21741.    21830.    21909.    22003.
 8 Argentina           15224096 15545223. 15912120. 16282345. 16654412. 17027712. 17389812.
 9 Armenia               957974  1008597.  1061426.  1115612.  1170683.  1226270.  1281582.
10 Aruba                  24996    28140.    28533.    28763.    28923.    29083.    29252.
```

]

```r
pivot_longer(data = urbanpop, cols = "1960":"1966",
             names_to = "year",
             values_to = "pop")
```

```
# A tibble: 10 x 3
   country     year       pop
   <chr>       <chr>    <dbl>
 1 Afghanistan 1960   769308 
 2 Afghanistan 1961   814923.
 3 Afghanistan 1962   858522.
 4 Afghanistan 1963   903914.
 5 Afghanistan 1964   951226.
 6 Afghanistan 1965  1000582.
 7 Afghanistan 1966  1058743.
 8 Albania     1960   494443 
 9 Albania     1961   511803.
10 Albania     1962   529439.
```

]

---

## Wide vs Long data format

**`pivot_wider`**

```
pivot_wider(data, names_from = ..., values_from = ...,)
```

+ Transform long data format to wide data format

]

---

## Wide vs Long data format

**`pivot_wider`**

```r
potato_data
```

```
# A tibble: 1,280 x 3
      id measure   value
   <int> <chr>     <dbl>
 1     1 area        1  
 2     1 temp        1  
 3     1 size        1  
 4     1 storage     1  
 5     1 method      1  
 6     1 texture     2.9
 7     1 flavor      3.2
 8     1 moistness   3  
 9     2 area        1  
10     2 temp        1  
# ... with 1,270 more rows
```

]

```r
pivot_wider(data = potato_data, names_from = "measure", values_from = "value")
```

```
# A tibble: 160 x 9
      id  area  temp  size storage method texture flavor moistness
   <int> <dbl> <dbl> <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>
 1     1     1     1     1       1      1     2.9    3.2       3  
 2     2     1     1     1       1      2     2.3    2.5       2.6
 3     3     1     1     1       1      3     2.5    2.8       2.8
 4     4     1     1     1       1      4     2.1    2.9       2.4
 5     5     1     1     1       1      5     1.9    2.8       2.2
 6     6     1     1     1       2      1     1.8    3         1.7
 7     7     1     1     1       2      2     2.6    3.1       2.4
 8     8     1     1     1       2      3     3      3         2.9
 9     9     1     1     1       2      4     2.2    3.2       2.5
10    10     1     1     1       2      5     2      2.8       1.9
# ... with 150 more rows
```

]

---

# Basic management of data types
----

---

## Strings

+ string is a character that is made of one character or contains a collection of characters.

+ enclosed inside single quotes or double quotes

+ `tidyverse` **`stringr`** package provide tools for manipulating with strings

]

---

## Strings

`stringr` [cheatsheets](https://github.com/rstudio/cheatsheets/blob/main/strings.pdf)

---

## Strings

**Converting strings to lower/ upper/ title case**

+ `str_to_lower()`

+ `str_to_upper()`

+ `str_to_title()`

---

## Strings

**Converting strings to lower/ upper/ title case**

+ `str_to_lower()`

+ `str_to_upper()`

+ `str_to_title()`

]

```r
fruits <- stringr::fruit
fruit
```

```
 [1] "APPLE"             "APRICOT"           "AVOCADO"           "BANANA"           
 [5] "BELL PEPPER"       "BILBERRY"          "BLACKBERRY"        "BLACKCURRANT"     
 [9] "BLOOD ORANGE"      "BLUEBERRY"         "BOYSENBERRY"       "BREADFRUIT"       
[13] "CANARY MELON"      "CANTALOUPE"        "CHERIMOYA"         "CHERRY"           
[17] "CHILI PEPPER"      "CLEMENTINE"        "CLOUDBERRY"        "COCONUT"          
[21] "CRANBERRY"         "CUCUMBER"          "CURRANT"           "DAMSON"           
[25] "DATE"              "DRAGONFRUIT"       "DURIAN"            "EGGPLANT"         
[29] "ELDERBERRY"        "FEIJOA"            "FIG"               "GOJI BERRY"       
[33] "GOOSEBERRY"        "GRAPE"             "GRAPEFRUIT"        "GUAVA"            
[37] "HONEYDEW"          "HUCKLEBERRY"       "JACKFRUIT"         "JAMBUL"           
[41] "JUJUBE"            "KIWI FRUIT"        "KUMQUAT"           "LEMON"            
[45] "LIME"              "LOQUAT"            "LYCHEE"            "MANDARINE"        
[49] "MANGO"             "MULBERRY"          "NECTARINE"         "NUT"              
[53] "OLIVE"             "ORANGE"            "PAMELO"            "PAPAYA"           
[57] "PASSIONFRUIT"      "PEACH"             "PEAR"              "PERSIMMON"        
[61] "PHYSALIS"          "PINEAPPLE"         "PLUM"              "POMEGRANATE"      
[65] "POMELO"            "PURPLE MANGOSTEEN" "QUINCE"            "RAISIN"           
[69] "RAMBUTAN"          "RASPBERRY"         "REDCURRANT"        "ROCK MELON"       
[73] "SALAL BERRY"       "SATSUMA"           "STAR FRUIT"        "STRAWBERRY"       
[77] "TAMARILLO"         "TANGERINE"         "UGLI FRUIT"        "WATERMELON"       
```

]

---

# Strings

**Converting strings to lower/ upper/ title case**
+ `str_to_lower()`

```r
fruits
```

```
 [1] "APPLE"        "APRICOT"      "AVOCADO"      "BANANA"       "BELL PEPPER"  "BILBERRY"    
 [7] "BLACKBERRY"   "BLACKCURRANT" "BLOOD ORANGE" "BLUEBERRY"    "BOYSENBERRY"  "BREADFRUIT"  
[13] "CANARY MELON" "CANTALOUPE"   "CHERIMOYA"    "CHERRY"       "CHILI PEPPER" "CLEMENTINE"  
[19] "CLOUDBERRY"   "COCONUT"      "CRANBERRY"    "CUCUMBER"     "CURRANT"      "DAMSON"      
[25] "DATE"         "DRAGONFRUIT"  "DURIAN"       "EGGPLANT"     "ELDERBERRY"   "FEIJOA"      
```

]

```r
*fruits %>% str_to_lower()
```

```
 [1] "apple"        "apricot"      "avocado"      "banana"       "bell pepper"  "bilberry"    
 [7] "blackberry"   "blackcurrant" "blood orange" "blueberry"    "boysenberry"  "breadfruit"  
[13] "canary melon" "cantaloupe"   "cherimoya"    "cherry"       "chili pepper" "clementine"  
[19] "cloudberry"   "coconut"      "cranberry"    "cucumber"     "currant"      "damson"      
[25] "date"         "dragonfruit"  "durian"       "eggplant"     "elderberry"   "feijoa"      
```

]

---

# Strings

**Converting strings to lower/ upper/ title case**
+ `str_to_upper()`

```r
fruits 
```

]

```r
*fruits %>% str_to_lower
```

]

---

# Strings

**Converting strings to lower/ upper/ title case**
+ `str_to_title()`

```r
fruits
```

]

```r
*fruits %>% str_to_title()
```

```
 [1] "Apple"        "Apricot"      "Avocado"      "Banana"       "Bell Pepper"  "Bilberry"    
 [7] "Blackberry"   "Blackcurrant" "Blood Orange" "Blueberry"    "Boysenberry"  "Breadfruit"  
[13] "Canary Melon" "Cantaloupe"   "Cherimoya"    "Cherry"       "Chili Pepper" "Clementine"  
[19] "Cloudberry"   "Coconut"      "Cranberry"    "Cucumber"     "Currant"      "Damson"      
[25] "Date"         "Dragonfruit"  "Durian"       "Eggplant"     "Elderberry"   "Feijoa"      
```

]

---

## Strings

**Joining of multiple strings**

+ `str_c()`

```r
pop_data <- gapminder::gapminder
pop_data
```

```
# A tibble: 10 x 3
   continent country                        pop
   <fct>     <fct>                        <dbl>
 1 Africa    Algeria                  19875406.
 2 Africa    Angola                    7309390.
 3 Africa    Benin                     4017497.
 4 Africa    Botswana                   971186.
 5 Africa    Burkina Faso              7548677.
 6 Africa    Burundi                   4651608.
 7 Africa    Cameroon                  9816648.
 8 Africa    Central African Republic  2560963 
 9 Africa    Chad                      5329256.
10 Africa    Comoros                    361684.
```

]

```r
pop_data %>% 
* mutate(label = str_c(continent, country, sep = ": "))
```

```
# A tibble: 9 x 4
  continent country                        pop label                           
  <fct>     <fct>                        <dbl> <chr>                           
1 Africa    Algeria                  19875406. Africa: Algeria                 
2 Africa    Angola                    7309390. Africa: Angola                  
3 Africa    Benin                     4017497. Africa: Benin                   
4 Africa    Botswana                   971186. Africa: Botswana                
5 Africa    Burkina Faso              7548677. Africa: Burkina Faso            
6 Africa    Burundi                   4651608. Africa: Burundi                 
7 Africa    Cameroon                  9816648. Africa: Cameroon                
8 Africa    Central African Republic  2560963  Africa: Central African Republic
9 Africa    Chad                      5329256. Africa: Chad                    
```

]

---

## Strings

**Replace matched patterns in the strings**

+ `str_replace()`

+ `str_replace_all()`

---

## Strings

**Replace matched patterns in the strings**

+ `str_replace()`

+ `str_replace(string, pattern, replacement)`

```r
pop_data %>% 
  count(continent)
```

```
# A tibble: 5 x 2
  continent     n
  <fct>     <int>
1 Africa       52
2 Americas     25
3 Asia         33
4 Europe       30
5 Oceania       2
```

]

```r
pop_data %>% count(continent) %>% 
* mutate(continent_2 = str_replace(continent, "Americas", "USA"))
```

```
# A tibble: 5 x 3
  continent     n continent_2
  <fct>     <int> <chr>      
1 Africa       52 Africa     
2 Americas     25 USA        
3 Asia         33 Asia       
4 Europe       30 Europe     
5 Oceania       2 Oceania    
```

]

---

## Factors

+ R represents categorical data with factors.

+ A factor is an integer vector with a levels attribute that stores a set of mappings between integers and categorical variables.

+ `tidyverse` **`forcats`** package provide tools for manipulating with strings

]

---

## Factors

`forcats` [cheatsheets](https://github.com/rstudio/cheatsheets/blob/main/factors.pdf)

---

## Factors

+ `factor()`

+ `fct_reorder()`

+ `fct_lump()`

---

## Factors

+ `factor()`

+ encode a vector as a factor
  
  + e.g., category & enumerated type

```r
factor(x = character(), levels, labels = levels)
```

]

**Example**

```r
rank <- c("third", "second", "fourth", "first")
sort(rank)
```

```
[1] "first"  "fourth" "second" "third" 
```

]

---

## Factors

+ `factor()`

+ encode a vector as a factor
  
  + e.g., category & enumerated type

```r
factor(x = character(), levels, labels = levels)
```

]

**Example**

```r
rank_levels <- c("first", "second", "third", "fourth")
```

```r
*rank_factor <- factor(x = rank, levels = rank_levels)
rank_factor
```

```
[1] third  second fourth first 
Levels: first second third fourth
```

```r
sort(rank_factor)
```

```
[1] first  second third  fourth
Levels: first second third fourth
```

]

---

## Factors

```r
mpg %>% select(model, class, displ) %>% head(4)
```

```
# A tibble: 4 x 3
  model class   displ
  <chr> <chr>   <dbl>
1 a4    compact   1.8
2 a4    compact   1.8
3 a4    compact   2  
4 a4    compact   2  
```

]

```r
mpg %>% 
* mutate(class = factor(class)) %>%
  select(model, class, displ)
```

```
# A tibble: 234 x 3
   model      class   displ
   <chr>      <fct>   <dbl>
 1 a4         compact   1.8
 2 a4         compact   1.8
 3 a4         compact   2  
 4 a4         compact   2  
 5 a4         compact   2.8
 6 a4         compact   2.8
 7 a4         compact   3.1
 8 a4 quattro compact   1.8
 9 a4 quattro compact   1.8
10 a4 quattro compact   2  
# ... with 224 more rows
```

]

---

## Factors

```r
mpg %>% count(class)
```

```
# A tibble: 7 x 2
  class          n
  <chr>      <int>
1 2seater        5
2 compact       47
3 midsize       41
4 minivan       11
5 pickup        33
6 subcompact    35
7 suv           62
```

]

```r
mpg %>% 
* mutate(class = factor(class)) %>%
  select(model, class, displ)
```

]

---

## Factors

+ `fct_reorder()`

+ reorder factor levels by sorting along another variable

```r
fct_reorder(.f, .x, .desc, ...)
```

** Sample data**

```r
relig_summary <- gss_cat %>% 
  group_by(relig) %>% 
  summarise(age = mean(age, na.rm = TRUE),
            tvhourse = mean(tvhours, na.rm = TRUE),
            n = n())
```

]

```r
relig_summary
```

```
# A tibble: 15 x 4
   relig                     age tvhourse     n
   <fct>                   <dbl>    <dbl> <int>
 1 No answer                49.5     2.72    93
 2 Don't know               35.9     4.62    15
 3 Inter-nondenominational  40.0     2.87   109
 4 Native american          38.9     3.46    23
 5 Christian                40.1     2.79   689
 6 Orthodox-christian       50.4     2.42    95
 7 Moslem/islam             37.6     2.44   104
 8 Other eastern            45.9     1.67    32
 9 Hinduism                 37.7     1.89    71
10 Buddhism                 44.7     2.38   147
11 Other                    41.0     2.73   224
12 None                     41.2     2.71  3523
13 Jewish                   52.4     2.52   388
14 Catholic                 46.9     2.96  5124
15 Protestant               49.9     3.15 10846
```

]

---

## Factors

+ `fct_reorder()`

+ reorder factor levels by sorting along another variable

```r
fct_reorder(.f, .x, .desc, ...)
```

** Sample data**

```r
relig_summary <- gss_cat %>% 
  group_by(relig) %>% 
  summarise(age = mean(age, na.rm = TRUE),
            tvhours = mean(tvhours, na.rm = TRUE),
            n = n()) %>% 
  ungroup()
```

]

```r
relig_summary %>% ggplot(aes(x = tvhours, y = relig)) + geom_point(size = 4)
```

]

---

## Factors

+ `fct_reorder()`

+ reorder factor levels by sorting along another variable

```r
fct_reorder(.f, .x, .desc, ...)
```

** Reordered data**

```r
relig_summary_reordered <- relig_summary %>% 
* mutate(relig = fct_reorder(relig, tvhours))
```

]

```r
relig_summary_reordered %>% ggplot(aes(x = tvhours, y = relig)) + geom_point(size = 4)
```

]

---

## Factors

+ `fct_lump()`

+ lump together factor levels into "other"

```r
fct_lump(f, n,other_level, ...)
```

**lumped factors**

```r
relig_summary_lumped <- gss_cat %>% 
  filter(!relig %in% c("Other", "None")) %>% 
* mutate(relig = fct_lump(f = relig, n = 7, other_level = "Other religion")) %>%
  group_by(relig) %>% 
  summarise(age = mean(age, na.rm = TRUE),
            tvhours = mean(tvhours, na.rm = TRUE),
            n = n()) %>% 
  ungroup() %>% 
  mutate(relig = fct_reorder(relig, tvhours))
```

]

```r
relig_summary_lumped
```

```
# A tibble: 8 x 4
  relig                     age tvhours     n
  <fct>                   <dbl>   <dbl> <int>
1 Inter-nondenominational  40.0    2.87   109
2 Christian                40.1    2.79   689
3 Moslem/islam             37.6    2.44   104
4 Buddhism                 44.7    2.38   147
5 Jewish                   52.4    2.52   388
6 Catholic                 46.9    2.96  5124
7 Protestant               49.9    3.15 10846
8 Other religion           45.4    2.53   329
```

]

---

## Factors

+ `fct_lump()`

+ lump together factor levels into "other"

```r
fct_lump(f, n,other_level, ...)
```

]

```r
relig_summary_lumped %>% ggplot(aes(x = tvhours, y = relig)) + geom_point(size = 4)
```

]

---

# What a start!

---

# Thank you!

#### Slides created via the R packages:

### xaringan by Yihui

]

### xaringanthemer and xaringanExtra by Garrick

]