CDE-R: Import, Clean, and Analyze

April 24, 2019

Outline

What we'll cover today

Intro & general tips
S1: R-core: un-wrapping a table of data
S2: tidyverse: reading census xlsx
S3: tidyverse: sex and violins

General Tips 1

Be practical: use the best tool you are comfortable with (where possible)
Learn to navigate with and run basic command in Terminal.app or WSL
Be humble: Look for answers and ask for help
Use version control (git) or at least a cloud-sync backup (Dropbox, GDrive)
Minimize the amount of manual processing of data / results

General Tips 2

Use consistent style
Use comments to document your code
Use variable names that are unique and identifying
- e.g. temps or temperatures instead of x or v1
Avoid hard-coded, repetitive code
- Write functions to re-use code

What R is

Interpreted, object-oriented language
Optimized matrix/vector functions
Large and diverse package community
- CRAN, Bioconductor, Github

What R isn't

Beginner friendly
General purpose
Memory efficient

`R` Tips

Use RStudio where available (when performance isn't important)
Avoid but do not fear loops (they've gotten better)
Use and request tidy dataWickham, H. "Tidy Data" J. Stat. Soft., 2014
Each variable is a column
Each observation or case is a row
pivot / hierarchical tables are for presentation, not analysis

Reminder: R uses NA for character and numeric data (Unlike SAS).

S1: Publication formatted table

Scenario 1: You're given the pasted text from a manuscript table to analyze

Toxoplasmosis Table

S1: Raw Text

data/toxoplasmosis.txt

  2  4   1735  3   10   1936 

  1  5   2000  3   10   1973 

  2  2   1750  3    5   1800 

  2  8   1750  7   19   2077 
...

S1: Read the table

Reminder: read.table skips empty lines

colnames <- c("Cases", "tested", "Rainfall")
toxo <- read.table(file="data/toxoplasmosis.txt")
names(toxo) <- rep(colnames, 2)
toxo

S1: Stitch back together

toxo <- rbind(toxo[,1:3], toxo[,4:6])
summary(toxo)

     Cases           tested         Rainfall   
 Min.   : 0.00   Min.   : 1.00   Min.   :1620  
 1st Qu.: 2.00   1st Qu.: 6.00   1st Qu.:1772  
 Median : 4.50   Median :10.50   Median :1895  
 Mean   :10.47   Mean   :20.21   Mean   :1908  
 3rd Qu.: 8.75   3rd Qu.:23.50   3rd Qu.:2000  
 Max.   :53.00   Max.   :82.00   Max.   :2292

S2: Census data from Excel

Scenario 2: you are given an Excel file with census data from census.gov, retrieved Apr 22, 2019 to analyze.

Let's try using the readxl package

S2: Read in the data

#install.packages(c("tidyverse","readxl"))
library(readxl)
census_xlsx <- read_excel("data/nst-est2018-01.xlsx")

New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* … and 6 more problems

S2: what went wrong?

census_xlsx

S2: Look at the columns

Find top-left cell coordinate to specify for import

S2: Look at the end of the rows

S2: Look at the end of the rows

Find bottom-right cell coordinate to specify for import

S2: Read the Excel file again

census_xlsx <- read_excel("data/nst-est2018-01.xlsx", range = "A4:L62")

New names:
* `` -> ...1

S2: Cleaned but not tidy

need to rename first column
columns as years might be ok for some analysis
"year" column would make date easier to analyze, join to other datasets

`tidyverse`

Over past ~5 years tidyversenice cheatsheets here if you're into that has become very popular in data science

Read in data

Then manipulate & plot

Load `tidyverse`

library("tidyverse")

── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
✔ tidyr   0.8.3       ✔ stringr 1.4.0  
✔ readr   1.3.1       ✔ forcats 0.4.0

── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

`tidyr` & `dplyr`

`tidyr`: reshaping data

gather() multiple columns to key-value pairs; makes “wide” data longer
drop_na() drops rows with NA values

`dplyr`: a grammar of data manipulation

select() picks columns based on their names
rename() renames columns
mutate() create or modify data

`dplyr` concept: pipes

tidyverse examples often use pipes

Borrowed from functional programming / shell languages
Create effecient, readable chains of functions

a %>% b(x) %>% c(y)

becomes

c(b(a, x), y)

S2: Through the pipes

census_tibb <- census_xlsx %>%
  rename(Region = ...1) %>%
  drop_na() %>%
  select(-one_of(c("Census", "Estimates Base"))) %>%
  gather(`2010`:`2018`, key = year, value = pop_estimate)

S2: Tidy census data

census_tibb

S2: Tidy census data

census_tibb$year <- as.numeric(census_tibb$year)
census_tibb

S2: All together

census_xlsx %>%
  rename(Region = ...1) %>%
  drop_na() %>%
  select(-one_of(c("Census", "Estimates Base"))) %>%
  gather(`2010`:`2018`, key = year, value = pop_estimate) %>%
  mutate(year = as.integer(year))

S3: Sex and Violins

Scenario 3: You're given a summary table to run some linear models on

S3: plain text

The raw text looks like this:

Chicago_Symphony  68 14 20 10  4 10  1  9  0

LA_Philharmonic   66 15 17  8  5 11  1  9  0

Cleveland_Orch    65 16 17  7  5  9  2  9  0
...

S3: Column name prep work

inst_types <- c("violin", "viola", "cello", "bass")
sexes <- c("male", "female")
colnames_tmp <- expand.grid(sexes, inst_types)
colnames_tmp

S3: Generate columns

colnames <- as.vector(mapply(paste, sep="_", 
                     colnames_tmp[,1], colnames_tmp[,2]))
colnames <- c("name", "total", colnames)
colnames

 [1] "name"          "total"         "male_violin"   "female_violin"
 [5] "male_viola"    "female_viola"  "male_cello"    "female_cello" 
 [9] "male_bass"     "female_bass"

S3: Read in table

orchest_raw <- read_table("data/symphony.txt", col_names=colnames)

Parsed with column specification:
cols(
  name = col_character(),
  total = col_double(),
  male_violin = col_double(),
  female_violin = col_double(),
  male_viola = col_double(),
  female_viola = col_double(),
  male_cello = col_double(),
  female_cello = col_double(),
  male_bass = col_double(),
  female_bass = col_double()
)

S3: Examine table

orchest_raw

S3: Gather and separate

orchestra <- gather(orchest_raw, male_violin:female_bass, key='sex_instrument', value="count")
orchestra <- separate(orchestra, 'sex_instrument', into=c('sex', 'instrument'))
orchestra

S3: Dummy variables

In R, called design matrices
Make sure desired variables (columns) are factors
Use model.matrix

single_mat <- model.matrix(~instrument, data = orchestra)
head(single_mat)

  (Intercept) instrumentcello instrumentviola instrumentviolin
1           1               0               0                1
2           1               0               0                1
3           1               0               0                1
4           1               0               0                1
5           1               0               0                1
6           1               0               0                1

S3: Dummy variables

In R, called design matrices
Make sure desired variables (columns) are factors
Use model.matrix

S3: More dummy variables

double_mat <- model.matrix(~instrument+sex, data = orchestra)
head(double_mat)

  (Intercept) instrumentcello instrumentviola instrumentviolin sexmale
1           1               0               0                1       1
2           1               0               0                1       1
3           1               0               0                1       1
4           1               0               0                1       1
5           1               0               0                1       1
6           1               0               0                1       1

End!

Thanks, questions?

Outline

What we'll cover today

General Tips 1

General Tips 2

What R is

What R isn't

R Tips

S1: Publication formatted table

S1: Raw Text

S1: Read the table

S1: Stitch back together

S2: Census data from Excel

S2: Read in the data

S2: what went wrong?

S2: Look at the columns

S2: Look at the end of the rows

S2: Look at the end of the rows

S2: Read the Excel file again

S2: Cleaned but not tidy

tidyverse

Load tidyverse

tidyr & dplyr

tidyr: reshaping data

dplyr: a grammar of data manipulation

dplyr concept: pipes

S2: Through the pipes

S2: Tidy census data

S2: Tidy census data

S2: All together

S3: Sex and Violins

S3: plain text

S3: Column name prep work

S3: Generate columns

S3: Read in table

S3: Examine table

S3: Gather and separate

S3: Dummy variables

S3: Dummy variables

S3: More dummy variables

End!

`R` Tips

`tidyverse`

Load `tidyverse`

`tidyr` & `dplyr`

`tidyr`: reshaping data

`dplyr`: a grammar of data manipulation

`dplyr` concept: pipes