April 24, 2019

Outline

What we'll cover today

  • Intro & general tips
  • S1: R-core: un-wrapping a table of data
  • S2: tidyverse: reading census xlsx
  • S3: tidyverse: sex and violins

General Tips 1

  • Be practical: use the best tool you are comfortable with (where possible)
  • Learn to navigate with and run basic command in Terminal.app or WSL
  • Be humble: Look for answers and ask for help
  • Use version control (git) or at least a cloud-sync backup (Dropbox, GDrive)
  • Minimize the amount of manual processing of data / results

General Tips 2

  • Use consistent style
  • Use comments to document your code
  • Use variable names that are unique and identifying
    • e.g. temps or temperatures instead of x or v1
  • Avoid hard-coded, repetitive code
    • Write functions to re-use code

What R is

  • Interpreted, object-oriented language
  • Optimized matrix/vector functions
  • Large and diverse package community

What R isn't

  • Beginner friendly
  • General purpose
  • Memory efficient

R Tips

  • Use RStudio where available (when performance isn't important)
  • Avoid but do not fear loops (they've gotten better)
  • Use and request tidy dataWickham, H. "Tidy Data" J. Stat. Soft., 2014
  • Each variable is a column
  • Each observation or case is a row
  • pivot / hierarchical tables are for presentation, not analysis

Reminder: R uses NA for character and numeric data (Unlike SAS).

S1: Publication formatted table

Scenario 1: You're given the pasted text from a manuscript table to analyze

Toxoplasmosis Table

S1: Raw Text

S1: Read the table

Reminder: read.table skips empty lines

colnames <- c("Cases", "tested", "Rainfall")
toxo <- read.table(file="data/toxoplasmosis.txt")
names(toxo) <- rep(colnames, 2)
toxo

S1: Stitch back together

toxo <- rbind(toxo[,1:3], toxo[,4:6])
summary(toxo)
     Cases           tested         Rainfall   
 Min.   : 0.00   Min.   : 1.00   Min.   :1620  
 1st Qu.: 2.00   1st Qu.: 6.00   1st Qu.:1772  
 Median : 4.50   Median :10.50   Median :1895  
 Mean   :10.47   Mean   :20.21   Mean   :1908  
 3rd Qu.: 8.75   3rd Qu.:23.50   3rd Qu.:2000  
 Max.   :53.00   Max.   :82.00   Max.   :2292  

S2: Census data from Excel

Scenario 2: you are given an Excel file with census data from census.gov, retrieved Apr 22, 2019 to analyze.

Let's try using the readxl package

S2: Read in the data

#install.packages(c("tidyverse","readxl"))
library(readxl)
census_xlsx <- read_excel("data/nst-est2018-01.xlsx")
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* … and 6 more problems

S2: what went wrong?

census_xlsx

S2: Look at the columns

Find top-left cell coordinate to specify for import

S2: Look at the end of the rows

S2: Look at the end of the rows

Find bottom-right cell coordinate to specify for import

S2: Read the Excel file again

census_xlsx <- read_excel("data/nst-est2018-01.xlsx", range = "A4:L62")
New names:
* `` -> ...1

S2: Cleaned but not tidy

  • need to rename first column
  • columns as years might be ok for some analysis
  • "year" column would make date easier to analyze, join to other datasets

tidyverse

Over past ~5 years tidyversenice cheatsheets here if you're into that has become very popular in data science

Read in data

Then manipulate & plot

Load tidyverse

library("tidyverse")
── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
✔ tidyr   0.8.3       ✔ stringr 1.4.0  
✔ readr   1.3.1       ✔ forcats 0.4.0  
── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

tidyr & dplyr

tidyr: reshaping data

  • gather() multiple columns to key-value pairs; makes “wide” data longer
  • drop_na() drops rows with NA values

dplyr: a grammar of data manipulation

dplyr concept: pipes

tidyverse examples often use pipes

  • Borrowed from functional programming / shell languages
  • Create effecient, readable chains of functions
a %>% b(x) %>% c(y)

becomes

c(b(a, x), y)

S2: Through the pipes

census_tibb <- census_xlsx %>%
  rename(Region = ...1) %>%
  drop_na() %>%
  select(-one_of(c("Census", "Estimates Base"))) %>%
  gather(`2010`:`2018`, key = year, value = pop_estimate)

S2: Tidy census data

census_tibb

S2: Tidy census data

census_tibb$year <- as.numeric(census_tibb$year)
census_tibb

S2: All together

census_xlsx %>%
  rename(Region = ...1) %>%
  drop_na() %>%
  select(-one_of(c("Census", "Estimates Base"))) %>%
  gather(`2010`:`2018`, key = year, value = pop_estimate) %>%
  mutate(year = as.integer(year))

S3: Sex and Violins

Scenario 3: You're given a summary table to run some linear models on

S3: plain text

The raw text looks like this:

Chicago_Symphony  68 14 20 10  4 10  1  9  0

LA_Philharmonic   66 15 17  8  5 11  1  9  0

Cleveland_Orch    65 16 17  7  5  9  2  9  0
...

S3: Column name prep work

inst_types <- c("violin", "viola", "cello", "bass")
sexes <- c("male", "female")
colnames_tmp <- expand.grid(sexes, inst_types)
colnames_tmp

S3: Generate columns

colnames <- as.vector(mapply(paste, sep="_", 
                     colnames_tmp[,1], colnames_tmp[,2]))
colnames <- c("name", "total", colnames)
colnames
 [1] "name"          "total"         "male_violin"   "female_violin"
 [5] "male_viola"    "female_viola"  "male_cello"    "female_cello" 
 [9] "male_bass"     "female_bass"  

S3: Read in table

orchest_raw <- read_table("data/symphony.txt", col_names=colnames)
Parsed with column specification:
cols(
  name = col_character(),
  total = col_double(),
  male_violin = col_double(),
  female_violin = col_double(),
  male_viola = col_double(),
  female_viola = col_double(),
  male_cello = col_double(),
  female_cello = col_double(),
  male_bass = col_double(),
  female_bass = col_double()
)

S3: Examine table

orchest_raw

S3: Gather and separate

orchestra <- gather(orchest_raw, male_violin:female_bass, key='sex_instrument', value="count")
orchestra <- separate(orchestra, 'sex_instrument', into=c('sex', 'instrument'))
orchestra

S3: Dummy variables

  • In R, called design matrices
  • Make sure desired variables (columns) are factors
  • Use model.matrix
single_mat <- model.matrix(~instrument, data = orchestra)
head(single_mat)
  (Intercept) instrumentcello instrumentviola instrumentviolin
1           1               0               0                1
2           1               0               0                1
3           1               0               0                1
4           1               0               0                1
5           1               0               0                1
6           1               0               0                1

S3: Dummy variables

  • In R, called design matrices
  • Make sure desired variables (columns) are factors
  • Use model.matrix

S3: More dummy variables

double_mat <- model.matrix(~instrument+sex, data = orchestra)
head(double_mat)
  (Intercept) instrumentcello instrumentviola instrumentviolin sexmale
1           1               0               0                1       1
2           1               0               0                1       1
3           1               0               0                1       1
4           1               0               0                1       1
5           1               0               0                1       1
6           1               0               0                1       1

End!

Thanks, questions?