- Intro & general tips
- S1:
R-core
: un-wrapping a table of data - S2:
tidyverse
: reading census xlsx - S3:
tidyverse
: sex and violins
April 24, 2019
R-core
: un-wrapping a table of datatidyverse
: reading census xlsxtidyverse
: sex and violinsTerminal.app
or WSLgit
) or at least a cloud-sync backup (Dropbox, GDrive)temps
or temperatures
instead of x
or v1
R
TipsReminder: R uses NA
for character and numeric data (Unlike SAS).
Scenario 1: You're given the pasted text from a manuscript table to analyze
Toxoplasmosis Table
2 4 1735 3 10 1936 1 5 2000 3 10 1973 2 2 1750 3 5 1800 2 8 1750 7 19 2077 ...
Reminder: read.table
skips empty lines
colnames <- c("Cases", "tested", "Rainfall") toxo <- read.table(file="data/toxoplasmosis.txt") names(toxo) <- rep(colnames, 2) toxo
toxo <- rbind(toxo[,1:3], toxo[,4:6]) summary(toxo)
Cases tested Rainfall Min. : 0.00 Min. : 1.00 Min. :1620 1st Qu.: 2.00 1st Qu.: 6.00 1st Qu.:1772 Median : 4.50 Median :10.50 Median :1895 Mean :10.47 Mean :20.21 Mean :1908 3rd Qu.: 8.75 3rd Qu.:23.50 3rd Qu.:2000 Max. :53.00 Max. :82.00 Max. :2292
Scenario 2: you are given an Excel file with census data
Let's try using the readxl
package
#install.packages(c("tidyverse","readxl")) library(readxl) census_xlsx <- read_excel("data/nst-est2018-01.xlsx")
New names: * `` -> ...2 * `` -> ...3 * `` -> ...4 * `` -> ...5 * `` -> ...6 * … and 6 more problems
census_xlsx
Find top-left cell coordinate to specify for import
Find bottom-right cell coordinate to specify for import
census_xlsx <- read_excel("data/nst-est2018-01.xlsx", range = "A4:L62")
New names: * `` -> ...1
tidyverse
Over past ~5 years tidyverse
Read in data
Then manipulate & plot
tidyverse
library("tidyverse")
── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.1 ✔ purrr 0.3.2 ✔ tibble 2.1.1 ✔ dplyr 0.8.0.1 ✔ tidyr 0.8.3 ✔ stringr 1.4.0 ✔ readr 1.3.1 ✔ forcats 0.4.0
── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()
tidyr
& dplyr
tidyr
: reshaping datagather()
multiple columns to key-value pairs; makes “wide” data longerdrop_na()
drops rows with NA
valuesdplyr
: a grammar of data manipulationdplyr
concept: pipestidyverse
examples often use pipes
a %>% b(x) %>% c(y)
becomes
c(b(a, x), y)
census_tibb <- census_xlsx %>% rename(Region = ...1) %>% drop_na() %>% select(-one_of(c("Census", "Estimates Base"))) %>% gather(`2010`:`2018`, key = year, value = pop_estimate)
census_tibb
census_tibb$year <- as.numeric(census_tibb$year) census_tibb
census_xlsx %>% rename(Region = ...1) %>% drop_na() %>% select(-one_of(c("Census", "Estimates Base"))) %>% gather(`2010`:`2018`, key = year, value = pop_estimate) %>% mutate(year = as.integer(year))
Scenario 3: You're given a summary table to run some linear models on
The raw text looks like this:
Chicago_Symphony 68 14 20 10 4 10 1 9 0 LA_Philharmonic 66 15 17 8 5 11 1 9 0 Cleveland_Orch 65 16 17 7 5 9 2 9 0 ...
inst_types <- c("violin", "viola", "cello", "bass") sexes <- c("male", "female") colnames_tmp <- expand.grid(sexes, inst_types) colnames_tmp
colnames <- as.vector(mapply(paste, sep="_", colnames_tmp[,1], colnames_tmp[,2])) colnames <- c("name", "total", colnames) colnames
[1] "name" "total" "male_violin" "female_violin" [5] "male_viola" "female_viola" "male_cello" "female_cello" [9] "male_bass" "female_bass"
orchest_raw <- read_table("data/symphony.txt", col_names=colnames)
Parsed with column specification: cols( name = col_character(), total = col_double(), male_violin = col_double(), female_violin = col_double(), male_viola = col_double(), female_viola = col_double(), male_cello = col_double(), female_cello = col_double(), male_bass = col_double(), female_bass = col_double() )
orchest_raw
orchestra <- gather(orchest_raw, male_violin:female_bass, key='sex_instrument', value="count") orchestra <- separate(orchestra, 'sex_instrument', into=c('sex', 'instrument')) orchestra
model.matrix
single_mat <- model.matrix(~instrument, data = orchestra) head(single_mat)
(Intercept) instrumentcello instrumentviola instrumentviolin 1 1 0 0 1 2 1 0 0 1 3 1 0 0 1 4 1 0 0 1 5 1 0 0 1 6 1 0 0 1
model.matrix
double_mat <- model.matrix(~instrument+sex, data = orchestra) head(double_mat)
(Intercept) instrumentcello instrumentviola instrumentviolin sexmale 1 1 0 0 1 1 2 1 0 0 1 1 3 1 0 0 1 1 4 1 0 0 1 1 5 1 0 0 1 1 6 1 0 0 1 1
Thanks, questions?