# Dahlquist:GCAT-SEEK Workshop 2016

This page contains the electronic lab notebook generated by Kam D. Dahlquist in preparation for and at the GCAT-SEEK Workshop held on June 28-July 2, 2016 at Cal-State LA.

## 2016-06-28 Notes

Chapter 2: Sequence Processing and Quality Control

• go to workshop directory
```cd /N/dc2/projects/GCAT/workshop
```
• FastQ results file.

## Pre-Workshop Tutorials

### R Tutorials

This is based on An introduction to statistics in R, a series of tutorials by Mark Peterson for working in R.

Scripts and notebooks are stored in my GitHub repository.

#### Basics of Data in R (Chapter 1)

##### 2016-06-03
• Launch RStudio (it default opens with the v3.3.0 version of R I just installed, noting that I have other versions of R installed on my laptop.)
• Create a script file
• File > New > R Script
• Save file using the menu item: File > Save, `Dahlquist_learningR.R` in folder My Documents > Travel-Conferences > GCAT-SEEK_2016 > scripts
```# Script written by Kam Dahlquist
# Learning R First Script
# 2016-06-03
```
• Did some simple arithmetic
```# Try some simple arithmetic
3+2
3-2
3*2
3/2
```
• By highlighting and typing `Ctrl+Enter`, R will compute the result and show results in the console window.
• Variables in R must begin with a letter, but can be any length and can include numerals. An "arrow", a less than symbol followed by a dash `<-` is used to assign variables.
```# Store a simple variable, then display it
testVariable <- 3
testVariable
```
• Note that the first command assigns the variable, the second one displays it.
```# Use the variable
testVariable + 2
testVariable * 2
```
• Highlight and use `Ctrl+Enter` to compute answer.
• Change testVariable and run again.
```# Change test variable
testVariable <- 4
testVariable
```
• Note that in the console window, the  next to the answer is the index telling us that the vaaule is the first element of the variable (also note that the index starts at 1, not zero).
```# Make a test vector, then display it
testVector <- c(1,3,4,6,8,10,15)
testVector
```
• Note that the function `c` stands for "concatenate". This function takes a series of numbers and returns them as a single vector.
• Note that a function takes arguments separated by commas inside of parentheses.
```# See how vectors respond to arithmetic
testVector + 3
testVector * 2
testVector + testVector
```
• In the output, for arguments that are only one number long, the same procedure is applied to each item of `testVector`. When the arguments are the same length, the argument is applied element-for-element.
```# Create a second vector
testVectorB <- testVector + 3
testVectorB
```
```# Display only the 3rd element of a vector
testVectorB
```
```# Display only elements greater than 8
testVectorB[testVectorB > 8]
```
• Answer is 9 11 13 18
```# Show how it is done:
testVectorB > 8
```
• Answer is FALSE FALSE FALSE TRUE TRUE TRUE TRUE
• In this case it is testing each element of the vector and returning the logical answer.
• At this point I think I need a little more explanation of the syntax. The tutorial is telling me what code to enter without explaining the syntax, so it will be difficult for me to extract the general lesson from this.
```# Display the value of testVectorB when testVector equals 3
# Note that two '=' are used for logical tests
testVectorB[testVector == 3]
```
```# Make a plot of two vectors
plot(testVector, testVectorB)
```
• This generates a plot of the two vectors, the x-coordinates are given in the argument first, then the y-coordinates.

I completed the section 1.7 Playing with Data and will pick up with section 1.8 Spreadsheet style data when I pick this up again.

##### 2016-06-06
• Picking up where I left off with Section 1.8 of An introduction to statistics in R tutorial.
• Launch RStudio. It launched, automatically opening my previous script, likely because I saved my workspace image.
• Practice making a type of variable called a data frame (`data.frame`) that will match vectors like a spreadsheet in Excel.
```# Practice making a data frame
testDataFrame <- data.frame(testVector,testVectorB)
testDataFrame
```
• Output looks like this:
```           testVector  testVectorB
1          1           4
2          3           6
3          4           7
4          6           9
5          8          11
6         10          13
7         15          18
```
• Create new folder called "data" and download the .csv file into it: hotdogs.csv.
• Use the menus to import data: Tools > Import Dataset > From local File.
• The data displayed. Copy the following command into the script so that the data will load the next time I run the script:
```# What RStudio added to my Console
View(hotdogs)
```
• The issue with this command is that it records the absolute path to the directory, which may change. To get around this, we need to set the relative path, aka, "set the working directory".
• In RStudio click Session > Set Working Directory > To Source File Location.
```# Manually set working directory
# Made sure that this is set for my computer, not the tutorial
setwd("~/Travel-Conferences/GCAT-SEEK_2016/scripts")
```
• Put the above code near the top of the script so it is easy to see and change if working on a different computer or sharing a script.
• Now that the directory is set, going to type commands to read data
```# What RStudio added to my Console after using menu to read csv data file
View(hotdogs)
```
```# Read hotdogs.csv data file directly instead of from menu
```
```# previous data was read in, but not saved, so we are going to assign it to the variable hotdogs
```
```# Look at the top of the hotdogs data
```
• displays the first six rows of the data (by default, can change this)
• this is what it showed:
```  Type    Calories Sodium
1 Beef      186    495
2 Beef      181    477
3 Beef      176    425
4 Beef      149    322
5 Beef      184    482
6 Beef      190    587
```
• Use `summary` function to get some descriptive statistics of the data
```# Look at the data a little more completely
summary(hotdogs)
```
• Results shown
```     Type       Calories         Sodium
Beef   :20   Min.   : 86.0   Min.   :144.0
Meat   :17   1st Qu.:132.0   1st Qu.:362.5
Poultry:17   Median :145.0   Median :405.0
Mean   :145.4   Mean   :424.8
3rd Qu.:172.8   3rd Qu.:503.5
Max.   :195.0   Max.   :645.0
```
• Install `knitr` package to create an HTML output from this script.
```# Install knitr
# commented out line after running it so that it doesn't run again each time I run this script
install.packages('knitr')
```
• ran install, clicked on notebook icon in menu bar in script tab and it installed further dependencies.
• it found an error in line 19 (it looks like I pasted the plot command there spuriously), so fixed it.
• ran it again and it worked, creating file called "Dahlquist_learningR.html" in the "scripts" directory. I will commit to github repository since OWW doesn't take HTML files.
##### Summary of commands learned so far
• Highlight text and type `Ctrl+Enter` to run a command
• simple arithmetic
• setting a variable `variableName <- value`
• displaying a variable `variableName>`
• setting a vector `vectorName <- c(list of elements separated by commas)`
• perform arithmetic on vectors
• displaying a vector `vectorName`
• a particular element
• based on a criteria
• producing a logical from each element
• simple x-y plot `plot(vector1,vector2)`
• make a data frame from two vectors `dataframeName <- data.frame(vector1,vector2)`
• import csv data from menu and command line `read.csv(path)`
• set working directory from menu and command line, including relative directory `setwd`
• look at first six lines of data frame `head(dataframeName)`
• look at summary statistics of data frame `summary(dataframeName)`
• create an HTML notebook

#### Plotting and evaluating one categorical variable (Chapter 2)

##### 2016-06-07
• Launch RStudio; note that it opens with my previous script because I saved the workspace image.
• Create a new script called "Dahlquist_ch2_ploting-one-categorical-variable_20160607.R"
```# Script written by Kam Dahlquist
# 2016-06-07
# From http://petersonbiology.com/math230Notes/ Chapter 2
# Electronic lab notebook found at: http://www.openwetware.org/wiki/Dahlquist:GCAT-SEEK_Workshop_2016
```
```# Manually set working directory
# Made sure that this is set for my computer, not the tutorial
setwd("~/Travel-Conferences/GCAT-SEEK_2016/scripts")
```
```# Read in the data directly
```
```# Look at the hotdogs data
```
• Result is:
``` Type Calories Sodium
1 Beef      186    495
2 Beef      181    477
3 Beef      176    425
4 Beef      149    322
5 Beef      184    482
6 Beef      190    587
```
• Result is:
```  Beef    Beef    Beef    Beef    Beef    Beef
 Beef    Beef    Beef    Beef    Beef    Beef
 Beef    Beef    Beef    Beef    Beef    Beef
 Beef    Beef    Meat    Meat    Meat    Meat
 Meat    Meat    Meat    Meat    Meat    Meat
 Meat    Meat    Meat    Meat    Meat    Meat
 Meat    Poultry Poultry Poultry Poultry Poultry
 Poultry Poultry Poultry Poultry Poultry Poultry
 Poultry Poultry Poultry Poultry Poultry Poultry
Levels: Beef Meat Poultry
```
• All text columns by default are a special kind of variable called a <factor>factor</factor>. It has "Levels", which in this case are Beef, Meat, Poultry, which are the options for that variable.
```# Check just the levels directly
levels(hotdogs\$Type)
```
• Result is:
``` "Beef"    "Meat"    "Poultry"
```
• To count how many of each type, use the table function
```# Make a table of hotdog types
table(hotdogs\$Type)
```
• Result is:
``` Beef    Meat Poultry
20      17      17
```
• Instead, store this result in a variable
```# Make a table of hotdog types, this time assigning it to a variable
typeCounts <- table(hotdogs\$Type)
```
• Note that when this line is run, the Values window of the console shows that the typeCounts variable has the value of
```'table' int [1:3(1d)] 20 17 17
```
• Question: what is the difference between a data.frame and a table?
• `table` is a function to create tabular results of categorical variables whereas a data.frame is a datatype.
```# See how many total hotdogs are in the data
sum(typeCounts)
```
• Result is
``` 54
```
• Is this the only way to get a sum of how many hotdogs are in this dataset? It seems a little indirect to first have to count how many there are of each individual type and then sum that together. Why not just count records? It seems that it is tied to the next task, which is to calculate the proportions for each type.
```# Calculate proportions for each type of hotdog
typeCounts / sum( typeCounts )
```
• Result is:
```     Beef      Meat   Poultry
0.3703704 0.3148148 0.3148148
```
```# Calculate proportions for each
typeCounts / sum( typeCounts )
```
• There is a function called `prop.table` that directly will calculate proportions from a table
```# Use prop.table to directly compute proportions of each type in a table
prop.table(typeCounts)
```
• Already have count data saved as a variable, so can just pass it to the built-in function `pie()`.
```# Make a pie chart of count data stored in typeCounts variable
pie(typeCounts)
```
```# Make a barchart of count data stored in typeCounts variable
barplot(typeCounts)
```
• We are going to fix up this basic plot by adding a title and labels for the x and y axes, by passing arguments to the function. So far, all of the arguments we have passed to functions were implicit, i.e., not named explicitly, but assumed to be certain things because of the order they were given. To give the arguments out of order, use the format `argument = value`.
```# Beautify the plot by adding title and axis labels
barplot( typeCounts,
main = "Distribution of Hotdog Types",
xlab = "Hotdog Type",
ylab = "Count")
```
```# Load Titanic data by reading it into a variable called titanicData
```
```# Check the data
```
• Result is:
```  class   age gender survival
```
```# Save and display the counts of living and dead in a variable called titanicSurvival
titanicSurvival <- table(titanicData\$survival)
titanicSurvival
```
• Result is:
```Alive  Dead
711  1490
```
```# How many people were on the Titanic?
sum(titanicSurvival)
```
• Result is:
``` 2201
```
```# What proportion lived and died?
prop.table(titanicSurvival)
```
• Result is:
```   Alive     Dead
0.323035 0.676965
```
```# Make a plot of survival
barplot(titanicSurvival,
main = "Titanic Survival",
xlab = "Status",
ylab = "Number of people")
```
• Do this again, for people in each class. I copied and pasted the previous four blocks of code and changed it to reflect the class category:
```# Save and display the counts of people in different classes in a  variable called titanicClass
titanicClass <- table(titanicData\$class)
titanicClass
```
```# How many people were on the Titanic?
sum(titanicClass)
```
```# What proportion were in each class?
prop.table(titanicClass)
```
```# Make a plot of peole in each class
barplot(titanicClass,
main = "Where were people on the Titanic?",
xlab = "Class",
ylab = "Number of people")
```
##### Commands learned in this module
• use the \$ to display the types of data stored in a variable as in `hotdogs\$Type`
• levels()
• table()
• sum()
• prop.table()
• pie()
• barplot()

#### Plotting and evaluating two categorical variables (Chapter 3)

##### 2016-06-08
• Launch RStudio.
• Start new script called: Dahlquist_ch3_plotting-two-categorical-variables_20160608.R
```# Script written by Kam Dahlquist
# 2016-06-08
# From http://petersonbiology.com/math230Notes/ Chapter 3
# Electronic lab notebook found at: http://www.openwetware.org/wiki/Dahlquist:GCAT-SEEK_Workshop_2016
```
```# Manually set working directory
# Made sure that this is set for my computer, not the tutorial
setwd("~/Travel-Conferences/GCAT-SEEK_2016/scripts")
```
```# Load data into variables by reading from csv files
```

Going to stop here and continue later because I've stared at the computer too long today and I'm glazed over. Kam D. Dahlquist 19:29, 8 June 2016 (EDT)

##### 2016-06-09
• Continuing from yesterday:
```# Make table of Gender
table(popularKids\$Gender)
```
```# Make barplot of Gender
barplot( table(popularKids\$Gender),
main = "Proportion of Boys and Girls in Sample",
xlab = "Gender",
ylab = "Count")
```
```# Make table of Goals
table(popularKids\$Goals)
```
```# Make barplot of Gender
barplot( table(popularKids\$Goals),
main = "Students choice in personal goals",
xlab = "Goals",
ylab = "Count")
```
• We want to know how many boys want to be good in sports.
```# Save a table of the two-way interaction
genderGoals <- table(popularKids\$Gender, popularKids\$Goals)
```
```# Enter the variable name to actually see it
genderGoals
```
• results
```      Grades Popular Sports
boy     117      50     60
girl    130      91     30
```
```# Make a simple (default) barplot
barplot(genderGoals,
main = "Goals divided by gender",
xlab = "Goals",
ylab = "Count")
```
```# Make a barplot with a legend, with bars side-by-side
barplot(genderGoals,
main = "Goals divided by gender",
xlab = "Goals",
ylab = "Count",
legend.text = TRUE,
beside = TRUE)
```
```# Save a table of location and goals
locGoals <- table(popularKids\$Urban.Rural,
popularKids\$Goals)
# View the table
locGoals
```
• Results
```          Grades Popular Sports
Rural        57      50     42
Suburban     87      42     22
Urban       103      49     26
```
```# Plot the result of location vs. goals
barplot(locGoals,
legend.text = TRUE,
beside = TRUE,
main = "Does location matter?",
xlab = "Personal goal",
ylab = "Number of students")
```
```# Save a second table of location and goals so data is transposed
locGoalsB <- table(popularKids\$Goals,
popularKids\$Urban.Rural)
```
```# Plot the result of transposed location vs. goals
barplot(locGoalsB,
legend.text = TRUE,
beside = TRUE,
main = "Does location matter?",
xlab = "Location",
ylab = "Number of students")
```
```# Display the original table transposed
t(locGoals)
```
```# Plot the original table transposed
barplot(t(locGoals),
legend.text = TRUE,
beside = TRUE,
main = "Does location matter?",
xlab = "Location",
ylab = "Number of students")
```
##### 2016-06-10
```# Make a plot of the difference in goals by gender with the bars reversed from before.
barplot(t(genderGoals),
main = "Goals divided by gender",
xlab = "Goals",
ylab = "Count",
legend.text = TRUE,
beside = TRUE)
```
• Interpretation: boys and girls are similar with respect to desire for good grades, but girls rate being popular higher than sports and boys rank them about equal.
• Now using the titanic data...
```# Save a two-way table of survival for the Titanic data
survivalClass <-  table(titanic\$survival,titanic\$class)
```
```# Make a plot of survival for each class of passenger
barplot(survivalClass,
main = "Survival by Class",
xlab = "Class",
ylab = "Number of People",
legend.text = TRUE,
beside = TRUE)
```
• It is difficult to interpret the results because there are different numbers in each class. We can use the function `addmargins()` to see group totals.
```# Look at the margins
```
• Result
```       Crew First Second Third  Sum
Alive  212   203    118   178  711
Dead   673   122    167   528 1490
Sum    885   325    285   706 2201
```
• This adds a sum column that gives a "marginal" total because it was literally written in the margins of the table when calculated by hand.
• What we want to know is the conditional distribution of survival, i.e., what proportion of people survived given that they were in each class.
```# Simple example of the conditional distribution of survival
prop.table(survivalClass)
```
• Results
```             Crew      First     Second      Third
Alive 0.09631985 0.09223080 0.05361199 0.08087233
```
• but this is the proportion of the grand sum, we want the proportion of each class, so use the option "margin" where 1 is rows and 2 is columns.
```# compute the conditional distribution by column, which is class
prop.table(survivalClass,
margin = 2)
```
• results in
```            Crew     First    Second     Third
Alive 0.2395480 0.6246154 0.4140351 0.2521246
```
```# check answer to see if columns sum to 1 as we would expect
margin = 2))
```
```            Crew     First    Second     Third       Sum
Alive 0.2395480 0.6246154 0.4140351 0.2521246 1.5303231
Dead  0.7604520 0.3753846 0.5859649 0.7478754 2.4696769
Sum   1.0000000 1.0000000 1.0000000 1.0000000 4.0000000
```
• result verifies this
```# Plot the conditional distribution of survival by class, embedding computation in plot call
barplot( prop.table(survivalClass,
margin = 2),
main = "Survival by Class",
legend.text = TRUE,
ylab = "Proportion surviving",
xlab = "Class")
```
• Now look at whether women were more likely to survive than men.
```# Make a table of gender by survival
survivalGender <- table(titanic\$survival,titanic\$gender)
```
```# Plot the conditional distributions of gender by survival
barplot( prop.table(survivalGender, 2),
main = "Survival by Gender",
legend.text = TRUE,
ylab = "Proportion surviving",
xlab = "Gender")
```
• Large difference in survival between women and men
```# Look at the raw counts of survival by gender
survivalGender
```
• results
```       Female Male
Alive    344  367
```
```# Save a two-way table of gender by class for Titanic data
genderClass <-  table(titanic\$gender,titanic\$class)
genderClass
```
```# Plot the conditional distributions of gender by class
barplot( prop.table(genderClass, 2),
main = "Where were the Women?",
legend.text = TRUE,
ylab = "Proportion within class",
xlab = "Class")
```
```# define a new variable which is the transpose of gender by class and display
genderClass2 <- t(genderClass)
genderClass2
```
```# Plot the conditional distributions of class by gender
barplot( prop.table(genderClass2, 2),
main = "Difference in class by gender",
legend.text = TRUE,
ylab = "Proportion within gender",
xlab = "gender")
```

Completed Chapter 3 and committed script and notebook to GitHub repository. — Kam D. Dahlquist 19:52, 10 June 2016 (EDT)

### Linux Tutorial

• Launched PuTTY
• selected mason.indiana.edu host
• got warning about privacy key, selected to connect just once, not to cache