# Dahlquist:GCAT-SEEK Workshop 2016

This page contains the electronic lab notebook generated by Kam D. Dahlquist in preparation for and at the GCAT-SEEK Workshop held on June 28-July 2, 2016 at Cal-State LA.

## 2016-06-28 Notes

Chapter 2: Sequence Processing and Quality Control

• go to workshop directory
cd /N/dc2/projects/GCAT/workshop

• FastQ results file.

## Pre-Workshop Tutorials

### R Tutorials

This is based on An introduction to statistics in R, a series of tutorials by Mark Peterson for working in R.

Scripts and notebooks are stored in my GitHub repository.

#### Basics of Data in R (Chapter 1)

##### 2016-06-03
• Launch RStudio (it default opens with the v3.3.0 version of R I just installed, noting that I have other versions of R installed on my laptop.)
• Create a script file
• File > New > R Script
• Save file using the menu item: File > Save, Dahlquist_learningR.R in folder My Documents > Travel-Conferences > GCAT-SEEK_2016 > scripts
# Script written by Kam Dahlquist
# Learning R First Script
# 2016-06-03

• Did some simple arithmetic
# Try some simple arithmetic
3+2
3-2
3*2
3/2

• By highlighting and typing Ctrl+Enter, R will compute the result and show results in the console window.
• Variables in R must begin with a letter, but can be any length and can include numerals. An "arrow", a less than symbol followed by a dash <- is used to assign variables.
# Store a simple variable, then display it
testVariable <- 3
testVariable

• Note that the first command assigns the variable, the second one displays it.
# Use the variable
testVariable + 2
testVariable * 2

• Highlight and use Ctrl+Enter to compute answer.
• Change testVariable and run again.
# Change test variable
testVariable <- 4
testVariable

• Note that in the console window, the  next to the answer is the index telling us that the vaaule is the first element of the variable (also note that the index starts at 1, not zero).
# Make a test vector, then display it
testVector <- c(1,3,4,6,8,10,15)
testVector

• Note that the function c stands for "concatenate". This function takes a series of numbers and returns them as a single vector.
• Note that a function takes arguments separated by commas inside of parentheses.
# See how vectors respond to arithmetic
testVector + 3
testVector * 2
testVector + testVector

• In the output, for arguments that are only one number long, the same procedure is applied to each item of testVector. When the arguments are the same length, the argument is applied element-for-element.
# Create a second vector
testVectorB <- testVector + 3
testVectorB

# Display only the 3rd element of a vector
testVectorB

# Display only elements greater than 8
testVectorB[testVectorB > 8]

• Answer is 9 11 13 18
# Show how it is done:
testVectorB > 8

• Answer is FALSE FALSE FALSE TRUE TRUE TRUE TRUE
• In this case it is testing each element of the vector and returning the logical answer.
• At this point I think I need a little more explanation of the syntax. The tutorial is telling me what code to enter without explaining the syntax, so it will be difficult for me to extract the general lesson from this.
# Display the value of testVectorB when testVector equals 3
# Note that two '=' are used for logical tests
testVectorB[testVector == 3]

# Make a plot of two vectors
plot(testVector, testVectorB)

• This generates a plot of the two vectors, the x-coordinates are given in the argument first, then the y-coordinates.

I completed the section 1.7 Playing with Data and will pick up with section 1.8 Spreadsheet style data when I pick this up again.

##### 2016-06-06
• Picking up where I left off with Section 1.8 of An introduction to statistics in R tutorial.
• Launch RStudio. It launched, automatically opening my previous script, likely because I saved my workspace image.
• Practice making a type of variable called a data frame (data.frame) that will match vectors like a spreadsheet in Excel.
# Practice making a data frame
testDataFrame <- data.frame(testVector,testVectorB)
testDataFrame

• Output looks like this:
           testVector  testVectorB
1          1           4
2          3           6
3          4           7
4          6           9
5          8          11
6         10          13
7         15          18

• Create new folder called "data" and download the .csv file into it: hotdogs.csv.
• Use the menus to import data: Tools > Import Dataset > From local File.
• The data displayed. Copy the following command into the script so that the data will load the next time I run the script:
# What RStudio added to my Console
View(hotdogs)

• The issue with this command is that it records the absolute path to the directory, which may change. To get around this, we need to set the relative path, aka, "set the working directory".
• In RStudio click Session > Set Working Directory > To Source File Location.
# Manually set working directory
# Made sure that this is set for my computer, not the tutorial
setwd("~/Travel-Conferences/GCAT-SEEK_2016/scripts")

• Put the above code near the top of the script so it is easy to see and change if working on a different computer or sharing a script.
• Now that the directory is set, going to type commands to read data
# What RStudio added to my Console after using menu to read csv data file
View(hotdogs)

# Read hotdogs.csv data file directly instead of from menu

# previous data was read in, but not saved, so we are going to assign it to the variable hotdogs

# Look at the top of the hotdogs data

• displays the first six rows of the data (by default, can change this)
• this is what it showed:
  Type    Calories Sodium
1 Beef      186    495
2 Beef      181    477
3 Beef      176    425
4 Beef      149    322
5 Beef      184    482
6 Beef      190    587

• Use summary function to get some descriptive statistics of the data
# Look at the data a little more completely
summary(hotdogs)

• Results shown
     Type       Calories         Sodium
Beef   :20   Min.   : 86.0   Min.   :144.0
Meat   :17   1st Qu.:132.0   1st Qu.:362.5
Poultry:17   Median :145.0   Median :405.0
Mean   :145.4   Mean   :424.8
3rd Qu.:172.8   3rd Qu.:503.5
Max.   :195.0   Max.   :645.0

• Install knitr package to create an HTML output from this script.
# Install knitr
# commented out line after running it so that it doesn't run again each time I run this script
install.packages('knitr')

• ran install, clicked on notebook icon in menu bar in script tab and it installed further dependencies.
• it found an error in line 19 (it looks like I pasted the plot command there spuriously), so fixed it.
• ran it again and it worked, creating file called "Dahlquist_learningR.html" in the "scripts" directory. I will commit to github repository since OWW doesn't take HTML files.
##### Summary of commands learned so far
• Highlight text and type Ctrl+Enter to run a command
• simple arithmetic
• setting a variable variableName <- value
• displaying a variable variableName>
• setting a vector vectorName <- c(list of elements separated by commas)
• perform arithmetic on vectors
• displaying a vector vectorName
• a particular element
• based on a criteria
• producing a logical from each element
• simple x-y plot plot(vector1,vector2)
• make a data frame from two vectors dataframeName <- data.frame(vector1,vector2)
• import csv data from menu and command line read.csv(path)
• set working directory from menu and command line, including relative directory setwd
• look at first six lines of data frame head(dataframeName)
• look at summary statistics of data frame summary(dataframeName)
• create an HTML notebook

#### Plotting and evaluating one categorical variable (Chapter 2)

##### 2016-06-07
• Launch RStudio; note that it opens with my previous script because I saved the workspace image.
• Create a new script called "Dahlquist_ch2_ploting-one-categorical-variable_20160607.R"
# Script written by Kam Dahlquist
# 2016-06-07
# From http://petersonbiology.com/math230Notes/ Chapter 2
# Electronic lab notebook found at: http://www.openwetware.org/wiki/Dahlquist:GCAT-SEEK_Workshop_2016

# Manually set working directory
# Made sure that this is set for my computer, not the tutorial
setwd("~/Travel-Conferences/GCAT-SEEK_2016/scripts")

# Read in the data directly

# Look at the hotdogs data

• Result is:
 Type Calories Sodium
1 Beef      186    495
2 Beef      181    477
3 Beef      176    425
4 Beef      149    322
5 Beef      184    482
6 Beef      190    587

• Result is:
  Beef    Beef    Beef    Beef    Beef    Beef
 Beef    Beef    Beef    Beef    Beef    Beef
 Beef    Beef    Beef    Beef    Beef    Beef
 Beef    Beef    Meat    Meat    Meat    Meat
 Meat    Meat    Meat    Meat    Meat    Meat
 Meat    Meat    Meat    Meat    Meat    Meat
 Meat    Poultry Poultry Poultry Poultry Poultry
 Poultry Poultry Poultry Poultry Poultry Poultry
 Poultry Poultry Poultry Poultry Poultry Poultry
Levels: Beef Meat Poultry

• All text columns by default are a special kind of variable called a <factor>factor</factor>. It has "Levels", which in this case are Beef, Meat, Poultry, which are the options for that variable.
# Check just the levels directly
levels(hotdogs$Type)  • Result is:  "Beef" "Meat" "Poultry"  • To count how many of each type, use the table function # Make a table of hotdog types table(hotdogs$Type)

• Result is:
 Beef    Meat Poultry
20      17      17

• Instead, store this result in a variable
# Make a table of hotdog types, this time assigning it to a variable
typeCounts <- table(hotdogs$Type)  • Note that when this line is run, the Values window of the console shows that the typeCounts variable has the value of 'table' int [1:3(1d)] 20 17 17  • Question: what is the difference between a data.frame and a table? • table is a function to create tabular results of categorical variables whereas a data.frame is a datatype. # See how many total hotdogs are in the data sum(typeCounts)  • Result is  54  • Is this the only way to get a sum of how many hotdogs are in this dataset? It seems a little indirect to first have to count how many there are of each individual type and then sum that together. Why not just count records? It seems that it is tied to the next task, which is to calculate the proportions for each type. # Calculate proportions for each type of hotdog typeCounts / sum( typeCounts )  • Result is:  Beef Meat Poultry 0.3703704 0.3148148 0.3148148  # Calculate proportions for each typeCounts / sum( typeCounts )  • There is a function called prop.table that directly will calculate proportions from a table # Use prop.table to directly compute proportions of each type in a table prop.table(typeCounts)  • Already have count data saved as a variable, so can just pass it to the built-in function pie(). # Make a pie chart of count data stored in typeCounts variable pie(typeCounts)  # Make a barchart of count data stored in typeCounts variable barplot(typeCounts)  • We are going to fix up this basic plot by adding a title and labels for the x and y axes, by passing arguments to the function. So far, all of the arguments we have passed to functions were implicit, i.e., not named explicitly, but assumed to be certain things because of the order they were given. To give the arguments out of order, use the format argument = value. # Beautify the plot by adding title and axis labels barplot( typeCounts, main = "Distribution of Hotdog Types", xlab = "Hotdog Type", ylab = "Count")  # Load Titanic data by reading it into a variable called titanicData titanicData <- read.csv("../data/titanicData.csv")  # Check the data head(titanicData)  • Result is:  class age gender survival 1 First Adult Male Alive 2 First Adult Male Alive 3 First Adult Male Alive 4 First Adult Male Alive 5 First Adult Male Alive 6 First Adult Male Alive  # Save and display the counts of living and dead in a variable called titanicSurvival titanicSurvival <- table(titanicData$survival)
titanicSurvival

• Result is:
Alive  Dead
711  1490

# How many people were on the Titanic?
sum(titanicSurvival)

• Result is:
 2201

# What proportion lived and died?
prop.table(titanicSurvival)

• Result is:
   Alive     Dead
0.323035 0.676965

# Make a plot of survival
barplot(titanicSurvival,
main = "Titanic Survival",
xlab = "Status",
ylab = "Number of people")

• Do this again, for people in each class. I copied and pasted the previous four blocks of code and changed it to reflect the class category:
# Save and display the counts of people in different classes in a  variable called titanicClass
titanicClass <- table(titanicData$class) titanicClass  # How many people were on the Titanic? sum(titanicClass)  # What proportion were in each class? prop.table(titanicClass)  # Make a plot of peole in each class barplot(titanicClass, main = "Where were people on the Titanic?", xlab = "Class", ylab = "Number of people")  ##### Commands learned in this module • use the$ to display the types of data stored in a variable as in hotdogs$Type • levels() • table() • sum() • prop.table() • pie() • barplot() #### Plotting and evaluating two categorical variables (Chapter 3) ##### 2016-06-08 • Launch RStudio. • Download popularKids.csv data and save it in my data directory. • Start new script called: Dahlquist_ch3_plotting-two-categorical-variables_20160608.R # Script written by Kam Dahlquist # 2016-06-08 # From http://petersonbiology.com/math230Notes/ Chapter 3 # Electronic lab notebook found at: http://www.openwetware.org/wiki/Dahlquist:GCAT-SEEK_Workshop_2016  # Manually set working directory # Made sure that this is set for my computer, not the tutorial setwd("~/Travel-Conferences/GCAT-SEEK_2016/scripts")  # Load data into variables by reading from csv files titanic <- read.csv("../data/titanicData.csv") popularKids <- read.csv("../data/popularKids.csv")  Going to stop here and continue later because I've stared at the computer too long today and I'm glazed over. Kam D. Dahlquist 19:29, 8 June 2016 (EDT) ##### 2016-06-09 • Continuing from yesterday: # Make table of Gender table(popularKids$Gender)

# Make barplot of Gender
barplot( table(popularKids$Gender), main = "Proportion of Boys and Girls in Sample", xlab = "Gender", ylab = "Count")  # Make table of Goals table(popularKids$Goals)

# Make barplot of Gender
barplot( table(popularKids$Goals), main = "Students choice in personal goals", xlab = "Goals", ylab = "Count")  • We want to know how many boys want to be good in sports. # Save a table of the two-way interaction genderGoals <- table(popularKids$Gender, popularKids$Goals)  # Enter the variable name to actually see it genderGoals  • results  Grades Popular Sports boy 117 50 60 girl 130 91 30  # Make a simple (default) barplot barplot(genderGoals, main = "Goals divided by gender", xlab = "Goals", ylab = "Count")  # Make a barplot with a legend, with bars side-by-side barplot(genderGoals, main = "Goals divided by gender", xlab = "Goals", ylab = "Count", legend.text = TRUE, beside = TRUE)  # Save a table of location and goals locGoals <- table(popularKids$Urban.Rural,
popularKids$Goals) # View the table locGoals  • Results  Grades Popular Sports Rural 57 50 42 Suburban 87 42 22 Urban 103 49 26  # Plot the result of location vs. goals barplot(locGoals, legend.text = TRUE, beside = TRUE, main = "Does location matter?", xlab = "Personal goal", ylab = "Number of students")  # Save a second table of location and goals so data is transposed locGoalsB <- table(popularKids$Goals,
popularKids$Urban.Rural)  # Plot the result of transposed location vs. goals barplot(locGoalsB, legend.text = TRUE, beside = TRUE, main = "Does location matter?", xlab = "Location", ylab = "Number of students")  # Display the original table transposed t(locGoals)  # Plot the original table transposed barplot(t(locGoals), legend.text = TRUE, beside = TRUE, main = "Does location matter?", xlab = "Location", ylab = "Number of students")  ##### 2016-06-10 # Make a plot of the difference in goals by gender with the bars reversed from before. barplot(t(genderGoals), main = "Goals divided by gender", xlab = "Goals", ylab = "Count", legend.text = TRUE, beside = TRUE)  • Interpretation: boys and girls are similar with respect to desire for good grades, but girls rate being popular higher than sports and boys rank them about equal. • Now using the titanic data... # Save a two-way table of survival for the Titanic data survivalClass <- table(titanic$survival,titanic$class)  # Make a plot of survival for each class of passenger barplot(survivalClass, main = "Survival by Class", xlab = "Class", ylab = "Number of People", legend.text = TRUE, beside = TRUE)  • It is difficult to interpret the results because there are different numbers in each class. We can use the function addmargins() to see group totals. # Look at the margins addmargins(survivalClass)  • Result  Crew First Second Third Sum Alive 212 203 118 178 711 Dead 673 122 167 528 1490 Sum 885 325 285 706 2201  • This adds a sum column that gives a "marginal" total because it was literally written in the margins of the table when calculated by hand. • What we want to know is the conditional distribution of survival, i.e., what proportion of people survived given that they were in each class. # Simple example of the conditional distribution of survival prop.table(survivalClass)  • Results  Crew First Second Third Alive 0.09631985 0.09223080 0.05361199 0.08087233 Dead 0.30577010 0.05542935 0.07587460 0.23989096  • but this is the proportion of the grand sum, we want the proportion of each class, so use the option "margin" where 1 is rows and 2 is columns. # compute the conditional distribution by column, which is class prop.table(survivalClass, margin = 2)  • results in  Crew First Second Third Alive 0.2395480 0.6246154 0.4140351 0.2521246 Dead 0.7604520 0.3753846 0.5859649 0.7478754  # check answer to see if columns sum to 1 as we would expect addmargins(prop.table(survivalClass, margin = 2))   Crew First Second Third Sum Alive 0.2395480 0.6246154 0.4140351 0.2521246 1.5303231 Dead 0.7604520 0.3753846 0.5859649 0.7478754 2.4696769 Sum 1.0000000 1.0000000 1.0000000 1.0000000 4.0000000  • result verifies this # Plot the conditional distribution of survival by class, embedding computation in plot call barplot( prop.table(survivalClass, margin = 2), main = "Survival by Class", legend.text = TRUE, ylab = "Proportion surviving", xlab = "Class")  • Now look at whether women were more likely to survive than men. # Make a table of gender by survival survivalGender <- table(titanic$survival,titanic$gender)  # Plot the conditional distributions of gender by survival barplot( prop.table(survivalGender, 2), main = "Survival by Gender", legend.text = TRUE, ylab = "Proportion surviving", xlab = "Gender")  • Large difference in survival between women and men # Look at the raw counts of survival by gender survivalGender  • results  Female Male Alive 344 367 Dead 126 1364  # Save a two-way table of gender by class for Titanic data genderClass <- table(titanic$gender,titanic\$class)
genderClass

# Plot the conditional distributions of gender by class
barplot( prop.table(genderClass, 2),
main = "Where were the Women?",
legend.text = TRUE,
ylab = "Proportion within class",
xlab = "Class")

# define a new variable which is the transpose of gender by class and display
genderClass2 <- t(genderClass)
genderClass2

# Plot the conditional distributions of class by gender
barplot( prop.table(genderClass2, 2),
main = "Difference in class by gender",
legend.text = TRUE,
ylab = "Proportion within gender",
xlab = "gender")


Completed Chapter 3 and committed script and notebook to GitHub repository. — Kam D. Dahlquist 19:52, 10 June 2016 (EDT)

### Linux Tutorial

• Launched PuTTY
• selected mason.indiana.edu host
• got warning about privacy key, selected to connect just once, not to cache