Dahlquist:GCAT-SEEK Workshop 2016

From OpenWetWare
Jump to navigationJump to search

Home        Research        Protocols        Notebook        People        Publications        Courses        Contact       



This page contains the electronic lab notebook generated by Kam D. Dahlquist in preparation for and at the GCAT-SEEK Workshop held on June 28-July 2, 2016 at Cal-State LA.

2016-06-28 Notes

Chapter 2: Sequence Processing and Quality Control

  • go to workshop directory
cd /N/dc2/projects/GCAT/workshop
  • FastQ results file.

Pre-Workshop Tutorials

R Tutorials

This is based on An introduction to statistics in R, a series of tutorials by Mark Peterson for working in R.

Scripts and notebooks are stored in my GitHub repository.

Basics of Data in R (Chapter 1)

2016-06-03
  • Downloaded and installed the current version of R for Windows, v3.3.0 from here; this is the direct download link.
  • Downloaded and installed the current version of RStudio, v0.99.902 from here; this is the direct download link.
  • Launch RStudio (it default opens with the v3.3.0 version of R I just installed, noting that I have other versions of R installed on my laptop.)
  • Create a script file
    • File > New > R Script
    • Save file using the menu item: File > Save, Dahlquist_learningR.R in folder My Documents > Travel-Conferences > GCAT-SEEK_2016 > scripts
    • Made the following comments:
# Script written by Kam Dahlquist
# Learning R First Script
# 2016-06-03
  • Did some simple arithmetic
# Try some simple arithmetic
3+2
3-2
3*2
3/2
  • By highlighting and typing Ctrl+Enter, R will compute the result and show results in the console window.
  • Variables in R must begin with a letter, but can be any length and can include numerals. An "arrow", a less than symbol followed by a dash <- is used to assign variables.
# Store a simple variable, then display it
testVariable <- 3
testVariable
  • Note that the first command assigns the variable, the second one displays it.
# Use the variable
testVariable + 2
testVariable * 2
  • Highlight and use Ctrl+Enter to compute answer.
    • Change testVariable and run again.
# Change test variable
testVariable <- 4
testVariable
  • Note that in the console window, the [1] next to the answer is the index telling us that the vaaule is the first element of the variable (also note that the index starts at 1, not zero).
# Make a test vector, then display it
testVector <- c(1,3,4,6,8,10,15)
testVector
  • Note that the function c stands for "concatenate". This function takes a series of numbers and returns them as a single vector.
  • Note that a function takes arguments separated by commas inside of parentheses.
# See how vectors respond to arithmetic
testVector + 3
testVector * 2
testVector + testVector
  • In the output, for arguments that are only one number long, the same procedure is applied to each item of testVector. When the arguments are the same length, the argument is applied element-for-element.
# Create a second vector
testVectorB <- testVector + 3
testVectorB
# Display only the 3rd element of a vector
testVectorB[3]
  • Answer is 7
# Display only elements greater than 8
testVectorB[testVectorB > 8]
  • Answer is 9 11 13 18
# Show how it is done:
testVectorB > 8
  • Answer is FALSE FALSE FALSE TRUE TRUE TRUE TRUE
    • In this case it is testing each element of the vector and returning the logical answer.
  • At this point I think I need a little more explanation of the syntax. The tutorial is telling me what code to enter without explaining the syntax, so it will be difficult for me to extract the general lesson from this.
# Display the value of testVectorB when testVector equals 3
# Note that two '=' are used for logical tests
testVectorB[testVector == 3]
  • Answer is 6
# Make a plot of two vectors
plot(testVector, testVectorB)
  • This generates a plot of the two vectors, the x-coordinates are given in the argument first, then the y-coordinates.

I completed the section 1.7 Playing with Data and will pick up with section 1.8 Spreadsheet style data when I pick this up again.

2016-06-06
  • Picking up where I left off with Section 1.8 of An introduction to statistics in R tutorial.
  • Launch RStudio. It launched, automatically opening my previous script, likely because I saved my workspace image.
  • Practice making a type of variable called a data frame (data.frame) that will match vectors like a spreadsheet in Excel.
# Practice making a data frame
testDataFrame <- data.frame(testVector,testVectorB)
testDataFrame
  • Output looks like this:
           testVector  testVectorB
1          1           4
2          3           6
3          4           7
4          6           9
5          8          11
6         10          13
7         15          18
  • Create new folder called "data" and download the .csv file into it: hotdogs.csv.
  • Use the menus to import data: Tools > Import Dataset > From local File.

  • The data displayed. Copy the following command into the script so that the data will load the next time I run the script:
# What RStudio added to my Console
hotdogs <- read.csv("~/Travel-Conferences/GCAT-SEEK_201/data/hotdogs.csv")
View(hotdogs)
  • The issue with this command is that it records the absolute path to the directory, which may change. To get around this, we need to set the relative path, aka, "set the working directory".
    • In RStudio click Session > Set Working Directory > To Source File Location.
# Manually set working directory
# Made sure that this is set for my computer, not the tutorial
setwd("~/Travel-Conferences/GCAT-SEEK_2016/scripts")
  • Put the above code near the top of the script so it is easy to see and change if working on a different computer or sharing a script.
  • Now that the directory is set, going to type commands to read data
# What RStudio added to my Console after using menu to read csv data file
hotdogs <- read.csv("~/Travel-Conferences/GCAT-SEEK_2016/data/hotdogs.csv")
View(hotdogs)
# Read hotdogs.csv data file directly instead of from menu
read.csv("../data/hotdogs.csv")
# previous data was read in, but not saved, so we are going to assign it to the variable hotdogs
hotdogs <- read.csv("../data/hotdogs.csv")
# Look at the top of the hotdogs data
head(hotdogs)
  • displays the first six rows of the data (by default, can change this)
  • this is what it showed:
  Type    Calories Sodium
1 Beef      186    495
2 Beef      181    477
3 Beef      176    425
4 Beef      149    322
5 Beef      184    482
6 Beef      190    587
  • Use summary function to get some descriptive statistics of the data
# Look at the data a little more completely
summary(hotdogs)
  • Results shown
     Type       Calories         Sodium     
Beef   :20   Min.   : 86.0   Min.   :144.0  
Meat   :17   1st Qu.:132.0   1st Qu.:362.5  
Poultry:17   Median :145.0   Median :405.0  
             Mean   :145.4   Mean   :424.8  
             3rd Qu.:172.8   3rd Qu.:503.5  
             Max.   :195.0   Max.   :645.0
  • Install knitr package to create an HTML output from this script.
# Install knitr
# commented out line after running it so that it doesn't run again each time I run this script
install.packages('knitr')
  • ran install, clicked on notebook icon in menu bar in script tab and it installed further dependencies.
  • it found an error in line 19 (it looks like I pasted the plot command there spuriously), so fixed it.
  • ran it again and it worked, creating file called "Dahlquist_learningR.html" in the "scripts" directory. I will commit to github repository since OWW doesn't take HTML files.
Summary of commands learned so far
  • Highlight text and type Ctrl+Enter to run a command
  • simple arithmetic
  • setting a variable variableName <- value
  • displaying a variable variableName>
  • setting a vector vectorName <- c(list of elements separated by commas)
  • perform arithmetic on vectors
  • displaying a vector vectorName
    • a particular element
    • based on a criteria
    • producing a logical from each element
  • simple x-y plot plot(vector1,vector2)
  • make a data frame from two vectors dataframeName <- data.frame(vector1,vector2)
  • import csv data from menu and command line read.csv(path)
  • set working directory from menu and command line, including relative directory setwd
  • look at first six lines of data frame head(dataframeName)
  • look at summary statistics of data frame summary(dataframeName)
  • create an HTML notebook

Plotting and evaluating one categorical variable (Chapter 2)

2016-06-07
  • Launch RStudio; note that it opens with my previous script because I saved the workspace image.
  • Create a new script called "Dahlquist_ch2_ploting-one-categorical-variable_20160607.R"
# Script written by Kam Dahlquist
# 2016-06-07
# From http://petersonbiology.com/math230Notes/ Chapter 2
# Electronic lab notebook found at: http://www.openwetware.org/wiki/Dahlquist:GCAT-SEEK_Workshop_2016
# Manually set working directory
# Made sure that this is set for my computer, not the tutorial
setwd("~/Travel-Conferences/GCAT-SEEK_2016/scripts")
# Read in the data directly
hotdogs <- read.csv("../data/hotdogs.csv")
# Look at the hotdogs data
head(hotdogs)
  • Result is:
 Type Calories Sodium
1 Beef      186    495
2 Beef      181    477
3 Beef      176    425
4 Beef      149    322
5 Beef      184    482
6 Beef      190    587
  • Result is:
 [1] Beef    Beef    Beef    Beef    Beef    Beef   
 [7] Beef    Beef    Beef    Beef    Beef    Beef   
[13] Beef    Beef    Beef    Beef    Beef    Beef   
[19] Beef    Beef    Meat    Meat    Meat    Meat   
[25] Meat    Meat    Meat    Meat    Meat    Meat   
[31] Meat    Meat    Meat    Meat    Meat    Meat   
[37] Meat    Poultry Poultry Poultry Poultry Poultry
[43] Poultry Poultry Poultry Poultry Poultry Poultry
[49] Poultry Poultry Poultry Poultry Poultry Poultry
Levels: Beef Meat Poultry
  • All text columns by default are a special kind of variable called a <factor>factor</factor>. It has "Levels", which in this case are Beef, Meat, Poultry, which are the options for that variable.
# Check just the levels directly
levels(hotdogs$Type)
  • Result is:
[1] "Beef"    "Meat"    "Poultry"
  • To count how many of each type, use the table function
# Make a table of hotdog types
table(hotdogs$Type)
  • Result is:
 Beef    Meat Poultry 
   20      17      17
  • Instead, store this result in a variable
# Make a table of hotdog types, this time assigning it to a variable
typeCounts <- table(hotdogs$Type)
  • Note that when this line is run, the Values window of the console shows that the typeCounts variable has the value of
'table' int [1:3(1d)] 20 17 17
  • Question: what is the difference between a data.frame and a table?
    • table is a function to create tabular results of categorical variables whereas a data.frame is a datatype.
# See how many total hotdogs are in the data
sum(typeCounts)
  • Result is
[1] 54
  • Is this the only way to get a sum of how many hotdogs are in this dataset? It seems a little indirect to first have to count how many there are of each individual type and then sum that together. Why not just count records? It seems that it is tied to the next task, which is to calculate the proportions for each type.
# Calculate proportions for each type of hotdog
typeCounts / sum( typeCounts )
  • Result is:
     Beef      Meat   Poultry 
0.3703704 0.3148148 0.3148148 
# Calculate proportions for each
typeCounts / sum( typeCounts )
  • There is a function called prop.table that directly will calculate proportions from a table
# Use prop.table to directly compute proportions of each type in a table
prop.table(typeCounts)
  • Already have count data saved as a variable, so can just pass it to the built-in function pie().
# Make a pie chart of count data stored in typeCounts variable
pie(typeCounts)

# Make a barchart of count data stored in typeCounts variable
barplot(typeCounts)

  • We are going to fix up this basic plot by adding a title and labels for the x and y axes, by passing arguments to the function. So far, all of the arguments we have passed to functions were implicit, i.e., not named explicitly, but assumed to be certain things because of the order they were given. To give the arguments out of order, use the format argument = value.
# Beautify the plot by adding title and axis labels
barplot( typeCounts,
         main = "Distribution of Hotdog Types",
         xlab = "Hotdog Type",
         ylab = "Count")

# Load Titanic data by reading it into a variable called titanicData
titanicData <- read.csv("../data/titanicData.csv")
# Check the data
head(titanicData)
  • Result is:
  class   age gender survival
1 First Adult   Male    Alive
2 First Adult   Male    Alive
3 First Adult   Male    Alive
4 First Adult   Male    Alive
5 First Adult   Male    Alive
6 First Adult   Male    Alive
# Save and display the counts of living and dead in a variable called titanicSurvival
titanicSurvival <- table(titanicData$survival)
titanicSurvival
  • Result is:
Alive  Dead 
  711  1490
# How many people were on the Titanic?
sum(titanicSurvival)
  • Result is:
[1] 2201
# What proportion lived and died?
prop.table(titanicSurvival)
  • Result is:
   Alive     Dead 
0.323035 0.676965
# Make a plot of survival
barplot(titanicSurvival,
        main = "Titanic Survival",
        xlab = "Status",
        ylab = "Number of people")

  • Do this again, for people in each class. I copied and pasted the previous four blocks of code and changed it to reflect the class category:
# Save and display the counts of people in different classes in a  variable called titanicClass
titanicClass <- table(titanicData$class)
titanicClass
# How many people were on the Titanic?
sum(titanicClass)
# What proportion were in each class?
prop.table(titanicClass)
# Make a plot of peole in each class
barplot(titanicClass,
        main = "Where were people on the Titanic?",
        xlab = "Class",
        ylab = "Number of people")

Commands learned in this module
  • use the $ to display the types of data stored in a variable as in hotdogs$Type
  • levels()
  • table()
  • sum()
  • prop.table()
  • pie()
  • barplot()

Plotting and evaluating two categorical variables (Chapter 3)

2016-06-08
  • Launch RStudio.
  • Download popularKids.csv data and save it in my data directory.
  • Start new script called: Dahlquist_ch3_plotting-two-categorical-variables_20160608.R
# Script written by Kam Dahlquist
# 2016-06-08
# From http://petersonbiology.com/math230Notes/ Chapter 3
# Electronic lab notebook found at: http://www.openwetware.org/wiki/Dahlquist:GCAT-SEEK_Workshop_2016
# Manually set working directory
# Made sure that this is set for my computer, not the tutorial
setwd("~/Travel-Conferences/GCAT-SEEK_2016/scripts")
# Load data into variables by reading from csv files
titanic <- read.csv("../data/titanicData.csv")
popularKids <- read.csv("../data/popularKids.csv")

Going to stop here and continue later because I've stared at the computer too long today and I'm glazed over. Kam D. Dahlquist 19:29, 8 June 2016 (EDT)

2016-06-09
  • Continuing from yesterday:
# Make table of Gender
table(popularKids$Gender)
# Make barplot of Gender
barplot( table(popularKids$Gender),
         main = "Proportion of Boys and Girls in Sample",
         xlab = "Gender",
         ylab = "Count")

# Make table of Goals
table(popularKids$Goals)
# Make barplot of Gender
barplot( table(popularKids$Goals),
         main = "Students choice in personal goals",
         xlab = "Goals",
         ylab = "Count")

[[Image:

  • We want to know how many boys want to be good in sports.
# Save a table of the two-way interaction
genderGoals <- table(popularKids$Gender, popularKids$Goals)
# Enter the variable name to actually see it
genderGoals
  • results
      Grades Popular Sports
 boy     117      50     60
 girl    130      91     30
# Make a simple (default) barplot
barplot(genderGoals,
        main = "Goals divided by gender",
        xlab = "Goals",
        ylab = "Count")


# Make a barplot with a legend, with bars side-by-side
barplot(genderGoals,
        main = "Goals divided by gender",
        xlab = "Goals",
        ylab = "Count",
        legend.text = TRUE,
        beside = TRUE)


# Save a table of location and goals
locGoals <- table(popularKids$Urban.Rural,
                 popularKids$Goals)
# View the table
locGoals
  • Results
          Grades Popular Sports
 Rural        57      50     42
 Suburban     87      42     22
 Urban       103      49     26
# Plot the result of location vs. goals
barplot(locGoals,
        legend.text = TRUE,
        beside = TRUE,
        main = "Does location matter?",
        xlab = "Personal goal",
        ylab = "Number of students")

# Save a second table of location and goals so data is transposed
locGoalsB <- table(popularKids$Goals,
                    popularKids$Urban.Rural)
# Plot the result of transposed location vs. goals
barplot(locGoalsB,
        legend.text = TRUE,
        beside = TRUE,
        main = "Does location matter?",
        xlab = "Location",
        ylab = "Number of students")
# Display the original table transposed
t(locGoals)
# Plot the original table transposed
barplot(t(locGoals),
        legend.text = TRUE,
        beside = TRUE,
        main = "Does location matter?",
        xlab = "Location",
        ylab = "Number of students")
2016-06-10
# Make a plot of the difference in goals by gender with the bars reversed from before. 
barplot(t(genderGoals),
        main = "Goals divided by gender",
        xlab = "Goals",
        ylab = "Count",
        legend.text = TRUE,
        beside = TRUE)

  • Interpretation: boys and girls are similar with respect to desire for good grades, but girls rate being popular higher than sports and boys rank them about equal.
  • Now using the titanic data...
# Save a two-way table of survival for the Titanic data
survivalClass <-  table(titanic$survival,titanic$class)
# Make a plot of survival for each class of passenger
barplot(survivalClass,
        main = "Survival by Class",
        xlab = "Class",
        ylab = "Number of People",
        legend.text = TRUE,
        beside = TRUE)

  • It is difficult to interpret the results because there are different numbers in each class. We can use the function addmargins() to see group totals.
# Look at the margins
addmargins(survivalClass)
  • Result
       Crew First Second Third  Sum
 Alive  212   203    118   178  711
 Dead   673   122    167   528 1490
 Sum    885   325    285   706 2201
  • This adds a sum column that gives a "marginal" total because it was literally written in the margins of the table when calculated by hand.
  • What we want to know is the conditional distribution of survival, i.e., what proportion of people survived given that they were in each class.
# Simple example of the conditional distribution of survival
prop.table(survivalClass)
  • Results
             Crew      First     Second      Third
 Alive 0.09631985 0.09223080 0.05361199 0.08087233
 Dead  0.30577010 0.05542935 0.07587460 0.23989096
  • but this is the proportion of the grand sum, we want the proportion of each class, so use the option "margin" where 1 is rows and 2 is columns.
# compute the conditional distribution by column, which is class
prop.table(survivalClass,
           margin = 2)
  • results in
            Crew     First    Second     Third
 Alive 0.2395480 0.6246154 0.4140351 0.2521246
 Dead  0.7604520 0.3753846 0.5859649 0.7478754
# check answer to see if columns sum to 1 as we would expect
addmargins(prop.table(survivalClass,
                      margin = 2))
            Crew     First    Second     Third       Sum
 Alive 0.2395480 0.6246154 0.4140351 0.2521246 1.5303231
 Dead  0.7604520 0.3753846 0.5859649 0.7478754 2.4696769
 Sum   1.0000000 1.0000000 1.0000000 1.0000000 4.0000000
  • result verifies this
# Plot the conditional distribution of survival by class, embedding computation in plot call
barplot( prop.table(survivalClass,
                    margin = 2),
         main = "Survival by Class",
         legend.text = TRUE,
         ylab = "Proportion surviving",
         xlab = "Class")

  • Now look at whether women were more likely to survive than men.
# Make a table of gender by survival
survivalGender <- table(titanic$survival,titanic$gender)
# Plot the conditional distributions of gender by survival
barplot( prop.table(survivalGender, 2),
         main = "Survival by Gender",
         legend.text = TRUE,
         ylab = "Proportion surviving",
         xlab = "Gender")

  • Large difference in survival between women and men
# Look at the raw counts of survival by gender
survivalGender
  • results
       Female Male
 Alive    344  367
 Dead     126 1364
# Save a two-way table of gender by class for Titanic data
genderClass <-  table(titanic$gender,titanic$class)
genderClass
# Plot the conditional distributions of gender by class
barplot( prop.table(genderClass, 2),
         main = "Where were the Women?",
         legend.text = TRUE,
         ylab = "Proportion within class",
         xlab = "Class")

# define a new variable which is the transpose of gender by class and display
genderClass2 <- t(genderClass)
genderClass2
# Plot the conditional distributions of class by gender
barplot( prop.table(genderClass2, 2),
         main = "Difference in class by gender",
         legend.text = TRUE,
         ylab = "Proportion within gender",
         xlab = "gender")

Completed Chapter 3 and committed script and notebook to GitHub repository. — Kam D. Dahlquist 19:52, 10 June 2016 (EDT)

Linux Tutorial

  • Launched PuTTY
  • selected mason.indiana.edu host
    • got warning about privacy key, selected to connect just once, not to cache
  • logged in with username and password
  • selected bash
  • Now am working through tutorial here: http://www.ee.surrey.ac.uk/Teaching/Unix/.
    • Did 1, 2, and almost all of 3.