User:Kam D. Dahlquist/Notebook/Personal/2011/07/06

From OpenWetWare
Jump to navigationJump to search
Dahlquist Personal Lab Notebook Main project page
Previous entry      Next entry

Today

I am following the Microarray Data Processing in R protocol created by Nick and Katrina.

Installing R

  • First I have to install R on my laptop.
  • From the CRAN mirror at UCLA, I am downloading the version R 2.13.0 for Windows (37 megabytes, 32/64 bit) and storing the .exe in my Installers folder on the Desktop.
    • Double-click to run installer.
    • Note: to find installers for previous versions follow the links:
mirrors > UCLA > Windows > base > previous releases
  • Note that the students are seeing issues using different versions of R.
    • Does not work in 2.7.0
    • Works in 2.7.1 and 2.7.2
    • Does not work in 2.13.0 (one line of code that makes it not work, the normalize between arrays line)

Installing Packages

The instructions for downloading and installing Bioconductor packages can be found on the Bioconductor.org Installs page.

  • Launch R.
  • At the command prompt, type the following:
source("http://bioconductor.org/biocLite.R")
biocLite()
  • The program will automatically download and install the packages.

Within-chip Normalization

Getting Files Ready

  • First, target files must be created to input all of the GPR files into R.
  • The target file is a .txt file that contains all of the filenames of the .gpr files from GenePix Pro. The file should be formatted as follows:
    • The first line should say FileName.
    • Starting from the second line and using as many lines as necessary, copy and paste the names of all of the .gpr files ordering them first by strain (wild type and then the deletion strains in alphabetical order), then by time point, then by flask.
      • Use the file MicroarrayLog.xls to find out which chips belong to which strain/flask/timepoint.
      • I had to modify the Bottom_GCAT.txt and Top_GCAT.txt files given to me by Katrina because the files were in the wrong order. The targets_all_ont.txt file checked out correctly.
    • Make sure that each line has only one .gpr file name.
    • Three of these target files must be created, one for the Ontario chip .gpr files and two for the GCAT chip .gpr files.
      • One of the target files for the GCAT chips must contain the .gpr file names corresponding to the tops of the GCAT chips and the other target file must contain the .gpr file names corresponding to the bottom of the GCAT chips.

I am creating a folder in LionShare that contains all of the files created/needed to do this analysis. The files in this folder are:

MicroarrayLog.xls
Bottom_GCAT.txt
Top_GCAT.txt
targets_all_ont.txt
  • Make sure that the target file and the corresponding .gpr files for the Ontario chips are in one folder and the the target files and the corresponding .gpr files for the GCAT chips are in another folder.
    • Moved copies of all .gpr files to the folders GCAT_nonnormalized_gpr_files and Ontario_nonnormalized_gpr_files.
    • Double-checked that all of the .gpr files were from the correct scans of the chips.

Within-chip Normalization for Ontario Chips

  • Launch R.
  • Change the directory using the menu File > Change dir... to the folder containing the target file and the .gpr files for the Ontario chips.
  • Into the R Console window, type:
library(limma)
  • Import the target .txt file containing all of the .gpr file names for the Ontario chips.
  • Type the following:
targets<-readTargets(file.choose())
This opens a file dialog to select the target .txt file.
f<-function(x) as.numeric(x$Flags > -99)
RGO<-read.maimages(targets, source="genepix.median", wt.fun=f)
  • Perform Within Array Normalization by typing the following:
MAO<-normalizeWithinArrays(RGO, method="loess", bc.method="normexp")
This returns messages saying "Array # corrected"
  • The next bit of code does the following:
    • Create a blank matrix which has as many rows as there are unique genes and as many columns as there are .gpr files.
    • Average all of the replicate spots by using the tapply function.
    • Write all of the log fold changes after averaging replicate spots into the blank matrix previously created.
M1<-tapply(MAO$M[,1],as.factor(MAO$genes[,5]),mean)
n0<-length(MAO$M[1,])
n1<-length(M1)
MM<-matrix(nrow=n1,ncol=n0)
MM[,1]=M1
for(i in 1:94) {MM[,i]<-tapply(MAO$M[,i],as.factor(MAO$genes[,5]),mean)}

This is as far as I got today. I may have mistyped the previous command because I got an error of unexpected }. I'll look into it tomorrow. — Kam D. Dahlquist 21:33, 6 July 2011 (EDT). It processed today, yesterday was a typo. — Kam D. Dahlquist 12:56, 7 July 2011 (EDT)

  • A .csv file must be created to denote which chips were dyeswapped.
    • To do so, launch Microsoft Excel.
    • Label the first column in row 1 with "Multiplier" or another label for your choosing.
    • Each cell in the column corresponds to each .gpr file in the targets file in the order that they appear in the targets file.
    • If a chip has not been dyeswapped (ratio is red-experimental/green-control), type a "1" into the appropriate cell within the column in the Excel file.
    • If a chip has been dyeswapped (red-control/green-experimental), type a "-1" into the appropriate cell within the column in the Excel file. Save this excel file as a .csv file.
      • I obtained the file dyeswap_matrix_ont.csv from Katrina and double-checked that all of the dyeswaps were correct against the MicroarrayLog.xls and targets_all_ont.txt files. I uploaded a copy of the file to LionShare.
  • Import the .csv file denoting which chips are dyeswapped into R using the following command:
ds<-read.csv("dyeswap_matrix_ont.csv",header=TRUE,sep=",")
  • Create a new matrix so that the number of rows corresponds to the number of unique genes and the number of column corresponds to the number of GPR files using the following command:
MN<-matrix(nrow=6191,ncol=94)
  • Multiply the values in the matrix containing the log fold changes after the replicates have been averaged by the dyeswap list created using the following command:
for (i in 1:94) {MN[,i]<-ds[i,]*MM[,i]}
  • Write the data to a table using the following command:
write.table(MN,"ont_WANorm_dyeswapped_Matrix.csv",sep=",")
This creates a file called ont_WANorm_dyeswapped_Matrix.csv. When I look at it in Excel, I see that the first column is an index that counts the data rows (6191). The columns are given headers of V1, V2, etc. However, header "V1" occurs in column A, the one with the index and the last column of data, column CQ, does not have a header. This needs to be offset by 1, and actually, it would be nice if the column headers were more meaningful, like the filenames of the chips they correspond to.
The results are different depending on which version of R I am using. R version 2.7.2 gives a different result than R version 2.13.0
  • The .csv file generated does not have the gene ID's listed. In order to obtain the list of gene ID's from R, run the tapply function with only one .gpr file and write it to a table.
write.table(tapply(MAO$M[,1],as.factor(MAO$genes[,5]),mean),"Ontario_ID.csv",sep=",")
  • Open these two files in Excel. Shift the column headings in the ont_WANorm_dyeswapped_Matrix.csv to the right by one column.
  • Copy the IDs (column A) from the Ontario_ID.csv file and paste it into (column A) of the ont_WANorm_dyeswapped_Matrix.csv file.
  • Delete the first two rows of values of the ont_WANorm_dyeswapped_Matrix.csv file, corresponding to the control spots (in the .gpr file, these are labeled "3XSSC" and "Arabidopsis"). There should be 6189 records remaining.
  • Save this new file as ont_WANorm_dyeswapped_Matrix_edited.csv to denote that I made these changes.

Within-chip Normalization for GCAT Chips

  • Change the directory using the menu File > Change dir... to the folder containing the target file and the .gpr files for the GCAT chips.
  • Import the target file containing all of the .gpr file names for the top blocks of the GCAT chips.
targets<-readTargets(file.choose())
f<-function(x) as.numeric(x$Flags > -99)
RT<-read.maimages(targets, source="genepix.median", wt.fun=f)
Note: when I first tried this, got an error because one of the filenames did not correspond to what was listed in the targets file because of a typo in the filename. I fixed the filename and tried the commands again.
  • Then, import the target file containing all of the .gpr file names for the bottom blocks of the GCAT chips.
targets<-readTargets(file.choose())
f<-function(x) as.numeric(x$Flags > -99)
RB<-read.maimages(targets, source="genepix.median", wt.fun=f)
  • Merge the top and bottom of each of the chips into one column using the rbind function:
RGG<-rbind(RT,RB)
Received the following error message: "Error in rbind(RT, RB) : object 'RB' not found". This was my mistake, I didn't realize that RT and RB were different in the commands above. I will redo.
  • Perform Within-array normalization using the following command:
MAG<-normalizeWithinArrays(RGG,method="loess", bc.method="normexp")
  • Create a new matrix so that the number of rows corresponds to the number of unique genes and the number of column corresponds to the number of .gpr files using the following command:
M<-matrix(nrow=6403,ncol=9)
  • Use the tapply function to average duplicate spots so that only unique spots remain in the data with the following command:
for(i in 1:9) {M[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}
  • Write the table with the following command:
write.table(M,"GCAT_WANorm.csv",sep=",")
This creates a file called GCAT_WANorm.csv. Note that we don't have to dye-swap any of the GCAT chips because they were all red-experimental/green-control. The same issue regarding the offset of the column headings in the results file applies here. There are nine data columns (up to column J) and 6403 data records in this file. There are three "control" records, "empty", "Empty", and "EMPTY". These will be gotten rid of when the data files are merged in Access.
  • The .csv file generated does not have the gene ID's listed. In order to obtain the list of gene ID's from R, run the tapply function with only one .gpr file and write it to a table using the following command.
write.table(tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean),"GCAT_ID.csv",sep=",")
  • Open these two files in Excel. Shift the column headings in the GCAT_WANorm.csv to the right by one column.
  • Copy the IDs (column A) from the GCAT_ID.csv file and paste it into (column A) of the GCAT_WANorm.csv file.
  • Save this new file as GCAT_WANorm_edited.csv to denote that I made these changes.

Merging Ontario and GCAT data using Microsoft Access

The microarrays obtained from GCAT and from Ontario have different gene sets printed on them. In order to proceed with the analysis, we need to merge the two datasets, making sure to match data from the same genes with each other. We will use a query in Microsoft Access to accomplish this. In the process, we will throw away the genes/spots from the GCAT chips that are not present on the Ontario chips, but keep all of the genes/spots from the Ontario chips. (This is because there are far fewer GCAT chips in the study and we would not be able to do any robust statistical analysis on the orphaned GCAT genes anyway.)

  1. Launch Microsoft Access.
  2. Create a new database in Access. Save it as Merging_GCAT_Ontario_chips.mdb.
  3. Import the data (File->Get External Data->Import)
    • Note that the arrangement of these menu items varies in different versions of Access.
  4. Go through the import Wizard:
    • Specify the data as delimited
    • Keep the delimiter as "comma" and the text qualifier as "none". Indicate that the first row contains field names.
    • Choose "no primary key".
    • Repeat this process once with the GCAT data and once with the Ontario data.
  5. In the window for the current database, go to queries and select "Create query in Design View".
  6. Add both imported tables (the GCAT and Ontario).
  7. In the "Select Query" window, join GCAT ID and Ontario ID with a line.
    • Right click on the line and select the "Join Properties" option.
    • Select the option "Include ALL all records from the Ontario data and only those records from the GCAT data where the joined fields are equal".
  8. Select all of the fields in the GCAT query window and drag into the first box in the "Field" row in the table below.
  9. Select all of the fields in the Ontario query window and drag into the next free box in the "Field" row in the table below.
  10. Create a new table for this joined data by choosing the menu item Query > Make-Table Query). Name the new table "GCAT_Ontario_merged_data".
  11. Run the query. There should be 6189 rows in the new table.
  12. Copy and paste the new table into a new Excel spreadsheet and save the file as GCAT_Ontario_merged_within-chip-normalized_data.xls.
    • Note that the data was previously sorted in alphabetical order by ID. After doing a database query, we can't assume that this order is preserved. In Excel, re-sort A-Z on the Ontario ID column before proceeding.
    • Note that R will not import the data with a non-numeric ID column. After sorting, the ID columns need to be deleted.
    • Make sure that the columns are still in the order of strain, timepoint, flask.
    • Once these changes are made, save the file as GCAT_Ontario_merged_within-chip-normalized_data_edited.csv to note that these changes are made and to put it into .csv format.

Between-chip Normalization

  • Change the directory in R using the menu File > Change dir... to the folder containing the GCAT_Ontario_merged_within-chip-normalized_data_edited.csv file.
  • Read the .csv file containing the within array normalized data for all of the chips into R using the following command:
MA2<-read.csv("GCAT_Ontario_merged_within-chip-normalized_data_edited.csv",header=TRUE,sep=",")
  • Force the .csv file into a matrix using the following command:
MA3<-as.matrix(MA2)
  • Perform between-array normalization using the following command:
MAB<-normalizeBetweenArrays(MA3,method="scale",targets=NULL)
Received the following "Warning message: In log(apply(x, 2, median, na.rm = TRUE)) : NaNs produced". According to Nick and Katrina, this is what happens in R version 2.13.0 and that I need to run this in R version 2.7.2. So I will install that and try again.
  • Write the data to a table using the following command (only works in R version 2.7.2):
write.table(MAB,"ALL_BANorm_dyeswapped_Matrix.csv",sep=",")
Note that the column headings need to be offset 1 column to the right. I had to paste the IDs back in from the GCAT_Ontario_merged_within-chip-normalized_data_edited.xls file. I then saved it as ALL_BANorm_dyeswapped_Matrix.xls.
I passed the file along to Nick and Katrina and they found discrepancies between what I did and what they did. I am going to repeat the whole process in R version 2.7.2 to make sure that I did everything correctly. — Kam D. Dahlquist 19:59, 7 July 2011 (EDT)