User:Kam D. Dahlquist/Notebook/Personal/2011/07/06

{| width="800"
 * style="background-color: #EEE"|[[Image:owwnotebook_icon.png|128px]] Dahlquist Personal Lab Notebook
 * style="background-color: #F2F2F2" align="center"|  |Main project page
 * style="background-color: #F2F2F2" align="center"|  |Main project page


 * colspan="2"|
 * colspan="2"|

Today
I am following the Microarray Data Processing in R protocol created by Nick and  Katrina.

Installing R
mirrors > UCLA > Windows > base > previous releases
 * First I have to install R on my laptop.
 * From the CRAN mirror at UCLA, I am downloading the version R 2.13.0 for Windows (37 megabytes, 32/64 bit) and storing the .exe in my Installers folder on the Desktop.
 * Double-click to run installer.
 * Note: to find installers for previous versions follow the links:
 * Note that the students are seeing issues using different versions of R.
 * Does not work in 2.7.0
 * Works in 2.7.1 and 2.7.2
 * Does not work in 2.13.0 (one line of code that makes it not work, the normalize between arrays line)

Installing Packages
The instructions for downloading and installing Bioconductor packages can be found on the Bioconductor.org Installs page. source(" http://bioconductor.org/biocLite.R ") biocLite
 * Launch R.
 * At the command prompt, type the following:
 * The program will automatically download and install the packages.

Getting Files Ready

 * First, target files must be created to input all of the GPR files into R.
 * The target file is a  file that contains all of the filenames of the   files from GenePix Pro.  The file should be formatted as follows:
 * The first line should say.
 * Starting from the second line and using as many lines as necessary, copy and paste the names of all of the  files ordering them first by strain (wild type and then the deletion strains in alphabetical order), then by time point, then by flask.
 * Use the file  to find out which chips belong to which strain/flask/timepoint.
 * I had to modify the  and   files given to me by Katrina because the files were in the wrong order.  The   file checked out correctly.
 * Make sure that each line has only one  file name.
 * Three of these target files must be created, one for the Ontario chip  files and two for the GCAT chip   files.
 * One of the target files for the GCAT chips must contain the  file names corresponding to the tops of the GCAT chips and the other target file must contain the   file names corresponding to the bottom of the GCAT chips.

''I am creating a folder in LionShare that contains all of the files created/needed to do this analysis. The files in this folder are:'' MicroarrayLog.xls Bottom_GCAT.txt Top_GCAT.txt targets_all_ont.txt


 * Make sure that the target file and the corresponding  files for the Ontario chips are in one folder and the the target files and the corresponding   files for the GCAT chips are in another folder.
 * ''Moved copies of all  files to the folders   and.
 * Double-checked that all of the  files were from the correct scans of the chips.

Within-chip Normalization for Ontario Chips
library(limma) targets<-readTargets(file.choose)
 * Launch R.
 * Change the directory using the menu File > Change dir... to the folder containing the target file and the  files for the Ontario chips.
 * Into the R Console window, type:
 * Import the target  file containing all of the   file names for the Ontario chips.
 * Type the following:
 * This opens a file dialog to select the target  file.

f<-function(x) as.numeric(x$Flags > -99) RGO<-read.maimages(targets, source="genepix.median", wt.fun=f) MAO<-normalizeWithinArrays(RGO, method="loess", bc.method="normexp")
 * Perform Within Array Normalization by typing the following:
 * This returns messages saying "Array # corrected"

M1<-tapply(MAO$M[,1],as.factor(MAO$genes[,5]),mean) n0<-length(MAO$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:94) {MM[,i]<-tapply(MAO$M[,i],as.factor(MAO$genes[,5]),mean)} ''This is as far as I got today. I may have mistyped the previous command because I got an error of unexpected }. I'll look into it tomorrow. &mdash; Kam D. Dahlquist 21:33, 6 July 2011 (EDT). It processed today, yesterday was a typo. &mdash; Kam D. Dahlquist 12:56, 7 July 2011 (EDT)''
 * The next bit of code does the following:
 * Create a blank matrix which has as many rows as there are unique genes and as many columns as there are  files.
 * Average all of the replicate spots by using the tapply function.
 * Write all of the log fold changes after averaging replicate spots into the blank matrix previously created.

ds<-read.csv("dyeswap_matrix_ont.csv",header=TRUE,sep=",")
 * A .csv file must be created to denote which chips were dyeswapped.
 * To do so, launch Microsoft Excel.
 * Label the first column in row 1 with "Multiplier" or another label for your choosing.
 * Each cell in the column corresponds to each  file in the targets file in the order that they appear in the targets file.
 * If a chip has not been dyeswapped (ratio is red-experimental/green-control), type a "1" into the appropriate cell within the column in the Excel file.
 * If a chip has been dyeswapped (red-control/green-experimental), type a "-1" into the appropriate cell within the column in the Excel file. Save this excel file as a  file.
 * I obtained the file  from Katrina and double-checked that all of the dyeswaps were correct against the   and   files.  I uploaded a copy of the file to LionShare.
 * Import the  file denoting which chips are dyeswapped into R using the following command:
 * Create a new matrix so that the number of rows corresponds to the number of unique genes and the number of column corresponds to the number of GPR files using the following command:

MN<-matrix(nrow=6191,ncol=94)


 * Multiply the values in the matrix containing the log fold changes after the replicates have been averaged by the dyeswap list created using the following command:

for (i in 1:94) {MN[,i]<-ds[i,]*MM[,i]}


 * Write the data to a table using the following command:

write.table(MN,"ont_WANorm_dyeswapped_Matrix.csv",sep=",")


 * This creates a file called .  When I look at it in Excel, I see that the first column is an index that counts the data rows (6191).  The columns are given headers of V1, V2, etc.  However, header "V1" occurs in column A, the one with the index and the last column of data, column CQ, does not have a header.  This needs to be offset by 1, and actually, it would be nice if the column headers were more meaningful, like the filenames of the chips they correspond to.
 * ''The results are different depending on which version of R I am using. R version 2.7.2 gives a different result than R version 2.13.0


 * The  file generated does not have the gene ID's listed. In order to obtain the list of gene ID's from R, run the tapply function with only one   file and write it to a table.

write.table(tapply(MAO$M[,1],as.factor(MAO$genes[,5]),mean),"Ontario_ID.csv",sep=",")


 * Open these two files in Excel. Shift the column headings in the   to the right by one column.
 * Copy the IDs (column A) from the  file and paste it into (column A) of the   file.
 * Delete the first two rows of values of the  file, corresponding to the control spots (in the   file, these are labeled "3XSSC" and "Arabidopsis").  There should be 6189 records remaining.
 * Save this new file as  to denote that I made these changes.

Within-chip Normalization for GCAT Chips

 * Change the directory using the menu File > Change dir... to the folder containing the target file and the  files for the GCAT chips.
 * Import the target file containing all of the  file names for the top blocks of the GCAT chips.

targets<-readTargets(file.choose) f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets, source="genepix.median", wt.fun=f)
 * Note: when I first tried this, got an error because one of the filenames did not correspond to what was listed in the targets file because of a typo in the filename.  I fixed the filename and tried the commands again.


 * Then, import the target file containing all of the  file names for the bottom blocks of the GCAT chips.

targets<-readTargets(file.choose) f<-function(x) as.numeric(x$Flags > -99) RB<-read.maimages(targets, source="genepix.median", wt.fun=f)


 * Merge the top and bottom of each of the chips into one column using the rbind function:

RGG<-rbind(RT,RB)
 * Received the following error message: "Error in rbind(RT, RB) : object 'RB' not found".  This was my mistake, I didn't realize that RT and RB were different in the commands above.  I will redo.


 * Perform Within-array normalization using the following command:

MAG<-normalizeWithinArrays(RGG,method="loess", bc.method="normexp")


 * Create a new matrix so that the number of rows corresponds to the number of unique genes and the number of column corresponds to the number of  files using the following command:

M<-matrix(nrow=6403,ncol=9)

for(i in 1:9) {M[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}
 * Use the tapply function to average duplicate spots so that only unique spots remain in the data with the following command:


 * Write the table with the following command:

write.table(M,"GCAT_WANorm.csv",sep=",")
 * This creates a file called .  Note that we don't have to dye-swap any of the GCAT chips because they were all red-experimental/green-control.  The same issue regarding the offset of the column headings in the results file applies here.  There are nine data columns (up to column J) and 6403 data records in this file.  There are three "control" records, "empty", "Empty", and "EMPTY".  These will be gotten rid of when the data files are merged in Access.


 * The  file generated does not have the gene ID's listed. In order to obtain the list of gene ID's from R, run the tapply function with only one   file and write it to a table using the following command.

write.table(tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean),"GCAT_ID.csv",sep=",")


 * Open these two files in Excel. Shift the column headings in the   to the right by one column.
 * Copy the IDs (column A) from the  file and paste it into (column A) of the   file.
 * Save this new file as  to denote that I made these changes.

Merging Ontario and GCAT data using Microsoft Access
The microarrays obtained from GCAT and from Ontario have different gene sets printed on them. In order to proceed with the analysis, we need to merge the two datasets, making sure to match data from the same genes with each other. We will use a query in Microsoft Access to accomplish this. In the process, we will throw away the genes/spots from the GCAT chips that are not present on the Ontario chips, but keep all of the genes/spots from the Ontario chips. (This is because there are far fewer GCAT chips in the study and we would not be able to do any robust statistical analysis on the orphaned GCAT genes anyway.)


 * 1) Launch Microsoft Access.
 * 2) Create a new database in Access.  Save it as.
 * 3) Import the data (File->Get External Data->Import)
 * 4) * Note that the arrangement of these menu items varies in different versions of Access.
 * 5) Go through the import Wizard:
 * 6) * Specify the data as delimited
 * 7) * Keep the delimiter as "comma" and the text qualifier as "none". Indicate that the first row contains field names.
 * 8) * Choose "no primary key".
 * 9) * Repeat this process once with the GCAT data and once with the Ontario data.
 * 10) In the window for the current database, go to queries and select "Create query in Design View".
 * 11) Add both imported tables (the GCAT and Ontario).
 * 12) In the "Select Query" window, join GCAT ID and Ontario ID with a line.
 * 13) * Right click on the line and select the "Join Properties" option.
 * 14) * Select the option "Include ALL all records from the Ontario data and only those records from the GCAT data where the joined fields are equal".
 * 15) Select all of the fields in the GCAT query window and drag into the first box in the "Field" row in the table below.
 * 16) Select all of the fields in the Ontario query window and drag into the next free box in the "Field" row in the table below.
 * 17) Create a new table for this joined data by choosing the menu item Query > Make-Table Query). Name the new table "GCAT_Ontario_merged_data".
 * 18) Run the query.  There should be 6189 rows in the new table.
 * 19) Copy and paste the new table into a new Excel spreadsheet and save the file as.
 * 20) * Note that the data was previously sorted in alphabetical order by ID. After doing a database query, we can't assume that this order is preserved. In Excel, re-sort A-Z on the Ontario ID column before proceeding.
 * 21) * Note that R will not import the data with a non-numeric ID column. After sorting, the ID columns need to be deleted.
 * 22) * Make sure that the columns are still in the order of strain, timepoint, flask.
 * 23) * Once these changes are made, save the file as   to note that these changes are made and to put it into   format.

Between-chip Normalization

 * Change the directory in R using the menu File > Change dir... to the folder containing the  file.
 * Read the  file containing the within array normalized data for all of the chips into R using the following command:

MA2<-read.csv("GCAT_Ontario_merged_within-chip-normalized_data_edited.csv",header=TRUE,sep=",")


 * Force the  file into a matrix using the following command:

MA3<-as.matrix(MA2)


 * Perform between-array normalization using the following command:

MAB<-normalizeBetweenArrays(MA3,method="scale",targets=NULL)
 * Received the following "Warning message: In log(apply(x, 2, median, na.rm = TRUE)) : NaNs produced". According to Nick and Katrina, this is what happens in R version 2.13.0 and that I need to run this in R version 2.7.2.  So I will install that and try again.


 * Write the data to a table using the following command (only works in R version 2.7.2):

write.table(MAB,"ALL_BANorm_dyeswapped_Matrix.csv",sep=",")


 * Note that the column headings need to be offset 1 column to the right. I had to paste the IDs back in from the   file.  I then saved it as  .


 * I passed the file along to Nick and Katrina and they found discrepancies between what I did and what they did. I am going to repeat the whole process in R version 2.7.2 to make sure that I did everything correctly.  &mdash; Kam D. Dahlquist 19:59, 7 July 2011 (EDT)


 * }