My Computational Journal Summer 2012
Week 1
May 14, 2012
Today code for a simple ANOVA and a comparison of two strains was tested to see if the results produced by our adviser could be reproduced. The output without modifying the code for the comparison of two strains using wildtype and dGLN3 for the comparison could be reproduced. A few difficulties have been faced when trying to modify the simple ANOVA code. It was attempted to match the results produced for dGLN3 using the two strain comparison code with the results of the simple ANOVA code with the input specified as the log fold concentrations for the dGLN3 data. After being initially unsuccessful, the code was revisited and looked at more closely. It was found that two matrices in the matrix division had an unequal number of rows (matrix X and matrix Y). The indices (which specify columns of the input file) were modified after which matrix X had the same number of rows as matrix Y. The following code was used:
ind15 = ind15-indx(1)+1; ind30 = ind30-indx(1)+1; ind60 = ind60-indx(1)+1; ind90 = ind90-indx(1)+1; ind120 = ind120-indx(1)+1;
However, the results from the simple ANOVA for dGLN3 still did not match the results for dGLN3 from the comparison of two strains code.
Katrina Sherbina 21:31, 14 May 2012 (EDT)
May 15, 2012
A separate set of indices were made to designate the rows of the X matrices for the reduced and the full model. This separate set of indices (one for each timpeoint) removed rows of zeros that were previously added when substituting the select 0's for 1's to create the two X matrices. As a result, it was possible to call upon the original indices to extract the log fold changes corresponding to a specific deletion to create the Y matrix. After these corrections, the two strain comparisons were performed between wildtype and each one of the deletion strains.
Further modifications were made to the two strain comparison code to be able to make a comparison of all of the strains simulatenously. To do so, indices were added to designate the columns in the input corresponding to each of the deletion strains. A separate set of indices (one index per timepoint) for each strain as in the case of the two strain comparison. The parameters p were increased to 25 (five timepoints for each deletion strain) and the constraints q were increased to 20 to yield an X and Xh matrices with the appropriate numbers of columns. Changes were made to the out_data(ii,[number]) lines to account for the increase in the number of strains being compared. In the output, the data for each of the timepoints for each of the deletion strains matched the corresponding data in the two strain comparisons.
The next task is figure out how to handle missing data. Since both GCAT and Ontario chips were used, log fold change concentrations are missing for some genes for some timepoints for the wildtype (these cells show up as NaN). One possibility is to modify the X matrices so that they take into account the NaN's for each gene. On this note, a for loop was begun to try to exclude timepoints for a gene for which there is an NaN.
Katrina Sherbina 20:38, 15 May 2012 (EDT)