My Computational Journal Summer 2012: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(→‎Week 1: Added entry for May 15, 2012.)
(→‎Week 1: Added entry for May 16, 2012.)
Line 26: Line 26:


[[User:Katrina Sherbina|Katrina Sherbina]] 20:38, 15 May 2012 (EDT)
[[User:Katrina Sherbina|Katrina Sherbina]] 20:38, 15 May 2012 (EDT)
===May 16, 2012===
NaN's were removed from the Y matrix for each gene by using the command YY = Y(~isnan(Y(:,1)),1). This truncated the Y matrix from 43 rows to 34 rows. However, in doing so, it was not possible to solve for beta (the solution to the linear system XB=Y). Another method to deal with the NaN's was to replace them with a very large number using the command Y(isnan(Y(:,1)),1)=1e6 in order to be able to recognize the timepoints and flasks for which there were NaN's in the output. However, this messed up the B&H corrections.
The next step will build on trucating the Y matrix by getting rid of the rows with the NaN's altogether. To keep this step, the X and Xh matrix pairs must be formulated so that they are unique to each gene. A possibilty is to use a for loop to define each of the indices for each gene by the columns that do not contain an NaN for that gene. The Y matrix calculations would have to be within this loop.
[[User:Katrina Sherbina|Katrina Sherbina]] 21:33, 16 May 2012 (EDT)

Revision as of 18:33, 16 May 2012

Week 1

May 14, 2012

Today code for a simple ANOVA and a comparison of two strains was tested to see if the results produced by our adviser could be reproduced. The output without modifying the code for the comparison of two strains using wildtype and dGLN3 for the comparison could be reproduced. A few difficulties have been faced when trying to modify the simple ANOVA code. It was attempted to match the results produced for dGLN3 using the two strain comparison code with the results of the simple ANOVA code with the input specified as the log fold concentrations for the dGLN3 data. After being initially unsuccessful, the code was revisited and looked at more closely. It was found that two matrices in the matrix division had an unequal number of rows (matrix X and matrix Y). The indices (which specify columns of the input file) were modified after which matrix X had the same number of rows as matrix Y. The following code was used:

ind15 = ind15-indx(1)+1;
ind30 = ind30-indx(1)+1;
ind60 = ind60-indx(1)+1;
ind90 = ind90-indx(1)+1;
ind120 = ind120-indx(1)+1;

However, the results from the simple ANOVA for dGLN3 still did not match the results for dGLN3 from the comparison of two strains code.

Katrina Sherbina 21:31, 14 May 2012 (EDT)

May 15, 2012

A separate set of indices were made to designate the rows of the X matrices for the reduced and the full model. This separate set of indices (one for each timpeoint) removed rows of zeros that were previously added when substituting the select 0's for 1's to create the two X matrices. As a result, it was possible to call upon the original indices to extract the log fold changes corresponding to a specific deletion to create the Y matrix. After these corrections, the two strain comparisons were performed between wildtype and each one of the deletion strains.

Further modifications were made to the two strain comparison code to be able to make a comparison of all of the strains simulatenously. To do so, indices were added to designate the columns in the input corresponding to each of the deletion strains. A separate set of indices (one index per timepoint) for each strain as in the case of the two strain comparison. The parameters p were increased to 25 (five timepoints for each deletion strain) and the constraints q were increased to 20 to yield an X and Xh matrices with the appropriate numbers of columns. Changes were made to the out_data(ii,[number]) lines to account for the increase in the number of strains being compared. In the output, the data for each of the timepoints for each of the deletion strains matched the corresponding data in the two strain comparisons.

The next task is figure out how to handle missing data. Since both GCAT and Ontario chips were used, log fold change concentrations are missing for some genes for some timepoints for the wildtype (these cells show up as NaN). One possibility is to modify the X matrices so that they take into account the NaN's for each gene. On this note, a for loop was begun to try to exclude timepoints for a gene for which there is an NaN.

Katrina Sherbina 20:38, 15 May 2012 (EDT)

May 16, 2012

NaN's were removed from the Y matrix for each gene by using the command YY = Y(~isnan(Y(:,1)),1). This truncated the Y matrix from 43 rows to 34 rows. However, in doing so, it was not possible to solve for beta (the solution to the linear system XB=Y). Another method to deal with the NaN's was to replace them with a very large number using the command Y(isnan(Y(:,1)),1)=1e6 in order to be able to recognize the timepoints and flasks for which there were NaN's in the output. However, this messed up the B&H corrections.

The next step will build on trucating the Y matrix by getting rid of the rows with the NaN's altogether. To keep this step, the X and Xh matrix pairs must be formulated so that they are unique to each gene. A possibilty is to use a for loop to define each of the indices for each gene by the columns that do not contain an NaN for that gene. The Y matrix calculations would have to be within this loop.

Katrina Sherbina 21:33, 16 May 2012 (EDT)