# Kristen M. Horstmann Lab Notebook

## October 11, 2015

Finally made electronic notebook, will be posting work from previous weeks.

- Began work on updated Microarray data analysis workflow: Dahlquist:Microarray Data Analysis Workflow
- Completed steps 4-5, began step 6, Statistical Analysis, by creating the Excel sheet. Workflow was followed up until heading "Within-Strain ANOVA." Excel sheet was saved on Google Drive
There were 41,998 deletions of #VALUE! in the Excel sheet.

-Sanity check for GLN3

Sanity Check

p < 0.05......2255

p < 0.01......1325

p < 0.001......616

p < 0.0001......259

Bonferroni p < 0.05......106

B-H p < 0.05......1356

For NSR1

unadjusted p-value......0.00050676

Bonferroni p-value......1

B-H p-value......0.00660287

AvgLogFC_t15......3.506225

AvgLogFC_t30......4.5319

AvgLogFC_t60......2.7592

AvgLogFC_t90......-1.85025

AvgLogFC_t120......-1.867425

-Sanity check for ZAP1

p < 0.05......2559

p < 0.01......1683

p < 0.001......953

p < 0.0001......521

Bonferroni p < 0.05......251

B-H p < 0.05......1859

For NSR1

unadjusted p-value.....6.06E-08

Bonferroni p-value......0.000374851

B-H p-value......6.15E-06

AvgLogFC_t15......3.8996

AvgLogFC_t30......3.7238

AvgLogFC_t60......3.962775

AvgLogFC_t90......-2.156

AvgLogFC_t120......0.0542

-Sanity check for SWI4
-Encountered some issues with SWI4 as the data for it was slightly different than the others, so the equations typed in may have been slightly altered

## October 12, 2015

### dZAP1 Results

- Number of significant transcription factors:99
- List of significant transcription factors with % in user set, % in YEASTRACT, p-value:
- Too many to list on site, can be found here: Horstmann ZAP1 Results

- Are CIN5, GLN3, HAP4, HMO1, SWI4, and ZAP1 on the list?
- CIN5p, SWI4p, and ZAP1p were on this list

- Opted not to continue with next trancriptional map task as too many genes to create and not sure as to which genes to continue with.
- Accessed evening of 10/12

### dGLN3 Results

- Number of significant transcription factors: none
- List of significant transcription factors with % in user set, % in YEASTRACT, p-value:
- accessed morning of 10/14

### dSWI4 Results

Opted to note continue with SWI4 as numbers did not match with Tessa's check on my work. Waiting for confirmation from 3rd party on whose data to use.

## October 14, 2015

Notes from team meeting:

- Removing spar from further looking at, not fair to assume yeast applies to this
- all-all green excel matrix but cut out the genes that are not connected.
- Delete the least signifcant transcription factor
- Then redo with deletion strain and ensure that wt, dCIN5, dGLN3, HMO1, dHAP4, dSWI4, and dZAP1 and do not delete TF if attached to these genes
- Create "family" of about 35 transcription factors, no smaller than 15
- Not making networks off of SPAR, HMO1, or SWI4.
- Each person assigned to different gene
- Kayla: dCIN5
- Kristen: dZAP1
- Natalie: wt
- Tessa: dGLN3
- Grace: dHAP4

- Paring down networks:
- First, eliminate unconnected genes
- Then, systematically pare down by p-value, one by one. For each elimination, check and get rid of unconnected genes. We want 15 - 35 range for our family of networks.
- For now, we will focus on wild type and deletion strains (not Spar)

- Two types of pare downs:
- First, just the genes from YEASTRACT
- Later, adding in our deletion strain genes. Careful with elimination of TF's - deletion strain genes must stay in

## October 19, 2015

- For Binding AND Expression
- Was able to delete 21 transcription factors, but still only paired them down to 83.
- Paired down based off YEASTRACT p-values to a network of 31

- For Binding PLUS Expression
- Could not delete any transcription factors, all of them had at least one 1 in the row or column
- Turned to pairing down based off least significant p-values according to YEASTRACT output of p-values
- Pruned out 72 rows of the least significant p-values, down to 31 transcription factors

- For ONLY Binding
- Had a few rows without a connection but could not delete any transcription factors
- Turned to pairing down based off least significant p-values, down to 31 transcription factors

## October 21, 2015

- Notes from Meeting
- Told only need to be creating networks for ONLY Binding, and confirmed that p-values were used from YEASTRACT.
- Will make a handful of networks with transcription factors ranging from 15-35 based only off of ONLY binding.
- Need to redo as may have deleted too many lines of data before checking back on the 0s in the edges

## October 27, 2015

- Made 17 sheets ranging from 35 genes to 15 genes by deleting based off 0s in edges and off significance of p-values INCLUDING the six deletion strains (CIN5, HAP4, GLN3, ZAP1, SWI4, HMO1)
- Made 16 by deleting based off of 0s in edges and off significance in p-values. Deletions were done regardless of if the deleted gene was the 6 or not
- Ran some of them through GRNsight to check their networks and connections
- First had to reformat Excel
- For this adjacency matrix to be usable in GRNmap (the modeling software) and GRNsight (the visualization software), must transpose the matrix. Insert a new worksheet into your Excel file and name it "network". Select the entire matrix and copy it. Go to you new worksheet and click on the A1 cell in the upper left. Select "Paste special" and "Transpose". This will paste your data with the columns transposed to rows and vice versa. This is necessary because we want the transcription factors that are the "regulatORS" across the top and the "regulatEES" along the side.
- Delete the "p" from each of the gene names in the columns. Adjust the case of the labels to make them all upper case.
- In cell A1, copy and paste the text "rows genes affected/cols genes controlling"
- On the GRNsight website, just clicked on "File" and "Open" to upload the file and create the network
- This helped to see the floating transcription factors and if any genes were connected together and not to the main network

- Number of networks in the subfamilies for all the strains
- dHAP4 (Grace)
- Regardless of deletion strains: 13
- Preserving deletion strains:19

- wt (Natalie)
- Regardless of deletion strains: 14
- Preserving deletion strains:19

- dGLN3 (Tessa)
- Regardless of deletion strains: 16
- Preserving deletion strains: TBD

- dZAP1 (Kristen)
- Regardless of deletion strains: 16
- Preserving deletion strains: 17

- dCIN5 (Kayla)
- Regardless of deletion strains: TBD
- Preserving deletion strains: TBD

- dHAP4 (Grace)
- Abstract for April Conference

Tessa, Kayla, and I will be presenting based off of in findings from spring semester. Introduce bio problem, introduce modeling, discuss results from Tessa's and my work on Zap1 and Gln3. Introduce future work (what we're doing right now)

## October 28, 2015

- Notes from the meeting
- Attending the ASMB COnference and need to submit abstracts for each team
- Keep working on the excel, keep making deletions from the sheet without the deletion strains
- Start making input sheets as well

- Abstracts
- Talk about class experience and which way we're extending the model
- Instead of individual instances, discuss how we're making families of networks
- Introduce biological problem, introduce modeling process
- Put on github wiki and work on it

## November 4, 2015

- Notes from the meeting
- Start/continue making input sheets (protocol should be up to date online)
- Next make expression sheets
- Use truncated, normalized data as the input
- ID in left column -> 15, 15, 15 -> 30,30,30 etc.
- Start to populate by doing the biggest one first and doing the rest by deletion

- Abstract
- Write results and conclusions (from paper)
- Future directions of similar things with different strains
- Write names with middle initial and how we want it to be established for scientific career

- Input sheets
- Use truncated normalized data

- Most protocol should be up for everything except degradation and production rates sheets
- Using Microsoft Access:
- Create an Excel spreadsheet with tabs containing expression for each strain (i.e. have one tab named "wt_expression", etc.).
- Use rounded normalized data for each strain
- Use only data from the 15, 30, and 60 timepoints --> use the LFC and not the average
- Find either a space or the #VALUE!, and replace with nothing
- Open Access, select External Data tab, then import Excel spreadsheet
- One tab is already available upon opening the program. Importing the sheet of interest creates a new tab instead of renaming the original tab (named Table1).
- To create a new table, go to the Create tab and select the Table button
- Select the file you wish to upload. Then select the tab that you want to build a database for.
- Choose primary key and select ID --> Systematic name of TFs; the systematic name is the primary key because it is unique to a specific transcription factor
- To rename a table, right click on the tab. Select Design View

Name the new sheet whatever network you are creating it for (i.e. dHAP_35_network)

- Populate ID colony with genes from desired network (i.e. CIN5, GLN3).
- Select Query Design and select the tables of interest (Expression table and Network table). Both windows will show up
- Select ID from Network and Click, Drag, and Drop onto the ID of Expression Table
- Right click on connection formed and select Join Properties
- Make selection of Include ALL records from 'network'
- Drag Network down to the first Field and drag all records of expression into the fields right of Network record (first field)
- Select make table and then name your table. Hit RUN.
- Wang mRNA degradation rate extraction from halflives
- Use degradation and production rates from site provided

Begin generation of input sheets for network family testing
Microsoft Access used to extract logFC data for t15, 30, and 60 for each strain (wt, dCIN5, dGLN3, dHAP4, dHMO1, dSWI4, and dZAP1) for:
The largest network for the dHAP4, deletion strains added family

- Took a while to learn how to use Access, sheets were created November 30
- Optimization parameters set according to standard used over the summer (copied from Tessa's sheets)
- Network_weights contains network with initial weight guesses of 0 (from summer)

## November 30, 2015

Created access sheets off of the directions from above

## December 9, 2015

- Formatted sheets identical to Tessa's and Kayla's and all three received same errors of:

Index exceeds matrix dimensions.

Error in readInputSheet (line 166)

log2FC(i).deletion = Deletion(i);

Error in GRNmodel (line 30)

GRNstruct = readInputSheet(GRNstruct);

- Tessa and Kayla were able to finish up and get rid of their errors from formatting issues, like spelling, spaces, and addition of different parameters.
- I copied their same format and checked the sheets vs theirs, and still got errors, although new ones:

"Error using barrier (line 22)
Objective function is undefined at initial point. Fmincon cannot continue.

Error in fmincon (line 799) [X,FVAL,EXITFLAG,OUTPUT,LAMBDA,GRAD,HESSIAN] = barrier(funfcn,X,A,B,Aeq,Beq,l,u,confcn,options.HessFcn, ...

Error in lse (line 85) estimated_guesses = fmincon(@general_least_squares_error,estimated_guesses,[],[],[],[],lb,ub,[],options);

Error in GRNmodel (line 32) GRNstruct = lse(GRNstruct);"

- Could not solve errors, cross-checked against others with no results. Sheet uploaded to github. Next semester I will be doing one of two things,

A.) Rerunning with taking the averages of each time step, which may cause an issue with creation of standard deviations B.) Copying into new Excel spreadsheet entirely to ensure the first one wasn't corrupt.

## January 15, 2016

- Wants coders to get ahead of us in order to check and run for bugs
- Start back from where we were and use information from over the summer in order to avoid same/similar bugs from trying to use different codes in different versions
- Coders need to turn on/off L-curve added
- Then fix bugs of can't handle missing data and only 5 strains
- May need to change to replace current data with averages of all data for expediency's sake
- No replicate, only one value for 15,30,etc

- No more overwhelming of many open threads
- Errors need to include: Branch (i.e master, beta, etc.). Date/time download. Namefile link to download. Bug, functionality, priority .5
- Only tag data analysis if highly prioritized

- Give us updates as we're working between meetings
- From last semester...
- Disc with two families (w/ and w/o deletion)
- Alphabetize the genes in the excel sheet to make easier to read
- Use sorting for all the sheets but the network (alphabetize one way, transpose, alphabetize back)

- Not all families were complete in the parameters. Go with Bell's published data for production and degradation rates for doing it quickly.
- Put in average for missing values
- Highlight cells that were empty since MATLAB doesn't care about color. Don't leave equation there, paste values, to ensure that the equation isn't being calculated multiple different ways. Do largest network and work down.

- For now, restricted to only 5 strains. For largest network, keep all there, but we will decide after which strains to use

## January 20, 2016

- Edited the largest input sheet with the deletion strains purposefully added, will be paring down row by row next and then with the largest sheet with deletion strains deleted out.
- Showed Brandon the ropes and helped answer his questions. He showed us a way to highlight all the blank cells without doing it individually. Yay.
- Also added the average values to blank cells, added the production and degradation rates, and edited the optimization sheet

## January 29, 2016

- Edited a few more input sheets but paused in order to wait for proper formatting techniques as changing with the code
- LMU Symposium abstracts due the 12th
- Update abstracts from the SD conference by the 5th in order to submit to LMU. Tessa will go "solo" on a verbal presentation

- Three more changes (taken from GitHub issue #166)
- We can now get rid of the row that says "Deletion". The code now can figure out the gene that is deleted from the strain information.

- We need to change the word "Model" to "production_function" (cell A8)

- We need to add a row beneath "production_function" called "L_curve". A zero value for this parameter means no L-curve analysis is done and a 1 value for this parameter means that an L-curve analysis will be run.

- Start running the L-curve analysis
- Do 4 runs this week: largest and smallest networks (+/- deletion strains)
- Make 4 L-curves for each of these runs
- L Curve is LSE on y axis and penalty on x axis. We get these values from the output sheets.
- Get these values and plot them against each other.
- Tells us which alpha to choose and compares largets alpha against smallest.

- Experienced crashes when making graphs (too many alphas) so changed optimization parameter sheet to make_graphs=0, but hopefully should not be an issue this time around
- by now, coder progress should not be interfering with what we are doing. They will soon merge code into master branch and have a new release. We need to still use beta branch, but soon they will make all changes to beta and we will be using Master.

## February 2, 2016

- Went and checked runs after running for about 21 hours. The two smallest sheets had finished earlier, as collected by Tessa at about 10 AM (roughly 14 hours after start).
- Largest network w/o deletions strains (ONLY_DNA_binding_dZAP1_28_genes_2_1_16) was still running
- One crashed, but realized had accidentally run one that I had already run (not sure why it ran to completion on one computer, and crashed on the other).
- Started largest network with deletion strains (with_deletions_ONLY_DNA_binding_dZAP1_34_genes_2_1_16), will check back tomorrow afternoon.
- Wednesday, 2/3, biostats class will be in here. Hopefully all models are collected and completed by then in case people mess up the program.

## February 3, 2016

- Checked in at 1:00 and large networks for everyone's network is still running
- Powerpoint of L-curves and output data to make curves found at: dZAP1 L-curve analysis KH
- wrote abstract for LMU symposium which can be found here: Horstmann Klein Morris Abstract

## Feb 5, 2016

- Wait out the rest of the runs to see how they finish
- Rerun the l-curves with fewer alphas but more condensed into the area of interest (i.e. more specific alphas in the "elbow" of the graph
- Maybe plot the W's and B's and everything (like in class) but wait until after we finish the reruns to examine those again
- Brought up that the team should have a universal naming convention of the files.
- Apparently one may have been started like: 21-genes_50-edges_Dahlquist-data_MM_estimation but maybe substitute Dahlquist-data with users initials to tell apart whose worksheet is whose. Also include the gene family (i.e zap1)
- Everyone may have cut out different strains/formatted slightly differently, so maybe everyone include the deletion strains in the title as well

## Feb 11, 2016

- Going to create bar charts of b and p values similar like done in biomath modeling class
- Label the graphs since better to do it now
- Do in alphabetical order of regulatOR (i.e CIN->ZAP1)
- From now on, use only released code and the latest released code.
- Latest code, can use all 6 strains on same code, but need to turn off creation of graphs/L-curves in order to not crash Matlab
- Make graphs=1 in order to overwrite the images

- Large networks likely too large... Not the number of genes but the number of edges
- With MSE and LSE, can compare to ANOVA to show that some genes are better than other genes

- If there's time, would like to pick particular network, generate random networks, and run
- Large networks were recreated to try to fix wonky L-curves
- Even though small network l-curves were decent, still pare down to them and rerun as it is confusing how the small network l-curves were acceptable and the large ones were not.
- Pare down and then after the l-curves were created, then do step-by-step pare downs

## Feb 17, 2016

- Worked on the l-curves of the "larger" input curves
- Dahlquist sent me updated inputs for the 33 and 25 input sheets.
- Ran 33 genes with multiple errors, Dahlquist lab computers got to 10 alphas and crashed, and finally able to get one going in SEA 120.
- Plotted 10 alphas into l-curve found in output page

## Feb 26, 2016

- L-curves turning into s-curves as the threshold b sheet requires a title in order to work
- new version of beta automatically does L-curve if makegraphs is set to 1
- should have no need to continue individually plotting the l-curves later but Tessa's work on it has shown otherwise.
- Tessa's plots are still showing S-curves
- Have not continued making L-curves yet as most recent L-curves formatted had final few alphas stacked on each other, so were more like downward slopes than "L's"
- See last two links in DZAP1 L-curve analysis KH

- grace will be trying to repeat alpha= .002 to try to replicate issue in l-curve code
- Think we should just reuse alpha .002 to try as it is generating the best information in the overall codes
- Would it be fair to repeat different output runs for different alphas? no, natural magnitude difference. Need to find consistent alpha term, so we will continue using .002
- May have indexing problem with threshold and production rates due to Grace's bar charts
- Running models were killed as bug is evident. Fitzpatrick rode code last semester and we just stuck it in... bugs to be expected. Indexing errors between scripts

- for the future: Githubissue187
- Only do families with deletion strains included
- Small large small large networks in case we run out of time
- Need to do beta code since has bug fix for allowing graphs for six strains
- Collect output sheets, theoretical set of graphs, parameter comparison, MSE. Single out individual genes to see which are being modeled better or worse based on connections, ANOVA, etc.

## March 11, 2016

- have been working on poster for LMU's Undergraduate Research Symposium
- ran production runs with graphs produced for each gene in 33, 23, and 15 gene networks. Decided to do three sized networks as initially thought would be comparing with Tessa's, then when realized poster was only Zap1, did not have time to run all 5 networks as originally thought (33, 29, 24, 19, 15)
- On the 24 Gene network, noticed HSF1 floating around with no connections. Tried to delete it from network and network weights, but was unable to. In 33 and 29, it only regulates RLM1, is floating freely in 24, and is deleted in 19 and 15. Will be returning to later in order to permanently delete, but for now will be leaving in network due to time constraints.

## September 12, 2016

- Discussed weekly updates on github and commenting the completed tasks of that week on GitHub
- Reviewed Code of Conduct in order to place it on GRNsight page
- This week our goals are further research into graph theory, perhaps using MATLAB to implement these, and work on TRACE documentation
- GRNsight is working on betweenness centrality and shortest path
- Look into systems bio package in MATLAB to see if there are any shortcuts for those
- Start testing those, will be used as independent check against GRNsight
- Play with if we substitute weights what would we get

- Do writeup on the literature we've been looking into and then move on to MATLAB coding
- In the future, get some graphs from degree distributions from random networks
- Start on TRACE Documentation so we have easier documentation when GRNmap is published

## September 13, 2016

- Made a powerpoint with quick summations of the articles we had read. Sent slides to Maggie to consolidate, will update with slides when completed
- Explored possibility of systems biology toolbox for MATLAB online and in Dahlquist Lab computers
- There was bioinformatics Toolbox available as well as Computational Biology Apps of Molecule Viewer, MGS Browser, Phylogenetic Tree, Sequence Alignment, and Sequence Viewer
- Googled what makes up the bioinformatics toolbox and seems like we will be able to analyze shortest path, betweenness centrality, and degree distribution. Next week we will explore how.
- Found article that created Systems Biology and Evolution toolbox: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3767578/
- Thier github page is here: https://github.com/biocoder/SBEToolbox/releases

## September 19, 2016

- Notes from Meeting:
- Narrow search space for pre-built function of betweenness centrality over others. If betweenness centrality is found, others will likely be found.

- First four topics of TRACE documentation from: http://www.openwetware.org/wiki/Dahlquist:TRACE_Documentation
- Double check the small networks from before, run the 5 real networks, then start generating random networks and collect data from this
- Do control experiments to ensure there aren't any major bugs before investing time into creating the data

- Go to Nicole's talk next week on Geffy, the graph layout software.
- Make sure to comment on the GitHub issues to show our progress

#### Sample auto-generated bibliography

- Dahlquist KD, Fitzpatrick BG, Camacho ET, Entzminger SD, and Wanner NC.
*Parameter Estimation for Gene Regulatory Networks from Microarray Data: Cold Shock Response in Saccharomyces cerevisiae.*Bull Math Biol. 2015 Aug;77(8):1457-92. DOI:10.1007/s11538-015-0092-6 | - JACOB F and MONOD J.
*Genetic regulatory mechanisms in the synthesis of proteins.*J Mol Biol. 1961 Jun;3:318-56. DOI:10.1016/s0022-2836(61)80072-7 |leave a comment about a paper here

- ISBN:0879697164

## September 20, 2016

- Explored the possibilities of the Bioinformatics toolbox. Split into high-throughput Sequencing, Microarray Analysis, Sequence Analysis, Structural Analysis, Mass Spectrometry and Bioanalytics.
- Within Network Analysis and Visualization, some pieces of interest below:
- "graphallshortestpaths(G)", "graphallshortestpaths(G,...'Weights', WeightsValue, ...)", and "graphallshortestpaths(G, ...'Weights', WeightsValue, ...)
- For more details and a worked example, see: http://www.mathworks.com/help/bioinfo/ref/graphallshortestpaths.html
- Only using G has N-by-N sparse matrix that represents a graph. Nonzero entries in matrix G represent the weights of the edges
- Using DirectedValue, property that indicates whether the graph is directed or undirected. Can enter "false" for an undirected graph
- Using Weights value lets you specify custom weights for the edges. WeightsValue is a column vector that specifies custom weights for the edges in Matrix G.

- "graphshortestpath(..., 'Weights', DirectedValue, ...) (DirectedValue can be replaced with MethodValue or WeightsValue). See full info and examples here: http://www.mathworks.com/help/bioinfo/ref/graphshortestpath.html
- [dist, path, pred] = graphshortestpath(G, S) determines the single-source shortest paths from node S to all other nodes in the graph represented by matrix G. Input G is an N-by-N sparse matrix that represents a graph. dist are the N distances from the source to every node.
- Directedalue that indicates whether the graph is directed or undirected
- MethodValue can be used as Character vectors that specifies the algorithm used to find the shortest path... Choices are
- Bellman-Ford: Assumes weights of the edges to be nonzero entries in sparse matrix G. Time complexity is O(N*E), where N and E are the number of nodes and edges
- BFS: Breadth-dirst search. Assumes all weights to be equal, and nonzero entries in sparse matrix G to represent edges. Time complexity is O(N+E), where N and E are the number of nodes and edges respectively
- Acyclic: Assumes G to be a directed acyclic graph and that weights of the edges are nonzero entries in sparse matrix G. Time complexity is O(N+E) where N and E are the number of nodes and edges respectively
- Dijkstra: Default algorithm. Assumes weights of the edges to be positive values in sparse matrix G. Time complexity is O(log(N)*E), where N and E are the number of nodes and edges respectively.

- graphconcomp and graphmaxflow may also be worth exploring in the bioinformatics toolbox
- GRNmap is already calculating shortest path so we may be able to use this as a control to check against any errors. Also may be able to calculate betweenness centrality after finding the shortest path. Look intoo the math on this

Searched through the article above to explore if Systems Biology and Evolution toolbox could be right for the GRNsight team to download and use for further data analysis. Main functions listed in the article:

- Betweenness centrality, clustering coefficient, and closeness centrality, bridging centrality.
- Statistics include local average connectivity, core number, graph mean distance, graph efficiency, etc.
- Random networks can be generated using Erdos-Reyni, small world, and ring lattice algorithms
- Can also simulate evolution of a network via node duplication, node loss, and edge rewiring

The paper notes that it is similar to toolboxes such as "Functional Genomics Assistant" and "Mathworks Bioinformatics Toolbox." Notes that MBT has basic graph theory algorithms, but no functions but statistical analysis. Paper goes into detail about computer memory and graphs with nodes of 10,000-80,0000. They were able to create a random network with 10,000 nodes and 450,000 edges with using 2 GB of memory, but the all of the core functions finished within 10 minutes.

I would give a light recommendation to downloading this toolbox and testing it out. It seems like the people who were devlooping it were anticipating that systems biology needs both statistical and network analysis. Only issue foreseen is when exploring the GitHub page, there's a link to a manual as to how to use the program, and it seems like a Texas A&M ID is necessary for login. If we decide to proceed with using this program, I can try to contact the authors to get this users manual.

## October 3

Notes from meeting:

- Continue working on TRACE
- Download SBEToolbox onto the Lab computers
- Try to figure out format, and play around with it using different networks, see if the interface
- Email SBE people and CC Dahlquist to make sure they respond
- Write Corresponding author and the first author in them, and send different emails to the two people

## October 17, 2016

- Heard back from SBEToolbox authors, Dr. Konganti with the user manual
- Continue exploring SBE Toolbox and its application to betweenness centrality and everything.
- Write small networks to input first.
- Repository-. 16-test (4 nodes, 6 edges?)

- Update Github more often, even if little progress is made
- Even commands and screenshots

- Natalie and Brandon are going to start running models, and stop making random networks
- TRACE documentation on the TRACE wiki:

## October 18, 2016

### Notes on SBEToolbox

- Have to go through specific saving to add to path
- Run SBEGUI to get graphical interface
- Gives >25 different organisms to select incl. e coli, s cerevisiae, candidas albicans, drosophilia, etc
- Note: When MATLAB was restarted, this was not offered again.

- Four sections within GUI:
- Top Bar is menu area, gives all the functions here
- Function Output area (main white area), output is shown here
- Current Network Information Pane. Gray box below function output area. "Brief information about the current loaded network such as number of nodes, number of edges and updates if the current network is edited"
- Will be intetresting to see what is shown but having a reminder of edges/nodes/etc will be handy to have when working on the networks

- Plugin Menu

- SBEToolbox Status Pane: Gray strip on bottom. Says "Ready" or "Busy"
- "Statistics" Menu Button has lots of the statistics wanted to run
- Able to create random networks of "small world", "Erdos-Renyi", and "Ring LAttice"
- Can upload own network but only in .txt (Tab Delimited)
- Need to figure out how to correctly format the input sheets. Example given in .txt is over 4,000 different nodes, but changes format when opened in Excel. Need to figure out how formatted, ie if rows are regulators or regulatees, etc.
- Cool addition: Has a node lookup factor where you can select the node and the program will tell you what it does (like ZAP1 and zinc activation)

## October 24, 2016

- Natalie & Brandon will be doing test runs on small gene networks
- SIF instructions from GRNsight website. Not really a standard so will need to take some guesses
- Typically Source Column A, Target Coulmn B (2 column format)
- Three column format, Column A: Source, Column B: relationship, Column C: Target
- No column headers

- GRNsight has adjacency converter from Excel to sif
- Perhaps write own documentation and make it detailed enough for anyone to follow
- Make note of what works and equally important what doesn't work

For tomorrow; we will be running and screenshotting basically everything on a 4 gene network but also for the 21-node 32-edge test to see how it all formats togther

## October 25, 2016

- Maggie and I spent our time running the 4-node network through basically every program on SBEToolbox
- Did not have time to get to the 21-gene network, will run those next week maybe on more specific statistics or programs
- Also will add informative installation and set-up slides to the beginning next week. Hopefully this powerpoint can be used as a reference document for people later on
- Powerpoint can be found here: Media:HorstmannOneil_SBEToolbox_Tests.pptx

## October 31, 2016

- Tomorrow, will run the stats on HAP4, GLN3, and ZAP1 and the visualization to see how the toolbox works on larger, more applicablee networks

## November 1, 2016

- Media: SBEToolbox_TEST.ppt
- Realized via the shortest path and betweeenness coefficient that their automatic statistics are being run assuming undirected pathways
- Trying to figure out if there is a way to change this to directed
- Could not find Brandon's and Natalie's weighted networks, so will run test on those when received/if deemed necessary

## November 7, 2016

- Different motifs for brandon and natalie to try (regulatory, feed-forward, etc.)
- maybe need to drop out the grey-scaled connections. Maybe make connection 10% of 0 instead of 5% of 0. Also might see some of the stronger connections become less important
- In the future: perhaps look at the MSEs to see how the overa genes are modeled
- SBE Dykstra: use assumptions of undirected networks
- Default implementation for shortest path and betweenness doesnt consider directionality.... Does anyone consider directionality?
- Perhaps we are exploring math packages more than computer science
- Protein-protein and genetic networks doont have directionality because they either bind or they dont. Perhaps people dont consider the directionality ovverall

- May be best to just search shortest path/betweenness of a directed network
- Send Dahlquist email about other programs to look for (Gephi, YED, etc.)

## November 28, 2016

- Assigned new task on GitHub (Issue 290) on inputting the 6 graphs on GEPHI for statistics
- Run weighted and unweighted in order to directly compare
- Get the definitions and simple calculations for each statistic
- Add in the GRNsight visualization for each one

- Start thinking of biological analysis and conclusions for abstracts next semester
- UCI Systems bio conference

- Make sure lab notebook is up to date and detailed, begin making table of contents, organizing everything onto a CD, etc.
- Issue 291 on Github

## November 29, 2016

- Ran Gephi on Weighted and Unweighted
- Saved Gephi Outputs as Excel sheets
- Maggie consolidated the definitions and equations

## December 5, 2016

- Ran our gephi powerpoints, most interesting statistics to possibly run may be strong component, closeness centrality, betweeness centrality, eigenvalue
- Are there relationships/graphical fits
- Plot closeness and harmonic against eachother to see if there's a major relationship between the two
- Find strong component and why some numbers are so much greater than others

- For next semester, choose a family and pump out ~100 random networks
- for this week, also do the table of contents and file upload to the CDs in the lab

## December 6, 2016

- Made final CD with table of contents and turned into Dr. Dahlquist
- Made the comparitive charts of closeness and harmonic centralities and added to the second sheet of the gephi outputs, which were included to the CD

## January 12, 2017

- Decided we will be attending UCI Systems Bio conference on Feb 28. Abstracts are due Feb 20.
- One computer will be roped off for only running models in order to not change that variable....

## January 24, 2017

- upload of the HAP4 Outputs in an excel sheet from GEPHI : Media: HAP4GephiOutputs.xls
- Will be making poster and presentatiion of Maggie's HAP4 from last year combined with the HAP4 Gephi outputs from the fall.

## February 2, 2017

- Could not attend the systems bio conference in Irvine at the last minute, but abstract can be found on GRNmap Github page
- Maggie presented without me, but I will be giving a solo talk on the topic at the undergraduate research symposium at LMU in March. Abstract in progress.
- Media: HorstmannSymposium_2017.doc

- Want to start looking at MSE relationship with ANOVA
- do p<0.5 genes have better fits, do they not, or is there no relationship?

- Are genes with no inputs modeled worse?
- Compare list of genes with no input vs list of genes with inputs. Which are modeled better? What are the drivers for good fit or bad fit
- If the b value is strange, is that good fit or bad fit
- Comparing to random networks may just be building a false argument and knocking it down, we wanna see the reasons for why real networks tend to fit better to the model.
- But if we are doing regression analysis, needs to be from same computer because otherwise it will be different.
- One thing we thought about doing is making a distance matrix. Once we have this, compare it to the weight values

## February 9, 2017

- Worked on editing LMU Research Symposium abstract, added comments from Dr. Dahlquist made in github
- Changed project title back to SoCal Systems Bio conference title, as I can mention random networks in talk without mentioning it in the title
- Once this is submitted, can work on Gephi analysis of the random networks once Natalie and Brandon posts them, then can work on Powerpoint slide
- Likely final abstract can be found here: Media:HorstmannFinalSmposium_2017.doc

## Feb 23, 2017

- Media: RandomNetworks10_Horstmann_Spring17.ppt
- Kept running Gephi on the random networks
- Had to redo the first few in order to put together the graphs in the order of the grid brandon/nat made
- Need to start computing the weighted values in Excel
- Gephi outputs can be found in the Dahlquist repository https://github.com/kdahlquist/DahlquistLab/tree/master/data/15-gene_networks_analysis

## March 2, 2017

- In comparing the closeness centralities between nodes, I noticed Networks 4 and 5 had the same values. Double checked with the original data, was the same. Ran the same programs on Random 11 to have a full 10 networks, and added the new network to the repository.
- Updated random network powerpoint slide here: Media:Randomnetworks11_horstmann_Spring17.ppt
- Consolidated the harmonic, betweeness, eccentricity, and closeness centralities for all 15 genes and 12 networks (hap4-derived and 11 randoms). Sheet can be found on github repository with the other data files
- Powerpoint of the four factors and outputs found here: Media: Randomnetworks_Gephi_stats_KMH.ppt
- Upcoming:
- Run Gephi on unweighted to see how they compare
- Make negative weights positive to see if those are the same
- Perhaps also examine degree distribution perhaps also for talk? Compare these quickly to see
- Make Powerpoint for talk
- Write up Gephi protocol more detailed

## March 16, 2017

- Began working on powerpoint presentation for the LMU URS. Draft can be found here. So far, only beginning stages of analysis, hence the last few slides being very bare. Decided to concentrate on betweenness, eccentricity, and weighted degrees. Likely to turn analysis into discussing why betweenness centrality is a better measurement of centrality than eccentricity and also why the randomized networks are not as well modeled as the natural one (aka why the betweenness numbers are so much larger)
- While comparing betweenness, found it easiest to consolidate all the 10 random networks. THe color code for each of the random networks is on the right: note that although it says there are 11 networks, there are actually 10, as networks 4 and 5 are duplicated but the numbering was kept.
- Media: Horstmann_11RandomNetwork_DegreeEccentricityBetweenness.xlsx

## March 30, 2017

- Presented my talk at the Undergrad Research Symposium on March 25
- Final talk uploaded here: Media: Horstmann_URS2017_Final.ppt

- Today, began running Gephi on the other random networks
- Completed Random networks 12-20
- Uploaded these output CSV sheets to GitHub repository at: https://github.com/kdahlquist/DahlquistLab/tree/master/data/15-gene_networks_analysis
- Will complete 21-30 next week

- In meeting today, discussed what we've completed this semester and future directions to take:
- 6 db networks
- 30 random networks
- LSE/min LSE ratios- barchart
- Degree Distributions
- unweighted: Brandon's R script done DB 1-6, still need to do random
- weighted: bar chart, cumulative plot done for db 1-6. SPSS?

- Gephi: tables of in-degree, out-degree, and total degree for both weighted and unweighted all db-derived and 20 of the randoms
- MSE/minMSE for db5; Natalie will set up excel spreadsheet to facilitate "plug and play"
- Gephi stats:
- eventually, make excel sheets like Maggie that compares the in-degree, out, etc for each of the 30 networks with different stat on each sheet
- For one of these, compute the stat by hand to make sure Gephi is doing it right, or find Gephi code, one unweighted and weighted

To sum of the semester, maybe type up a group report with compilation of charts and tables and stuff on github (no CDs)

- Powerpoint or Word Doc

## April 6, 2017

- Finished running the remainder of the random networks (Networks 21-31)
- Folders in Dahlquist depository have been altered and shifted around, so was no longer able to find Networks 1-20, so reuploaded all of them here: https://github.com/kdahlquist/DahlquistLab/tree/master/data/Spring2017/Gephi_output/Gephi_db5-derived-random-network-1-through-31_output
- Dahlquist Lab Repository -> data -> Spring 2017 -> Gephi_output -> Gephi_db5-derived-random-network-1-through-31_output

- Updated Random Networks Powerpoint
- Maggie and I wrote Gephi Protocol for future students which can be found here: media: DahlquistLab_Gephi_Protocol.pdf

## April 20, 2017

- Tried combing through Gephi's GitHub page yet again to find the exact documentation/code to calculate the stats by hand, instead of assuming that the equations we found were correct, but could not find anything. Very difficult to maneuver their many files.
- Was chatting with Anu and she mentioned that cytoscape runs similar stats and that we could cross reference with a better documented program but ran into issues
- Said I had to download Java 8 first, which I did, but then neither Java nor Cytoscape would open

- Next week will make sure everything is in order before graduation and will attempt at hand calculations again.

## April 27, 2017

- Had lab checkout and made sure all the files were uploaded and in order on GitHub
- Unmarked me from all the github comments and issues.
- Goodbye!!!