Claire A. Kosewic Lab Notebook

From OpenWetWare
Jump to navigationJump to search

Relevant Links

Dahlquist Lab

Lactase Persistance

Personal User Page

Fall 2020 Week 6 | October 4, 2020

Database Exploration

Given the Covid-19 pandemic and the halting of in-person research, we cannot collect participant samples as originally planned. In search of information about allele frequencies of the "European SNP" we have turned to previously-available nucleotide databases. Hopefully, this will allow us to begin formulating data about the relative expression of lactase persistence among post-weaning individuals in the United States.

  • 1000 Genomes Project Database: calls itself a "a database of signatures of selection in the human genome." When "LCT" is searched in the gene browser, all of the common SNPs present in the LCT gene are returned. However, none of these SNPs are in the enhancer region (one of the most commonly relevant SNPs to this work is -13910 C/T, which is not returned in this database). Additionally, this database was created and published in 2014, and has not been maintained or updated since.
  • Database of Genomic Variants: ten-year compilation of structural variation in the genomes of control individuals, originally published in 2013. Many variants are noted, though there is little frequency data noted for any of them. This database will not be helpful for SNP frequency searches, because the database architects designed "genomic variants" to be at least 50 bp long.
  • dsPSHP: database of positive selection across human populations. Offers variant plot, frequency tables, and genome browsers for several defined ethnic groups (European American, African ancestry in Southwest USA, Han Chinese in Beijing, Chinese population in Denver (Colorado), Gujarati Indians in Houston (Texas). Has not been updated since 2014, but likely a good source for frequencies.
  • Lynx: Database/Knowledge Extraction Engine for Integrative Medicine: helps to identify genetic factors of phenotypes/disorders of research interest. Has not been updated since 2018. Likely not a good candidate for allele frequency work because its chief role is to connect observational disorders with genetic basis. Many SNPs have already been identified in regards to lactase persistence, and the LCT gene is relatively well-understood in connection with lactase persistence, lactase non-persistence, and congenital lactase deficiency.
  • GWAS Catalog: catalog of genome-wide association studies, gives more general information about the LCT gene and the SNPs that confer lactase persistence, but no easily available frequency data.
  • rVarBase: an updated version of rSNP base, annotates a variant's regulatory features in three ways: chromatin state of the region surrounding the variant, regulatory elements overlapping with query, and traits associated with the variant. Given that the enhancer region is one of the most important pieces to the puzzle of lactase persistence, focusing on regulatory regions is a wise choice. However, limited information about allele frequency.

Week 6 Research Meeting Notes

Research goal is simple, well-defined: we're looking for frequency data of SNPs relevant to lactase persistence/non-persistence (e.g. How many people have the C versus T allele at rs4988235, and what is their noted ethnic heritage?) For next week, going to start mining available data for frequencies. Anticipated queries are as follows:

  • How to derive allele frequencies from GWAS data (has a workflow for doing this already been set up?)
  • Figuring out what exactly is in the available data and how we can use it (lists of SNPs and frequencies? raw chromosomal associations?)
  • Use of George Church/personal genomes project data to start aggregating public data regarding -13910 SNP and lactase persistence
  • Big question: how are ethnic groups broken down? Does European mean collected from people living in Western Europe? Or does it refer to people of European descent?

We know what we're looking for; now the goal will be finding data in the format we're looking for OR finding data in other formats that can be analyzed and put into the format we're looking for.

Fall 2020 Week 7 | October 11, 2020

Frequency Data from OpenSNP and Personal Genomes Project

A chief goal going forward is mining data from other research papers correlating race/ethnicity to allele frequencies at -13910 (rs4988235) because new sample collection is quite difficult given the Covid-19 pandemic and limited research resources (as well as safety concerns and transmission risk).

From OpenSNP: SNP -13910 (rs4988235) OpenSNP has great frequency data that they've compiled from dozens of papers, all linked and organized by SNP. However, to determine the race/ethnic correlation with the allele frequencies, I am fairly certain that each paper will have to be examined individually. Some papers are studying specific ethnic groups (e.g. the Maasai or the Fulani nomads), and others happen to have the data based on other studied characteristics (e.g. Type 2 Diabetes Risk or obesity correlation). We will likely need to establish what racial/ethnic groups we would like to track, and begin looking through the data to fit it to the model we propose. There does not seem to be a consensus on the best way to do this, nor does there seem to be a supermajority of any one racial breakdown. Additionally, of the papers that do give a racial/ethnic classification, there is no data on how this information was collected (e.g. self-reported?)

From Personal Genomes Project: Unlike OpenSNP, you cannot search by individual SNP. There are a few options for mining data

  • Surveyed Traits: self-reported traits of family disease history, phenotypic/anecdotal evidence of certain conditions (there is a specific gastrointestinal traits page, where one can download a CSV file of all of the individuals' responses, search for "lactose intolerance" or "lactase non-persistence") but there does not seem to be any easy way to link the respondents to their racial/ethnic history.
  • Participant Profiles: can look at each participant individually, but there is no summary data. Could possibly mine this data for information about phenotypic lactase persistence, but would really need a computer program to identify the race on file in each participant's profile. Going through 6000+ individual genomic profiles individually would be incredibly time-consuming, though it is an option if that is one to be explored.

Week 7 Research Meeting Notes

GitHub repository

  • collect profile data (race and gender)
  • collect sequence data (for LCT-related SNPs)
  • relate the profile data and sequence data by user ID

Fall 2020 Week 8 | October 18, 2020

Week 8 Research Meeting Notes

Excellent information sharing with the Covid-19 project lab. They information that they'd gleaned from the gnoMAD database and Ensembl genome browser, which looked like it could definitely be applicable to the lactase persistence project. Short summaries of the relevant information from cursory searches of SNP rs4988235 are shown below:

  • Ensembl: Ensembl collects information from other available databases and synthesizes it into a graphical interface for ease of reference. They divide the population data into five ethnic groups: African, American, East Asian, European, and South Asian. Like we've previously seen, the European group seems to have the highest frequency of the T allele (note: reported here as A; don't get confused!), showing approx. 51% T and 49% A alleles in the surveyed population. However, it is important to note that these ethnic divisions are based on individuals' current country of residence, but rather some ancestral connection. For example, the "European" subgroup contains a study of people living in Utah with reported European ancestry; and the "African" subgroup contains individuals of African ancestry living in the Southeast United States. So, if we are going to use this information in coming up with some frequency data for the United States, we need to be careful about the ethnic boundaries that are being reported. Also, the population from "East Asia" shows 100% frequency of the A allele (consistent with lactase non-persistence), which needs to be studied further. Colloquially, there are anecdotes about people of Asian ancestry having a higher chance of being LNP than other ethnic groups, but dispelling these kinds of stereotypes is one of the reasons for this project, so I'd like to propose further review of that data.
  • gnomAD: gnomAD has some of the same kinds of data aggregation as Ensembl, collecting allelic frequencies from many different research papers and studies (takes advantage of publicly available data, which is a huge bonus, and largely what we are depending on). It does not have the same kind of population data that Ensembl does, which is cool because it gives us a different approach to the allele frequencies. It contains detailed information about the specific mutations that different samples showed, and will classify them as such. I want more information about how to interpret the chromosomal data, because I don't feel like I have a strong understanding of that to date.

Fall 2020 Week 9 | October 25, 2020

Summary of Work/Ideas To Date

  • Covid-19 changed everything: we had originally planned to spend the fall collecting samples and genotyping them to begin building out a data set of allele frequencies related to LP/LNP in the genomes of the people that we surveyed. Obviously, the halting of in-person research plus infection risk concerns due to Covid-19 (even if some research were still occurring in person) made this research highly unadvisable.
  • New goals set at the beginning of the semester: to look through publicly-available data and see if we could begin to create some frequency data collection of our own, even though we weren't running samples. This largely consisted of searching through the various databases to find pertinent allele frequency data. This was met with varying degrees of success based on the database.
    • gnomAD: possibly helpful? struggled interpreting data and explanations were a little over my head
    • Ensembl: very helpful, aggregated a lot of the genetic data I'd found in places like the 1000 Genomes Project
    • GWAS Catalog (genome-wide association studies) cool information, but mostly focused on the association of different SNPs to diseases; in this case we pretty much know the SNPs that are associated with lactase persistence (this could possibly be helpful for the Covid group though? note to self: ask Jess!) Also has excellent graphical interface for locating different SNPs on different chromosomes, and well-reviewed/frequently-updated
    • dbGAP: archive of the same kinds of information that the GWAS catalog provides, though lots of it behind privacy walls for subject data protection. Not sure if I have the right tools/experience to mine the right data out of here/even figure out what I might like to look at. If they're phenotype-genotype correlational reports, is that necessarily directly beneficial to what we are trying to do here in assembling a frequency map?
  • Questions that I still have/guide further work
    • I've done all of this playing around databases. What does this assemblage of allele frequencies look like? What is the vision?
    • Am I just going to compile all of the data that is traceable back to the United States? Do we need to be more discriminating in what we choose to include in the data set?
    • What deliverables am I working towards? Talked about updating the educational slide deck early this semester; more on that front?
    • Do I need to do some more historical research about the dairy lobby and the influence on the myPlate/national dietary recommendations for the United States?
    • Milk racism? Food oppression? The Unbearable Whiteness of Milk

Week 9 Research Meeting Notes

  • geneticists who are looking at ancestral data/information and geneticists who are looking for linkage between genetic variants and disease (homogenous populations, isolated, without in/out migration: get a strong signal and figure out what gene is responsible, really good signal:noise ratio)
  • this is a public health approach: we have a heterogenous population in the United States, and in this population, what are the allele frequencies? due to ancestry, we know that things are not "evenly mixed"
  • race is a social/cultural construction
  • white males are vastly overrepresented in biomedical research, and the information gathered on those individuals
  • more diversity and inclusion in our data
  • don't want to use someone's self-identified race as the end all be all
  • P-hacking (gross)
  • Ensembl: deep dive -- understanding the data that is being presented? write lots of questions? go into documentation (where the data is coming from, how they are presenting the data)

Fall 2020 Week 11 | November 12, 2020

What does Ensembl have to offer us?

  • info comes from dbSNP, release 153; visible and linked on the home page of the SNP information
  • good dashboard summary of the SNP; when first searched, you can immediately find out critical information (e.g. intron variant, its ancestral allele, its location, and info about the fact that it overlaps 3 transcripts/has 3009 sample genotypes/associated with 8 phenotypes/177 citations)
    • the 8 phenotypes would be interesting to look more into; what else is this SNP associated with besides lactase persistence?
  • can be viewed in genomic context, can look at population genetics, phenotype data, etc
    • interestingly, in the phenotype data, you can explore the different phenotypic associations for rs4988235. There are two notations about blood protein levels, four about body mass index, one about hip circumference, and one about lactase persistence
      • if you click on the "lactase persistence" note, it will give you other SNPs also associated with lactase persistence (not incredibly helpful for us because we already have a decent list) and the "evidence" for these associations with PMID links
    • population genetics is probably the most applicable to what we are looking for and about: lots of summary data about different populations worldwide. Most importantly for us is the issue of diversity...the US specifically has one of the most heterogenous populations in the world, from vastly different ancestral backgrounds. Population genetics normally likes homogenous populations, which is probably why we've gotten ourselves into this mess of forcing people to consume dairy products even if they are not nutritionally optimal...someone, somewhere, chose to base dietary recommendations on a small group of people with European ancestry, dairy suppliers and farmers got hold of it, and here we are
      • it would be super interesting to go back in time and look for when dietary guidelines began being published, up through today and the "got milk?" ads and the battle over plant-based milks (who were those guidelines written for originally? who decided that they would be codified for everyone in the US? when did that happen?)
  • all the citations are gathered together for study and analysis
  • linkage disequilibrium plots? (they are published for the individual cohorts with the population genetics, what info do they have to offer us?)
  • flanking sequences: do we need information about environment around the SNP? How much does the immediate flanking environment mediate transcription? Or are we focusing singularly on rs4988235 because it is so far away from the promoter sequence?
    • Lewinsky paper about the size of the enhancer; verification?
  • make a better map of lactase persistence frequencies; heat map is not the right way to go about it, because that manufactures data that we don't have
    • one where the size of the circles measures population, one where the size of the circle is the "n" that is sampled

GIS software: what is the input format? do we need longitude/latitude or are political boundaries?

Fall 2020 Week 12 | November 19, 2020

Notes About QGIS

  • can potentially be a very powerful tool to use
    • I was able to download it from the internet (easy) and begin to get it going (hard — it took me about 30 minutes and a lot of Google searching to figure out input formats alone); found a lot of great tutorials, but they are long and complicated (it is a long and complicated software to learn to use, as most powerful software applications are)
    • best tutorial collection can be found here but the important question is, is this what we want to be doing? if it is, I will throw myself into it and become an expert as best I can, but I don't want to do that if it isn't the best or most productive use of my time
  • lots of groups and research studies choose to use ArcGIS because of professional support, fewer bugs (supposedly), cleaner interface, and generally more sophisticated functioning; huge drawback is obviously that is expensive while QGIS is free

Notes from "The Unbearable Whiteness of Milk"

  • "The USDA’s efforts to reduce the high-fat milk surplus by selling it to fast food consumers impose health costs on Americans generally, but disproportionately harm low-income African Americans and Latina/os who live in urban centers dominated by fast food restaurants."
    • prime example of food oppression: an institutional, systemic, food-related action or policy that physically debilitates a socially subordinated group
  • food oppression in the United States is hard to grasp, because the concept of individual choice is so much easier to promote (systemic change = way harder than the personal change that is theoretically possible...anecdotally); influence of neoliberalism and capitalism oozing from every ounce of surplus cheese that gets piled on a Domino's pizza
    • government subsidies to fast food restaurants, etc, make this even more complex!
  • sketchy stats from this random guy's blog