Claire A. Kosewic Lab Notebook
Personal User Page
Fall 2020 Week 6 | October 4, 2020
Given the Covid-19 pandemic and the halting of in-person research, we cannot collect participant samples as originally planned. In search of information about allele frequencies of the "European SNP" we have turned to previously-available nucleotide databases. Hopefully, this will allow us to begin formulating data about the relative expression of lactase persistence among post-weaning individuals in the United States.
- 1000 Genomes Project Database: calls itself a "a database of signatures of selection in the human genome." When "LCT" is searched in the gene browser, all of the common SNPs present in the LCT gene are returned. However, none of these SNPs are in the enhancer region (one of the most commonly relevant SNPs to this work is -13910 C/T, which is not returned in this database). Additionally, this database was created and published in 2014, and has not been maintained or updated since.
- Database of Genomic Variants: ten-year compilation of structural variation in the genomes of control individuals, originally published in 2013. Many variants are noted, though there is little frequency data noted for any of them. This database will not be helpful for SNP frequency searches, because the database architects designed "genomic variants" to be at least 50 bp long.
- dsPSHP: database of positive selection across human populations. Offers variant plot, frequency tables, and genome browsers for several defined ethnic groups (European American, African ancestry in Southwest USA, Han Chinese in Beijing, Chinese population in Denver (Colorado), Gujarati Indians in Houston (Texas). Has not been updated since 2014, but likely a good source for frequencies.
- Lynx: Database/Knowledge Extraction Engine for Integrative Medicine: helps to identify genetic factors of phenotypes/disorders of research interest. Has not been updated since 2018. Likely not a good candidate for allele frequency work because its chief role is to connect observational disorders with genetic basis. Many SNPs have already been identified in regards to lactase persistence, and the LCT gene is relatively well-understood in connection with lactase persistence, lactase non-persistence, and congenital lactase deficiency.
- GWAS Catalog: catalog of genome-wide association studies, gives more general information about the LCT gene and the SNPs that confer lactase persistence, but no easily available frequency data.
- rVarBase: an updated version of rSNP base, annotates a variant's regulatory features in three ways: chromatin state of the region surrounding the variant, regulatory elements overlapping with query, and traits associated with the variant. Given that the enhancer region is one of the most important pieces to the puzzle of lactase persistence, focusing on regulatory regions is a wise choice. However, limited information about allele frequency.
Week 6 Research Meeting Notes
Research goal is simple, well-defined: we're looking for frequency data of SNPs relevant to lactase persistence/non-persistence (e.g. How many people have the C versus T allele at rs4988235, and what is their noted ethnic heritage?) For next week, going to start mining available data for frequencies. Anticipated queries are as follows:
- How to derive allele frequencies from GWAS data (has a workflow for doing this already been set up?)
- Figuring out what exactly is in the available data and how we can use it (lists of SNPs and frequencies? raw chromosomal associations?)
- Use of George Church/personal genomes project data to start aggregating public data regarding -13910 SNP and lactase persistence
- Big question: how are ethnic groups broken down? Does European mean collected from people living in Western Europe? Or does it refer to people of European descent?
We know what we're looking for; now the goal will be finding data in the format we're looking for OR finding data in other formats that can be analyzed and put into the format we're looking for.
Fall 2020 Week 7 | October 11, 2020
Frequency Data from OpenSNP and Personal Genomes Project
A chief goal going forward is mining data from other research papers correlating race/ethnicity to allele frequencies at -13910 (rs4988235) because new sample collection is quite difficult given the Covid-19 pandemic and limited research resources (as well as safety concerns and transmission risk).
From OpenSNP: SNP -13910 (rs4988235) OpenSNP has great frequency data that they've compiled from dozens of papers, all linked and organized by SNP. However, to determine the race/ethnic correlation with the allele frequencies, I am fairly certain that each paper will have to be examined individually. Some papers are studying specific ethnic groups (e.g. the Maasai or the Fulani nomads), and others happen to have the data based on other studied characteristics (e.g. Type 2 Diabetes Risk or obesity correlation). We will likely need to establish what racial/ethnic groups we would like to track, and begin looking through the data to fit it to the model we propose. There does not seem to be a consensus on the best way to do this, nor does there seem to be a supermajority of any one racial breakdown. Additionally, of the papers that do give a racial/ethnic classification, there is no data on how this information was collected (e.g. self-reported?)
From Personal Genomes Project: Unlike OpenSNP, you cannot search by individual SNP. There are a few options for mining data
- Surveyed Traits: self-reported traits of family disease history, phenotypic/anecdotal evidence of certain conditions (there is a specific gastrointestinal traits page, where one can download a CSV file of all of the individuals' responses, search for "lactose intolerance" or "lactase non-persistence") but there does not seem to be any easy way to link the respondents to their racial/ethnic history.
- Participant Profiles: can look at each participant individually, but there is no summary data. Could possibly mine this data for information about phenotypic lactase persistence, but would really need a computer program to identify the race on file in each participant's profile. Going through 6000+ individual genomic profiles individually would be incredibly time-consuming, though it is an option if that is one to be explored.
Week 7 Research Meeting Notes
- collect profile data (race and gender)
- collect sequence data (for LCT-related SNPs)
- relate the profile data and sequence data by user ID
Fall 2020 Week 8 | October 18, 2020
Week 8 Research Meeting Notes
Excellent information sharing with the Covid-19 project lab. They information that they'd gleaned from the gnoMAD database and Ensembl genome browser, which looked like it could definitely be applicable to the lactase persistence project. Short summaries of the relevant information from cursory searches of SNP rs4988235 are shown below:
- Ensembl: Ensembl collects information from other available databases and synthesizes it into a graphical interface for ease of reference. They divide the population data into five ethnic groups: African, American, East Asian, European, and South Asian. Like we've previously seen, the European group seems to have the highest frequency of the T allele (note: reported here as A; don't get confused!), showing approx. 51% T and 49% A alleles in the surveyed population. However, it is important to note that these ethnic divisions are based on individuals' current country of residence, but rather some ancestral connection. For example, the "European" subgroup contains a study of people living in Utah with reported European ancestry; and the "African" subgroup contains individuals of African ancestry living in the Southeast United States. So, if we are going to use this information in coming up with some frequency data for the United States, we need to be careful about the ethnic boundaries that are being reported. Also, the population from "East Asia" shows 100% frequency of the A allele (consistent with lactase non-persistence), which needs to be studied further. Colloquially, there are anecdotes about people of Asian ancestry having a higher chance of being LNP than other ethnic groups, but dispelling these kinds of stereotypes is one of the reasons for this project, so I'd like to propose further review of that data.
- gnomAD: gnomAD has some of the same kinds of data aggregation as Ensembl, collecting allelic frequencies from many different research papers and studies (takes advantage of publicly available data, which is a huge bonus, and largely what we are depending on). It does not have the same kind of population data that Ensembl does, which is cool because it gives us a different approach to the allele frequencies. It contains detailed information about the specific mutations that different samples showed, and will classify them as such. I want more information about how to interpret the chromosomal data, because I don't feel like I have a strong understanding of that to date.
Fall 2020 Week 9 | October 25, 2020
Summary of Work/Ideas To Date
- Covid-19 changed everything: we had originally planned to spend the fall collecting samples and genotyping them to begin building out a data set of allele frequencies related to LP/LNP in the genomes of the people that we surveyed. Obviously, the halting of in-person research plus infection risk concerns due to Covid-19 (even if some research were still occurring in person) made this research highly unadvisable.
- New goals set at the beginning of the semester: to look through publicly-available data and see if we could begin to create some frequency data collection of our own, even though we weren't running samples. This largely consisted of searching through the various databases to find pertinent allele frequency data. This was met with varying degrees of success based on the database.
- gnomAD: possibly helpful? struggled interpreting data and explanations were a little over my head
- Ensembl: very helpful, aggregated a lot of the genetic data I'd found in places like the 1000 Genomes Project
- GWAS Catalog (genome-wide association studies) cool information, but mostly focused on the association of different SNPs to diseases; in this case we pretty much know the SNPs that are associated with lactase persistence (this could possibly be helpful for the Covid group though? note to self: ask Jess!) Also has excellent graphical interface for locating different SNPs on different chromosomes, and well-reviewed/frequently-updated
- dbGAP: archive of the same kinds of information that the GWAS catalog provides, though lots of it behind privacy walls for subject data protection. Not sure if I have the right tools/experience to mine the right data out of here/even figure out what I might like to look at. If they're phenotype-genotype correlational reports, is that necessarily directly beneficial to what we are trying to do here in assembling a frequency map?
- Questions that I still have/guide further work
- I've done all of this playing around databases. What does this assemblage of allele frequencies look like? What is the vision?
- Am I just going to compile all of the data that is traceable back to the United States? Do we need to be more discriminating in what we choose to include in the data set?
- What deliverables am I working towards? Talked about updating the educational slide deck early this semester; more on that front?
- Do I need to do some more historical research about the dairy lobby and the influence on the myPlate/national dietary recommendations for the United States?
- Milk racism? Food oppression? The Unbearable Whiteness of Milk
Week 9 Research Meeting Notes
- geneticists who are looking at ancestral data/information and geneticists who are looking for linkage between genetic variants and disease (homogenous populations, isolated, without in/out migration: get a strong signal and figure out what gene is responsible, really good signal:noise ratio)
- this is a public health approach: we have a heterogenous population in the United States, and in this population, what are the allele frequencies? due to ancestry, we know that things are not "evenly mixed"
- race is a social/cultural construction
- white males are vastly overrepresented in biomedical research, and the information gathered on those individuals
- more diversity and inclusion in our data
- don't want to use someone's self-identified race as the end all be all
- P-hacking (gross)
- Ensembl: deep dive -- understanding the data that is being presented? write lots of questions? go into documentation (where the data is coming from, how they are presenting the data)
Fall 2020 Week 11 | November 12, 2020
What does Ensembl have to offer us?
- info comes from dbSNP, release 153; visible and linked on the home page of the SNP information
- good dashboard summary of the SNP; when first searched, you can immediately find out critical information (e.g. intron variant, its ancestral allele, its location, and info about the fact that it overlaps 3 transcripts/has 3009 sample genotypes/associated with 8 phenotypes/177 citations)
- the 8 phenotypes would be interesting to look more into; what else is this SNP associated with besides lactase persistence?
- can be viewed in genomic context, can look at population genetics, phenotype data, etc
- interestingly, in the phenotype data, you can explore the different phenotypic associations for rs4988235. There are two notations about blood protein levels, four about body mass index, one about hip circumference, and one about lactase persistence
- if you click on the "lactase persistence" note, it will give you other SNPs also associated with lactase persistence (not incredibly helpful for us because we already have a decent list) and the "evidence" for these associations with PMID links
- population genetics is probably the most applicable to what we are looking for and about: lots of summary data about different populations worldwide. Most importantly for us is the issue of diversity...the US specifically has one of the most heterogenous populations in the world, from vastly different ancestral backgrounds. Population genetics normally likes homogenous populations, which is probably why we've gotten ourselves into this mess of forcing people to consume dairy products even if they are not nutritionally optimal...someone, somewhere, chose to base dietary recommendations on a small group of people with European ancestry, dairy suppliers and farmers got hold of it, and here we are
- it would be super interesting to go back in time and look for when dietary guidelines began being published, up through today and the "got milk?" ads and the battle over plant-based milks (who were those guidelines written for originally? who decided that they would be codified for everyone in the US? when did that happen?)
- interestingly, in the phenotype data, you can explore the different phenotypic associations for rs4988235. There are two notations about blood protein levels, four about body mass index, one about hip circumference, and one about lactase persistence
- all the citations are gathered together for study and analysis
- linkage disequilibrium plots? (they are published for the individual cohorts with the population genetics, what info do they have to offer us?)
- flanking sequences: do we need information about environment around the SNP? How much does the immediate flanking environment mediate transcription? Or are we focusing singularly on rs4988235 because it is so far away from the promoter sequence?
- Lewinsky paper about the size of the enhancer; verification?
- make a better map of lactase persistence frequencies; heat map is not the right way to go about it, because that manufactures data that we don't have
- one where the size of the circles measures population, one where the size of the circle is the "n" that is sampled
GIS software: what is the input format? do we need longitude/latitude or are political boundaries?
Fall 2020 Week 12 | November 19, 2020
Notes About QGIS
- can potentially be a very powerful tool to use
- I was able to download it from the internet (easy) and begin to get it going (hard — it took me about 30 minutes and a lot of Google searching to figure out input formats alone); found a lot of great tutorials, but they are long and complicated (it is a long and complicated software to learn to use, as most powerful software applications are)
- best tutorial collection can be found here but the important question is, is this what we want to be doing? if it is, I will throw myself into it and become an expert as best I can, but I don't want to do that if it isn't the best or most productive use of my time
- lots of groups and research studies choose to use ArcGIS because of professional support, fewer bugs (supposedly), cleaner interface, and generally more sophisticated functioning; huge drawback is obviously that is expensive while QGIS is free
Notes from "The Unbearable Whiteness of Milk"
- "The USDA’s efforts to reduce the high-fat milk surplus by selling it to fast food consumers impose health costs on Americans generally, but disproportionately harm low-income African Americans and Latina/os who live in urban centers dominated by fast food restaurants."
- prime example of food oppression: an institutional, systemic, food-related action or policy that physically debilitates a socially subordinated group
- food oppression in the United States is hard to grasp, because the concept of individual choice is so much easier to promote (systemic change = way harder than the personal change that is theoretically possible...anecdotally); influence of neoliberalism and capitalism oozing from every ounce of surplus cheese that gets piled on a Domino's pizza
- government subsidies to fast food restaurants, etc, make this even more complex!
- sketchy stats from this random guy's blog
Fall 2020 Week 15 | December 10, 2020
add say the SNP and the gene, maybe say that there are a lot of SNPs in this region ability to digest lactose into adulthood World-wide estimates suggest that lactase persistence occurs in 35% of the world's population, but these estimates are based on limited populations. Because milk consumption is promoted in dietary guidelines, we wondered what the actual genotypic frequencies were in the US. Finding reliable data for USA is difficult with just one published study with a small sample size. We undertook a review of public SNP data in databases to see what we could find. [results] Examined, x, y, z databases, limitations were.
While single nucleotide polymorphisms conferring lactase persistence are widely known, the frequency of these SNPs are not, posing important questions for public health and health policy. Understanding the frequency of these SNPs will allow us to quantify the extent of lactase persistence in the United States, a number we believe to be lower than what the Department of Agriculture and associated entities suggest. Preliminary research suggests that individuals with a SNP conferring lactase persistence most commonly trace their ancestry to Northwestern Europe, and that other large populations worldwide do not express this genetic mutation. Given that the United States is a settler nation, encouraging milk consumption (through “Got Milk?” public health campaigns and official inclusion on the USDA MyPlate recommendations) and even mandating it (through free and reduced lunch programs across the country) can have serious public health implications if the majority of the population is unable to digest lactose.
Spring 2021 Week 1 | January 15, 2021
Abstract Progress January 2021
Lactase persistence (LP), the ability to digest lactose into adulthood, is a trait present in approximately 16 percent of the world’s population. While large-scale population studies are few, available data suggests that anywhere from 16-35% of the world’s population maintains the ability to digest lactose beyond childhood. Relatedly, it appears that the vast majority of the population is not able to metabolize lactose effectively, classifying them as effectively lactase non-persistent (LNP). LNP individuals often report gastrointestinal distress upon dairy consumption, causing them to make dietary choices that, logically, limit their discomfort. However, the vast majority of current research on the subject is devoted to the ways in which LNP individuals can consume dairy products anyway, despite self-reporting noteworthy, uncomfortable physical symptoms. Literature review of this topic suggests a possible cause: the number of individuals, both in the United States and worldwide, who are LP, is both unknown and vastly overestimated.
LP has been associated with a few possible causes, but seems to be most clearly connected to a variant on chromosome 2. Approximately twenty single nucleotide polymorphisms have been linked to LP in this region. However, the one with the most molecular experimentation and evidence behind it lies -13910 bases upstream from the LCT gene (SNP ID rs4988235). Here, a human C->T intron variant has been correlated with the ability to digest lactose into adulthood, with the T allele conferring LP. (Genotypes CT and TT appear to confer LP at relatively similar rates, while genotype CC remains LNP.) However, the crucial limiting factor in all of the current research is simply the lack of data. Public access data for rs4988235 compiled in Ensembl suggests the following frequencies of the T allele: 3 percent in African populations, 22 percent in American populations, 0 percent in East Asian populations, 51 percent in European populations, and 11 percent in South Asian populations. This data is undoubtedly a vast oversimplification.
Our work seeks to address the limited knowledge of allele frequencies correlated with lactase persistence
Spring 2021 Week 2 | January 22, 2021
Abstract Draft January 21, 2021
Lactase persistence (LP), the ability to digest lactose into adulthood, is a trait present in at least 16 percent of the world’s population. While large-scale population studies are few, available data suggests that anywhere from 16-35% of the world’s population maintains the ability to digest lactose beyond childhood. Relatedly, it appears that the vast majority of the population is not able to metabolize lactose effectively, classifying them as lactase non-persistent (LNP). LNP individuals often report gastrointestinal distress upon dairy consumption, causing them to make dietary choices that, logically, limit their discomfort. However, the vast majority of current research on the subject is devoted to the ways in which LNP individuals can consume dairy products anyway, despite self-reporting noteworthy, uncomfortable physical symptoms. Literature review of this topic suggests a possible cause: the number of individuals, both in the United States and worldwide, who are LP, is both unknown and vastly overestimated.
Lactase, the enzyme required to digest lactose (the primary carbohydrate in cow’s milk), is encoded by the LCT gene and dominantly expressed in the small intestine. In most individuals, lactase levels are high in infancy and childhood, but drop significantly after weaning due to decreased dietary lactose (beginning the transition into LNP). But lactase levels remain high in some individuals, allowing them to metabolize milk sugars beyond weaning (maintaining LP). LP has been associated with a few possible causes, but is most clearly connected to a variant on chromosome 2. Approximately twenty single nucleotide polymorphisms (SNPs) have been linked to LP in the enhancer region of the LCT gene. However, the one with the most molecular evidence behind it lies -13910 bases upstream from the LCT gene. Here, a human C->T intron variant (SNP ID rs4988235) has been correlated with the ability to digest lactose into adulthood, with the T allele conferring LP. (It is thought that genotypes CT and TT confer LP at relatively similar rates, while genotype CC confers LNP.) However, the crucial limiting factor in all of the current research is simply the lack of data. Public access data for rs4988235 compiled in Ensembl suggests the following frequencies of the T allele: 3 percent in African populations, 22 percent in American populations, 0 percent in East Asian populations, 51 percent in European populations, and 11 percent in South Asian populations. This data is undoubtedly a vast oversimplification.
Our work seeks to address the limited knowledge of allele frequencies correlated with LP, specifically in the United States. In a cohort of 227 subjects, Chin et al., 2019 attempted to correlate rs4988235 genotypes and ethnicity influence with reported dairy consumption. They determined that the LP genotype together with ethnicity influenced dairy intake, but that the LP genotype alone was not significant enough to predict dairy intake. Most genetic research focuses on the link between a specific genotype and an expressed trait — in this exploration, the link has been established between rs4988235 and LP. Our research fills the gap that GWAS data cannot answer: correlation of frequency and individual behavior. An individual may have the LP genotype yet avoid dairy products anyway. Similarly, an individual may have the LNP genotype yet continue to consume dairy products. Data from the Personal Genomes Project was initially explored as a method for linking LP to rs4988235 genotypes, but the crowd-sourced nature of that data makes finding genotype data, dairy intake/diet information, and ethnicity for the same individual difficult. So, analysis of available GWAS data is the first step. While an individual’s genotype alone cannot predict their dairy intake, it will allow a baseline understanding of likely consumption.
Understanding the frequency of these SNPs will allow us to quantify the extent of lactase persistence in the United States, a number we believe to be lower than what the Department of Agriculture and associated entities suggest. Preliminary research suggests that individuals with a SNP conferring lactase persistence most commonly trace their ancestry to Northwestern Europe, and that other large populations worldwide do not express this genetic mutation. Given that the United States is a settler nation, encouraging milk consumption (through “Got Milk?” public health campaigns and official inclusion on the USDA MyPlate recommendations) and even mandating it (through free and reduced lunch programs across the country) can have serious public health implications if the majority of the population is unable to digest lactose.
next two weeks
- review SNP table from Susanne & Alyssa (check over the papers that mention the SNPs, see if there are any updates, start putting that into a more digestible format that could possibly be used for a poster)
- pull together all of the database exploration: okay we looked at all of these databases, here's why we were interested, this is the kind of data that they present, this is what we found, this is how we might use this data (possible figure for a poster in this information too)
- actually choose a database and run with it!! what info are we going to take from it, what are we going to ignore, etc
things for later
- talk a little bit about the way that the frequency data is presented and how confusing that is given the question we are trying to answer (eg the thing from Ensembl that characterized those Utahns as being European and didn't put them into the "Americas" section, talking about why that is done, why that's a problem for the question, etc)
- editing the PPT, making better maps of the frequency data once we actually have it