Jasieniuk:Implementation of a method to calculate microsatellite genotype distance in polyploids

Update
Note, June 2010: All of this functionality is now available in the R package polysat.

Note, April 2010: I'm working on developing an R package with these and other tools for polyploid microsatellite data. I have restructured the functions somewhat so that they will be more versatile (allow for other measures of distance, and import the genotypes as an R object so that other things can be done with them). I've written a function that will read the data more directly from the format produced by ABI GeneMapper and one that can process multiple loci at once and calculate the mean distances. Missing data is also allowed. There is also a function to export data to be read by Structure 2.3.3.

The functions are in a text file: [[Media:Polysatfun100409.R.txt]]

Documentation of the functions: [[Media:Polysatfun100409.Rd.txt]]

Overview
This is an implementation of the method of Bruvo et al. (2004). It calculates a pairwise measure of genetic distance between individuals at a single microsatellite locus. The allele copy number does not need to be known and the individuals do not even need to be the same ploidy level. Its primary advantage over band-sharing measures of genetic distance is that it allows for mutation, such that alleles aren't considered to be qualitatively different, but instead the difference in repeat number between two alleles is taken into account. The paper offers two options for dealing with virtual alleles (when one individual has more alleles than the other). My implementation uses the simpler option and sets virtual alleles as equal to infinity.

This is my first adventure into the R programming environment, so I apologize for any programming faux pas.

Arguments and Output
The function  takes as arguments two single locus microsatellite genotypes. In the code these are  and. Each genotype should be a vector containing all of the alleles for a given individual at a given locus. The vectors do not need to be the same length, and should only be as long as needed to contain the alleles. The alleles should be expressed in repeat number rather than nucleotides (DNA fragment length). Because it is the difference in allele size that is measured, you can simply divide the nucleotide length by the repeat size (e.g. divide by two if you have dinucleotide repeats).

The output is a single number that is the distance between genotypes. In the paper, this is dl.

Explanation, arguments, and output
The function defined below utilizes the  function and produces a symmetrical matrix of genotype distances, given a file of individual genotypes at a given locus. It takes three arguments:


 * is the filepath of a tab-delimited text file containing the genotypes. The file should contain a header row.  The first column should be labels for the individuals/samples, the second column should contain the number of alleles that the individual has at this locus, and columns 3-22 should be the alleles, expressed as the DNA fragment length.  If, as is likely, a genotype contains less than 20 alleles, the allele fields on the right can be left blank or filled with zeros.
 * is the filepath where you want your output to be written. This will be a tab-delimited text file containing a labeled matrix of genotype distances.  (This matrix is also returned by the function.)
 * is the length of the microsatellite repeat at this locus, for example  if this is a dinucleotide repeat.

can be run individually on several different loci, then the matrices can be averaged.

I work with a dodecaploid organism, so the calculations can take awhile. For this reason, the function prints each pair of individuals as it works so that the user can monitor the progress.

Processing time
The amount of time that it takes  to work is a function of the factorial of the number of alleles that the two genotypes have. If two individuals with four alleles each are being compared, there are only 24 sets of compatible allele comparisons. If the two individuals each have eight alleles, there are 8! or 40,320 sets of comparisons, and if the individuals each have twelve alleles there 479,001,600 sets of comparisons. I wrote the function to optimize for clonal reproduction and large genotypes, simply returning  without doing any calculation if the two genotypes are identical, and skipping all of the different ways that alleles can be matched to virtual alleles if one genotype has two or more virtual alleles (since the distance between any allele and a virtual allele is 1).

On my personal laptop (Windows Vista, 2 GHz processor, 4 Mb RAM), processing time became an issue when the allele distance grid was 10x10 (two genotypes of 10 alleles each) or larger. 9x9 grids took a few minutes to process, but anything larger could take over an hour. For my own use (not included in the above code), I added a clause in  to return   if both genotypes had more than nine alleles. I then calculated these allele distance grids in Excel and manually chose sets of compatible allele comparisons, since in these cases it was faster to visually pick out the short distances and work from there than it was to have the computer calculate every possible combination.

If your organism is tetraploid, processing time should not be an issue.

Contact
User:Lindsay_V._Clark