Jasieniuk:Implementation of a method to calculate microsatellite genotype distance in polyploids
Update
Note, June 2010: All of this functionality is now available in the R package polysat.
Note, April 2010: I'm working on developing an R package with these and other tools for polyploid microsatellite data. I have restructured the functions somewhat so that they will be more versatile (allow for other measures of distance, and import the genotypes as an R object so that other things can be done with them). I've written a function that will read the data more directly from the format produced by ABI GeneMapper and one that can process multiple loci at once and calculate the mean distances. Missing data is also allowed. There is also a function to export data to be read by Structure 2.3.3.
The functions are in a text file: Media:Polysatfun100409.R.txt
Documentation of the functions: Media:Polysatfun100409.Rd.txt
Overview
This is an implementation of the method of Bruvo et al. (2004). It calculates a pairwise measure of genetic distance between individuals at a single microsatellite locus. The allele copy number does not need to be known and the individuals do not even need to be the same ploidy level. Its primary advantage over band-sharing measures of genetic distance is that it allows for mutation, such that alleles aren't considered to be qualitatively different, but instead the difference in repeat number between two alleles is taken into account. The paper offers two options for dealing with virtual alleles (when one individual has more alleles than the other). My implementation uses the simpler option and sets virtual alleles as equal to infinity.
This is my first adventure into the R programming environment, so I apologize for any programming faux pas.
Arguments and Output
The function genotype.distance
takes as arguments two single locus microsatellite genotypes. In the code these are genotype1
and genotype2
. Each genotype should be a vector containing all of the alleles for a given individual at a given locus. The vectors do not need to be the same length, and should only be as long as needed to contain the alleles. The alleles should be expressed in repeat number rather than nucleotides (DNA fragment length). Because it is the difference in allele size that is measured, you can simply divide the nucleotide length by the repeat size (e.g. divide by two if you have dinucleotide repeats).
The output is a single number that is the distance between genotypes. In the paper, this is dl.
Code for genotype.distance
genotype.distance <- function(genotype1, genotype2) {
require(combinat) # for permn function
if(identical(genotype1, genotype2)) {dist <- 0} else { if(length(genotype1) >= length(genotype2)) {genotypeL <- genotype1; genotypeS <- genotype2} else {genotypeL <- genotype2; genotypeS <- genotype1}
# if genotypes are identical, just return a distance of zero without doing the rest of the calculation
# whichever genotype has more alleles, make this genotypeL (long) and the other genotypeS (short)
kl <- length(genotypeL) # sets the ploidy level for this genotype comparison
ks <- length(genotypeS) # number of alleles in the shorter genotype
allele.distances <- array(0 , c(kl,ks)) # Create an empty matrix to contain the distances between alleles
for(n in 1:kl) { for(m in 1:ks) {allele.distances[n,m] <- genotypeL[n] - genotypeS[m]}} # fills the array with the differences in allele length
geometric.distances <- array(1 - 2^-abs(allele.distances) , c(kl,ks)) # geometric transformation based on mutation probabilities
#Next, find the minimum distance sum among all permutations
column <- 1:ks # an index of all columns (genotypeS alleles)
row <- 1:kl # an index of all rows (genotypeL alleles)
combinations <- combn(row, ks, FUN = NULL, simplify=FALSE) # all combinations of alleles in genotypeL that can be matched to non-virtual alleles in genotypeS
permutations <- permn(ks) # all possible orders that alleles within these combinations can go in
mindist <- Inf # this variable will store the minimum sum encountered so far.
for(i in 1:length(combinations)) { # the loop to go through every possible sum of compatible allele comparisons
rowcomb <- combinations[[i]] # choose one combination of rows for this round
for(l in 1:length(permutations)){ # go through all orders of this combinations of rows
sum <- 0 # this is si, the sum of allele comparisons
for(j in 1:ks) {sum <- sum + geometric.distances[rowcomb[permutations[[l]][j]],column[j]]} # the loop to calculate the sum for this permutation
if(sum < mindist) {mindist <- sum} # is this the minimum sum found so far?
}}
dist
}
Processing a file of genotypes
Explanation, arguments, and output
The function defined below utilizes the genotype.distance
function and produces a symmetrical matrix of genotype distances, given a file of individual genotypes at a given locus. It takes three arguments:
infile
is the filepath of a tab-delimited text file containing the genotypes. The file should contain a header row. The first column should be labels for the individuals/samples, the second column should contain the number of alleles that the individual has at this locus, and columns 3-22 should be the alleles, expressed as the DNA fragment length. If, as is likely, a genotype contains less than 20 alleles, the allele fields on the right can be left blank or filled with zeros.outfile
is the filepath where you want your output to be written. This will be a tab-delimited text file containing a labeled matrix of genotype distances. (This matrix is also returned by the function.)repeat.length
is the length of the microsatellite repeat at this locus, for example2
if this is a dinucleotide repeat.
entire.locus
can be run individually on several different loci, then the matrices can be averaged.
I work with a dodecaploid organism, so the calculations can take awhile. For this reason, the function prints each pair of individuals as it works so that the user can monitor the progress.
Code for entire.locus
entire.locus <- function(infile,outfile,repeat.length){
usatdata<-read.table(infile, header=TRUE, sep = "\t", fill=TRUE)
labels <- row.names(usatdata)
genotypes <- list(0) #this will contain the genotype vectors
count <- dim(usatdata)[1] #the number of genotypes being read
for(i in 1:count){ #extract genotype vectors from the data table
genotypes[[i]] <- c(usatdata[i,2],usatdata[i,3],usatdata[i,4],usatdata[i,5],usatdata[i,6],usatdata[i,7],usatdata[i,8],usatdata[i,9],usatdata[i,10],usatdata[i,11],usatdata[i,12],usatdata[i,13],usatdata[i,14],usatdata[i,15],usatdata[i,16],usatdata[i,17],usatdata[i,18],usatdata[i,19],usatdata[i,20],usatdata[i,21])/repeat.length
length(genotypes[[i]]) <- usatdata[i,1]
}
distances<- array(0,c(count,count)) #array for containing genotype distances
for(m in 1:count){for(n in m:count){thisdistance <-genotype.distance(genotypes[[m]],genotypes[[n]]) #fill array with genotype distances
distances[m,n] <- thisdistance
distances[n,m] <- thisdistance
print(c(labels[m],labels[n]))}} #so the user can monitor the progress of the computation
rownames(distances) <- labels #add labels to the matrix of distances that was just produced
colnames(distances) <- labels
write.table(distances,file=outfile,sep="\t")
distances
}
Processing time
The amount of time that it takes genotype.distance
to work is a function of the factorial of the number of alleles that the two genotypes have. If two individuals with four alleles each are being compared, there are only 24 sets of compatible allele comparisons. If the two individuals each have eight alleles, there are 8! or 40,320 sets of comparisons, and if the individuals each have twelve alleles there 479,001,600 sets of comparisons. I wrote the function to optimize for clonal reproduction and large genotypes, simply returning 0
without doing any calculation if the two genotypes are identical, and skipping all of the different ways that alleles can be matched to virtual alleles if one genotype has two or more virtual alleles (since the distance between any allele and a virtual allele is 1).
On my personal laptop (Windows Vista, 2 GHz processor, 4 Mb RAM), processing time became an issue when the allele distance grid was 10x10 (two genotypes of 10 alleles each) or larger. 9x9 grids took a few minutes to process, but anything larger could take over an hour. For my own use (not included in the above code), I added a clause in genotype.distance
to return NA
if both genotypes had more than nine alleles. I then calculated these allele distance grids in Excel and manually chose sets of compatible allele comparisons, since in these cases it was faster to visually pick out the short distances and work from there than it was to have the computer calculate every possible combination.
If your organism is tetraploid, processing time should not be an issue.
References
Bruvo R, Michiels NK, D'Souza TG, Schulenberg H (2004). "A simple method for the calculation of microsatellite genotype distances irrespective of ploidy level." Molecular Ecology 13:2101-2106.