Jasieniuk:Implementation of a method to calculate microsatellite genotype distance in polyploids

From OpenWetWare
Jump to navigationJump to search


Note, June 2010: All of this functionality is now available in the R package polysat.

Note, April 2010: I'm working on developing an R package with these and other tools for polyploid microsatellite data. I have restructured the functions somewhat so that they will be more versatile (allow for other measures of distance, and import the genotypes as an R object so that other things can be done with them). I've written a function that will read the data more directly from the format produced by ABI GeneMapper and one that can process multiple loci at once and calculate the mean distances. Missing data is also allowed. There is also a function to export data to be read by Structure 2.3.3.

The functions are in a text file: Media:Polysatfun100409.R.txt

Documentation of the functions: Media:Polysatfun100409.Rd.txt


This is an implementation of the method of Bruvo et al. (2004). It calculates a pairwise measure of genetic distance between individuals at a single microsatellite locus. The allele copy number does not need to be known and the individuals do not even need to be the same ploidy level. Its primary advantage over band-sharing measures of genetic distance is that it allows for mutation, such that alleles aren't considered to be qualitatively different, but instead the difference in repeat number between two alleles is taken into account. The paper offers two options for dealing with virtual alleles (when one individual has more alleles than the other). My implementation uses the simpler option and sets virtual alleles as equal to infinity.

This is my first adventure into the R programming environment, so I apologize for any programming faux pas.

Arguments and Output

The function genotype.distance takes as arguments two single locus microsatellite genotypes. In the code these are genotype1 and genotype2. Each genotype should be a vector containing all of the alleles for a given individual at a given locus. The vectors do not need to be the same length, and should only be as long as needed to contain the alleles. The alleles should be expressed in repeat number rather than nucleotides (DNA fragment length). Because it is the difference in allele size that is measured, you can simply divide the nucleotide length by the repeat size (e.g. divide by two if you have dinucleotide repeats).

The output is a single number that is the distance between genotypes. In the paper, this is dl.

Code for genotype.distance

genotype.distance <- function(genotype1, genotype2) {

require(combinat) # for permn function

if(identical(genotype1, genotype2)) {dist <- 0} else { if(length(genotype1) >= length(genotype2)) {genotypeL <- genotype1; genotypeS <- genotype2} else {genotypeL <- genotype2; genotypeS <- genotype1}

# if genotypes are identical, just return a distance of zero without doing the rest of the calculation

# whichever genotype has more alleles, make this genotypeL (long) and the other genotypeS (short)

kl <- length(genotypeL) # sets the ploidy level for this genotype comparison

ks <- length(genotypeS) # number of alleles in the shorter genotype

allele.distances <- array(0 , c(kl,ks)) # Create an empty matrix to contain the distances between alleles

for(n in 1:kl) { for(m in 1:ks) {allele.distances[n,m] <- genotypeL[n] - genotypeS[m]}} # fills the array with the differences in allele length

geometric.distances <- array(1 - 2^-abs(allele.distances) , c(kl,ks)) # geometric transformation based on mutation probabilities

#Next, find the minimum distance sum among all permutations

column <- 1:ks # an index of all columns (genotypeS alleles)

row <- 1:kl # an index of all rows (genotypeL alleles)

combinations <- combn(row, ks, FUN = NULL, simplify=FALSE) # all combinations of alleles in genotypeL that can be matched to non-virtual alleles in genotypeS

permutations <- permn(ks) # all possible orders that alleles within these combinations can go in

mindist <- Inf # this variable will store the minimum sum encountered so far.

for(i in 1:length(combinations)) { # the loop to go through every possible sum of compatible allele comparisons

rowcomb <- combinations[[i]] # choose one combination of rows for this round

for(l in 1:length(permutations)){ # go through all orders of this combinations of rows

sum <- 0 # this is si, the sum of allele comparisons

for(j in 1:ks) {sum <- sum + geometric.distances[rowcomb[permutations[[l]][j]],column[j]]} # the loop to calculate the sum for this permutation

if(sum < mindist) {mindist <- sum} # is this the minimum sum found so far?




Processing a file of genotypes

Explanation, arguments, and output

The function defined below utilizes the genotype.distance function and produces a symmetrical matrix of genotype distances, given a file of individual genotypes at a given locus. It takes three arguments:

  • infile is the filepath of a tab-delimited text file containing the genotypes. The file should contain a header row. The first column should be labels for the individuals/samples, the second column should contain the number of alleles that the individual has at this locus, and columns 3-22 should be the alleles, expressed as the DNA fragment length. If, as is likely, a genotype contains less than 20 alleles, the allele fields on the right can be left blank or filled with zeros.
  • outfile is the filepath where you want your output to be written. This will be a tab-delimited text file containing a labeled matrix of genotype distances. (This matrix is also returned by the function.)
  • repeat.length is the length of the microsatellite repeat at this locus, for example 2 if this is a dinucleotide repeat.

entire.locus can be run individually on several different loci, then the matrices can be averaged.

I work with a dodecaploid organism, so the calculations can take awhile. For this reason, the function prints each pair of individuals as it works so that the user can monitor the progress.

Code for entire.locus

entire.locus <- function(infile,outfile,repeat.length){

usatdata<-read.table(infile, header=TRUE, sep = "\t", fill=TRUE)

labels <- row.names(usatdata)

genotypes <- list(0) #this will contain the genotype vectors

count <- dim(usatdata)[1] #the number of genotypes being read

for(i in 1:count){ #extract genotype vectors from the data table

genotypes[[i]] <- c(usatdata[i,2],usatdata[i,3],usatdata[i,4],usatdata[i,5],usatdata[i,6],usatdata[i,7],usatdata[i,8],usatdata[i,9],usatdata[i,10],usatdata[i,11],usatdata[i,12],usatdata[i,13],usatdata[i,14],usatdata[i,15],usatdata[i,16],usatdata[i,17],usatdata[i,18],usatdata[i,19],usatdata[i,20],usatdata[i,21])/repeat.length

length(genotypes[[i]]) <- usatdata[i,1]


distances<- array(0,c(count,count)) #array for containing genotype distances

for(m in 1:count){for(n in m:count){thisdistance <-genotype.distance(genotypes[[m]],genotypes[[n]]) #fill array with genotype distances

distances[m,n] <- thisdistance

distances[n,m] <- thisdistance

print(c(labels[m],labels[n]))}} #so the user can monitor the progress of the computation

rownames(distances) <- labels #add labels to the matrix of distances that was just produced

colnames(distances) <- labels




Processing time

The amount of time that it takes genotype.distance to work is a function of the factorial of the number of alleles that the two genotypes have. If two individuals with four alleles each are being compared, there are only 24 sets of compatible allele comparisons. If the two individuals each have eight alleles, there are 8! or 40,320 sets of comparisons, and if the individuals each have twelve alleles there 479,001,600 sets of comparisons. I wrote the function to optimize for clonal reproduction and large genotypes, simply returning 0 without doing any calculation if the two genotypes are identical, and skipping all of the different ways that alleles can be matched to virtual alleles if one genotype has two or more virtual alleles (since the distance between any allele and a virtual allele is 1).

On my personal laptop (Windows Vista, 2 GHz processor, 4 Mb RAM), processing time became an issue when the allele distance grid was 10x10 (two genotypes of 10 alleles each) or larger. 9x9 grids took a few minutes to process, but anything larger could take over an hour. For my own use (not included in the above code), I added a clause in genotype.distance to return NA if both genotypes had more than nine alleles. I then calculated these allele distance grids in Excel and manually chose sets of compatible allele comparisons, since in these cases it was faster to visually pick out the short distances and work from there than it was to have the computer calculate every possible combination.

If your organism is tetraploid, processing time should not be an issue.


Bruvo R, Michiels NK, D'Souza TG, Schulenberg H (2004). "A simple method for the calculation of microsatellite genotype distances irrespective of ploidy level." Molecular Ecology 13:2101-2106.