# Jasieniuk:Implementation of a method to calculate microsatellite genotype distance in polyploids

## Contents

## Update

**Note, June 2010:** All of this functionality is now available in the R package polysat.

**Note, April 2010:** I'm working on developing an R package with these and other tools for polyploid microsatellite data. I have restructured the functions somewhat so that they will be more versatile (allow for other measures of distance, and import the genotypes as an R object so that other things can be done with them). I've written a function that will read the data more directly from the format produced by ABI GeneMapper and one that can process multiple loci at once and calculate the mean distances. Missing data is also allowed. There is also a function to export data to be read by Structure 2.3.3.

The functions are in a text file: Media:Polysatfun100409.R.txt

Documentation of the functions: Media:Polysatfun100409.Rd.txt

## Overview

This is an implementation of the method of Bruvo *et al.* (2004). It calculates a pairwise measure of genetic distance between individuals at a single microsatellite locus. The allele copy number does not need to be known and the individuals do not even need to be the same ploidy level. Its primary advantage over band-sharing measures of genetic distance is that it allows for mutation, such that alleles aren't considered to be qualitatively different, but instead the difference in repeat number between two alleles is taken into account. The paper offers two options for dealing with virtual alleles (when one individual has more alleles than the other). My implementation uses the simpler option and sets virtual alleles as equal to infinity.

This is my first adventure into the R programming environment, so I apologize for any programming faux pas.

## Arguments and Output

The function `genotype.distance`

takes as arguments two single locus microsatellite genotypes. In the code these are `genotype1`

and `genotype2`

. Each genotype should be a vector containing all of the alleles for a given individual at a given locus. The vectors do not need to be the same length, and should only be as long as needed to contain the alleles. The alleles should be expressed in repeat number rather than nucleotides (DNA fragment length). Because it is the difference in allele size that is measured, you can simply divide the nucleotide length by the repeat size (e.g. divide by two if you have dinucleotide repeats).

The output is a single number that is the distance between genotypes. In the paper, this is *d _{l}*.

## Code for genotype.distance

`genotype.distance <- function(genotype1, genotype2) {`

`require(combinat) # for permn function`

`if(identical(genotype1, genotype2)) {dist <- 0} else { if(length(genotype1) >= length(genotype2)) {genotypeL <- genotype1; genotypeS <- genotype2} else {genotypeL <- genotype2; genotypeS <- genotype1}`

`# if genotypes are identical, just return a distance of zero without doing the rest of the calculation`

`# whichever genotype has more alleles, make this genotypeL (long) and the other genotypeS (short)`

`kl <- length(genotypeL) # sets the ploidy level for this genotype comparison`

`ks <- length(genotypeS) # number of alleles in the shorter genotype`

`allele.distances <- array(0 , c(kl,ks)) # Create an empty matrix to contain the distances between alleles`

`for(n in 1:kl) { for(m in 1:ks) {allele.distances[n,m] <- genotypeL[n] - genotypeS[m]}} # fills the array with the differences in allele length`

`geometric.distances <- array(1 - 2^-abs(allele.distances) , c(kl,ks)) # geometric transformation based on mutation probabilities`

`#Next, find the minimum distance sum among all permutations`

`column <- 1:ks # an index of all columns (genotypeS alleles)`

`row <- 1:kl # an index of all rows (genotypeL alleles)`

`combinations <- combn(row, ks, FUN = NULL, simplify=FALSE) # all combinations of alleles in genotypeL that can be matched to non-virtual alleles in genotypeS`

`permutations <- permn(ks) # all possible orders that alleles within these combinations can go in`

`mindist <- Inf # this variable will store the minimum sum encountered so far.`

`for(i in 1:length(combinations)) { # the loop to go through every possible sum of compatible allele comparisons`

`rowcomb <- combinations[[i]] # choose one combination of rows for this round`

`for(l in 1:length(permutations)){ # go through all orders of this combinations of rows`

`sum <- 0 # this is si, the sum of allele comparisons`

`for(j in 1:ks) {sum <- sum + geometric.distances[rowcomb[permutations[[l]][j]],column[j]]} # the loop to calculate the sum for this permutation`

`if(sum < mindist) {mindist <- sum} # is this the minimum sum found so far?`

`}}`

`dist`

`}`

## Processing a file of genotypes

### Explanation, arguments, and output

The function defined below utilizes the `genotype.distance`

function and produces a symmetrical matrix of genotype distances, given a file of individual genotypes at a given locus. It takes three arguments:

`infile`

is the filepath of a tab-delimited text file containing the genotypes. The file should contain a header row. The first column should be labels for the individuals/samples, the second column should contain the number of alleles that the individual has at this locus, and columns 3-22 should be the alleles, expressed as the DNA fragment length. If, as is likely, a genotype contains less than 20 alleles, the allele fields on the right can be left blank or filled with zeros.`outfile`

is the filepath where you want your output to be written. This will be a tab-delimited text file containing a labeled matrix of genotype distances. (This matrix is also returned by the function.)`repeat.length`

is the length of the microsatellite repeat at this locus, for example`2`

if this is a dinucleotide repeat.

`entire.locus`

can be run individually on several different loci, then the matrices can be averaged.

I work with a dodecaploid organism, so the calculations can take awhile. For this reason, the function prints each pair of individuals as it works so that the user can monitor the progress.

### Code for entire.locus

`entire.locus <- function(infile,outfile,repeat.length){`

`usatdata<-read.table(infile, header=TRUE, sep = "\t", fill=TRUE)`

`labels <- row.names(usatdata)`

`genotypes <- list(0) #this will contain the genotype vectors`

`count <- dim(usatdata)[1] #the number of genotypes being read`

`for(i in 1:count){ #extract genotype vectors from the data table`

`genotypes[[i]] <- c(usatdata[i,2],usatdata[i,3],usatdata[i,4],usatdata[i,5],usatdata[i,6],usatdata[i,7],usatdata[i,8],usatdata[i,9],usatdata[i,10],usatdata[i,11],usatdata[i,12],usatdata[i,13],usatdata[i,14],usatdata[i,15],usatdata[i,16],usatdata[i,17],usatdata[i,18],usatdata[i,19],usatdata[i,20],usatdata[i,21])/repeat.length`

`length(genotypes[[i]]) <- usatdata[i,1]`

`}`

`distances<- array(0,c(count,count)) #array for containing genotype distances`

`for(m in 1:count){for(n in m:count){thisdistance <-genotype.distance(genotypes[[m]],genotypes[[n]]) #fill array with genotype distances`

`distances[m,n] <- thisdistance`

`distances[n,m] <- thisdistance`

`print(c(labels[m],labels[n]))}} #so the user can monitor the progress of the computation`

`rownames(distances) <- labels #add labels to the matrix of distances that was just produced`

`colnames(distances) <- labels`

`write.table(distances,file=outfile,sep="\t")`

`distances`

`}`

## Processing time

The amount of time that it takes `genotype.distance`

to work is a function of the factorial of the number of alleles that the two genotypes have. If two individuals with four alleles each are being compared, there are only 24 sets of compatible allele comparisons. If the two individuals each have eight alleles, there are 8! or 40,320 sets of comparisons, and if the individuals each have twelve alleles there 479,001,600 sets of comparisons. I wrote the function to optimize for clonal reproduction and large genotypes, simply returning `0`

without doing any calculation if the two genotypes are identical, and skipping all of the different ways that alleles can be matched to virtual alleles if one genotype has two or more virtual alleles (since the distance between any allele and a virtual allele is 1).

On my personal laptop (Windows Vista, 2 GHz processor, 4 Mb RAM), processing time became an issue when the allele distance grid was 10x10 (two genotypes of 10 alleles each) or larger. 9x9 grids took a few minutes to process, but anything larger could take over an hour. For my own use (not included in the above code), I added a clause in `genotype.distance`

to return `NA`

if both genotypes had more than nine alleles. I then calculated these allele distance grids in Excel and manually chose sets of compatible allele comparisons, since in these cases it was faster to visually pick out the short distances and work from there than it was to have the computer calculate every possible combination.

If your organism is tetraploid, processing time should not be an issue.

## References

Bruvo R, Michiels NK, D'Souza TG, Schulenberg H (2004). "A simple method for the calculation of microsatellite genotype distances irrespective of ploidy level." Molecular Ecology 13:2101-2106.