- large deletions 1bp - 10kb
- medium size insertions 1-20pb
(citing the paper) Suppose we define a database S of genome sequence as
and the pattern
The output of the algorithm consists of all substrings (and their locations) starting from the leftmost base of P that appear exactly once in S.
In the first step, we scan the whole database for ‘A’, the first base of pattern P.
The locations of ‘A’
are stored in a projected database of ‘A’. In the second step, we look for ‘T’ as it is the second base in pattern P at the right side of ‘A's identified previously. The projected database for ‘AT’ then only contains two locations (‘ATCAAGTATGCTTAGC’). When we search for the third base ‘G’ of pattern P at the right sides of ‘AT’, we found that ‘ATG’ appears exactly once in the database S (‘ATCAAGTATGCTTAGC’). Thus we know that ‘ATG’ is the minimum unique substring of pattern P in the database S. After we examine the fourth and fifth base of pattern P, we notice that ‘ATGC’ is also unique in the database S but ‘ATGCA’ isn't. In this case we know that ‘ATGC’ is the maximum unique substring of pattern P in this particular database S.