REPETITIVE STRUCTURES, MULTIPLETS, PERIODICITY ANALYSIS, SPACING ANALYSIS
REPETITIVE STRUCTURES.
Repeats are indicated for two alphabets: the 20-letter amino acid alpha- bet, and a reduced 11-letter alphabet in which the major hydrophobics LVIF, the charged residues KR and ED, the small residues AG, the hydroxyl group residues ST, the amid group residues NQ, and the aromatics YW are treated as combined letters. For each alphabet, three classes of repeats are distinguished: separated repeats, simple tandem repeats, and periodic repeats. The separated repeats are largely non-overlapping. They are displayed in groups of matching blocks (exceeding a given core block length of contiguous exact matches) and intervening spacer distances (which may be negative, signifying a partial overlap). The core block length in case of the amino acid alphabet is set to 4 for sequences up to 500 residues, to 5 for sequences between 500 and 2000 residues, and to 6 for longer sequences (same values increased by 4 for the reduced alpha- bet). Simple tandem repeats are displayed in similar layout, but separately. Sequence segments that are highly repetitive with relatively short repeats are displayed as periodic repeats.
A. SEPARATED, TANDEM, AND PERIODIC REPEATS: amino acid alphabet.
Repeat core block length: 5
Aligned matching blocks:
[ 110- 115] VANGIF
[ 171- 176] VANGIF
______________________________
[ 621- 625] NPGTS [1107-1111] NPGTS
______________________________
[ 785- 789] TIETA [1224-1228] TIETA
______________________________
[1073-1081] AATLTGTGL [1166-1172] AAT__GTGL
B. SEPARATED AND TANDEM REPEATS: 11-letter reduced alphabet.
(i= LVIF; += KR; -= ED; s= AG; o= ST; n= NQ; a= YW; p= P; h= H; m= M; c= C)
Repeat core block length: 9
MULTIPLETS.
Multiplets refer to homooligopeptides of any length (e.g., A2, Q7, etc.); altplets refer to reiterations of two different residues (e.g., RG, EAEAEA, etc.). The multiplet composition of the protein sequence is evaluated for both the amino acid and the charge alphabet. (High) Aggre- gate altplet counts are evalued only for the charge alphabet. The multi- plet sequence is displayed whenever the total multiplet count of the sequence falls outside the expected range (i.e., beyond 3 standard devia- tions of the mean). Printed are also the histogram of the spacings between consecutive multiplets (differences between starting positions) as well as clusters of multiplets (multiplet clusters are determined in the same way as charge clusters are determined; the binomial test is applied to a compressed sequence over the alphabet {M,S}, where M signifies a multiplet and S signifies a singlet; i.e., the amino acid sequence AADFFFGHRRT... is translated as MSMSSMS..., and the binomial cluster test is applied to the latter sequence). Multiplets and altplets of specific residue content that individually show an unusually high count are indicated, and the positions of all multiplets exceeding a minimum length of 5 residues are shown.
A. AMINO ACID ALPHABET.
1. Total number of amino acid multiplets: 93 (Expected range: 66--123)
2. Histogram of spacings between consecutive amino acid multiplets:
(1-5) 27 (6-10) 29 (11-20) 24 (>=21) 14
3. Clusters of amino acid multiplets (cmin = 12/30 or 16/45 or 19/60): none
B. CHARGE ALPHABET.
1. Total number of charge multiplets: 19 (Expected range: 4-- 29)
7 +plets (f+: 7.2%), 12 -plets (f-: 9.6%) Total number of charge altplets: 15 (Critical number: 32)
2. Histogram of spacings between consecutive charge multiplets:
(1-5) 0 (6-10) 1 (11-20) 2 (>=21) 17
PERIODICITY ANALYSIS.
The program identifies periodic elements of periods between 1 and 10 for the amino acid alphabet, for the charge alphabet, and for a hydrophobicity alphabet. Each periodic element consists of an error-free core pattern (of length at least 4 for the amino acid alphabet, 5 for the charge alphabet, and 6 for the hydrophobicity alphabet) which is extended allowing for errors. The numbers of errors are given for each position in the con- sensus of a periodic pattern involving more than one letter. The displayed periodic patterns would generally not be statistically significant but are listed for the sake of a general interactive appraisal of the sequence. Periodicities of exceptionally high copy number are indicated with a !- mark.
A. AMINO ACID ALPHABET (core: 4; !-core: 5)
Location Period Element Copies Core Errors
12- 23 3 A.. 4 4 0 492- 521 6 V..... 5 5 ! 0 509- 524 4 T... 4 4 0 996-1015 2 T. 8 4 2
1042-1049 2 T. 4 4 0 1069-1104 9 G........ 4 4 0 1120-1144 5 A.... 5 5 ! 0 1175-1190 4 A... 4 4 0 1195-1275 9 T........ 8 5 ! 1 1223-1238 4 T... 4 4 0
B. CHARGE ALPHABET ({+= KR; -= ED; 0}; core: 5; !-core: 6)
and HYDROPHOBICITY ALPHABET ({*= KRED; i= LVIF; 0}; core: 6; !-core: 9)
Location Period Element Copies Core Errors
424- 453 5 i0.00 6 6 /0/1/./1/2/
SPACING ANALYSIS.
The spacings between consecutive residues of the same type (all 20 amino acids, + and - charge, and combined charge *) are evaluated for signifi- cantly large or small maximal and minimal spacings. The output is ordered by the beginning point of the significant spacing. Entries are identified by the residue type, spacing (number of amino acids between the identified positions), rank of the displayed spacing (e.g., 50 alanines in the sequence induce 51 spacings, ranked by decreasing length from 1 to 51), and p-value (probability of exceeding the displayed spacing). A maximal spacing with p-value 0.01 or less is considered significantly large; a maximal spacing with p-value 0.99 or larger is considered significantly small. Similarly, a minimal spacing with p-value 0.99 or larger is con- sidered significantly small, and a minimal spacing with p-value 0.01 or less is considered significantly large (excluding doublets). If the first maximal spacing (rank 1) of a residue is significantly large or small, then also the second maximal spacing (rank 2) is evaluated. Large maximal and small minimal spacings indicate clustering effects, whereas small max- imal and large minimal spacings indicate excessive evenness in the distri- bution of the residues.
Location (Quartile) Spacing Rank P-value Interpretation
83- 85 (1.) L( 2)L 67 of 67 0.0006 large minimal spacing 185- 187 (1.) L( 2)L 64 of 67 0.0006 matching minimum 206- 208 (1.) L( 2)L 65 of 67 0.0006 matching minimum 618- 620 (2.) L( 2)L 66 of 67 0.0006 matching minimum