REPETITIVE STRUCTURES, MULTIPLETS, PERIODICITY ANALYSIS, SPACING ANALYSIS

REPETITIVE STRUCTURES.

Repeats are indicated for two alphabets: the 20-letter amino acid  alpha- bet, and  a  reduced  11-letter  alphabet in which the major hydrophobics LVIF, the charged residues KR and ED, the small residues AG, the hydroxyl group residues  ST,  the amid group residues NQ, and the aromatics YW are treated as combined letters. For each alphabet, three classes of repeats are distinguished: separated repeats, simple tandem repeats, and periodic repeats. The separated repeats  are  largely  non-overlapping. They are displayed in  groups  of  matching  blocks  (exceeding a given core block length of contiguous  exact  matches)  and  intervening  spacer  distances (which may  be  negative,  signifying  a partial overlap). The core block length in case of the amino acid alphabet is set to 4 for sequences up to 500  residues,  to 5 for sequences between 500 and 2000 residues, and to 6 for longer sequences (same values increased by 4 for the  reduced  alpha- bet). Simple tandem  repeats  are  displayed  in  similar  layout,  but separately. Sequence segments that are highly repetitive with  relatively short repeats are displayed as periodic repeats.

A. SEPARATED, TANDEM, AND PERIODIC REPEATS: amino acid alphabet. Repeat core block length: 5

Aligned matching blocks:

[ 110- 115]  VANGIF [ 171- 176]  VANGIF

______________________________

[ 621- 625]  NPGTS [1107-1111]  NPGTS

______________________________

[ 785- 789]  TIETA [1224-1228]  TIETA

______________________________

[1073-1081]  AATLTGTGL [1166-1172]  AAT__GTGL

B. SEPARATED AND TANDEM REPEATS: 11-letter reduced alphabet. (i= LVIF; += KR; -= ED; s= AG; o= ST; n= NQ; a= YW; p= P; h= H; m= M; c= C) Repeat core block length: 9

MULTIPLETS.

Multiplets refer to homooligopeptides of any length (e.g., A2, Q7, etc.); altplets refer  to  reiterations  of  two  different  residues (e.g., RG, EAEAEA, etc.). The multiplet  composition  of  the  protein  sequence  is evaluated  for  both the amino acid and the charge alphabet. (High) Aggre- gate altplet counts are evalued only for the charge alphabet. The multi- plet sequence  is  displayed  whenever  the  total multiplet count of the sequence falls outside the expected range (i.e., beyond 3 standard devia- tions of the mean). Printed are also the histogram of the spacings between consecutive multiplets (differences between starting positions) as well as clusters of multiplets (multiplet clusters are determined in the same way as charge clusters are determined; the  binomial  test  is  applied  to  a compressed sequence over the alphabet {M,S}, where M signifies a multiplet and S signifies a singlet; i.e., the amino acid sequence AADFFFGHRRT... is translated  as MSMSSMS..., and the binomial cluster test is applied to the latter sequence). Multiplets and altplets of specific residue content that individually show an unusually high count are indicated, and the positions of all multiplets exceeding a minimum length of 5 residues are shown.

A. AMINO ACID ALPHABET.

1. Total number of amino acid multiplets: 93  (Expected range:  66--123)

2. Histogram of spacings between consecutive amino acid multiplets: (1-5) 27  (6-10) 29   (11-20) 24   (>=21) 14

3. Clusters of amino acid multiplets (cmin = 12/30 or 16/45 or 19/60): none

B. CHARGE ALPHABET.

1. Total number of charge multiplets: 19  (Expected range:   4-- 29) 7 +plets (f+: 7.2%), 12 -plets (f-: 9.6%) Total number of charge altplets: 15 (Critical number: 32)

2. Histogram of spacings between consecutive charge multiplets: (1-5) 0  (6-10) 1   (11-20) 2   (>=21) 17

PERIODICITY ANALYSIS.

The program identifies periodic elements of periods between 1 and 10  for the amino acid alphabet, for the charge alphabet, and for a hydrophobicity alphabet. Each periodic element consists of an error-free core pattern (of length at least 4 for the amino acid alphabet, 5 for the charge alphabet, and 6 for the hydrophobicity alphabet)  which  is  extended  allowing  for errors. The numbers  of  errors are given for each position in the con- sensus of a periodic pattern involving more than one letter. The displayed periodic patterns would generally not be statistically significant but are listed for the sake of a general interactive appraisal of  the  sequence. Periodicities of  exceptionally  high copy number are indicated with a !- mark.

A. AMINO ACID ALPHABET (core: 4; !-core: 5)

Location	Period	Element		Copies	Core	Errors 12- 23	 3	A.. 4	 4 	 0 492- 521	 6	V..... 5	 5 !	 0 509- 524	 4	T... 4	 4 	 0 996-1015	 2	T. 8	 4 	 2 1042-1049	 2	T. 4	 4 	 0 1069-1104	 9	G........ 4	 4 	 0 1120-1144	 5	A.... 5	 5 !	 0 1175-1190	 4	A... 4	 4 	 0 1195-1275	 9	T........ 8	 5 !	 1 1223-1238	 4	T... 4	 4 	 0

B. CHARGE ALPHABET ({+= KR; -= ED; 0}; core: 5; !-core: 6) and HYDROPHOBICITY ALPHABET ({*= KRED; i= LVIF; 0}; core: 6; !-core: 9)

Location	Period	Element		Copies	Core	Errors 424- 453	 5	i0.00    	 6	 6  	/0/1/./1/2/

SPACING ANALYSIS.

The spacings between consecutive residues of the same type (all 20  amino acids,  +  and - charge, and combined charge *) are evaluated for signifi- cantly large or small maximal and minimal spacings. The output is ordered by the beginning point of the significant spacing. Entries are identified by the residue type, spacing (number of amino acids between the identified positions), rank  of  the  displayed  spacing  (e.g.,  50 alanines in the sequence induce 51 spacings, ranked by decreasing length from  1  to  51), and p-value  (probability  of exceeding the displayed spacing). A maximal spacing with p-value 0.01 or less is considered  significantly  large;  a maximal  spacing  with  p-value 0.99 or larger is considered significantly small. Similarly, a minimal spacing with p-value 0.99 or larger  is  con- sidered significantly  small,  and a minimal spacing with p-value 0.01 or less is considered significantly large (excluding doublets). If the first maximal spacing  (rank  1)  of a residue is significantly large or small, then also the second maximal spacing (rank 2) is evaluated. Large maximal and small minimal spacings indicate clustering effects, whereas small max- imal and large minimal spacings indicate excessive evenness in the distri- bution of the residues.

Location (Quartile) Spacing    Rank       P-value   Interpretation

83- 85  (1.)     L(   2)L    67 of  67   0.0006   large minimal spacing 185- 187 (1.)     L(   2)L    64 of  67   0.0006     matching minimum 206- 208 (1.)     L(   2)L    65 of  67   0.0006     matching minimum 618- 620 (2.)     L(   2)L    66 of  67   0.0006     matching minimum