REPETITIVE STRUCTURES, MULTIPLETS, PERIODICITY ANALYSIS, SPACING ANALYSIS

From OpenWetWare
Jump to navigationJump to search

REPETITIVE STRUCTURES.

Repeats are indicated for two alphabets: the 20-letter amino  acid  alpha-
bet,  and  a  reduced  11-letter  alphabet in which the major hydrophobics
LVIF, the charged residues KR and ED, the small residues AG, the  hydroxyl
group  residues  ST,  the amid group residues NQ, and the aromatics YW are
treated as combined letters.  For each alphabet, three classes of  repeats
are  distinguished: separated repeats, simple tandem repeats, and periodic
repeats. The separated  repeats  are  largely  non-overlapping.  They  are
displayed  in  groups  of  matching  blocks  (exceeding a given core block
length of contiguous  exact  matches)  and  intervening  spacer  distances
(which  may  be  negative,  signifying  a partial overlap). The core block
length in case of the amino acid alphabet is set to 4 for sequences up  to
500  residues,  to 5 for sequences between 500 and 2000 residues, and to 6
for longer sequences (same values increased by 4 for  the  reduced  alpha-
bet).   Simple  tandem  repeats  are  displayed  in  similar  layout,  but
separately. Sequence segments that are highly repetitive  with  relatively
short repeats are displayed as periodic repeats.


A. SEPARATED, TANDEM, AND PERIODIC REPEATS: amino acid alphabet. Repeat core block length: 5

Aligned matching blocks:


[ 110- 115] VANGIF [ 171- 176] VANGIF

______________________________

[ 621- 625] NPGTS [1107-1111] NPGTS

______________________________

[ 785- 789] TIETA [1224-1228] TIETA

______________________________

[1073-1081] AATLTGTGL [1166-1172] AAT__GTGL


B. SEPARATED AND TANDEM REPEATS: 11-letter reduced alphabet.

  (i= LVIF; += KR; -= ED; s= AG; o= ST; n= NQ; a= YW; p= P; h= H; m= M; c= C)

Repeat core block length: 9


MULTIPLETS.

Multiplets refer to homooligopeptides of any length (e.g., A2, Q7,  etc.);
altplets  refer  to  reiterations  of  two  different  residues (e.g., RG,
EAEAEA, etc.). The  multiplet  composition  of  the  protein  sequence  is
evaluated  for  both the amino acid and the charge alphabet. (High) Aggre-
gate altplet counts are evalued only for the charge alphabet.  The  multi-
plet  sequence  is  displayed  whenever  the  total multiplet count of the
sequence falls outside the expected range (i.e., beyond 3 standard  devia-
tions of the mean). Printed are also the histogram of the spacings between
consecutive multiplets (differences between starting positions) as well as
clusters  of multiplets (multiplet clusters are determined in the same way
as charge clusters are determined; the  binomial  test  is  applied  to  a
compressed sequence over the alphabet {M,S}, where M signifies a multiplet
and S signifies a singlet; i.e., the amino acid sequence AADFFFGHRRT... is
translated  as MSMSSMS..., and the binomial cluster test is applied to the
latter sequence). Multiplets and altplets of specific residue content that
individually show an unusually high count are indicated, and the positions
of all multiplets exceeding a minimum length of 5 residues are shown.


A. AMINO ACID ALPHABET.

1. Total number of amino acid multiplets: 93 (Expected range: 66--123)

2. Histogram of spacings between consecutive amino acid multiplets:

  (1-5) 27   (6-10) 29   (11-20) 24   (>=21) 14

3. Clusters of amino acid multiplets (cmin = 12/30 or 16/45 or 19/60): none


B. CHARGE ALPHABET.

1. Total number of charge multiplets: 19 (Expected range: 4-- 29)

  7 +plets (f+: 7.2%), 12 -plets (f-: 9.6%)
  Total number of charge altplets: 15 (Critical number: 32)

2. Histogram of spacings between consecutive charge multiplets:

  (1-5) 0   (6-10) 1   (11-20) 2   (>=21) 17

PERIODICITY ANALYSIS.

The program identifies periodic elements of periods between 1 and  10  for
the amino acid alphabet, for the charge alphabet, and for a hydrophobicity
alphabet. Each periodic element consists of an error-free core pattern (of
length  at least 4 for the amino acid alphabet, 5 for the charge alphabet,
and 6 for the hydrophobicity alphabet)  which  is  extended  allowing  for
errors.   The  numbers  of  errors are given for each position in the con-
sensus of a periodic pattern involving more than one letter. The displayed
periodic patterns would generally not be statistically significant but are
listed for the sake of a general interactive appraisal  of  the  sequence.
Periodicities  of  exceptionally  high copy number are indicated with a !-
mark.


A. AMINO ACID ALPHABET (core: 4; !-core: 5)

Location Period Element Copies Core Errors

 12-  23	 3	A..       	 4	 4  	 0
492- 521	 6	V.....    	 5	 5 !	 0
509- 524	 4	T...      	 4	 4  	 0
996-1015	 2	T.        	 8	 4  	 2

1042-1049 2 T. 4 4 0 1069-1104 9 G........ 4 4 0 1120-1144 5 A.... 5 5 ! 0 1175-1190 4 A... 4 4 0 1195-1275 9 T........ 8 5 ! 1 1223-1238 4 T... 4 4 0


B. CHARGE ALPHABET ({+= KR; -= ED; 0}; core: 5; !-core: 6)

  and HYDROPHOBICITY ALPHABET ({*= KRED; i= LVIF; 0}; core:  6; !-core: 9)

Location Period Element Copies Core Errors

424- 453	 5	i0.00     	 6	 6  	/0/1/./1/2/



SPACING ANALYSIS.

The spacings between consecutive residues of the same type (all  20  amino
acids,  +  and - charge, and combined charge *) are evaluated for signifi-
cantly large or small maximal and minimal spacings. The output is  ordered
by  the beginning point of the significant spacing. Entries are identified
by the residue type, spacing (number of amino acids between the identified
positions),  rank  of  the  displayed  spacing  (e.g.,  50 alanines in the
sequence induce 51 spacings, ranked by decreasing length from  1  to  51),
and  p-value  (probability  of exceeding the displayed spacing). A maximal
spacing with p-value 0.01 or less is  considered  significantly  large;  a
maximal  spacing  with  p-value 0.99 or larger is considered significantly
small. Similarly, a minimal spacing with p-value 0.99 or  larger  is  con-
sidered  significantly  small,  and a minimal spacing with p-value 0.01 or
less is considered significantly large (excluding doublets). If the  first
maximal  spacing  (rank  1)  of a residue is significantly large or small,
then also the second maximal spacing (rank 2) is evaluated. Large  maximal
and small minimal spacings indicate clustering effects, whereas small max-
imal and large minimal spacings indicate excessive evenness in the distri-
bution of the residues.


Location (Quartile) Spacing Rank P-value Interpretation

 83-  85  (1.)     L(   2)L    67 of  67   0.0006   large minimal spacing
185- 187  (1.)     L(   2)L    64 of  67   0.0006     matching minimum
206- 208  (1.)     L(   2)L    65 of  67   0.0006     matching minimum
618- 620  (2.)     L(   2)L    66 of  67   0.0006     matching minimum