E.histolytica

Abstract
Short interspersed repetitive elements (SINEs) are nonautonomous mobile genetic elements related to non-long terminal repeat (non-LTR) retrotransposons, and have a significant impact on genome evolution due to their mobility and abundantly transcribed nature. In this study, computational analysis has been used to look for functional genes that have decayed due to the insertion of a SINE present in the Entamoeba histolytica genome.

One family of SINEs in E.histolytica, EhSINE1, has been examined, looking for the possible effects of the insertions splitting and inactivating (or modifying) a gene. However, it has been found that the genome sequences at the flanks of the EhSINE1 insertion site is often technically difficult to recognize as the sequences would have decayed over time. The analysis of the flanking insertion region of EhSINE1 provides a new insight into the genome evolution with respect to a human pathogen.

We aimed to find evidence of decayed genes inactivated by the insertion of an EhSINE1, and the function of any disrupted genes can be potentially related to the ability of E.histolytica to cause amoebiasis. This may further enable a greater understanding of the possible mechanism of this disease, and these genes might become new drug targets for treatment.

Abbreviations

 * bp	  base pair
 * BLAST	  Basic Local Alignment Search Tool
 * CDS	   coding sequences
 * GLIMMER3	Gene Locator and Interpolated Markov ModelER
 * IMM	interpolated Markov model
 * LTR	long terminal repeats
 * ORF	open reading frame
 * SINE	Short interspersed repetitive elements
 * TIGR	The Institute for Genomic Research
 * UniProtKB	UniProt KnowledgeBase

Introduction
Amoebiasis can be a cause of infectious disease, such as diarrhoea and amoebic dysentery, which are usually prevalent among developing countries. Amoebiasis is currently defined as an infection caused by the protozoan parasite E. histolytica. The factors causing this clinical disease remain largely unknown. In 1925, however, Emile Brumpt stated that two species of parasite are involved - pathogenic and nonpathogenic E.histolytica, the later one he termed E.dispar, which never causes disease (Brumpt, 1925). E.histolytica is a pathogen developing invasive amoebiasis, while another species, E.dispar, is not capable of invading tissue, though both species are able to colonize in human gut, in terms of amoebic biology. People in developing world endure a great deal of suffering from E.histolytica, which is the causal agent of amoebic dysentery and amoebic liver abscess (Ryan et al. 2004), especially leading to serious effects on young and old, malnourished individuals and pregnant women. Approximately 50 million cases worldwide have been infected by E.histolytica annually resulting in 100,000 deaths among to human beings (WHO 1997). The strategies for amoebiasis prevention and eradication are critical to develop because of high prevalence of infection from faecal contamination of food and water supply, which is hard to control during a short term of years in these developing countries.

In E.histolytica, protein-coding genes comprised 49% of the genome and the remaining nucleotides are constituted of a variety of repetitive sequences (Loftus et al. 2005). These repetitive sequences increase the genome size by a way of duplication and repetition, or induce mutations via insert into the vicinity or interior of genes. In some organisms, particularly mammals, change of genome size will lead to some genetic disease by the insertion of repetitive sequences. SINEs are one of retrotransposon that can be amplified themselves and inserted into genes or the remainder genome regions (Clayton 2010). SINEs are nonautonomous mobile genetic elements related to non-long terminal repeated (LTR), a subgroup of repetitive sequences in eukaryotic genomes. The SINEs in E.histolytica are divided into three families based on the phylogenetic analysis: EhSINE1, EhSINE2 and EhSINE3 (Van Dellen et al. 2002). This project on E. histolytica is focused at identifying decayed genes due to SINE insertion using a combination of genetic and genomics based approaches and only deeply looks into EhSINE1 among three families. In this study, EhSINE1 has only been focused on the evidence of recent transposition, regardless of the repeats in EhSINE1.

EhSINE1 analysis was carried out on 393 elements distributed among ~24 Mb E.histolytica genome sequences identified by Huntley et al in 2010. However, the impact of EhSINE1 in the E.histolytica is still unknown. The only access of SINE is that it becomes fixed in populations through genetic drift, an important force in the evolution of eukaryotes (Norihiro Okada 1991). The data of EhSINE1 can be separated into two parts for analysis. One situates at the internal part of annotated gene while another locates at the external area of the annotated gene, non-coding region, which may have inserted into a gene and disrupted it. The later situation contains many EhSINE1s, which explains why SINEs have been viewed as junk DNA (Santangelo et al. 2007). Three out of 393 SINEs are placed inside of annotated gene whilst the reminders are located in the non-coding regions.

Identification of all the genes has begun to draw significant attention among bioinformatics community by computational approaches in recent years. The issue of identification has devoted a large number of complex problems of finding genes to eukaryotic sequences, compared with finding that in prokaryotic sequences. Thus, it is better to define a number of essential terms with specialized meanings. The terms of sensitivity and specificity are criteria for gene finding. The sensitivity is defined as the proportion of true sites and the specificity is defined as that of predicted sites (Burge & Karlin 1998).

To find potential disrupted genes in the E.histolytica genome, an approach is to investigate genes present in E dispar where there is an absent syntenic version in the E.histolytica genome. To confirm whether the absent gene has been disrupted in E.histolytica, the flanking region of E.dispar gene sequence is using Basic Local Alignment Search Tool (BLAST) to identify potential homologues to the E.histolytica genome (Altschul et al. 1990). Extracting the sequences from the upstream and the downstream of un-syntenic E.dispar gene and separately aligning with the E.histolytica genome, it is ascertain provided that the insertion of EhSINE1 has disrupted a gene either by disruption of the promoter region or by insertion directly within the gene sequence, if two homologues can be incorporated together with EhSINE1 nearby or in the homologues,

Another approach concerns the upstream and downstream of each EhSINE1 of the E.histolytica to see if the flanking sequences once belonged to a functional gene. There are three strategies to analyse the sequences. The first strategy explores evidence that the flanking sequences are decayed ORFs. If the EhSINE1 has only recently disrupted the gene, especially the two-repeat EhSINE1s, it would be quite possible that the flanking sequences still contain recognizable or even fragmented, ORFs. The EhSINE1s have divided into classes based on the numbers between one to four identifiable internal repeats. The member of repeats suggests that some are older than others and produce a small list of EhSINE1s which appear to have been recently transposed (Huntley et al. 2010). Two-repeat EhSINE1s show the properties expected of relatively recently transposed specimens (Huntley et al. 2010). The second strategy searches the flanking sequences using gene finder programs such as GLIMMER (Delcher et al. 1999) or GENSCAN (Burge et al., 1997), to see if they demonstrate any gene predictions. The last strategy predicts the function of these sequences by aligning them to proteins from UniProt protein databases (The UniProt Consortium 2010).

With accurate, consistent and rich annotation, protein database of UniProt knowledgebase (UniProtKB) is used to align with the query DNA sequences for further gene prediction. There are two sections exist in the UniProtKB: Swiss-Prot and TrEMBL (Boeckmann et al. 2003). Swiss-Prot protein sequence database contains manually-annotated records that extracted from published literature and curator-evaluated computational analysis, and TrEMBL protein database is a computer-annotated section awaiting manual annotation (Boeckmann et al. 2003). Overall, these two databases are well annotated in the form of evidence, attribution of experimental and computational data to verify the availability of proteins.

However, defining decayed genes via a diverse type of above approaches are complicated. First of all, E.histolytica genomes are T-rich (Mandal et al. 2004). Next, the EhSINE1 will decay over time in the wake of transposition. Last but not least, many EhSINE1s are placed in the scaffold terminal, either 5’ or 3’ end. For SINE, between one and four recognizable repeats could be identified per EhSINE1 element unit 1.

In this study, it was attempted to find the decayed gene of E.histolytica which is inserted by the EhSINE1. As mentioned above, the molecular function of E.histolytica is characterized by understanding the proportion of gene distinct from E.dispar. Many approaches have been used for gene predictions compared with E.dispar genome and other protein databases in terms of different gene finders and BLAST.

Missing gene in E.histolytica
In mouse, EAAT2 has been found only in the network with the mutant SOD1G37R of significant 50 edges, but absented in the network with same legend of wild type SOD1G37R (Figure 1). EAAT2 did not present in the networks of significant 50 edges neither in mutant SOD1G85R nor in wild type SOD1G85R data sets (Figure 2). SLC gene family were strongly present in the networks of both SOD1G37R data sets and SOD1G85R data sets.

EhSINE1 located in gene
Further prove that candidate genes were presented in the network of significant gene area in rats. Three candidate genes (Cd11b, Isl1 and p75) were found in the networks of significant 20 or 50 edges. Gene Cd11b was presented in the networks of process integrin-mediated signaling pathway and leukocyte cell-cell adhesion. Gene Isl1 was presented in that of axon regeneration and retinal ganglion cell axon guidance. Gene p75 only presented in the network with inflammatory response of significant 50 edges. And gene EAAT3 presented in the network with transport, but beyond of significant 1000 edges. Other biological process either attached with too few probe id or did not presented of significant 2000 edges. Figure 3 showed where Isl1 was located and the correlation with other genes. Nevertheless, no correlation between Isl1 and other genes were connected with same edge in both networks, though some genes were returned.

ORFs match with E.dispar genome
Not all the candidate genes can be found in the distribution network with 0.05 cut off. Some of them against GO terms have high correlation that would not find out in the network of significant edges. Compared with correlation distribution filtered by GO term, the density that the candidate genes located in the subset network is higher or lower, individually. Figure 4 to 6 shows the distribution of the candidate gene in the subset networks. The unknown gene was therefore pointed out with specific threshold for each subset, and the summary of unknown genes to the candidate gene was displayed at table 1.

Conclusion
The Affymetrix microarray raw data with CEL files was obtained from the NIH Neuroscience Microarray Consortium, and Affymetrix NetAffx annotation CSV file was downloaded from the Affymetrix NetAffx data base through AffyCompatible R package. Microarray data sets have to be separated into groups with mutant and wild type of corresponding annotation file. Figure 7 shows an example of how individual CEL files divided into groups to create the expression CSV file.

Annotation files
NetAffx probe-to-gene annotations can be changed as the sequence data are updated with approximately 5% over a two year span (Perez-Iratxeta & Andrade 2005). Affymetrix probe sets are annotated based on the related current records in UniGene, which is a database of clusters of GenBank sequences. The inconsistence between versions of NetAffx annotation files can be detected when two probe sets attached to the same gene in one version of the annotations, while they attached to different gene names in another version. Take probe ID 1433436_s_at and 1419113_at of eight versions of NetAffx MOE430A/B annotation files for example (Table 2). NetAffx has been released 30 versions of MOE430A/B since 17th May 2003. Using different versions of NetAffx annotation file could result in the variation in the biological interpretation of a microarray experiment. Besides, the entire mouse genomic sequence is not represented by experiment as there are around 17 percent of unassigned EST-only probe sets with mouse MOE430s annotation files till now.

Software

 * R project	          version 2.11.0
 * affy R package            Version 1.26.1
 * AffyCompatible R package  version 1.8.0
 * plotrix R package	  Version 2.9.3
 * GeneNet R package	  version 1.2.4
 * Gplots R package	  version 2.7.4

Abbreviations

 * ALS	amyotrophic lateral sclerosis
 * SOD1	copper-zinc superoxide dismutase
 * GO	gene ontology
 * VAR	vector autoregressive
 * FALS	Familial Amyotrophic lateral sclerosis
 * BP	biological process
 * CC	cellular component
 * MF	molecular function
 * EAAT	excitatory amino acid transporters

Supervisor

 * Dr. Derek Huntley, Prof John Ackers