Viruses differ markedly in their specificity toward host organisms. Here, we test the level of general sequence adaptation that viruses display toward their hosts. We compiled a representative data set of viruses that infect hosts ranging from bacteria to humans. We consider their respective amino acid and codon usages and compare them among the viruses and their hosts. We show that bacteria‐infecting viruses are strongly adapted to their specific hosts, but that they differ from other unrelated bacterial hosts. Viruses that infect humans, but not those that infect other mammals or aves, show a strong resemblance to most mammalian and avian hosts, in terms of both amino acid and codon preferences. In groups of viruses that infect humans or other mammals, the highest observed level of adaptation of viral proteins to host codon usages is for those proteins that appear abundantly in the virion. In contrast, proteins that are known to participate in host‐specific recognition do not necessarily adapt to their respective hosts. The implication for the potential of viral infectivity is discussed.
Viruses are autonomous entities with an extremely fast evolution rate. They invade their host and replicate to produce new viral particles. These processes take place only inside their hosts’ cellular environment. To activate their reproductive cycle, viruses typically have to override their hosts’ translational machinery and in addition they must evade the hosts’ immune system and additional defense mechanism. These basic observations make it very interesting to investigate the evolutionary interactions among hosts and their infecting viruses. There are several critical parameters that determine the selectivity with which viruses infect their hosts. These include the number of viruses that are produced in each infected cell, the host's population size, and its generation time. In addition, there is the degree of the virus stability in the hostile environment outside the cell and, most importantly, the molecular specificity of recognition that underlies the virus entry into the host. Studies of the evolutionary history of viral adaptation suggest the existence of a rich web of interactions that involve both the host and virus codon usage, the virus replication mode, genome size, and the variety of its potential hosts. It was also proposed that the extremely high mutation rates in viruses (especially RNA viruses) outpace the evolutionary processes of selection that drive codon preference optimization of viruses and their cognate hosts. For certain viruses, genome‐wide mutational pressures override the selection for specific codons.
In this study, we took advantage of the fast growth in sequencing data for many model organisms as well as for thousands of viral genomes. Such advances have made it possible for us to compile a balanced data set for further analysis. This set includes ∼300 representative viruses whose hosts range from humans to bacteria, and whose genome had been completely sequenced. We had to overcome the difficulty that arises from the fact that although certain viruses infect a broad range of species, others infect only a single host. We solved this problem by developing a consistent virus‐to‐host mapping. Our main objective was to answer the following question: notwithstanding the enormous diversity among viruses, is there an overall well‐defined and measurable molecular similarity between viruses and their hosts? Such similarity, should one exist, can presumably be considered as a manifestation of some molecular adaptation mechanisms. We develop a statistical framework for the purpose of providing an unbiased assessment of the mutual distances between all viruses and all recognized hosts. To test the hypothesis of a molecular adaptation of viruses toward their hosts, we focus on the codon usage and on the amino acid preferences within groups of viruses that are grouped at varying taxonomical granularities.
We observe that all bacteriophages are strongly tuned to match their unique bacterial hosts and this correspondence is also evident in their GC genomic contents. However, somewhat surprisingly, viruses that infect humans resemble not only the human codon preference and amino acids frequency but also an additional 10 mammalian hosts equally. This similarity even extends to aves and several insects. This observation does not hold for viruses that infect other mammals, despite a strong similarity among the codon usages among most mammals.
Finally, we show that viral selection of codon usage toward that of the host has not occurred uniformly for all proteins of the virus, but it is mainly dominated by the set of proteins expressed in high abundance. The implications of these observations for viral evolution and on the potential for zoonotic epidemics are evident. It is likely that the domestication and the close interaction between humans, rats, and farm animals for thousands of years has led to the evolution of viruses that infect humans and are adapted toward a broad range of hosts. During the last century of human evolution, with the growth in human population and global traffic, we witness instances of viruses that crossed the host barrier and were introduced into the human population. Known examples are the HIV virus in the early 1980s, the SARS in 2003, and the latest epidemic of the H1N1 swine flu in 2009. The similarities in codon usage and amino acid composition that we have observed in this work can somewhat relate to the potential for zoonosis. Although these molecular properties are neither necessary nor sufficient conditions for host shifts, our analysis can nevertheless contribute to a framework that would, on the one hand, permit analysis of the potential of certain viruses to adapt to new host species and, on the other, allow the development of attenuated viruses for vaccination.
A representative set ∼300 viruses was compiled and mapped to their cognate hosts, ranging from bacteria to humans.
The amino acid distribution and codon usage of bacteriophages resemble their specific bacterial hosts.
Viruses that infect humans, but not those that infect other mammals or aves, show a strong resemblance in codon preference and in amino acid frequencies to most mammalian and avian hosts.
The highest level of molecular adaptation is for proteins that appear abundantly in the virion of viruses that infect humans and mammals.
Viruses show appreciable variation in the selectivity with which they infect host organisms. Some viruses infect a broad range of species, whereas others infect only a single host. A successful viral infection requires that the virus possess the capability to enter the host cell and take over cellular functions and direct them toward the efficient production of new viruses. Most viruses recognize their respective hosts through membrane receptors that have a role in host physiology. Examples of such receptors are gangliosides, heparan sulfate moieties, and integrins (Garrigues et al, 2008), which act as the cell receptors for simian virus 40 (SV40), human cytomegalovirus (HHV‐5), and human herpesvirus 8 (HHV8), respectively. In stark contrast, for some viruses, host range is not limited to the recognition stage (McFadden, 2005). For example, poxviruses bind to and enter a wide range of mammalian cells, but a fruitful replication cycle occurs only in a restricted set of hosts. Replication of poxviruses involves the host cell cycle, signal transduction, transcription factors, phosphatases, and interferon‐induced mediators. Therefore, the features that govern the host range for poxvirus seem to involve a rich collection of host genes (McFadden, 2005).
All viruses are characterized by very high natural mutation rates, with the RNA viruses displaying an exceptionally high rate (Drake, 1993). Co‐evolution and adaptation of viruses to their hosts were mostly studied by comparing mutations at synonymous and non‐synonymous coding sites in specific genes. The fast adaptation of human immunodeficiency virus‐1 (HIV‐1) to specific HLA‐1 epitopes validates the importance of viral evolution at a population level (Kawashima et al, 2009). As of yet, the study of adaptation of viruses toward their hosts has been undertaken for specific viral families, including retroviruses (Bronson and Anderson, 1994), astroviridae (van Hemert et al, 2007), mimivirus (Sau et al, 2006), and bacteriophages (Lucks et al, 2008), but this has not been systematically investigated for all known viral proteomes.
The degeneracy of the genetic code implies that multiple triplets code for the same amino acid. The frequencies with which different codons are used vary significantly between organisms and between proteins within the same organism (Akashi, 2001). Many studies have focused on the bias in codon usage among species. In single cell organisms (prokaryotes, archaea, and some fungi), the codon usage is strongly tuned for highly expressed genes and was thus concluded to be optimized for translational efficiency (Sharp et al, 1988). However, the main trends in multicellular organism codon usage were attributed to the isochore‐dependent genome composition (GC) content, gene architecture, and chromosomal locations (see discussion in Costantini et al, 2009). Still, evidence for codon usage bias toward highly expressed genes and its correlation to tRNA abundance argues that translational efficiency does have a role for some plant, fly, and worm proteomes (Duret, 2000 and references within). Evolutionary forces and multiple molecular processes (e.g., unbiased gene conversion, mutation rates, and genetic drift) have also participated in shaping codon usage in higher eukaryotes (Bernardi, 1986; Duret, 2002). The molecular determinants that have globally influenced the translational efficiency in Escherichia coli (Kudla et al, 2009) and the evolution of polymerase genes in the influenza A virus (Brower‐Sinning et al, 2009) indicate that, in addition to GC content, RNA folding processes also affect the adaptability and translational capacity of viral sequences.
Viruses do not have tRNAs, and consequently the translation of viral proteins relies entirely on the pool of host tRNAs. An exception is the Paramecium bursaria chlorella virus, which contains a partial set of tRNAs and other host‐like properties (Van Etten and Meints, 1999). In a recent study that tested the codon usage adaptation for over 100 bacteriophages infecting 10 different bacterial hosts, it was shown that the bacteriophage genomes are under codon‐selective pressure imposed by the translational biases of their respective hosts (Carbone, 2008). The reasoning underlying this codon selection hypothesis argues that it provides an advantage for viral protein synthesis at the level of translational efficiency.
In viruses infecting multicellular animals, such translational biases may lead to increased virion production rates within the infected cell and reduce the accessibility of viruses to the immune response of the host (Bonhoeffer and Nowak, 1994). However, to the best of our knowledge, the analysis of codon biases of eukaryotic (alongside prokaryotic) viruses compared with their hosts has yet to be undertaken on a large scale. However, related phenomena have been described. Specifically, the codon usage bias in the poxviridae family (dsDNA viruses) was determined by measuring the effective number of codons in the viral proteome. Neither the expression level nor the gene size was shown to be a determinant of the measured codon usage biases. Nonetheless, for most poxviruses, the codon usage was close to the value predicted based on the GC content (Barrett et al, 2006). Similar results were shown for coronavirus (Gu et al, 2004) and other vertebrate‐infecting DNA viruses (Shackelton et al, 2006). In papillomavirus, the codon bias was attributed to the AT content rather than to host specificity (Zhao et al, 2003). In the case of retroviruses, it was shown that strong discrimination against CpG sequences directly shapes the codon usage and, as a result, even indirectly restricts the choice of amino acids (Berkhout et al, 2002). Thus, in general, GC and, specifically, the GC content were thus far found to be the major determinants of codon usage in vertebrate DNA viruses (Shackelton et al, 2006).
It has been found that for many viruses, genome‐wide mutational pressures override the selection for specific codons (Jenkins and Holmes, 2003). Studies of the evolutionary history of viral adaptation propose a cross talk between codon usage, replication mode, genome size, and host range (Koonin et al, 2006). Furthermore, the observation that there exist both eukaryotic viruses that have adapted their codon usage toward their hosts and those that show little evidence for such adaptation recently prompted the hypothesis that this simply reflects the limited time of the latter for optimization toward their hosts (Barrai et al, 2008). A contrary view would suggest that the extremely high mutation rates in viruses (especially in RNA viruses) outpace the evolutionary processes of selection that drive such optimization of the virus to the host.
In this paper, we set out to determine whether, despite the enormous diversity among viruses, a high‐level, generalized trend of adaptation of viruses toward their hosts can be observed. To this end, we provide a strict virus‐to‐host mapping using a non‐redundant set of representative viruses and hosts, ranging from human to bacteria. We develop a statistical framework for the unbiased assessment of the mutual pairwise distances between all viruses and all recognized hosts. To test the hypothesis of general molecular adaptation of a virus toward its hosts, we focus on codon usage and amino acid preferences within groups of viruses that are unified at varying taxonomical granularities. We observe that all bacteriophages are strongly tuned to match their unique hosts and this correspondence is also evident in their GC contents. However, somewhat surprisingly, viruses that infect humans resemble all mammalian hosts equally, and this similarity even extends to aves and several insects. This observation does not hold for viruses of other mammals, despite a strong similarity among the codon usages of most mammals. Finally, we show that viral selection of codon usage toward that of the host has not occurred uniformly for all proteins of the virus, but it is mainly dominated by the set of proteins expressed in high abundance. The implications of these observations for viral evolution and on the potential for zoonotic epidemics are discussed.
Viral proteomes are biased and poorly annotated
Viruses comprise the largest group of parasitic organisms for which cross talk between the proteomes and their cognate hosts can be studied.
The huge diversity among viruses encompasses their mode of replication, shape, stability, proteome size, and infectivity. These factors impose an inherent difficulty in the classification of viruses into taxonomical groupings. Currently, ∼10% of all sequences in the UniProtKB database (Boutet et al, 2007) (release 14.6) are viral proteins (718 000 proteins). Actually, full‐length proteins account for only a third of these, and, following the elimination of sequence redundancy (at the level of 90% identity), the number of proteins is reduced to only ∼10% of the original number (72 992 proteins) (Figure 1). In addition, the low fraction of these proteins that are manually reviewed (based on the SwissProt database) results in only 1% of the initial collection (7416 proteins). Furthermore, the relevance of specific virus families to human health has led to a strong bias in the quality and reliability of genome annotation. The majority of viral sequences in the public databases are derived from only a few viral families, whereas most families remain poorly represented. This point is illustrated for the HIV, which makes up 36% of all viral protein entries (Figure 1). Half of all viral proteins are either from the HIV or hepatitis (Hepadnaviridae) viruses, two families with an indisputable impact on human health. An additional source of bias in analyzing the viral world stems from data that originate from incomplete genomes. The UniProtKB annotation of ‘complete proteome’ covers only 0.5% of all viral sequences.
The collection of proteins from ViralZone, a manually reviewed virus–host web portal that provides information on all known virus genera, overcomes some of these biases. ViralZone lists ∼300 genera of viruses belonging to 80 major families. Associated with each genus is information on the host range and tissue tropism. All viruses are classified by their taxonomical order as well as by the accepted index that divides them into seven classes (Baltimore index I–VII), based on their genetic material and mode of replication. One hundred twenty‐one human‐infecting viruses that belong to 50 genera are currently known (Supplementary Table S1). The uneven partition for human‐infecting viruses among the seven classes is shown (Figure 1B). Class I (dsDNA) and class V (ssRNA(−)) account for 70% of the proteins, but all other classes are also represented among human viruses. By considering all proteins that are known from UniProtKB (a unification of SwissProt and TrEMBL), only 25% of the relevant proteomes are included in classes I and V, whereas the dominating class in terms of the quantity of protein sequences is class VI (ssRNA (RT), including HIV). Proteins belonging to class IV account for ∼50% of the proteins of human‐infecting viruses (total ∼568 000). We used the manually compiled set from SwisProt for analyzing the human viruses throughout this study. Thus, in summary, we chose to focus only on complete proteomes of the representative species to ensure an unbiased and unabridged data set for subsequent analysis, as an uneven representation of viral protein sequences will affect most statistical properties (e.g., codon usage, GC content, and amino acid composition).
Ambiguity in mapping of viruses to their respective hosts
Ambiguity in virus‐to‐host mappings in publicly available databases often reflects missing information regarding a specific host. For example, a virus may be assigned to several hosts described at various levels of the species taxonomical tree (e.g., rodents, primates, and insects). However, only rarely do members of the same virus genus infect hosts differing above the level of class (e.g., mammals), phylum (e.g., chordata), or regnum (e.g., animals). An example of such an uncommon case is the Iridoviridae family (dsDNA viruses), which infects frog, snakes, insects, and fish. To overcome the ambiguities resulting from virus–host assignments, we adopt a mapping that focuses on the host taxonomical level of interest, which then groups together viruses that infect a unique group of hosts at that particular level.
As an illustrative example (Figure 2), we depict the viruses that infect mammals (excluding humans and other primates). Critically, these mappings account both for the virus under study and its hosts, with respect to the underlying host taxonomical tree. There are 10 host organisms that are infected by 17 viruses. These 17 viruses are represented by 7 types of viruses (Figure 2, V1–V7) that are identical in terms of their defined host range. We show that for the case in which the host‐species level is considered (level A), only a restricted virus‐to‐host mapping can be applied. However, higher taxonomical views (levels B, C, or D) are consistent with a mapping of additional viruses. All further analyses herein will follow such a mapping (see Materials and methods). Note that resolving the ambiguity of assignment of viruses to their hosts is a fundamental precondition for studying virus–host evolution on a large scale.
Amino acid distribution and codon usage signature
We set out to test the preference of amino acids in viral proteomes vis‐a‐vis their hosts. To this end, we compiled an exhaustive representative set (see Materials and methods) and applied the virus‐to‐host mapping at a high taxonomical level (Figure 2, level C). To start with, we focused on two taxonomical groups: mammals (subdivided into human and nonhuman hosts) and bacteria. This analysis is based on 481 779 and 312 201 amino acids from the respective virus groups. The proteomes of virus representatives that infect humans and those that infect bacteria (bacteriophages) are compared (Figure 3A). It is evident that some amino acids strongly deviate between these two groups. For example, arginine (R) is more prevalent in the viruses of humans (P<10−6, t‐test with Bonferroni correction), whereas lysine (K) appears more in bacterial proteomes (P<10−6). A similar trend is seen for isoleucine (I, P<10−6) and leucine (L, P<10−6). The source and biological significance of such differences are under study and beyond the scope of this study.
Similarly, we measured the codon usage for each of the 59 codons that code for 1 of the 18 degenerately encoded amino acids (tryptophan and methionine are encoded by only a single codon). As an illustration, we show the codon preferences for arginine (R, 6 codons) and leucine (L, 6 codons), as measured for human‐infecting and mammalian (excluding human) virus groups (Figure 3B). The different usage of each of the amino acids’ codon triplets is evident (χ2 test, P<10−6).
Variability among viral proteomes is greater than for their hosts
To test the range of variability in amino acids and in codon usage within the space of the viruses studied, all representative viruses were divided on the basis of their infectivity toward a taxonomical partition of six high‐level host groups: humans, mammals (excluding humans), vertebrates (excluding mammals—mainly fish and aves), insects, plants, and bacteria. This partition permitted maximal coverage of the virus representatives.
For a particular group of viruses and a given set of hosts, frequency vectors (20 element vectors for amino acids, and 59 element vectors for codons) were calculated. To compare these vectors, we measure the pairwise distance between codon usage (or amino acid distribution) using a distance metric, where lower values indicate greater similarity. We applied multiple measures to determine the distance between any pair of vectors for virus and host. We will present the results obtained using the L2 norm measure. Additional measures that were applied include the L1 norm and DKL (see Materials and methods), but their use has only a negligible impact on the results, supporting the robustness of the analysis performed hereafter.
Under the scheme outlined above, we consider the ranked distance of each pair of viruses and host groups relative to the L2 variability among the entire set of tested pairs (36 pairs, covering all 6 major taxonomical groups). Figure 4 shows the subset of results for humans (H), non‐human mammals (M), and non‐mammal vertebrates (V), where we compare pairs of host groups (Ho × Ho), pairs of virus groups (Vir × Vir), and pairs of virus group–host group (Vir × Ho). The results for the amino acid distributions (Figure 4A, left) suggest that the taxonomical host groups are less variable (dominated by blue) than are the respective viral groups (predominantly red). The variability among viruses that infect plants, insects, and bacteria is substantially higher, and so is the variability among the respective host genomes (the full results are found in Supplementary information S1). The resemblance in amino acid preference between viruses and their grouped taxonomical hosts is rather weak except for humans and somewhat for non‐mammal vertebrates (Figure 4B, left; Supplementary information S1).
Source data for Figure 4 [msb200971-sup-0001-SourceData-S1.zip]
The results for a host–host and virus–virus comparison of codon usages (Figure 4A, right) show a trend similar to that for amino acid distributions, namely, substantial similarity among the host groups and enhanced diversity among the corresponding virus groups. Nevertheless, viruses infecting non‐human mammals and viruses infecting non‐mammal vertebrates show an intermediate level of resemblance to each other (green squares), whereas human viruses differ from these two groups.
Next, we tested the similarity between virus, host amino acid, and codon preferences, as a measure for coherent adaptation of viruses with respect to their host taxonomical groups (Figure 4B, right). Interestingly, viruses that infect humans are not only adapted to the human host, but are also similar in codon preference to host groups comprising mammals excluding humans (M), and vertebrates excluding mammals (V, mostly viruses of fish and birds). On the other hand, the codon usage of viruses infecting vertebrates is highly dissimilar from that of all host groups shown (the opposite of that for human viruses, Supplementary information S1).
Comparison of codon usages between hosts and between viruses
The similarity between the codon and amino acid preferences of human‐infecting viruses and a wide variety of host organisms (Figure 4) may reflect the non‐unique definition for virus strains that are associated with broad taxonomical host groups. We thus compiled a set of representative viruses derived from an organism‐level view of the hosts (Figure 2, level A), where, in this setting, only viruses that uniquely infect a defined host species are included. The 30 hosts infected by virus representatives unique to their respective hosts are listed in Table I. Most viruses are represented with >1000 codons for each host and 10 of the viruses are supported by >20 000 codons (see Supplementary Table S2). A comparison of the codon usage among the viruses themselves is shown (Figure 5A), indicating enormous variability between viral genomes. Note that the colors in the various matrices range from blue (high similarity) to red (maximal distance); also, as data normalization is performed to obtain ranks for the 900 values (30 × 30 pairs) in each matrix, the matrices can be easily compared. Unlike the intra‐virus comparisons, when the 30 hosts were compared among themselves (Figure 5B), the internal variability in the groups of mammals, plants, and insects was relatively low (especially among the mammal hosts). Nonetheless, among the 10 bacterial hosts tested, the variability is very high (dominated by red color).
Source data for Figure 5A [msb200971-sup-0002-SourceData-S2.xls]
Source data for Figure 5B [msb200971-sup-0003-SourceData-S3.xls]
Adaptation of viruses toward their hosts is shown by GC content and codon usage
It is known that the GC content is a strong determinant in shaping codon usage, specifically in the higher multicellular eukaryotes. As a control experiment, a comparison of the GC content between viruses and their cognate hosts shows that viruses have an overall weak, but significant (R2=0.575, P<10−5), correlation with their host GC content (Figure 6A). In fact, for bacteria, the partition by host GC content provides a very strong linear association (Figure 6A, blue points, R2=0.927, P<10−5). However, no significant associations are found between the GC contents of viruses and their hosts for other taxonomic groups. For example, for the 11 mammals analyzed in this study, the correlation was extremely poor (R2=0.065). This can be explained by the fact that although the GC content in mammal‐infecting viruses ranges between 35 and 56%, the GC content of the proteomes of the mammal hosts studied (Supplementary Table S3) is rather narrow (50–53%). Thus, we conclude that the correlation between the GC contents of the viruses and their hosts (Figure 6A) is dominated by the bacteriophages matching their unique bacteria.
As we did not find virus‐to‐host adaptation of GC content with respect to the entire taxonomical spectrum, we proceeded to test the codon usage distances for all pairs of virus and host (Figure 6B); the similarity of the viruses toward their specific hosts (the diagonal of the matrix in Figure 6B) is also summarized in Table II. The adaptation among the bacterial set is very prominent, especially in light of the extreme differences among the different bacterial hosts themselves (Figure 5B; Supplementary information S1). In fact, each bacterial virus shows a very different pattern relative to all other bacterial viruses. In addition, significant levels of resemblance are evident among the different plant viruses and their hosts.
However, the strongest signal observed is the resemblance of human viruses to all mammalian hosts; at the same time, these viruses remain rather different from any of the other mammalian viruses (Figure 5A). Furthermore, the strong similarity of the codon usage of human viruses to all 11 mammalian hosts reaches substantially farther into the taxonomic realm, approaching the insect and bird host species as well (Supplementary Table S3). Interestingly, the viruses that actually infect birds do not show strong adaptation to their hosts (based on viruses that infect chickens). We have shown that human viruses show an unexpected similarity to a broad range of host taxonomical groups, including mammals, avians, most insects, and some plants. Among all tested mammals, only human and rat viruses share strong resemblance in their codon usage profiles. However, owing to the relatively weak support for rat‐infecting viruses (i.e., few proteins, narrower virus representatives), we will focus only on the adaptation of human viruses.
We tested whether the above phenomenon is perhaps dominated by the virus classification scheme. Human‐infecting viruses are found in each of the seven classes (see Materials and methods). However, only for four of the seven classes do there exist three or more proteins derived from viruses that exclusively infect humans. Overall, all four of these human virus classes provide an almost identical codon usage profile when compared with mammals, insects, and plants (not shown), thus precluding such reasoning.
Codon usage resemblance is stronger for structural viral proteins
Most virus proteomes are rather simple and include <10 proteins. A minimal set of proteins comprises the virion structure by building the atomic units of the capsomers. Similarly, most viruses have a replication enzyme such as reverse transcriptase and RNA or DNA polymerase, according to their mode of replication, transcription, and regulation. In some instances (e.g., small DNA viruses and hepadnaviruses), the involvement of host polymerases is essential for the initial phase of viral replication. The rest of the proteome encodes diverse functions that are mostly uncharacterized and are often specialized to the life cycle of the particular virus. We tested the hypothesis that the evolutionary forces underlying codon usage adaptation of the virus may not be determined at the overall genomic level but may instead reflect some functional properties of its proteins.
The variability in viral structure, size, complexity, and shape is enormous. Despite such diversity, we assigned all viral proteins to four mutually exclusive functional sets (Figure 7; see Materials and methods). Figure 7C shows that structural proteins (‘H’) that do not function as host recognition elements are characterized by the highest levels of codon usage similarity with their respective host (i.e., lower L2 distance measure). This diverse group includes proteins that participate in packing and covering the DNA, as well as the structural proteins that build the core of the virions. On the other hand, proteins that are expressed on the surface (‘R’), which are molecules that participate in recognition of the host receptors, show the largest deviation from the host relative to the other defined groups. The polymerases and additional nucleic acid‐related enzymes (‘EC’) show an intermediate level of resemblance to host codon usage.
Source data for Figure 7C [msb200971-sup-0004-SourceData-S4.zip]
As early as 20 years ago, a correlation was detected between the prevalence of dinucleotides in viruses and their hosts (Barrai et al, 1990). Although these data were based on a very limited set of sequences, the main conclusion remains accurate in view of the current scale of sequenced data, which suggests an active adaptation process of viruses toward their hosts. We found that the huge amounts of data regarding viral genomes and the genomes of their respective hosts have enabled the compilation of a balanced data set for further analysis (Figure 1).
The viral space displays enormous diversity but high redundancy
In this study, we set out to analyze the overall potential adaptability of known virus families by using the complete set of virus representatives that reflect the current knowledge of all major viruses. Clearly, our analyses are strongly dependent on the correct mapping of viruses to their hosts. We analyzed 122 viruses at the higher taxonomical levels and 64 viruses at the host‐species level, where each such virus represents a different viral genus (Table I; Supplementary Table S2). By the strict mapping of a set of 30 virus genera that exclusively infect 30 different hosts, we limited the set of viruses and often remained with only one or very few representatives for a unique host. It is possible that the set of viruses that have a restricted range of hosts is skewed. They may reflect (i) poorly studied cases, which leads to partial information regarding the virus and its hosts; and (ii) cases where the dependency on virus–host pair is stronger because of a specific molecular barrier that restricts the host range. We cannot separate these two instances, but for many cases the restricted host assignment is supported by a large body of literature. For example, the observed overwhelming similarity in amino acid distributions and in the codon usage among viruses infecting the tomato, lettuce, rice, and arabidopsis plants (Figure 4) seems more likely to be a result of incomplete annotation in the viral database, where each of these viruses overlap and, in reality, infect the other plants but are simply not yet annotated as such. This is in accord with the current view of plant‐infecting viruses (Roossinck, 1997). Although the statistical power for some analyses may be affected by this ‘reduction to representatives,’ we argue that the trends observed in this study hold and will be further substantiated when additional viruses with accurate annotations become available.
We analyzed a representative set of viruses irrespective of their mode of replication according to a partition to the 7 standard classes (Melnick, 1972) that are indicated by I–VII. Note that host specificity does not determine the class. For example, although most viruses that infect humans (Supplementary Table S1) belong to class I, human‐infecting viruses are represented in all other classes, including the well‐known health‐threatening viruses, such as the Coronavirus family (SARS, class IV), Lentivirus (AIDS virus, class VI), and Ebola virus (class V). The genomic structures, nucleotide composition, replicative mode, replication time, and rates of mutation in the different classes are estimated to show great differences, where, for example, RNA viruses mutate much faster than other groups. Despite these differences, the analysis of human‐infecting viruses based on this classification showed that the observed adaptation of human viruses characterizes viruses from a broad range of life cycles and replication modes.
Testing codon and amino acids usage (rather than more direct measures of substitution and mutation rate) has the advantage that it provides a view on the variability of the viral proteome relative to its potential hosts (Figures 4, 5 and 6). Nonetheless, our observations cannot provide insights into the dynamics or rates of viral evolution. Studies that estimate the diversity among viruses and their hosts often focus on those having high enough mutation rates or short generation times, resulting in increased genetic diversity (van Hemert et al, 2007). Our analysis is thus complementary to such studies.
Adaptation of viruses toward their hosts
In this paper, we observed that all mammalian genomes have similar codon usage. Furthermore, we found that human viruses share this common codon usage with their human host; on the other hand, other mammalian viruses do not. Theoretically, this could derive from a situation where, for some reason, only human viruses are required to adapt their codon usage to successfully infect their host, whereas this adaptation does not seem critical for the viruses of other mammals. More likely explanations may be related to the recent expansion of humans and the co‐evolution of their viruses, or to the hypothesis that large portions of the human genome are actually of viral origin (Kazazian, 2004).
A high similarity was reported earlier between the codon usage of bacteriophages and their hosts (Lucks et al, 2008). In that study, the authors analyzed a large set of bacteriophages and isolated the effect of the GC (i.e., GC content) and the adaptation of specific viral codons toward the primary bacterial host. Interestingly, for about 40% of the viruses, host‐preferred codons were selected, which suggests that adaptation toward the host has a strong role in viral evolution. In addition, they found that structural proteins show maximal similarity toward the host‐preferred codon, in accordance with our finding regarding the high degree of adaptation for highly abundant proteins (Figure 7C).
Here, we found similar codon usages among viruses, hosts, and for virus–host pairs. Similarity in codon usage in different viruses can somewhat be explained by the occurrence of lateral gene transfer (LGT) and other modes of genetic material exchange. Accordingly, recent recombination events between the host and the virus may leave behind similar codon frequencies. Yet we do not believe this phenomenon to be a major determinant in codon usage adaptation as (i) it is unlikely that the codon usage of some functional groups but not of the entire proteome will show differences in the patterns observed (Figure 7C); (ii) there is no evidence that among the mammals we tested here some are more likely to be affected by LGT than others, yet human viruses show a significantly different pattern than other mammals; (iii) different classes of viruses (class I–VII) have similar adaptation trends, despite substantial differences in the potential for the exchange of genetic material with the host in RNA and DNA viruses. Thus, although it is unlikely that LGT dominates the observed resemblance of codon usage between eukaryotic viruses and their hosts, this does not hold for bacteria and archaea, which are exposed to high frequencies of LGT events.
An interesting case of co‐evolution with expected restrictions on infectivity is that of viruses that infect hosts that use alternative genetic code assignments. Indeed, studies on mitoviruses that infect fungal mitochondria led to insights on host limitation that are imposed by the use of a specialized genetic code (Shackelton and Holmes, 2008).
Possible selection for translational efficiency in mammalian viruses
In our study, the similarity between the codon usage of human viruses and that of mammals, birds, and some insects is not duplicated for other mammalian viruses (Figure 6). Furthermore, the signal observed for codon usage exceeds that detected for amino acid distributions, potentially indicating selection for translational efficiency.
The number of protein products in the viral capsid can reach thousands; for example, the mature HIV‐1 contains 1572 capsid proteins. The African swine fever virus (family Asfarviridae) consists of ∼1900–2200 capsomers. On the other hand, recognition proteins on the viral surface are not necessarily expressed in such large amounts. A partition of structural proteins and enzymes is based on ‘virion properties’ from the ICTV database (http://www.ncbi.nlm.nih.gov/ICTVdb). Currently, on the basis of 3D structure, sparse data on the stoichiometry of virion composition are available. For example, the Adenoviridae virus genome encodes 10 structural proteins and ∼30 non‐structural proteins. The capsid is composed of 720 copies of the major hexon protein (protein II, 988 aa), 64 and 60 copies that build the penton (proteins III and IIIa, respectively), 180 copies of the minor core (protein V), but only 12 copies of the recognition fiber (protein IV, 582 aa).
We found that for mammalian viruses, the proteins that appear in virion in high numbers (Figure 7, marked ‘H’) are the ones with codon usage most similar to that of their hosts. In the case of human viruses, we can see that highly expressed genes in different viruses that infect the same host preferentially use codons similar to that of humans and of each other (Figure 7C). On the other hand, the surface proteins that participate in recognition are often expressed in lower quantities displaying a rather low adaptation level toward their hosts (marked ‘R’). A complementary explanation may rely on the positive selection paradigm that was proposed in virus–host recognition (Sawyer et al, 2005). The enzymes (marked ‘EC’), which are generally expressed in minute amounts, show only an intermediate codon usage similarity. Thus, overall, these results further strengthen the case for translational selection. Note that earlier studies did not find evidence for translational selection operating on mammalian genes (see discussion in (dos Reis and Wernisch, 2009; Semon et al, 2006 #544) and references within). It may be possible that such selection does exist, but these phenomena are weak because of the low effective mammalian population sizes. On the other hand, viruses affecting mammals have larger effective population sizes and a shorter generation time (dos Reis and Wernisch, 2009). Thus, similar analysis to that performed here may be able to identify translational selection in genomes in which it was impossible to do so earlier.
In the case of bacterial viruses (Lucks et al, 2008), we were unable to consistently and reliably partition the proteins that are involved in recognition from those that are abundant, because of the enormous variability in shape and recognition mode among bacteriophages. Our results agree with a role of translational selection and extend it toward mammalian viruses, where it may have a role in their evolutionary fitness. However, this adaptation may be of lesser importance, as a critical obstacle for viruses that infect mammals is the need to invade their host cells, while bypassing an active immune system (whereas no such extensive system exists in bacterial hosts). For example, the HIV virus has adopted recognition strategies that overcome the immune barrier (Holmes et al, 1992).
Host range, tissue specificity, and codon usage similarity
It is known that a change in only a few amino acids of viral proteins can lead to a shift in the host infectivity range. Such a shift occurs through a genetic adaptation process that overcomes the hurdles of viral entry and replication in a new cellular environment. ΦX174 bacteriophage, which normally grows on E. coli, was switched to infect Salmonella, where this shift was attributed to only a very few mutations (2–3) in the major capsid gene (Crill et al, 2000). This phenomenon is not unique to bacterial viruses, as this has occurred in canine parvovirus, which appeared in the late 1970s as a variant of a feline parvovirus. The host shift was attributed to only two to three substitutions (Truyen et al, 1995). A shift in host recognition was also shown in the case of HIV‐1, where a single mutation in the envelope gene was sufficient to alter cell specificity (Rambaut et al, 2004). In all these strategies, virus–host shift is based on modifications in the virus receptor recognition step. However, it has been shown that host range is not entirely dependent on the initial recognition stage (McFadden, 2005).
Our results on the high adaptation in codon usage, especially for human viruses, suggest that viral envelope/capsid proteins have the potential to be a factor in infectivity and efficiency. Furthermore, our observation that some viruses are adapted toward multiple hosts, in terms of their codon usage, can even possibly permit the expansion of host infectivity.
In multicellular organisms, viruses do not infect the organism but rather are restricted to a specific organ, tissue, or cell type (Gallagher and Buchmeier, 2001). Throughout this study, we presented data that use the average codon usage of the organism as a reference measure to study adaptation. With the fast growth of high‐quality mass spectrometry proteomics data from different tissues and cell types, the notion of resemblance between viruses and their hosts under the assumption of translational (and not transcriptional) efficiency at the tissue and cell‐type levels will be of great interest.
Adaptation and human health
Studying the evolution of viral codon usage and amino acid preferences in view of their hosts is fundamental in developing strategies for managing viral infections in the scope of human health, agriculture, and the environment. Insight into such phenomena was used in the laboratory, for example, when unfavorable codon pairs of capsid poxvirus proteins were injected into infected mice, resulting in virus attenuation (Coleman et al, 2008). Similarly, neuroattenuated phenotype was associated with codon preference deoptimization in polioviruses (Mueller et al, 2006). In a common vaccination practice, a live, attenuated virus is produced by adaptation to a new host, thereby eliminating its virulence to humans. As we have found that human‐infecting viruses have conserved and unique codon usages, we propose that a fine‐tuning of codon deoptimization may allow the alteration of tissue tropism and virulence attenuation.
In addition, shifts in hosts have huge implications on human health and on the world economy, for example, zoonotic epidemics. Known examples of naturally occurring host–virus shifts are the introduction of HIV‐1 to humans in the early 1950s and the shift in the SARS (CoV) virus that crossed over to infect humans only very recently. The worldwide threat of influenza‐based epidemics, such as the transmission of avian flu (Influenza A virus, H5N1) to humans and the latest outbreak of swine influenza (H1N1, April 2009) in Mexico, is heightened by the rapid evolution of the Influenza virus witnessed during the last decade; recently, H3N2 and H3N8 were introduced from humans to pigs and from horses to dogs, respectively (Campitelli et al, 1997). It is likely that the domestication and close interaction between humans, rats, and farm animals for thousands of years has led to the evolution of viruses that infect humans and are adapted toward a broad range of hosts. The similarities in codon usage and amino acid composition that we have observed in this work can somewhat relate to the potential for zoonosis. Although, as discussed above, these molecular properties are neither necessary nor sufficient conditions for host shifts, our analysis can nevertheless contribute to a framework that would permit analysis of the potential of certain viruses to adapt to new host species.
Materials and methods
Proteins for all organisms were collected from UniProt (Apweiler et al, 2004). Virus proteins were collected from ViralZone (http://www.expasy.ch/viralzone, coordinated by UniProt/SwissProt), which holds 314 reference strain viruses that belong to 80 families and 291 genera. ViralZone provides reviewed data that cover molecular information (shape, genome and replication mode, and capsomer composition), epidemiological data, cell tropism, and host range. Each genus is specified by a manually selected representative (in some cases, >1). All viruses are classified into seven classes: (I) double‐stranded DNA viruses, (II) single‐stranded DNA viruses, (III) double‐stranded RNA viruses, single‐stranded RNA viruses with positive and negative sense (IV, V, respectively), (VI) positive sense single‐stranded RNA viruses that replicate through a DNA intermediate and double‐stranded DNA viruses that replicate though a single‐stranded RNA intermediate (VII). Fragmented proteins and polyproteins were filtered out. Coding sequences were collected from EMBL through an SRS querying system that links UniProt proteins to their respective EMBL coding sequences. As one protein is often associated with multiple sequences, we extracted all data as mapped by EMBL to UniProt ID. This collection of virus proteins in UniProt covers ∼13 000 proteins that are reviewed (SwissProt) and additionally ∼730 000 from a non‐reviewed TrEMBL resource.
We selected 30 organisms and 30 matched viruses (Supplementary Table S1) that are unique (i.e., assigned to a specific organism, Figure 2). Taxonomical views that have very little support (<2 proteins, <500 amino acids, or <700 codons) were eliminated. Note that the representative virus (reference strain) corresponds to tens of other viruses that are poorly annotated and thus are not selected as representatives. The mapping of a representative to other viruses is based on the ViralZone mapping.
For each group of (virus or host) genes, codon usage frequencies were independently calculated for each of the amino acids. For each of the 18 degenerately encoded amino acids, the empirical frequencies of its corresponding codons were counted and normalized to sum to 1. The other two amino acids tryptophan (W) and methionine (M) each have a single codon and were not included in the analysis. Thus, each of the 59 redundant codons that account for these 18 amino acids were assigned a number between 0 and 1. The GC content of each virus–host pair was also calculated independently and was assigned a number between 0 and 1.
Divergence between the codon usage of two viruses, two hosts, or virus and host was estimated according to the distances between their usage vectors. Specifically, for each group, a usage vector of 59 coordinates, denoted as F=(f1,…,f59), was calculated as described above. The distance between two such vectors was measured in two different ways: once as the L1 distance and the second time as the Euclidean (L2) distance . For differences in the amino acid frequencies between two species, the same method was used, with the corresponding 20‐coordinated vectors.
The codon usage differences were also measured in a manner that integrates the amino acid frequencies, where the 59 codons were assigned their empirical frequencies in the data, regardless of their corresponding amino acid frequency. This quantification results in a probability vector P=(p1,…,p59), where .
For this representation, the differences between two codon usages (P=(p1,…,p59), Q=(q1,…,q59)) were measured using their KL divergences (DKL, Kullback–Leibler divergence), where and .
For each partition of the host taxonomy that we considered, we included a virus in the calculations only if there was not more than one taxonomic class that it is capable of infecting. Formally, for each virus v, define h(v) to be the set of host species that it can infect. And, let C1, …, Ck be a disjoint partition of the host organisms under study. Now, for a particular virus v, consider the least common ancestor (LCA) of the host species of v in the host taxonomic tree: LCA(h(v)). If there exists a single cluster Ci (1⩽i⩽k) such that LCA(h(v)) is a descendant of Ci (possibly Ci itself), then we uniquely map virus v to be among the viruses that infect the taxonomic sub‐tree rooted at Ci.
Division of viral proteins into functional categories
We divided all mammalian virus proteins into one of four classes: (i) recognition receptors on the surface, for example, coat, spike, glycoprotein, or envelope (Figure 7, orange frames); (ii) enzymes (as annotated by the EC classification according to UniProt—mostly polymerases, purple frames); (iii) capsomers and structural units, including tegument, nucleoproteins, and capsids in enveloped viruses (Figure 7, blue frame); and (iv) proteins that are either unknown or cannot be uniquely assigned to the other three functional sets (see Supplementary Table S4). This assignment was performed manually, addressing the proteins with multiple functions or non‐exclusive functional assignments (mainly in filamentous phage and other bacteriophages).
We thank Nati Linial for valuable suggestions and a critical view throughout the research. We thank the ProtoNet research group for comments and discussions. YP and MF are fellows of the SCCB, the Sudarsky Center for Computational Biology. This research is partially supported by EU Prospects FR7 and a grant from the Israel Science Foundation (ISF).
Conflict of Interest
The authors declare that they have no conflict of interest.
Supplementary tables S2‐3, Supplementary data S1‐2 [msb200971-sup-0001.pdf]
Supplementary Table S1
Human infecting viruses. [msb200971-sup-0002.xls]
Supplementary Table S4
Proteins from human infecting viruses based on ‘complete proteome’ are listed according to their unique UniProt accession. [msb200971-sup-0003.xls]
This is an open‐access article distributed under the terms of the Creative Commons Attribution License, which permits distribution, and reproduction in any medium, provided the original author and source are credited. This license does not permit commercial exploitation without specific permission.
- Copyright © 2009 EMBO and Nature Publishing Group