Large‐scale sequencing of environmental microorganisms (including genomes, metagenomes, and metatranscriptomes) now allows researchers to apply sequence‐based evolutionary genetics approaches to microbial ecology questions, including studying how microbes adapt to a diversity of natural environments. There is, however, a danger for misapplication and misinterpretation if the caveats and limitations of these evolutionary methods are not fully appreciated. Here, we highlight an example of such problems in an evolutionary analysis of potential positive selection within strains of SAR11 marine bacteria (Brown et al, 2012). We feel that these issues deserve wider recognition, particularly as evolutionary approaches become more widespread among environmental microbiologists.
In the vast majority of genes, the ratio (ω) of non‐synonymous substitutions per non‐synonymous site (dN) to synonymous substitutions per synonymous site (dS) is <1, which is evidence of purifying selection, acting to remove deleterious mutations (most of which are non‐synonymous). If ω is >1, then this may be indicative of positive selection favoring amino‐acid replacements (Hughes and Nei, 1988, Hughes, 1999). If ω is not statistically different from 1, then there is evidence for neither purifying nor positive selection. Positive selection is not the only factor that can cause ω to be >1; since ω is a ratio, an unusually high ω value may indicate either an unusually high dN or an unusually low dS. For example, purifying selection on synonymous mutations can lead to a low dS and thus high ω (Parmley and Hurst, 2007). Approaches for calculating dS and dN can be classified into two categories, based on whether the underlying nucleotide evolutionary model takes into account the base/codon frequency bias (Supplementary Table S1; Zhang et al, 2006). For instance, the Yang and Nielsen method (YN00) considers the base/codon frequency bias (Yang and Nielsen, 2000), whereas the Nei and Gojobori method (NG86) does not (Nei and Gojobori, 1986). Taking base/codon frequency into account can be important when nucleotide content biases are sufficiently strong to reduce the observed number of synonymous substitutions.
In a recent publication, Brown et al (2012) calculated ω using the NG86 method over full gene alignments as well as over 60‐codon sliding windows for orthologs shared by two SAR11 genomes, IMCC9063 from polar waters and HIMB114 from tropical waters. They identified 116 genes that had at least one window showing ω>1, and claimed that these genes are under positive selection (hereafter, 116 PS genes). They conducted a similar analysis with two other polar‐tropical pairs of genomes, ACE_P3.2 (polar) versus HIMB114 (96 PS genes), and ACE_P1a.1 (polar) versus HTCC7211 (tropical) (50 PS genes). However, there are problems with the interpretation of ω in these analyses. Here, we present in‐depth analyses of ω calculation for orthologous genes shared by IMCC9063 and HIMB114. We did not include the other two genome pairs of Brown et al (2012) because the ACE_P3.2 and ACE_P1a.1 sequences were derived from metagenomes, which could be subject to assembly errors. However, similar average nucleotide identity (ANI) between all pairs (74%, 70%, and 76%) suggests that conclusions from the IMCC9063/HIMB114 analysis will also be relevant to these pairs.
The 1083 orthologous genes shared by IMCC9063 and HIMB114 (Supplementary Table S2) were aligned using the same approach as in Brown et al (2012). The median of G+C content at the third codon position (%GC3) is 0.20 among all orthologous genes and 0.19 among the 116 PS genes, suggesting a strong bias in base/codon usage. Using PAML software (Yang, 1997), the NG86 method (which was used by Brown et al and does not consider base/codon frequency bias) and YN00 method (which considers base/codon frequency bias) gave similar estimates of dN (median=0.29 and 0.27, respectively), but very different estimates of dS (median=1.47 and 3.64, respectively) (Supplementary Table S3). We hypothesized that the failure to account for base/codon frequency bias in genomes with evidence for strong bias causes the NG86 approach to underestimate dS in these data sets.
To test this hypothesis, we employed several other methods in common use for calculating ω (implemented in the KaKs_Calculator software of Zhang et al, 2006). Methods that assume equal base frequency (hereafter, ‘equal frequency methods’; these include NG86, LWL85, LPB93, MLWL04, and MLPB04) consistently produced greater estimates of ω and lower estimates of dS than those which take into account the base/codon frequency bias (hereafter, ‘frequency bias methods’; these include YN00, GY94, MYN06, MS06, and MA06; these 10 methods are all referred in Zhang et al, 2006), but comparable results for dN. For simplicity, only comparison of MS06 with NG86 was shown (Figure 1), but the results of other frequency bias methods were similar. The best‐fit model for ω calculation among the 14 candidate DNA sequence evolutionary models (Supplementary Table S1; Zhang et al, 2006) was always one of the latter models (Supplementary Table S4) using corrected Akaike information criterion (AICC).
Thus, equal frequency methods do not provide reliable estimates of dS and ω for base/codon frequency biased genomes, such as IMCC9063 and HIMB114. Further, because synonymous sites are saturated with substitutions when dS » 1.0 (Ochman et al, 1999), the finding that the dS of most orthologous genes was ≥3 using the frequency bias methods (Figure 1B) suggests that even these methods cannot accurately estimate ω at the level of genetic divergence of these genome pairs.
Because stochastic errors can be substantial in a short sequence, the calculation of ω over 60‐codon sliding windows by Brown et al (2012) was particularly problematic. Using this approach, dS is significantly correlated with %GC3 in 76% of all genes and 70% of the 116 PS genes (Supplementary Table S3). Among the significantly correlated genes, 92% are positively correlated (Supplementary Table S3). We identified 107 genes, which contain at least one window of 60 codons showing ω>1 using the NG86 method employed by Brown et al (2012), including 92 that were found in the 116 PS genes. For all 2951 windows with ω>1 (Supplementary Table S5), the Spearman rank correlation between dS and %GC3 was 0.506 (P<0.001). The phenomenon of the correlation between dS and %GC3 suggests that the small sliding window analysis is not appropriate; G+C content is a significant determinant of the calculated dS and thus affects ω for these windows. This is true whatever factors (including stochastic error, mutational bias, or natural selection) cause variation in G+C content.
A final concern in the approach of Brown et al (2012) is that the statistical significance of cases in which ω>1 was not determined. A Fisher's exact test (implemented in the KaKs_Calculator) showed that only one ortholog pair (HIMB114_0556 and SAR11G3_00373) contained at least one 60‐codon window showing ω>1 with statistical significance (Supplementary Table S6). Here, ω was estimated by NG86 using KaKs_Calculator, which is slightly different from that estimated by NG86 using PAML. This ortholog pair contains five such windows, all of which have extremely low %GC3, ranging from 0.08 to 0.11, and low dS compared with other windows in the same gene, yet dN is only slightly greater than the median, suggesting that the low G+C content reduced the number of observed synonymous substitutions in these windows, resulting in ω>1. In addition, this orthologous gene encodes a putative porphobilinogen deaminase; there is no evidence showing that it is involved in cold adaptation.
In conclusion, the finding of Brown et al (2012) that positive selection drives biogeographic separation of polar and tropical phylotypes of SAR11 populations is based on an evolutionary model that does not account for the low G+C/high codon bias of the genomes being analyzed. Our analysis of evolutionary statistics using optimized evolutionary models does not support the contention that 116 genes shared by polar and tropical SAR11 populations have diverged as a result of positive selection.
We thank MA Moran for discussion, K Konstantinidis for providing the script for computing ANI, Z Zhang for advice on using KaKs_Calculator, and C English with graphics. This research was funded by grants from the Gordon and Betty Moore Foundation and the National Science Foundation (MCB‐0702125).
Conflict of Interest
The authors declare that they have no conflict of interest.
Supplementary Table S1 [msb201258-sup-0001.doc]
Supplementary Table S2 [msb201258-sup-0002.xls]
Supplementary Table S3 [msb201258-sup-0003.xls]
Supplementary Table S4 [msb201258-sup-0004.xls]
Supplementary Table S5 [msb201258-sup-0005.xls]
Supplementary Table S6 [msb201258-sup-0006.xls]
This is an open‐access article distributed under the terms of the Creative Commons Attribution License, which permits distribution, and reproduction in any medium, provided the original author and source are credited. This license does not permit commercial exploitation without specific permission.
- Copyright © 2012 EMBO and Macmillan Publishers Limited