” Where there is life there is wishful thinking “ Gerald F. Lieberman
Finding genes which are under positive selection is an important part of any molecular evolution biologists’ work as these genes can be responsible for adaptations in a studied specie. To find such genes, genomic scans are conducted and regions of the genome that show specific patterns, such as selective sweeps, are further studied and sensible biological interpretations are made. In this paper, Pavlidis & al. show that one has to be careful with such biological interpretations as the patterns for positive selection can appear under an a priori known neutrally evolving genome and that it might not be that difficult to come up with a satisfying story about such false-positives.
Figure 1 | Flowchart representing the the steps in the simulation. These steps were repeated for all of the 100 simulations. | ||||||||||
To show the existence of false-positives in the detection of positive selection patterns, Pavlidis & al simulated 100 data sets of 40 D.melanogaster X chromosomes evolving under a neutral Wright-Fisher model. The D.melanogaster X chromosome, which was sampled in the Netherlands, is believed to have gone through a recent and deep bottleneck. A demographic scenario for that population was inferred using the Markovian Coalescent Simulator (MaCS) software. The group then used the SeepFinder program to find the regions characteristic of selective sweeps in these artificially neutrally evolving genomes and mapped them to the actual X chromosome using Flybase, this allowed the naming of identified genes. Interesting genes were detected and biological meaning was assigned using the Gene Ontology Statistics (g:GOSt) module of g:Profiler. A “convincing” narrative was then given (Figure 1.).
The results showed that on average, 43 regions per simulation (min. 27 & max. 60) were found where the site frequency spectrum (SFS) shifted towards low- and high-frequency-derived alleles. These patterns presenting a lack of intermediate allelic frequencies are characteristic of recent selective sweep and are indistinguishable from selective sweeps occurring in nature under selective pressures (Figure 2.). These detected regions were then mapped to the real X chromosome using FlyBase as was mentioned earlier.
For each of the 100 simulated data sets, the g:GOSt enrichment analysis of every detected region showed that on average, 5.19 statistically significant categories were detected per data set with 77 sets yielding at least one significant category and 16 giving rise to more than 10 significant categories. To be able to quantitatively compare these results to real data results, an enrichment analysis was done on 37 inbred lines of D.melanogaster sampled in North Carolina which are accepted to have gone through very recent and deep bottlenecks as well. This real data enrichment analysis showed that 9 statistically significant terms were related to transcription factor binding site. This important result shows that the number of biological terms obtained with a g:GOSt enrichment analysis are not higher in the real data than in the simulated data sets. A few issues in the model were also addressed.
1. It is known that bottlenecks increase the proportion of false-positives in neutrality tests so the group made another simulation with a milder bottleneck model. The g:GOSt analysis still yielded significant categories in 85% of the simulated data sets.
2. It is known that large recombination rates result in different coalescent genealogies every few base pair thus hiding any genetic sweep and that small recombination rates tend to diminish the independence of genes to the hole genome thus not allowing selective sweeps to happen. To address this issue, the group did more simulations with 5 different combinations of recombination rates and bottleneck models. The g:GOSt analysis didn’t show substantial differences between these simulations.
3. SweepFinder detects SFS outsider as signatures for recent selective sweeps but there exists other statistics such as the omega-statistic which will detect other signature for recent selective sweeps such as linkage disequilibrium (LD). Two more simulations were done using firstly a LD detection method (OmegaPlus software) and secondly a joint method combining SFS and LD detection. The g:GOSt enrichment for both simulations yielded similar amounts of significant categories even though the distributions of the detected regions along the genome are different (the distribution is more uniform with omega-statistics than with the SFS detection).
The group then tried to make up convincing narratives about the three highest SweepFinder scoring genes (CG15211, CG8188 & CG6788) in the first simulation. In my opinion, these narratives were not the most convincing from a biological point of view but that is not the point of the article.
Selective pressures experienced by organisms are complex, varied and changing with time. Even if we knew all the selective pressures imposed on a population at one point, the ways in which its’ organisms could respond are vast! Every gene, as obscure as it might be, is linked one way or another to an important biological process so “meaningful” narratives, even about false-positive, can relatively easily be constructed. The extensive use of Gene Ontology and the ever increasing precision of data bases put at greater risk researchers of seeing patterns of positive selection were there are none.
What the authors of this article have shown isn’t that computational nor that statistical approaches for detecting positive selection are wrong but that one should be cautious of not over-interpreting genomic scans and blindly trusting statistics because: No null hypothesis of what “makes sense” exists.
Pavlidis P, Jensen JD, Stephan W, & Stamatakis A (2012). A critical assessment of storytelling: gene ontology categories and the importance of validating genomic scans. Molecular biology and evolution, 29 (10), 3237-48 PMID: 22617950