A journey through the The Simons Genome Diversity Project: more genomes sequenced, more diverse populations


Since the first genome of Bacteriophage MS21 was completely sequenced, in 1976, until 2001 when the first draft of human genome2 was released, a lot of work was done to improve and to make accessible different methods to get inside of the genetics of various organisms. For human genome, this step was a very important one and the Human Genome Project was declared complete in 20033. During the last years, more and more projects are involved in deciphering the human wanderlust. To all of previous studies, we can add The Simons Genome Diversity Project, that brought us more information by sequencing 300 new genomes from 142 diverse populations. One of the aim was to chose populations that differ in genetics, language and culture. The study shows that some of the populations separated 100000 years ago and reveals more information about the ancestors of Australian, New Guinean and Andamanese people.


One of the most important thing in discovering the real human peopling of the Earth is to sequence as many as possible genomes, but from individuals coming from diverse populations, that could differ in many aspects. In this study, the 300 samples were prepared by using PCR-free library, through Illumina Ltd. method and the median coverage it was 42-fold (Figure S1.1; Supplementary Data Table 1). The method is using an improved genome coverage to identify the greatest number of variants with some of them previously reported. The single-sample genotypes was made by using the reference-bias free modification of GATK, but the some preprocessing was conducted for eliminating some adapter sequences. For increasing the data accuracy, it was used a filtering system, highly specific to the SGDP dataset. The levels are from 0 to 9 for each sample as a single character and the first level is the best for having a good balance between sensitivity and low error rate, but level 9 is good to be used when there is needed to low the errors rates (Figure S2.1).

The first part of the study is offering us more information about the time needed for the worldwide populations subjected to the study to get separated. The pairwise sequential Markovian coalescent (PSMC) and multiple sequentially Markovian coalescent (MSMC) was used to interpret the changes in size of the populations and the split time, the phased haplotypes of split time estimation were made by using the SHAPEIT and IMPUTE2. The filter used was the level 1. From the Figure 2a we can see evidence about the ancestors of some present populations that were isolated by at least 100kya, that could be an obstacle of certain mutations across the ancestors of all populations. The gene flow continued until around 50kya among the great majority of ancestral populations. The graphs show the moments when the substructure of different populations starts: in the Figure 2a, we can see that the substructure between french and africans start around 200 kya. In the next ones there is a comparison between only africans (the Yoruba separated from KhoeSan 87kya, from Mbuti 56kya and from the Dinka 19kya) or only non-Africans (the oldest substructure is from 50kya, taking part during or shortly after the deepest part of the shared non-African bottleneck 40-60kya). For the Figure 2d-f, it was used the PSMC and PS1 that show the effective population sizes inferred and the cross-coalescence rates inferred.

By using the neighbours-joining tree (pairwise divergence per nucleotide) and FST, Mallick et. al could reconfirm the previous studies regarding the fact that the deepest splits happened among the Africans. Previous studies showed that all non-Africans today possess Neanderthal ancestry and Figure 1c shows that the higher proportion of Neanderthal ancestry we can find it in East Asians. If we compare the EuroAsians between them, the South Asians have highest Denisovan ancestry (heatmap from Figure 1d). Another result is that there are more Denisovan ancestry in eastern than in western EuroAsians. If we take Australia, New Guinea or Oceania we can see that the results from other studies are confirmed by having more ancestry than in mainland Eurasians. In the Figure 3 the deeper the split is, the more divergent is the early dispersal ancestry. By using the cross-population coalescence pattern and allele frequency correlations, the best model is that the Australian, New-Guinean and Andamanese history doesn’t involve ancestry from an early- diverging source. In this study there is no archeological data taken in consideration regarding southern Asia or Australia. So, by using only the data from this study, it is released that the Australians, New Guineans and Andamanese are lacking in an analogous deep ancestry component. All the data referring to Australians seems to be consistent with descending
from a common homogeneous population since separation from New Guineans. Also, New Guineans, Australians and Andamanese appear as part of an eastern clade together with mainland EastAsians.

The 3P-CLR was used to scan the genome for positive selection. In the end, 38 of the largest peaks emerged for selection in the common ancestors of all modern humans. These peaks are the sweeps at the time that the archeological data shows an accelerated evidence of behavioral modernity. This data does not search for the sweeps on chromosome X or in repetitive or difficult-to-analyze sections of the genome.

For the rate of mutation accumulation between the non-Africans (grouped in America, CentralAsiaSiberia, EastAsia, WestEurasia, Oceania) and sub-Saharan Africans (grouped in Pygmy, Khoesan and Africa) it was supposed to be quite equal, but this study revealed an significant average of 0,5% difference. For this part, they used a highly restriction to the samples, by choosing only the samples processed in the same way and the highest level of filtering, pooling the samples from the same regions together. The one strength of this experiment is the fact that they avoid the bias due to different heterozygosity level in different populations (the heterozygosity is higher in Africans), by using only the chromosome X for males. Although, they map everything to chimpanzee, which is equally distant to all present populations. There are differences in observations related to other studies, by having a different rate of CCT>CTT mutation, that is close to Africans in Europeans, but not in East Asians. This could be explained by the decrease in generation interval in non-Africans since separation. Previous studies5 showed a higher X-to-autosome heterozygosity ratio in sub-Saharan Africans than in non-Africans. Mallick et al. confirmed this results by adding more populations to be analyzed: Khoesan for sud-Saharan Africans and New Guineans, Australians, Native Americans, Near Easternes and indigenous Siberians for the non-Africans. The only one exception, that showed a lower X-to-autosome heterozygosity ratio in sud-Saharan African than in non-Africans is in Pygmies (eastern Mbuti and western Biaka). In the Figure 1b through a scatterplot we can observe the two primary clusters: sud-Saharan Africans and all other populations, but without a big difference among the groups, except of the Pygmies with a high autosomal heterozygosity. If we compare the two Pygmies populations with a lower X-to-autosome ratio, we can see that the Mbuti are closer to non-Africans than to Africans, even if in the Neighbour-joining tree based on pairwise divergence, they are integrated to the Africans. The reduction of the X-to-autosome ratio in the non-African compared to African populations could be explained by the repeated waves of male mixture in already mixed population, but in the Pygmy populations, the strongest argument is the sex-biased gene flow supported by the anthropological data.

In the last part, Mallick et al.  shows that the non-Africans are presenting a higher accumulation of mutations. This can be explained in two ways: the rate of mutations in non-Africans is increasing by acceleration of it or by a deceleration within Africans. The Extended Data Table 1, shows that none of the populations with strong signals of non-Africans could be in fact a deceleration of Africans. The acceleration in non-Africans could be caused by many possibilities: the life history traits (eg. generation interval) could change after the dispersal of modern humans outside of Africa, increasing the latitudes conquered by the humans or the colder climates, the gene conversion (GC to A or T alleles) was more effective in Africans or a Neanderthal admixture into the ancestors of non-Africans, that could accumulate more mutations than in the modern humans after separation (but there are not clear evidence about this fact).


The Simons Genome Diversity Project is bringing more information by studying 300 new genomes, from 142 diverse populations, that shows an acceleration of accumulation of mutations  in non-Africans compared to Africans. Also, the Pygmies seem to be the only African group with a low X-to-Autosome diversity ratio. Regarding the ancestors, the highest proportion of Neanderthal it was present in EastAsians and an excess of Denisovan in some SouthAsians compared to other Euroasians.


  1. Min Jou, W., Haegeman, G., Ysebaert, M., Fiers, W., Nucleotide Sequence of the Gene Coding for Bacteriophage MS2 Coat Protein, Nat., 237, 5350, pp. 82-88, 1972
  2. http://web.ornl.gov/sci/techresources/Human_Genome/project/clinton1.shtml
  3. International HapMap Consortium. The International HapMap Project. Nat., 426,789–796, 2003.
  4. Keinan, A., Mullikin, J. C., Patterson, N. & Reich, D., Accelerated genetic drift on chromosome X during the human dispersal out of Africa, Nat Genet 41, 66­?70, doi:10.1038/ng.303, 2009.
  5. Mallick, S., Li, H., Lipson, M., Mathieson, I., Gymrek, M., Racimo, F., Zhao, M., Chennagiri, N., Nordenfelt, S., Tandon, A. et al., The Simons Genome Diversity Project: 300 genomes from 142 diverse populations, Nat., 538: 201–206, 2016.