Mouse genomic variation and its effect on phenotypes and gene regulation (Keane, Goodstadt, and Danecek et al., Nature 2011)

ResearchBlogging.org Motivation: Documenting the genomic variation of 17 inbred strains of mice. Describing the distribution of variants between strains and its relation to phenotypes and gene regulation. Exploring the evolutionary origins of the subspecies that gave rise to the laboratory mouse.

       Structure: The article is divided up in three main parts: i) description of genomic variants, ii) examination of functional consequences of allele-specific variation on transcript abundance, and iii) investigation of the molecular nature of functional variants and their position relative to genes.
       Experimental design: The 17 most widely used mouse strains (liver tissue) were selected for whole genome sequencing on the illumina GAIIx sequencing platform. To estimate error rates and evaluate the method a NOD/ShiLtJ BAC clone library was constructed. 107 BACs from seven loci on chromosomes 1, 6, 11 and 17 from this library were shotgun cloned and capillary sequenced. SNPs, structural variants (inversions, balanced translocations, CNVs), and transposable elements were identified based on a reference genome (the one that had already been sequenced before: C57BL/6J). Bayesian concordance analysis was used to construct gene trees across the genomes of M. m. musculus, M. m. domesticus and M. m. castaneus. M. spretus was used as the outgroup. Allele specific expression was analyzed in liver, thymus, spleen, lung, hippocampus and heart using RNA sequencing. Each lane of transcriptome sequence was re-genotyped prior to downstream analysis. For this transcriptome analysis a F1 hybrid of two sequenced strains was used. To identify sequence variants that underlie quantitative traits and investigate their common molecular features and their position relative to coding genes the complete genome sequence of eight inbred strains (founder haplotypes of lab strains) were used. QTLs used were chosen based on previous literature (mainly [1]). For more details on methods consult the supplementary information of the article.
       Main results: The whole genome sequences of 17 inbred laboratory mouse strains are reported. Ten times more variants than previously known were found. The phylogenetic history of laboratory mice strains could not be completely resolved. 12% of allele-specific transcripts showed a significant tissue-specific expression pattern. The molecular nature of functional variants, as well as their position relative to coding genes, varies according to the effect size of the quantitative trait locus (QTL) and seems to have a significant effect on the function.
 Oddities of the article
       22 authors
       3 really big guys in the end
       3 guys sharing first author
       very condensed
       article represents the integrative nature of current science
Discussion among tutorial participants
       General discussion about hiring process
a)     Generally it is good to be several times an author in the middle of an article if the PI, as well as the journal itself have a good reputation.
b)    For first authors mostly the reputation of the journal counts.
–> The hiring process is different for different positions. For a technician it is good if a) applies and for a PhD position or a post-doc position it is good if b) applies. For further steps in a career the criteria are more stringent.
       Experimental design
Would you repeat the experimental design of this study?
Yes) The results are influential for all kinds of inbred mammal studies – even humans, a lot of new information is produced and the study has a high impact.
No) It might be considered a waste of money to spend on 17 lab strains if only a subset is used in most analyses. Most of the lab strains look more or less the same. Less of lab strains and more wild types could have been chosen.
–> Afterwards we always know more than beforehand!  Conclusively, for the lab strains behavior, morphology and physiology are better studied than of any species. This information can be used to explain small genetic differences. There is also a social constraint: You want to include as many people as possible to make it more interesting for the whole mouse community. This contributes to the Collaborative Cross, a community resource for the genetic analysis of complex traits. The Complex Trait Consortium is to promote the development of resources that can be used to understand, treat and ultimately prevent pervasive human diseases [2]
       Figure 1…
…caused problems to understand. What is the reference genome? (C57BL/6J) What does „inaccessible“ mean? (mostly LINEs, chr 17, chr X). There is more variation in outbred strains (more color, longer distance). A lot of people had problems with this figure. Most probably because figure 1a contains a lot of information at different levels and it takes the reader a long time and a good color printer to understand what they want to show. On the other hand figure 1b is rather simple. From the left to the right the blue circle increases relative to the red one. Does that represent how variation evolves in the genome? The SNPs show a small blue circle. This could be explained by bottlenecking or by selection acting on SNPs. Transposable elements show a large blue circle. Are lab strains evolving faster in this class? Unfortunately this part of the figure is not touched in the discussion section of the article.
       The generation and sequencing of NOD/ShiLtJ bacterial artificial chromosomes was appreciated by most of the students. It is a nice way to estimate error rates of the new sequencing techniques and it evaluates and confirms the method used. Public databases contain lots of false negatives per se right now because not many individuals/strains/species were fully sequenced. It is compulsory to show the consistency of a new method.
       The estimation of the amount of structural variants in the laboratory mouse strains caused some doubts among the students. Apparently, 48.4Mb of sequence of each strain falls into structurally variant regions of the genome. These structural variants cluster with SNPs in each strain. That means that 1.6% of the mouse genome are structural variants. Is this amount common? We do not know. What we know is that SNPs together with structural variants are of relatively old origin in these genomes. They seem to occur together. The authors report that many structural variants could not be mapped, so their estimation must be biased.
       Some students had a problem with understanding figure 3. A simpler way to represent the same data would be histograms of every tissue. I assume that the authors like the “ggplot” package and Hadley Wickham’s way of presenting data in one plot. To understand figure 3 is it necessary to understand that allelic bias is defined as the proportion of expression attributable to a particular parental strain, ranging from 0 to 1, with the null hypothesis of 0.5 in the absence of any bias.
       The phylogenetic analysis revealed that all trees have similar probabilities.
–> compare with human-chimp-gorilla relationship [3]: The human and gorilla relationship can appear closer than the one of human and chimp because of incomplete lineage sorting. The gene tree does not equal the species tree, whereas the percentage of shared autosomes equals about 10%. Here the percentage of shared autosomes is around 5% among mouse lab strains. Here we are comparing strains and not species, so the differences are smaller and incomplete sorting might be more common due to recent shared ancestry [4].
       QTL:
I find it interesting that wherever you go, students always find QTL analyses difficult to understand. The same here. The idea was to use the whole genome in an attempt to identify sequence variants that underlie quantitative traits. It was asked if functional variants have common molecular features and if they are more common within genes or outside them, as well as if they consist of structural variants, indels or SNPs. As candidate loci 843 QTLs were selected, as identified in the literature [1,5]. Two competing models were used to answer their questions: Either the haplotype model where eight haplotypes are used (eight sequenced strains of the founder haplotypes of all lab strains) or the SNP allele model where two alleles at every locus were imputed. In 85% of the cases there was at least one variant where the fit of the allelic model was better than the haplotype model. It was concluded that at these QTLs, there is either a single functional variant or a series of functional variants of the same haplotype. We questioned whether using the two competing models is sufficient for a thorough analysis to find functional variants. (Basic help for QTL analysis with R can be found here: [6].)
Table 2 and figure 4 show the physical part of the QTL analysis. Interestingly, the table and figure in question are almost redundant. On top of figure 4 the importance of the position of a significant functional variant is shown and at the bottom the molecular nature of quantitative trait variants that influence the effect size of the QTL are represented. Position and molecular type are important. At this moment it became clear to us that we still do not really know how genes work. Five years ago it was the common believe that mostly flanking regions of genes are important for their regulation. It seems like we are just advancing in the dark and feeling the tail of an elephant. A small change in a regular sequence can lead to a big change in a gene. It seems like trans regions are much less important than cis-regulatory elements. We missed some kind of categorization of QTLs in the collection of QTLs in mice. It might be that there are bigger classes of QTLs that fall into different positions or molecular types. Finally we also lacked a combined analysis where positions and molecular types are analyzed together.
1. Valdar W, Solberg LC, Gauguier D, Burnett S, Klenerman P, et al. (2006) Genome-wide genetic association of complex traits in heterogeneous stock mice. Nat Genet 38: 879-887.
2. Churchill GA, Airey DC, Allayee H, Angel JM, Attie AD, et al. (2004) The Collaborative Cross, a community resource for the genetic analysis of complex traits. Nat Genet 36: 1133-1137.
3. Scally A, Dutheil JY, Hillier LW, Jordan GE, Goodhead I, et al. (2012) Insights into hominid evolution from the gorilla genome sequence. Nature 483: 169-175.
4. Ane C, Larget B, Baum DA, Smith SD, Rokas A (2007) Bayesian estimation of concordance among gene trees. Mol Biol Evol 24: 412-426.
5. Yalcin B, Flint J, Mott R (2005) Using progenitor strain information to identify quantitative trait nucleotides in outbred mice. Genetics 171: 673-681.
6. Zhou Q (2010) A Guide to QTL Mapping with R/qtl. Journal of Statistical Software 32: 396.

Keane, T., Goodstadt, L., Danecek, P., White, M., Wong, K., Yalcin, B., Heger, A., Agam, A., Slater, G., Goodson, M., Furlotte, N., Eskin, E., Nellåker, C., Whitley, H., Cleak, J., Janowitz, D., Hernandez-Pliego, P., Edwards, A., Belgard, T., Oliver, P., McIntyre, R., Bhomra, A., Nicod, J., Gan, X., Yuan, W., van der Weyden, L., Steward, C., Bala, S., Stalker, J., Mott, R., Durbin, R., Jackson, I., Czechanski, A., Guerra-Assunção, J., Donahue, L., Reinholdt, L., Payseur, B., Ponting, C., Birney, E., Flint, J., & Adams, D. (2011). Mouse genomic variation and its effect on phenotypes and gene regulation Nature, 477 (7364), 289-294 DOI: 10.1038/nature10413