QUICKTEST - Quick association testing, for quantitative traits, allowing genotype uncertainty
quicktest --pheno file --geno file --out file --method-mean --method-ML
quicktest --pheno file --npheno name --ooops-line --missing-code code --geno file ...
./quicktest --pheno ex.sample --npheno hta --geno ex.geno --method-ML --out ex.out
QUICKTEST estimates genetic effect sizes and tests for significance, testing for association between a quantitative trait, and SNP genotypes that are uncertain. It allows (in principle) arbitrary quantitative trait distributions, by modelling the trait distribution as a mixture of normal distributions. It can compute a likelihood ratio test, using an iterative algorithm to find the exact MLE of the missing data likelihood [KJB+08]. QUICKTEST can also compute several other tests, including a score test, and simple regression onto mean genotypes.
The complete list of available options is as follows.
Sets the probability threshold used for calling genotypes to value. Default value is 0.9. Only used if the --copy-calls, --method-call or --test-HW options are used.
Compute a simple method-of-moments estimator for alpha. This assumes the probability vectors are i.i.d. Dirichlet over individuals and defines alpha to be the sum of the parameters of that Dirichlet distribution. This option is of very obscure interest.
Compute the (expected) minor allele frequency for each SNP/locus. This option is provided simply for convenience, since the (expected) MAF can be computed easily from the (expected) genotype counts, this is technically redundant. Not using this option will slightly reduce output file size.
Compute r-squared-hat for each SNP, which is the (estimated) fraction of variance in unobserved 0/1/2 genotype explained by the the individual mean genotypes. (We assume this is the same definition used by Abecasis et al.)
For SNPs/loci and individuals analysed, write a copy of the genotype probabilities to file. This can be used to gunzip and extract subsets from the genotype probability file, column-wise by using the --only option, and/or row-wise by using a sample file with missing phenotypes,
This option alters the behaviour of the --copy option. Instead of outputting genotype probabilities, QUICKTEST will output called (best guess) genotypes, using the probability threshold set using the --call-thresh option. The output is identical to the .tped format used by PLINK. You should be able to reformat the .sample/.pheno file to make a matching .tfam file.
An exclusion list is constructed by reading all strings in file. Strings are whitespace delimited, so multiple strings on a single line are treated identically to if they were on separate lines. For each individual, if either of the first two fields in the phenotype file match any of the strings in file, the individual is excluded. Matching must be stringwise exact. Note that this is very different to the linewise matching using by PLINK.
Specifies a name of an input genotype file. If the file has been compressed using gzip, QUICKTEST will detect this and uncompress it on the fly. If this option is used multiple times, the files will be read in the order they are specified, and the results concatenated into a single output and/or copy file.
By default, QUICKTEST will print a warning message if multiple
individuals have identical (tied) phenotypic values. The intended
behaviour is to detect when a special value (like -9
) is being used to
code missing data, but that the --missing-code option has not been
used to specify the special value. Use this option only if you have
genuinely tied values and you want to disable the warning message.
Carries out a logistic regression, by dichotomising the phenotype at split point value. Mean genotypes are used, and all covariates specified with --ncovar are included in the analysis.
Will carry out an analysis by calling the genotypes, with a threshold that can be set using the --call-thresh option.
Will carry out an analysis for SNPxCovariate interaction, using the first covariate specified by the first use of the --ncovar option. Other covariates are included in the analysis, but not as interaction terms. The statistical test is analogous to the mean genotype method.
Will carry out an analysis by computing the exact MLE, using the iterative algorithm, and then a LR test, as described by [KJB+08].
Will carry out an analysis using mean genotypes, i.e. the so-called expected dosage method.
Adjusts the behaviour of --method-interaction. The beta coefficients and standard errors for both the SNP and SNPxCovariate regression terms will be output, along with the error covariance between the two. The Huber sandwich estimate is used to provide a robust estimate of the error variance-covariance matrix. Code implementing this method was written by Alisa Manning and Han Chen (see AUTHORS).
Will carry out an analysis using the missing data likelihood method of [MHM+07], equivalent to the -proper option in the SNPTEST software.
Will carry out an analysis using Markov chain Monte Carlo simulations to average over the missing data distribution. Do not confuse this option with the options for Monte Carlo permutation tests!
This option is currently disabled. It will use mean genotypes, but will carry out an analysis using a linear model with an (approximate) marginal likelihood based on the ranks of the phenotypes, as described by [Pet82]. At least for large sample sizes, we find that --method-mean --quantile-norm gives very similar results and is faster.
Phenotype values exactly matching value will be treated as missing.
The matching is done with strings and must be exact, so -9
and
-9.
are different. The default is that this option is not set.
Regardless of this option, QUICKTEST always assumes that NA codes for
missing data.
Attempt to model errors using a mixture of number Gaussians. This allows heavy tailed (or light tailed) error distributions, and makes analyses of imputed genotypes more reliable. Currently only implemented for --method-score, --method-mean and --method-ML. This requires a pre-analysis to fit the mixture model to the phenotype distribution. This can be slow, because we find it necessary to restart the EM algorithm from multiple random initial points. If repeatedly analysing the same phenotype, it is possible to save and then reuse an already fitted model, see --mix-hint.
Constrains the mixture model such that all components have zero mean.
Parse file to determine initial values of mixture parameters, and
then run EM algorithm just once, using those initial values, to
estimate the mixture distribution. The initial value specifications
in file must take forms like pi1 = 0.333
mu2 = -0.12
sigma3
= 1e-3
etc. WITH SPACES AROUND THE EQUALS SIGNS. Any content in
file that cannot be parsed is ignored. The simplest way to create
a suitable file is to cut the relevant lines from the screen output of
a previous run.
When either quantile-quantile normalisation or mixture model methods are used, do not run analyses for the simple normal model. This option saves computation time.
Do not print a refreshing line reporting progress.
Specifies that a column in the phenotype file is to be used as a covariate in the analysis. The argument value is handled in the same way as for --npheno. The --ncovar option can be used multiple times to specify multiple covariates, but at present only the first covariate is used for interaction analysis.
Specifies which column in the phenotype file to use for analysis. If value matches any token in the header, the corresponding column is used. Otherwise, value is interpreted as a number, and the (value+3)-th column will be used. The default value is 1, meaning to analyse the phenotype in the 4-th column of the phenotype file.
Perform all operations only for SNPs/loci corresponding to those in file. Entries in file should specify the identifier in the second column of the genotype probability file, which is usually used for rs number.
Assume that the second line of the phenotype file is a 0 0 0 P P
...
line and should be ignored. Using this option means that SNPTEST
phenotype files can be read.
The name of the output file.
Perform a Monte Carlo permutation test for each SNP, using the mean genotype method. The simple closed sequential scheme of [BC91] is used. Permutations are performed until either (i) there have been value1 occurrences of a statistic more extreme than the observed value, or (ii) value2 permutations have been performed. This permutation approach is recommended if you wish to control the false positive rate, when the assumptions of the normal linear model do not hold.
Under the null, about value1(1+log(value2/value1)) permutations per SNP are needed on average.
Perform value1 Monte Carlo permutations for each SNP, using the same set of permutations over all SNPs. The p-value, computed using the mean genotype method, is written to the output file for each permutation. The output may be subsequently analysed to estimate an empirical p-value, or using various methods for controlling or estimating the genome-wide false discovery rate (FDR).
Warning: The output file may be large.
The name of the input phenotype file.
Quantile normalise data before analysis.
The number of genotype simulations to perform for each SNP, if the MCMC method is to be used.
The seeds for the random number generator. QUICKTEST links against
libRmath, and calls set_seed(int1,int2)
to seed the
generator. According to my documentation, the generator is the
Marsaglia-multicarry.
Equivalent to --ooops-line --missing-code -9.
Allows evaluation of the performance of different methods, by simulating an effect of size value. Technically, a different synthetic phenotype is computed for each SNP, by simulate genotypes according to imputation probabilites and then multiplying by value. The errors are taken from a random permutation of the input phenotype. Note that a different permutation is used for each SNP. The synthetic phenotype is then analysed using all --method-XXX options selected, as if the simulated genotypes had not been observed. If the --mixture option is used, the mixture model is refitted to the synthetic phenotype before analysis. The magic estimates in the output file are the estimates that would have been obtained if the the simulated genotypes had been observed.
Will perform the exact test of [WCA05], using the best guess genotypes determined using the probability threshold set by the --call-thresh option.
The program returns 0 upon successful completion. Other return values indicate a problem with the command line arguments or input files, and are accompanied by a diagnostic message.
Each line of the phenotype file corresponds to one individual. There
can either be a single header line (the default), or two header lines
are allowed (as used by SNPTEST) if the --ooops-line option is
used. Missing values should be specified by writing NA
, but an
additional missing code can be specified using the --missing-code
option.
In summary, to read SNPTEST phenotype files, use:
--ooops-line --missing-code -9 or equivalently --SNPTEST
It is absolutely essential that the order of individuals in this file is the same as the order of individuals in the genotype file (see below).
Each line of the genotype file corresponds to one SNP. The first five fields contain information about the SNP. These are not used by QUICKTEST, but are copied verbatim into the output file. Subsequent fields contains the probabilities of each genotype for n individuals. For each individual, three consecutive fields give the probabilities of genotype 0, 1 and 2.
This file format is identical to the file format used by the CHIAMO, IMPUTE and SNPTEST programs that implement the methods of [MHM+07].
Genotypes of individuals with phenotypes written as missing (using
NA
or -9
) must be included in the genotype file, and are ignored
by the program.
For uncertain genotypes and a mixture error distribution, we declare convergence of the EM algorithm if (i) the sum log ratio of parameters being estimated is less than 1e-6, or if (ii) the increase in log likelihood is less than 1e-4 per iteration, averaged over 100 iterations. If more than 500 iterations occur, we declare non-convergence.
Email us if you do not understand the messages output by the program.
See the files ex.sample, ex.gen.gz, and run.sh.
[BC91] Besag and Clifford (1991) Sequential Monte Carlo p-values. Biometrika 78(2):301-4.
[KJB+09] Kutalik, Johnson, Bochud, Mooser, Vollenweider, Waeber, Waterworth, Beckmann and Bergmann (2009?) A comparison of methods for testing association between uncertain genotypes and quantitative triats. Manuscript submitted.
[JK09] QUICKTEST: Quick testing of genetic assocation for quantitative traits. In preparation.
[MHM+07] Marchini, Howie, Myers, McVean and Donnelly (2007) A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics 39:906-913.
[OF04] O'Hagan and Forster (2004) Bayesian Inference, volume 2B of Kendall's Advanced Theory of Statistics. Arnold, London, 2nd edition.
[Pet82] Pettitt (1982) Inference for the Linear Model using a Likelihood based on Ranks. J. R. Statist. Soc. B 44(2):234-243.
[SS07] Servin and Stephens (2007) Imputation-based analysis of association studies: Candidate regions and quantitative traits. PLoS Genetics 3(7):e114. doi:10.1371/journal.pgen.0030114
[WCA05] Wigginton, Cutler and Abecasis (2005) A note on exact tests of Hardy-Weinberg equilibrium. Am. J. Hum. Genet. 76:887-93.
QUICKTEST can be used to extract subsets from genotype probability files. To choose a subset of SNPs or loci, use the --only option. To choose a subset of individuals, make a dummy phenotype file where the phenotypes are non-missing for the individuals you want, and missing otherwise. Combine the two to extract a subset of loci and individuals. Use the --copy option to write the subset of genotype probabilities to a new file.
Use multiple instances of the --geno option to merge multiple input files, assuming they describe the same individuals in the same order.
The --only option builds a hash table of identifiers, and thus is reasonably fast even with very large lists.
There are probably bugs. Although all functions have been tested, we have reorganised the internal data structures and the way that many functions are called, so some functions may have been broken.
We continue to develop the software. Please check that you are using the most up-to-date version.
It would be nice to implement more general coding schemes, 2 degree of freedom tests, Bayes factors, missing data likelihood for interactions, generalised error variance (estimating equation) methods, etc.
Toby Johnson Toby.Johnson@unil.ch
Zoltan Kutalik Zoltan.Kutalik@unil.ch
Alisa Manning amanning@bu.edu
Han Chen hanchen@bu.edu
QUICKTEST uses gzstream, by Deepak Bandyopadhyay and Lutz Kettner.
QUICKTEST uses SNP-HWE, by Jan Wiggington.
The QUICKTEST program, the source code, and this documentation are all Copyright (C) 2007, 2008 Toby Johnson, Zoltan Kutalik.
QUICKTEST uses gzstream, which is Copyright (C) 2001 Deepak Bandyopadhyay, Lutz Kettner.
The programs, source code, and documentation, can be copied only under the terms of the licences below.
QUICKTEST is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
gzstream is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
With appropriate citation, the SNP-HWE routines are freely available for your use and can be incorporated into other programs.
Permission is granted to copy, distribute and/or modify the QUICKTEST documentation under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
This software distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the licenses described above for more details.
The GNU General Public Licence can be viewed at http://www.gnu.org/licenses/gpl.html.
The GNU Lesser General Public Licence can be viewed at http://www.gnu.org/licenses/lgpl.html.