GMM

From Computational Biology Group

(Difference between revisions)
Jump to: navigation, search
m (Download)
 
(7 intermediate revisions by one user not shown)
Line 3: Line 3:
 
Such genomic aberrations were identified already more than a decade ago using array-based comparative hybridization. They can also be detected using  
 
Such genomic aberrations were identified already more than a decade ago using array-based comparative hybridization. They can also be detected using  
 
data from SNP genotyping arrays, typically by combining the intensities of the two probes for a given SNP and comparing to the same SNP from other arrays (thus deriving a copy number ratio).
 
data from SNP genotyping arrays, typically by combining the intensities of the two probes for a given SNP and comparing to the same SNP from other arrays (thus deriving a copy number ratio).
Significant shift from the baseline (unit ratio or zero log ratio) reflects copy number changes. Such changes can be identified in many ways, for example, one can use segmentation algorithms to partition the signal then try to classify such segments into gain, copy neutral and loss status.
+
Significant shift from the baseline (unit ratio or zero log ratio) reflects copy number changes. Such changes can be identified in many ways, for example, one can use segmentation algorithms to partition the signal then classify such segments into gain, copy neutral and loss status.
 
Yet, for large datasets, one can take advantage of the signal distribution at each SNP, and cluster each individual from the distribution into a component that would reflect a given copy number change.
 
Yet, for large datasets, one can take advantage of the signal distribution at each SNP, and cluster each individual from the distribution into a component that would reflect a given copy number change.
  
We developped a Gaussian Mixture Model, which detect copy number variation from the distribution of copy number ratios. From the data, it will fit one component for each of the following copy number states: deletion, copy-neutral, 1 and 2 additional copy; with a constraint on the difference between the mixture means. Then for a given individual, it will determine the probabilities for each copy number state and compute the expected copy number (dosage).   
+
We developed a Gaussian Mixture Model, which detect copy number variation from the distribution of copy number ratios. From the data, it will fit one component for each of the following copy number states: deletion, copy-neutral, 1 and 2 additional copy; with a constraint on the difference between the mixture means. Then for a given individual, it will determine the probabilities for each copy number state and compute the expected copy number (dosage).   
 
+
  
 
=== License ===
 
=== License ===
  
 
The GMM algorithm is licensed under the GNU General Public License, version 2 or later. For details, see http://www.gnu.org/licenses/old-licenses/gpl-2.0.html.
 
The GMM algorithm is licensed under the GNU General Public License, version 2 or later. For details, see http://www.gnu.org/licenses/old-licenses/gpl-2.0.html.
 
  
 
=== Usage ===
 
=== Usage ===
Line 25: Line 23:
  
  
For [http://www.mathworks.com/ Matlab] users, download the source code and use the callCNVs.m script.
+
For '''[http://www.mathworks.com/ Matlab] users''', download the source code and use the callCNVs.m script.
  
Users without Matlab, can use the compiled version and the Matlab Component Runtime (MCR). (Please note, we are only providing a compiled Linux x86_64 version for now).
+
'''Users without Matlab''', can use the compiled version and the Matlab Component Runtime (MCR). (Please note, we are only providing a compiled Linux x86_64 version for now).
 
   
 
   
 
Then you can use (and edit according to your need) the shell script called "run_CallCNVs.sh"  
 
Then you can use (and edit according to your need) the shell script called "run_CallCNVs.sh"  
Line 38: Line 36:
  
 
Otherwise, you will need to download the Matlab Component Runtime to use the executables (see Download section).
 
Otherwise, you will need to download the Matlab Component Runtime to use the executables (see Download section).
 
  
 
=== Download ===
 
=== Download ===
 
  
 
{| class="wikitable" border="1"
 
{| class="wikitable" border="1"
Line 51: Line 47:
 
|-
 
|-
 
|  MCR for 64-bit Linux
 
|  MCR for 64-bit Linux
[http://www.unil.ch/cbg/homepage/downloads/MCR2007_x86_64.zip MCR2007_x86_64.zip]
+
<googa>http://www.unil.ch/cbg/homepage/downloads/MCR2007_x86_64.zip|MCR2007_x86_64.zip|/download/MCR2007_x86_64.zip</googa>
 
|  224M
 
|  224M
 
|  451c54a811b3e01402b6a46a1b814c4d  
 
|  451c54a811b3e01402b6a46a1b814c4d  
 
|-
 
|-
 
|  Linux Executables (+ example input file)
 
|  Linux Executables (+ example input file)
[http://www.unil.ch/cbg/homepage/downloads/GMM_CNV.zip GMM_CNV.zip]
+
<googa>http://www.unil.ch/cbg/homepage/downloads/GMM_CNV.zip|GMM_CNV.zip|/download/GMM_CNV.zip</googa>
 
|  556k  
 
|  556k  
 
|  bd579f39c340a50de2bb80a649643be3
 
|  bd579f39c340a50de2bb80a649643be3
 
|-
 
|-
 
|  Source code
 
|  Source code
[http://www.unil.ch/cbg/homepage/downloads/GMM_CNV_SOURCE.zip GMM_CNV_SOURCE.zip]
+
<googa>http://www.unil.ch/cbg/homepage/downloads/GMM_CNV_SOURCE.zip|GMM_CNV_SOURCE.zip|/download/GMM_CNV_SOURCE.zip</googa>
 
|  16k  
 
|  16k  
 
|  3cb7799bf3e180b33a6742ef382b105e
 
|  3cb7799bf3e180b33a6742ef382b105e
Line 67: Line 63:
 
|-
 
|-
 
|  Example output files
 
|  Example output files
[http://www.unil.ch/cbg/homepage/downloads/GMM_CNV_outputs.zip GMM_CNV_outputs.zip]
+
<googa>http://www.unil.ch/cbg/homepage/downloads/GMM_CNV_outputs.zip|GMM_CNV_outputs.zip|/download/GMM_CNV_outputs.zip</googa>
 
|  460k  
 
|  460k  
 
|  6b621a6a8e279697f610db35810777ce
 
|  6b621a6a8e279697f610db35810777ce
 
|-
 
|-
 
|}
 
|}
 
  
 
=== Frequently Ask Questions ===
 
=== Frequently Ask Questions ===
  
* What are the default component the model will try to fit?
+
 
 +
 
 +
'''* What are the default component the model will try to fit?'''
  
 
The current implementation models deletion, copy neutral, 3 copies and more than 3 copies.
 
The current implementation models deletion, copy neutral, 3 copies and more than 3 copies.
  
  
* What happen if the model fails to fit the data ?
+
'''* What happen if the model fails to fit the data ?'''
  
You will model will move to the next SNP to process and you will simply get the warning :
+
The model will output this warning :
 
  Exiting: Maximum number of iterations has been exceeded - increase MaxIter option.
 
  Exiting: Maximum number of iterations has been exceeded - increase MaxIter option.
Missing data will be set as 0.
+
Missing data will be set as 0. Then the model will analyse the next SNP (if any).
  
  
* I am getting :
+
'''* I am getting :
 
  Exiting: Maximum number of iterations has been exceeded - increase MaxIter option.
 
  Exiting: Maximum number of iterations has been exceeded - increase MaxIter option.
 
What does this mean?
 
What does this mean?
 
+
'''
 
The model could not find the component separation before reaching its maximal iteration limit.
 
The model could not find the component separation before reaching its maximal iteration limit.
 
This can be due to noisy data, or distribution where no such separation exists.
 
This can be due to noisy data, or distribution where no such separation exists.
Line 100: Line 97:
  
  
* Can I apply some extra normalization before fitting the Gaussian Mixture Model?
+
'''* Can I apply some extra normalization before fitting the Gaussian Mixture Model?'''
  
 
Yes, by default a Loess smoothing is applied. (This step can be skipped by setting DO_LOESS_SMOOTH=0 in the shell script or setting DO_LOESS=0; in callCNVs.m).
 
Yes, by default a Loess smoothing is applied. (This step can be skipped by setting DO_LOESS_SMOOTH=0 in the shell script or setting DO_LOESS=0; in callCNVs.m).
  
It is also recommended that adequate normalization is applied and that such normalized ratios are provided in the matrix input file.
+
Since Gaussian Mixture Model can be sensitive to batch effects, it is strongly recommended that adequate normalization is applied before using the model.
 +
note : The loess smoothing will not correct batch effects, but will improve the signal to noise ratio within individual profile. By default, the Loess windows size is 41 SNPs. For higher density arrays (Affymetrix 6.0 or Illumina 1M) such window could be increased.  
  
  
* I am getting this error :  
+
'''* I am getting this error :  
  error while loading shared libraries: libmwmclmcrrt.so: cannot open shared object file: No such file or directory
+
  error while loading shared libraries: libmwmclmcrrt.so:
what does it mean?
+
cannot open shared object file: No such file or directory
 +
what does it mean?'''
  
 
Most likely your LD_LIBRARY_PATH is not pointing correctly to the MCR.
 
Most likely your LD_LIBRARY_PATH is not pointing correctly to the MCR.
The run_[appName].sh script should do it for you.
+
The run_callCNVs.sh script should do it for you.
  
  sh run_callCNVs.sh /path-to-my-MCR/v79 test.dat # i.e. for compiled distrib with Matlab 2008, build v79
+
  sh run_callCNVs.sh /path-to-my-MCR/v79 test.dat

Latest revision as of 18:07, 8 January 2010


Deletion, insertion and duplication events giving rise to copy number variations (CNVs) have been found genome-wide in the humans and other species. Such genomic aberrations were identified already more than a decade ago using array-based comparative hybridization. They can also be detected using data from SNP genotyping arrays, typically by combining the intensities of the two probes for a given SNP and comparing to the same SNP from other arrays (thus deriving a copy number ratio). Significant shift from the baseline (unit ratio or zero log ratio) reflects copy number changes. Such changes can be identified in many ways, for example, one can use segmentation algorithms to partition the signal then classify such segments into gain, copy neutral and loss status. Yet, for large datasets, one can take advantage of the signal distribution at each SNP, and cluster each individual from the distribution into a component that would reflect a given copy number change.

We developed a Gaussian Mixture Model, which detect copy number variation from the distribution of copy number ratios. From the data, it will fit one component for each of the following copy number states: deletion, copy-neutral, 1 and 2 additional copy; with a constraint on the difference between the mixture means. Then for a given individual, it will determine the probabilities for each copy number state and compute the expected copy number (dosage).

Contents

[edit] License

The GMM algorithm is licensed under the GNU General Public License, version 2 or later. For details, see http://www.gnu.org/licenses/old-licenses/gpl-2.0.html.

[edit] Usage

The GMM can be applied to identify CNVs from any rectangular matrix of copy number ratio.

Format is like : chr pos sample1 sample2 ...

Fields should be tab-delimited and it assumes data (within chromosome) are sorted by position.

An example input file is available within the GMM_CNV.zip (see Download section).


For Matlab users, download the source code and use the callCNVs.m script.

Users without Matlab, can use the compiled version and the Matlab Component Runtime (MCR). (Please note, we are only providing a compiled Linux x86_64 version for now).

Then you can use (and edit according to your need) the shell script called "run_CallCNVs.sh"

sh run_CallCNVs.sh path_to_the_MCR/v79/

[edit] Requirements

If you have the MATLAB software, you can directly use the source code.

Otherwise, you will need to download the Matlab Component Runtime to use the executables (see Download section).

[edit] Download

Description File Name Size md5sum
MCR for 64-bit Linux MCR2007_x86_64.zip 224M 451c54a811b3e01402b6a46a1b814c4d
Linux Executables (+ example input file) GMM_CNV.zip 556k bd579f39c340a50de2bb80a649643be3
Source code GMM_CNV_SOURCE.zip 16k 3cb7799bf3e180b33a6742ef382b105e
Example output files GMM_CNV_outputs.zip 460k 6b621a6a8e279697f610db35810777ce

[edit] Frequently Ask Questions

* What are the default component the model will try to fit?

The current implementation models deletion, copy neutral, 3 copies and more than 3 copies.


* What happen if the model fails to fit the data ?

The model will output this warning :

Exiting: Maximum number of iterations has been exceeded - increase MaxIter option.

Missing data will be set as 0. Then the model will analyse the next SNP (if any).


* I am getting :

Exiting: Maximum number of iterations has been exceeded - increase MaxIter option.

What does this mean? The model could not find the component separation before reaching its maximal iteration limit. This can be due to noisy data, or distribution where no such separation exists. Try increasing : MAX_FUN_CALL=10000; # nb of optimization function call MAX_FUN_ITER=5000; # nb of iterations for each optimization function call But note, this can significantly increase the runtime.


* Can I apply some extra normalization before fitting the Gaussian Mixture Model?

Yes, by default a Loess smoothing is applied. (This step can be skipped by setting DO_LOESS_SMOOTH=0 in the shell script or setting DO_LOESS=0; in callCNVs.m).

Since Gaussian Mixture Model can be sensitive to batch effects, it is strongly recommended that adequate normalization is applied before using the model. note : The loess smoothing will not correct batch effects, but will improve the signal to noise ratio within individual profile. By default, the Loess windows size is 41 SNPs. For higher density arrays (Affymetrix 6.0 or Illumina 1M) such window could be increased.


* I am getting this error :

error while loading shared libraries: libmwmclmcrrt.so:
cannot open shared object file: No such file or directory

what does it mean?

Most likely your LD_LIBRARY_PATH is not pointing correctly to the MCR. The run_callCNVs.sh script should do it for you.

sh run_callCNVs.sh /path-to-my-MCR/v79 test.dat