Difference between revisions of "HMM based classification for conserved SNARE protein domains"

Line 11: Line 11:
 
This protein family has already been classified into four main groups (Qa, Qb, Qc, R) which are also divided into subgroups. This classification informes about the function ans sub-cellular localization of the protein.
 
This protein family has already been classified into four main groups (Qa, Qb, Qc, R) which are also divided into subgroups. This classification informes about the function ans sub-cellular localization of the protein.
  
In this project, we aim to build an algorithm capable of automatically classifying a SNARE sequences, thus giving rapid and accurate insights about its function. The classification method will be based on HMM models. We will build one HMM per class, and new sequences will be aligned to each of the HMMs. Each HMM will output a score, quantifying the likelyhood of the sequence being generated by the HMM, and an e-value, representing the statistical significance of the score. The new sequence could already be classified based on those values, but in order to reduce the error-rate of the models profiles, the e-value and score of all the HMM will be stored in a vector which will be input to a random forest model. Based on those results, it will be able to improve redo the classification, yealding a better accuracy (more than 7% in some cases).
+
In this project, we aim to build an algorithm capable of automatically classifying a SNARE sequences, thus giving rapid and accurate insights about its function. The classification method will be based on HMM models. We will build one HMM per class, and new sequences will be aligned to each of the HMMs. Each HMM will output a score, quantifying the likelyhood of the sequence being generated by the HMM, and an e-value, representing the statistical significance of the score. The new sequence could already be classified based on those values, but in order to reduce the error-rate of the models profiles, the e-value and score of all the HMM will be stored in a vector which will be input to a random forest model. Based on those results, it will be able to improve redo the classification, yealding a better accuracy (more than 7% in some cases). The final product will be a pipeline that takes a sequence as input, then determines its main group, and then determines the subgroup based on this result.
  
  
 
'''Building and using the HMM profiles'''
 
'''Building and using the HMM profiles'''
  
HMM profiles are a probabilistic representation of a group of sequences capable of characterizing the conserved as well as the variable regions. They are built based on multiple sequences alignements. We used a software called hmmer to build the profiles.
+
HMM profiles are a probabilistic representation of a group of sequences capable of characterizing the conserved as well as the variable regions.
  
'1: Building the multiple sequences alignements'
+
''1: Building the multiple sequences alignements''
 +
 
 +
As we wanted to classify proteins into groups and subgroups, we had to build an HMM profile for each group and each subgroups. HMM profiles are built based on multiple sequences alignements so that they can differenciate variable and conserved regions. Thus, we had to build one alignement per group and subgroup. The sequences we used were all found in Tracey, the SNARE proteins database of the Dirk Fasshauer group at University of Lausanne. For the main groups, we used the classification that was already present on the Tracey database.

Revision as of 13:18, 3 June 2024

by Marius Audenis, Leana Ortolani and Gabriel Chiche

Supervisor: Carlos Pulido Quetglas

Link to the presentation: https://unils-my.sharepoint.com/:p:/r/personal/marius_audenis_unil_ch/_layouts/15/Doc.aspx?sourcedoc=%7B3a03a6f1-e7b0-4099-977c-fb2a72924716%7D&action=edit&wdPreviousSession=05951c8f-6f64-d402-8f8d-a9743eda88ca

Introduction

The SNARE family proteins play a role in membrane fusion and can be involved in several processes such as neurotransmission. All of these proteins share a common motif, which we call SNARE. This motif is essential to form the SNARE complex, which mediates the fusion between a vesicle and a target membrane.

This protein family has already been classified into four main groups (Qa, Qb, Qc, R) which are also divided into subgroups. This classification informes about the function ans sub-cellular localization of the protein.

In this project, we aim to build an algorithm capable of automatically classifying a SNARE sequences, thus giving rapid and accurate insights about its function. The classification method will be based on HMM models. We will build one HMM per class, and new sequences will be aligned to each of the HMMs. Each HMM will output a score, quantifying the likelyhood of the sequence being generated by the HMM, and an e-value, representing the statistical significance of the score. The new sequence could already be classified based on those values, but in order to reduce the error-rate of the models profiles, the e-value and score of all the HMM will be stored in a vector which will be input to a random forest model. Based on those results, it will be able to improve redo the classification, yealding a better accuracy (more than 7% in some cases). The final product will be a pipeline that takes a sequence as input, then determines its main group, and then determines the subgroup based on this result.


Building and using the HMM profiles

HMM profiles are a probabilistic representation of a group of sequences capable of characterizing the conserved as well as the variable regions.

1: Building the multiple sequences alignements

As we wanted to classify proteins into groups and subgroups, we had to build an HMM profile for each group and each subgroups. HMM profiles are built based on multiple sequences alignements so that they can differenciate variable and conserved regions. Thus, we had to build one alignement per group and subgroup. The sequences we used were all found in Tracey, the SNARE proteins database of the Dirk Fasshauer group at University of Lausanne. For the main groups, we used the classification that was already present on the Tracey database.