HMM based classification for conserved SNARE protein domains

by Marius Audenis, Leana Ortolani and Gabriel Chiche

Supervisor: Carlos Pulido Quetglas

Link to the presentation: https://unils-my.sharepoint.com/:p:/r/personal/marius_audenis_unil_ch/_layouts/15/Doc.aspx?sourcedoc=%7B3a03a6f1-e7b0-4099-977c-fb2a72924716%7D&action=edit&wdPreviousSession=05951c8f-6f64-d402-8f8d-a9743eda88ca

Introduction

The SNARE family proteins play a role in membrane fusion and can be involved in several processes such as neurotransmission. All of these proteins share a common motif, which we call SNARE. This motif is essential to form the SNARE complex, which mediates the fusion between a vesicle and a target membrane.

This protein family has already been classified into four main groups (Qa, Qb, Qc, R) which are also divided into subgroups. This classification informes about the function ans sub-cellular localization of the protein.

In this project, we aim to build an algorithm capable of automatically classifying a SNARE sequences, thus giving rapid and accurate insights about its function. The classification method will be based on HMM models. We will build one HMM per class, and new sequences will be aligned to each of the HMMs. Each HMM will output a score, quantifying the likelyhood of the sequence being generated by the HMM, and an e-value, representing the statistical significance of the score. The new sequence could already be classified based on those values, but in order to reduce the error-rate of the models profiles, the e-value and score of all the HMM will be stored in a vector which will be input to a random forest model. Based on those results, it will be able to improve redo the classification, yealding a better accuracy (more than 7% in some cases). The final product will be a pipeline that takes a sequence as input, then determines its main group, and then determines the subgroup based on this result.


Building and using the HMM profiles

HMM profiles are a probabilistic representation of a group of sequences capable of characterizing the conserved as well as the variable regions.

1: Building the multiple sequences alignements

As we wanted to classify proteins into groups and subgroups, we had to build an HMM profile for each group and each subgroups. HMM profiles are built based on multiple sequences alignements (MSA) so that they can differenciate variable and conserved regions. Thus, we had to build one alignement per group and subgroup. The sequences we used were all found in Tracey, the SNARE proteins database of the Dirk Fasshauer group at University of Lausanne.

For the main groups, we used the classification that was already present on the Tracey database. For each group, we downloaded a fasta file containing all of its sequences and used a software called MAFFT to build the MSA. However, before building the HMMs, we cleaned the MSAs by applying a treshold that deleted every position where there was more than 80% of gaps across all the sequences of the group. After that, we checked that the domains had not been deleted by the treshold by verifying that the SNARE and Habc domains were still complete in the human sequences present in the alignements. We used human sequences because for this quality check because they are the most well studied, and the information regarding the SNARE and Habc domains were found on the Tracey database. This filtering step reduces the noise in the HMM profiles that would have been due to the important number of gaps present in the MSAs, thus increasing the accuracy of the HMMs.

For the subgoups, we based the classification on the phylogeny. To do this, we built a tree for each main group, based on MSAs that were not tresholded to keep the maximum amount of information. All the clearly separated branches were defined as a subgroup. After that, we checked if the composition of the subgroups corresponded to the already existing classification in the article "An Elaborate Classification of SNARE Proteins Sheds Light on the Conservation of the Eukaryotic Endomembrane System" by Dirk FAsshauer, Tobias H. Kloepper and C. Nickias Kienle.