HMM based classification for conserved SNARE protein domains

by Marius Audenis, Leana Ortolani and Gabriel Chiche

Supervisor: Carlos Pulido Quetglas

Link to the presentation: https://unils-my.sharepoint.com/:p:/r/personal/marius_audenis_unil_ch/_layouts/15/Doc.aspx?sourcedoc=%7B3a03a6f1-e7b0-4099-977c-fb2a72924716%7D&action=edit&wdPreviousSession=05951c8f-6f64-d402-8f8d-a9743eda88ca

Introduction

The SNARE family proteins play a role in membrane fusion and can be involved in several processes such as neurotransmission. All of these proteins share a common motif, which we call SNARE. This motif is essential to form the SNARE complex, which mediates the fusion between a vesicle and a target membrane.

This protein family has already been classified into four main groups (Qa, Qb, Qc, R) which are also divided into subgroups. This classification informes about the function ans sub-cellular localization of the protein.

In this project, we aim to build an algorithm capable of automatically classifying a SNARE sequences, thus giving rapid and accurate insights about its function. The classification method will be based on HMM models. We will build one HMM per class, and new sequences will be aligned to each of the HMMs. Each HMM will output a score, quantifying the likelyhood of the sequence being generated by the HMM, and an e-value, representing the statistical significance of the score. The new sequence could already be classified based on those values, but in order to reduce the error-rate of the models profiles, the e-value and score of all the HMM will be stored in a vector which will be input to a random forest model. Based on those results, it will be able to improve redo the classification, yealding a better accuracy (more than 7% in some cases).


Building and using the HMM profiles

HMM profiles are a probabilistic representation of a group of sequences capable of characterizing the conserved as well as the variable regions. They are built based on multiple sequences alignements. We used a software called hmmer to build the profiles.

'1: Building the multiple sequences alignements'