RA2Vec: Distributed representation of protein sequences with reduced alphabet embeddings

Published in Proceedings of the 11th ACM international conference on bioinformatics, computational biology and health informatics., 2020

This study introduces RA2Vec, a new method for protein function identification using reduced amino acid alphabets. It maps Swiss-Prot sequences to a reduced form based on hydropathy and conformational similarity. The method uses a skip-gram approach to create embedding vectors for each set. These vectors are then used as input to Support Vector Machines classifiers. The vectors are further refined using recursive Feature Elimination to maximize accuracy. The results show that certain combinations of these new representations can significantly improve performance.

Recommended citation: Wijesekara, Rajitha Yasas, et al. "RA2Vec: Distributed representation of protein sequences with reduced alphabet embeddings: RA2Vec: distributed representation." Proceedings of the 11th ACM international conference on bioinformatics, computational biology and health informatics. 2020. https://dl.acm.org/doi/abs/10.1145/3388440.3414925