Hybrid deep learning models for text-based identification of gene-disease associations

Noor Fadhil Jumaa; Jafar Razmara; Sepideh Parvizpour; Jaber Karimpour

doi:10.34172/bi.31226

Bioimpacts. 2025;15: 31226.
doi: 10.34172/bi.31226

PMID: 40761527
PMCID: PMC12319213
Scopus ID: 105010098580

Abstract View: 411

PDF Download: 402

Full Text View: 215

Original Article

Hybrid deep learning models for text-based identification of gene-disease associations

Noor Fadhil Jumaa ¹ , Jafar Razmara ¹^* , Sepideh Parvizpour ², Jaber Karimpour ¹

¹ Department of Computer Science, Faculty of Mathematics, Statistics, and Computer Science, University of Tabriz, Tabriz, Iran
² Research Center for Pharmaceutical Nanotechnology, Biomedicine Institute, Tabriz University of Medical Sciences, Tabriz, Iran

*Corresponding Author: Jafar Razmara, Email: razmara@tabrizu.ac.ir

Abstract

Introduction: Identifying gene-disease associations is crucial for advancing medical research and improving clinical outcomes. Nevertheless, the rapid expansion of biomedical literature poses significant obstacles to extracting meaningful relationships from extensive text collections.
Methods: This study uses deep learning techniques to automate this process, using publicly available datasets (EU-ADR, GAD, and SNPPhenA) to classify these associations accurately. Each dataset underwent rigorous pre-processing, including entity identification and preparation, word embedding using pre-trained Word2Vec and fastText models, and position embedding to capture semantic and contextual relationships within the text. In this research, three deep learning-based hybrid models have been implemented and contrasted, including CNN-LSTM, CNN-GRU, and CNN-GRU-LSTM. Each model has been equipped with attentional mechanisms to enhance its performance.
Results: Our findings reveal that the CNN-GRU model achieved the highest accuracy of 91.23% on the SNPPhenA dataset, while the CNN-GRU-LSTM model attained an accuracy of 90.14% on the EU-ADR dataset. Meanwhile, the CNN-LSTM model demonstrated superior performance on the GAD dataset, achieving an accuracy of 84.90%. Compared to previous state-of-the-art methods, such as BioBERT-based models, our hybrid approach demonstrates superior classification performance by effectively capturing local and sequential features without relying on heavy pre-training.
Conclusion: The developed models and their evaluation data are available at https://github.com/NoorFadhil/Deep-GDAE.

Keywords: Gene-disease association extraction, Deep learning, Attention mechanism, Feature extraction