Preprocessing of amino acid chains of antibody structure for machine learning analysis
DOI:
https://doi.org/10.61467/2007.1558.2026.v17i1.1030Keywords:
Antibody classification, amino acid sequence representationAbstract
Antibody classification represents a task of growing importance in bioinformatics. In recent years, the identification of antibodies capable of recognising and neutralising SARS-CoV-2 has become a central focus in immunological research and bioinformatics. Antibody representation presents several challenges, as antibody structure and function are highly variable, which complicates the development of a universal classification framework. Antibodies are composed of heavy and light chains that contain hypervariable complementarity-determining regions, which define their specificity. These structural variations create substantial challenges for sequence alignment, feature extraction, and classification. In this research, three methods for representing amino acid sequences were compared: TF–IDF, Atchley Factors, and ProtVec. These representations were evaluated using decision trees, logistic regression, and support vector machines. A separate dataset was generated for each representation. The results suggest that the representation based on Atchley Factors achieved comparatively stronger performance in the task of antibody classification.
Smart citations: https://scite.ai/reports/10.61467/2007.1558.2026.v17i1.1030
References
Abbas, A. K., Lichtman, A. H., & Pillai, S. (2021). Cellular and molecular immunology (10th ed.). Elsevier.
Asgari, E., & Mofrad, M. R. K. (2015). ProtVec: A continuous distributed representation of biological sequences. PLoS ONE, 10(11), e0141287. https://doi.org/10.1371/journal.pone.0141287
Atchley, W. R., Zhao, J., Fernandes, A. D., & Drüke, T. (2005). Solving the protein sequence metric problem. Proceedings of the National Academy of Sciences, 102(18), 6395–6400. https://doi.org/10.1073/pnas.0408677102
Birunda, S. S., & Devi, R. K. (2021). A review on word embedding techniques for text classification. In J. S. Raj, A. M. Iliyasu, R. Bestak, & Z. A. Baig (Eds.), Innovative data communication technologies and application (pp. 267–281). Springer. https://doi.org/10.1007/978-981-15-9651-3_23
Chen, X., Dougherty, T., Hong, C., Schibler, R., Zhao, Y. C., Sadeghi, R., Matasci, N., Wu, Y.-C., & Kerman, I. (2020). Predicting antibody developability from sequence using machine learning. bioRxiv. https://doi.org/10.1101/2020.06.18.159798
Greiff, V., Yaari, G., & Cowell, L. G. (2020). Mining adaptive immune receptor repertoires for biological and clinical information using machine learning. Current Opinion in Systems Biology, 24, 109–119. https://doi.org/10.1016/j.coisb.2020.10.010
Ibero-American Cooperative Group on Transfusion Medicine. (2020). Basic and applied immunohematology. GCIAMT.
Jurafsky, D., & Martin, J. H. (2021). Speech and language processing (3rd ed.). Pearson.
Leem, J., Mitchell, L. S., Farmery, J. H. R., Barton, J., & Galson, J. D. (2022). Deciphering the language of antibodies using self-supervised learning. Patterns, 3(7), Article 100513. https://doi.org/10.1016/j.patter.2022.100513
Lefranc, M.-P., Giudicelli, V., Ginestoux, C., Jabado-Michaloud, J., Folch, G., Bellahcene, F., … & Lefranc, G. (2009). IMGT®, the international ImMunoGeneTics information system®. Nucleic Acids Research, 37(Database issue), D1006–D1012. https://doi.org/10.1093/nar/gkn838
Li, L., Gupta, E., Spaeth, J., Shing, L., Bepler, T., & Caceres, R. S. (2022). Antibody representation learning for drug discovery. arXiv. https://doi.org/10.48550/arXiv.2210.02881
Li, X., Van Deventer, J. A., & Hassoun, S. (2020). ASAP-SML: An antibody sequence analysis pipeline using statistical testing and machine learning. PLOS Computational Biology, 16(4), e1007779. https://doi.org/10.1371/journal.pcbi.1007779
Magar, R., Yadav, P., & Barati Farimani, A. (2021). Potential neutralizing antibodies discovered for novel coronavirus using machine learning. Scientific Reports, 11(1), Article 5261. https://doi.org/10.1038/s41598-021-84637-4
Murphy, K. M., Weaver, C., & Berg, L. J. (2022). Janeway’s immunobiology (10th ed.). W. W. Norton & Company.
Olsen, T. H., Boyles, F., & Deane, C. M. (2022). Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science, 31(1), 141–146. https://doi.org/10.1002/pro.4205
Parham, P. (2021). The immune system (5th ed.). W. W. Norton & Company.
Pulendran, B., & Davis, M. M. (2020). The science and medicine of human immunology. Science, 369(6511), eaay4014. https://doi.org/10.1126/science.aay4014
Punt, J., Stranford, S. A., Jones, P., & Owen, J. (2020). Kuby immunology (8th ed.). McGraw-Hill.
Raybould, M. I. J., Kovaltsuk, A., Marks, C., & Deane, C. M. (2021). CoV-AbDab: The coronavirus antibody database. Bioinformatics, 37(5), 734–735. https://doi.org/10.1093/bioinformatics/btaa739
Sapoval, N., Aghazadeh, A., Nute, M. G., Antunes, D. A., Balaji, A., Baraniuk, R., Barberan, C. J., Dannenfelser, R., Dun, C., Edrisi, M., Elworth, R. A. L., Kille, B., Kyrillidis, A., Nakhleh, L., Wolfe, C. R., Yan, Z., Yao, V., & Treangen, T. J. (2022). Current progress and open challenges for applying deep learning across the biosciences. Nature Communications, 13(1), Article 1728. https://doi.org/10.1038/s41467-022-29268-7
Yadav, D., Yadav, N., Kumar, A., Sharma, P., & Sood, D. (2022). Probing the immune system dynamics of the COVID-19 disease for vaccine designing and drug repurposing using bioinformatics tools. Immuno, 2(2), 172–191. https://doi.org/10.3390/immuno2020022
Zhang, Y., Chen, Q., Yang, Z., Lin, H., & Lu, Z. (2019). BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific Data, 6(1), Article 52. https://doi.org/10.1038/s41597-019-0055-0
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 International Journal of Combinatorial Optimization Problems and Informatics

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.