Multi-Label Topic Classification in a Twitter Corpus of Public Communication of Science in Mexican Spanish

Authors

  • Alec Sánchez-Montero Universidad Nacional Autónoma de México https://orcid.org/0009-0007-1181-3620
  • Gemma Bel-Enguix Universidad Nacional Autónoma de México
  • Sergio-Luis Ojeda-Trueba Universidad Nacional Autónoma de México

DOI:

https://doi.org/10.61467/2007.1558.2024.v15i4.510

Keywords:

natural language processing, Multi-label text classification, public communication of science, Corpus, Machine Learning, Transformers, Deep Learning, Large Language Models, scientific communication

Abstract

In the context of Mexico, comprehensive studies on the public communication of science (PCS) through social networks remain an unaddressed area of research. To address this gap, the present work is conducted from the perspective of natural language processing (NLP). The objective of this study is to develop and evaluate an automatic multilabel topic classification system for PCS tweets published in Mexico. This is achieved by training various machine learning models, which include traditional algorithms and transformer-based models. Utilizing a manually labeled corpus that identifies eighteen distinct areas or themes of science, the study evaluates and compares several approaches for the automatic identification and classification of thematic areas within PCS tweets. The findings indicate that transformer-based models, such as XLM-RoBERTa, demonstrate superior performance compared to classic algorithms, while the emerging LLM models, such as BLOOM, present a promising alternative for a range of NLP tasks.

Downloads

Published

2024-11-04

How to Cite

Sánchez-Montero, A., Bel-Enguix, G., & Ojeda-Trueba, S.-L. (2024). Multi-Label Topic Classification in a Twitter Corpus of Public Communication of Science in Mexican Spanish. International Journal of Combinatorial Optimization Problems and Informatics, 15(4), 199–210. https://doi.org/10.61467/2007.1558.2024.v15i4.510

Issue

Section

Articles