Multi-Label Topic Classification in a Twitter Corpus of Public Communication of Science in Mexican Spanish

Alec Sánchez-Montero; Gemma Bel-Enguix; Sergio-Luis Ojeda-Trueba

doi:10.61467/2007.1558.2024.v15i4.510

Authors

Alec Sánchez-Montero Universidad Nacional Autónoma de México https://orcid.org/0009-0007-1181-3620
Gemma Bel-Enguix Universidad Nacional Autónoma de México
Sergio-Luis Ojeda-Trueba Universidad Nacional Autónoma de México

DOI:

https://doi.org/10.61467/2007.1558.2024.v15i4.510

Keywords:

natural language processing, Multi-label text classification, public communication of science, Corpus, Machine Learning, Transformers, Deep Learning, Large Language Models, scientific communication

Abstract

In the context of Mexico, comprehensive studies on the public communication of science (PCS) through social networks remain an unaddressed area of research. To address this gap, the present work is conducted from the perspective of natural language processing (NLP). The objective of this study is to develop and evaluate an automatic multilabel topic classification system for PCS tweets published in Mexico. This is achieved by training various machine learning models, which include traditional algorithms and transformer-based models. Utilizing a manually labeled corpus that identifies eighteen distinct areas or themes of science, the study evaluates and compares several approaches for the automatic identification and classification of thematic areas within PCS tweets. The findings indicate that transformer-based models, such as XLM-RoBERTa, demonstrate superior performance compared to classic algorithms, while the emerging LLM models, such as BLOOM, present a promising alternative for a range of NLP tasks.

Multi-Label Topic Classification in a Twitter Corpus of Public Communication of Science in Mexican Spanish

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Information

Current Issue