Toward Robust Offensive Language Detection in Urdu YouTube Discourse

Authors

  • Fiaz Ahmad The University of Central Punjab (UCP), Pakistan.
  • Nisar Hussain Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), México.
  • Amna Qasim Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), México. https://orcid.org/0000-0002-7536-6969
  • Muhammad Usman Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), México.
  • Muhammad Zain Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), México.
  • Momina Hafeez Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), México.
  • Fatima Hafeez Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), México.
  • Grigori Sidorov Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), México. https://orcid.org/0000-0003-3901-3522

DOI:

https://doi.org/10.61467/2007.1558.2026.v17i1.1240

Keywords:

Machine Learning, Deep Learning, Logistic Regression

Abstract

This study addresses the task of detecting offensive language in Urdu text on social media, where harmful and offensive comments can lead to emotional distress. Using a dataset of Urdu comments collected from YouTube news channels, both deep learning (DL) and machine learning (ML) models were employed for offensive language detection. Feature extraction was performed using Term Frequency–Inverse Document Frequency (TF-IDF) and the Count Vectorizer approach to capture unigrams, bigrams, and trigrams. Four ML models—Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), and Gaussian Naive Bayes (NB)—were evaluated, with SVM achieving comparatively stronger performance among the ML approaches.

For the deep learning framework, Bidirectional Long Short-Term Memory (Bi-LSTM) and Convolutional Neural Network (CNN) architectures were implemented using custom Word2Vec and FastText embeddings. The experimental results indicated that the CNN model outperformed the other evaluated approaches, suggesting its effectiveness for detecting offensive language in Urdu text.

 

Smart citations: https://scite.ai/reports/10.61467/2007.1558.2026.v17i1.1240

Dimensions.
Open Alex.

References

Aklouche, B., Bazine, Y., & Ghalia-Bououchma, Z. (2024). Offensive language and hate speech detection using transformers and ensemble learning approaches. Computación y Sistemas, 28(3), 1031–1039.

Mehra, S., & Hasanuzzaman, M. (2020). Detection of offensive language in social media posts (Doctoral dissertation).

Sigurbergsson, G. I., & Derczynski, L. (2019). Offensive language and hate speech detection for Danish. arXiv. https://arxiv.org/abs/1908.04531

Abro, S., Shaikh, S., Khand, Z. H., Zafar, A., Khan, S., & Mujtaba. (2020). Automatic hate speech detection using machine learning: A comparative study. International Journal of Advanced Computer Science and Applications, 11(8).

Daouadi, K. E., Boualleg, Y., & Guehairia, O. (2024). Comparing pre-trained language model for Arabic hate speech detection. Computación y Sistemas, 28(2), 681–693.

Hussain, N., Qasim, A., Mehak, G., Kolesnikova, O., Gelbukh, A., & Sidorov, G. (2025). Hybrid machine learning and deep learning approaches for insult detection in Roman Urdu text. AI, 6(2), 33.

Hussain, A., & Aslam, A. (2024). Hate speech against women and immigrants: A comparative analysis of machine learning and text embedding techniques. Journal of Applied Research and Technology, 22(4), 548–559.

Rahman-Laskar, S., Gupta, G., Badhani, R., & Pinto-Avendaño, D. E. (2024). Cyberbullying detection in a multi-classification codemixed dataset. Computación y Sistemas, 28(3), 1091–1113.

Hussain, N., Qasim, A., Mehak, G., Kolesnikova, O., Gelbukh, A., & Sidorov, G. (2025). ORUD-Detect: A comprehensive approach to offensive language detection in Roman Urdu using hybrid machine learning–deep learning models with embedding techniques. Information, 16(2), 139.

Kaur, M., & Saini, M. (2024). Artificial intelligence inspired method for cross-lingual cyberhate detection from low resource languages. ACM Transactions on Asian and Low-Resource Language Information Processing, 23(9), 1–23.

Mnassri, K., Rajapaksha, P., Farahbakhsh, R., & Crespi, N. (2022, December). BERT-based ensemble approaches for hate speech detection. In GLOBECOM 2022: IEEE Global Communications Conference (pp. 4649–4654). IEEE.

Mozafari, M., Mnassri, K., Farahbakhsh, R., & Crespi, N. (2024). Offensive language detection in low resource languages: A use case of Persian language. PLOS ONE, 19(6), e0304166.

Meque, A. G. M., Hussain, N., Sidorov, G., & Gelbukh, A. (2023). Guilt detection in text: A step towards understanding complex emotions. arXiv. https://arxiv.org/abs/2303.03510

Meque, A. G. M., Hussain, N., Sidorov, G., & Gelbukh, A. (2023). Machine learning-based guilt detection in text. Scientific Reports, 13(1), 11441.

Mukherjee, S., & Das, S. (2023). Application of transformer-based language models to detect hate speech in social media. Journal of Computational and Cognitive Engineering, 2(4), 278–286.

Downloads

Published

2026-01-02

How to Cite

Ahmad, F., Hussain, N., Qasim, A., Usman , M., Zain, M., Hafeez, M., Hafeez, F., & Sidorov, G. (2026). Toward Robust Offensive Language Detection in Urdu YouTube Discourse. International Journal of Combinatorial Optimization Problems and Informatics, 17(1), 76–84. https://doi.org/10.61467/2007.1558.2026.v17i1.1240

Issue

Section

Articles

Most read articles by the same author(s)