Toward Robust Offensive Language Detection in Urdu YouTube Discourse
DOI:
https://doi.org/10.61467/2007.1558.2026.v17i1.1240Keywords:
Machine Learning, Deep Learning, Logistic RegressionAbstract
This study addresses the task of detecting offensive language in Urdu text on social media, where harmful and offensive comments can lead to emotional distress. Using a dataset of Urdu comments collected from YouTube news channels, both deep learning (DL) and machine learning (ML) models were employed for offensive language detection. Feature extraction was performed using Term Frequency–Inverse Document Frequency (TF-IDF) and the Count Vectorizer approach to capture unigrams, bigrams, and trigrams. Four ML models—Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), and Gaussian Naive Bayes (NB)—were evaluated, with SVM achieving comparatively stronger performance among the ML approaches.
For the deep learning framework, Bidirectional Long Short-Term Memory (Bi-LSTM) and Convolutional Neural Network (CNN) architectures were implemented using custom Word2Vec and FastText embeddings. The experimental results indicated that the CNN model outperformed the other evaluated approaches, suggesting its effectiveness for detecting offensive language in Urdu text.
Smart citations: https://scite.ai/reports/10.61467/2007.1558.2026.v17i1.1240
References
Aklouche, B., Bazine, Y., & Ghalia-Bououchma, Z. (2024). Offensive language and hate speech detection using transformers and ensemble learning approaches. Computación y Sistemas, 28(3), 1031–1039.
Mehra, S., & Hasanuzzaman, M. (2020). Detection of offensive language in social media posts (Doctoral dissertation).
Sigurbergsson, G. I., & Derczynski, L. (2019). Offensive language and hate speech detection for Danish. arXiv. https://arxiv.org/abs/1908.04531
Abro, S., Shaikh, S., Khand, Z. H., Zafar, A., Khan, S., & Mujtaba. (2020). Automatic hate speech detection using machine learning: A comparative study. International Journal of Advanced Computer Science and Applications, 11(8).
Daouadi, K. E., Boualleg, Y., & Guehairia, O. (2024). Comparing pre-trained language model for Arabic hate speech detection. Computación y Sistemas, 28(2), 681–693.
Hussain, N., Qasim, A., Mehak, G., Kolesnikova, O., Gelbukh, A., & Sidorov, G. (2025). Hybrid machine learning and deep learning approaches for insult detection in Roman Urdu text. AI, 6(2), 33.
Hussain, A., & Aslam, A. (2024). Hate speech against women and immigrants: A comparative analysis of machine learning and text embedding techniques. Journal of Applied Research and Technology, 22(4), 548–559.
Rahman-Laskar, S., Gupta, G., Badhani, R., & Pinto-Avendaño, D. E. (2024). Cyberbullying detection in a multi-classification codemixed dataset. Computación y Sistemas, 28(3), 1091–1113.
Hussain, N., Qasim, A., Mehak, G., Kolesnikova, O., Gelbukh, A., & Sidorov, G. (2025). ORUD-Detect: A comprehensive approach to offensive language detection in Roman Urdu using hybrid machine learning–deep learning models with embedding techniques. Information, 16(2), 139.
Kaur, M., & Saini, M. (2024). Artificial intelligence inspired method for cross-lingual cyberhate detection from low resource languages. ACM Transactions on Asian and Low-Resource Language Information Processing, 23(9), 1–23.
Mnassri, K., Rajapaksha, P., Farahbakhsh, R., & Crespi, N. (2022, December). BERT-based ensemble approaches for hate speech detection. In GLOBECOM 2022: IEEE Global Communications Conference (pp. 4649–4654). IEEE.
Mozafari, M., Mnassri, K., Farahbakhsh, R., & Crespi, N. (2024). Offensive language detection in low resource languages: A use case of Persian language. PLOS ONE, 19(6), e0304166.
Meque, A. G. M., Hussain, N., Sidorov, G., & Gelbukh, A. (2023). Guilt detection in text: A step towards understanding complex emotions. arXiv. https://arxiv.org/abs/2303.03510
Meque, A. G. M., Hussain, N., Sidorov, G., & Gelbukh, A. (2023). Machine learning-based guilt detection in text. Scientific Reports, 13(1), 11441.
Mukherjee, S., & Das, S. (2023). Application of transformer-based language models to detect hate speech in social media. Journal of Computational and Cognitive Engineering, 2(4), 278–286.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 International Journal of Combinatorial Optimization Problems and Informatics

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.