Latent Dirichlet Allocation complement in the vector space model for Multi-Label Text Classification

Víctor Carrera-Trejo; Grigori Sidorov; Sabino Miranda-Jiménez; Marco Moreno Ibarra; Rodrigo Cadena Martínez

Authors

Víctor Carrera-Trejo Centro de Investigación en Computación
Grigori Sidorov Centro de Investigación en Computación
Sabino Miranda-Jiménez Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación (INFOTEC)
Marco Moreno Ibarra Centro de Investigación en Computación
Rodrigo Cadena Martínez UNITEC

Keywords:

Multi-label text classification, Reuters-21578, Latent Dirichlet Allocation, Vector Space Model

Abstract

In text classification task one of the main problems is to choose which features give the best results. Various features can be used like words, n-grams, syntactic n-grams of various types (POS tags, dependency relations, mixed, etc.), or a combinations of these features can be considered. Also, algorithms for dimensionality reduction of these sets of features can be applied, like Latent Dirichlet Allocation (LDA). In this paper, we consider multi-label text classification task and apply various feature sets. We consider a subset of multi-labeled files from the Reuters-21578 corpus. We use traditional tf-IDF values of the features and tried both considering and ignoring stop words. We also tried several combinations of features, like bigrams and unigrams. We also experimented with adding LDA results into Vector Space Models as new features. These last experiments obtained the best results.

Latent Dirichlet Allocation complement in the vector space model for Multi-Label Text Classification

Authors

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)

Information

Current Issue