Method of extraction of feature in the classification of texts for authorship attribution
Keywords:
Authorship attribution, feature extraction, text classification, supervised learningAbstract
The authorship attribution has been analyzed mainly through the classification of texts, the process of the extraction of features has been carried out under two approaches: based on profile and instances, through the analysis of textual features or linguistic features that allow finding the style of author's writing. In profile and instance approaches, extracting features from the authors' set of documents results in high feature dimensionality that can impair classification performance.
Therefore, an approach that does not depend on the set of documents for the extraction of features and does not depend on the selection of features is proposed, the classification of texts was carried out with different supervised learning methods. In the present investigation, it is determined if all the features of an author are contained in a single document that describe their writing style. For the experimentation, we worked with three corpus (C10, C50 and PAN12), these were selected based on the review of the literature. According to the results obtained, it was concluded that the approach shows results superior to the state of the art in unbalanced samples, consistent results when evaluated in different contexts and robust when analyzing 10 or 50 authors.
From this approach it is determined that in 500 words without repeating the writing style of an author is contained, presenting a classification accuracy of 79.68%