Statistical Evaluation of Categorical Encoders for Pattern Preservation in Machine Learning Tasks
DOI:
https://doi.org/10.61467/2007.1558.2024.v15i2.456Keywords:
Categorical encoding, synthetic data, machine learning, data preprocessingAbstract
Categorical attributes are prevalent in many datasets used for training Machine Learning models. However, most ML models are designed to handle only numerical inputs. Therefore, converting these categorical attributes into numerical values is necessary to utilize them effectively. During this conversion process, it is essential to preserve the underlying patterns. A loss of such information could adversely affect the performance of ML algorithms. Several encoding techniques have been developed to map categorical instances to numbers. This study evaluates commonly used encoders alongside CESAMO, a novel encoder designed to capture relationships between categorical attributes and other variables using what is referred to as Pattern Preserving Codes. We conducted a statistically supported assessment of these categorical encoders using synthetic data and compared the encoders’ performance. The results show that CESAMO outperforms all other evaluated encoding techniques, confirming its ability to identify patterns in categorical data effectively.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 International Journal of Combinatorial Optimization Problems and Informatics
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.