Statistical Evaluation of Categorical Encoders for Pattern Preservation in Machine Learning Tasks

Authors

  • Eric Valdez-Valenzuela Universidad Nacional Autónoma de México
  • Angel Kuri-Morales Instituto Tecnológico Autónomo de México
  • Helena Gomez-Adorno Universidad Nacional Autónoma de México

DOI:

https://doi.org/10.61467/2007.1558.2024.v15i2.456

Keywords:

Categorical encoding, synthetic data, machine learning, data preprocessing

Abstract

Categorical attributes are prevalent in many datasets used for training Machine Learning models. However, most ML models are designed to handle only numerical inputs. Therefore, converting these categorical attributes into numerical values is necessary to utilize them effectively. During this conversion process, it is essential to preserve the underlying patterns. A loss of such information could adversely affect the performance of ML algorithms. Several encoding techniques have been developed to map categorical instances to numbers. This study evaluates commonly used encoders alongside CESAMO, a novel encoder designed to capture relationships between categorical attributes and other variables using what is referred to as Pattern Preserving Codes. We conducted a statistically supported assessment of these categorical encoders using synthetic data and compared the encoders’ performance. The results show that CESAMO outperforms all other evaluated encoding techniques, confirming its ability to identify patterns in categorical data effectively.

Downloads

Published

2024-06-12

How to Cite

Valdez-Valenzuela, E., Kuri-Morales, A., & Gomez-Adorno, H. (2024). Statistical Evaluation of Categorical Encoders for Pattern Preservation in Machine Learning Tasks. International Journal of Combinatorial Optimization Problems and Informatics, 15(2), 160–172. https://doi.org/10.61467/2007.1558.2024.v15i2.456

Issue

Section

Articles