The Impact Of Imputation Timing On Model Performance Estimation

Document Type

Conference Proceeding

Publication Date

8-19-2025

Published In

2025 IEEE International Conference On Artificial Intelligence Testing (AITest)

Abstract

Handling missing data is a critical challenge in applying machine learning, as most algorithms assume complete data. Imputation, the process of replacing missing values with estimates from available data, is a common solution. This study investigates the impact of imputation timing (before vs. after train-test split) on machine learning classifier performance estimates, particularly focusing on the biases introduced by different imputation strategies. We evaluate the effects of imputation before train-test split (IBS) and imputation after train-test split (IAS) across multiple datasets and imputation methods, including Random Forest (RF), KNN, and Mean Imputation. Our findings reveal that IBS consistently overestimates generalization performance, with severity worsening as the proportion of missing data increases, while IAS underestimates performance, again worsening as missing data fractions grow. These discrepancies highlight the potential for bias and instability in performance estimates, emphasizing the need for careful handling of imputation techniques to avoid misleading conclusions about model robustness. Our results further underscore the influence of missing data rates and dataset characteristics on classifier performance, suggesting that no single imputation method is universally appropriate.

Keywords

Missing Data, Imputation, Performance Estimation, Data Preprocessing, Data Quality

Published By

IEEE

Conference

IEEE AITest 2025

Conference Dates

July 21-24, 2025

Conference Location

Tucson, AZ

This document is currently not available here.

Share

COinS