I have come across a noticeable pattern on many data science websites such as Kaggle, where someone will share their machine learning model that claims 100% accuracy (across all metrics). While achieving scores that approach perfection can be possible, it is not very common in most scenarios.
As I began to look into these models further, I noticed that in almost every scenario, the person has either mistakenly or been naive to the fact that these scores were due to some form of data leakage.
What is Data Leakage?
In layman terms, data leakage is when your model is fit on data that it shouldn’t have. It is when your training and test datasets share information with each other when they should be independent.
In the real world, when models are put into production, the model predicts on data that it has never seen before, and since the test dataset’s purpose to mimic unseen data, caution should always be taken to make sure that the necessary steps and precautions are done to prevent any leakage between the two.
The simplest and silliest form of data leakage would be one where for a supervised learning problem, you provide your training set with access to your label. In this situation, regardless of what other features you may choose to include in your model, the leaked data from the label will result in a model with perfect accuracy.
Common Causes of Data Leakage
1. Mishandling missing values
When handling missing values, depending on your use case and size of your data, it is common to impute your missing values with a summary statistic, perhaps mean values for numeric features and mode for your categorical features. The issue here is that those summary statistics are now imputed for the entire dataset, before splitting into your training and test datasets. This form of data imputation across the entire dataset will result in data leakage.
To prevent data leakage in this case, make sure that all imputations are performed after the data is split into training and test sets. A pipeline with function transformers can be a neat way to do this.
2. Shuffling Time Series Data
In common packages for machine learning, such as scikit-learn, train-test-splits shuffle your data before splitting by default. Since the order in time series data is important (a point in time is always dependent on what happened in the past), it is crucial to make sure that your training and test sets are split with the order in mine.
To prevent running into such an issue, you can choose a cut-off point to break your data into training and test sets. This way, you eliminate the risk of mixing your time series data.
3. Over/Under Sampling Imbalanced Datasets
When working with an imbalanced dataset, such as fraud detection, since the ratio of fraudulent to safe transactions is high, it is very common to oversample the minority class to prevent or minimize the bias towards the majority class in your model.
When oversampling, it is important that the data is again split before into train and test sets to prevent and duplicate data leakage from the test set to the training set.
Good Habits to Prevent Indirect Data Leakage
- If in doubt, always split your data first.
- Incorporate a pipeline
- Validate using K-fold cross-validation to detect inconsistencies using the validation dataset
Data leakage is when information from outside the training dataset is used to create the model. To prevent data leakage, a good habit to keep in mind is to always think before any transformation step if there could be a possibility of a potential mix between your training and test datasets. If the answer is yes, you now know what to do 😉