Imbalanced datasets refer to a situation where the number of instances of one class significantly outweighs the number of instances of another class. This can lead to a biased model, as it tends to favor the majority class. Imbalanced datasets are common in real-world applications, such as fraud detection, medical diagnosis, and customer churn prediction. In this article, we will be exploring the 10 best imbalanced datasets in 2023.
Dataset Name | Domain | Size | Download Link | Description |
---|---|---|---|---|
Credit Card Fraud Detection | Finance | 284,807 records | https://www.kaggle.com/mlg-ulb/creditcardfraud | This dataset includes information on credit card transactions, with a high percentage of fraudulent transactions. |
Santander Customer Transaction Prediction | Finance | 200,000 records | https://www.kaggle.com/c/santander-customer-transaction-prediction | This dataset includes information on customer transactions, with a minority of positive classifications (i.e. customer will make a transaction). |
Steel Plate Faults | Manufacturing | 19,620 records | https://www.kaggle.com/uciml/steel-plates-faults | This dataset includes information on steel plates, with a high imbalance between normal plates and plates with faults. |
Titanic Survival | Social Science | 887 records | https://www.kaggle.com/c/titanic | This dataset includes information on passengers on the Titanic, with a high imbalance between those who survived and those who did not. |
Pima Indians Diabetes | Healthcare | 768 records | https://www.kaggle.com/uciml/pima-indians-diabetes-database | This dataset includes information on patient medical records, with a high imbalance between positive and negative classifications (i.e. patients with and without diabetes). |
Employee Turnover | HR | 14,999 records | https://www.kaggle.com/ludobenistant/hr-dataset | This dataset includes information on employee turnover, with a high imbalance between employees who left and those who stayed. |
Fraud Detection in E-Commerce | E-Commerce | 284,807 records | https://www.kaggle.com/ntnu-testimon/paysim1 | This dataset includes information on e-commerce transactions, with a high imbalance between fraudulent and non-fraudulent transactions. |
Telco Customer Churn | Telecommunications | 7,043 records | https://www.kaggle.com/blastchar/telco-customer-churn | This dataset includes information on customer churn in a telecommunications company, with a high imbalance between customers who stayed and those who left. |
The Heart Disease UCI | Healthcare | 303 records | https://www.kaggle.com/ronitf/heart-disease-uci | This dataset includes information on patient medical records, with a high imbalance between positive and negative classifications (i.e. patients with and without heart disease). |
Medical Cost Personal Datasets | Healthcare | 13,38 records | https://www.kaggle.com/mirichoi0218/insurance | This dataset includes information on medical insurance costs, with a high imbalance between low and high costs. |