You are currently viewing 10 Best Imbalanced Datasets [2023]

10 Best Imbalanced Datasets [2023]

Imbalanced datasets refer to a situation where the number of instances of one class significantly outweighs the number of instances of another class. This can lead to a biased model, as it tends to favor the majority class. Imbalanced datasets are common in real-world applications, such as fraud detection, medical diagnosis, and customer churn prediction. In this article, we will be exploring the 10 best imbalanced datasets in 2023.

Dataset NameDomainSizeDownload LinkDescription
Credit Card Fraud DetectionFinance284,807 recordshttps://www.kaggle.com/mlg-ulb/creditcardfraudThis dataset includes information on credit card transactions, with a high percentage of fraudulent transactions.
Santander Customer Transaction PredictionFinance200,000 recordshttps://www.kaggle.com/c/santander-customer-transaction-predictionThis dataset includes information on customer transactions, with a minority of positive classifications (i.e. customer will make a transaction).
Steel Plate FaultsManufacturing19,620 recordshttps://www.kaggle.com/uciml/steel-plates-faultsThis dataset includes information on steel plates, with a high imbalance between normal plates and plates with faults.
Titanic SurvivalSocial Science887 recordshttps://www.kaggle.com/c/titanicThis dataset includes information on passengers on the Titanic, with a high imbalance between those who survived and those who did not.
Pima Indians DiabetesHealthcare768 recordshttps://www.kaggle.com/uciml/pima-indians-diabetes-databaseThis dataset includes information on patient medical records, with a high imbalance between positive and negative classifications (i.e. patients with and without diabetes).
Employee TurnoverHR14,999 recordshttps://www.kaggle.com/ludobenistant/hr-datasetThis dataset includes information on employee turnover, with a high imbalance between employees who left and those who stayed.
Fraud Detection in E-CommerceE-Commerce284,807 recordshttps://www.kaggle.com/ntnu-testimon/paysim1This dataset includes information on e-commerce transactions, with a high imbalance between fraudulent and non-fraudulent transactions.
Telco Customer ChurnTelecommunications7,043 recordshttps://www.kaggle.com/blastchar/telco-customer-churnThis dataset includes information on customer churn in a telecommunications company, with a high imbalance between customers who stayed and those who left.
The Heart Disease UCIHealthcare303 recordshttps://www.kaggle.com/ronitf/heart-disease-uciThis dataset includes information on patient medical records, with a high imbalance between positive and negative classifications (i.e. patients with and without heart disease).
Medical Cost Personal DatasetsHealthcare13,38 recordshttps://www.kaggle.com/mirichoi0218/insuranceThis dataset includes information on medical insurance costs, with a high imbalance between low and high costs.

Leave a Reply