Logistic regression is a popular machine learning algorithm. To train and evaluate logistic regression models, high-quality datasets are needed. The following list of 10 datasets provides a solid starting point for researchers and practitioners who want to apply logistic regression to real-world problems. These datasets cover a range of topics, from predicting the survival status of Titanic passengers to identifying the quality of wine based on chemical analysis. Each dataset includes a brief description and a link to the source where it can be downloaded.
Dataset Name | Description | Reference Paper | Download Link |
---|---|---|---|
Titanic Dataset | This dataset contains information about passengers on the Titanic, including demographics, fare information, and survival status. | N/A | https://www.kaggle.com/c/titanic/data |
Pima Indians Diabetes Dataset | This dataset contains medical records for Pima Indians, including various diagnostic measures, and the onset of diabetes. | N/A | https://www.kaggle.com/uciml/pima-indians-diabetes-database |
Credit Card Default Dataset | This dataset contains information on default payments, demographic information, and credit card usage for customers of a credit card company. | Y. Ye and W. Yan, “A Study on Default of Credit Card Clients: Is this the End of the Trend?” Int. Conf. on Information and Financial Engineering, 2010. | https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset |
Bank Marketing Dataset | This dataset contains information about a bank’s past marketing campaigns, including contact information, response rates, and whether the customer subscribed to the bank’s product. | M. Moro, R. Laureano, and P. Cortez, “A Data-Driven Approach to Predict the Success of Bank Telemarketing,” Decision Support Systems, 2014. | https://archive.ics.uci.edu/ml/datasets/Bank+Marketing |
Iris Flower Dataset | This dataset contains measurements of iris flowers, including sepal length, sepal width, petal length, petal width, and species information. | R.A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of Eugenics, 1936. | https://archive.ics.uci.edu/ml/datasets/Iris |
Digits Recognition Dataset | This dataset contains images of handwritten digits, with the goal of training a classifier to recognize the digits. | N/A | https://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset |
Spam Email Dataset | This dataset contains a collection of spam and non-spam email messages, with the goal of training a classifier to distinguish between spam and non-spam messages. | N/A | https://www.kaggle.com/uciml/sms-spam-collection-dataset |
Wine Quality Dataset | This dataset contains wine chemical analysis information, including pH, alcohol content, and other features, with the goal of predicting wine quality. | P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis, “Modeling wine preferences by data mining from physicochemical properties,” Decision Support Systems, 2009. | https://archive.ics.uci.edu/ml/datasets/Wine+Quality |
Adult Income Dataset | This dataset contains information about individuals, including age, education, occupation, and income, with the goal of predicting whether an individual earns over $50,000 per year. | N/A | https://www.kaggle.com/uciml/adult-census-income |
Student Performance Dataset | This dataset contains information about students in secondary education, including demographic information, study habits, and grades, with the goal of predicting academic performance. | N/A | https://archive.ics.uci.edu/ml/datasets/Student+Performance |
These datasets provide a good starting point for researchers and practitioners looking to apply logistic regression in their work. The datasets range from small and straightforward to larger and more complex, allowing for a range of different logistic regression use cases to be tested and evaluated.