Data Sets with A Binary Response

A. SBA Endorsed Bank Loan Data

SIZE: 899,164 observations, 27 variables

SOURCE: United States Small Business Administration

BRIEF DESCRIPTION: This data set is from the U.S. Small Business Administration (SBA) and provides historical data from 1987 through 2014. This large data set contains 27 variables and 899,164 observations. Each observation represents a loan that was guaranteed to some degree by the SBA. Included is a variable [MIS_Status] which indicates if the loan was paid in full or defaulted/charged off.

DOWNLOAD: The data set and its description can be downloaded using the following links:

1. Data set (de-identified data) was split into 9 subsets:

2. Data description: LoanData-description.pdf

3. NACIS Code description: NACIS Code Description

APPROPRIATE USE: demonstrating sampling techniques; regression modeling.

B. Synthetic Breast Cancer Data

SIZE: 600 observations, 10 variables

SOURCE: Book - Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6

BRIEF DESCRIPTION: This is a synthrtic data set. The structure of the data set is simple. It can be used for logistic and binary classification / predictive models and algorithms.

DOWNLOAD: The data set and its description can be downloaded using the following links:

1. Data set: |Breast Cancer Dataset|
2. Variable Description: Breast Cancer Description

C. Loan Defualt Data

SIZE: 1000 observations, 16 variables

SOURCE: Book - Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6

BRIEF DESCRIPTION: This is a subset of a large o data set. The structure of the data set is simple. It can be used for logistic and binary classification / predictive models and algorithms.

DOWNLOAD: The data set and its description can be downloaded using the following links:

1. Data set: |Loan Defualt Data|
2. Variable Description: Loan Defualt Data Description

D. Pima Indans Diabetes Data

SIZE: 768 records, 8 feature variables and one binary outcome variable.

SOURCE: Kaggle https://www.kaggle.com/kumargh/pimaindiansdiabetescsv

BRIEF DESCRIPTION: This data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. It has been used in numerous researches in computational and clinical sciences.

DOWNLOAD: Data set can be downloaed from the Kaggle site and the link as well:

1. Data set: Pima-Indian-diabetes-data.csv
2. Data set (replaced 0's with NAs): PimaIndiansDiabetes2.csv
3. Data Dictionary: PimaIndiansDiabetesDataDescription.pdf

E. Customer Churn Data

SIZE: 1000 records, 14 variables (3 numerical,11categorical).

SOURCE: Book - Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6

BRIEF DESCRIPTION: This data set is a small portion of customer data of a telecommunication company that allows to investigating their service quality and improve customer retaintion.

DOWNLOAD: Data set can be downloaed from the Kaggle site and the link as well:

1. Data set: Churn Data [TXT]
2. Data Dictionary: Churn Data Description.pdf

F. Credit Card Fraud Data

SIZE: 25575 authentic credit cards and 7762 compromised credit cards. Each card has 41 historical transactions dollar amounts as well as the transaction times. For the compromised cards, the most recent transactions are fraudulent. The total number of observations is about 1.4 million.

SOURCE: From A Financial Firm.

BRIEF DESCRIPTION: The data sets were taken from a financial firm to develop new and scalable machine learning algorithms and statistical models to detect credit card fraudulent transactions. A stratified sampling was used to collect the data. The following URLs link to three subsets from authentic cards and one based on ccompromised cards. You need to create an outcome (binary) variable.

DOWNLOAD: Two sequence data can be downloaded using the following links:

1. Authentic cards: | IDXDataset_gdc01.csv | IDXDataset_gdc02.csv | IDXDataset_gdc03.csv |
2. Compromised cards: IDXDataset_wpc.csv

The following is two column fraud index data set created with a machine learning algorithm.

Fraud Index Data Set: fraudidx.csv

G. Gas Station - Point of Compromised (POC)

SIZE: 72798 records, 30 feature variables and one binary outcome variable.

SOURCE: A de-identified industry data with some simulated features.

BRIEF DESCRIPTION: This data set can be used for binary predictive modeling. The binary response variable of POC. DOWNLOAD: Data set can be downloaed from the following link:

1. Data set: Gas Station Data Set.csv
2. Data Dictionary: Gas Station Data Description

H. The Health Facts - Diabetes Data

SIZE: 101,766 records, 49 feature variables.

SOURCE: The Health Facts database (Cerner Corporation, Kansas City, MO), a national data warehouse that collects comprehensive clinical records across hospitals throughout the United States.

BRIEF DESCRIPTION: This study used theHealth Facts database (Cerner Corporation, Kansas City, MO), a national data warehouse that collects comprehensive clinical records across hospitals throughout the United States. Health Facts is a voluntary program offered to organizations which use the Cerner Electronic Health Record System.The database contains data systematically collected from participating institutions electronic medical records and includes encounter data (emergency, outpatient, and inpatient), provider specialty, demographics (age, sex, and race), diagnoses and in-hospital procedures documented by ICD-9-CM codes, laboratory data, pharmacy data, in-hospital mortality, and hospital characteristics. All data were deidentified in compliance with the Health Insurance Portability and Accountability Act of 1996 before being provided to the investigators. Continuity of patient encounters within the same health system (EHR system) is preserved.

DOWNLOAD: Data set can be downloaed from the following link:

1. Data set: Health Facts - Diabetes Data
2. Some definitions of the variables: Variable mapping
3. More information about the background of this data set can be found in this Research Article

I. Framingham Heart Study

SIZE: 4240 subjects were included in the data. This a small subset of the data used in the textbook of Hosmer and Lemeshow.

SOURCE: This is a subset of extracted from the study. It can be found in various public domain. The current data can be founs at framingham.csv

BRIEF DESCRIPTION:The Framingham Heart Study is a long-term, ongoing cardiovascular cohort study of residents of the city of Framingham, Massachusetts. The study began in 1948 with 5,209 adult subjects from Framingham, and is now on its third generation of participants. More information about this long-term observational study can be found at https://nfb.org//sites/default/files/images/nfb/publications/vodold/vspr9804.htm

DOWNLOAD: Data set and description are can downloaded in the following links:

1. Data Description: FraminghamHeartStudy-description.pdf
2. Data Set: FraminghamHeartStudy.csv

J. National Health and Nutrition Examination Survey (NHANES) Data

SIZE: 7926 subjects were included in the data. This a small subset of the data used in the textbook of Hosmer and Lemeshow.

SOURCE: Data build from various data sources from CDC's National Health and Nutrition Examination Survey (NHAHES). https://wwwn.cdc.gov/nchs/nhanes/Default.aspx

BRIEF DESCRIPTION: The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations. NHANES is a major program of the National Center for Health Statistics (NCHS). NCHS is part of the Centers for Disease Control and Prevention (CDC) and has the responsibility for producing vital and health statistics for the nation. In 1999, the survey became a continuous program that has a changing focus on a variety of health and nutrition measurements to meet emerging needs.

DOWNLOAD: Data set and description are can downloaded in the following links:

1. Data Description: NationalHealthNutritionSurvey-description.pdf
2. Data Set: nhanes.csv

K. Bankruptcy Data

SIZE: Five years data. Each data has about 10000 observations and each data set has 64 variables.

SOURCE: Data sets in are available on UCI Machine Learning Data Repository. https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data

BRIEF DESCRIPTION: The dataset is about bankruptcy prediction of Polish companies. The data was collected from Emerging Markets Information Service (EMIS, [Web Link]), which is a database containing information on emerging markets around the world. The bankrupt companies were analyzed in the period 2000-2012, while the still operating companies were evaluated from 2007 to 2013. Basing on the collected data five classification cases were distinguished, that depends on the forecasting period:

1stYear - the data contains financial rates from 1st year of the forecasting period and corresponding class label that indicates bankruptcy status after 5 years. The data contains 7027 instances (financial statements), 271 represents bankrupted companies, 6756 firms that did not bankrupt in the forecasting period.
2ndYear - the data contains financial rates from 2nd year of the forecasting period and corresponding class label that indicates bankruptcy status after 4 years. The data contains 10173 instances (financial statements), 400 represents bankrupted companies, 9773 firms that did not bankrupt in the forecasting period.
3rdYear - the data contains financial rates from 3rd year of the forecasting period and corresponding class label that indicates bankruptcy status after 3 years. The data contains 10503 instances (financial statements), 495 represents bankrupted companies, 10008 firms that did not bankrupt in the forecasting period.
4thYear - the data contains financial rates from 4th year of the forecasting period and corresponding class label that indicates bankruptcy status after 2 years. The data contains 9792 instances (financial statements), 515 represents bankrupted companies, 9277 firms that did not bankrupt in the forecasting period.
5thYear - the data contains financial rates from 5th year of the forecasting period and corresponding class label that indicates bankruptcy status after 1 year. The data contains 5910 instances (financial statements), 410 represents bankrupted companies, 5500 firms that did not bankrupt in the forecasting period.

DOWNLOAD: Data set and description are can downloaded in the following links:

1. Data Description: NationalHealthNutritionSurvey-description.pdf
2. Data Sets: | firstyear.csv | secondyear.csv | thirdyear.csv | fourthyear.csv | fifthyear.csv |

L. Bank Direct Marketing Data

SIZE: 45223 records and 17 variables (including the outcome variable)

SOURCE: UCI-Machine Learning Repository. https://archive.ics.uci.edu/dataset/222/bank+marketing

BRIEF DESCRIPTION: The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed. The data set is ordered by date (from May 2008 to November 2010). The data set can used for classification or predicting if the client will subscribe a term deposit (variable y) after the direct marketing campaign.

DOWNLOAD: Data set and description are can downloaded in the following links:

1. Data Description: Background and Variable Definitions
2. Data Set: Bank Marketing Data

M. Employee Turnover Data: People Analytics Data

SIZE: 1129 records and 16 variables

SOURCE: UCI-Machine Learning Repository. https://www.aihr.com/wp-content/uploads/2019/10/turnover-data-set.csv

BRIEF DESCRIPTION: The data set contains information on gender, age, wage type, way of travel, traffic (source of hire), and big five personality! The data set is real and pretty straightforward. The only thing to keep an eye on is that some terms got lost in translation from Russian to English. As an example, ‘independ’ translates to a reversed scale of agreeableness, ‘selfcontrol’ is conscientiousness, ‘anxiety’ is neuroticism, and ‘novator’ stands for openness.

DOWNLOAD: Data set and description are can downloaded in the following links:

1. Data Description: Variable Definitions
2. Data Set: Employee Turnover Data

N. Employee Attrition Data: People Analytics Data (Synthetic)

SIZE: 1470 records and 35 variables

SOURCE: https://www.aihr.com/wp-content/uploads/2019/10/ibm-hr-analytics-attrition-data-set-1.zip

BRIEF DESCRIPTION: This data set is well-known in the People Analytics world. When IBM creates a data set that enables you to practice attrition modeling, you pay attention. The data set has 1470 rows and 35 columns. The data set contains data like age, gender, job satisfaction, environment satisfaction, education field, job role, income, overtime, percentage salary hike, tenure, training time, years in current role, relationship status, and more. With these variables, IBM has created a fairly complete overview that contains the data of the average HRIS combined with a full engagement survey. The data set is therefore great to predict turnover, or to simply find differences between the group that stayed or that left.

DOWNLOAD: Data set and description are can downloaded in the following links:

1. Data Description: Variable Definitions
2. Data Set: Employee Attrition Data

O. Diabetes Prediciton

SIZE: 100001 records and 9 variables

SOURCE: https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

BRIEF DESCRIPTION: The Diabetes prediction dataset is a collection of medical and demographic data from patients, along with their diabetes status (positive or negative). The data includes features such as age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level. This dataset can be used to build machine learning models to predict diabetes in patients based on their medical history and demographic information. This can be useful for healthcare professionals in identifying patients who may be at risk of developing diabetes and in developing personalized treatment plans. Additionally, the dataset can be used by researchers to explore the relationships between various medical and demographic factors and the likelihood of developing diabetes.

DOWNLOAD: Data set and description are can downloaded in the following links:

1. Data Description: Variable Definitions
2. Data Set: Diabetes Data

P. Strokes Prediciton

SIZE: 5110 records and 12 variables

SOURCE: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

BRIEF DESCRIPTION: According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

DOWNLOAD: Data set and description are can downloaded in the following links:

1. Data Description: Variable Definitions
2. Data Set: Strokes Data

Continous Response

A. California Housing Price Data

SIZE: 20640 records, 10 feature variables.

SOURCE: Data from Kaggle.

BRIEF DESCRIPTION: This data set can be used for regression mmodeling.

DOWNLOAD: Data set can be downloaed from the following link:

1. Data set: California Housing Price Data.csv
This data set can also be found at https://www.kaggle.com/harrywang/housing-price-prediction/data 2. Data Dictionary: California Housing Price Data Description

B. Taipei Real Estate Data

SIZE: 414 records, 8 feature variables.

SOURCE: Data from Kaggle.

BRIEF DESCRIPTION: This data set can be used for regression mmodeling.

DOWNLOAD: Data set can be downloaed from the following link:

1. Data set: Taipei Real Estate Data.csv
2. Data Dictionary: Taipei Real Estate Data Description

C. World Life Expectancy Data

SIZE: There four related data files to be used to create data visualization.

SOURCE: Data from public domain.

BRIEF DESCRIPTION: The four subsets contain information that potentially impacts the life expectancy of the world. The data cover relative information between between 1800 and 2018. The following is an animated bubble chart created using Tableau Public based on these four data sets.

DOWNLOAD: Data sets can be downloaed from the following link:

1. Income per Person.csv
2. Life Expectancy in Years.csv
3. Population Size.csv
4. Country Regions.csv

D. Flight Delay Data

SIZE: 3593 records, 11 feature variables.

SOURCE: Book - Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6

BRIEF DESCRIPTION: This data set can be used for regression mmodeling. The arrival delay time is a positive (random) response variable.

DOWNLOAD: Data set can be downloaed from the following link:

1. Data set: Flight Delay Data
2. Data Dictionary: Flight Delay Data Description

E. Melbourne Housing Market

SIZE: 34857 records, 21 variables.

SOURCE: Kaggle - https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market

BRIEF DESCRIPTION: This data was scraped from publicly available results posted every week from Domain.com.au, The dataset includes Address, Type of Real estate, Suburb, Method of Selling, Rooms, Price, Real Estate Agent, Date of Sale and distance from C.B.D.

DOWNLOAD: Data set can be downloaed from the following link:

1. Data set: House Price Data
2. Data Dictionary: Variable Description

F. HR Data (People Anlytics)

SIZE: 311 records, 36 variables.

SOURCE: Kaggle - https://www.kaggle.com/datasets/rhuebner/human-resources-data-set

BRIEF DESCRIPTION: Please see description from Kaggle . or from the the author's RPubs note https://rpubs.com/rhuebner/hrd_cb_v14

DOWNLOAD: Data set can be downloaed from the following link:

1. Data set: Human Resource Data
2. Data Dictionary: Variable Description

Multinomial Response (7)

A. Amazon Customer Review Data

SIZE:13194 observations. 4 variables.

SOURCE: https://www.kaggle.com/datasets/danielihenacho/amazon-reviews-dataset

BRIEF DESCRIPTION: This dataset was created from the scraped reviews from products in Amazon for the purpose of text classification. The classes are three in number namely;

1. Negative Reviews
2. Neutral Reviews
2. Positive Reviews.

DOWNLOAD: Data set and related files can be downloaded in the following links:

1. Data set: Amazon Review.csv
2. Data Dictionary: Amazon review data dictionary.pdf

B. Body Performance Data

SIZE: 13393 obervations and 12 variables.

SOURCE: https://www.kaggle.com/datasets/kukuroo3/body-performance-data.

BRIEF DESCRIPTION: TThis is data that confirmed the grade of performance with age and some exercise performance data..

DOWNLOAD: The following links provide data downloaded on August 12, 2021.

1. Data: MclsBodyPerformance.csv
2. Description: Description of Body Performance data

C. Predicting Churn Risk Rate

SIZE: Training set: 36992 observations and 25variables.
Testing set: 19919 observations and 24 variables.

SOURCE: https://www.hackerearth.com/problem/machine-learning/predict-the-churn-risk-rate-11-fb7a760d/

BRIEF DESCRIPTION: Churn rate is a marketing metric that describes the number of customers who leave a business over a specific time period. Every user is assigned a prediction value that estimates their state of churn at any given time. This value is based on:
1. User demographic information
2. Browsing behavior
3. Historical purchase data among other information
It factors in our unique and proprietary predictions of how long a user will remain a customer. This score is updated every day for all users who have a minimum of one conversion. The values assigned are between 1 and 5.

DOWNLOAD: The data set and variable descrition can be downloaded using the following links.

1. Datasets: |Training Data.csv |Testing Data
2. Data Dictionary: Customer Churn Data Description.pdf

D. Customer Segmentation Data

SIZE: Training data: 8069 observations. 10 variables.
Testing data: 2627 observations. 9 variables.

SOURCE: https://www.superdatascience.com/pages/sql

BRIEF DESCRIPTION: Customer segmentation is the practice of dividing a customer base into groups of individuals that are similar in specific ways relevant to marketing, such as age, gender, interests, and spending habits. Companies employing customer segmentation operate under the fact that every customer is different and that their marketing efforts would be better served if they target specific, smaller groups with messages that those consumers would find relevant and lead them to buy something. Companies also hope to gain a deeper understanding of their customer's preferences and needs with the idea of discovering what each segment finds most valuable to more accurately tailor marketing materials toward that segment. .

DOWNLOAD: The data sets (in CSV format) can be downloaded in the following lniks.

1. Data Sets: | Training Data | Testing Data |
2 Data Description

E. Dematology Data Set

SIZE: 366 observations. 35 variables.

SOURCE: https://www.kaggle.com/datasets/olcaybolat1/dermatology-dataset-classification

BRIEF DESCRIPTION: The differential diagnosis of "erythemato-squamous" diseases is a real problem in dermatology. They all share the clinical features of erythema and scaling, with minimal differences. The disorders in this group are psoriasis, seborrheic dermatitis, lichen planus, pityriasis rosea, chronic dermatitis, and pityriasis rubra pilaris. Usually, a biopsy is necessary for the diagnosis, but unfortunately, these diseases share many histopathological features as well.
Patients were first evaluated clinically with 12 features. Afterward, skin samples were taken for the evaluation of 22 histopathological features. The values of the histopathological features are determined by an analysis of the samples under a microscope.
In the dataset constructed for this domain, the family history feature has the value 1 if any of these diseases has been observed in the family, and 0 otherwise. The age feature simply represents the age of the patient.
Every other feature clinical and histopathological was given a degree in the range of 0 to 3. Here, 0 indicates that the feature was not present, 3 indicates the largest amount possible, and 1, 2 indicate the relative intermediate values.

DOWNLOAD: The data sets (in CSV format) can be downloaded in the following lniks.

1. Data Sets: | Dematology Data |
2 Data Description

F. Healthcare Service Demand Data Set

SIZE: Training Data: 310000 observations. 18 variables.
Testing Data: 137058 observations. 17 variables.

SOURCE: https://www.kaggle.com/datasets/olcaybolat1/dermatology-dataset-classification

BRIEF DESCRIPTION: The differential diagnosis of "erythemato-squamous" diseases is a real problem in dermatology. They all share the clinical features of erythema and scaling, with minimal differences. The disorders in this group are psoriasis, seborrheic dermatitis, lichen planus, pityriasis rosea, chronic dermatitis, and pityriasis rubra pilaris. Usually, a biopsy is necessary for the diagnosis, but unfortunately, these diseases share many histopathological features as well.
Patients were first evaluated clinically with 12 features. Afterward, skin samples were taken for the evaluation of 22 histopathological features. The values of the histopathological features are determined by an analysis of the samples under a microscope.
In the dataset constructed for this domain, the family history feature has the value 1 if any of these diseases has been observed in the family, and 0 otherwise. The age feature simply represents the age of the patient.
Every other feature clinical and histopathological was given a degree in the range of 0 to 3. Here, 0 indicates that the feature was not present, 3 indicates the largest amount possible, and 1, 2 indicate the relative intermediate values.

DOWNLOAD: The data sets (in CSV format) can be downloaded in the following lniks.

1. Data Sets: | Dematology Data |
2 Data Description

Data Preparation

A. Self-compassion and Gratitude Survey Data

SIZE:119 observations. 12 demographic variables, 12 variables in self-compassion instrument, and 6 in gratitude instrument.

SOURCE: One of my recent research projects.

BRIEF DESCRIPTION: This is a survey taken from students in the social work program at a regional university. Students are invited to answer the survey questions voluntarily. The survey contains two different survey instruments and some demographic questions as well.

1. The Self-Compassion Scale by Neff
2. The Gratitude Questionnaire 6

The purpose is to study the perception of self-compassion in students and how it can link to self-care as well as success in their future professional life.

DOWNLOAD: Data set and related files can be downloaded in the following links:

1. Data set: Self-compasion-gratitude-SurveyDataCsv.csv
2. Survey instruments: Selfcompassion-Survey-Instrument.pdf
3. Data Dictionary: Selfcompassion-data-dictionary.pdf

B. COVID-19 Related Data Sets

SIZE: Different data sets have different sizes. The range is from several thousand to several million. Some data sets are currently updating.

SOURCE: Multiple data sources: NYT COVID-19 data repository, CDC data, and USDA Economic Research Service data.

BRIEF DESCRIPTION: These data sets can be used to address the potential association between COVID-19 infection/death rates associated with related variables such as vaccination rates as well as other related demographic variables such as poverty, unemployment, education, etc.

DOWNLOAD: The following links provide data downloaded on August 12, 2021.

1. Unemployment data: Unemployment.csv
2. Poverty data: PovertyEstimates.csv
3. 10-year mortality data: Mortality-Estimates.csv
4. Education level data: Education.csv
5. Population density data: census-population-landarea.csv
6. Income data: est-income-19all.xls
7. Presidential election data: countypresidential_election_2000-2020.csv
8. FIPS to Latitude and Longitude Conversion Table: fips2latlon.csv

SOME DIRECT LINKS: You can download the most recent updated data using the following direct links

1. NYT COVID-19 data repository: us-counties (live updating)
2. USDA county-level demographic data: County-level Data Sets
3. CDC vaccination data: County-level vaccination data (live updating)
4. CDC COVID-19 surveillance: COVID-19 Case Surveillance Public Use Data (live updating) (nearly 28 million records at individual level, 3.4GB)
5. MIT Election Lab Data Sets: Presidential Election Data

C. Customer Segmentation Data

SIZE:330379 observations. 8 variables.

SOURCE: Book - Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6

BRIEF DESCRIPTION: This is a data set extract from an orgnization that can be used to divid customers into groups of individuals that have similar characteristics (factors) for marketing, or improving customer satisfaction, etc.

DOWNLOAD: The data set and variable descrition can be downloaded using the following links.

1. Data set: Segmentation Data.csv
3. Data Dictionary: Segmentation Data Description.pdf

D. Pet Care Data

SIZE:330379 observations. 8 variables.

SOURCE: https://www.superdatascience.com/pages/sql

BRIEF DESCRIPTION: Four relational data tables that contain pet information, owner information, precedure history, and procedure detail.

DOWNLOAD: The data sets (in CSV format) can be downloaded in the following links.

1. Pet Information
2 Owner Information
3. Procedure Detail
4. Procedure History

E. Healthcare Data Sets

SIZE: 8 data sets with various sizes and different numbers of variables.

SOURCE: https://www.kaggle.com/datasets/kanikakhera/healthcare-dataset

BRIEF DESCRIPTION: These eight different relational data tables can be used for sharpenning data wrangling skills.

DOWNLOAD: The data sets (in CSV format) can be downloaded in the following links.

1. Inpatient data - patients
2. Inpatient data - providers
3. Outpatient data - patients
4. Outpatient data - providers
5. Paitent history sample data
6. Review patient history sample data
7. Review transaction counts
8. Transaction counts

Data Sets with A Binary Response

A. SBA Endorsed Bank Loan Data

B. Synthetic Breast Cancer Data

C. Loan Defualt Data

D. Pima Indans Diabetes Data

E. Customer Churn Data

F. Credit Card Fraud Data

G. Gas Station - Point of Compromised (POC)

H. The Health Facts - Diabetes Data

I. Framingham Heart Study

J. National Health and Nutrition Examination Survey (NHANES) Data

K. Bankruptcy Data

L. Bank Direct Marketing Data

M. Employee Turnover Data: People Analytics Data

N. Employee Attrition Data: People Analytics Data (Synthetic)

O. Diabetes Prediciton

P. Strokes Prediciton

Continous Response

A. California Housing Price Data

B. Taipei Real Estate Data

C. World Life Expectancy Data

D. Flight Delay Data

E. Melbourne Housing Market

F. HR Data (People Anlytics)

Multinomial Response (7)

A. Amazon Customer Review Data

B. Body Performance Data

C. Predicting Churn Risk Rate

D. Customer Segmentation Data

E. Dematology Data Set

F. Healthcare Service Demand Data Set

Data Preparation

A. Self-compassion and Gratitude Survey Data

B. COVID-19 Related Data Sets

C. Customer Segmentation Data

D. Pet Care Data

E. Healthcare Data Sets