Most data sets are clean. No advanced data wrangling is needed. There are also different data sets. Some of them data sets are big and complex. It is a very popular data repository in machine leanrning community.
This list of a topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. Other amazingly awesome lists can be found in sindresorhus's awesome list.
Some small data sets for teaching and Learning statistics. This work is licensed under a Creative Commons License. Basically, you are free to copy, distribute, and display this work, to make derivative works, and to make commercial use of the work. However you must give proper attribution and provide a link to the home site: http://www.randomservices.org/random/. Click on the link above for more information about permissions.
DASL provides data from a wide variety of topics so that statistics teachers can find interesting, real-world examples for their students. We know a good example can make a lesson on a particular statistics method vivid and relevant. This website is designed to help teachers locate and identify datafiles for teaching as well as serve as an archive for datasets from statistics literature.
DASL provides data from a wide variety of topics so that statistics teachers can find interesting, real-world examples for their students. We know a good example can make a lesson on a particular statistics method vivid and relevant. This website is designed to help teachers locate and identify datafiles for teaching as well as serve as an archive for datasets from statistics literature.
Here is a list of suggested project ideas for the mini-project for IRDS. If you wish, you may instead propose a project that is not on this list. At the bottom of this page, you will find some examples of datasets which we judged as inappropriate for the projects – this may help you to avoid some pitfalls. Web site is maintained by the School of Information at University of Edingburg in UK.
R has a lot of built-in data come with different libraries. Data sets are structured and clean. They are used in different illlustrative exmaples. To use an R built-in data set, you need load the libray in which the data resides before you use it.
Most of the datasets on this page are in the S dumpdata and R compressed save() file formats. Some are available in Excel and ASCII ( .csv) formats and Stata (.dta). If you need one of the datasets we maintain converted to a non-S format please e-mail Charles Dupont to make a request.
If you install the R Hmisc package you can retrieve most of the datasets stored here using for example getHdata(titanic3).
Permission is granted to anyone wishing to use the data sets provided here. Please reference the original paper which, for most data sets, is given in our notes linked below, and note – Data obtained from http://hbiostat.org/data courtesy of the Vanderbilt University Department of Biostatistics.
This site contains a lot of nice data sets and detailed information of each data set. If you use an algorithm, dataset, or other information from StatLib, please acknowledge both StatLib and the original contributor of the material.
This webpage contains data sets that can be used for teaching statistics or in place of student data when supporting students. There is a description of each data set, suggested research questions and types of analysis which can be demonstrated using the data.
TCIA data are organized as "collections"; typically these are patient cohorts related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc) or research focus. Supporting data related to the images such as patient outcomes, treatment details, genomics and image analyses are also provided when available. Try using the filter box above the table to quickly find collections of interest using keywords. Column headers can also be clicked to change the sorting method.