Step 1.7 - What datasets will you use?
Write a list of datasets that you will use as a basis for exercises in the course, including references to their sources.
From a Google Sheets course:
Wikipedia's list of 100m sprint world records since 1977. It has a few different data types for converting.
The Wikipedia list of predicted asteroid close encounters. Contains numeric fields that we can do rounding on and log-transform.
A subset of Ecdat::Benefits. Good for doing logical manipulation. Will delete a few values in order to talk about missing data and errors.
From a PostgreSQL case study course:
- Poker hand data set from UCI
- Internet usage data from UCI
- Census income data from UCI
- Bakery and wine datasets from Cal Poly
How many datasets do I need?
One dataset per chapter is typical. Some courses use a single dataset throughout. In this case it has to be a fairly rich and interesting dataset in order to keep student's attentions over the whole course. A few courses use multiple datasets in each chapter, but you have to be careful in this case that you don't spend all the time introducing datasets instead of teaching other concepts.
Where can I find datasets?
Here are a few data sources to get you started.
For small, reasonably clean datasets, Wikipedia tables are excellent.
The Dataset subreddit is a pot luck of datasets.
There are several free data sharing platforms available.
- CKAN is a data sharing platform. Some popular instances include datahub.io, catalog.data.gov, and the European Data Portal.
- Dataverse is a good source of datasets from academic papers. (Click the map on the home page to see specific Dataverse installations.)
- data.world hosts datasets directly, and contains many (typically small) datasets.
Many governmental organizations and NGO's provide datasets.
- UK government data
- World Health Organization Global Health Observatory Data Repository and Mortality Database
- Humanitarian Data Exchange
- World Bank Open Data
Can I create my own synthetic datasets?
Yes, but only if you can't find a real or semi-synthetic dataset. Students really enjoy using "real" datasets, so existing publicly available datasets are preferred over those generated from code.
What's a semi-synthetic dataset?
Semi-synthetic datasets are the data equivalent of a "based on a true story" movie. If you have a dataset that is not suitable for use on DataCamp due to licensing issues, (for example, if it is commercially sensitive) then it is sometimes possible to anonymize it and change enough numbers that it retains the spirit of the original data, but does not reveal anything commercially sensitive.
Use the simplest dataset that will get your point across
Bigger datasets are not always better. In many cases, having a dataset that is small enough that the students can easily understand it in it's entirety is beneficial. Be ruthless about shrinking your dataset when you prepare it.
Common problems and their solutions
Using the iris dataset
This dataset is very overused. Please choose a different one.
The dataset doesn't give the answer I wanted
Sometimes datasets give unexpected answers, and this can make the narrative in the exercise tricky or confusing. It is best to develop a reasonable familiarity with your datasets during course speccing phase, when it is easiest to swap the dataset for a different one.
How will this be reviewed?
Your Curriculum Lead will discuss your responses to the brainstorming questions. They will not be formally reviewed (though they provide important context for reviewers).