Step 1.7 - What datasets will you use?

Write a list of datasets that you will use as a basis for exercises in the course, including references to their sources.

Examples

From a Google Sheets course:

From a PostgreSQL case study course:

FAQs

How many datasets do I need?

One dataset per chapter is typical. Some courses use a single dataset throughout. In this case it has to be a fairly rich and interesting dataset in order to keep student's attentions over the whole course. A few courses use multiple datasets in each chapter, but you have to be careful in this case that you don't spend all the time introducing datasets instead of teaching other concepts.

Where can I find datasets?

Here are a few data sources to get you started.

  • DataCamp can provide access to the datasets curated by ThinkNum. Ask your Curriculum Lead for details.

  • For small, reasonably clean datasets, Wikipedia tables are excellent.

  • There are also many R packages that contain datasets. You can generate a list of CRAN packages that contain datasets using finddatasetpkgs::get_dataset_pkgs(). Bioconductor has its own list.

The UCI Machine Learning Archive and Kaggle datasets contain lots of mostly larger datasets on a wide variety of topics.

The Dataset subreddit is a pot luck of datasets.

There are several free data sharing platforms available.

  • CKAN is a data sharing platform. Some popular instances include datahub.io, catalog.data.gov, and the European Data Portal.
  • Dataverse is a good source of datasets from academic papers. (Click the map on the home page to see specific Dataverse installations.)
  • data.world hosts datasets directly, and contains many (typically small) datasets.

Many governmental organizations and NGO's provide datasets.

Can I create my own synthetic datasets?

Yes, but only if you can't find a real or semi-synthetic dataset. Students really enjoy using "real" datasets, so existing publicly available datasets are preferred over those generated from code.

What's a semi-synthetic dataset?

Semi-synthetic datasets are the data equivalent of a "based on a true story" movie. If you have a dataset that is not suitable for use on DataCamp due to licensing issues, (for example, if it is commercially sensitive) then it is sometimes possible to anonymize it and change enough numbers that it retains the spirit of the original data, but does not reveal anything commercially sensitive.

Good ideas

Use the simplest dataset that will get your point across

Bigger datasets are not always better. In many cases, having a dataset that is small enough that the students can easily understand it in it's entirety is beneficial. Be ruthless about shrinking your dataset when you prepare it.

Common problems and their solutions

Using the iris dataset

This dataset is very overused. Please choose a different one.

The dataset doesn't give the answer I wanted

Sometimes datasets give unexpected answers, and this can make the narrative in the exercise tricky or confusing. It is best to develop a reasonable familiarity with your datasets during course speccing phase, when it is easiest to swap the dataset for a different one.

How will this be reviewed?

Your Curriculum Lead will discuss your responses to the brainstorming questions. They will not be formally reviewed (though they provide important context for reviewers).

results matching ""

    No results matching ""