Our Running Example

Streaming Dataset

Over the past decade, streaming services have rapidly transformed the way people consume entertainment, rising to dominate the media landscape. As more people turn to digital platforms for personalized viewing experiences, streaming services continue to redefine modern entertainment consumption. Are you into streaming? What was the most recent movie or TV show you watched?

This workshop utilizes a comma-separated value (CSV) file which is derived from the movies and TV shows featured by major streaming services and distributed in Kaggle Project under a CC0 Public License1.

For this project, we have merged the six titles.csv files included in the project—each representing one of the streaming services featured in this project (Amazon Prime Video, Apple TV+, Disney+, HBO Max, Netflix, and Paramount)—into a single master spreadsheet.

The dataset we will be working with contains 25,223 rows with movies and TV series titles along with the following variables:

  • streaming: streaming service/provider
  • id: unique movie or TV series ID
  • title: movie or TV series title
  • type: if movie or tv series
  • description: the synopsis/plot summary
  • release_year: YYYY
  • classification: age group movie age categories including G (General Audiences), PG (Parental Guidance), PG-13 (Parents Strongly Cautioned), R (Restricted), and NC-17 (No One 17 and Under Admitted).
  • runtime: total time in minutes
  • genre(s): the thematic classification that define the type of story, themes, and style of the production, by order of relevance.
  • country: code for country or countries responsible for the production, ordered by importance.
  • imdb_id: unique ID attributed to each production by IMDB.
  • imdb_score: the rating on IMDB.
  • seasons: Number of seasons if it’s a show.
  • imdb_id: The title ID on IMDB.
  • imdb_score: Score on the Internet Movie Database (IMDB).
  • imdb_votes: Votes on the Internet Movie Database (IMDB).
  • tmdb_popularity: Votes on The Movie Database (TMDB).
  • tmdb_score: Score on on The Movie Database (TMDB).

Downloading the Dataset

Now that we have a clearer understanding of the data we’ll be working with, please take a moment to download and save the file master-streaming-messy.csv to your working environment.

Download the csv File

Let’s open the file and check how the data looks like. Chances are you favorite movies or TV shows are listed on it.

Please note that, for the purposes of this lesson, the data has been intentionally modified to support the associated exercises. Therefore, we do not vouch for the use of this dataset for actual research. The data has been specifically edited and curated for instructional purposes and may not represent a fully accurate or comprehensive source of data for formal analysis.

Our Challenge

In this workshop, we will explore how OpenRefine can support data organization and preparation for analysis. For instance, you might want to compare scores across genres, plot the most common age classifications over the years, or investigate whether the country of origin affects popularity. These are just a few examples of the kinds of insights you could uncover once your data is properly cleaned and organized. But before that the data has to be cleaned and prepared accordingly. Ready?

Footnotes

  1. Henrique, D. (2020). A simple movie & TV show recommendation system. Kaggle. https://www.kaggle.com/code/dgoenrique/a-simple-movie-tv-show-recommendation-system?select=credits.csv↩︎