EDS 213: Databases and Data Management

Course description

The goals of the course were to give MEDS students the skills they need to practically, successfully, and ethically manage their data, and to create, manage, and use relational databases where appropriate. Relational database topics went farther than just SQL queries and included a significant unit on data modeling and database constraints and integrity, in addition to advanced database topics such as indexes and accessing databases from programming environments. The data management portion includes analyzing data from an ethical perspective, creating standards-compliant metadata, and employing data de-identification techniques. The course also included a unit on the Unix command line, with an emphasis on creating reusable Bash scripts, given in the spirit that Bash is a generally useful tool that all data scientists should have at least some familiarity with.

Data

For the database portion of the course, the Arctic Shorebird Demographics Network dataset, obtained from the Arctic Data Center, was used as a running example. While this dataset is not distributed as a relational database (it is packaged as a set of related CSV files), its structure is highly amenable to a relational approach and provides a realistic example of where and why one would want to use a relational database in the Earth and environmental sciences. It also provides just enough complexity to support realistic and complex queries and views. Note that the dataset used in the course, and included in this archive, is a cleaned-up subset of the full dataset. It is necessarily a subset of the full dataset to keep the size and complexity manageable for pedagogical purposes, and it had to be cleaned up because, unfortunately, the full dataset has many errors that would have precluded creating foreign keys.

A class data GitHub repository, linked below, is used to distribute data files to students. Each week a new directory of files will be added to the repository and the students will be asked to pull the repository to their local environment to get the updates.

Database

DuckDB is used as the database platform due to its strict implementation of data types, which turned out to be a weakness when teaching with SQLlite last year. DuckDB is a fast in-process analytical personal database.

Instructors

  • Julien Brun (jb160@ucsb.edu)
  • Greg Janée (gjanee@ucsb.edu)
  • Renata Curty (rcurty@ucsb.edu)

Teaching Assistant

  • Jamie Miller (jkmiller@ucsb.edu)

Schedule

  • Class: Monday & Wednesday 9:30-10:45 am (NCEAS)
  • Discussion - session 1: Thur 1-1:50PM, Bren Hall 1510
  • Discussion - session 2: Thur 2-2:50PM, Bren Hall 1510
  • Office hours: Monday 11-12 pm (NCEAS)

Course content

Syllabus

Installing DuckDB

Resources

Class data GitHub repository

Modules

Week Topic/Content
1 Relational databases and data modeling
2 Analyzing & cleaning the bird dataset (from csv)
3 Introduction to SQL (part 1) & DuckDB
4 SQL part 2 + Analyzing the bird database using SQL
5 I/O & data management + Advanced database topics (indexing, triggers, …)
6 Using R & Python to query databases + bash programming
7 Documenting your work: metadata & computing environment
8 Sensitive data
9 Ethical & responsible data mgnt
10 Data licensing and publication


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License

UCSB logo