Lab 5

EDS 213 | Documenting Your Work

Overview

You’ve found some datasets, cleaned your data, built a database, written queries, and started analyzing your data. This week, you’ll make the work you’ve done so far reproducible by documenting it thoroughly.

View Slides


What Documentation Covers

Documentation for this project has two parts:

  1. Your code — the .sql script and .ipynb / .qmd analysis file should be well commented
  2. Your repository — the README should serve as a guide for anyone visiting your project

Both are extremely important for both reproducibility and understanding the steps you decided to take and why.


Commenting Your Code

Comments in your .sql and .ipynb / .qmd files should explain why decisions were made, not just describe what the code does. Variable names and SQL keywords already communicate what — comments are for context a reader couldn’t otherwise infer.

Good:

# Drop rows where species_id is NULL —
# these represent failed sensor readings, not valid observations
df = df.dropna(subset=["species_id"])

Not useful:

# Drop nulls
df = df.dropna(subset=["species_id"])

The same principle applies to SQL:

-- Aggregate by habitat type rather than individual sites
-- so results are comparable across the study area
SELECT habitat_type, SUM(count) AS total_observations
FROM observations o
JOIN sites s ON o.site_id = s.site_id
GROUP BY habitat_type;

Making Your Project Reproducible

A reproducible project is one that someone else can clone and run without asking you anything (with the exception of needing to download data, but instructions should this should be extremely clear!). To get there:

  • All necessary files are included or clearly referenced
  • Both your .sql file and your .ipynb / .qmd file are in the repo
  • Dependencies are documented so the environment can be recreated

The Dependencies File

Include a plain-text file listing the packages and versions required to run your analysis.

For Python, export from conda or pip:

# conda
conda env export > environment.yml

# pip
pip freeze > requirements.txt

For R, capture your session info and write it to a file:

writeLines(capture.output(sessionInfo()), "requirements.txt")

Name the file requirements.txt or environment.yml and place it in the root of your repository.


README requirements

Your README is the main location viewers can find the documentation for your repo. Include the following in your README:

  • Title Short and descriptive. It should tell someone what the project is without needing to read further.

  • Purpose A brief explanation of what the repository is for. Paragraphs or a bulleted list are both fine. You may include an image or logo if relevant.

  • Repository structure Describe what files and folders exist and what each one contains. A reader should know where to look for any piece of the project.

  • Data access Explain where the data lives (in the repo, on a server, in a package, etc.) and how to access it in order to run the code.

  • References & acknowledgements In a consistent format with links, provide a reference to the course and any other sources that supported the project. Include references for datasets too.

  • Any other essential info that is required to reproduce your project


README Quality Checklist

Before submitting, review your README:

NoteREADME Checklist

This Week’s Task

NoteDocumentation Checklist


This work is licensed under CC BY 4.0

UCSB logo