EDS 213 Lab 2


Data Cleaning

Today’s Lab

Real-world datasets are messy. We’ll walk through three example databases and the cleaning steps each one needs before ingesting into a database.

Database #1

Air Quality
Date Location PM2_5 NO2 O3
2020-01-01 Site A 12.3 20 15
2020/02/01 Site B fifteen 30 20
March 3, 2020 Site C 7.8 25 twenty
NA Site A 9 NA 30
2020-04-01 Site D 10.2 40 35
Water Quality
Sample_Date Site_Name pH Turbidity Lead_Concentration
01-01-2020 River X 6.5 3.2 0.05
2020-02-01 Lake Y 7 2.8 0.07
03-03-2020 River X 7.5 NA 0.1
2020-Apr-04 Lake Z 7.2 3.5 0.04
2020-05-01 River X 8.1 300.0 0.030 ppm

Database #1: What’s Wrong?

Air Quality table:

  • Date — mixed formats: 2020-01-01, 2020/02/01, March 3, 2020
  • PM2_5 — contains "fifteen" instead of a number
  • NO2 — contains "NA" as a string, not a true missing value

Water Quality table:

  • Sample_Date — multiple inconsistent date formats
  • Lead_Concentration — mixes plain numbers with "0.030 ppm"

Database #2

Air Quality
Date Time Location PM10 PM2_5 NO2
2023-04-01 08:00 City Park 50 25 µg/m³ 40
2023-04-01 14:00 City Park 55 30 42
2023-04-02 08:00 Industrial Area 150 75 60
2023-04-02 14:00 Industrial Area 155 mg/m³ 80 NA
2023-04-03 08:00 Residential Zone 40 NA 35
Weather (wide format)
Location X2023.04.01.Temp.06.00 X2023.04.01.Humidity.06.00 X2023.04.01.Temp.12.00 X2023.04.01.Humidity.12.00 X2023.04.01.Temp.18.00 X2023.04.01.Humidity.18.00 X2023.04.01.Temp.24.00 X2023.04.01.Humidity.24.00
Park 14 80 22 60 20 70 16 85
Beach 16 85 24 55 22 75 18 90
Forest 13 90 21 65 19 80 15 95
Hill 15 75 23 50 21 65 17 80

Database #2: What’s Wrong?

Air Quality table:

  • PM10 has one entry with "155 mg/m³" — unit embedded in value
  • PM2_5 has one entry with "25 µg/m³" — same problem

Weather table:

  • Data is in wide format — each column encodes date, variable, and time
  • Relational databases need long/tidy format
  • Need pivot_longer() to reshape into separate Date, Time, Temp, and Humidity columns

Database #3

Daily Air Quality
Date Location PM2_5 NO2 Ozone
2020-01-01 Location1 46.69874 87.96601 98.12719
2020-01-01 Location2 119.30424 33.44394 89.80869
2020-01-01 Location3 64.30165 35.56292 63.59488
2020-01-02 Location1 133.03752 24.92722 88.71676
2020-01-02 Location2 141.36776 12.69610 34.84138
2020-01-02 Location3 11.60569 18.83068 22.21545
Yearly Vegetation Health
Year Location VHI
2020 Location1 63.01918
2021 Location1 83.86073
2022 Location1 80.71608
2020 Location2 92.40360
2021 Location2 42.04080
2022 Location2 51.59250
2020 Location3 75.97901
2021 Location3 77.41948
2022 Location3 88.97631

Database #3: What’s Wrong?

Daily Air Quality table:

  • Recorded at daily resolution

Yearly Vegetation Health table:

  • Recorded at yearly resolution
  • These tables cannot be joined directly — they have different temporal resolutions

Fix: Aggregate daily air quality to yearly averages so both tables share a Year + Location key.

This Week’s Task

Apply these principles to your own dataset. Before next week’s ingestion step, make sure your data meets the following criteria:

  • All date columns use a consistent format
  • Numeric columns contain only numbers (no units or text mixed in)
  • Missing values are represented consistently as NA
  • Data is in long/tidy format (each variable in its own column, each observation in its own row)
  • Tables that will be joined share a common key
  • Any other cleaning steps specific to your dataset (e.g., removing duplicates, handling outliers, standardizing category names)