| Date | Location | PM2_5 | NO2 | O3 |
|---|---|---|---|---|
| 2020-01-01 | Site A | 12.3 | 20 | 15 |
| 2020/02/01 | Site B | fifteen | 30 | 20 |
| March 3, 2020 | Site C | 7.8 | 25 | twenty |
| NA | Site A | 9 | NA | 30 |
| 2020-04-01 | Site D | 10.2 | 40 | 35 |
EDS 213 Lab 2
Data Cleaning
Today’s Lab
Real-world datasets are messy. We’ll walk through three example databases and the cleaning steps each one needs before ingesting into a database.
Database #1
| Date | Location | PM2_5 | NO2 | O3 |
|---|---|---|---|---|
| 2020-01-01 | Site A | 12.3 | 20 | 15 |
| 2020/02/01 | Site B | fifteen | 30 | 20 |
| March 3, 2020 | Site C | 7.8 | 25 | twenty |
| NA | Site A | 9 | NA | 30 |
| 2020-04-01 | Site D | 10.2 | 40 | 35 |
| Sample_Date | Site_Name | pH | Turbidity | Lead_Concentration |
|---|---|---|---|---|
| 01-01-2020 | River X | 6.5 | 3.2 | 0.05 |
| 2020-02-01 | Lake Y | 7 | 2.8 | 0.07 |
| 03-03-2020 | River X | 7.5 | NA | 0.1 |
| 2020-Apr-04 | Lake Z | 7.2 | 3.5 | 0.04 |
| 2020-05-01 | River X | 8.1 | 300.0 | 0.030 ppm |
Database #1: What’s Wrong?
Air Quality table:
Date — mixed formats: 2020-01-01, 2020/02/01, March 3, 2020PM2_5 — contains "fifteen" instead of a numberNO2 — contains "NA" as a string, not a true missing valueWater Quality table:
Sample_Date — multiple inconsistent date formatsLead_Concentration — mixes plain numbers with "0.030 ppm"Database #2
| Date | Time | Location | PM10 | PM2_5 | NO2 |
|---|---|---|---|---|---|
| 2023-04-01 | 08:00 | City Park | 50 | 25 µg/m³ | 40 |
| 2023-04-01 | 14:00 | City Park | 55 | 30 | 42 |
| 2023-04-02 | 08:00 | Industrial Area | 150 | 75 | 60 |
| 2023-04-02 | 14:00 | Industrial Area | 155 mg/m³ | 80 | NA |
| 2023-04-03 | 08:00 | Residential Zone | 40 | NA | 35 |
| Location | X2023.04.01.Temp.06.00 | X2023.04.01.Humidity.06.00 | X2023.04.01.Temp.12.00 | X2023.04.01.Humidity.12.00 | X2023.04.01.Temp.18.00 | X2023.04.01.Humidity.18.00 | X2023.04.01.Temp.24.00 | X2023.04.01.Humidity.24.00 |
|---|---|---|---|---|---|---|---|---|
| Park | 14 | 80 | 22 | 60 | 20 | 70 | 16 | 85 |
| Beach | 16 | 85 | 24 | 55 | 22 | 75 | 18 | 90 |
| Forest | 13 | 90 | 21 | 65 | 19 | 80 | 15 | 95 |
| Hill | 15 | 75 | 23 | 50 | 21 | 65 | 17 | 80 |
Database #2: What’s Wrong?
Air Quality table:
PM10 has one entry with "155 mg/m³" — unit embedded in valuePM2_5 has one entry with "25 µg/m³" — same problemWeather table:
pivot_longer() to reshape into separate Date, Time, Temp, and Humidity columnsDatabase #3
| Date | Location | PM2_5 | NO2 | Ozone |
|---|---|---|---|---|
| 2020-01-01 | Location1 | 46.69874 | 87.96601 | 98.12719 |
| 2020-01-01 | Location2 | 119.30424 | 33.44394 | 89.80869 |
| 2020-01-01 | Location3 | 64.30165 | 35.56292 | 63.59488 |
| 2020-01-02 | Location1 | 133.03752 | 24.92722 | 88.71676 |
| 2020-01-02 | Location2 | 141.36776 | 12.69610 | 34.84138 |
| 2020-01-02 | Location3 | 11.60569 | 18.83068 | 22.21545 |
| Year | Location | VHI |
|---|---|---|
| 2020 | Location1 | 63.01918 |
| 2021 | Location1 | 83.86073 |
| 2022 | Location1 | 80.71608 |
| 2020 | Location2 | 92.40360 |
| 2021 | Location2 | 42.04080 |
| 2022 | Location2 | 51.59250 |
| 2020 | Location3 | 75.97901 |
| 2021 | Location3 | 77.41948 |
| 2022 | Location3 | 88.97631 |
Database #3: What’s Wrong?
Daily Air Quality table:
Yearly Vegetation Health table:
Fix: Aggregate daily air quality to yearly averages so both tables share a Year + Location key.
This Week’s Task
Apply these principles to your own dataset. Before next week’s ingestion step, make sure your data meets the following criteria:
NA