EDS 213: Lab 6


From Querying to Visualizing

This Week’s Goal

This week, you’ll run your SQL query in Python or R, pulling results directly into a data frame and turning them into a visualization.

  • In Python, use duckdb + a cursor to query your database, convert to a pandas DataFrame, and plot
  • In R, use dbplyr to write dplyr code that gets translated to SQL automatically
  • Both paths end the same way: a query result becomes a plot

Python: Connect to Your Database

import duckdb
import pandas as pd

conn = duckdb.connect("database.duckdb")
cur = conn.cursor()

Python: Run a Query

cur.execute("""
    SELECT Species, COUNT(*) AS num_nests
    FROM Bird_nests
    GROUP BY Species
    ORDER BY num_nests DESC
""")

At this point the data is not yet in Python memory. It’s a symbolic representation of the query,but you need to materialize it.

Python: Getting Results Out

Several options for materializing results:

cur.fetchall()   # list of tuples — all rows at once
cur.fetchone()   # one row as a tuple
cur.df()         # pandas DataFrame  ← best for analysis

To get a DataFrame you can plot:

cur.execute("""
    SELECT Species, COUNT(*) AS num_nests
    FROM Bird_nests
    GROUP BY Species
    ORDER BY num_nests DESC
""")
df = cur.df()
df.head()

Python: Turn the Result into a Plot

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))
ax.barh(df["Species"], df["num_nests"])
ax.set_xlabel("Number of Nests")
ax.set_title("Nest Count by Species")
plt.tight_layout()
plt.show()

Because cur.df() returned a standard pandas DataFrame, any plotting library works — matplotlib, seaborn, plotly, etc.

Python: Close the Connection

conn.close()

Cursors don’t store anything, they just transfer queries and return results. Once you’ve called .df() or .fetchall(), the data lives in your variable, not the cursor.

R: Connect to Your Database

library(tidyverse)
library(DBI)
library(dbplyr)
library(duckdb)

conn <- DBI::dbConnect(duckdb::duckdb(),
                       dbdir = "database.duckdb")

DBI::dbListTables(conn)

dbListTables() shows you all tables available.

R: Reference a Table

nests_db <- tbl(conn, "Bird_nests")
nests_db

This looks like a data frame — but it isn’t. It’s a lazy reference to the database table. The data hasn’t been pulled into R yet.

dim(nests_db)   # row count is unknown (NA)

R: Writing SQL with dbplyr

Use dplyr verbs on the table reference. dbplyr translates them to SQL behind the scenes:

nests_db |>
  group_by(Species) |>
  summarise(num_nests = n()) |>
  arrange(desc(num_nests)) |>
  head(3)
# See what SQL dbplyr generated:
nests_db |>
  group_by(Species) |>
  summarise(num_nests = n()) |>
  show_query()

R: Pulling Data into R with collect()

result <- nests_db |>
  group_by(Species) |>
  summarise(num_nests = n()) |>
  arrange(desc(num_nests)) |>
  collect()   # ← runs the query, returns a true data frame

Nothing runs on the database until collect() is called. Everything before it is building up the SQL query.

R: Turn the Result into a Plot

result |>
  ggplot(aes(x = num_nests,
             y = reorder(Species, num_nests))) +
  geom_col(fill = "#52796f") +
  labs(
    x = "Number of Nests",
    y = "Species",
    title = "Nest Count by Species"
  ) +
  theme_minimal()

R: Close the Connection

DBI::dbDisconnect(conn, shutdown = TRUE)

The shutdown = TRUE argument is specific to DuckDB. It cleanly shuts down the DuckDB engine when you disconnect.

This Week’s Task

Using Python or R, connect to your database, run your analytical query, and create a visualization from the results.

Your submission must include: