Lab 6

EDS 213 |Query to Visualization in R/ Python

Overview

This week, you’ll run your SQL query in Python or R, pulling results directly into a data frame and turning them into a visualization.

View Slides


Python: Querying with DuckDB

Connect and create a cursor

import duckdb
import pandas as pd

conn = duckdb.connect("database.duckdb")
cur = conn.cursor()

Execute a query

cur.execute("""
    SELECT Species, COUNT(*) AS num_nests
    FROM Bird_nests
    GROUP BY Species
    ORDER BY num_nests DESC
""")

At this point the data is not yet in Python memory. It’s a symbolic representation of the query. You still need to materialize it.

Materialize the results

There are several ways to get data out of the cursor:

cur.fetchall()   # list of tuples — all rows at once
cur.fetchone()   # one row as a tuple
cur.df()         # pandas DataFrame ← best for analysis

For plotting, use .df() to get a pandas DataFrame:

cur.execute("""
    SELECT Species, COUNT(*) AS num_nests
    FROM Bird_nests
    GROUP BY Species
    ORDER BY num_nests DESC
""")
df = cur.df()
df.head()

Visualize the result

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))
ax.barh(df["Species"], df["num_nests"])
ax.set_xlabel("Number of Nests")
ax.set_title("Nest Count by Species")
plt.tight_layout()
plt.show()

Close the connection

conn.close()

R: Querying with dbplyr

dbplyr lets you write dplyr code against a database. It translates your verbs into SQL and runs the query.

Connect to your database

library(tidyverse)
library(DBI)
library(dbplyr)
library(duckdb)

conn <- DBI::dbConnect(duckdb::duckdb(),
                       dbdir = "database.duckdb")

DBI::dbListTables(conn)

Reference a table

nests_db <- tbl(conn, "Bird_nests")
nests_db

This looks like a data frame, but it isn’t. It’s a lazy reference to the database table. The data hasn’t been pulled into R yet (dim() will show NA for the row count).

Query with dplyr verbs

Use familiar dplyr verbs on the table reference:

nests_db |>
  group_by(Species) |>
  summarise(num_nests = n()) |>
  arrange(desc(num_nests)) |>
  head(3)

See what SQL dbplyr generates

nests_db |>
  group_by(Species) |>
  summarise(num_nests = n()) |>
  show_query()

Pull data into R with collect()

Nothing runs on the database until collect() is called. Everything before it is building up the SQL query:

result <- nests_db |>
  group_by(Species) |>
  summarise(num_nests = n()) |>
  arrange(desc(num_nests)) |>
  collect()

collect() executes the query and returns a true R data frame.

Visualize the result

Because collect() returns a regular data frame, you can pipe directly into ggplot2:

result |>
  ggplot(aes(x = num_nests,
             y = reorder(Species, num_nests))) +
  geom_col(fill = "#52796f") +
  labs(
    x = "Number of Nests",
    y = "Species",
    title = "Nest Count by Species"
  ) +
  theme_minimal()

Close the connection

DBI::dbDisconnect(conn, shutdown = TRUE)

The shutdown = TRUE argument is specific to DuckDB. It cleanly shuts down the DuckDB engine when you disconnect.


This Week’s Task

Using Python or R, connect to your database, run your analytical query from last week, and create a visualization from the results. Everything should live in a single .ipynb or .qmd file.

NoteChecklist


This work is licensed under CC BY 4.0

UCSB logo