%pwdExample of Jupyter “magic command”:
To install DuckDB Python module:
# %pip install duckdb
# %pip install pandasRequirement already satisfied: duckdb in /Users/jb160-local/GitHub/gitRDS/EDS213/bren-eds213/.venv/lib/python3.14/site-packages (1.5.2)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: pandas in /Users/jb160-local/GitHub/gitRDS/EDS213/bren-eds213/.venv/lib/python3.14/site-packages (3.0.2)
Requirement already satisfied: numpy>=2.3.3 in /Users/jb160-local/GitHub/gitRDS/EDS213/bren-eds213/.venv/lib/python3.14/site-packages (from pandas) (2.4.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/jb160-local/GitHub/gitRDS/EDS213/bren-eds213/.venv/lib/python3.14/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in /Users/jb160-local/GitHub/gitRDS/EDS213/bren-eds213/.venv/lib/python3.14/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
Note: you may need to restart the kernel to use updated packages.
import duckdb
import pandas as pd1. Create a connection to the database
conn = duckdb.connect("~/GitHub/gitRDS/EDS213/bren-eds213-data/database/database.duckdb")conn2. Now let’s query the databse using:
.sql()method – specific to DuckDB and analysis focused
conn.sql("SELECT * FROM Site LIMIT 5")Note: This is also a lazy evaluation like we were doing with dbplyr. At this point, the data has not been fully processed or brought into Python memory yet. It’s a symbolic representation of a query. To see the actual data, you need to materialize the relation. You can do this in several ways:
.show(): Prints a nice tabular representation (great for interactive exploration)..fetchall(): Returns all results as a list of tuples (the traditional DB-API way)..fetchone(): Returns the next single row as a tuple..df(): Converts the result into a Pandas DataFrame.Now we want results…
# You get a list of tuples (one per row)
conn.sql("SELECT * FROM Site LIMIT 5").fetchall()# You get a pandas dataframe
site_df = conn.sql("SELECT * FROM Site").df()
site_df.head()site_table = conn.execute("SELECT * FROM Site")
site_table.execute()method – more ubiquitous to other database workflows where you create a cursor object that you use to iterate on the rows of a table
cur = conn.cursor()Cursors don’t store anything, they just transfer queries to the database and get results back.
cur.execute("SELECT * FROM Site LIMIT 5")We still need to fectch the results as before:
cur.fetchall()Always get tuples, even if you only request one column
cur.execute("SELECT Nest_ID FROM Bird_nests LIMIT 10")cur.fetchall()cur.execute("SELECT Nest_ID FROM Bird_nests LIMIT 10")
[t[0] for t in cur.fetchall()]- Get the one result
cur.execute("SELECT COUNT(*) FROM Bird_nests")
cur.fetchall()cur.execute("SELECT COUNT(*) FROM Bird_nests")
cur.fetchone()cur.execute("SELECT COUNT(*) FROM Bird_nests")
cur.fetchone()[0]- Using an iterator - but DuckDB doesn’t support iterators :(
cur.execute("SELECT Nest_ID FROM Bird_nests LIMIT 10")
for row in cur:
print(f"got {row[0]}")A workaround is to use the .fectchone() method:
cur.execute("SELECT Nest_ID FROM Bird_nests LIMIT 10")
while True:
row = cur.fetchone()
if row == None:
break
# do something with row
print(f"got nest ID {row[0]}")Can do things other than SELECT!
cur.execute("CREATE TEMP TABLE temp_table AS
SELECT * FROM Bird_nests LIMIT 10")Watch out for line breaks!!
cur.execute("""
CREATE TEMP TABLE temp_table AS
SELECT * FROM Bird_nests LIMIT 10
""")cur.execute("SELECT * FROM temp_table")cur.fetchone()A note on fragility
For example:
INSERT INTO Site VALUES ("abcd", "Foo", 35.7, 42.3, "?")A less fragile way of expressing the same thing:
INSERT INTO Site (Code, Site_name, Latitude, Longitude, Something_else)
VALUES ("abcd", "Foo", 35.7, 42.3, "?")In the same vein: SELECT * is fragile
cur.execute("SELECT * FROM Site LIMIT 3")
cur.fetchall()A better, more robust way of coding the same thing:
cur.execute("SELECT Site_name, Code, Latitude, Longitude FROM Site LIMIT 3")
cur.fetchall()An extended example: Question we’re trying to answer: How many nests do we have for each species?
Approach: first get all species. Then execute a count query for each species.
A digression: string interpolation in Python
# The % method
s = "My name is %s"
print(s % "Greg")
s = "My name is %s and the other teacher's name is %s"
print(s % ("Greg", "Julien"))
# The new f-string method
name = "Greg"
print(f"My name is {name}")
# Third way
print("My name is {}".format("Greg"))query = """
SELECT COUNT(*) FROM Bird_nests
WHERE Species = '%s'
"""
cur.execute("SELECT Code FROM Species LIMIT 3")
for row in cur.fetchall(): # DuckDB workaround
code = row[0]
prepared_query = query % code
#print(prepared_query)
cur2 = conn.cursor()
cur2.execute(prepared_query)
print(f"Species {code} has {cur2.fetchone()[0]} nests")
cur2.close()The above Python interpolation is dangerous and has caused many database hacks! There’s a better way
query = """
SELECT COUNT(*) FROM Bird_nests
WHERE Species = ?
"""
cur.execute("SELECT Code FROM Species LIMIT 3")
for row in cur.fetchall(): # DuckDB workaround
code = row[0]
# NOT NEEDED! prepared_query = query % code
#print(prepared_query)
cur2 = conn.cursor()
cur2.execute(query, [code]) # <-- added argument here
print(f"Species {code} has {cur2.fetchone()[0]} nests")
cur2.close()Let’s illustrate the danger with a different example
abbrev = "TS"
name = "Taylor Swift"
cur.execute("""
INSERT INTO Personnel (Abbreviation, Name)
VALUES ('%s', '%s')
""" % (abbrev, name)
)cur.execute("SELECT * FROM Personnel")
cur.fetchall()[-3:]abbrev = "CO"
name = "Conan O'Brien"
cur.execute("""
INSERT INTO Personnel (Abbreviation, Name)
VALUES ('%s', '%s')
""" % (abbrev, name)
)"""
INSERT INTO Personnel (Abbreviation, Name)
VALUES ('%s', '%s')
""" % (abbrev, name)abbrev = "CO"
name = "Conan O'Brien"
cur.execute("""
INSERT INTO Personnel (Abbreviation, Name)
VALUES (?, ?)
""",
[abbrev, name])cur.execute("SELECT * FROM Personnel")
cur.fetchall()[-3:]Don’t forget to close your connection when you are done!
conn.close()