Lab 3

EDS 213 | Ingesting Your Data into a Database

Overview

Now that your data is clean, it’s time to load it into a database. This week we cover two different methods for ingesting CSV files into a PostgreSQL database, and walk through the key differences between them.

View Slides


Two Ways to Ingest Data

There are many ways to get data into a database. We’ll focus on two common approaches, both using a CSV file as the starting point.


Option 1: COPY

The COPY command is a PostgreSQL-native way to bulk-load data from a file directly into an existing table. You first create the table manually with the correct schema, then use COPY to populate it.

COPY table_name (col1, col2, col3)
FROM '/path/to/file.csv'
DELIMITER ','
CSV HEADER;

Key characteristics: - Fast and efficient for large files - Requires the table to already exist with matching column names and types - You control the schema explicitly — data types, constraints, etc. - The file path must be accessible to the PostgreSQL server (not just your local machine)


Option 2: read_csv (DuckDB / R / Python)

The read_csv approach (available in DuckDB, R’s duckdb package, or Python) lets you create and populate a table in one step directly from a CSV file, with automatic type inference.

CREATE TABLE table_name AS
SELECT * FROM read_csv_auto('/path/to/file.csv');

Key characteristics: - Convenient — no need to define the schema upfront - Automatically infers column types from the data - Great for quick exploration and prototyping - Type inference can sometimes get things wrong — always verify!


The Difference

The two methods produce tables that look similar on the surface, but there are important practical differences:

COPY read_csv
Schema definition Manual (you write it) Automatic (inferred)
Type control Full control Inferred — may need correction
Speed Very fast for large data Fast, slightly more overhead
File location Server-side path Local or relative path
Best for Production ingestion Exploration & prototyping
TipWhich should you use?

For your project, either method works. If you want full control over your schema (recommended for a well-designed database), use COPY. If you want to get up and running quickly and explore first, read_csv is a great starting point — just double-check that the inferred types match what you expect.


This Week’s Task

Ingest your cleaned data into a database and start exploring it.

NoteIngestion Checklist


This work is licensed under CC BY 4.0

UCSB logo