# Work on a copy so the raw 'whale' object stays intact for comparison
file <- whale
# TODO: Convert categorical variables to factor using varToFactor()
# Hint: include type, county, state, condition, origin, gear, fine, infraction_type
file <- varToFactor(
obj = file,
var = c(___) # <-- fill in the variable names
)
# TODO: Convert continuous variables to numeric using varToNumeric()
# Hint: year and month should be numeric
file <- varToNumeric(
obj = file,
var = c(___) # <-- fill in the variable names
)K-Anonymity with sdcMicro: Whale Entanglement Data
Background
This dataset (whale-entanglement.csv) contains documented whale entanglement incidents on the U.S. West Coast. Each row is one event, with variables for whale species, location, gear type, and fishery involved. (Adapted for instructional use only.)
Goals:
- Assess re-identification risk in the raw data
- Apply Statistical Disclosure Control (SDC) to reduce that risk
- Quantify the resulting information loss
Step 1: Load Package & Data
Inspect Data
Each row is one entanglement event.
Step 2: Identify Disclosure Risks
Direct Identifiers
Direct identifiers uniquely identify a record on their own.
Q1 — How many direct identifiers are present, and what are they?
Answer: Your answer here.
| Variable | Why it directly identifies |
|---|---|
___ |
___ |
___ |
___ |
___ |
___ |
These must be excluded from any analytical release.
Quasi-Identifiers
Quasi-identifiers (QIDs) don’t identify records alone, but can be combined with each other — or with external data — to re-identify individuals.
Q2 — Which variables are quasi-identifiers?
Answer: Your answer here. List the variables and explain why each is a QID.
Step 3: Prepare Variable Types
Q3 — What types do the variables need to be?
sdcMicro requires categorical QIDs to be factor type and continuous variables to be numeric. Misspecified types produce incorrect risk estimates.
💡 Tip —
varToFactor()andvarToNumeric(): ThesesdcMicrohelper functions convert columns in a data frame tofactorornumeric. Both takeobj(the data frame) andvar(a character vector of column names). Work on a copy of your data frame so the original stays intact for later comparison.
Step 4: Measure Initial Risk
4.1 Create the SDC Object
Q4.1 — What is the re-identification risk for this dataset?
We create an sdcMicroObj that encodes our design decisions: which variables are QIDs, which direct identifiers to exclude, and any weights or strata.
💡 Tip —
createSdcObj(): The key arguments are:
dat— your data frame (use the originalwhale, notfile)keyVars— character vector of quasi-identifier column namesexcludeVars— character vector of direct identifier column names to dropAll other arguments (
weightVar,hhId, etc.) can beNULLfor this exercise.
# TODO: Create the sdcMicro object. Choose your QIDs and direct identifiers.
sdcInitial <- createSdcObj(
dat = ___, # <-- which data frame?
keyVars = c(___), # <-- your QIDs
excludeVars = c(___), # <-- direct identifiers to exclude
weightVar = NULL,
hhId = NULL,
strataVar = NULL,
pramVars = NULL,
seed = 0,
randomizeRecords = FALSE,
alpha = c(1)
)Then, compute the current disclosure risk:
💡 Tip: The global re-identification risk is stored in
sdcInitial@risk$global$risk. Values closer to 1 mean higher risk.
# TODO: Print the global re-identification risk
print(___)4.2 Assess k-Anonymity Violations
Q4.2 — To what extent does this dataset violate k-anonymity?
A dataset satisfies k-anonymity if every QID combination appears in at least k records. Records in smaller groups are “at risk.”
💡 Tip —
print()on ansdcMicroObj: Callingprint(sdcInitial)produces a full risk report, including the number of records violating 2-, 3-, and 5-anonymity. Look for the “Frequency of key” and “Number of observations violating k-anonymity” sections.
# TODO: Print the full risk report for sdcInitial
___Our target: achieve at least 3-anonymity (k = 3).
Step 5: Apply Anonymization
5.1 Non-Perturbative Method Testing — Recode origin
Q5.1 — Apply one non-perturbative method. How effective was it?
Reminder:
- Non-perturbative methods generalize or suppress values without adding noise.
- Perturbative methods (e.g., PRAM, microaggregation) alter values by adding controlled noise.
We can recode origin by merging minority categories into a single broader category. This reduces the distinctiveness of those records.
💡 Tip —
groupAndRename(): Collapses one or more levels of a factor QID into a new label. Arguments:
obj— yoursdcMicroObjvar— the variable name (as a string)before— character vector of original levels to mergeafter— character vector with the new label (single value repeated for all merged levels)Always check
table(sdcInitial@manipKeyVars$origin)before and after to confirm the recode worked.
# Before: check current distribution
cat("Distribution of 'origin' before recoding:\n")
print(table(sdcInitial@manipKeyVars$origin))
# TODO: Recode origin — merge "recreational" and "tribal" into "non-commercial"
sdcInitial <- groupAndRename(
obj = sdcInitial,
var = "___", # <-- variable name
before = c(___, ___), # <-- levels to merge
after = c("___") # <-- new label
)
# After: check new distribution
cat("\nDistribution of 'origin' after recoding:\n")
print(table(sdcInitial@manipKeyVars$origin))
# Recompute risk
cat("\nRisk summary after recoding 'origin':\n")
print(sdcInitial)Q5.1 answer — Did the recoding reduce k-anonymity violations? Why or why not?
Your answer here.
5.2 Automated k-Anonymization — kAnon()
Q5.2 — Apply k = 3 anonymization.
kAnon() uses local suppression: it sets the minimum number of QID values to NA until every combination appears at least k times, minimizing total information loss.
💡 Tip —
kAnon(): Takes yoursdcMicroObjandk(an integer vector, e.g.,c(3)). It returns a newsdcMicroObjwith suppressed values applied. After running it, callprint()again to see the updated violation counts — they should drop to zero for your chosen k.
# TODO: Apply k = 3 anonymization and store the result in sdcAnon
sdcAnon <- kAnon(___, k = c(___))
cat("Risk summary after k = 3 anonymization:\n")
print(sdcAnon)Q5.2 answer — How many records still violate 3-anonymity after running kAnon()? Your answer here.
Step 6: Measure Information Loss
Q6 — How much information did we lose?
Anonymization is a trade-off: privacy protection at the cost of analytical utility. Quantifying this trade-off is essential for responsible data sharing.
💡 Tip —
print(sdcAnon, "ls"): Passing"ls"as a second argument toprint()displays the local suppression summary: how many values were suppressed in each QID column and as an overall percentage. Higher suppression = more information loss.
# TODO: Print the information-loss summary
print(___, "ls")Q6 answer — Which variable had the most suppression, and what does that tell you about the data? Your answer here.
Step 7: Export the Anonymized Dataset
Before sharing, we must:
- Remove direct identifiers (already excluded from the SDC object)
- Generate a new random ID to replace
case_id— never reuse the original, as it may link to external records
# Anonymized quasi-identifier columns
anon_key_vars <- sdcAnon@manipKeyVars
# Non-key, non-excluded variables (passed through unchanged)
safe_vars <- whale[, !(names(whale) %in%
c("case_id", "lat", "long", "fishery_license",
"state", "origin", "county"))]
# Combine
anon_data <- cbind(anon_key_vars, safe_vars)
# Generate synthetic random IDs
set.seed(42)
anon_data$anon_id <- paste0("ID_", sample(100000:999999, nrow(anon_data), replace = FALSE))
# Put the new ID first
anon_data <- anon_data[, c("anon_id", setdiff(names(anon_data), "anon_id"))]
# Preview and save
head(anon_data)
write.csv(anon_data, "whale-entanglement-anon.csv", row.names = FALSE)Key takeaway: k-anonymity via local suppression is a principled, auditable approach to protecting microdata. The sdcMicro package makes every suppression traceable, and the information-loss metrics help justify trade-offs to data stakeholders.
