K-Anonymity with sdcMicro: Whale Entanglement Data

Background

This dataset (whale-entanglement.csv) contains documented whale entanglement incidents on the U.S. West Coast. Each row is one event, with variables for whale species, location, gear type, and fishery involved. (Adapted for instructional use only.)

Goals:

Assess re-identification risk in the raw data
Apply Statistical Disclosure Control (SDC) to reduce that risk
Quantify the resulting information loss

Step 1: Load Package & Data

Inspect Data

Each row is one entanglement event.

Step 2: Identify Disclosure Risks

Direct Identifiers

Direct identifiers uniquely identify a record on their own.

Q1 — How many direct identifiers are present, and what are they?

Answer: Your answer here.

Variable	Why it directly identifies
`___`	___
`___`	___
`___`	___

These must be excluded from any analytical release.

Quasi-Identifiers

Quasi-identifiers (QIDs) don’t identify records alone, but can be combined with each other — or with external data — to re-identify individuals.

Q2 — Which variables are quasi-identifiers?

Answer: Your answer here. List the variables and explain why each is a QID.

Step 3: Prepare Variable Types

Q3 — What types do the variables need to be?

sdcMicro requires categorical QIDs to be factor type and continuous variables to be numeric. Misspecified types produce incorrect risk estimates.

💡 Tip — varToFactor() and varToNumeric(): These sdcMicro helper functions convert columns in a data frame to factor or numeric. Both take obj (the data frame) and var (a character vector of column names). Work on a copy of your data frame so the original stays intact for later comparison.

# Work on a copy so the raw 'whale' object stays intact for comparison
file <- whale

# TODO: Convert categorical variables to factor using varToFactor()
# Hint: include type, county, state, condition, origin, gear, fine, infraction_type
file <- varToFactor(
  obj = file,
  var = c(___)   # <-- fill in the variable names
)

# TODO: Convert continuous variables to numeric using varToNumeric()
# Hint: year and month should be numeric
file <- varToNumeric(
  obj = file,
  var = c(___)   # <-- fill in the variable names
)

Step 4: Measure Initial Risk

4.1 Create the SDC Object

Q4.1 — What is the re-identification risk for this dataset?

We create an sdcMicroObj that encodes our design decisions: which variables are QIDs, which direct identifiers to exclude, and any weights or strata.

💡 Tip — createSdcObj(): The key arguments are:

dat — your data frame (use the original whale, not file)

keyVars — character vector of quasi-identifier column names

excludeVars — character vector of direct identifier column names to drop

All other arguments (weightVar, hhId, etc.) can be NULL for this exercise.

# TODO: Create the sdcMicro object. Choose your QIDs and direct identifiers.
sdcInitial <- createSdcObj(
  dat         = ___,                        # <-- which data frame?
  keyVars     = c(___),                     # <-- your QIDs
  excludeVars = c(___),                     # <-- direct identifiers to exclude
  weightVar   = NULL,
  hhId        = NULL,
  strataVar   = NULL,
  pramVars    = NULL,
  seed        = 0,
  randomizeRecords = FALSE,
  alpha       = c(1)
)

Then, compute the current disclosure risk:

💡 Tip: The global re-identification risk is stored in sdcInitial@risk$global$risk. Values closer to 1 mean higher risk.

# TODO: Print the global re-identification risk
print(___)

4.2 Assess k-Anonymity Violations

Q4.2 — To what extent does this dataset violate k-anonymity?

A dataset satisfies k-anonymity if every QID combination appears in at least k records. Records in smaller groups are “at risk.”

💡 Tip — print() on an sdcMicroObj: Calling print(sdcInitial) produces a full risk report, including the number of records violating 2-, 3-, and 5-anonymity. Look for the “Frequency of key” and “Number of observations violating k-anonymity” sections.

# TODO: Print the full risk report for sdcInitial
___

Our target: achieve at least 3-anonymity (k = 3).

Step 5: Apply Anonymization

5.1 Non-Perturbative Method Testing — Recode `origin`

Q5.1 — Apply one non-perturbative method. How effective was it?

Reminder:

Non-perturbative methods generalize or suppress values without adding noise.

Perturbative methods (e.g., PRAM, microaggregation) alter values by adding controlled noise.

We can recode origin by merging minority categories into a single broader category. This reduces the distinctiveness of those records.

💡 Tip — groupAndRename(): Collapses one or more levels of a factor QID into a new label. Arguments:

obj — your sdcMicroObj

var — the variable name (as a string)

before — character vector of original levels to merge

after — character vector with the new label (single value repeated for all merged levels)

Always check table(sdcInitial@manipKeyVars$origin) before and after to confirm the recode worked.

# Before: check current distribution
cat("Distribution of 'origin' before recoding:\n")
print(table(sdcInitial@manipKeyVars$origin))

# TODO: Recode origin — merge "recreational" and "tribal" into "non-commercial"
sdcInitial <- groupAndRename(
  obj    = sdcInitial,
  var    = "___",          # <-- variable name
  before = c(___, ___),    # <-- levels to merge
  after  = c("___")        # <-- new label
)

# After: check new distribution
cat("\nDistribution of 'origin' after recoding:\n")
print(table(sdcInitial@manipKeyVars$origin))

# Recompute risk
cat("\nRisk summary after recoding 'origin':\n")
print(sdcInitial)

Q5.1 answer — Did the recoding reduce k-anonymity violations? Why or why not?

Your answer here.

5.2 Automated k-Anonymization — `kAnon()`

Q5.2 — Apply k = 3 anonymization.

kAnon() uses local suppression: it sets the minimum number of QID values to NA until every combination appears at least k times, minimizing total information loss.

💡 Tip — kAnon(): Takes your sdcMicroObj and k (an integer vector, e.g., c(3)). It returns a new sdcMicroObj with suppressed values applied. After running it, call print() again to see the updated violation counts — they should drop to zero for your chosen k.

# TODO: Apply k = 3 anonymization and store the result in sdcAnon
sdcAnon <- kAnon(___, k = c(___))

cat("Risk summary after k = 3 anonymization:\n")
print(sdcAnon)

Q5.2 answer — How many records still violate 3-anonymity after running `kAnon()`? Your answer here.

Step 6: Measure Information Loss

Q6 — How much information did we lose?

Anonymization is a trade-off: privacy protection at the cost of analytical utility. Quantifying this trade-off is essential for responsible data sharing.

💡 Tip — print(sdcAnon, "ls"): Passing "ls" as a second argument to print() displays the local suppression summary: how many values were suppressed in each QID column and as an overall percentage. Higher suppression = more information loss.

# TODO: Print the information-loss summary
print(___, "ls")

Q6 answer — Which variable had the most suppression, and what does that tell you about the data? Your answer here.

Step 7: Export the Anonymized Dataset

Before sharing, we must:

Remove direct identifiers (already excluded from the SDC object)
Generate a new random ID to replace case_id — never reuse the original, as it may link to external records

# Anonymized quasi-identifier columns
anon_key_vars <- sdcAnon@manipKeyVars

# Non-key, non-excluded variables (passed through unchanged)
safe_vars <- whale[, !(names(whale) %in%
                         c("case_id", "lat", "long", "fishery_license",
                           "state", "origin", "county"))]

# Combine
anon_data <- cbind(anon_key_vars, safe_vars)

# Generate synthetic random IDs
set.seed(42)
anon_data$anon_id <- paste0("ID_", sample(100000:999999, nrow(anon_data), replace = FALSE))

# Put the new ID first
anon_data <- anon_data[, c("anon_id", setdiff(names(anon_data), "anon_id"))]

# Preview and save
head(anon_data)
write.csv(anon_data, "whale-entanglement-anon.csv", row.names = FALSE)

Key takeaway: k-anonymity via local suppression is a principled, auditable approach to protecting microdata. The sdcMicro package makes every suppression traceable, and the information-loss metrics help justify trade-offs to data stakeholders.