Data De-identification

Disclosing sensitive information without proper safeguards can lead to significant harm and jeopardize not only the well-being of participants but also the integrity and trustworthiness of the research process.

The richness of the data obtained through interviews presents additional ethical and analytical challenges, particularly regarding the need for more nuanced and intricate analysis, concerning the protection of participant identity, and the mitigation of the risk of re-identification.

Qualitative data can be more difficult to de-identify compared to quantitative data due to its typically unstructured nature. Another challenge is that qualitative data includes unique personal narratives and contextual details that can inadvertently reveal identities, requiring careful and nuanced de-identification strategies to protect participants’ privacy effectively. The process of de-identification or anonymization can also alter or diminish data value, especially when the significance lies in capturing personal experiences or narratives. Researchers should consider using de-identification alongside other methods, such as limiting access to the data or obtaining explicit consent from participants to share some or all of their personal information. Because extensive de-identification might sometimes obscure important details needed for deep analysis, striking a balance between removing identifiers and preserving the essential context of the qualitative data is key.

In a nutshell, data de-identification entails the process of removing direct and indirect identifiers from a dataset, while maintaining enough information to preserve its value and usability to future research. In this episode, we cover common strategies for de-identifying interview data and recommendations for handling sensitive information.

🤔 Should all data derived from human subjects always be de-identified?

Not all qualitative data collected from human participants requires de-identification. First, it depends on the agreement made with the research subject and the nature of the study. Also, oral history research and “on the record” interviews and resulting notes or transcripts can be shared with the respondent’s name, especially for well-known figures who are used to journalistic interviews. However, be aware of local practices: in some regions, interviewees may have the right to review the written record before it is shared or published. Data that is already part of the public record, such as public statements by politicians, does not need to be de-identified.

Direct vs Indirect Identifiers

A person’s identity can be disclosed from direct identifiers, which are unique to an individual, or indirect identifiers which, when linked with other available information, could identify someone.

Direct identifiers are pieces of information that can immediately identify an individual on their own, such as a social security number, full name, email address, home address, or phone number. In contrast, indirect identifiers are data points that do not uniquely identify someone by themselves but can do so when combined with other information. Examples of indirect identifiers include job title, gender, and ethnicity. The risk associated with indirect identifiers can vary depending on the context and the availability of additional data. Therefore, it is crucial to understand which data points are relevant within the context of the study and to consider both direct and indirect identifiers to effectively protect privacy and minimize the risk of identification.

Source: UCSB Library Data Literacy Series (perma.cc/LM2L-K8DN).

Initially, researchers might change names and disguise locations, but effectively managing identifying details in qualitative data requires a nuanced approach. Anonymity and privacy exist on a spectrum, so researchers must balance the risk of identification with the needs of their research.

When data isn’t fully anonymized, including cases of potential indirect identification, obtaining explicit consent is essential before public release. This consent should be secured either at the time of data collection or prior to publication.

One significant challenge arises with indirect identification, particularly in small groups where participants know each other. This issue can be exacerbated by existing power dynamics, such as those between a team and its leader. Therefore, researchers must transparently communicate the risks of identity disclosure to participants and respect their wishes.

To mitigate these risks, techniques such as using multiple pseudonyms, rephrasing quotes, and breaking known links can be effective. These methods help protect participant identities while allowing for restricted sharing of the dataset.

However, there may be situations where these strategies are not feasible. In such cases, researchers need to carefully consider how to obtain consent for disclosing an individual’s identity without inadvertently revealing the identities of others. Ultimately, the outcomes of these processes should be documented in the data management plan and ethics application, ensuring that all considerations are addressed.

💭 Discussion: What could be potential direct and indirect identifiers in the context of Sarah’s research?

Answer: Besides names, social media usernames or handles are considered direct identifiers, while patterns in use of custom tags and symbols, locations, bios and profile descriptions and catchphrases combined might re-identify users, especially in niche activities.

Consider the infamous Baby Reindeer Netflix series lawsuit, in which a simple keyword search on Twitter swiftly uncovered the identity of the real person behind the pseudonym used in the series.

Data De-identification Essentials

The existing literature provides limited guidance on de-identifying qualitative data, with most researchers relying on ad hoc strategies. Perhaps the most robust and up-to-date methodological framework for handling sensitive narrative data is found in Campbell and co-authors (2023) multiphased process inspired by common qualitative coding techniques.

In the first phase, the process involves consultations with a range of stakeholders and subject-matter experts to identify risks related to re-identifiability and concerns about data sharing. The second phase outlines an iterative approach to identifying potentially identifiable information and developing tailored remediation strategies through group review and consensus. The final phase includes multiple methods for evaluating the effectiveness of the de-identification efforts, ensuring that the remediated transcripts adequately protect participants’ privacy. If your project involves working with vulnerable or protected groups, or handling sensitive data, we strongly recommend reviewing and applying this framework to your research.

Qualitative Data De-identification Overview. Source: Campbell et al. (2023)

These three phases can be broken into tasks, actions and examples:

Phase goal	Tasks	Actions	Examples
Phase 1: Develop a process to distinguish potentially identifiable data	Create a coding framework	Consult with stakeholders and look for strategies followed by similar projects	Regulatory guidance (HIPPAA, IRB, relevant professional and research associations)
			Subject-matter experts familiar with the population studied
			Publicly available records that may contain same/similar information as the research-interview transcripts
		Draft a codebook	Scan for named entities (e.g., names, places, dates)
			List potential identifiable topics
			Guidance for evaluating ambiguous information others or public records might have and that combined could jeopardize privacy and confidentiality
Phase 2: Remediate potentially identifiable data	Establish a coding team*	Hire and train coders	Include coders with varying levels of familiarity with the data
	Review transcripts and propose re mediation plans	Highlight each data point and create an audit trail containing proposed edits	Draft blurred text
			Bracket redacted text
			Review proposed remediation plans and discuss as a team
	Implement remediation plans	Edit and redact text	Insert blurred text, remove redacted text, and insert summaries in the event of important long redactions
	Provide support to coding team if there will be repeated exposure to traumatic content* \| Check in with team members regarding their experiences of vicarious trauma \| Provide information and support regarding the emotional impact of repeated exposure to traumatic content and offer supportive resources \|
Phase 3: Assess the validity of the de-identification analyses	Select validity standards	Assess credibility (i.e., confidence in the accuracy of the findings)	Use of prolonged engagement, persistent engagement, and member checks to assess accuracy of the findings
		Assess dependability (i.e., the findings are consistent and could be repeated)	Use of codebooks, memos, and audit trails to document analyses
		Assess transferability (i.e., the findings have applicability in other con texts)	Provide audiences with sufficient detail about the project so they can assess whether conclusions are transferable to other settings
		Assess confirmability (i.e., the findings reflect the participants’ views, not the researchers’ biases)	Consider how researchers’ positionalities affected the processes and findings and when necessary, recenter the participants’ views \|

*Not applicable to small-scale studies. Source: Adapted from: Campbell et al. (2023).

Now, let’s focus on some practical ways to remediate potentially identifiable data in interview transcripts. General recommendations include:

Establish uniform de-identification rules at the start of your project and apply them consistently throughout, especially when working with a team.
Thoroughly review data to pinpoint any details that could lead to individual identification.
Assign a unique identifier to each participant to replace their name.
Replace real names (people, companies) with pseudonyms.
Generalize location details and dates (e.g., North Carolina → [Southeast], 1977 → [1975-1980])
Generalize meaning of detailed variables (e.g., specific professional position → occupation or area of expertise)
Remove or redact sensitive text or entire sections as needed.
Avoid blanking out or replacing items without any indication. The use of brackets indicates that something has been changed, modified, or deleted from its original form.
Maintain a master log of all replacements, aggregations, or removals made and keep it in a secure location separate from the de-identified data files.

In a single excerpt, you can integrate multiple de-identification techniques. The example below illustrates how an excerpt has been de-identified, with the modifications clearly indicated, while still retaining essential information for future analysis.

🏋️‍♀️ Exercise: Handling Identifiable Information

How many potential identifiers can you spot? How would you mitigate re-identification risk?

Identifiable Excerpt: I collaborated with EcoFashion Co., a company based in North Carolina, to launch a sustainable clothing line bearing my name, Chole Adams. Given that this line was closely associated with my personal brand, I insisted on strict terms for fair compensation and transparency in sourcing. Unfortunately, the CEO at the time, Mitchel, did not adhere to these commitments. It was later exposed in a Netflix documentary named Harmful Fashion, that some of the items were produced in Cambodia under very unfair labor practices, a fact Mitchel had failed to disclose to me.

Answer

There are seven potential identifiers that if combined with public information could potentially give away the interviewee’s identity.

De-identified Excerpt: I collaborated with [Company A] based in [the Southeastern U.S.] to launch a sustainable clothing line bearing my name [name redacted]. Given that this line was closely associated with my personal brand, I insisted on strict terms for fair compensation and transparency in sourcing. Unfortunately, the CEO at the time, [name redacted], did not adhere to these commitments. It was later exposed in a [Streaming Service name redacted] documentary named [Documentary Title redacted], that some items were produced in [Southeast Asia] under very unfair labor practices, which the CEO had failed to disclose to me.

This handout provides a compilation with some helpful tips:

Source: UCSB Library Data Literacy Series (perma.cc/LM2L-K8DN).

Having promised to keep participants’ identities confidential, Sarah has assigned pseudonyms to interviewees. However, she is uncertain how to handle indirect identifiers. We will help her to address these issues confidently.

🏋️‍♀Exercise: Helping Sarah De-identifying Transcripts

Let’s open the files for interviewee_01 and interviewee_03 located in sara-project/data/transcripts you might have already unzipped from project example folder. To make things easier, we have highlighted some potential sensitive pieces of information. What strategies can we implement to minimize the risk of interviewees’ identification?

The deid-transcripts folder contains a possible solution to this exercise. Please note that the names listed at the top of the document are pseudonyms assigned to the interviewees. After all, we want to ensure the privacy and confidentiality of participants by avoiding the use of real names, right?

Adopting best practices for de-identifying responses from your human participants can significantly enhance both the ease and reliability of this process. It’s crucial to incorporate de-identification considerations early in your planning phase as part of your overall data management strategy. Decisions on which data to collect, what to exclude, and how to inform participants about de-identification will profoundly impact how you can use and share the data in the future as we will explore in future episodes.

Automating the De-id Process

You might be wondering how to effectively perform data de-identification for qualitative data. A open source and free tool to support this processes is QualiAnon. Specifically designed for qualitative interview data, QualiAnon was developed by the Qualiservice Research Data Center from University of Bremen, Germany, to support the anonymization of text data. It is particularly useful for anonymizing interview transcripts in the qualitative social sciences, ensuring that data can be safely archived without compromising participant confidentiality.

For information on installation, user manual, and more visit: https://github.com/pangaea-data-publisher/qualianon/releases

Recommended/Cited Sources:

Campbell R, Javorka M, Engleton J, Fishwick K, Gregory K, Goodman-Williams R. Open-Science Guidance for Qualitative Research: An Empirically Validated Approach for De-Identifying Sensitive Narrative Data. Advances in Methods and Practices in Psychological Science. 2023;6(4). doi:10.1177/25152459231205832

Myers CA, Long SE, Polasek FO. Protecting participant privacy while maintaining content and context: Challenges in qualitative data De-identification and sharing. ProcAssoc Inf Sci Technol. 2020;57:e415. https://doi.org/10.1002/pra2.415