Public, private and government organizations are now, more than ever, concerned about how to implement data privacy, given that many countries have begun regulating by creating national standards and laws to protect sensitive information from being disclosed.
What might not be well appreciated, however, is that companies can easily comply with privacy laws by using data that is truly anonymized. Once individuals are no longer identifiable, the data no longer falls under the scope of the General Data Protection Regulation (GDPR).
Specifically, data anonymization is the process of protecting private or sensitive information by erasing, encrypting, or masking identifiers that connect an individual to stored data. Examples of personally identifiable data include names, social security numbers, and mobile phone numbers.
As long as the anonymized data sets do not relate to an identified or identifiable natural person, they can be published or shared with any party without legal obligations, and without the need for user consent.
Techniques to anonymize data
Fortunately, database providers are also beginning to appreciate the importance of anonymizing data and are providing tools to do so. For example, open-source database PostgreSQL now has an extension available to implement anonymization, while EDB, one of the largest contributors to the Postgres open source project, has offered Postgres Advanced Server 11 has featured native data redaction capabilities since 2018.
However, anonymization is not something to be taken lightly. Care must be taken on how it is done, and it will depend on how you plan to utilize the data.
For example, you need to decide whether the anonymization will be static or dynamic. Static means that the data is changed permanently on the database (or more usually, a copy of it). Dynamic means that the change is applied to the results of the query, and not the entire data set.
Most industries use static anonymization because it is a ‘once-and-done’ technique, with the added benefit that once anonymized, it doesn’t matter what happens to the data, even if it is stolen. However, dynamic anonymization is a less mature technology for the moment, and there are very few customer stories that could attest to its success.
Another consideration is how you anonymize the data. There are several different techniques available, each with its own benefits:
- Attribute or record separation means deleting the attribute or record directly from the data set. There is no risk of re-identification, but there is permanent data loss.
- Pseudonymization is the use of fake or pseudo identifiers. Pseudo identifiers are created with a one-to-one mapping to the original identifiers, which means the pseudo data can be “translated” back to the original.
- Generalization is to make the data more generic by grouping them into broad areas. For example, although Bob is 28 years old, it is recorded that Bob's age is between 20 to 30 years. However, higher generalization impacts the utility of the data.
- Synthetic data uses completely artificial data to replace the original. It is suitable for testing purposes, and there is no risk of re-identification. However, large datasets may require high computing resources, so cost may become a factor.
- Data perturbation is when the data is modified by adding random noise. Mostly suitable for numeric values.
- Data swapping is when data sets are rearranged, essentially a reshuffling of data. However, it may create unusual conditions (e.g., if the male and female gender of patients are swapped in a medical database).
Security and reputational benefits vs impact on personalization
Anonymization is a key tool in trying to protect data privacy. However, please take note that anonymization may also create issues down the line, especially if you intend to use the data to deliver a personalised experience to your visitors. Unfortunately, the use of anonymized data may hamper the effectiveness of your marketing efforts.
Nevertheless, by now we should appreciate the risk of personal data being targeted for theft. If organizations do not double down on improving data privacy, digital identities of individuals will inevitably become compromsised.
When this happens, the consequences would have serious implications for the individuals whose identities are stolen and for the organizations suffering the breach, including lack of customer trust, negative brand exposure, and potential litigation due to non-compliance with data privacy regulation.
Shilpa Oswal works for an R&D organization under the Ministry of Electronics and Information Technology in the Government of India, and has had experience implementing anonymizing procedures for e-government systems in the country.
Parts of this editorial have been adapted from a talk given at an event organised by EDB (formerly EnterprisedB), an organization that provides software and services based on the open-source database PostgreSQL.