At the beginning of March, Oklahoma passed its own privacy law. Before that, In the middle of February, Virginia’s state Senate voted on a data privacy protection bill that was so uncontroversial that it passed unanimously. Great...except weeks before that New York, Washington state, and Utah legislators started debating privacy bills, each with its own specific ideas about what data privacy means.
All of these are operating in the shadow of the influential California Consumer Privacy Act (CCPA) that went into effect on Jan. 1, 2020. The CCPA in turn was created with 2018’s sweeping European General Data Protection Regulation (GDPR) in the background. And now a US federal data privacy law has been proposed that would have uncertain effects on all those laws and the ones to come.
In fact, according to the United Nations Conference on Trade and Developments, as of now 128 countries have data privacy laws.
Regulations follow the customers
This problem is compounded by the fact that your business doesn’t have to have offices in a region to be bound by its data privacy regulations, for most of these regulations apply wherever a company has customers. For example, the CCPA says that once a company passes a minimum size threshold, having a single customer in California brings the CCPA to bear.
It gets yet more complicated: federal acts, such as HIPAA for healthcare data, the Fair Credit Reporting Act, and the Children’s Online Privacy Protection Act have their own privacy requirements. Not to mention that every state in the US has its own separate law covering what to do when there’s a breach of privacy.
Complying with these regulations is mandatory and expensive, requiring the involvement of many faculties of a business, including engineering, legal, product design, the user experience team, risk assessment, customer support, and, of course, multiple areas of the IT department.
The challenge is high, but there are strong incentives for companies to get it right: US companies that violate Europe’s GDPR can be assessed fines as high as 4% of their annual global revenue, up to €20 million.
CTOs like me cannot wave a wand and bring all the world’s data privacy laws into alignment. Instead, we have to start thinking differently about data.
Data are getting harder
As the regulatory side of data is getting more complicated, data itself are getting harder to manage:
The total amount of data in the world has been estimated to have been 2 zettabytes in 2010, 59 zettabytes in 2020, and to be 149 zettabytes in 2024; a zettabyte is a trillion gigabytes. Many of those bytes are being generated and stored by businesses.
Business data live in systems that span many generations, from mainframes to the latest clusters running NoSQL, and more and more in the Cloud.
Or, more properly, in the many clouds, each with its own access rules, protocols, features, drawbacks, and quirks.
The sprawl of data increases with every new system installed, application developed, and new business process moved onto computers. Each of these enlarges the breach surface — places at which breaches could occur.
The sensitivity of data increases as new applications capture new types of data and as adversaries not only have more data fragments they can put together to identify individuals, but also have increasingly powerful tools to set to the task.
Data privacy is becoming a more inclusive term as legislatures around the world generally have been bringing more types and categories under its protective umbrella.
Our global concerns about privacy have been rapidly escalating as we learn, sometimes through painful experience, the ways in which sensitive data can fail to be contained, and the devastating effects of breaches, whether intentional or not.
Our old data management pipelines and systems were not designed for the world we now find ourselves in. As a simple example, most systems were designed before it was so imperative to think through which fields contain personal identifiable information (PII), which fields could be used in conjunction with other fields or other datasets to derive PII, or, of course, how machine learning’s mighty statistical engines might derive PII based on patterns it discovers in what seems like perfectly safe data. Data that were once safe may today have become risky, and we have to be prepared to discover that today’s safe data turn out to be keys for unlocking PII tomorrow.
Worst of all, perhaps, is that every company has developed its own standards and processes for keeping PII safe, but because those methods often were developed for simpler data in simpler times, every company now faces the new world of privacy demands with old workflows, pipelines, and warehousing strategies.
Since we cannot simplify the world’s Big Book of Privacy Regulations in which each country gets its own unique chapter written in very small print, we can at least design our data management systems to work better in the global regulatory environment that is already complex and becoming more so every week.
Designing data for privacy
Here’s one approach to managing the bramble of privacy regulations: Keep track of every type of data protected by any form of regulation in any country in the world — a long, ever-growing list. Then mask all of that information no matter which countries you’re doing business with or in.
For example, if a 20-year-old files in California for a liver transplant, and an 80-year-old files in Virginia for a hip replacement, to comply with the health data privacy requirements in HIPAA, California’s CCPA, Virginia’s laws, and all of the rest of the constraints just within the United States, you could anonymize the superset of restricted data types.
That would lower your risk of non-compliance, but such a blunt force approach would also reduce the value of your data. For example, if you’ve wiped out zip codes or birthdates as revealing PII, you will be handicapping the analytic tools you or a client might want to bring to bear on the data to discover geographic or age-based correlations. This can be especially damaging to the use of data to train sophisticated machine learning models that can bring real business value and social benefit.
So this blunt-force approach needs to be part of an approach to data management that is more flexible, agile, and responsive to user needs, while still being fully protective of private data. Doing this means means keeping a full-value, unmasked master data collection under lock and key, while quickly and easily generating datasets that have reliably masked any and all data that might violate the world’s privacy regulations.
Think of it as a dynamic library model, with a touch of magic added to it. This library keeps a master copy of all of its books in a highly secure facility. No one gets direct access to them except the library’s trusted administrators. But this library can produce a copy of anything in its master library, exactly according to the user’s wishes and the library’s stringent privacy requirements. If it’s a library of health records maintained by a hospital, the medical staff can instantly see a patient’s record complete with the necessary PII. But researchers analyzing those records get the data they need with none of the PII. A machine learning team looking at the equity of treatments and outcomes would get the data they need that will likely include fields irrelevant for those forecasting hospital supply requirements, while the attached pharmacy gets exactly and only the patient data it needs.
The bit of magic comes from the dynamism of this architecture. For example, the need to support the Right to Be Forgotten is becoming increasingly widespread; it’s already required by the GDPR and CCPA. If a patient asks the hospital to honor her right to delete what it knows about her, the business can burn through resources to track down and expunge each use if that data is in fifty places outside of the master data center, It can be even harder to be confident that it’s found them all. But if the master dataset is centralized, and if the updating of the masked sets is automated and immediate, the organization can be confident that the library of available datasets will also be in compliance. Like magic.
This architecture lowers the operational price of compliance, which can be dauntingly steep. Rather than generating and updating privacy-compliant datasets taking days, weeks and even months, it can happen in a reasonable approximation of real time.
Every CTO knows how much this matters to business. But regulators and businesses generally are not on the same time scales. Regulators care first and foremost about compliance. Businesses want to comply but also go fast. That’s especially true for a company’s developers who want to increase the business’ competitive advantage by innovating. No company wants its best developers using old data simply because it took too long to get the latest.
No business can afford to wait until the world comes to complete agreement about what privacy means and what regulations we need to implement it. Fortunately, modern data management tools and services can provide exactly these sorts of automated workflows while meeting and embracing data privacy requirements.