Data utility can be preserved while enhancing data privacy

Organizations need to strike a perplexing balance when launching strategic AI initiatives: data needs to be accessible, without compromising privacy regulation compliance or the speed of business innovation. Customer trust and brand reputation are key competitive advantages, so accelerated digital transformation and growth relies on businesses being smart about protecting sensitive customer data while still preserving data utility for AI and analytics teams.

Three questions organizations need to confront when it comes to leveraging customer data are:

When organizations do not have answers readily available to the questions above, then Artificial Intelligence projects are often stalled and collaboration using meaningful data is limited. Gartner predicts that by 2024, the use of data protection techniques will increase industry collaborations on AI projects by 70%.

IBM AutoPrivacy framework and the key use cases delivered via IBM Cloud Pak® for Data.  Today I will expand on the advanced data protection use case, which is one of key capabilities in the AutoPrivacy framework.

Data protection and de-identification of sensitive data are not new concepts.  Although these concepts have been well known for many years, most enterprises did not employ these practices consistently.  The enforcement of GDPR has drastically changed that and in the post-GPDR era, enterprises are hyperaware of data protection regulations that they must adhere to. With the enforcement of GDPR (Europe), CCPA (California), LGPD (Brazil) and many other data protection legislations in recent months, consumers are now well aware of their privacy rights and are demanding that enterprises provide transparent privacy protection approaches.

Historically, enterprises have used many methods of sensitive data protection, including redaction and various forms of masking such as substitution, shuffling or randomization.  However, with the employment of deep (learning) neural network technology in AI, data science and analytical modeling, the risk of re-identification has been increasing.  Hence, there is a need for newer data protection techniques and robust encryption algorithms that can enhance privacy but also preserve utility of the data.

By far, the most important requirement from IBM customers has been the consistent enforcement of data protection policies, regardless of where the data resides.

Data cannot simply be de-identified randomly; important relationships must be maintained.  Format preservation is a fundamental requirement.  Values must be de-identified consistently across the enterprise, respecting relationships across multiple data assets.  For example, de-identification of a credit card number, personal first and last names, or any other entity identifiers must be repeatable consistently across data sources in on-premises and hybrid cloud environments.

In addition, I have often encountered unique industry use cases where there is a need for special treatment of certain data elements.  For example, in financial services and healthcare, the time intervals between certain dates should be the same whether unmasked or masked.  The accuracy of dates of disease treatment in healthcare are critical for biomedical research, so while shifting dates, it’s very important to maintain the right intervals. Similarly, the interval between a date of birth and date of an auto policy agreement (in other words, the customer’s age) may make a very big difference in the cost and available features of auto insurance.

Most customers require support for custom de-identification when it comes to complex, multi-field computation using a low-code or no-code approach.  There are also several use cases that require the addition of statistical noise to hide individual data and only surface group level information for analytics.

These rich data protection and consistent policy enforcement capabilities are available via IBM Watson® Knowledge Catalog Enterprise Edition to address a wide range of use cases.

The future is bright as the latest privacy enhancing technologies such as differential privacy, synthetic data fabrication and more are brought into the solution. These technologies, paired with the power of IBM Cloud Pak for Data, will allow data science teams to make choices along the privacy-utility spectrum and continue to push the boundaries of AI initiatives.