The GDPR Clock Is Ticking: Pseudonymization and Test Data Quality
With the EU General Data Protection Regulation (GDPR) coming into full effect on May 25, 2018, organizations must adjust how they handle test data privacy in order to comply with new legislation and avoid fines.
In my last post, I covered a few important points your company should consider when starting a test data privacy project, including inventorying sensitive data in tests and defining a category for each column (field) deemed sensitive. I also touched on the idea of pseudonymization, which is data processed to “no longer be attributed to a specific data subject without the use of additional information,” according to the GDPR legislation.
The end users of disguised data—the testers—in a test data privacy project typically have two objections to how pseudonymization affects test data quality. The first objection is easily handled, but the other must be delicately approached with the right solution.
Objection 1: I Can’t Test If I Don’t Find My Customer
For most of my professional career, I’ve worked with or for application testers. They usually have their “favorite” customers—test cases they regularly use—who are selectable by name, more often by an ID, and have complex sets of products/services.
The favorite customer is essential for successful testing, and usually is a real person. Typically, if pseudonymization is used on test data, testers will complain their customer has “disappeared.” Sound familiar?
I had an interesting discussion with a group of testers at an insurance company regarding this. They actually claimed any tampering with their test data (which was real data) would make testing impossible. Our discussion went more or less like this:
Marcin: I’m in my early forties, male, married, with good health. I work with computers and I travel a lot. I don’t smoke. I’m not into any extreme sports. Do you have a life insurance policy for me?
Testers: The company will find something for you.
Marcin: Great. Now, let’s say you need to test the policy application. What information will you need to make sure I have a good user experience?
Testers: Your name.
Marcin: Why? It wasn’t within the criteria for my policy, was it? Or do they base the policy premium on my name?
Testers: No, they don’t, but—
Marcin: Wait. Is my policy a unique case at your company? Am I a particularly special client of yours?
Testers: No, there are hundreds of thousands of clients like you.
Marcin: There you go. You don’t need my name to verify the application works correctly for me; you need to formulate criteria for your tests!
So, the response to objection number one is only a matter of educating testers about test data privacy and helping them break habits. I know this is possible based on experience. For example, I worked for a bank in the UK where business analysts were able to single out 400 clients statistically representative of their whole client base (some 1.5 million clients). Each of the 400 records was precisely described and disguising the names and addresses was painless.
Objection 2: Pseudonymization Will Spoil My Test Data Quality
The second objection, that pseudonymization will ruin test data quality, goes deeper than whether to use real names (or IDs). In my previous post, I used the following example of test data:
Marcin Grabiński (my real name)
Norden Rd, Maidenhead, Berkshire SL6 4AY, UK (a valid address in the UK, but not mine)
09.06.1976 (not my real date of birth)
+1 313.227.7088 (a valid phone number but not mine)
Further, what if we changed the above address (a real one) to:
Norden Rd, Maidenhead, Berkshire SL4 1NJ, UK
Legally, this dataset is fine. The problem is it includes a U.K. address and a U.S. telephone number. What if there’s a validation rule stating those two pieces of information must match? Additionally, the pieces of the new address are individually correct, but the combination of the town, street and ZIP code is invalid. Logical disagreements between data can ruin test data quality, especially if address validation is done by an external service.
And what with credit card numbers? It’s easy to change a card number into “1234 5678 9012 3456” but then the number is not valid and tests would fail. How about national IDs? They usually code the gender, date of birth and have some check digit. You change one digit and it immediately becomes invalid. And even if the “masked” national ID is valid per se, will it still be valid with the rest of the person’s data?
These are good points to raise, and initially make a strong case for objecting to disguising data. However, disguising data is going to become more important for companies to do well under the GDPR as it begins to penalize companies for using real data. Therefore, companies don’t really have the option to object to disguising data, and the truth is it won’t ruin your test data quality.
Rather, in the analysis (preparation) phase of your test data privacy project, any such constraints must be identified and documented. The analysis must show if the constraints found are important during the test process. If there is no address validation routine used during tests, why bother keeping the address valid? One fake address could be used for all records!
In sum, before any disguise rule is coded, it is absolutely essential to:
- Create an inventory of sensitive data and assign all columns (fields) to a category
- Analyze the application, data and test requirements and document all valid constraints
In my next post, I’ll explain how to move from data analysis into design. To do so, we’ll first have a look at some disguise techniques. Stay tuned by subscribing to InsideTechTalk.com.
To learn more about test data privacy in light of the GDPR, read the other posts in my “The GDPR Clock Is Ticking” blog series:
- How to Start a Test Data Privacy Project
- Data Disguise Techniques
- Creating a Data Lookup Table
- Accessing a Data Lookup Table
- Two-tier Access to a Lookup Table