De-Identifying Data (1 Viewer)

Isaac

Lifelong Learner
Local time
Today, 06:09
Joined
Mar 14, 2017
Messages
8,778
I'm actually doing this in Excel but posting it in Access for traffic's sake, and because the actual thrust of my question has nothing to do with whether it is done in Access or Excel, it's just asking for ideas.

I've written a macro (think 'query' if you want) that changes data for the purpose of de-sensitizing it. You can pick a workbook, worksheet and column and the it runs through and changes every vowel to "x". That's just a start for proof of concept.

What I'm wondering now is how to make it more effective. If my manager sees that I have (obviously) changed "Johnson" to "Jxhnsxn", that might not be good enough. :)

I'm thinking of coming up with something a little bit more random, and that covers consonants as well, but it needs to be predictable enough so that data de-sensitized in one datasource - which is supposed to match another data source - still matches that other datasource.

I guess what I'm wondering is, given this situation I've already come up with, do you have any ideas merely as to the logic of how I might approach it?

I suppose if I just chose a specific replacement vowel for each vowel, and then one for each consonants (or at least most consonants), that may fit the bill - any other ideas?
 

theDBguy

I’m here to help
Staff member
Local time
Today, 06:09
Joined
Oct 29, 2018
Messages
21,496
Sounds like you're almost talking about encryption. :)
 

Minty

AWF VIP
Local time
Today, 14:09
Joined
Jul 26, 2013
Messages
10,371
I had a list of about 1000 names.
I just ran a query that listed all the Distinct First Names, and matched with a random one from the list.
I stored that in a temp table and then globally replaced each real name with the random name.

Rinse and repeat with the surname using a different random list. Get some interesting results...
 

Pat Hartman

Super Moderator
Staff member
Local time
Today, 09:09
Joined
Feb 19, 2002
Messages
43,368
When I was building a set of test data I used the name and address scramble described by Minty and shifted digits in the phone numbers. Such as swapping 3 and 6 and 4 and 7. But if I just want to obfuscate it and don't care if it makes sense. I pick the third, first, last, and fifth digits and drop all the rest.
 

Isaac

Lifelong Learner
Local time
Today, 06:09
Joined
Mar 14, 2017
Messages
8,778
Thanks to everyone for the suggestions. I eventually created a fixed 'key' by using some one-time randomizing techniques for all alphabetter letters and for 1-9 numbers, then did a loop that just finds and replaces character by character. I will just use this once in a while, not often, so this works lightning fast and I just call it in the immediate window. Thanks again. Posting and reading answers helped me get my thoughts in gear.

This way the data still matches.

Some day I may get into encryption @theDBguy , for now I'll leave it as 'over my head'. :)
 

Pat Hartman

Super Moderator
Staff member
Local time
Today, 09:09
Joined
Feb 19, 2002
Messages
43,368
I'm not sure encryption is useful for something like this. The idea is to be able to use valid data but to obscure certain pieces of identifying data like names, phones, addresses, SSN. I took some care with the model I used because I was delivering it as a set of test data that could be reloaded after the user had done the training and perhaps reload it again, clean, to train a new group of users. There were associated import files and word docs that were filled via automation so the whole set needed to be coherent. But, if I'm going to load a cut down database here for someone to help to debug, more than likely, just picking a couple of characters from the sensitive fields would suffice.
 

Cronk

Registered User.
Local time
Today, 23:09
Joined
Jul 4, 2013
Messages
2,772
I've had to do this a few times. I always laughed to myself when the requirement was requested by the manager. We try so hard to maintain data integrity that to do the opposite is an anathema.

I would substitute something like Surname1, ..2,..3 for actual surname and not keep the same SurnameX for a particular person. Similarly for address fields. Phone numbers I'd replace all with 999-999-999 or whatever. Quantities and similar apply a random factor. Similarly for dates of shipping whatever a random offset (+/-) of the actual date.

This hashes the data. There is no encryption to decipher the data.
 

Users who are viewing this thread

Top Bottom