De-Identifying Data

Isaac · Sep 29, 2022

I'm actually doing this in Excel but posting it in Access for traffic's sake, and because the actual thrust of my question has nothing to do with whether it is done in Access or Excel, it's just asking for ideas.

I've written a macro (think 'query' if you want) that changes data for the purpose of de-sensitizing it. You can pick a workbook, worksheet and column and the it runs through and changes every vowel to "x". That's just a start for proof of concept.

What I'm wondering now is how to make it more effective. If my manager sees that I have (obviously) changed "Johnson" to "Jxhnsxn", that might not be good enough.

I'm thinking of coming up with something a little bit more random, and that covers consonants as well, but it needs to be predictable enough so that data de-sensitized in one datasource - which is supposed to match another data source - still matches that other datasource.

I guess what I'm wondering is, given this situation I've already come up with, do you have any ideas merely as to the logic of how I might approach it?

I suppose if I just chose a specific replacement vowel for each vowel, and then one for each consonants (or at least most consonants), that may fit the bill - any other ideas?

theDBguy · Sep 29, 2022

Sounds like you're almost talking about encryption.

bastanu · Sep 29, 2022

Why not use some hashing functions (https://en.wikibooks.org/wiki/Visual_Basic_for_Applications/String_Hashing_in_VBA)?

Minty · Sep 29, 2022

I had a list of about 1000 names.
I just ran a query that listed all the Distinct First Names, and matched with a random one from the list.
I stored that in a temp table and then globally replaced each real name with the random name.

Rinse and repeat with the surname using a different random list. Get some interesting results...

jdraw · Sep 29, 2022

Isaac,

Take a look at Micron's randomizing approach.

Pat Hartman · Sep 30, 2022

When I was building a set of test data I used the name and address scramble described by Minty and shifted digits in the phone numbers. Such as swapping 3 and 6 and 4 and 7. But if I just want to obfuscate it and don't care if it makes sense. I pick the third, first, last, and fifth digits and drop all the rest.

Isaac · Sep 30, 2022

Thanks to everyone for the suggestions. I eventually created a fixed 'key' by using some one-time randomizing techniques for all alphabetter letters and for 1-9 numbers, then did a loop that just finds and replaces character by character. I will just use this once in a while, not often, so this works lightning fast and I just call it in the immediate window. Thanks again. Posting and reading answers helped me get my thoughts in gear.

This way the data still matches.

Some day I may get into encryption @theDBguy , for now I'll leave it as 'over my head'.

Pat Hartman · Sep 30, 2022

I'm not sure encryption is useful for something like this. The idea is to be able to use valid data but to obscure certain pieces of identifying data like names, phones, addresses, SSN. I took some care with the model I used because I was delivering it as a set of test data that could be reloaded after the user had done the training and perhaps reload it again, clean, to train a new group of users. There were associated import files and word docs that were filled via automation so the whole set needed to be coherent. But, if I'm going to load a cut down database here for someone to help to debug, more than likely, just picking a couple of characters from the sensitive fields would suffice.

Cronk · Oct 2, 2022

I've had to do this a few times. I always laughed to myself when the requirement was requested by the manager. We try so hard to maintain data integrity that to do the opposite is an anathema.

I would substitute something like Surname1, ..2,..3 for actual surname and not keep the same SurnameX for a particular person. Similarly for address fields. Phone numbers I'd replace all with 999-999-999 or whatever. Quantities and similar apply a random factor. Similarly for dates of shipping whatever a random offset (+/-) of the actual date.

This hashes the data. There is no encryption to decipher the data.

De-Identifying Data

Isaac

Lifelong Learner

theDBguy

I’m here to help

bastanu

AWF VIP

Minty

AWF VIP

jdraw

Super Moderator

Pat Hartman

Super Moderator

Isaac

Lifelong Learner

Pat Hartman

Super Moderator

Cronk

Registered User.

Similar threads

Users who are viewing this thread