I have a person table with completely inconsistent location data, and I need to consolidate/fix it. Looking for ideas and methods.

Bettany

Member
Local time
Yesterday, 23:04
Joined
Apr 13, 2020
Messages
44
I'm building a database that's importing Excel contact data into it, and the data (which consists of many thousands of records) is very inconsistent. In the Excel doc, there's a "coverage" field ("coverage" indicates the area a particular contact covers/manages, so it can be multiple geographic locations) and this coverage field has wildly inconsistent entries: sometimes it's just the country "England", sometimes a state "New York", sometimes a region "Northwest", sometimes it's a combination "New York City, New York." In summary, there's no consistency at all. I believe I have two main options for cleaning this up and normalizing: (please provide thoughts on if there are more):

  1. Create many to many relationships for each type of data I find: people_country, people_city, people_region, people_state, etc.
  2. Denormalize all coverage entries into a "locations" table, and then create a many to many relation between people_locations, with a lookup table that provides metadata for each location
Never encountered something quite like this situation before, so wanted to ask the community for pros and cons on these two methods, or an entirely different method.
 
3. Make a mapping table.

You need a table that has 2 fields:

InputCoverage, OutputCoverage

InputCoverage holds every single value you ever receive in that Coverage field. OutputCoverage is what you want it to ultimately be. For example:

InputCoverage, OutputCoverage
New York, New York
NY, New York
NYC, New York
DC, Washington D.C.
Washington D.C., Washington D.C.
Wash DC, Washington D.C.

You do this by making a MAKE TABLE query from your input file using:

Code:
SELECT InputTable.coverage AS InputCoverage, InputTable.coverage AS OutputCoverage INTO CoverageMapping
FROM InputTable
GROUP BY InputTable.coverage, InputTable.coverage;

Once made open that table, sort by InputCoverage and manually go through them and clean it up by merging various InputCoverage values into standard OutputCoverage values like I demonstrated above. Once done you can then use that table to convert the values you receive to the values you want.
 
3. Make a mapping table.

You need a table that has 2 fields:

InputCoverage, OutputCoverage

InputCoverage holds every single value you ever receive in that Coverage field. OutputCoverage is what you want it to ultimately be. For example:

InputCoverage, OutputCoverage
New York, New York
NY, New York
NYC, New York
DC, Washington D.C.
Washington D.C., Washington D.C.
Wash DC, Washington D.C.

You do this by making a MAKE TABLE query from your input file using:

Code:
SELECT InputTable.coverage AS InputCoverage, InputTable.coverage AS OutputCoverage INTO CoverageMapping
FROM InputTable
GROUP BY InputTable.coverage, InputTable.coverage;

Once made open that table, sort by InputCoverage and manually go through them and clean it up by merging various InputCoverage values into standard OutputCoverage values like I demonstrated above. Once done you can then use that table to convert the values you receive to the values you want.

This is absolutely brilliant! Thank you! Once I've eliminated inconsistencies on how people inputted the same location, how best to tackle the table relationships? I suppose I ultimately need a many to many (mtm) relationship between people and the locations they cover. So I could create City, State, Country, Region, etc tables. And if a person only has a city listed, then I create an entry in the person_city junction table, and so forth and so on for other geographic entities? Or do I ultimately need a person_location table that maps every coverage value (previously entered on the Excel sheet but now corrected)?
 
Last edited:
I'd need a better big picture view of what you are doing to advise properly--what does this organization this database is for do? What does this database help accomplish within that?

So I create City, State, Country, Region, etc tables

I generally don't like over-normalizing location data. I'd only have one location table that lists city/state/country/region in one record, not a seperate table for each entity. Then if you have a many-many relationship between people and locations, I'd just make that location table, a person table and a junction between the two for a total of 3 tables.

In fact, that mapping table could be your full location table. Just include the InputCoverage field and then put all other location fields in it as well.
 
I'm building a database that's importing Excel contact data into it, and the data (which consists of many thousands of records) is very inconsistent. In the Excel doc, there's a "coverage" field ("coverage" indicates the area a particular contact covers/manages, so it can be multiple geographic locations) and this coverage field has wildly inconsistent entries: sometimes it's just the country "England", sometimes a state "New York", sometimes a region "Northwest", sometimes it's a combination "New York City, New York." In summary, there's no consistency at all. I believe I have two main options for cleaning this up and normalizing: (please provide thoughts on if there are more):

  1. Create many to many relationships for each type of data I find: people_country, people_city, people_region, people_state, etc.
  2. Denormalize all coverage entries into a "locations" table, and then create a many to many relation between people_locations, with a lookup table that provides metadata for each location
Never encountered something quite like this situation before, so wanted to ask the community for pros and cons on these two methods, or an entirely different method.
Take a step back first. Could coverage actually be a whole country or a whole state or a whole city or a whole region within a country or a state? Just be careful before you assume coverage can only be fined as one thing. You could have a locations or coverage table with a field named LocationDescription with entries of "England" or "New York, New York" or "The Pacific Northwest". They could be perfectly valid coverage areas.

Also, could each person have multiple locations attached to them? Or does each location or coverage area have multiple people attached to it?

Does this post relate to your other post with respect to questionnaires? I think we need to see your actual table and relationship design before going any further. It sounds like you may be unsure as to how to design your project.
 
Last edited:
When you say you are building the database, is this for your company? You really need to get an understanding of what they are trying to do with the location, and whether it needs fixing at all, which is similar to what @LarryE just said.
 
I generally don't like over-normalizing location data. I'd only have one location table that lists city/state/country/region in one record, not a seperate table for each entity.
If the programmer is managing these tables, I would agree but not if the user is managing it. It would be too easy to end up with a situation like the OP has. At least with separate tables, there is less possibility of the users adding typos or abbreviations. Also, if the db has security, you can restrict updates to this table(s) to just a few "admin" types.

I did have a project like this for a multi-national reinsurance company. They had about a 20 customer files tied to different applications. The files had a core of the same customers but which differed slightly differently. Minor differences in the name like Co instead of Company or Co.. similar differences in address. So I had to standardize the data within a file and attempt to remove duplicates and then standardize the files from all the applications. It isn't a trivial task. I did have a lot of help though. I found software that the client purchased which did a lot of the analysis for me. It had two modules. One you used when you were doing data entry. It looked at the customer name and address you were adding and determined if something similar already existed. The other was used for the clean up. It calculated a score for matching data. If the score was above 85%, for example, we could assume the records were the same and accept the match automatically but for less firm matches, say 50% to 85%, the users had to decide. Below 50% there was "no" match. I can't remember the name exactly but if they're still in business, you might find them. I did find some oblique references to the product in a proposal I made for how we would use the products. Look for Name3 and clustering engine for a company with the initials of SSA.
 

Users who are viewing this thread

Back
Top Bottom