Matching logic

Probably not difficult but I would advise against it.
In case where matches aren't exact, I would want someone to check the results rather than trust any algorithm

I have a system for checking a batch of online payments by parents against the reference number for their child so it can be credited to the correct student. That ref number consists of LastName followed by the StudentID
e.g. Smith13254 or HaleAllen12892
Usually parents do it correctly but where there are errors (e.g. wrong studentID such as Smith 13524 or modified names such as Hale-Allen12892) a series of checks are made using several methods. The end user then needs to confirm (or possibly reject) the suggested match or matches. In the rare cases where no clear match can be determined, parents are contacted directly to determine origin of payment

In that application, I decided against using fuzzy match algorithms as I found them to be counterproductive for this purpose in testing.

In case its of any use to you, I've attached a PDF summarising the process I use
 

Attachments

isladogs - Very Professionally designed and documented System.
 
Bear in mind that one field (name for instance) may not be enough to guarantee a match. You should include other info to insure its a match such as date of birth, address , or customer number. You could have a John Doe or a Jon Doe thus the reason for returning a list of possible matches with other identifiers.
 
How hard would this be.
Change to a BATCH matching process.
The records in the downloaded Table from the Suppliers would contain (among many other fields):
The Customer of ours which was relayed to the clerk and keyed in by the clerk at the time of purchase
A blank field (bestmatch) to be BATCH filled in by just the #1 Levenshtein Match
A blank field (Bestmatch%) to be BATCH filled in by the TOP Levenshtein match %

After playing with this a lot and trying to do a lot of common mistakes, the Levenshtein and the FAYT seem to have the most utility. Rarely can I even come up with an example were soundex would get the correct solution and Levenshtein does not get it. It is pretty hard to make something sound the same that also does not have a lot of simlar substrings.

However, you do not want to give them just the best choice but like I did give them several choices. There is no reason (except academic curiosity) to show the percentage. If they keyed it in really wrong, you can get a 100% but just happens to match the wrong record.

You want to return several matches. Doing this with 10k records and doing some realistic miskeys, it will usually be in the top 5-10 records. Then you want to be able to quickly look at other fields to verify it is who you want by popping up the full record and looking at other fields. You could even return an exact match, but happens to be the wrong person because you unfortunately coded a wrong name but that name already exists.

Assume you have
Johns
Jones
Joans

You client is Joans, but they key him in as Jones. If you return only the exact match you would get Jones, but unfortunately not correct. If you return the top 10, the clerk should be able to click them to pop up other details and verify. If not they can go to the next record.
 
@dcavaiani
Whilst the school payments app is highly unlikely to be something you want to purchase, you are welcome to look at the DEMO version if you think it might be useful for ideas. See School Payment System
Its a fully working DEMO with data but an ACCDE so code not available.
BACS payment matching is just one small part of this large app
 
Here is the latest update showing several ways to search for a partial string or misspelling. There are 2 versions of soundex and 2 versions of Levenshtein (one is called Simil). None of the code is mine for these 4 functions. I provided the links earlier. The supposedly "updated" SoundEx2 seems inferior to the orginal SoundEx and Simil seems inferior to the standard Levenshtein. I would just get rid of those, but i included just to show there are different instantiations of the same Idea. I am sure there are many others.
Soundex was developed specifically for encoding english last names. Single words with a few syllables (from what I read it was for Census databases). It finds similar sounding names if they start with the same letter. It will find Kathy and Kathie but will not find Cathy. IMO there is not much utility for Soundex in this context. If Soundex can find a likely match it is likely Levenshtein will too. But levenshtein is way more versatile. It handles long strings.
IMO the Class module I developed for FAYT that turns any listbox into a find as you type, just blows these method out of the water for finding partial matches within a table. I provided that as well to show the utility. That would be my main approach. If you cannot find what you are looking for with that then you likely have bigger problems. In my demo you can find a person with a partial email, phone, lastname or first name. I would use that in conjunction with Levenshtein as a backup. FAYT can find a string anywhere in the text, but Levenshtein can find partial multiple strings. So its real strength would be in something like a misspelling of "Robbert". FAYT could find words with Rob or words with bert, but if you type in "Robert" you will get a very high Levenshtein value.
FYI, this is not meant to demonstrate how to incorporate these checks into a DB, it was built to give visibility into these different algorithms and methods of partial matching / "fuzzy searches." I fully realize in a real application you would have indices and other fields on which to check.
 

Attachments

Users who are viewing this thread

Back
Top Bottom