Levenshtein Distance

TimInAus · Apr 2, 2010

Hi All,

I hoping someone can help me with an issue I'm having

I'm trying to design a fuzzy search feature for a C# application that uses MS Access as the DB. I am searching on a list of horse names, there are 130,000 horse records in the database. I've tried lots of different algorithms but the one that seems to suit my need best is levenshtein distance. I'm really just trying to account for spelling mistakes when the user tries to find a horse. I've added the VBA to the database as a function, which I then use in my SQL. I can run the function in the SELECT part of the statement without any issues, 'SELECT name, levenshtein(a,b) FROM Horses' returns the whole table in under 2 seconds. But as soon as I put it in the WHERE clause the query takes 5+ minutes, 'SELECT name, FROM Horses WHERE levenshtein(a,b) <= 2'.

Has anyone experienced the massive time difference when including a function in the WHERE clause?

Is there a better solution for my fuzzy search?

Any help is appreciated.

Thanks

Tim

vbaInet · Apr 2, 2010

Firstly, welcome to AWF!

This levenshtein's distance algorithm sounds to me like a theoretical logic, something that Theoretical Computer Scientists would deal with and know. You also used the word "fuzzy" which sounded to me like fuzzy logic in AI. You didn't however mention what the algorithm does, what is the inputted data and what the outcome should be?

What is your overall goal?

georgedwilkinson · Apr 2, 2010

All distance algorithms take a lot of time. Yes, I have use the Jaro-Winkler algorithm in the where clause and it takes forever.

I finally decided to do the Jaro-Winkler distance calculations in the select clause (it still takes a long time) and store the results in a table with the primary key of the records being compared. The problem with this is that the result set is a Cartesian product, which is itself difficult to deal with.

So I finally decided to further classify my records (in my case, I used state and zip code which won't help you) before calculating the distance.

What you are trying to do is not simple (just saying there are no simple answers).

What you could do is calculate and store the distance for all pairs once and then, whenever a new record (aka horse) is entered, calculate all possible pairs for that particular horse.

Any way you go, the process is slow. Use the very best computer(s) you can get your hands on. You'd be surprised at how much difference a better computer will make.

vbaInet · Apr 2, 2010

George, have you tried generating the records using a recordset? I've found in some cases iterating a recordset for a Select query is faster than nesting loops.

georgedwilkinson · Apr 2, 2010

I've tried just about every method known (at least, every method known by me). In the OP's case, he is dealing with something moderately simple (comparing 1 horse name to all other horse names).

I am confused by looking at the samples, however, because there is no indication that there is more than 1 "horse" being compared by looking at the sql statement. But maybe that's where a and b come from.

In my case, I was comparing hundreds of thousands of business names to hundreds of thousands of other business names, some of which were no longer in business.

In every case, the resolution requires a full or modified (by finding a way to limit the comparisons) Cartesian product.

You know, google does this millions of times a day. I wonder what algorithm they use to determine that you've mis-spelt your query and that here are multiple options?

vbaInet · Apr 2, 2010

The OP would need to elaborate.

I was explaining the same thing about how google does searches and returns results based on relevance to another poster. I would imagine things would be done rather differently. They would have a data dictionary from which they compare words. That dictionary would be driven by some really fast algorithm too.

Banana · Apr 2, 2010

What if you tried this:

Code:

SELECT ..., levenstein_value...
   FROM horses
   INNER JOIN (
       SELECT horseID, levenstein(a,b) AS levenstein_value
       FROM horses
   ) lh 
WHERE levenstein_value <= 2;

?

Banana · Apr 3, 2010

Side notes:

1) I'm very puzzled at the apparent difference between SELECT and WHERE especially if the number of rows hasn't changed. If a sample database can be uploaded, I'd love to fiddle with it.

2) Regarding google et al, they actually don't use relational database. In case of Google they use a product named Bigtable which you could read up in wikipedia but the reason they can do it fast is because it's not relational in first place.

georgedwilkinson · Apr 3, 2010

I can't even begin to wrap my head around BigTable. It looks like that is the ideal way to do what google does and it seems like an ideal way to replace distance algorithms. I actually considered using google earth or google maps to resolve my business problem but so many of the businesses in our database are out of business and were located in places that no longer exist (the businesses have been torn down and the streets they were located on are gone).

I wonder if there's a way to integrate a google solution for the OP's problem?

vbaInet · Apr 3, 2010

Not enough information to consider BigTable. Even if we did, it would be a milestone project.

Banana · Apr 3, 2010

I could be dead wrong and imagining things but OP makes it sounds like he's searching for all possible variation of a given spelling of a horse name and using the distance algorithm to calculate how close a name matches the given criteria? IOW, this has nothing to do with Google maps.

I copied a sample algorithm from wiki:

Code:

int LevenshteinDistance(char s[1..m], char t[1..n])
 {
   // d is a table with m+1 rows and n+1 columns
   declare int d[0..m, 0..n]
  
   for i from 0 to m
     d[i, 0] := i // deletion
   for j from 0 to n
     d[0, j] := j // insertion
  
   for j from 1 to n
   {
     for i from 1 to m
     {
       if s[i] = t[j] then 
         d[i, j] := d[i-1, j-1]
       else
         d[i, j] := minimum
                    (
                      d[i-1, j] + 1,  // deletion
                      d[i, j-1] + 1,  // insertion
                      d[i-1, j-1] + 1 // substitution
                    )
     }
   }
  
   return d[n, m]
 }

Looking at the code, I would bet that what OP meant for "(a,b)" was a = actual horse name as recorded in the table, b = user's input for a horse name.

I'm also assuming there are no recordsets opened or additional querying being done in VBA code - there is no need to do so.

If the OP had said that "SELECT..." with a WHERE works but moving the function into WHERE is suddenly slow, then that would be easily explained by saying that function in a WHERE would cause a table scan but the OP also say that SELECT... was without any criteria so the time required to execute the function for every record should be roughly same. The only difference here is that WHERE has to be able to figure out which rows returns a distance of less than 2... perhaps by creating an array in memory to list the result whereas SELECT could just dump the results.

Some more questions for OP:

1) You say the SELECT... executes instantly... but if if you move to last record does it take just as long as moving the function to WHERE?

2) Do you know how to use SHOWPLAN? If so, paste the results. Otherwise, post a sample so I can try it out for myself.

Banana · Apr 3, 2010

vbaInet said:
Not enough information to consider BigTable. Even if we did, it would be a milestone project.

I believe BigTable is not even commercially available so it wouldn't be just a milestone project but rather more warez project or vaporware project.

vbaInet · Apr 3, 2010

Banana said:
I believe BigTable is not even commercially available so it wouldn't be just a milestone project but rather more warez project or vaporware project.

Hehe! We'll do it just like Microsoft do - get the idea from competitor, change things a little bit then sell it

The OP definitely has alot of questions to answer now.

stopher · Apr 3, 2010

I've attached a sample. I pinched the code from here (I'm too lazy to re-write the wikipedia code).

There are no horses in the sample. To populate, run the subroutine "PopulateHorseTbl" in the module "modCreateHorses" - currently set to populate 120000 "horses" (random strings of 5-8 letters).

Then run the query "qryCloseMatches".

I found that without the Where statement the query seemed to fly. Just to be sure I click "last record". But with the Where statement, it ran like a dog. Also, even with the results on screen, my PC seemed to hang for quite a few seconds.

I think the approach should be to do a single pass of the horse table using a recordset and pull each close match into a temporary table.

Chris

TimInAus · Apr 3, 2010

Wow, thanks for all the responses guys! I'll try to answer your questions and give more info.

I am re-writing an application that records details about racehorses. The original version was written 18 years ago in early C++ using Access as the DB, I’m re-writing it using C# and Access for the DB. A lot of the original code is unreadable so I’m having to re-invent the wheel for a lot of it. One of the functions I’m having to re-do is ‘Search Horse’ , with the option to do a fuzzy search.

By ‘fuzzy search’ I mean a non-exact search that will account for user spelling mistakes when searching for a horse by name. Some typical horse names might be ‘Tom Pepper’, ‘Decade’, ‘Rhythm’ & ‘Miss Devine’, there are approx 130,000 records in the DB. The ‘Search Horse’ form allows the user to enter in a horse name, click search, and have the results returned to them in a DataGrid. What I want is the functionality that will return ‘Tom Pepper’ if something like ‘Tom Popper’ or ‘Pepper’ is entered as the search string.

From the research that I’ve done it seems the best way to implement something like this is the use an algorithm like Levenshtein Distance to calculate the ‘closeness’ of the search string compared with the records in the database, and return those that are >= the closeness threshold. Levenshtein Distance basically calculates how many changes are required to turn one string into another e.g. Levenshtein(‘Tom Pepper’, ‘Tom Popper’) would = 1 and Levenshtein(‘Tom Pepper’, ‘Pepper’) would = 4.

Below is the Levenshtein Distance function that I’m using;

Code:

[SIZE=2]Option Compare Database[/SIZE]
[SIZE=2]Option Explicit[/SIZE]
[SIZE=2]Function levenshtein(a As String, b As String) As Integer[/SIZE]
 
[SIZE=2]Dim i As Integer[/SIZE]
[SIZE=2]Dim j As Integer[/SIZE]
[SIZE=2]Dim cost As Integer[/SIZE]
[SIZE=2]Dim d() As Integer[/SIZE]
[SIZE=2]Dim min1 As Integer[/SIZE]
[SIZE=2]Dim min2 As Integer[/SIZE]
[SIZE=2]Dim min3 As Integer[/SIZE]
 
 
[SIZE=2]If Len(a) = 0 Then[/SIZE]
[SIZE=2]levenshtein = Len(b)[/SIZE]
[SIZE=2]Exit Function[/SIZE]
[SIZE=2]End If[/SIZE]
 
 
[SIZE=2]If Len(b) = 0 Then[/SIZE]
[SIZE=2]levenshtein = Len(a)[/SIZE]
[SIZE=2]Exit Function[/SIZE]
[SIZE=2]End If[/SIZE]
 
[SIZE=2]ReDim d(Len(a), Len(b))[/SIZE]
 
 
[SIZE=2]For i = 0 To Len(a)[/SIZE]
[SIZE=2]d(i, 0) = i[/SIZE]
[SIZE=2]Next[/SIZE]
 
 
[SIZE=2]For j = 0 To Len(b)[/SIZE]
[SIZE=2]d(0, j) = j[/SIZE]
[SIZE=2]Next[/SIZE]
 
 
[SIZE=2]For i = 1 To Len(a)[/SIZE]
 
[SIZE=2]For j = 1 To Len(b)[/SIZE]
 
[SIZE=2]If Mid(a, i, 1) = Mid(b, j, 1) Then[/SIZE]
[SIZE=2]cost = 0[/SIZE]
[SIZE=2]Else[/SIZE]
[SIZE=2]cost = 1[/SIZE]
[SIZE=2]End If[/SIZE]
 
[SIZE=2]' Since Min() function is not a part of[/SIZE]
[SIZE=2]' VBA, we'll "emulate" it below[/SIZE]
[SIZE=2]min1 = (d(i - 1, j) + 1)[/SIZE]
[SIZE=2]min2 = (d(i, j - 1) + 1)[/SIZE]
[SIZE=2]min3 = (d(i - 1, j - 1) + cost)[/SIZE]
 
 
[SIZE=2]If min1 <= min2 And min1 <= min3 Then[/SIZE]
[SIZE=2]d(i, j) = min1[/SIZE]
[SIZE=2]ElseIf min2 <= min1 And min2 <= min3 Then[/SIZE]
[SIZE=2]d(i, j) = min2[/SIZE]
[SIZE=2]Else[/SIZE]
[SIZE=2]d(i, j) = min3[/SIZE]
[SIZE=2]End If[/SIZE]
 
[SIZE=2]' In Excel we can use Min() function tha[/SIZE]
[SIZE=2]' t is included[/SIZE]
[SIZE=2]' as a method of WorksheetFunction objec[/SIZE]
[SIZE=2]' t[/SIZE]
[SIZE=2]'d(i, j) = Application.WorksheetFunction[/SIZE]
[SIZE=2]' .Min(min1, min2, min3)[/SIZE]
[SIZE=2]Next[/SIZE]
[SIZE=2]Next[/SIZE]
 
[SIZE=2]levenshtein = d(Len(a), Len(b))[/SIZE]
 
[SIZE=2]End Function[/SIZE]

I’ve been trying to keep this functionality in the DB but maybe I’m going the wrong way about it.

I thought I was going the right way about it as I though my only problem was getting the function to work in the WHERE clause without the query execution for 5 minutes. I tried to simplify them for my example but I seemed to have caused a little confusion. The full queries are below where I’m replacing the search string variable with ‘Tall Lady’;

"SELECT ID, name, levenshtein(horses.name,'Tall Lady') FROM Horses" takes approx 1 minute

"SELECT ID, name FROM Horses WHERE levenshtein(horses.name,'Tall Lady') <= 2" goes for 5+ minutes before I cancel the query.

Thanks Banana, you’re exactly right with what I’m trying to achieve. Answering your first question, no I was not going to the last row, I thought the results only displayed when the query was finished. Navigating to the last row shows that the query is taking about 1 minute rather than 2 seconds. Which is still a big difference between my two queries, but maybe makes my approach unacceptable as even 1 minute would be too long to make the user wait to have their horse returned.

I can’t understand why one takes longer than the other does when they are both running the same function on the same number of records.

The last version of the software returns ‘Tom Pepper’ if ‘Tom Pipper’ is enterer in under two seconds, so I know this can be done. I thought I was close to a solution but now I feel I’m back to square one.

Am I trying to achieve something that can’t be done in Access? Should I be filtering my records in the C# code maybe?

Thanks again all for responding.

Cheers
Tim

EDIT: Stopher, just saw you response. Thanks I'll check it out a little later on.

stopher · Apr 3, 2010

Well I did a output test using recordset and I was getting it down to about 11 seconds. But then I wasn't experiencing the 5 mins that you had.

By the way, the code you are using has some inefficiencies. For instance, the expression Mid(a, i, 1) is inside the double loop. But the value never changes beyond the first loop - hence it's being re-calculated needlessly. If you look at the code in the link I provided you can see that they have move the expression to the previous loop.

I wonder if this could be improved further by taking the MID functions out of the loops altogether. Instead, use the MID function to read the characters of each string into cells of two corresponding arrays.

So suppose you are comparing a 7 char string with a 9 char string. Then in your current method you are going to perform 63 MID's. But using two arrays, there will be only 16 MID's.

Chris

Banana · Apr 3, 2010

Hey, stopher, thanks for sharing an bit more efficient code.

I put it in down and as expected, it still did not address the issue OP was here.

On my sample I created, I queried against 300,024 employees' first name using both variants (my first variant is very similar to Tim's) and the one you linked.

In both cases, both performed near-instant when it's in the SELECT clause, but move it to WHERE clause and it's now slow as molasses. I can get first few results but it's MoveLast that kills me, waiting >3 minutes to get to the last record.

More importantly, what doesn't make sense is that the number of rows examined hasn't changed in both cases. That is, for both cases, we are scanning the whole set of 300,024 employees. The result may be different (e.g. SELECT returns the whole thing while WHERE return only a part) but the numbers of row examined is same. So at best, WHERE should be slightly slower to account for building an filter or whatever but certainly not 100x slower.

FWIW, here's the SHOWPLAN results:

When LD is in the WHERE clause

Code:

--- temp query ---

- Inputs to Query -
Table 'employees'
- End inputs to Query -

01) Restrict rows of table employees
      by scanning
      testing expression "LD("XXXX",[first_name])<=2"

When the LD is in SELECT:

Code:

--- temp query ---

- Inputs to Query -
Table 'employees'
    Using index 'PrimaryKey'
    Having Indexes:
    PrimaryKey 300024 entries, 503 pages, 300024 values
      which has 1 column, fixed, unique, primary-key, no-nulls
    hire_date 300022 entries, 849 pages, 5434 values
      which has 1 column, fixed
- End inputs to Query -

01) Scan table 'employees'
    Using index 'PrimaryKey'

Just to be sure, I had included the primary key in the LD-in-WHERE query but it made no difference, as expected.

I'm thinking there is something wrong going on and not Tim's fault. I'll have to work out the 'why', but if Tim is after an immediate solution, I would think that George's suggestion of using a temporary table then querying upon it would produce better results than trying to do it in a single query. Another alternative is to add another criteria to restrict the rows. Jet is smart enough to evaluate a criteria that can use index before it evaluate other criteria that cannot use index so if that is a possibility, it may help a bit.

Tim, you asked about doing it in C# -- the issue is that to be accessible, it has to be COM-compatible but for each row, the LD function will be called, so that means as many calls to the function as there are rows in the table. But what will really hurt is the crossing .NET-COM boundary that many times. An alternative is to build an array from the result, perhaps by using GetRows then passing it as is to .NET's environment so you only need to cross the boundary once, albeit with a big array then you can evaluate the results. But that seems to me more work than necessary when you can just do a temporary table and query upon it.

TimInAus · Apr 11, 2010

Thanks for you advice everyone. Looks like I'll have to go with a different approach. It's a shame that there isn't a public accessable solution for these types of searches.

Banana · Apr 11, 2010

Woo, I forgot about that thread!

There's a possible solution you may want to look at - let me get it.

Banana · Apr 11, 2010

So, back then, I had asked other MVPs about this, and the general conclusion explaining the difference was this:

By default, Jet does a "lazy fetching" and delays as much evaluation as it can until it actually is needed. Therefore when we open a recordset in a form or datasheet or whatever, it evaluates only enough records to fill the screen, basically. It will then keep evaluating the other records in background but if the user click last button, it interrupts its running background evaluation and immediately evaluates the last few records. But when you move it to the WHERE clause, there is no possibility of lazy fetching so we're stuck with waiting for it to complete in its entirety.

But that doesn't address your issue -- needing to get a horse's name quickly enough. Of that, Tom Wickerath, a long-time MVP was very kind to come up with this sample basically combining both SoundEx function and Levenshein function to provide a better performing function. I've not tested it as I've been tied up with other things but I hope you find his sample to be very useful.

HTH.

Levenshtein Distance

Registered User.

AWF VIP

AWF VIP

AWF VIP

AWF VIP

AWF VIP

split with a cherry atop.

split with a cherry atop.

AWF VIP

AWF VIP

split with a cherry atop.

split with a cherry atop.

AWF VIP

AWF VIP

Attachments

Registered User.

AWF VIP

split with a cherry atop.

Registered User.

split with a cherry atop.

split with a cherry atop.

Attachments

Similar threads

Users who are viewing this thread