• Welcome to Valhalla Legends Archive.
 

Compare text

Started by Spht, July 13, 2007, 10:17 AM

Previous topic - Next topic

Spht

Haven't even thought of how i'm going to do it, but i'd like to compare two strings (no larger than 256 bytes) and return a number between 0 and 100 for how much they match.

ie, given "test" and "candy" would return 0.  "test" and "a test" would probably return about 95%.  "bob" and "bill" is maybe 5%.  "Listen carefully:  Do not touch" and "Do not touch" is maybe 60%

It's in C++, and needs to be able to perform well with 100,000 simultaneous calls

Shouldn't be too difficult to do, haven't put any thought into the best way to do it yet, so figured i'd just post here and see what you smart people come up with.  Count how many of each letter appear and compare the amounts to the other string maybe?

rabbit

I'm not exactly sure if it'd be exactly what you want, but personally I'd tally the frequencies of each character and then comparing the results of that.
Grif: Yeah, and the people in the red states are mad because the people in the blue states are mean to them and want them to pay money for roads and schools instead of cool things like NASCAR and shotguns.  Also, there's something about ketchup in there.

iago

Perhaps rabbit's solution, with some kind of weighting for how far apart the letters are? So when letters are in nearly the same position (at the beginning, end, middle, etc.), the words are considered more similar. Also, you might want to ignore vowels (or count them for less), since they can often be interchanged without changing the meaning as much as most consonants.

It probably wouldn't help, but there is a "sounds like" operator in MySQL, which returns true or false depending on if the two entries sound like each other. I believe that there's a formula for how this is done on MySQL's Web site.
This'll make an interesting test for broken AV:
QuoteX5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*