ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

DonationCoder.com Software > N.A.N.Y. 2019

NANY 2019 - StringSimilarity - Release

<< < (2/2)

HelmutWe:
Great work. Have you considered a commercial use? Could be used to identify typos in databases. That is variations of data compared to an official list like names of locations, chemical formulae, parts of engines etc.

Uit_VerKoop3:
Hi, nice stuff.
Is it possible to build in a comparison based on trigrams?
Start at pos 1 of the string take 3 characters and hold them, shift 1 position and do the same, until the end.
Do the same for the second string.
See how many trig's are the same among the strings and calculate the percentage to the total number of trig's of string 1.
In this way you get a good idea about the similarity between the different writings of street names, etc.
Keep up the good work

Ath:
Start at pos 1 of the string take 3 characters and hold them, shift 1 position and do the same, until the end.
Do the same for the second string.
See how many trig's are the same among the strings and calculate the percentage to the total number of trig's of string 1.
-Uit_VerKoop3 (January 07, 2019, 03:22 AM)
--- End quote ---
I'm not quite sure, nor an expert on the subject, but isn't this the 'F23 Jaccard index' calculation from StringSimilarity? You'd have to multiply by 100 to get a percentage, if you need that, I kept all results except Damerau-Levenshtein (number of edits) in the 0..1 range. The documentation for the F23 library and an excerpt of its algorithms can be found here

HelmutWe:
Hi Uit_Verkoop3. What you have in mind is very similar to my thoughts which I wasn´t able to explain further. Only I wasn´t thinking about trigrams but "n-grams" to to speak. My proposal was splitting a string in all possible substrings. Like "Microsoft" would consist of:
M, i, c, r, o, s, o, f, t, plus Mi, ic, cr, ro, so, of, ft plus Mic, icr, cro, ros, oso, sof, oft plus Micr, icro, cros... and so on. But thinking of this kind of pattern recognition I got carried away. Ending on thoughts of pattern recognision about Chinese. And I´m still thinking. "10001" is similar to "7asdz7" in a way, isn´t it? Or "HAL" is similar to "IBM"? (Space Odyssey)