topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday March 28, 2024, 4:17 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: NANY 2019 - StringSimilarity - Release  (Read 11805 times)

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
NANY 2019 - StringSimilarity - Release
« on: November 11, 2018, 05:19 AM »
NANY 2019 Entry Information

Application Name StringSimilarity
Version 1.3.0.0
Short Description Compare 2 strings/texts giving a distance/proximity score using several algorithms
Supported OSes Windows 7 or newer (.NET 4.5 or newer required)
Web Page StringSimilarity page
Download Location StringSimilarity page
System Requirements
  • Windows 7/8/10
  • .NET 4.5 or newer
Version History
  • 2018-11-24 1.3.0.0: Added Load from (txt) file feature.
  • 2018-11-11 1.2.0.0: NANY Release. Show all algorithm results in a grid
  • 2018-10-26 1.1.0.0: (not released) Added extra algorithms, restore screen size/position and settings.
  • 2018-10-24 1.0.0.0: Initial release
Author Link to Ath's Profile page


Description
As a response to a request by HelmutWe I searched and found an algorithm that seemed to match the request, and folded that into a C#/WinForms application.
After adding a few similar but different algorithms, also found on the internet, and re-shaping the UI a bit to handle larger texts and a results grid, the current incarnation is now available.

Used algorithms and sources:
AlgorithmSource
Jaro-WinklerRonnie Overby's Jaro-Winkler
Damerau-LevenshteinWicked Shimmy's Damerau Levenshtein
F23 Sorensen-dice coefficientFeature 23's StringSimilarity.NET library
F23 Cosine similarity(see above)
F23 Jaccard index(see above)
F23 Normalized Levenshtein(see above)

Features
Compare 2 strings/texts and calculate their similarity.

Planned Features
Allow to select 2 files and determine their similarity

Requested features
Make available as separate dll for use from other tools (undecided yet)

Screenshots
Initial screen:
StringSimilarity-1.2.0-initial.png

Comparing Similarity and Simelarity
StringSimilarity-1.2.0-similarity.png

Usage
Installation
- Unzip the file to it's own directory
- Run the exe
(A settings.xml file will be created when closing the application)

Using the Application
- Enter some texts to compare
- Select desired options
- Results are updated immediately when both strings are non-empty
- Results can be sorted by clicking a column title

Uninstallation
- Close the application
- Remove all files

Known Issues
To be reported by users, none so far...
« Last Edit: November 24, 2018, 08:05 AM by Ath, Reason: Updated to 1.3.0 »

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: NANY 2019 - StringSimilarity - Release
« Reply #1 on: November 11, 2018, 05:20 AM »
(Reserved for future use)
« Last Edit: November 11, 2018, 05:27 AM by Ath »

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 40,896
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
Re: NANY 2019 - StringSimilarity - Release
« Reply #2 on: November 11, 2018, 05:35 AM »
very nice  :up:

KodeZwerg

  • Honorary Member
  • Joined in 2018
  • **
  • Posts: 718
    • View Profile
    • Donate to Member
Re: NANY 2019 - StringSimilarity - Release
« Reply #3 on: November 11, 2018, 05:59 AM »
I love such stuff, cool project!
Thankyou for wrapping it up so far!

Is a export things to dll planned? ...for programmers reachable by Api with functions you let us use...

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: NANY 2019 - StringSimilarity - Release
« Reply #4 on: November 24, 2018, 08:15 AM »
StringSimilarity has been updated to 1.3.0, adding the ability to load text from txt files (*.txt or *.*), no file-content check!

Also, the download has been moved to my DCMembers space, link available from the first post, above.



Is a export things to dll planned?
Haven't decided on this request yet, as it's quite a bit of work for such a small feature-set :huh:
What would you want to use it for? Or would a command-line enabled version (with somewhat formatted output) suffice?

HelmutWe

  • Supporting Member
  • Joined in 2018
  • **
  • Posts: 62
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: NANY 2019 - StringSimilarity - Release
« Reply #5 on: January 07, 2019, 01:55 AM »
Great work. Have you considered a commercial use? Could be used to identify typos in databases. That is variations of data compared to an official list like names of locations, chemical formulae, parts of engines etc.
And the best that you can hope for is to die in your sleep (Schlitz/Rogers)

Uit_VerKoop3

  • Supporting Member
  • Joined in 2013
  • **
  • default avatar
  • Posts: 1
    • View Profile
    • Donate to Member
Re: NANY 2019 - StringSimilarity - Release
« Reply #6 on: January 07, 2019, 03:22 AM »
Hi, nice stuff.
Is it possible to build in a comparison based on trigrams?
Start at pos 1 of the string take 3 characters and hold them, shift 1 position and do the same, until the end.
Do the same for the second string.
See how many trig's are the same among the strings and calculate the percentage to the total number of trig's of string 1.
In this way you get a good idea about the similarity between the different writings of street names, etc.
Keep up the good work

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: NANY 2019 - StringSimilarity - Release
« Reply #7 on: January 07, 2019, 02:33 PM »
Start at pos 1 of the string take 3 characters and hold them, shift 1 position and do the same, until the end.
Do the same for the second string.
See how many trig's are the same among the strings and calculate the percentage to the total number of trig's of string 1.
I'm not quite sure, nor an expert on the subject, but isn't this the 'F23 Jaccard index' calculation from StringSimilarity? You'd have to multiply by 100 to get a percentage, if you need that, I kept all results except Damerau-Levenshtein (number of edits) in the 0..1 range. The documentation for the F23 library and an excerpt of its algorithms can be found here

HelmutWe

  • Supporting Member
  • Joined in 2018
  • **
  • Posts: 62
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: NANY 2019 - StringSimilarity - Release
« Reply #8 on: January 08, 2019, 02:12 AM »
Hi Uit_Verkoop3. What you have in mind is very similar to my thoughts which I wasn´t able to explain further. Only I wasn´t thinking about trigrams but "n-grams" to to speak. My proposal was splitting a string in all possible substrings. Like "Microsoft" would consist of:
M, i, c, r, o, s, o, f, t, plus Mi, ic, cr, ro, so, of, ft plus Mic, icr, cro, ros, oso, sof, oft plus Micr, icro, cros... and so on. But thinking of this kind of pattern recognition I got carried away. Ending on thoughts of pattern recognision about Chinese. And I´m still thinking. "10001" is similar to "7asdz7" in a way, isn´t it? Or "HAL" is similar to "IBM"? (Space Odyssey)
And the best that you can hope for is to die in your sleep (Schlitz/Rogers)