topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Tuesday July 16, 2024, 11:11 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Last post Author Topic: comparing two big different lists of strings/filenames  (Read 6783 times)

paradisusvic

  • Participant
  • Joined in 2024
  • *
  • Posts: 54
  • Call me Vic!
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #25 on: May 28, 2024, 10:56 AM »
separating content vs function words
-paradisusvic (May 21, 2024, 10:51 AM)

These are the function words that made it to the list:

a, an, the, this, that, these, those, my, your, their, our, some, many, few, all, and, but, or, so, because, although, in, of, on, with, by, at, over, under, he, she, it, they, we, you, me, him, her, is, am, are, was, were, has, have, had, can, could, may, might, shall, should, will, would, must, who, what, when, where, why, how

The processing conditions are:

  • Two or more words in the title (this prevents cases where the movie is just a function word in the list, such as "her").
  • At least one word must be present after filtering.

(I'm home & coding! :Thmbsup:)
My name's Victor but do feel free to call me Vic!

❤️ Support on Patreon @ www.patreon.com/paradisusis

GitHub: github.com/paradisusis

🌟 Good projects deserve support 🌎🌍🌏

✉ Email / Extend your PayPal support ❤️: paradisusvic@gmail.com
« Last Edit: May 28, 2024, 11:01 AM by paradisusvic »

paradisusvic

  • Participant
  • Joined in 2024
  • *
  • Posts: 54
  • Call me Vic!
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #26 on: May 28, 2024, 07:01 PM »
MovieList-compn ALPHA 2 is released :up::


New GUI:

MovieListCompn_ALPHA-2.png



BTW, it also works on Linux!

MovieListCompn_ALPHA-2_Linux.png

Ubuntu 22.04 + Mono runtime.
My name's Victor but do feel free to call me Vic!

❤️ Support on Patreon @ www.patreon.com/paradisusis

GitHub: github.com/paradisusis

🌟 Good projects deserve support 🌎🌍🌏

✉ Email / Extend your PayPal support ❤️: paradisusvic@gmail.com

compn

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 54
    • View Profile
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #27 on: May 29, 2024, 03:12 PM »
crash.jpg

paradisusvic

  • Participant
  • Joined in 2024
  • *
  • Posts: 54
  • Call me Vic!
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #28 on: May 30, 2024, 07:14 PM »
Hello & thanks for your report; the latest ALPHA 3 version processes the file correctly:


MovieList_compn-Screenshot from 2024-05-30 19-56-47.png



There is internal processing and normalization of titles in place. The second file has "Stjärne" edited to "Stjarne" for comparison and it was a match:

Woman and Gramophone Johannes Stjärne Nilsson & Ola Simonsson,
First list:
Woman and Gramophone Johannes Stjärne Nilsson & Ola Simonsson, 2000
Second list:
Woman and Gramophone Johannes Stjarne Nilsson & Ola Simonsson, 2000

Also, this version has other minor internal pre-processing updates for punctuation and handling multiple spaces that should make direct title matching more efficient :Thmbsup:
My name's Victor but do feel free to call me Vic!

❤️ Support on Patreon @ www.patreon.com/paradisusis

GitHub: github.com/paradisusis

🌟 Good projects deserve support 🌎🌍🌏

✉ Email / Extend your PayPal support ❤️: paradisusvic@gmail.com
« Last Edit: May 31, 2024, 09:03 AM by paradisusvic »

paradisusvic

  • Participant
  • Joined in 2024
  • *
  • Posts: 54
  • Call me Vic!
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #29 on: May 31, 2024, 06:22 AM »
Okay! Regarding the steps on Reply #5:

This is being tackled in two steps/releases:

- First, the MovieList for direct movie comparison.

- Second, Fuzzy-matching for adding more/partial results.
-paradisusvic (May 15, 2024, 07:38 PM)

Direct movie comparison is good enough, so, now we proceed to implement fuzzy matching on the pre-processed titles, as derived from this step :up:

The library of choice is "FuzzySharp", which is based on the previously-mentioned FuzzyWuzzy, hence it is functionally equivalent for our purposes.

My name's Victor but do feel free to call me Vic!

❤️ Support on Patreon @ www.patreon.com/paradisusis

GitHub: github.com/paradisusis

🌟 Good projects deserve support 🌎🌍🌏

✉ Email / Extend your PayPal support ❤️: paradisusvic@gmail.com
« Last Edit: May 31, 2024, 08:52 AM by paradisusvic »

paradisusvic

  • Participant
  • Joined in 2024
  • *
  • Posts: 54
  • Call me Vic!
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #30 on: June 01, 2024, 12:11 AM »
The new ALPHA 4 version implements Fuzzy Matching :Thmbsup::


The new GUI has a checkbox for toggling between direct and fuzzy comparison:

MovieListCompn_ALPHA-4_encircled-checkbox.png

The library of choice is "FuzzySharp"
-paradisusvic (May 31, 2024, 06:22 AM)

The fuzzy algorithm and cutoff value are configurable:

MovieListCompn_ALPHA-4_expanded-combo.png



Time to proceed to the "Bells & Whistles" to make the program complete :)
My name's Victor but do feel free to call me Vic!

❤️ Support on Patreon @ www.patreon.com/paradisusis

GitHub: github.com/paradisusis

🌟 Good projects deserve support 🌎🌍🌏

✉ Email / Extend your PayPal support ❤️: paradisusvic@gmail.com
« Last Edit: June 01, 2024, 12:36 AM by paradisusvic »

compn

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 54
    • View Profile
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #31 on: June 01, 2024, 01:41 AM »
fuzzy match works!

document the default ratio / partial ratio / token set etc from fuzzysharp into help menu?

alpha-4 has a weird bug where the matches result lists switch. its not important bug. i just report it.

Anthropophagous
First list:
Anthropophagous.1980.REMASTERED.1080p.BluRay.x264.DTS-FGT.mkv.torrent
Anthropophagus (1980) DTS (Custom 5.1 Mix).torrent
Anthropophagus 1980 1080p GBR Blu-ray AVC LPCM 2.0-CultFilmsT.torrent
Anthropophagus.1980.1080p.BluRay.x264-FAPCAVE.torrent
Anthropophagus.1980.BD.Baggerinc.torrent
ANTROPOPHAGUS.Mg.torrent
Antropophagus.torrent
Second list:
Anthropophagous [1980]

fuzzy matching is slower. i know there is nothing that can be done to speed this up, no worries. a progress/working indicator or progress bar would be nice. maybe just a little animated gif ?

compn

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 54
    • View Profile
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #32 on: June 01, 2024, 01:47 AM »
here are some stress test lists. these are from youtube titles and it crashes. probably due to non english character.

compn

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 54
    • View Profile
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #33 on: June 01, 2024, 09:56 PM »
program crashes on more characters:

Dark Lady of Kung Fu µûŸoÓ° (1981) º£°¶°æ.avi.torrent

paradisusvic

  • Participant
  • Joined in 2024
  • *
  • Posts: 54
  • Call me Vic!
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #34 on: June 02, 2024, 11:40 AM »
program crashes on more characters:

Dark Lady of Kung Fu µûŸoÓ° (1981) º£°¶°æ.avi.torrent

Thanks for the report

Next version is going to strip the remaining non-foreign-language characters for internal comparison --while caching the original title for output/display :)
My name's Victor but do feel free to call me Vic!

❤️ Support on Patreon @ www.patreon.com/paradisusis

GitHub: github.com/paradisusis

🌟 Good projects deserve support 🌎🌍🌏

✉ Email / Extend your PayPal support ❤️: paradisusvic@gmail.com

paradisusvic

  • Participant
  • Joined in 2024
  • *
  • Posts: 54
  • Call me Vic!
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #35 on: June 04, 2024, 06:51 PM »
Dark Lady of Kung Fu µûŸoÓ° (1981) º£°¶°æ.avi.torrent

You're correct: pre-processing throws an error when processing such odd titles.

The easy way out would be to strip non-alphanumeric characters in the traditional file-friendly way ("[^A-Za-z0-9_. ]+") but it can lose some title-delimiter information in the process.

I'm fixing it by keeping the processing "as is" for the correct file names while stripping bit by bit when reaching the ones that throw the error.
My name's Victor but do feel free to call me Vic!

❤️ Support on Patreon @ www.patreon.com/paradisusis

GitHub: github.com/paradisusis

🌟 Good projects deserve support 🌎🌍🌏

✉ Email / Extend your PayPal support ❤️: paradisusvic@gmail.com
« Last Edit: June 04, 2024, 10:00 PM by paradisusvic »

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,616
    • View Profile
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #36 on: June 05, 2024, 01:13 AM »
Wouldn't that all be fixed by handling strings as Unicode? Because stripping 'non-ascii' from titles would mean significant information being removed from these titles :huh:

paradisusvic

  • Participant
  • Joined in 2024
  • *
  • Posts: 54
  • Call me Vic!
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #37 on: June 05, 2024, 02:04 AM »
Wouldn't that all be fixed by handling strings as Unicode?

Hello & good day! You're right with your spot-on Unicode observation... yet, in this case, this particular app needs to have the lesser character pool to compare words in the movie title; we are currently folding characters to ASCII in order to help keep the program's complexity reasonable -- according to our standards, that is :)

Certainly, other implementations of fuzzy movie file name comparison can do it differently; it's just that we are really streamlining things here for our purposes... The lesser, the better in this case
My name's Victor but do feel free to call me Vic!

❤️ Support on Patreon @ www.patreon.com/paradisusis

GitHub: github.com/paradisusis

🌟 Good projects deserve support 🌎🌍🌏

✉ Email / Extend your PayPal support ❤️: paradisusvic@gmail.com
« Last Edit: June 05, 2024, 05:43 AM by paradisusvic »

compn

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 54
    • View Profile
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #38 on: June 08, 2024, 03:30 PM »
another string that causes crash.

"FÃ¥rö Document    [Documentary]   [1970].mkv"

can we get an update program that doesn't crash on these? :)

also this crashes: (i might edit and put more below)
"Movies\\[1971]\\David Lean   A Self Portrait   [Documentary]   [1971]"
« Last Edit: June 08, 2024, 03:39 PM by compn »

paradisusvic

  • Participant
  • Joined in 2024
  • *
  • Posts: 54
  • Call me Vic!
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #39 on: June 08, 2024, 03:37 PM »
another string that causes crash.

"FÃ¥rö Document    [Documentary]   [1970].mkv"

can we get an update program that doesn't crash on these? :)

It's solved! In the end the pre-processing is made with the regex "[^\u0000-\u007F]+" and then stripping all invalid path characters.

This approach solved everything regarding crashing due to weird symbols, in exchange of folding the foreign characters on the display title --of course, the actual passed file name remains intact.

I'm bumping the version to 0.2.0 to call main functionality complete :Thmbsup:

(Going to push to GitHub as time permits later today/tomorrow)
My name's Victor but do feel free to call me Vic!

❤️ Support on Patreon @ www.patreon.com/paradisusis

GitHub: github.com/paradisusis

🌟 Good projects deserve support 🌎🌍🌏

✉ Email / Extend your PayPal support ❤️: paradisusvic@gmail.com

compn

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 54
    • View Profile
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #40 on: June 08, 2024, 03:59 PM »
thanks vic

paradisusvic

  • Participant
  • Joined in 2024
  • *
  • Posts: 54
  • Call me Vic!
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #41 on: June 08, 2024, 04:31 PM »
thanks vic

My pleasure 🤗

Please feel free to post a sorted list with the next functionality you need, in the order you want it developed :up:

e.g. progress bar, auto saving, drag and drop, command line interface... plus any other.

(simply place those you consider the most important ones to implement at the top of the list)



Meanwhile, I continue "tweaking" this version's code to release

Cheers!
My name's Victor but do feel free to call me Vic!

❤️ Support on Patreon @ www.patreon.com/paradisusis

GitHub: github.com/paradisusis

🌟 Good projects deserve support 🌎🌍🌏

✉ Email / Extend your PayPal support ❤️: paradisusvic@gmail.com
« Last Edit: June 09, 2024, 12:34 AM by paradisusvic »

compn

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 54
    • View Profile
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #42 on: June 08, 2024, 06:13 PM »
 :Thmbsup: :Thmbsup: :Thmbsup:

1. command line interface if possible.


paradisusvic

  • Participant
  • Joined in 2024
  • *
  • Posts: 54
  • Call me Vic!
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #43 on: June 09, 2024, 12:27 AM »
Hello & good day! New v0.2.0 is released:


1. command line interface ...

...On the way!

Cheers!
My name's Victor but do feel free to call me Vic!

❤️ Support on Patreon @ www.patreon.com/paradisusis

GitHub: github.com/paradisusis

🌟 Good projects deserve support 🌎🌍🌏

✉ Email / Extend your PayPal support ❤️: paradisusvic@gmail.com

compn

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 54
    • View Profile
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #44 on: June 09, 2024, 07:42 PM »
no more crashing!
also it doesnt use a lot of ram,  even on large lists. i am impressed!

paradisusvic

  • Participant
  • Joined in 2024
  • *
  • Posts: 54
  • Call me Vic!
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #45 on: June 10, 2024, 07:23 PM »
no more crashing!
also it doesnt use a lot of ram,  even on large lists. i am impressed!

Glad you're liking it! :Thmbsup:

In order to have separation of concerns, the new CLI version has been separated to its own thread and code:

https://www.donationcoder.com/forum/index.php?topic=54226.0

The current Windows/GUI version continues on v0.3.0 with drag&drop on the text boxes for the lists + other features, under its own code repository, as seen above
My name's Victor but do feel free to call me Vic!

❤️ Support on Patreon @ www.patreon.com/paradisusis

GitHub: github.com/paradisusis

🌟 Good projects deserve support 🌎🌍🌏

✉ Email / Extend your PayPal support ❤️: paradisusvic@gmail.com

compn

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 54
    • View Profile
    • Donate to Member
Re: comparing two big different lists of strings/filenames
« Reply #46 on: June 17, 2024, 09:53 PM »
still run into annoying stuff like this with fuzzy match though :(

Bruiser
First list:
Bruiser
Gunbuster (1988)
buster.1988.internal.bdrip.x264-ghouls.mkv
Second list:
Buster (1988).mkv.torrent

paradisusvic

  • Participant
  • Joined in 2024
  • *
  • Posts: 54
  • Call me Vic!
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: (comparing two big different lists of strings/filenames)
« Reply #47 on: June 17, 2024, 10:57 PM »
still run into annoying stuff like this with fuzzy match though :(

Bruiser
First list:
Bruiser
Gunbuster (1988)
buster.1988.internal.bdrip.x264-ghouls.mkv
Second list:
Buster (1988).mkv.torrent

Hello & good day! You may want to try with "Weighted" as algorithm and using different "cutoff" values.

Cutoff defaults to 75. In your case, you may want to try with 80, then 85, then 90 up to 95.

(Until you find the configuration that behaves closest to what you need/expect)

Cheers! :Thmbsup:
My name's Victor but do feel free to call me Vic!

❤️ Support on Patreon @ www.patreon.com/paradisusis

GitHub: github.com/paradisusis

🌟 Good projects deserve support 🌎🌍🌏

✉ Email / Extend your PayPal support ❤️: paradisusvic@gmail.com
« Last Edit: June 18, 2024, 07:15 AM by paradisusvic »