1
Post New Requests Here / Re: comparing two big different lists of strings/filenames
« Last post by compn on Today at 12:18 AM »I see Vic is on the task, but I think what you're asking for is usually called fuzzy string matching. Try a Web search, there seems plenty of Python work on it, and take a look at Comparing Strings Is Easy With FuzzyWuzzy.-rjbull (May 15, 2024, 05:23 PM)
thanks, i didnt know what it was called.
this stackoverflow post has some info about scoring fuzzy matches, and the output kinda sounds exactly what i was describing, although i dont know how useful that output would be until using it. and i wouldnt want the match percentage to be output in a new list anyhow.
https://stackoverflo...h-very-similar-names
Output:
print(results)
sample_name actual_name score
0 jtsports JT Sports LLC 79.0
1 tombaseball Tom Baseball Inc. 81.0
2 context express Context Express LLC 95.0
3 zb sicily ZB Sicily LLC 95.0
4 lightening express Lightening Express LLC 95.0
5 fire roads Fire Road Express 86.0
6 NaN Earth Treks NaN
7 NaN TS Sports LLC NaN
8 NaN MM Baseball Inc. NaN
9 NaN Contact Express LLC NaN
10 NaN AB Sicily LLC NaN
11 NaN Lightening Roads LLC NaN
- First, the MovieList for direct movie comparison.this already exists i think. its just uniq -d-paradisusvic (May 15, 2024, 07:38 PM)
for example, with the lists i provided:
C:\>cat list* | sort | uniq -d
The.Last.Married.Couple.In.America.1980.720p.BluRay.x264.AAC-[YTS.MX].mp4
The.Osterman.Weekend.1983.720p.BluRay.x264.YIFY.mp4
The.Wild.Life.1984.720p.BluRay.x264.AAC-[YTS.MX].mp4
C:\>uniq --version
uniq (textutils) 2.1
C:\>cat --version
cat (GNU textutils) 2.0
but that wouldnt tell me which list has the duplicate because i had to concatenate the files.
all i really need is a fuzzy uniq for comparing two lists...
and if i google fuzzy uniq theres 'funiq'
https://github.com/mjfisheruk/funiq
Funiq (fuzzy uniq) is a command line tool for performing fuzzy string matching against lists of words.
and the examples it is using to compare? movie titles! hahaha :finger pointing at brain.meme:
but the ability to exclude words and characters would make for better matches. i see some fuzzy matching toolkits use scrubbers to scrub the inputs first.
- Second, Fuzzy-matching for adding more/partial results.
Tokens + ignore list of words
fuzzy matching looks difficult