Author Topic: Brillant Text Editor (search functions) (Read 22606 times)

riposa · « **on:** February 13, 2007, 10:36 PM »

Hello,

I posted this elsewhere, but the manager told me to post it here, as he also thought it was a good question, and he was also interested in replies. Here is the question...

What editor will search for a name in a file where the name is misspelled, and the editor will give the results of both when the name is spelled correctly, and also list results of where the name is also not spelled correctly but the software editor is smart enough to make a good guess of the name you wanted along with all the instances where it was misspelled. This would be of great value to me, since I deal with lots of misspelled names, and now with Textpad, all I can do is search twice separately using both the first and last name or only using the first few letters, or clumps of letters in the name that stand out, so it becomes very time consuming and tedious. For instance below, the same name might be misspelled 3 times, 3 different ways. Also some names, like Bella Russian are spelled by sound, like Stahurskaia or Stahurskaya, thus, depending on the original editor, spells names by guessing when such names are not available according to or depending on the word is transliterated from Cyrillic characters.

Is that a brilliant editor out there for this?

For instance, if I search for Luperini, it will bring back, Luporini, Luperin, Loperoni, and so on, it guesses and brings back everything close, or even customize the search more.
When I seach race results to cycling in a given year, there might be 600 times a certain name is listed, but from past years, names are often misspelled, so therefore, I miss all those instances with Textpad. What I need if for the editor to bring back everything that is a reasonable match, close proximity.

thanks,
Bruce

mouser · « **Reply #1 on:** February 13, 2007, 10:43 PM »

I'd really like to hear suggestions for search tools (or text editors with built in multi-file search) which can do such advanced searching based on phonetics or mispelled words, etc.

Darwin · « **Reply #2 on:** February 14, 2007, 12:23 AM »

Won't EditPad do this? It's got built in reg expression searching...

mouser · « **Reply #3 on:** February 14, 2007, 12:31 AM »

we're not talking about regex, we're talking about a smart search that knows how to search for mispelled versions of a word.

Darwin · « **Reply #4 on:** February 14, 2007, 12:35 AM »

OK, I'm obviously out of my depth here. I thought that reg ex searching would allow one to do just what the OP is asking. However, given the fact that I find reg ex to be somewhat mysterious (ie, I have a vague idea of its power and an even vaguer idea of how to write reg ex searches), smart searching would be wonderful

riposa · « **Reply #5 on:** February 14, 2007, 04:20 AM »

Someone posted this elsewhere. It is an editor that claims to do something along these lines. I downloaded it, but yet to try it.

Bruce

--------------------------------

A text editor that supports soundex (an English-oriented phonetic matching algorithm) searches.

Power Edit ($30): http://www.galcott.com/pe.htm

f0dder · « **Reply #6 on:** February 14, 2007, 05:35 AM »

RegEx *could* do what is asked for, but it would be extremely tedious construct.

SoundEx searching, customized for different languages, does indeed sound like what you want. Would probably be nice if it showed a list with all hits (including some context), and let you goto that line, instead of the regular "find next" tediousness.

Iirc GNU aspell has internationalized/table-based soundex?

mouser · « **Reply #7 on:** February 14, 2007, 06:13 AM »

this would be a great assignment for the brand new LEVEL 2 TASKS in the Programming School..

riposa · « **Reply #8 on:** February 14, 2007, 11:53 AM »

With Textpad, using (find in files) does indeed built a list of all the instances of a (specific word) searched for in not only one file, but any number of files, including subdirectories if you wish, allowing for Sky's the limit, which is great. Also the results list what files and what lines the instance occurs. Extremely effective and efficient, but the problem of course with Textpad, is it cannot bring back close matches. There is no option for that. I am only mildly familiar with boonlee, regex, and such, for instance, Webseeker by Blue Squirrel goes out on the search engines and bring back results of close matches for its search queries. This is exactly what needs to be an option in Textpad, but perhaps soundex or something along those lines might work. Otherwise, I was also aware of GREP in perl, which can be used to built a perl utility to do just such a task, and perhaps such a utility exists already, I don't know without searching the web au nausium in hundreds of utility libraries to find just such a jewel.

Bruce

Ruffnekk · « **Reply #9 on:** February 14, 2007, 12:02 PM »

I don't know of any editor capable of it, but this is now on my to-do list for my upcoming application. I can't reveal a lot about it yet (I don't want to

), but it will take a couple of months before it's finished I'm afraid. But it's someting I know how to implement (or have ideas about) and I will definitely have a go at it.

kimmchii · « **Reply #10 on:** February 14, 2007, 03:35 PM »

i use InfoRapid search alot:

InfoRapid Search & Replace is one of the most powerful search and replace utilities you can find. With it s built-in HTML filter, it s excellent for searching and previewing HTML documents (also in the cache of your internet browser)and supports many advanced search /replace techniques including regular expressions and phonetic searches. One of the nicest features is the fact that you can preview the search results with highlighted keywords as they were found, using a built-in mini browser.

For instance, if I search for Luperini, it will bring back, Luporini, Luperin, Loperoni, and so on, it guesses and brings back everything close, or even customize the search more.

cant you search L*p*r*
?

mwb1100 · « **Reply #11 on:** February 14, 2007, 03:51 PM »

cant you search L*p*r*?
-kimmchii (February 14, 2007, 03:35 PM)

That'll work for a lot of situations, and it might be good enough for the original poster's needs. But you'd miss such things that might be considered close-enough matches like:

Luberini
Luperim

which have the same soundex code as Luporini, Luperin, and Loperoni.

And you'd have matches like:

Longpastor
Langpresser

which probably shouldn't match.

That problem could maybe be fixed with more targetted regex: L[aeiou]*p[aeiou]*r[aeiou]*
But I'd hate to have to come up with a custom regex each time I wanted to locate some names.

kimmchii · « **Reply #12 on:** February 14, 2007, 04:15 PM »

ahh i see, thanks for the explanation.

riposa · « **Reply #13 on:** February 14, 2007, 05:58 PM »

I downloaded InfoRapid, so I have two now to try out, but as for regex, I really think for the common users, the function in an editor needs to be as simple as the easy button on the staples commerical!, check marking the box for the function called (exact match), but this box would say (clostest match) or (nearest match)! Or perhaps (exact match and near match together) That takes all the tedium out, which also leaves common users completely in the dark. I have to edit thousands of pages of results from 30 years of bicycle racing with some riders names being mispelled over the years many times. If I miss a name that is mispelled, then I might miss the results for that race, not acceptable. So instead of Luperini, the saftest might be just (Lup) which will bring back probably much more then what I want in the search results. These archives are on my hard drive, so its extremely and fast to search, if I have the right utility.

Some riders names like Pucinskaite, I have heard pronounced like this, Pushing-Sky-Ah and some pronounce it as Pushing-Scooty. So when Euro journalists and writers, editors write names, they guess sometimes and write them as to how they sound either to them, or what they think it would sound like in the native language if they do not have the correct spelling in hand. So for instance, the Bela Russian champion Zinaida Stahurskaia, who was recently busted for selling steriods has always had her name spelled two different ways, and no one seems to know really how its spelled, as its always Stahurskaia or Stahurskaya on the Interent. This name is literally spelled hunreds of times both ways on Google, so its impossible to know, and no one seems to know.

Now the professional rider Jeannie Longo would be a good example. Of course I could just use her last name, Longo. But for the sake of shining some light on this, if I use the first name, sometimes its spelled Jeannie, Jeanie, Jeanne, and so on.

Add to that, many Euro names use there own accents marks on certain letters which wrecks havoc on search function like these French characters. Like Magali Le Floc'h, migh just be spelled Magali Le Floch, or LeFloch and some letters like ù, é, ê, Â, are used in French names, and other accents are used in other languages, but often when editors write the names in English, they just use regular letters, but sometimes they don't, and what you are left with is sometimes not only mispelled, but they contain these little devils which really throw a wrench in the works, so what is needed is for the editor to bring back close matches that include these little devils, as well as without. Otherwise I will miss their names! It should not ignore the accents in close matches!

Some of these names are mispelled many ways, but usually close, because the names are complex like Polikeviciute, Zabelinskaya, Polkhanova, Vzesniauskaite, now add some accent marks, with mispelling and you have alphabet soup!

Now the final wrench in the works! When I scan with OCR Omnipage Pro, it does a good job, but also mispells names, which adds to the fog of war here. So if an editor can be cleaver enough to bring back close matches, or even gross matches as a last result but still would be better then the nonsense I am going through now!

So you can see the need for an easy button like Staples! The easy button is besides, the box for exact match, a box that says close match! (mispellings, accents and all!)

thanks,
Bruce

mouser · « **Reply #14 on:** February 14, 2007, 06:09 PM »

what you are asking for should exist and can exit. I suspect their are programs like this already, maybe some good free ones. please do take the time to scour the web for them. give us a report on what you find. if there are no good free programs that do this, it's something we should consider writing.

Darwin · « **Reply #15 on:** February 14, 2007, 09:33 PM »

Bruce - I couldn't agree with you more about the need for ease of use. I hope mouser is right and that there is text editor that will do what you seek available already. I look forward to hearing more about it when you find it (I took a kick at googling for it this morning and didn't turn up very much).

Renegade · « **Reply #16 on:** February 15, 2007, 03:41 PM »

Regular expressions won't solve this problem from the OP. (I've never heard of a utility that will do this.)

The *real* problem is that 3 months ago you saved a file called "financial analysis.xls", or at least so you thought... The reality is that you made a typo and named the file "fnamcial analisis.xls". What you then need is spell checking as you search.

If you made that typo error, how would you know enough to use a regular expression? Wouldn't you have corrected it when you first saved the file?

The other thing is that while a few of us speak "regular expression speak" quite well, it's not a very common language for most people, and the market size is very small. So it's unlikely that anyone would ever want to write a utility to do this. i.e. Why satisfy all 3 people that want it when you can write something that thousands or millions of people would want? The utility would have to use a very dumbed down version of regex like the typical * and ? notation. Most people don't even know how to use ? though...

Ruffnekk · « **Reply #17 on:** February 15, 2007, 03:48 PM »

The way I'm solving this in my application is to compare all words which are approximately the same length (configurable) to the search term. I've designed scoring rules that will assign a score depending on the characters that match, their relative position and order in the word. The score is a percentage between 0 and 100. The higher the score, the higher the similarity. The treshold can be configured and a lower treshold will result in more matches of course.

riposa · « **Reply #18 on:** February 15, 2007, 11:02 PM »

You might also consider the fact to program your editor or utility to treat accent letters like ù, é, ê, Â as regular letters. That way when names contain these accents, they won't be overlooked in the search results. Its because sometimes the names or words are spelled both with and without these accents, depending on who is writing the names. So to include these accents will also bring back results that include the same names spelled sometimes with and sometimes without these accents. Just an idea for you to consider.

Good luck with your software!

thanks,
Bruce

Ruffnekk · « **Reply #19 on:** February 15, 2007, 11:06 PM »

You might also consider the fact to program your editor or utility to treat accent letters like ù, é, ê, Â as regular letters.
-riposa (February 15, 2007, 11:02 PM)

Yes I forgot to mention that. I'm from Europe so I'm used to the concept of accented letters and I will take them into account.

riposa · « **Reply #20 on:** February 15, 2007, 11:11 PM »

Send me an email, when its in beta. If you want, of depending on what your intentions are. I don't know who you would like to share your software with, privately or publicly, for a price, shareware, or freeware I doubt, since of course it might fetch a fair price!

thanks,
Bruce

[email protected]

Ruffnekk · « **Reply #21 on:** February 16, 2007, 12:20 AM »

It's still far from a beta version, but the program will be freeware no matter what and I will probably publish some alphas and betas on DC to get opinions and bug reports.

mouser · « **Reply #22 on:** February 16, 2007, 01:02 AM »

Ruffnek you should absolutely definitely look into some of the free spellcheck libraries and soundex functions as f0dder suggested - that might very well be a very good thing to add to your program.. the idea of calculating scores based on how likely one word is to stand in for another is a good idea.. make your code modular so that its easy to add more functions that could check for various plausible alternatives for a word.

Ruffnekk · « **Reply #23 on:** February 16, 2007, 01:05 AM »

Yes I plan to make it a plugin / add-on which can be customized easily, but first I have to make the core application function well enough

After that I will start researching how to handle this particular subject. I find it very useful myself so it's on the top of my todo-list

f0dder · « **Reply #24 on:** February 16, 2007, 08:25 AM »

Per-locale translation tables (to handle accents and the like) is probably also a good idea... unless you wanna go full unicode