21
General Software Discussion / Re: Desktop search; NTFS file numbers
« on: January 11, 2015, 12:42 PM »
Hi jity2,
When I spoke of sharing being a step beyond finding out, I definitely didn't have you in mind, and I apologize for not having made my pov clear enough that I very much appreciate your sharing findings, in fact it was your details that motivated me to put some more details in. ;-)
Since you mention my mentioning path/file name lengths, let me cite from techrepublic:
http://www.techrepublic.com/forums/questions/any-recommended-limits-on-number-of-files-in-an-ntfs-directory/
"Something else to look out for
TonytheTiger 8 years ago
is the length limits. Filenames can be up to 255 characters, but path and file combined can only be 260 characters (you can get around the path part by using Subst or 'net use' and setting a drive letter farther down in the directory structure)."
Well, I would try file and folder naming hygiene first, and in most cases, that should do it.
And since you mention html and all those innumerable, worthless "service files", that reminds me of the importance of some "smarter" search tool hopefully being able to leave all these out of its indexing, but NOT by suffix name only, but by a combi of suffix and vicinity within the file structure, and possibly, if needed, also content: In fact, you separate your files/folders into an application part, and then a contents part (i.e., most of us do so), but the html format and similar formats blur this distinction again, shuffling servicing code into the "contents" area of your data, so it's obvious a smarter search tool should de-clutter this intrusion again.
Also, above, I forgot to mention Armando's
"Why do I mix Archivarius and DtSearch ? Simply because their algorithms for dealing with space and dashes are different and lead to different results. But if I had to choose one (but I woudn't...), I'd probably go with DtSearch : indexing is fairly quick and there are more search options to get what you want. Archivarius is fast too, but its search syntax isn't as sophisticated. Both could have better interface.
I use everything for filename/foldername search as it's so quick and its search syntax is very flexible and powerful (e.g. Regex can be used). (...) [Edit: about X1 : used to be my favorite, many years ago, but had to drop it because of performance reason and inaccuracy : it wouldn't index bigger documents well enough. See my comments earlier in the thread. To me, accuracy and precision are of absolute importance. If I'm looking for something and can't get to it when I know it's there... and then I'm forced to search "by hand"... There's a BIG problem.]"
The second passage bolded by me is both very important, in case, and subject to questioning since they clearly did some work upon their tool - question is, how deep that work would possibly have been: In short: Problem resolved, or then, not?
The first bolded passage is of the very highest interest, and should certainly not be buried in some page 32 of somewhere and somewhat, but should be investigated further, all the more so since my observation re French accents applies here, analogously: Many parallel wordings for the same phrase, with or without hyphens (let alone dashes), or then, "written together", i.e. in one word (or even in abbrevs), and further complicated when the phrase contains more than just two elements: a space between the first two elements, but then a hyphen between the second and third one, or the other way round...
Which makes me wonder which of these tools might be able to correctly treat as equal English and American English, but without doing so by "fuzzy searching" which would bring masses of unwanted pseudo-hits...
(History's irony: askSam, by its overwhelming success in those ancient times, "killed" another, similar "full text db" program, but which HAD semantic search, whilst AS, even 30 years later, never got to that (and has now be moribund for some 5, 6 or 8 years)... and cannot be found yet in any of those 2- and little-3-figure desktop search tools (but in 4- and 5-figure corporate tools it seems... and all this is about market considerations, not about technology: technology-wise, not speaking of (possible) AI, it all would be some additional concordance tables, especially when indexing, and less so when search time comes).)
And no, I'm not trying to talk you into running dtSearch indexing for days: It would just put unnecessary strain on your hardware, and from your findings and what we think we know, we can rather safely assume it would be somewhere between 6 and 8 full days of indexing, when X1 needs 10, and Copernic needs 15. (Even though I'm musing about possible surprises, and then, you ran your stuff for 25 consecutive days now, so some 5 days more, percentage-wise... ;-) ) Let's just say, that would have been utterly instructive. ;-)
EDIT:
The 8.3 problem/solution is often mentioned; in
http://stackoverflow.com/questions/115882/how-do-you-deal-with-lots-of-small-files it is explained best:
"NTFS actually will perform fine with many more than 10,000 files in a directory as long as you tell it to stop creating alternative file names compatible with 16 bit Windows platforms. By default NTFS automatically creates an '8 dot 3' file name for every file that is created. This becomes a problem when there are many files in a directory because Windows looks at the files in the directory to make sure the name they are creating isn't already in use. You can disable '8 dot 3' naming by setting the NtfsDisable8dot3NameCreation registry value to 1. The value is found in the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\FileSystem registry path. It is safe to make this change as '8 dot 3' name files are only required by programs written for very old versions of Windows.
A reboot is required before this setting will take effect.
share|improve this answer
edited Oct 24 '08 at 23:06
community wiki
2 revs
Dan Finucane"
ANOTHER EDIT:
Since Copernic is (again) on bits today, here's another element being relevant for our subject (from over there):
"Jaap Aap Hi, is my understanding from the fine print correct that this includes the upgrade to v5? And will sorting by relevance be included at that point?"
I would have worded this "SOME sorting by relevance", since there are innumerable ways of implementing sorting by relevance into a search tool, but it's clear as day this functionality, while being of the highest possible importance, has not been treated by developers (of the "consumer products" discussed here at least) with due attention, up to now, if I'm not mistaken? (This being said, it's obvious that a badly-implemented "display-by-relevance" would need to come with the option to disable it.)
When I spoke of sharing being a step beyond finding out, I definitely didn't have you in mind, and I apologize for not having made my pov clear enough that I very much appreciate your sharing findings, in fact it was your details that motivated me to put some more details in. ;-)
Since you mention my mentioning path/file name lengths, let me cite from techrepublic:
http://www.techrepublic.com/forums/questions/any-recommended-limits-on-number-of-files-in-an-ntfs-directory/
"Something else to look out for
TonytheTiger 8 years ago
is the length limits. Filenames can be up to 255 characters, but path and file combined can only be 260 characters (you can get around the path part by using Subst or 'net use' and setting a drive letter farther down in the directory structure)."
Well, I would try file and folder naming hygiene first, and in most cases, that should do it.
And since you mention html and all those innumerable, worthless "service files", that reminds me of the importance of some "smarter" search tool hopefully being able to leave all these out of its indexing, but NOT by suffix name only, but by a combi of suffix and vicinity within the file structure, and possibly, if needed, also content: In fact, you separate your files/folders into an application part, and then a contents part (i.e., most of us do so), but the html format and similar formats blur this distinction again, shuffling servicing code into the "contents" area of your data, so it's obvious a smarter search tool should de-clutter this intrusion again.
Also, above, I forgot to mention Armando's
"Why do I mix Archivarius and DtSearch ? Simply because their algorithms for dealing with space and dashes are different and lead to different results. But if I had to choose one (but I woudn't...), I'd probably go with DtSearch : indexing is fairly quick and there are more search options to get what you want. Archivarius is fast too, but its search syntax isn't as sophisticated. Both could have better interface.
I use everything for filename/foldername search as it's so quick and its search syntax is very flexible and powerful (e.g. Regex can be used). (...) [Edit: about X1 : used to be my favorite, many years ago, but had to drop it because of performance reason and inaccuracy : it wouldn't index bigger documents well enough. See my comments earlier in the thread. To me, accuracy and precision are of absolute importance. If I'm looking for something and can't get to it when I know it's there... and then I'm forced to search "by hand"... There's a BIG problem.]"
The second passage bolded by me is both very important, in case, and subject to questioning since they clearly did some work upon their tool - question is, how deep that work would possibly have been: In short: Problem resolved, or then, not?
The first bolded passage is of the very highest interest, and should certainly not be buried in some page 32 of somewhere and somewhat, but should be investigated further, all the more so since my observation re French accents applies here, analogously: Many parallel wordings for the same phrase, with or without hyphens (let alone dashes), or then, "written together", i.e. in one word (or even in abbrevs), and further complicated when the phrase contains more than just two elements: a space between the first two elements, but then a hyphen between the second and third one, or the other way round...
Which makes me wonder which of these tools might be able to correctly treat as equal English and American English, but without doing so by "fuzzy searching" which would bring masses of unwanted pseudo-hits...
(History's irony: askSam, by its overwhelming success in those ancient times, "killed" another, similar "full text db" program, but which HAD semantic search, whilst AS, even 30 years later, never got to that (and has now be moribund for some 5, 6 or 8 years)... and cannot be found yet in any of those 2- and little-3-figure desktop search tools (but in 4- and 5-figure corporate tools it seems... and all this is about market considerations, not about technology: technology-wise, not speaking of (possible) AI, it all would be some additional concordance tables, especially when indexing, and less so when search time comes).)
And no, I'm not trying to talk you into running dtSearch indexing for days: It would just put unnecessary strain on your hardware, and from your findings and what we think we know, we can rather safely assume it would be somewhere between 6 and 8 full days of indexing, when X1 needs 10, and Copernic needs 15. (Even though I'm musing about possible surprises, and then, you ran your stuff for 25 consecutive days now, so some 5 days more, percentage-wise... ;-) ) Let's just say, that would have been utterly instructive. ;-)
EDIT:
The 8.3 problem/solution is often mentioned; in
http://stackoverflow.com/questions/115882/how-do-you-deal-with-lots-of-small-files it is explained best:
"NTFS actually will perform fine with many more than 10,000 files in a directory as long as you tell it to stop creating alternative file names compatible with 16 bit Windows platforms. By default NTFS automatically creates an '8 dot 3' file name for every file that is created. This becomes a problem when there are many files in a directory because Windows looks at the files in the directory to make sure the name they are creating isn't already in use. You can disable '8 dot 3' naming by setting the NtfsDisable8dot3NameCreation registry value to 1. The value is found in the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\FileSystem registry path. It is safe to make this change as '8 dot 3' name files are only required by programs written for very old versions of Windows.
A reboot is required before this setting will take effect.
share|improve this answer
edited Oct 24 '08 at 23:06
community wiki
2 revs
Dan Finucane"
ANOTHER EDIT:
Since Copernic is (again) on bits today, here's another element being relevant for our subject (from over there):
"Jaap Aap Hi, is my understanding from the fine print correct that this includes the upgrade to v5? And will sorting by relevance be included at that point?"
I would have worded this "SOME sorting by relevance", since there are innumerable ways of implementing sorting by relevance into a search tool, but it's clear as day this functionality, while being of the highest possible importance, has not been treated by developers (of the "consumer products" discussed here at least) with due attention, up to now, if I'm not mistaken? (This being said, it's obvious that a badly-implemented "display-by-relevance" would need to come with the option to disable it.)