ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > Living Room

What are the rules for local v. public index/search of DC Forum + Wayback saves?

(1/3) > >>

IainB:
What are the rules for local v. public index/search of DC Forum + Wayback saves? I'm not sure whether the rules are somehow broken, or working as specified (rightly or wrongly).

When trying to see if I might be able to help/contribute more in a discussion on a topic in DCF, I will often resort to:

* (a) Exploring the DC Forum - by using the local search functionality provided - to search for some of the terms used in the discussion, and to see what else might have been mentioned about that topic elsewhere on the Forum.


* (b) Exploring more widely by doing a duckgo search of some of the terms used in the discussion, to see what else might have been mentioned about that topic elsewhere on the Internet.


* (c) Conducting Wayback searches for links/URLs which are old or give a 404 response.

If the searches turned up some useful links then I would often post whatever I found - e.g., the DC Forum links, or the duckgo search URL, or the Wayback URL - into the discussion, on the basis that being specific with references was likely to be more useful than than simply saying something such as, for example, "We already discussed that on DC Forum", or "RTFM", or "Look in Wayback", or something.

However, I am a bit confuzzled by what I have observed in an ad hoc manner as and when I have performed searches, and over an extended period from some time back, because sometimes:

* (i) Some text strings of more than adequate length could not be found in local searches of DCF - that is, using the DCF local search function. I had presumed this could be the result of (say) either a local indexing failure, or deliberate site blocking (non-indexing) of certain strings/posts by admins.


* (ii) Some text strings of more than adequate length were not found in some or any DC Forum posts, in duckgo searches, even though the strings were in one or more DC Forum posts. I had presumed this could be the result of (say) deliberate site blocking - e.g., (say) robots.txt of certain strings/posts, or other blocking, so that search crawlers were inhibited/blocked.


* (iii) Some discussion threads in the main DCF forum being unavailable in Wayback beyond a certain point due to robots.txt blocking. I had presumed this could be the result of deliberate site blocking of certain strings/posts by admins.


* (iv) Some posts appearing in the DCF "Best Of Blog" but with the wrong date and time of posting. I had presumed this could be the result of an indexing rule error.

Any daylight on this would be helpful.

TaoPhoenix:
...
However, I am a bit confuzzled by what I have observed in an ad hoc manner as and when I have performed searches, and over an extended period from some time back, because sometimes:

* (i) Some text strings of more than adequate length could not be found in local searches of DCF - that is, using the DCF local search function. I had presumed this could be the result of (say) either a local indexing failure, or deliberate site blocking (non-indexing) of certain strings/posts by admins.


* (ii) Some text strings of more than adequate length were not found in some or any DC Forum posts, in duckgo searches, even though the strings were in one or more DC Forum posts. I had presumed this could be the result of (say) deliberate site blocking - e.g., (say) robots.txt of certain strings/posts, or other blocking, so that search crawlers were inhibited/blocked.


* (iii) Some discussion threads in the main DCF forum being unavailable in Wayback beyond a certain point due to robots.txt blocking. I had presumed this could be the result of deliberate site blocking of certain strings/posts by admins.
...
Any daylight on this would be helpful.
-IainB (October 05, 2015, 02:41 PM)
--- End quote ---
I've seen much of this too Iain, so here are my guesses:

1. The new search is a disaster. I think it was one of the days I was working on the Getting Things Done thread and for the life of me I couldn't get it to come up in the *new* DC search. It seems to do "OR" searches so I'd get all threads for "getting", and "Things" and "Done".

The old board had a pretty good hand optimized search by someone from DC, and Mouser hasn't had time / figured out how to port it over.

2. The crawlers of different search engines work differently, and to varying success. Unfortunately, I have to report I like DuckGo's philosophy, but that its engine simply gave me mediocre results a lot like you are saying, you can be staring at the relevant DC post in one window and the DuckGo simply refuses to pull it.

Then change engines and as much as I *don't* like their philosophy, Google IS basically the best except weird cases. I've done the same search in like 4 engines (including Binged-Yahoo and Startpage) and they come up with different results of varying quality. So it can't be a block per se if Google finds it and Duck doesn't.

3. Maybe it costs money to run scans at sufficient intervals or something, but Wayback simply sadly not reliable for thoroughness and I know you are a thorough person. So a page for ex you know changes every day only has such as 5-20 images for that month. Again, Google's spider is the best because it runs at lightning speed and I think I recall once a post I made an hour earlier showed up in Google. [/list]

IainB:
@TaoPhoenix: Thanks for that. I think I had seen you bellyaching about the search before, but I didn't understand that as, from my perspective, the local search always worked pretty well, and google or duckgo searches gave similar/consistent results.
By the way, duckgo incorporates google searches.

It is curious though.
Local search: if you give it a 7-letter word, it will tend to reliably give you all the occurrences of that string, but sometimes it will consistently omit at least one occurrence that you know of. I presumed this could be due to an indexing failure.

Public searches: google and duckgo searches consistently do or do not display the same strings, and it is consistent over time too.

Wayback: Wayback itself tells you when robots.text is blocking.

Wrong date and time of posting: is true also.

Having managed several projects to implement Internet/Intranet websites and CMS (Content Management Systems), I am aware that many/most of these behaviours can generally be set and are controllable by the website owner.

mouser:
Some discussion threads in the main DCF forum being unavailable in Wayback beyond a certain point due to robots.txt blocking. I had presumed this could be the result of deliberate site blocking of certain strings/posts by admins.
--- End quote ---

the only blocking we do in robots.txt is to block redundant non-canonical versions of pages -- ie the mobile versions of pages where duplicate content is available on another, indexed page.

the only other blocking that should take place is in the basement area (since you must sign up to view those pages).

other than that we try very hard (with custom constantly updated site map) to ensure that google and other indexes can index everything on dc as fast as possible.  we impose no limits on indexing speed, etc. the more the better.  and i should note that google tends to index dc forum posts VERY quickly -- usually within the hour.

we welcome the wayback machine to index everything :)

mouser:
1. The new search is a disaster. I think it was one of the days I was working on the Getting Things Done thread and for the life of me I couldn't get it to come up in the *new* DC search. It seems to do "OR" searches so I'd get all threads for "getting", and "Things" and "Done".
The old board had a pretty good hand optimized search by someone from DC, and Mouser hasn't had time / figured out how to port it over.
--- End quote ---

im not convinced it's a disaster -- it has some pros and cons.

having said that i think i will try to port over the old custom search so we can compare.

Navigation

[0] Message Index

[#] Next page

Go to full version