Home | Blog | Software | Reviews and Features | Forum | Help | Donate | About us
topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • September 30, 2016, 11:57:00 PM
  • Proudly celebrating 10 years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: What are the rules for local v. public index/search of DC Forum + Wayback saves?  (Read 1063 times)

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 6,054
  • Slartibartfarst
    • View Profile
    • Donate to Member
What are the rules for local v. public index/search of DC Forum + Wayback saves? I'm not sure whether the rules are somehow broken, or working as specified (rightly or wrongly).

When trying to see if I might be able to help/contribute more in a discussion on a topic in DCF, I will often resort to:
  • (a) Exploring the DC Forum - by using the local search functionality provided - to search for some of the terms used in the discussion, and to see what else might have been mentioned about that topic elsewhere on the Forum.

  • (b) Exploring more widely by doing a duckgo search of some of the terms used in the discussion, to see what else might have been mentioned about that topic elsewhere on the Internet.

  • (c) Conducting Wayback searches for links/URLs which are old or give a 404 response.

If the searches turned up some useful links then I would often post whatever I found - e.g., the DC Forum links, or the duckgo search URL, or the Wayback URL - into the discussion, on the basis that being specific with references was likely to be more useful than than simply saying something such as, for example, "We already discussed that on DC Forum", or "RTFM", or "Look in Wayback", or something.

However, I am a bit confuzzled by what I have observed in an ad hoc manner as and when I have performed searches, and over an extended period from some time back, because sometimes:
  • (i) Some text strings of more than adequate length could not be found in local searches of DCF - that is, using the DCF local search function. I had presumed this could be the result of (say) either a local indexing failure, or deliberate site blocking (non-indexing) of certain strings/posts by admins.

  • (ii) Some text strings of more than adequate length were not found in some or any DC Forum posts, in duckgo searches, even though the strings were in one or more DC Forum posts. I had presumed this could be the result of (say) deliberate site blocking - e.g., (say) robots.txt of certain strings/posts, or other blocking, so that search crawlers were inhibited/blocked.

  • (iii) Some discussion threads in the main DCF forum being unavailable in Wayback beyond a certain point due to robots.txt blocking. I had presumed this could be the result of deliberate site blocking of certain strings/posts by admins.

  • (iv) Some posts appearing in the DCF "Best Of Blog" but with the wrong date and time of posting. I had presumed this could be the result of an indexing rule error.

Any daylight on this would be helpful.

TaoPhoenix

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 4,548
    • View Profile
    • Donate to Member
...
However, I am a bit confuzzled by what I have observed in an ad hoc manner as and when I have performed searches, and over an extended period from some time back, because sometimes:
  • (i) Some text strings of more than adequate length could not be found in local searches of DCF - that is, using the DCF local search function. I had presumed this could be the result of (say) either a local indexing failure, or deliberate site blocking (non-indexing) of certain strings/posts by admins.

  • (ii) Some text strings of more than adequate length were not found in some or any DC Forum posts, in duckgo searches, even though the strings were in one or more DC Forum posts. I had presumed this could be the result of (say) deliberate site blocking - e.g., (say) robots.txt of certain strings/posts, or other blocking, so that search crawlers were inhibited/blocked.

  • (iii) Some discussion threads in the main DCF forum being unavailable in Wayback beyond a certain point due to robots.txt blocking. I had presumed this could be the result of deliberate site blocking of certain strings/posts by admins.
    ...
    Any daylight on this would be helpful.
I've seen much of this too Iain, so here are my guesses:

1. The new search is a disaster. I think it was one of the days I was working on the Getting Things Done thread and for the life of me I couldn't get it to come up in the *new* DC search. It seems to do "OR" searches so I'd get all threads for "getting", and "Things" and "Done".

The old board had a pretty good hand optimized search by someone from DC, and Mouser hasn't had time / figured out how to port it over.

2. The crawlers of different search engines work differently, and to varying success. Unfortunately, I have to report I like DuckGo's philosophy, but that its engine simply gave me mediocre results a lot like you are saying, you can be staring at the relevant DC post in one window and the DuckGo simply refuses to pull it.

Then change engines and as much as I *don't* like their philosophy, Google IS basically the best except weird cases. I've done the same search in like 4 engines (including Binged-Yahoo and Startpage) and they come up with different results of varying quality. So it can't be a block per se if Google finds it and Duck doesn't.

3. Maybe it costs money to run scans at sufficient intervals or something, but Wayback simply sadly not reliable for thoroughness and I know you are a thorough person. So a page for ex you know changes every day only has such as 5-20 images for that month. Again, Google's spider is the best because it runs at lightning speed and I think I recall once a post I made an hour earlier showed up in Google. [/list]

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 6,054
  • Slartibartfarst
    • View Profile
    • Donate to Member
@TaoPhoenix: Thanks for that. I think I had seen you bellyaching about the search before, but I didn't understand that as, from my perspective, the local search always worked pretty well, and google or duckgo searches gave similar/consistent results.
By the way, duckgo incorporates google searches.

It is curious though.
Local search: if you give it a 7-letter word, it will tend to reliably give you all the occurrences of that string, but sometimes it will consistently omit at least one occurrence that you know of. I presumed this could be due to an indexing failure.

Public searches: google and duckgo searches consistently do or do not display the same strings, and it is consistent over time too.

Wayback: Wayback itself tells you when robots.text is blocking.

Wrong date and time of posting: is true also.

Having managed several projects to implement Internet/Intranet websites and CMS (Content Management Systems), I am aware that many/most of these behaviours can generally be set and are controllable by the website owner.

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 36,276
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
Quote
Some discussion threads in the main DCF forum being unavailable in Wayback beyond a certain point due to robots.txt blocking. I had presumed this could be the result of deliberate site blocking of certain strings/posts by admins.

the only blocking we do in robots.txt is to block redundant non-canonical versions of pages -- ie the mobile versions of pages where duplicate content is available on another, indexed page.

the only other blocking that should take place is in the basement area (since you must sign up to view those pages).

other than that we try very hard (with custom constantly updated site map) to ensure that google and other indexes can index everything on dc as fast as possible.  we impose no limits on indexing speed, etc. the more the better.  and i should note that google tends to index dc forum posts VERY quickly -- usually within the hour.

we welcome the wayback machine to index everything :)

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 36,276
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
Quote
1. The new search is a disaster. I think it was one of the days I was working on the Getting Things Done thread and for the life of me I couldn't get it to come up in the *new* DC search. It seems to do "OR" searches so I'd get all threads for "getting", and "Things" and "Done".
The old board had a pretty good hand optimized search by someone from DC, and Mouser hasn't had time / figured out how to port it over.

im not convinced it's a disaster -- it has some pros and cons.

having said that i think i will try to port over the old custom search so we can compare.

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 36,276
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
Quote
and i should note that google tends to index dc forum posts VERY quickly -- usually within the hour.

at least they used to -- i did some searches for posts from today and can't find them.

why isn't google indexing our pages quickly now.. i have no idea. is there a way to find out? doesn't appear so. is there someone to ask? doesn't appear so.  what do we do? nothing. just wait for the gods of google to decide who they want to favor or disfavor as usual.

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 6,054
  • Slartibartfarst
    • View Profile
    • Donate to Member
Thanks for the info about the rules @mouser.
...we welcome the wayback machine to index everything
_____________________
- presumably that refers to "everything" except that which is blocked by robots.txt or other means?

Interesting about the public search content indexing latency, but which failed google/duckgo search is temporary and due to this latency, and which is permanent and due to some content never being indexed? Could the latency or failure to index be attributable to long server response times to the crawlers, or something, which could lead to default timeouts if maximum response time performance thresholds were exceeded?
I often find the DCF website to be sluggish, but had attributed that to FF not the website.

None of that accounts for this sort of - what seems to be - date/time error though:

DCF - date-time error of post in Best Of Blog (750).png

I could make a guess as to how it got that date/time, but it still seems wrong.
___________________________________

f0dder

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 9,027
  • [Well, THAT escalated quickly!]
    • View Profile
    • f0dder's place
    • Read more about this member.
    • Donate to Member
2. The crawlers of different search engines work differently, and to varying success. Unfortunately, I have to report I like DuckGo's philosophy, but that its engine simply gave me mediocre results a lot like you are saying, you can be staring at the relevant DC post in one window and the DuckGo simply refuses to pull it.

Then change engines and as much as I *don't* like their philosophy, Google IS basically the best except weird cases. I've done the same search in like 4 engines (including Binged-Yahoo and Startpage) and they come up with different results of varying quality. So it can't be a block per se if Google finds it and Duck doesn't.
That's my experience as well. I tried using DuckDuckGo for a while after Google was forced to obey the atrocious "right to be forgotten" EU crap, but DDG had pathetic results :(
- carpe noctem

f0dder

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 9,027
  • [Well, THAT escalated quickly!]
    • View Profile
    • f0dder's place
    • Read more about this member.
    • Donate to Member
why isn't google indexing our pages quickly now.. i have no idea. is there a way to find out? doesn't appear so. is there someone to ask? doesn't appear so.  what do we do? nothing. just wait for the gods of google to decide who they want to favor or disfavor as usual.
The Google Webmaster Tools say nothing?
- carpe noctem

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 36,276
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
Quote
The Google Webmaster Tools say nothing?

no not really. they i did check the sitemap, and it is listing recent pages, and it does show as being parsed by google without errors.  google webmaster tools shows that they have "decided" not to index most of the pages in the sitemap, though it doesn't say why. welcome to google world.

Deozaan

  • Charter Member
  • Joined in 2006
  • ***
  • Points: 1
  • Posts: 7,651
    • View Profile
    • The Blog of Deozaan
    • Read more about this member.
    • Donate to Member
I could make a guess as to how it got that date/time, but it still seems wrong.

The date is for the post in that thread which you made on September 26th. Or I guess maybe it's just the date/time the blog entry was made.