ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

Scraper too expensive at 20 bucks

(1/2) > >>

peter.s:
(Original post at bits and referred to here was "$19 + seems an awful lot of money for software you can get the same type of thing for nothing. (...)".)

The problem lies elsewhere. A price of 20 bucks is certainly not a deal breaker, neither would be 40 bucks (original price), and there are competitors that cost several hundred bucks, and which are not necessarily better or much better.

First,

if you search for "download manager", the web (and the people who constitute it by their respective contributions) mix up web scrapers (like A1) and tools for downloading files specified beforehand by the user, but the download of which will then be done within multiple threads, instead of just one, by this using your possible fast internet connection to its fullest; of course, most of the scrapers will include such accelerating functionality, too. Thus, the lacking discriminating effort in what commentators see as a "download manager" does not facilitate the discussion to begin with; you should perhaps use the terms "scrapers", and "download accelerators", for a start, but there is also some "middle thing", pseudo-scrapers who just download the current page, but without following its links.

Second,

the big problem for scrapers nowadays is Ajax and database techniques, i.e. many of today's web pages are not static anymore, but are built up from multiple elements coming from various sources, and you do not even see those scripts in full; scripts you can read by "see page source" refer back to scripts on their servers, and almost anything that is done behind these scenes, cannot be replicated by ANY scraper (i.e. not even by guessing parts of it, and from building up some alternative functionality from those guesses), so the remark that A1's pages from scraped Ajax pages do not "work" is meaningless.

The only other remark re A1 I found in the web was, you will get "the whole page", instead of just the photos, in case you would like to download just the photos of a web page; IF that is right, that was a weakness of A1 indeed, since these "choosing selected content only" questions are the core functionality today's scrapers could and should have, in the above-described general framework in which "original web page functionality" can not be replicated anymore, for many pages (which often are the ones which are of most interest = with the most money behind = with both the "best" content, and lots of money for ace programming).

Thus, "taking up" with server-side programming has become almost impossible for developers anyway, so they should revert to optimization of choosing selected content, and of making that content available, at least in a static way, and it goes without saying that multiple different degrees of optimization of that functionality are imaginable: built-in "macros" could replicate at least some standard connections between screen/data elements "on your side", and of which the original triggers are lost, by downloading, but this would involve lots of user-sided decisions to be made, and hence lots of dialogs the scraper would offer the user to begin with ("click on an element you want as a trigger, then select data (in a table e.g.) that would be made available from that trigger", or then, big data tables, which then you would hierarchically "sort" in groups, in order to make that data meaningful again).

It's clear as day that the better the guesses of the scraper in such scenarios, the easier such partial re-consitution of the original data would often become, and also, that programming such guesses-and-services-offered-from-those would both be very "expensive" in programming, and be a never-ending task, all this because today's web technologies succeed in hiding what's done on the server side.

In other words, from even very complicated but static, and even pseudo-dynamic (i.e. get it all out of databases, but in a stringent, easily-to-be-replicated way) web pages yesterday, to today's dynamic web pages, it has been a step beyond what scrapers sensibly would have been able to handle.

But it's obvious also that scrapers should at least perfectly handle "what they've got", and the above-mentioned example (as said, found in the web) of "just downloading the pics of a page", whilst being totally realistic, is far from being sufficient as a feature request:

In so many instances, the pics of the current page are either just thumbs, or then, just pics in some intermediate resolution, and the link to the full-resolution pic is not available but from the dedicated page of that middle-resolution pic, and the situation is further complicated by the fact that often, the first or second resolution is available, but the third resolution is not, and that within the same start page, i.e. for the same task at arrival, for some pics, the scraper / script would have to follow two or three links, in for other pics linked to at the same page, it would have to follow just one or two.

This being said, of course, such "get the best available resolution for the pics on current page" should be standard functionality for a scraper.

But, all this being said, it also appears as quite evident to me that for tasks beyond such "elaborate standard tasks" (and which could be made available by the scraper "guessing" possibly relevant links, then have the user choose from the intermediate results, and then the scraper building up the necessary "rule(s)" for the site in question), scraper programming comes with the additional problem that such "specific rule building" would be split into a) what the scraper would make available and b) what the user could make out of these pre-fetched instruments, whilst in fact, the better, easier, and ultimately far more powerful solution (because the limitations of the intermediate step would be done away, together with that intermediate step) would be to do scripting, but ideally having some library of standards at your disposal.

(Readers here in DC will remember my - unanswered - question here how to immediately get to "page x" (e.g. 50) of an "endless" Ajax page (of perhaps 300 such partial "pages" (or whatever you like to name those additions), instead of "endlessly" scrolling down to it.)

Anyway, precise selection of what the user wants to scrape, and of "what not", should be possible in detail, and not only for links to follow on start page, but also for links further down, at the very least for links "on page 2", i.e. on several kinds (!) of pages which only have in common the fact that all of them are one level "down" from the respective "start page" (I assume there are multiple but similar such "start pages", all of them to be treated in a similar (but not identical, see above) way.

Third,

so many scrapers (and download accelerators, too) tout their respective accelerating power, but few, if ever one, mention the biggest problem of them all: More and more server programs quickly throw your IP(s!) and even your PC out of their access scheme, should you dare scrape big content and/or, repeatedly, updated content, and again, as above, the more elaborate the content and their server-side page-build-up programming, the higher the chances are that they have sophisticated scraper detection, too.

What most people do not know, when they choose their tunnel provider, is the fact that in such "heavy-scraping" scenarios, it's quite "risky" to get a full-year contract (let alone something beyond a year), and that there are special tunnel providers where you rent multiple IPs at the same time instead - which comes at a price.

With these multiple addresses, many scraping guys think they are on the safe side - well, what's multiple addresses "abroad" (from the server's pov), and when in country x no such provider can provide you any, or more than just a handful of "national" IPs?

And it does not end there. How "visually good" is your script, from the server's pov again? Don't you think they cannot "put it all together again" when your scraping follows detectable rules? To begin with, your scraping is probably mutually exclusive, which is obviously a big mistake, but which facilitates combining the parts on your side, right? He, he...

And you're spacing your requests, of course, in order for the server not to detect it's a machine fetching the data? He, he, again, just spacing the requests in time does not mean the server will think it detects some real person, looking for the data in a way some bona fide prospect would look for that data.

Not to speak of bona fide prospects looking in certain standard ways, but which never are the same though, and that they don't do just sequential downloading ("sequential" does not mean, follow link 1, then 2, then 3, but link 35, 482, 47, whatever, but download, download, download!), but revert to some page before, press F5 here or there (but not systematically of course), and so on, and in endless ways: As soon as there is a possible script to be detected, those servers send a signal on a real person on their side, and who will then look into things, relying on their scripts-for-further-pattern-detection: time of the day for such a "session", amount of data downloaded, number of elements downloaded, order in which (sub-) elements are downloaded (patterns, too similar and/or or not "real-life" enough).

Then, even if you quite perfect all this, by having your machines replicating real-life behavior of different real persons, even most real-life prospects will not remain interested in the same or similar data over the years, and most of them, not even over months in a row!

And all this with the concurrent problem of the geographic repartition of your IPs again: Where almost all of their bona fide prospects would sit in some specific country, or even in some specific region of that country, and so all of the above problems, even if resolved in perfect ways (and this necessarily included lots of overlaps if you want your global scheme to remain "realistic") will be only partial solutions and not work for long if you cannot resolve the problem of how to fake IPs and their geography, instead of just renting some.

My 2 cent to put into perspective some naïve, "$19 + seems an awful lot of money for software you can get the same type of thing for nothing.", and I certainly left out additional aspects I didn't think of on the fly.

ThomasSchulz:
Note to admin: This thread originates from http://www.bitsdujour.com/software/a1-website-download

Since Peter decided to post his response here (his site?) I am posting here as well. If admin wants to delete this thread, I understand. (Peter, we can move it to my support forum or email.)

...

Anyhow, that was a very long post which makes it hard to answer but I will try :)

A1 Website Download can download and convert links for both entire websites, sections of websites and more. It does not matter if the pages are served dynamicly or not. You can also use it to only download e.g. images or PDF files, and not download the actual web pages. For those who need it, it includes some very rich filtering options (including support for regular expressions), so you can define exactly what you want. This download is all automatic and does not include any guessing or prompting the user. The crawler simply dives through all pages and tries to discover all links (including those in e.g. CSS and Javascript files) and fix them, if necessary, when downloaded to disk. (You can also partially control/configure how this is done depending on your specific needs.) You are correct though, sometimes Javascript parsing it not perfect, but it works *pretty well* for most sites. (And the problem you mention about website downloaders in general only downloading thumbnails should not be a problem at all. However, should you know of websites causing issues, please do feel free to drop an email.)

Please note that A1 Website Download is not meant to be used as a scraper that extract specific data (e.g. catalogs) and convert such data into e.g. .CSV files. (I offer a different tool for that.) It is meant for what the name states, download websites or larger portions whereof.

You mention that many webservers will ban IP addresses when you download websites. I can say that the default settings in A1 Website Download are configured so it will not cause problems for most. This also means that the default settings only run at about 1/10 of max speed, but which will still be fast enough to download most websites reasonably fast. In addition, you can, if you wish, configure very precisely how many connections to use, pauses in-between and much more. As such, what you describe it a relative rare problem with many workarounds if you consult the documentation or drop me an email.

The only thing I can fully agree with is that downloading AJAX websites which can be very problematic. However, A1 Website Download implemented around 4.0.x some AJAX support - more precisely those websites who implements Google's suggestion on making AJAX websites crawlable: https://developers.google.com/webmasters/ajax-crawling/ Another area that can be problematic is login support as the systems behind get more and more complex. When I started developing A1 series in 2006, sessions cookies behind post forms where used on many websites which was very easy to handle. Now, it is much more complicated and hit or miss.

Note: I tried to answer most of your questions - if I missing anything, feel free to drop me an email, and I will explain it more fully and/or point to the relevant help page. Your post is good because it describes much of of he work that goes into building a website downloader, so even if we disagree, and your fundamental opinion is that such "website download" jobs should be built custom for each website (if I understood you correctly), it gives a good impression of the overall problems faced!


wraith808:
Welcome to the site, ThomasSchulz!  Hope you stick around for more than a response... we're a loose conglomeration of software enthusiasts and coders, 'led' by the inestimable mouser.  Thanks for the response, and hope to see you around on a more congenial basis.

peter.s:
My post above was not so much about any A1 or other, but meant in general, A1 bits offer just being the trigger for my considerations, but of course I gave the link to DC over there, which Thomas then promptly followed... ;-) Here's my answer from over there, there "over there" meaning here, and so on... ;-) :

Thomas,

Just to clarify, it was not my intention to denigrate A1, and I very much hope the title I gave the thread cited above appears as perfectly ironic as it was intended.

I should have clarified above - I do it here though - that I consider A1 as a perfectly worthy example of a "standard scraper", and more so, possibly "perfect" or at least very good, for almost any "un-professional", i.e. amateur scraping task (and from your comments, I see those imperfections of A1 I found described in the rare web comments of it, have been dealt with in the meantime).

Also, there seems to be a slight misunderstanding, automatisms are good, but the possibility of deselecting and tweaking automatisms is even better, since very often, scrapers follow links too fervently, instead of following only some kinds/groups of links (and I mentioned the programming problems in order to make such choices available): it's not about ".jpg only", or even of "pics within a given size range only" and such; also, standard "this page and its children down to 1/2/3 levels" is not really helpful, since (even for "amateurs"), it often would be necessary to follow links of some kind rather deep, whilst not following links of other kinds.

As for the "heavy scraping problem", there is also a legal problem, which consists of new-kind "authors' rights", in most European countries, to "data bases", even if those db's only consist of third-party advertizing / good offerings, with no content contribution whatsoever from the owner of the target site (but who, e.g. for vacancy notices, gets often paid 800 euro, some 1,000 bucks for publishing that ad for a mere 4 weeks, AND holds "authors' rights" to that same ad as part of his db); this being said, it's clear as day that such considerations are perfectly irrelevant within the context of a "consumer product" like A1, and this denomination of mine is certainly not meant in order to tear down A1 either.

But there clearly is a schism between professional use, for which highly elaborate custom scripting is necessary (and, as explained, not even sufficient), and "consumer use", and in this latter category, the above-mentioned tweaking possibilities for "which groups of links to follow then how, respectively", could certainly make the necessary distinction among "consumer scrapers".

Or let's get as precise as it gets: Years ago, I trialled several such "consumer scrapers", in order to get all of William Gedney's Kentucky 1964 and 1972 photos (i.e. not even photo scraping is all about porn, but sometimes it's about art), from the Foundation's web site, but in the best resolution available there, and that was not possible with those scrapers since there was an intermediate quality level, between thumbs and the quality I was after - perhaps I did it wrong at the time; anyway, I succeeded by writing my own download script.

Just for fun, I checked that page again:

http://library.duke.edu/digitalcollections/gedney/Subject/Cornett%20Family

and verified the current state of things:

pages of kind a: some 50 pages with thumbs (for more than 900 photos),

then target pages (= down 1 level, for intermediate photo quality),

and there, links to the third level, with the full quality, but also many, many links to other things:

Whilst from level 1 to level 2, it's "easy", it's obvious that for level 2 pages, highly-selective-only link following (i.e. just follow the link to the pic-in-full and nothing else) would be asked for, but probably is not possible with most consumer scrapers even today: Would it be possible to tweak A1's link following that way? Again, we're speaking not of general specifics, but of specifics applying to level-2 pages only.

Well, whilst my first DC post was rather theoretical, here we've got a real-life example. It's clear as day that if a tool for 20 or 40 bucks did this, with some easy tweaking, I'd call such a tool "near perfect": it's all about discrete, selectivity of link following. ;-)

(Or then, I'd have to continue to do my own scripts for every such task I encounter...)



P.S. If by this scheme of mine, Gedney's work will get some more followers, that wouldn't be a bad thing either. ;-)

ThomasSchulz:
Hi Peter,

Since you could use whatever-number regular expressions to

1)
Match URLs you want:
* Excluded from analysis
* Limit-to analysis to
(all filters in the two above gets combined when deciding if an URL should be analyzed)

2)
Match URLs you want:
* Excluded from output/download-and-kept
* Limit-to output/download to
(all filters in the two above gets combined when deciding if an URL should be output/downloaded-and-kept)

It should be possible to do what you want. It is one of the more complex but also powerful features of A1 tools

But I will take a look later! :)

Navigation

[0] Message Index

[#] Next page

Go to full version