Messages - peter.s [ switch to compact view ]

Pages: prev1 2 3 [4] 5 6 7 8 9 ... 24next
General Software Discussion / Re: The impossible thing software
« on: January 30, 2015, 09:20 AM »
Is there any site which publishes the most outrageous obscenities, created inadvertently by their authors' lack of English? If there is, I would happily suggest my top 1 find of many years for their home page.

"You want to automate the presents to your mother.

You must [sic] control the tastes of your mother in a database.
And the limit you may waste. And so on."

My formatting. And my kudos. And let's face it, it's not a language prob to 100 p.c. Simply wonderful. And I very much like the gif idea. Cabaret.

General Software Discussion / Re: Scraper too expensive at 20 bucks
« on: January 18, 2015, 12:55 PM »
"I could create a project that only analyzes pages related to gedney and downloads all files used/linked from them."
We assumed that was not we wanted to do, in so general a way.

"After that, one could, if one so chooses, either:
1) start to limit further, so it does not downloaded linked pages/content outside wanted areas ."
Areas? Link groups, I presume.

"2) e.g. limit it to only download and keep images."
Again, that would often download much too much, and I prefer downloading hygiene to lots of discard later on. Ultimate irony: That discarding unwanted pics could then possibly be done within a file manager allowing for regex selection.

"URL / link normalization is always done before any filter is tested, so all that is zero problem. General rule: As long as a browser can understand something in a website, so can the crawler smiley (AJAX being an exception.)"
Brilliant! (But in some contradiction to your next paragraph?)

"Replacing content/links/URLs inside page content is also a far larger system than regular expressions."
As explained above, not necessarily so. I originally started with a (dumb) misunderstanding of mine: That regex in your scrapers was for choosing the parts of the text the user wanted to preserve; obviously, that is not their use in your software, but it's for the users' determining the links to be followed. Now, from a regex match to a regex replace there (and with input, as explained above, from the user, and which the user will have identified from manually loading pages and looking into their respective url pattern, and/or from looking into the respective page sources), it's not that big a step, and...

"It is essentially a huge multi-threaded engine that is behind it all with a ton of queues, lists and what not to ensure everything is done correctly and optimal as possible."
I don't know about traditional multi-threading, but in downloading, the term multi-threading is often used for multiple, concurrent downloading; I don't know about that either. But: It's evident, from my explications above, that all this "do different things at different download levels" is easy, as soon as you accept that in fact, doing different (and specific) things at different (= at the specific) download levels, is the natural way of doing things, and that "download whole webpages" is very Late Ninetees, even if today it's possible to exlude, by real hard work, and totally insufficiently, only SOME unwanted meanders (and all the advertizing, socialising and whatever).

"and what not"
This is a formula regularly used for deprecating, or more precisely, it shows that the writer in question has not yet got the importance of the detail(s) he's hiding behind such an all-englobing "and what not"; btw, another sophist's formula of choice is "questions", when no questions have been asked (yet) (see your top post): in fact, many of us will convene that corporations that don't deliver the service you asked, and paid, for, and if then you dare utter some requirements, these will, by corporations of that (= not-so-much customer-centered) kind, invariably met by them saying, "your questions". (end of ot)

"to ensure everything is done correctly and optimal as possible"
Well, that's why I suggested adding a second "mode", for stringent downloading, instead of trying to mingle it all together into existant, traditional mode, this second "mode" "only" going down some 5, 6 levels, but from the

"with a ton of queues, lists"
Of course, you'd be free to use lists instead of arrays, and even though that would imply some multiplication of storage elements indeed, I wouldn't expect a ton of additional elements from this choice.

"It is of course a never ending process of optimizations"
I mentioned this above: Where a custom-made script is straightforward, the intermediate step of providing, in a commercial tool, dialogs and such in order to build, within some necessary confines, some variants of proceeds, in order to run something what more or less mimicks, for standard tasks, such a custom-made script, or at least a lot of what the latter only would have provided in the last resort, complicates things for the developer of such a tool.  But then, identify standard tasks (e.g. from what you hear / listen to (?) in your forum), then write additional "modes", as the one described above, and I'm even more precise in my description: Do one big list field, with multiple indentation: Let users build up a tree (which then you process by nested loops). This will make available bifurcations even, i.e. down level 3, with 2 (or more) link kinds to follow, each of them having their own process commandments further down, i.e., in this example, 1 level 3, but 2 different levels 4, 1 with level 5 and 6, the other one with just level 1, or even it ends at level 4 for that line of scraping. The next programming step would of course be to integrate your multi-threading into this alternative mode, and then, a prof. version could even allow for user's indication how often these specific lines are to be updated (re-check for changes in that site's contents). It's not "endless", but it can be done in sensible installments, according to real-life scenarios and how to best handle them, and following the (in most cases, tree) structure of a site in a perfectly coordinated way, without gathering rubbish laying on the way, should be a core task (of which the coding will take one week since "everything around" is already there), which then could perhaps be refined in some details from the observations of your forum posters.

"which is why writing such a product can end up taking lots of time smiley"
One week, as said, and including debugging - since most of the design work has been delivered free of charge.

Thomas, it's clear as day I don't need such a crawler anymore, but not anybody who would like to do some neat scraping has got the scripting routine I've got over the years now. It's not that anybody's asking you any particular effort in one way or any other, it's just that you've got some weeks' advance over your competitors; some weeks I say because DC's got quite tons of hits from professionals (I had been astonished you hadn't become aware of DC yet before my linking here?)


General Software Discussion / Re: Scraper too expensive at 20 bucks
« on: January 17, 2015, 10:14 AM »
Hi Thomas,

I did not try myself, but I think you're right, and I realize I was too much focused upon the level1-level2 thing (which A1 (and presumably all the others, either) obviously doesn't do (yet)), but which is not really necessary either; in fact, the core functionality that is needed, is "follow ONLY links that meet (a) certain regex pattern(s)", AND it's necessary to have several such regexes if needed (as in this scenario where the level 1 regex would be different from the level 2 regex).

Then, most of tasks will be realizable this way, understood that BOTH (i.e. all) regexes will invariably be applied to ANY such level, which only in very rare cases could be a problem - in my script I applied to that site, I had differentiated the regex for level 1 and level 2, building a subroutine for that lower level, but we see here that I could have simplified the routine, according to my example as described above.

Unfortunately, and I only discover this now, putting these things together again, it was a "bad" example, i.e. not as straightforward as described my me yesterday. In fact:


level one = thumbs page, code line for the very first thumb is:

      <li class="grid_3 alpha clearBoth"><a href="/digitalcollections/gedney_KY0001/"><img src="" alt=""/><br/>Man with no fingers on right hand lighting a cigarette; view from interior of ...</a></li>


will just bring a single thumb again!,

whilst the intermediate-quality page, displayed by a click on the thumb, has the url

Such a direct link is nowhere on the source (= multiple thumbs) page, but compare with the

<a href="/digitalcollections/gedney_KY0001/">

part of the above line; this means you can identify the core info by a regex fetching that line, then you need to build a variable taking this core info


and putting the necessary part

before that element fetched from the source page.

It goes without saying that I make abstraction here from the specific detail of these pages that all photos are just numbered (with leading zeroes though), so that this part of the script can be greatly simplified; I've seen other such pages where there was some sort of a "numbering", but in a perfectly aleatoric way, some hundreds of numbers only, but from a range of 100,000, so a simple "compound url" function, with just numbering 1...n would NOT be sufficient in many instances, and I very much fear A1 (and "all" the "others") do NOT have such a "compose the target url from a variable, and a static component" function yet?

In other words, you would not only need a "match regex" functionality, but a "regex replace" functionality, either for a copy of the original page source, and before the reading-the-source-for-link-following is done, or, simpler, as described, for building an intermediate variable to be processed as link to follow then.

Also, and this in not new from yesterday, there are MANY such (to-be-compounded-first) links to follow, not just one, and such a scraper (here: A1) should be able to do the necessary processing for all of these. In other words, it would be best if internally, an array would be built up for being processed then url by url.


Now, having followed the compound link, you are on page

with a button "All Sizes", and with some more hampering, you'll get to the highest-quality page, with the url

Here again, it's evident that by knowing the "architecture" of these links, you simply could run a script with those "KY0001" counting up from 1 to n, but as said, it's not always as easy as Duke Univ. makes things for us; thus we fetch the target url from the "intermediate" 's page source:

This link is present in the page source, but if it wasn't, there are several other such links, with "embed", "mid" and "thm", so here again, some regex REPLACE should be possible in cases the direct link is not to be found within the page source.

Whilst in our example, on this second level, there is only 1 link to follow, there are many cases where even on level 2, there are several links to follow - or then, it's the other way round, and on level 1, there is just 1 link, but then, on level 2, there are several (similar).

In fact, my current example is for Gedney 1964, but there is also Gedney 1972 (and others, of a lot less interest in this "artsy" context), i.e. I left out level 1 (= multiple links to multiple portfolios), and also left out level 2 (= multiple thumbs page per portfolio), so that the level 1 in my example is in fact already level 3 = one of those multiple-thumbs pages, of several, within one of several portfolios, of one portfolio (of several).

This means you have to provide functionality for multiple similar links (= not just one) on several levels in a row, in order to meet realistic scenarios, and this means you should not provide simple variables, but arrays, for the very first 4 levels, or perhaps even for 5 levels in a row.


In all this, I assume that all links in one array are "similar", i.e. are to be treated in a similar way "on arrival" on the target page, or more precisely, that on every target page of the links of some such array, the subsequent array building for links will be done in the same way.

It's obvious, though, that for many such scrape jobs, there will be dis-similar links to be followed, but I also think that in order to not over-complicate such a tool, you could ask the user to follow the same "page 0" (= source page), with different such tasks, in several installments.

As long as link following will be done just for "links of one kind" (i.e. not: single links), no chaos whatsoever will ensue.

Also, from the above, I conceptually deduct that there should be two different "modes" (at least):
- follow every link meeting the regex pattern(s), and
- follow only links of a certain kind;

in this second case, different regex pattern should be possible for different levels (you see, I cling to this idea: it would make things so much easier for the user (clarity!), and it would not represent any programming difficulty to achieve; also, it would make the program run "faster", neater: to build up the intermediate link arrays for the levels deeper down, no search for unnecessary regex matches (= not occuring at those levels anyway) would be forced upon the respective source texts (and the machine; and less risk for accidental unwanted matches there)).

From the above, I do not think A1 would have been able to execute my given example (= otherwise than by just counting from 1 to n, that is), as there will probably not be the possibility to construct compound target urls yet, all the less so for similar groups? In other words, and as explained above, I suppose today it's "follow every link matching one or several regex patterns (and not ANY link, as in the previous generation of scrapers)", but it would not be possible to build up such links if they are not found already, in full, within the respective page source?

Btw, with today's more and more Ajax and similar pages, I suppose that functionality to build up links (up from information the user will have to put into the program, by manually following links and then checking for the elements of the respective target url "upon arrival") will become more and more important in order for such a tool to not fail on many pages it could not handle otherwise?

Well, I love to come up with complications! ;-)

General Software Discussion / Re: Scraper too expensive at 20 bucks
« on: January 16, 2015, 01:19 PM »
My post above was not so much about any A1 or other, but meant in general, A1 bits offer just being the trigger for my considerations, but of course I gave the link to DC over there, which Thomas then promptly followed... ;-) Here's my answer from over there, there "over there" meaning here, and so on... ;-) :


Just to clarify, it was not my intention to denigrate A1, and I very much hope the title I gave the thread cited above appears as perfectly ironic as it was intended.

I should have clarified above - I do it here though - that I consider A1 as a perfectly worthy example of a "standard scraper", and more so, possibly "perfect" or at least very good, for almost any "un-professional", i.e. amateur scraping task (and from your comments, I see those imperfections of A1 I found described in the rare web comments of it, have been dealt with in the meantime).

Also, there seems to be a slight misunderstanding, automatisms are good, but the possibility of deselecting and tweaking automatisms is even better, since very often, scrapers follow links too fervently, instead of following only some kinds/groups of links (and I mentioned the programming problems in order to make such choices available): it's not about ".jpg only", or even of "pics within a given size range only" and such; also, standard "this page and its children down to 1/2/3 levels" is not really helpful, since (even for "amateurs"), it often would be necessary to follow links of some kind rather deep, whilst not following links of other kinds.

As for the "heavy scraping problem", there is also a legal problem, which consists of new-kind "authors' rights", in most European countries, to "data bases", even if those db's only consist of third-party advertizing / good offerings, with no content contribution whatsoever from the owner of the target site (but who, e.g. for vacancy notices, gets often paid 800 euro, some 1,000 bucks for publishing that ad for a mere 4 weeks, AND holds "authors' rights" to that same ad as part of his db); this being said, it's clear as day that such considerations are perfectly irrelevant within the context of a "consumer product" like A1, and this denomination of mine is certainly not meant in order to tear down A1 either.

But there clearly is a schism between professional use, for which highly elaborate custom scripting is necessary (and, as explained, not even sufficient), and "consumer use", and in this latter category, the above-mentioned tweaking possibilities for "which groups of links to follow then how, respectively", could certainly make the necessary distinction among "consumer scrapers".

Or let's get as precise as it gets: Years ago, I trialled several such "consumer scrapers", in order to get all of William Gedney's Kentucky 1964 and 1972 photos (i.e. not even photo scraping is all about porn, but sometimes it's about art), from the Foundation's web site, but in the best resolution available there, and that was not possible with those scrapers since there was an intermediate quality level, between thumbs and the quality I was after - perhaps I did it wrong at the time; anyway, I succeeded by writing my own download script.

Just for fun, I checked that page again:

and verified the current state of things:

pages of kind a: some 50 pages with thumbs (for more than 900 photos),

then target pages (= down 1 level, for intermediate photo quality),

and there, links to the third level, with the full quality, but also many, many links to other things:

Whilst from level 1 to level 2, it's "easy", it's obvious that for level 2 pages, highly-selective-only link following (i.e. just follow the link to the pic-in-full and nothing else) would be asked for, but probably is not possible with most consumer scrapers even today: Would it be possible to tweak A1's link following that way? Again, we're speaking not of general specifics, but of specifics applying to level-2 pages only.

Well, whilst my first DC post was rather theoretical, here we've got a real-life example. It's clear as day that if a tool for 20 or 40 bucks did this, with some easy tweaking, I'd call such a tool "near perfect": it's all about discrete, selectivity of link following. ;-)

(Or then, I'd have to continue to do my own scripts for every such task I encounter...)

P.S. If by this scheme of mine, Gedney's work will get some more followers, that wouldn't be a bad thing either. ;-)

General Software Discussion / Scraper too expensive at 20 bucks
« on: January 16, 2015, 06:34 AM »
(Original post at bits and referred to here was "$19 + seems an awful lot of money for software you can get the same type of thing for nothing. (...)".)

The problem lies elsewhere. A price of 20 bucks is certainly not a deal breaker, neither would be 40 bucks (original price), and there are competitors that cost several hundred bucks, and which are not necessarily better or much better.


if you search for "download manager", the web (and the people who constitute it by their respective contributions) mix up web scrapers (like A1) and tools for downloading files specified beforehand by the user, but the download of which will then be done within multiple threads, instead of just one, by this using your possible fast internet connection to its fullest; of course, most of the scrapers will include such accelerating functionality, too. Thus, the lacking discriminating effort in what commentators see as a "download manager" does not facilitate the discussion to begin with; you should perhaps use the terms "scrapers", and "download accelerators", for a start, but there is also some "middle thing", pseudo-scrapers who just download the current page, but without following its links.


the big problem for scrapers nowadays is Ajax and database techniques, i.e. many of today's web pages are not static anymore, but are built up from multiple elements coming from various sources, and you do not even see those scripts in full; scripts you can read by "see page source" refer back to scripts on their servers, and almost anything that is done behind these scenes, cannot be replicated by ANY scraper (i.e. not even by guessing parts of it, and from building up some alternative functionality from those guesses), so the remark that A1's pages from scraped Ajax pages do not "work" is meaningless.

The only other remark re A1 I found in the web was, you will get "the whole page", instead of just the photos, in case you would like to download just the photos of a web page; IF that is right, that was a weakness of A1 indeed, since these "choosing selected content only" questions are the core functionality today's scrapers could and should have, in the above-described general framework in which "original web page functionality" can not be replicated anymore, for many pages (which often are the ones which are of most interest = with the most money behind = with both the "best" content, and lots of money for ace programming).

Thus, "taking up" with server-side programming has become almost impossible for developers anyway, so they should revert to optimization of choosing selected content, and of making that content available, at least in a static way, and it goes without saying that multiple different degrees of optimization of that functionality are imaginable: built-in "macros" could replicate at least some standard connections between screen/data elements "on your side", and of which the original triggers are lost, by downloading, but this would involve lots of user-sided decisions to be made, and hence lots of dialogs the scraper would offer the user to begin with ("click on an element you want as a trigger, then select data (in a table e.g.) that would be made available from that trigger", or then, big data tables, which then you would hierarchically "sort" in groups, in order to make that data meaningful again).

It's clear as day that the better the guesses of the scraper in such scenarios, the easier such partial re-consitution of the original data would often become, and also, that programming such guesses-and-services-offered-from-those would both be very "expensive" in programming, and be a never-ending task, all this because today's web technologies succeed in hiding what's done on the server side.

In other words, from even very complicated but static, and even pseudo-dynamic (i.e. get it all out of databases, but in a stringent, easily-to-be-replicated way) web pages yesterday, to today's dynamic web pages, it has been a step beyond what scrapers sensibly would have been able to handle.

But it's obvious also that scrapers should at least perfectly handle "what they've got", and the above-mentioned example (as said, found in the web) of "just downloading the pics of a page", whilst being totally realistic, is far from being sufficient as a feature request:

In so many instances, the pics of the current page are either just thumbs, or then, just pics in some intermediate resolution, and the link to the full-resolution pic is not available but from the dedicated page of that middle-resolution pic, and the situation is further complicated by the fact that often, the first or second resolution is available, but the third resolution is not, and that within the same start page, i.e. for the same task at arrival, for some pics, the scraper / script would have to follow two or three links, in for other pics linked to at the same page, it would have to follow just one or two.

This being said, of course, such "get the best available resolution for the pics on current page" should be standard functionality for a scraper.

But, all this being said, it also appears as quite evident to me that for tasks beyond such "elaborate standard tasks" (and which could be made available by the scraper "guessing" possibly relevant links, then have the user choose from the intermediate results, and then the scraper building up the necessary "rule(s)" for the site in question), scraper programming comes with the additional problem that such "specific rule building" would be split into a) what the scraper would make available and b) what the user could make out of these pre-fetched instruments, whilst in fact, the better, easier, and ultimately far more powerful solution (because the limitations of the intermediate step would be done away, together with that intermediate step) would be to do scripting, but ideally having some library of standards at your disposal.

(Readers here in DC will remember my - unanswered - question here how to immediately get to "page x" (e.g. 50) of an "endless" Ajax page (of perhaps 300 such partial "pages" (or whatever you like to name those additions), instead of "endlessly" scrolling down to it.)

Anyway, precise selection of what the user wants to scrape, and of "what not", should be possible in detail, and not only for links to follow on start page, but also for links further down, at the very least for links "on page 2", i.e. on several kinds (!) of pages which only have in common the fact that all of them are one level "down" from the respective "start page" (I assume there are multiple but similar such "start pages", all of them to be treated in a similar (but not identical, see above) way.


so many scrapers (and download accelerators, too) tout their respective accelerating power, but few, if ever one, mention the biggest problem of them all: More and more server programs quickly throw your IP(s!) and even your PC out of their access scheme, should you dare scrape big content and/or, repeatedly, updated content, and again, as above, the more elaborate the content and their server-side page-build-up programming, the higher the chances are that they have sophisticated scraper detection, too.

What most people do not know, when they choose their tunnel provider, is the fact that in such "heavy-scraping" scenarios, it's quite "risky" to get a full-year contract (let alone something beyond a year), and that there are special tunnel providers where you rent multiple IPs at the same time instead - which comes at a price.

With these multiple addresses, many scraping guys think they are on the safe side - well, what's multiple addresses "abroad" (from the server's pov), and when in country x no such provider can provide you any, or more than just a handful of "national" IPs?

And it does not end there. How "visually good" is your script, from the server's pov again? Don't you think they cannot "put it all together again" when your scraping follows detectable rules? To begin with, your scraping is probably mutually exclusive, which is obviously a big mistake, but which facilitates combining the parts on your side, right? He, he...

And you're spacing your requests, of course, in order for the server not to detect it's a machine fetching the data? He, he, again, just spacing the requests in time does not mean the server will think it detects some real person, looking for the data in a way some bona fide prospect would look for that data.

Not to speak of bona fide prospects looking in certain standard ways, but which never are the same though, and that they don't do just sequential downloading ("sequential" does not mean, follow link 1, then 2, then 3, but link 35, 482, 47, whatever, but download, download, download!), but revert to some page before, press F5 here or there (but not systematically of course), and so on, and in endless ways: As soon as there is a possible script to be detected, those servers send a signal on a real person on their side, and who will then look into things, relying on their scripts-for-further-pattern-detection: time of the day for such a "session", amount of data downloaded, number of elements downloaded, order in which (sub-) elements are downloaded (patterns, too similar and/or or not "real-life" enough).

Then, even if you quite perfect all this, by having your machines replicating real-life behavior of different real persons, even most real-life prospects will not remain interested in the same or similar data over the years, and most of them, not even over months in a row!

And all this with the concurrent problem of the geographic repartition of your IPs again: Where almost all of their bona fide prospects would sit in some specific country, or even in some specific region of that country, and so all of the above problems, even if resolved in perfect ways (and this necessarily included lots of overlaps if you want your global scheme to remain "realistic") will be only partial solutions and not work for long if you cannot resolve the problem of how to fake IPs and their geography, instead of just renting some.

My 2 cent to put into perspective some naïve, "$19 + seems an awful lot of money for software you can get the same type of thing for nothing.", and I certainly left out additional aspects I didn't think of on the fly.

Pages: prev1 2 3 [4] 5 6 7 8 9 ... 24next
Go to full version