Main Area and Open Discussion > General Software Discussion
Scraper too expensive at 20 bucks
peter.s:
Hi Thomas,
I did not try myself, but I think you're right, and I realize I was too much focused upon the level1-level2 thing (which A1 (and presumably all the others, either) obviously doesn't do (yet)), but which is not really necessary either; in fact, the core functionality that is needed, is "follow ONLY links that meet (a) certain regex pattern(s)", AND it's necessary to have several such regexes if needed (as in this scenario where the level 1 regex would be different from the level 2 regex).
Then, most of tasks will be realizable this way, understood that BOTH (i.e. all) regexes will invariably be applied to ANY such level, which only in very rare cases could be a problem - in my script I applied to that site, I had differentiated the regex for level 1 and level 2, building a subroutine for that lower level, but we see here that I could have simplified the routine, according to my example as described above.
Unfortunately, and I only discover this now, putting these things together again, it was a "bad" example, i.e. not as straightforward as described my me yesterday. In fact:
1)
level one = thumbs page, code line for the very first thumb is:
<li class="grid_3 alpha clearBoth"><a href="/digitalcollections/gedney_KY0001/"><img src="http://library.duke.edu/digitalcollections/media/jpg/gedney/thm/KY0001.jpg" alt=""/><br/>Man with no fingers on right hand lighting a cigarette; view from interior of ...</a></li>
and
http://library.duke.edu/digitalcollections/media/jpg/gedney/thm/KY0001.jpg
will just bring a single thumb again!,
whilst the intermediate-quality page, displayed by a click on the thumb, has the url
http://library.duke.edu/digitalcollections/gedney_KY0001/
Such a direct link is nowhere on the source (= multiple thumbs) page, but compare with the
<a href="/digitalcollections/gedney_KY0001/">
part of the above line; this means you can identify the core info by a regex fetching that line, then you need to build a variable taking this core info
/digitalcollections/gedney_KY0001/
and putting the necessary part
http://library.duke.edu
before that element fetched from the source page.
It goes without saying that I make abstraction here from the specific detail of these pages that all photos are just numbered (with leading zeroes though), so that this part of the script can be greatly simplified; I've seen other such pages where there was some sort of a "numbering", but in a perfectly aleatoric way, some hundreds of numbers only, but from a range of 100,000, so a simple "compound url" function, with just numbering 1...n would NOT be sufficient in many instances, and I very much fear A1 (and "all" the "others") do NOT have such a "compose the target url from a variable, and a static component" function yet?
In other words, you would not only need a "match regex" functionality, but a "regex replace" functionality, either for a copy of the original page source, and before the reading-the-source-for-link-following is done, or, simpler, as described, for building an intermediate variable to be processed as link to follow then.
Also, and this in not new from yesterday, there are MANY such (to-be-compounded-first) links to follow, not just one, and such a scraper (here: A1) should be able to do the necessary processing for all of these. In other words, it would be best if internally, an array would be built up for being processed then url by url.
2)
Now, having followed the compound link, you are on page
http://library.duke.edu/digitalcollections/gedney_KY0001/
with a button "All Sizes", and with some more hampering, you'll get to the highest-quality page, with the url
http://library.duke.edu/digitalcollections/media/jpg/gedney/lrg/KY0001.jpg
Here again, it's evident that by knowing the "architecture" of these links, you simply could run a script with those "KY0001" counting up from 1 to n, but as said, it's not always as easy as Duke Univ. makes things for us; thus we fetch the target url from the "intermediate" 's page source:
http://library.duke.edu/digitalcollections/media/jpg/gedney/lrg/KY0001.jpg
This link is present in the page source, but if it wasn't, there are several other such links, with "embed", "mid" and "thm", so here again, some regex REPLACE should be possible in cases the direct link is not to be found within the page source.
Whilst in our example, on this second level, there is only 1 link to follow, there are many cases where even on level 2, there are several links to follow - or then, it's the other way round, and on level 1, there is just 1 link, but then, on level 2, there are several (similar).
In fact, my current example is for Gedney 1964, but there is also Gedney 1972 (and others, of a lot less interest in this "artsy" context), i.e. I left out level 1 (= multiple links to multiple portfolios), and also left out level 2 (= multiple thumbs page per portfolio), so that the level 1 in my example is in fact already level 3 = one of those multiple-thumbs pages, of several, within one of several portfolios, of one portfolio (of several).
This means you have to provide functionality for multiple similar links (= not just one) on several levels in a row, in order to meet realistic scenarios, and this means you should not provide simple variables, but arrays, for the very first 4 levels, or perhaps even for 5 levels in a row.
3)
In all this, I assume that all links in one array are "similar", i.e. are to be treated in a similar way "on arrival" on the target page, or more precisely, that on every target page of the links of some such array, the subsequent array building for links will be done in the same way.
It's obvious, though, that for many such scrape jobs, there will be dis-similar links to be followed, but I also think that in order to not over-complicate such a tool, you could ask the user to follow the same "page 0" (= source page), with different such tasks, in several installments.
As long as link following will be done just for "links of one kind" (i.e. not: single links), no chaos whatsoever will ensue.
Also, from the above, I conceptually deduct that there should be two different "modes" (at least):
- follow every link meeting the regex pattern(s), and
- follow only links of a certain kind;
in this second case, different regex pattern should be possible for different levels (you see, I cling to this idea: it would make things so much easier for the user (clarity!), and it would not represent any programming difficulty to achieve; also, it would make the program run "faster", neater: to build up the intermediate link arrays for the levels deeper down, no search for unnecessary regex matches (= not occuring at those levels anyway) would be forced upon the respective source texts (and the machine; and less risk for accidental unwanted matches there)).
From the above, I do not think A1 would have been able to execute my given example (= otherwise than by just counting from 1 to n, that is), as there will probably not be the possibility to construct compound target urls yet, all the less so for similar groups? In other words, and as explained above, I suppose today it's "follow every link matching one or several regex patterns (and not ANY link, as in the previous generation of scrapers)", but it would not be possible to build up such links if they are not found already, in full, within the respective page source?
Btw, with today's more and more Ajax and similar pages, I suppose that functionality to build up links (up from information the user will have to put into the program, by manually following links and then checking for the elements of the respective target url "upon arrival") will become more and more important in order for such a tool to not fail on many pages it could not handle otherwise?
Well, I love to come up with complications! ;-)
ThomasSchulz:
Hi,
I could create a project that only analyzes pages related to e.g. gedney and download all files used/linked from them.
After that, one could, if one so chooses, configure the crawler further to:
1) limit further, so it does not downloaded linked pages/content outside wanted areas.
2) limit it to only download and keep images.
URL / link normalization is always done before any filter is tested, so all that is zero problem. General rule: As long as a browser can understand something in a website, so can the crawler :) (AJAX being an exception.)
Replacing content/links/URLs inside page content is also a far larger system than regular expressions. It is essentially a huge multi-threaded engine that is behind it all with a ton of queues, lists and what not to ensure everything is done correctly and optimal as possible. It is of course a never ending process of optimizations which is why writing such a product can end up taking lots of time :)
peter.s:
"I could create a project that only analyzes pages related to gedney and downloads all files used/linked from them."
We assumed that was not we wanted to do, in so general a way.
"After that, one could, if one so chooses, either:
1) start to limit further, so it does not downloaded linked pages/content outside wanted areas ."
Areas? Link groups, I presume.
"2) e.g. limit it to only download and keep images."
Again, that would often download much too much, and I prefer downloading hygiene to lots of discard later on. Ultimate irony: That discarding unwanted pics could then possibly be done within a file manager allowing for regex selection.
"URL / link normalization is always done before any filter is tested, so all that is zero problem. General rule: As long as a browser can understand something in a website, so can the crawler smiley (AJAX being an exception.)"
Brilliant! (But in some contradiction to your next paragraph?)
"Replacing content/links/URLs inside page content is also a far larger system than regular expressions."
As explained above, not necessarily so. I originally started with a (dumb) misunderstanding of mine: That regex in your scrapers was for choosing the parts of the text the user wanted to preserve; obviously, that is not their use in your software, but it's for the users' determining the links to be followed. Now, from a regex match to a regex replace there (and with input, as explained above, from the user, and which the user will have identified from manually loading pages and looking into their respective url pattern, and/or from looking into the respective page sources), it's not that big a step, and...
"It is essentially a huge multi-threaded engine that is behind it all with a ton of queues, lists and what not to ensure everything is done correctly and optimal as possible."
I don't know about traditional multi-threading, but in downloading, the term multi-threading is often used for multiple, concurrent downloading; I don't know about that either. But: It's evident, from my explications above, that all this "do different things at different download levels" is easy, as soon as you accept that in fact, doing different (and specific) things at different (= at the specific) download levels, is the natural way of doing things, and that "download whole webpages" is very Late Ninetees, even if today it's possible to exlude, by real hard work, and totally insufficiently, only SOME unwanted meanders (and all the advertizing, socialising and whatever).
"and what not"
This is a formula regularly used for deprecating, or more precisely, it shows that the writer in question has not yet got the importance of the detail(s) he's hiding behind such an all-englobing "and what not"; btw, another sophist's formula of choice is "questions", when no questions have been asked (yet) (see your top post): in fact, many of us will convene that corporations that don't deliver the service you asked, and paid, for, and if then you dare utter some requirements, these will, by corporations of that (= not-so-much customer-centered) kind, invariably met by them saying, "your questions". (end of ot)
"to ensure everything is done correctly and optimal as possible"
Well, that's why I suggested adding a second "mode", for stringent downloading, instead of trying to mingle it all together into existant, traditional mode, this second "mode" "only" going down some 5, 6 levels, but from the
"with a ton of queues, lists"
Of course, you'd be free to use lists instead of arrays, and even though that would imply some multiplication of storage elements indeed, I wouldn't expect a ton of additional elements from this choice.
"It is of course a never ending process of optimizations"
I mentioned this above: Where a custom-made script is straightforward, the intermediate step of providing, in a commercial tool, dialogs and such in order to build, within some necessary confines, some variants of proceeds, in order to run something what more or less mimicks, for standard tasks, such a custom-made script, or at least a lot of what the latter only would have provided in the last resort, complicates things for the developer of such a tool. But then, identify standard tasks (e.g. from what you hear / listen to (?) in your forum), then write additional "modes", as the one described above, and I'm even more precise in my description: Do one big list field, with multiple indentation: Let users build up a tree (which then you process by nested loops). This will make available bifurcations even, i.e. down level 3, with 2 (or more) link kinds to follow, each of them having their own process commandments further down, i.e., in this example, 1 level 3, but 2 different levels 4, 1 with level 5 and 6, the other one with just level 1, or even it ends at level 4 for that line of scraping. The next programming step would of course be to integrate your multi-threading into this alternative mode, and then, a prof. version could even allow for user's indication how often these specific lines are to be updated (re-check for changes in that site's contents). It's not "endless", but it can be done in sensible installments, according to real-life scenarios and how to best handle them, and following the (in most cases, tree) structure of a site in a perfectly coordinated way, without gathering rubbish laying on the way, should be a core task (of which the coding will take one week since "everything around" is already there), which then could perhaps be refined in some details from the observations of your forum posters.
"which is why writing such a product can end up taking lots of time smiley"
One week, as said, and including debugging - since most of the design work has been delivered free of charge.
Thomas, it's clear as day I don't need such a crawler anymore, but not anybody who would like to do some neat scraping has got the scripting routine I've got over the years now. It's not that anybody's asking you any particular effort in one way or any other, it's just that you've got some weeks' advance over your competitors; some weeks I say because DC's got quite tons of hits from professionals (I had been astonished you hadn't become aware of DC yet before my linking here?)
;-)
ThomasSchulz:
It took a couple of minutes to create a project file for A1 Website Download that could download and allow for offline surfing of the wanted pages and images including the big images. (And exclude the rest of the website / images) I figured it would be easier if you saw the project in case there was a misunderstanding (either by me or you) if A1WD would be able to handle the scenario you outlined.
I think we are talking past each other: Reading your posts again, I think you may be more interested in scraping specific data/images than downloading websites / large parts. Since A1 Website Download is not meant to scrape bits of each page - e.g. summaries or whatever people like to scrape - it is not something A1WD will ever do as its not what it is meant to solve.
2) e.g. limit it to only download and keep images."
Again, that would often download much too much, and I prefer downloading hygiene to lots of discard later on. Ultimate irony: That discarding unwanted pics could then possibly be done within a file manager allowing for regex selection.
--- End quote ---
--- End quote ---
I think we are talking past each other: A1WD would not download files excluded... But yes, it would have to download the content pages to discover the links/references to the images. If that is your criticism, I understand - i.e. if you in your proposed solution do not want to crawl pages to discover the files to download, but instead download files from a pre-generated list. For those who have the time to write custom downloaders for specific websites - that is of course always a good solution if you are not interested in downloading websites themselves.
For reference, I wrote my first "proof of concept" (but working) crawler and sitemapper in less than two days in 2003. It generated nice HTML sitemaps for my test websites... In 2005 I picked it up again and started working on the core crawler engine and all around it. Never stopped developing on it ever since. (Powers multiple sibling tools, among those A1 Website Download, so that is partly why.)
Anyhow, it seems you have a different idea for users to define what they want scraped/downloaded. And as a general statement: If you or anyone else have a plan on releasing a product that somehow improves or complements what is already out there - then I think you should go for it and release it!
I believe more in my own design (which I think is pretty unparalleled in flexibility and configuration compared to whatelse is currently out there of website downloaders), but it is a moot point to discuss from my point of view. (And you are certainly entitled to believe your proposed design is better.) I was mainly interested in hearing if anyone could present a practical real world problem out there - if you recall, I immediately agreed that websites that rely heavily on AJAX can be a problem.
When I started, I joined http://asp-software.org/ but I do not really engage much in developer forums anymore except those specific to the tools and problems I encounter.
I wish you well on all your future endeavours!
Navigation
[0] Message Index
[*] Previous page
Go to full version