Hi Thomas,
I did not try myself, but I think you're right, and I realize I was too much focused upon the level1-level2 thing (which A1 (and presumably all the others, either) obviously doesn't do (yet)), but which is not really necessary either; in fact, the core functionality that is needed, is "follow ONLY links that meet (a) certain regex pattern(s)", AND it's necessary to have several such regexes if needed (as in this scenario where the level 1 regex would be different from the level 2 regex).
Then, most of tasks will be realizable this way, understood that BOTH (i.e. all) regexes will invariably be applied to ANY such level, which only in very rare cases could be a problem - in my script I applied to that site, I had differentiated the regex for level 1 and level 2, building a subroutine for that lower level, but we see here that I could have simplified the routine, according to my example as described above.
Unfortunately, and I only discover this now, putting these things together again, it was a "bad" example, i.e. not as straightforward as described my me yesterday. In fact:
1)
level one = thumbs page, code line for the very first thumb is:
<li class="grid_3 alpha clearBoth"><a href="/digitalcollections/gedney_KY0001/"><img src="
http://library.duke....edney/thm/KY0001.jpg" alt=""/><br/>Man with no fingers on right hand lighting a cigarette; view from interior of ...</a></li>
and
http://library.duke....edney/thm/KY0001.jpgwill just bring a single thumb again!,
whilst the intermediate-quality page, displayed by a click on the thumb, has the url
http://library.duke....tions/gedney_KY0001/Such a direct link is nowhere on the source (= multiple thumbs) page, but compare with the
<a href="/digitalcollections/gedney_KY0001/">
part of the above line; this means you can identify the core info by a regex fetching that line, then you need to build a variable taking this core info
/digitalcollections/gedney_KY0001/
and putting the necessary part
http://library.duke.edubefore that element fetched from the source page.
It goes without saying that I make abstraction here from the specific detail of these pages that all photos are just numbered (with leading zeroes though), so that this part of the script can be greatly simplified; I've seen other such pages where there was some sort of a "numbering", but in a perfectly aleatoric way, some hundreds of numbers only, but from a range of 100,000, so a simple "compound url" function, with just numbering 1...n would NOT be sufficient in many instances, and I very much fear A1 (and "all" the "others") do NOT have such a "compose the target url from a variable, and a static component" function yet?
In other words, you would not only need a "match regex" functionality, but a "regex replace" functionality, either for a copy of the original page source, and before the reading-the-source-for-link-following is done, or, simpler, as described, for building an intermediate variable to be processed as link to follow then.
Also, and this in not new from yesterday, there are MANY such (to-be-compounded-first) links to follow, not just one, and such a scraper (here: A1) should be able to do the necessary processing for all of these. In other words, it would be best if internally, an array would be built up for being processed then url by url.
2)
Now, having followed the compound link, you are on page
http://library.duke....tions/gedney_KY0001/with a button "All Sizes", and with some more hampering, you'll get to the highest-quality page, with the url
http://library.duke....edney/lrg/KY0001.jpgHere again, it's evident that by knowing the "architecture" of these links, you simply could run a script with those "KY0001" counting up from 1 to n, but as said, it's not always as easy as Duke Univ. makes things for us; thus we fetch the target url from the "intermediate" 's page source:
http://library.duke....edney/lrg/KY0001.jpgThis link is present in the page source, but if it wasn't, there are several other such links, with "embed", "mid" and "thm", so here again, some regex REPLACE should be possible in cases the direct link is not to be found within the page source.
Whilst in our example, on this second level, there is only 1 link to follow, there are many cases where even on level 2, there are several links to follow - or then, it's the other way round, and on level 1, there is just 1 link, but then, on level 2, there are several (similar).
In fact, my current example is for Gedney 1964, but there is also Gedney 1972 (and others, of a lot less interest in this "artsy" context), i.e. I left out level 1 (= multiple links to multiple portfolios), and also left out level 2 (= multiple thumbs page per portfolio), so that the level 1 in my example is in fact already level 3 = one of those multiple-thumbs pages, of several, within one of several portfolios, of one portfolio (of several).
This means you have to provide functionality for multiple similar links (= not just one) on several levels in a row, in order to meet realistic scenarios, and this means you should not provide simple variables, but arrays, for the very first 4 levels, or perhaps even for 5 levels in a row.
3)
In all this, I assume that all links in one array are "similar", i.e. are to be treated in a similar way "on arrival" on the target page, or more precisely, that on every target page of the links of some such array, the subsequent array building for links will be done in the same way.
It's obvious, though, that for many such scrape jobs, there will be dis-similar links to be followed, but I also think that in order to not over-complicate such a tool, you could ask the user to follow the same "page 0" (= source page), with different such tasks, in several installments.
As long as link following will be done just for "links of one kind" (i.e. not: single links), no chaos whatsoever will ensue.
Also, from the above, I conceptually deduct that there should be two different "modes" (at least):
- follow every link meeting the regex pattern(s), and
- follow only links of a certain kind;
in this second case, different regex pattern should be possible for different levels (you see, I cling to this idea: it would make things so much easier for the user (clarity!), and it would not represent any programming difficulty to achieve; also, it would make the program run "faster", neater: to build up the intermediate link arrays for the levels deeper down, no search for unnecessary regex matches (= not occuring at those levels anyway) would be forced upon the respective source texts (and the machine; and less risk for accidental unwanted matches there)).
From the above, I do not think A1 would have been able to execute my given example (= otherwise than by just counting from 1 to n, that is), as there will probably not be the possibility to construct compound target urls yet, all the less so for similar groups? In other words, and as explained above, I suppose today it's "follow every link matching one or several regex patterns (and not ANY link, as in the previous generation of scrapers)", but it would not be possible to build up such links if they are not found already, in full, within the respective page source?
Btw, with today's more and more Ajax and similar pages, I suppose that functionality to build up links (up from information the user will have to put into the program, by manually following links and then checking for the elements of the respective target url "upon arrival") will become more and more important in order for such a tool to not fail on many pages it could not handle otherwise?
Well, I love to come up with complications! ;-)