Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Messages - ThomasSchulz [ switch to compact view ]

Pages: [1]

General Software Discussion / Re: Scraper too expensive at 20 bucks

« on: January 18, 2015, 02:58 PM »

It took a couple of minutes to create a project file for A1 Website Download that could download and allow for offline surfing of the wanted pages and images including the big images. (And exclude the rest of the website / images) I figured it would be easier if you saw the project in case there was a misunderstanding (either by me or you) if A1WD would be able to handle the scenario you outlined.

I think we are talking past each other: Reading your posts again, I think you may be more interested in scraping specific data/images than downloading websites / large parts. Since A1 Website Download is not meant to scrape bits of each page - e.g. summaries or whatever people like to scrape - it is not something A1WD will ever do as its not what it is meant to solve.

2) e.g. limit it to only download and keep images."
Again, that would often download much too much, and I prefer downloading hygiene to lots of discard later on. Ultimate irony: That discarding unwanted pics could then possibly be done within a file manager allowing for regex selection.

I think we are talking past each other: A1WD would not download files excluded... But yes, it would have to download the content pages to discover the links/references to the images. If that is your criticism, I understand - i.e. if you in your proposed solution do not want to crawl pages to discover the files to download, but instead download files from a pre-generated list. For those who have the time to write custom downloaders for specific websites - that is of course always a good solution if you are not interested in downloading websites themselves.

For reference, I wrote my first "proof of concept" (but working) crawler and sitemapper in less than two days in 2003. It generated nice HTML sitemaps for my test websites... In 2005 I picked it up again and started working on the core crawler engine and all around it. Never stopped developing on it ever since. (Powers multiple sibling tools, among those A1 Website Download, so that is partly why.)

Anyhow, it seems you have a different idea for users to define what they want scraped/downloaded. And as a general statement: If you or anyone else have a plan on releasing a product that somehow improves or complements what is already out there - then I think you should go for it and release it!

I believe more in my own design (which I think is pretty unparalleled in flexibility and configuration compared to whatelse is currently out there of website downloaders), but it is a moot point to discuss from my point of view. (And you are certainly entitled to believe your proposed design is better.) I was mainly interested in hearing if anyone could present a practical real world problem out there - if you recall, I immediately agreed that websites that rely heavily on AJAX can be a problem.

When I started, I joined http://asp-software.org/ but I do not really engage much in developer forums anymore except those specific to the tools and problems I encounter.

I wish you well on all your future endeavours!

General Software Discussion / Re: Scraper too expensive at 20 bucks

« on: January 17, 2015, 12:47 PM »

Hi,

I could create a project that only analyzes pages related to e.g. gedney and download all files used/linked from them.

After that, one could, if one so chooses, configure the crawler further to:
1) limit further, so it does not downloaded linked pages/content outside wanted areas.
2) limit it to only download and keep images.

URL / link normalization is always done before any filter is tested, so all that is zero problem. General rule: As long as a browser can understand something in a website, so can the crawler

(AJAX being an exception.)

Replacing content/links/URLs inside page content is also a far larger system than regular expressions. It is essentially a huge multi-threaded engine that is behind it all with a ton of queues, lists and what not to ensure everything is done correctly and optimal as possible. It is of course a never ending process of optimizations which is why writing such a product can end up taking lots of time

General Software Discussion / Re: Scraper too expensive at 20 bucks

« on: January 16, 2015, 10:59 PM »

Hi Peter,

Since you could use whatever-number regular expressions to

1)
Match URLs you want:
* Excluded from analysis
* Limit-to analysis to
(all filters in the two above gets combined when deciding if an URL should be analyzed)

2)
Match URLs you want:
* Excluded from output/download-and-kept
* Limit-to output/download to
(all filters in the two above gets combined when deciding if an URL should be output/downloaded-and-kept)

It should be possible to do what you want. It is one of the more complex but also powerful features of A1 tools

But I will take a look later!

General Software Discussion / Re: Scraper too expensive at 20 bucks

« on: January 16, 2015, 07:56 AM »

Note to admin: This thread originates from http://www.bitsdujou.../a1-website-download

Since Peter decided to post his response here (his site?) I am posting here as well. If admin wants to delete this thread, I understand. (Peter, we can move it to my support forum or email.)

...

Anyhow, that was a very long post which makes it hard to answer but I will try

A1 Website Download can download and convert links for both entire websites, sections of websites and more. It does not matter if the pages are served dynamicly or not. You can also use it to only download e.g. images or PDF files, and not download the actual web pages. For those who need it, it includes some very rich filtering options (including support for regular expressions), so you can define exactly what you want. This download is all automatic and does not include any guessing or prompting the user. The crawler simply dives through all pages and tries to discover all links (including those in e.g. CSS and Javascript files) and fix them, if necessary, when downloaded to disk. (You can also partially control/configure how this is done depending on your specific needs.) You are correct though, sometimes Javascript parsing it not perfect, but it works *pretty well* for most sites. (And the problem you mention about website downloaders in general only downloading thumbnails should not be a problem at all. However, should you know of websites causing issues, please do feel free to drop an email.)

Please note that A1 Website Download is not meant to be used as a scraper that extract specific data (e.g. catalogs) and convert such data into e.g. .CSV files. (I offer a different tool for that.) It is meant for what the name states, download websites or larger portions whereof.

You mention that many webservers will ban IP addresses when you download websites. I can say that the default settings in A1 Website Download are configured so it will not cause problems for most. This also means that the default settings only run at about 1/10 of max speed, but which will still be fast enough to download most websites reasonably fast. In addition, you can, if you wish, configure very precisely how many connections to use, pauses in-between and much more. As such, what you describe it a relative rare problem with many workarounds if you consult the documentation or drop me an email.

The only thing I can fully agree with is that downloading AJAX websites which can be very problematic. However, A1 Website Download implemented around 4.0.x some AJAX support - more precisely those websites who implements Google's suggestion on making AJAX websites crawlable: https://developers.g...sters/ajax-crawling/ Another area that can be problematic is login support as the systems behind get more and more complex. When I started developing A1 series in 2006, sessions cookies behind post forms where used on many websites which was very easy to handle. Now, it is much more complicated and hit or miss.

Note: I tried to answer most of your questions - if I missing anything, feel free to drop me an email, and I will explain it more fully and/or point to the relevant help page. Your post is good because it describes much of of he work that goes into building a website downloader, so even if we disagree, and your fundamental opinion is that such "website download" jobs should be built custom for each website (if I understood you correctly), it gives a good impression of the overall problems faced!

Pages: [1]