DonationCoder.com Forum

DonationCoder.com Software => Coding Snacks => Finished Programs => Topic started by: rayman3003 on August 04, 2017, 01:59 PM

Title: DONE: Extracting All Image Links from a Booru Website
Post by: rayman3003 on August 04, 2017, 01:59 PM
I want to extract links of all images (png-jpg-jpeg-gif) in a booru website. (Booru(s) are some kind of image hosting websites).

Here is an example:

http://mspabooru.com/
(contains 161151 images)

(https://kek.gg/i/8653RC.jpg)

When clicking on posts, we could see images:

http://mspabooru.com/index.php?page=post&s=list

(https://kek.gg/i/6C5Ldh.jpg)

Then, after clicking on any images, It goes to a page:

http://mspabooru.com/index.php?page=post&s=view&id=166035

(https://kek.gg/i/6kK7SK.jpg)

That contains the link (url) of that image in left side of the page, when we click on the "original image" placeholder:

http://mspabooru.com//images/15/953debec5a4f550b37a7b0cabe5395a2.png

-------------

Now I 'm looking for a tool (an app or an online website) that helps me extract all these "Original image" links (urls) fast, And then I import the list of those urls in a download manager to leech them all.

I give that tool, these links:

http://mspabooru.com/index.php?page=post&s=view&id=161151
http://mspabooru.com/index.php?page=post&s=view&id=161150
http://mspabooru.com/index.php?page=post&s=view&id=161149
...
.
.
.
http://mspabooru.com/index.php?page=post&s=view&id=1

Then it gives me the links of images in each page:

http://mspabooru.com//images/15/aa2652c67b307eefb48ca3672fe09517.jpeg
http://mspabooru.com//images/15/fae66c978b7e52ada0e5332b5dbe2abd.png
http://mspabooru.com//images/15/38d72ad818e1471c5955db17ce2e89c4.jpeg
...
.
.
http://mspabooru.com//images/15/38d72ad818e14645955db17ce2e89c4.jpeg

So please give me any suggestion.
(or maybe a coder here could help me (I think its not hard to code a tiny app that doing this).

Thank u  ;D
Title: Re: DONE: Extracting All Image Links from a Booru Website
Post by: Curt on August 04, 2017, 05:50 PM
Do you really need to keep a list of the links, or is it okay to "merely" get the pictures?


Title: Re: DONE: Extracting All Image Links from a Booru Website
Post by: rayman3003 on August 05, 2017, 12:23 AM
Do you really need to keep a list of the links, or is it okay to "merely" get the pictures?

My goal is to get the pictures. But having the list of links, helps me to put them in download managers and leech them faster.  :) (So I prefer the links list)
Title: Re: DONE: Extracting All Image Links from a Booru Website
Post by: 4wd on August 05, 2017, 06:44 AM
Wouldn't it be easier to use a Download Manager within the browser?

eg. Firefox + DownThemAll  or  Chrome-based + GetThemAll

Both of these allow you to filter on filetype.
Title: Re: DONE: Extracting All Image Links from a Booru Website
Post by: rayman3003 on August 05, 2017, 10:24 AM
Wouldn't it be easier to use a Download Manager within the browser?

eg. Firefox + DownThemAll  or  Chrome-based + GetThemAll

Both of these allow you to filter on filetype.

OK, thats also a download manager.  :P

I have "DownThemAll" in my firefox. Please show me how to grab all images by "DownThemAll" within all of these 161151 pages:

http://mspabooru.com/index.php?page=post&s=view&id=161151
http://mspabooru.com/index.php?page=post&s=view&id=161150
http://mspabooru.com/index.php?page=post&s=view&id=161149
...
.
.
.
http://mspabooru.com/index.php?page=post&s=view&id=1
Title: Re: DONE: Extracting All Image Links from a Booru Website
Post by: skwire on August 05, 2017, 05:12 PM
Here's an AutoHotkey example showing how to download all the images from the example site you gave:

Code: Autohotkey [Select]
  1. ;http://mspabooru.com/index.php?page=post&s=view&id=166035
  2. sSourceURL := "http://mspabooru.com/index.php?page=post&s=view&id="
  3. nPageStart := 1
  4. nPageEnd   := 161151
  5.  
  6. ; Create directory to dump images to.
  7. FileCreateDir, % A_ScriptDir . "\images"
  8.  
  9. Loop, % nPageEnd
  10. {
  11.     If ( A_Index < ( nPageStart - 1 ) )
  12.     {
  13.         Continue
  14.     }
  15.     Else
  16.     {
  17.         ; Update tray icon tooltip.
  18.         Menu, Tray, Tip, % "Processing URL number: " . A_Index
  19.        
  20.         ; Download HTML page.
  21.         URLDownloadToFile, % sSourceURL . A_Index, % A_ScriptDir . "\temp.html"
  22.        
  23.         ; Read in HTML source.
  24.         FileRead, myData, % A_ScriptDir . "\temp.html"
  25.        
  26.         ; Parse HTML source for image URLs.
  27.         Loop, Parse, myData, `n, `r
  28.         {
  29.             ; Match image URLs.  
  30.             If ( RegExMatch( A_LoopField, "(https?:\/\/.*\.(?:png|jpg|jpeg|gif))", Match ) )
  31.             {
  32.                 ; Crack the URL into its parts.
  33.                 SplitPath, % Match1, OutFileName, OutDir, OutExtension, OutNameNoExt, OutDrive
  34.                
  35.                 ; Skip any images with "thumbnail" in the filename.
  36.                 If ! InStr( Match1, "thumbnail" )
  37.                 {
  38.                     ; Download the image.
  39.                     URLDownloadToFile, % Match1, % A_ScriptDir . "\images\" . OutFileName
  40.                 }
  41.             }
  42.         }
  43.     }
  44. }
  45. MsgBox, Done!
Title: Re: DONE: Extracting All Image Links from a Booru Website
Post by: rayman3003 on August 06, 2017, 04:36 AM
Here's an AutoHotkey example showing how to download all the images from the example site you gave:

Thank u very much. It works like a charm.

But unfortunately, Its sooooooo slow. It leeched 2MB in 2 minutes!

But with "download managers", I could grab at least 15MB in 2 minutes!! "Download Managers" can download files simultaneously, so with them I can download with more speed; Thats why I prefer image links over leeching them by a none-downloader tool (like hotkey).

Speed is very important for me in this case (Like I said in the first post), Bcuz I want to leech more than 800,000 images among three different booru websites.  :-[

But again, Thank u for your time.  :Thmbsup:
Title: Re: DONE: Extracting All Image Links from a Booru Website
Post by: Ath on August 06, 2017, 05:06 AM
As said, it's an example of how to get a lot of files from that particular website. Each page has to be downloaded to extract the actual download-url per image.

You might want to do some work on it, as downloading 800.000 files into 1 directory isn't something Windows is very fond of :o
Splitting the from/to range into several scripts allows you to run more scripts in parallel (number of CPU-cores seems reasonable), and this way you can run several sites in parallel too, but you might get your IP banned from the server, because of hammering the site with that many requests :-\
A possible speed improvement could be to not set the tray icon tooltip for each page, but only for each 10th or so, so you still have a notion of progress, especially if Windows is displaying it when it's set, that's usually quite slow. (NB: Haven't tested it myself.)
Title: Re: DONE: Extracting All Image Links from a Booru Website
Post by: IainB on August 06, 2017, 06:46 AM
Might be worth looking at the references to "image download" in the thread Re: Firefox Extensions: Your favorite or most useful (https://www.donationcoder.com/forum/index.php?topic=1685.msg282733;topicseen#msg282733)

Also see:
...Does it really work? Wow. It hasn't been updated for 11 months, so I assumed...
__________________
It most decidedly does work, and you can crawl any website, gathering specific file types.
For example, from the Mozilla FoxySpider Add-on page (https://addons.mozilla.org/en-US/firefox/addon/foxyspider/):
____________________________
About this Add-on
With FoxySpider you can:
  • Get all photos from an entire website
  • Get all video clips from an entire website
  • Get all audio files from an entire website
  • Well, actually get any file type you want from an entire website
FoxySpider can be used to create a thumbnail gallery containing links to rich media files of any file types you are interested in. It can also crawl deep to any level on a website and display the applicable files it found in the same gallery. FoxySpider is useful for different media content pages (music, video, images, documents), thumbnail gallery post (TGP) sites, podcasts. You can narrow and expand the search to support exactly what you want.
Once the thumbnail gallery is created you can view, download or share (on Facebook and Twitter) every file that was fetched by FoxySpider.
____________________________

Title: Re: DONE: Extracting All Image Links from a Booru Website
Post by: skwire on August 06, 2017, 06:51 AM
But unfortunately, Its sooooooo slow. It leeched 2MB in 2 minutes!

This is going to be due to your location and ISP speed since I was able to grab 30+ megs of images in two minutes even with just that single-threaded code.  Here's a modification that simply creates a text file of image links.  As Ath mentioned, you can comment out line 17 for a tiny speedup but I don't think it's going to make a noticeable difference.

Code: Autohotkey [Select]
  1. ;http://mspabooru.com/index.php?page=post&s=view&id=166035
  2. sSourceURL := "http://mspabooru.com/index.php?page=post&s=view&id="
  3. nPageStart := 1
  4. nPageEnd   := 161151
  5.  
  6. FileCreateDir, % A_ScriptDir . "\images"
  7.  
  8. Loop, % nPageEnd
  9. {
  10.     If ( A_Index < ( nPageStart - 1 ) )
  11.     {
  12.         Continue
  13.     }
  14.     Else
  15.     {
  16.         ; Update tray icon tooltip.
  17.         Menu, Tray, Tip, % "Processing URL number: " . A_Index
  18.  
  19.         ; Download HTML page.
  20.         URLDownloadToFile, % sSourceURL . A_Index, % A_ScriptDir . "\temp.html"
  21.  
  22.         ; Read in HTML source.
  23.         FileRead, myData, % A_ScriptDir . "\temp.html"
  24.  
  25.         ; Parse HTML source for image URLs.
  26.         Loop, Parse, myData, `n, `r
  27.         {
  28.             ; Match image URLs.
  29.             If ( RegExMatch( A_LoopField, "(https?:\/\/.*\.(?:png|jpg|jpeg|gif))", Match ) )
  30.             {
  31.                 ; Crack the URL into its parts.
  32.                 SplitPath, % Match1, OutFileName, OutDir, OutExtension, OutNameNoExt, OutDrive
  33.  
  34.                 ; Skip any images with "thumbnail" in the filename.
  35.                 If ! ( InStr( Match1, "thumbnail" ) OR Instr( Match1, "width" ) )
  36.                 {
  37.                     ; Download the image.
  38.                     ; URLDownloadToFile, % Match1, % A_ScriptDir . "\images\" . OutFileName
  39.  
  40.                     ; Create a list of links.
  41.                     FileAppend, % Match1 . "`r`n", % A_ScriptDir . "\ImageLinks.txt"
  42.                 }
  43.             }
  44.         }
  45.     }
  46. }
  47. MsgBox, Done!
Title: Re: DONE: Extracting All Image Links from a Booru Website
Post by: rayman3003 on August 06, 2017, 07:41 AM
But unfortunately, Its sooooooo slow. It leeched 2MB in 2 minutes!

This is going to be due to your location and ISP speed since I was able to grab 30+ megs of images in two minutes even with just that single-threaded code.  Here's a modification that simply creates a text file of image links.  As Ath mentioned, you can comment out line 17 for a tiny speedup but I don't think it's going to make a noticeable difference.


Thank u. This time it worked just as I wanted. Thank u.  ;D
(And thanks to others that helped me in this topic  :up: )