topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Friday December 6, 2024, 11:07 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: DONE: Extracting All Image Links from a Booru Website  (Read 12058 times)

rayman3003

  • Participant
  • Joined in 2014
  • *
  • Posts: 5
    • View Profile
    • Donate to Member
DONE: Extracting All Image Links from a Booru Website
« on: August 04, 2017, 01:59 PM »
I want to extract links of all images (png-jpg-jpeg-gif) in a booru website. (Booru(s) are some kind of image hosting websites).

Here is an example:

http://mspabooru.com/
(contains 161151 images)



When clicking on posts, we could see images:

http://mspabooru.com/index.php?page=post&s=list



Then, after clicking on any images, It goes to a page:

http://mspabooru.com/index.php?page=post&s=view&id=166035



That contains the link (url) of that image in left side of the page, when we click on the "original image" placeholder:

http://mspabooru.com//images/15/953debec5a4f550b37a7b0cabe5395a2.png

-------------

Now I 'm looking for a tool (an app or an online website) that helps me extract all these "Original image" links (urls) fast, And then I import the list of those urls in a download manager to leech them all.

I give that tool, these links:

http://mspabooru.com/index.php?page=post&s=view&id=161151
http://mspabooru.com/index.php?page=post&s=view&id=161150
http://mspabooru.com/index.php?page=post&s=view&id=161149
...
.
.
.
http://mspabooru.com/index.php?page=post&s=view&id=1

Then it gives me the links of images in each page:

http://mspabooru.com//images/15/aa2652c67b307eefb48ca3672fe09517.jpeg
http://mspabooru.com//images/15/fae66c978b7e52ada0e5332b5dbe2abd.png
http://mspabooru.com//images/15/38d72ad818e1471c5955db17ce2e89c4.jpeg
...
.
.
http://mspabooru.com//images/15/38d72ad818e14645955db17ce2e89c4.jpeg

So please give me any suggestion.
(or maybe a coder here could help me (I think its not hard to code a tiny app that doing this).

Thank u  ;D

Curt

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 7,566
    • View Profile
    • Donate to Member
Re: DONE: Extracting All Image Links from a Booru Website
« Reply #1 on: August 04, 2017, 05:50 PM »
Do you really need to keep a list of the links, or is it okay to "merely" get the pictures?



rayman3003

  • Participant
  • Joined in 2014
  • *
  • Posts: 5
    • View Profile
    • Donate to Member
Re: DONE: Extracting All Image Links from a Booru Website
« Reply #2 on: August 05, 2017, 12:23 AM »
Do you really need to keep a list of the links, or is it okay to "merely" get the pictures?

My goal is to get the pictures. But having the list of links, helps me to put them in download managers and leech them faster.  :) (So I prefer the links list)
« Last Edit: August 05, 2017, 12:34 AM by rayman3003 »

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,644
    • View Profile
    • Donate to Member
Re: DONE: Extracting All Image Links from a Booru Website
« Reply #3 on: August 05, 2017, 06:44 AM »
Wouldn't it be easier to use a Download Manager within the browser?

eg. Firefox + DownThemAll  or  Chrome-based + GetThemAll

Both of these allow you to filter on filetype.

rayman3003

  • Participant
  • Joined in 2014
  • *
  • Posts: 5
    • View Profile
    • Donate to Member
Re: DONE: Extracting All Image Links from a Booru Website
« Reply #4 on: August 05, 2017, 10:24 AM »
Wouldn't it be easier to use a Download Manager within the browser?

eg. Firefox + DownThemAll  or  Chrome-based + GetThemAll

Both of these allow you to filter on filetype.

OK, thats also a download manager.  :P

I have "DownThemAll" in my firefox. Please show me how to grab all images by "DownThemAll" within all of these 161151 pages:

http://mspabooru.com/index.php?page=post&s=view&id=161151
http://mspabooru.com/index.php?page=post&s=view&id=161150
http://mspabooru.com/index.php?page=post&s=view&id=161149
...
.
.
.
http://mspabooru.com/index.php?page=post&s=view&id=1

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 5,287
    • View Profile
    • Donate to Member
Re: DONE: Extracting All Image Links from a Booru Website
« Reply #5 on: August 05, 2017, 05:12 PM »
Here's an AutoHotkey example showing how to download all the images from the example site you gave:

Code: Autohotkey [Select]
  1. ;http://mspabooru.com/index.php?page=post&s=view&id=166035
  2. sSourceURL := "http://mspabooru.com/index.php?page=post&s=view&id="
  3. nPageStart := 1
  4. nPageEnd   := 161151
  5.  
  6. ; Create directory to dump images to.
  7. FileCreateDir, % A_ScriptDir . "\images"
  8.  
  9. Loop, % nPageEnd
  10. {
  11.     If ( A_Index < ( nPageStart - 1 ) )
  12.     {
  13.         Continue
  14.     }
  15.     Else
  16.     {
  17.         ; Update tray icon tooltip.
  18.         Menu, Tray, Tip, % "Processing URL number: " . A_Index
  19.        
  20.         ; Download HTML page.
  21.         URLDownloadToFile, % sSourceURL . A_Index, % A_ScriptDir . "\temp.html"
  22.        
  23.         ; Read in HTML source.
  24.         FileRead, myData, % A_ScriptDir . "\temp.html"
  25.        
  26.         ; Parse HTML source for image URLs.
  27.         Loop, Parse, myData, `n, `r
  28.         {
  29.             ; Match image URLs.  
  30.             If ( RegExMatch( A_LoopField, "(https?:\/\/.*\.(?:png|jpg|jpeg|gif))", Match ) )
  31.             {
  32.                 ; Crack the URL into its parts.
  33.                 SplitPath, % Match1, OutFileName, OutDir, OutExtension, OutNameNoExt, OutDrive
  34.                
  35.                 ; Skip any images with "thumbnail" in the filename.
  36.                 If ! InStr( Match1, "thumbnail" )
  37.                 {
  38.                     ; Download the image.
  39.                     URLDownloadToFile, % Match1, % A_ScriptDir . "\images\" . OutFileName
  40.                 }
  41.             }
  42.         }
  43.     }
  44. }
  45. MsgBox, Done!

rayman3003

  • Participant
  • Joined in 2014
  • *
  • Posts: 5
    • View Profile
    • Donate to Member
Re: DONE: Extracting All Image Links from a Booru Website
« Reply #6 on: August 06, 2017, 04:36 AM »
Here's an AutoHotkey example showing how to download all the images from the example site you gave:

Thank u very much. It works like a charm.

But unfortunately, Its sooooooo slow. It leeched 2MB in 2 minutes!

But with "download managers", I could grab at least 15MB in 2 minutes!! "Download Managers" can download files simultaneously, so with them I can download with more speed; Thats why I prefer image links over leeching them by a none-downloader tool (like hotkey).

Speed is very important for me in this case (Like I said in the first post), Bcuz I want to leech more than 800,000 images among three different booru websites.  :-[

But again, Thank u for your time.  :Thmbsup:

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,629
    • View Profile
    • Donate to Member
Re: DONE: Extracting All Image Links from a Booru Website
« Reply #7 on: August 06, 2017, 05:06 AM »
As said, it's an example of how to get a lot of files from that particular website. Each page has to be downloaded to extract the actual download-url per image.

You might want to do some work on it, as downloading 800.000 files into 1 directory isn't something Windows is very fond of :o
Splitting the from/to range into several scripts allows you to run more scripts in parallel (number of CPU-cores seems reasonable), and this way you can run several sites in parallel too, but you might get your IP banned from the server, because of hammering the site with that many requests :-\
A possible speed improvement could be to not set the tray icon tooltip for each page, but only for each 10th or so, so you still have a notion of progress, especially if Windows is displaying it when it's set, that's usually quite slow. (NB: Haven't tested it myself.)

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 7,544
  • @Slartibartfarst
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: DONE: Extracting All Image Links from a Booru Website
« Reply #8 on: August 06, 2017, 06:46 AM »
Might be worth looking at the references to "image download" in the thread Re: Firefox Extensions: Your favorite or most useful

Also see:
...Does it really work? Wow. It hasn't been updated for 11 months, so I assumed...
__________________
It most decidedly does work, and you can crawl any website, gathering specific file types.
For example, from the Mozilla FoxySpider Add-on page:
____________________________
About this Add-on
With FoxySpider you can:
  • Get all photos from an entire website
  • Get all video clips from an entire website
  • Get all audio files from an entire website
  • Well, actually get any file type you want from an entire website
FoxySpider can be used to create a thumbnail gallery containing links to rich media files of any file types you are interested in. It can also crawl deep to any level on a website and display the applicable files it found in the same gallery. FoxySpider is useful for different media content pages (music, video, images, documents), thumbnail gallery post (TGP) sites, podcasts. You can narrow and expand the search to support exactly what you want.
Once the thumbnail gallery is created you can view, download or share (on Facebook and Twitter) every file that was fetched by FoxySpider.
____________________________

« Last Edit: August 06, 2017, 06:57 AM by IainB »

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 5,287
    • View Profile
    • Donate to Member
Re: DONE: Extracting All Image Links from a Booru Website
« Reply #9 on: August 06, 2017, 06:51 AM »
But unfortunately, Its sooooooo slow. It leeched 2MB in 2 minutes!

This is going to be due to your location and ISP speed since I was able to grab 30+ megs of images in two minutes even with just that single-threaded code.  Here's a modification that simply creates a text file of image links.  As Ath mentioned, you can comment out line 17 for a tiny speedup but I don't think it's going to make a noticeable difference.

Code: Autohotkey [Select]
  1. ;http://mspabooru.com/index.php?page=post&s=view&id=166035
  2. sSourceURL := "http://mspabooru.com/index.php?page=post&s=view&id="
  3. nPageStart := 1
  4. nPageEnd   := 161151
  5.  
  6. FileCreateDir, % A_ScriptDir . "\images"
  7.  
  8. Loop, % nPageEnd
  9. {
  10.     If ( A_Index < ( nPageStart - 1 ) )
  11.     {
  12.         Continue
  13.     }
  14.     Else
  15.     {
  16.         ; Update tray icon tooltip.
  17.         Menu, Tray, Tip, % "Processing URL number: " . A_Index
  18.  
  19.         ; Download HTML page.
  20.         URLDownloadToFile, % sSourceURL . A_Index, % A_ScriptDir . "\temp.html"
  21.  
  22.         ; Read in HTML source.
  23.         FileRead, myData, % A_ScriptDir . "\temp.html"
  24.  
  25.         ; Parse HTML source for image URLs.
  26.         Loop, Parse, myData, `n, `r
  27.         {
  28.             ; Match image URLs.
  29.             If ( RegExMatch( A_LoopField, "(https?:\/\/.*\.(?:png|jpg|jpeg|gif))", Match ) )
  30.             {
  31.                 ; Crack the URL into its parts.
  32.                 SplitPath, % Match1, OutFileName, OutDir, OutExtension, OutNameNoExt, OutDrive
  33.  
  34.                 ; Skip any images with "thumbnail" in the filename.
  35.                 If ! ( InStr( Match1, "thumbnail" ) OR Instr( Match1, "width" ) )
  36.                 {
  37.                     ; Download the image.
  38.                     ; URLDownloadToFile, % Match1, % A_ScriptDir . "\images\" . OutFileName
  39.  
  40.                     ; Create a list of links.
  41.                     FileAppend, % Match1 . "`r`n", % A_ScriptDir . "\ImageLinks.txt"
  42.                 }
  43.             }
  44.         }
  45.     }
  46. }
  47. MsgBox, Done!

rayman3003

  • Participant
  • Joined in 2014
  • *
  • Posts: 5
    • View Profile
    • Donate to Member
Re: DONE: Extracting All Image Links from a Booru Website
« Reply #10 on: August 06, 2017, 07:41 AM »
But unfortunately, Its sooooooo slow. It leeched 2MB in 2 minutes!

This is going to be due to your location and ISP speed since I was able to grab 30+ megs of images in two minutes even with just that single-threaded code.  Here's a modification that simply creates a text file of image links.  As Ath mentioned, you can comment out line 17 for a tiny speedup but I don't think it's going to make a noticeable difference.


Thank u. This time it worked just as I wanted. Thank u.  ;D
(And thanks to others that helped me in this topic  :up: )