Author Topic: DONE: Extracting All Image Links from a Booru Website (Read 15253 times)

rayman3003 · « **on:** August 04, 2017, 01:59 PM »

I want to extract links of all images (png-jpg-jpeg-gif) in a booru website. (Booru(s) are some kind of image hosting websites).

Here is an example:

[Select]

http://mspabooru.com/

(contains 161151 images)

When clicking on posts, we could see images:

[Select]

http://mspabooru.com/index.php?page=post&s=list

Then, after clicking on any images, It goes to a page:

[Select]

http://mspabooru.com/index.php?page=post&s=view&id=166035

That contains the link (url) of that image in left side of the page, when we click on the "original image" placeholder:

[Select]

http://mspabooru.com//images/15/953debec5a4f550b37a7b0cabe5395a2.png

-------------

Now I 'm looking for a tool (an app or an online website) that helps me extract all these "Original image" links (urls) fast, And then I import the list of those urls in a download manager to leech them all.

I give that tool, these links:

[Select]

http://mspabooru.com/index.php?page=post&s=view&id=161151
http://mspabooru.com/index.php?page=post&s=view&id=161150
http://mspabooru.com/index.php?page=post&s=view&id=161149
...
.
.
.
http://mspabooru.com/index.php?page=post&s=view&id=1

Then it gives me the links of images in each page:

[Select]

http://mspabooru.com//images/15/aa2652c67b307eefb48ca3672fe09517.jpeg
http://mspabooru.com//images/15/fae66c978b7e52ada0e5332b5dbe2abd.png
http://mspabooru.com//images/15/38d72ad818e1471c5955db17ce2e89c4.jpeg
...
.
.
http://mspabooru.com//images/15/38d72ad818e14645955db17ce2e89c4.jpeg

So please give me any suggestion.
(or maybe a coder here could help me (I think its not hard to code a tiny app that doing this).

Thank u

Curt · « **Reply #1 on:** August 04, 2017, 05:50 PM »

Do you really need to keep a list of the links, or is it okay to "merely" get the pictures?

rayman3003 · « **Reply #2 on:** August 05, 2017, 12:23 AM »

Do you really need to keep a list of the links, or is it okay to "merely" get the pictures?
-Curt (August 04, 2017, 05:50 PM)

My goal is to get the pictures. But having the list of links, helps me to put them in download managers and leech them faster.

(So I prefer the links list)

4wd · « **Reply #3 on:** August 05, 2017, 06:44 AM »

Wouldn't it be easier to use a Download Manager within the browser?

eg. Firefox + DownThemAll or Chrome-based + GetThemAll

Both of these allow you to filter on filetype.

rayman3003 · « **Reply #4 on:** August 05, 2017, 10:24 AM »

Wouldn't it be easier to use a Download Manager within the browser?

eg. Firefox + DownThemAll or Chrome-based + GetThemAll

Both of these allow you to filter on filetype.
-4wd (August 05, 2017, 06:44 AM)

OK, thats also a download manager.

I have "DownThemAll" in my firefox. Please show me how to grab all images by "DownThemAll" within all of these 161151 pages:

[Select]

http://mspabooru.com/index.php?page=post&s=view&id=161151
http://mspabooru.com/index.php?page=post&s=view&id=161150
http://mspabooru.com/index.php?page=post&s=view&id=161149
...
.
.
.
http://mspabooru.com/index.php?page=post&s=view&id=1

skwire · « **Reply #5 on:** August 05, 2017, 05:12 PM »

Here's an AutoHotkey example showing how to download all the images from the example site you gave:

Code: Autohotkey [Select]

;http://mspabooru.com/index.php?page=post&s=view&id=166035
sSourceURL := "http://mspabooru.com/index.php?page=post&s=view&id="
nPageStart := 1
nPageEnd   := 161151
 
; Create directory to dump images to.
FileCreateDir, % A_ScriptDir . "\images"
 
Loop, % nPageEnd
{
    If ( A_Index < ( nPageStart - 1 ) )
    {
        Continue
    }
    Else
    {
        ; Update tray icon tooltip.
        Menu, Tray, Tip, % "Processing URL number: " . A_Index
        
        ; Download HTML page.
        URLDownloadToFile, % sSourceURL . A_Index, % A_ScriptDir . "\temp.html"
        
        ; Read in HTML source.
        FileRead, myData, % A_ScriptDir . "\temp.html"
        
        ; Parse HTML source for image URLs.
        Loop, Parse, myData, `n, `r
        {
            ; Match image URLs.  
            If ( RegExMatch( A_LoopField, "(https?:\/\/.*\.(?:png|jpg|jpeg|gif))", Match ) ) 
            {
                ; Crack the URL into its parts.
                SplitPath, % Match1, OutFileName, OutDir, OutExtension, OutNameNoExt, OutDrive
                
                ; Skip any images with "thumbnail" in the filename.
                If ! InStr( Match1, "thumbnail" )
                {
                    ; Download the image.
                    URLDownloadToFile, % Match1, % A_ScriptDir . "\images\" . OutFileName
                }
            }
        }
    }
}
MsgBox, Done!

rayman3003 · « **Reply #6 on:** August 06, 2017, 04:36 AM »

Here's an AutoHotkey example showing how to download all the images from the example site you gave:
-skwire (August 05, 2017, 05:12 PM)

Thank u very much. It works like a charm.

But unfortunately, Its sooooooo slow. It leeched 2MB in 2 minutes!

But with "download managers", I could grab at least 15MB in 2 minutes!! "Download Managers" can download files simultaneously, so with them I can download with more speed; Thats why I prefer image links over leeching them by a none-downloader tool (like hotkey).

Speed is very important for me in this case (Like I said in the first post), Bcuz I want to leech more than 800,000 images among three different booru websites.

But again, Thank u for your time.

Ath · « **Reply #7 on:** August 06, 2017, 05:06 AM »

As said, it's an example of how to get a lot of files from that particular website. Each page has to be downloaded to extract the actual download-url per image.

You might want to do some work on it, as downloading 800.000 files into 1 directory isn't something Windows is very fond of

Splitting the from/to range into several scripts allows you to run more scripts in parallel (number of CPU-cores seems reasonable), and this way you can run several sites in parallel too, but you might get your IP banned from the server, because of hammering the site with that many requests $:-\$
A possible speed improvement could be to not set the tray icon tooltip for each page, but only for each 10th or so, so you still have a notion of progress, especially if Windows is displaying it when it's set, that's usually quite slow. (NB: Haven't tested it myself.)

IainB · « **Reply #8 on:** August 06, 2017, 06:46 AM »

Might be worth looking at the references to "image download" in the thread Re: Firefox Extensions: Your favorite or most useful

Also see:

...Does it really work? Wow. It hasn't been updated for 11 months, so I assumed...
__________________
-Curt (March 12, 2014, 08:47 AM)
It most decidedly does work, and you can crawl any website, gathering specific file types.
For example, from the Mozilla FoxySpider Add-on page:
____________________________
About this Add-on
With FoxySpider you can:
Get all photos from an entire website
Get all video clips from an entire website
Get all audio files from an entire website
Well, actually get any file type you want from an entire website
FoxySpider can be used to create a thumbnail gallery containing links to rich media files of any file types you are interested in. It can also crawl deep to any level on a website and display the applicable files it found in the same gallery. FoxySpider is useful for different media content pages (music, video, images, documents), thumbnail gallery post (TGP) sites, podcasts. You can narrow and expand the search to support exactly what you want.
Once the thumbnail gallery is created you can view, download or share (on Facebook and Twitter) every file that was fetched by FoxySpider.
____________________________
-IainB (March 12, 2014, 04:12 PM)

skwire · « **Reply #9 on:** August 06, 2017, 06:51 AM »

But unfortunately, Its sooooooo slow. It leeched 2MB in 2 minutes!
-rayman3003 (August 06, 2017, 04:36 AM)

This is going to be due to your location and ISP speed since I was able to grab 30+ megs of images in two minutes even with just that single-threaded code. Here's a modification that simply creates a text file of image links. As Ath mentioned, you can comment out line 17 for a tiny speedup but I don't think it's going to make a noticeable difference.

Code: Autohotkey [Select]

;http://mspabooru.com/index.php?page=post&s=view&id=166035
sSourceURL := "http://mspabooru.com/index.php?page=post&s=view&id="
nPageStart := 1
nPageEnd   := 161151
 
FileCreateDir, % A_ScriptDir . "\images"
 
Loop, % nPageEnd
{
    If ( A_Index < ( nPageStart - 1 ) )
    {
        Continue
    }
    Else
    {
        ; Update tray icon tooltip.
        Menu, Tray, Tip, % "Processing URL number: " . A_Index
 
        ; Download HTML page.
        URLDownloadToFile, % sSourceURL . A_Index, % A_ScriptDir . "\temp.html"
 
        ; Read in HTML source.
        FileRead, myData, % A_ScriptDir . "\temp.html"
 
        ; Parse HTML source for image URLs.
        Loop, Parse, myData, `n, `r
        {
            ; Match image URLs.
            If ( RegExMatch( A_LoopField, "(https?:\/\/.*\.(?:png|jpg|jpeg|gif))", Match ) )
            {
                ; Crack the URL into its parts.
                SplitPath, % Match1, OutFileName, OutDir, OutExtension, OutNameNoExt, OutDrive
 
                ; Skip any images with "thumbnail" in the filename.
                If ! ( InStr( Match1, "thumbnail" ) OR Instr( Match1, "width" ) )
                {
                    ; Download the image.
                    ; URLDownloadToFile, % Match1, % A_ScriptDir . "\images\" . OutFileName
 
                    ; Create a list of links.
                    FileAppend, % Match1 . "`r`n", % A_ScriptDir . "\ImageLinks.txt"
                }
            }
        }
    }
}
MsgBox, Done!

rayman3003 · « **Reply #10 on:** August 06, 2017, 07:41 AM »

But unfortunately, Its sooooooo slow. It leeched 2MB in 2 minutes!
-rayman3003 (August 06, 2017, 04:36 AM)

This is going to be due to your location and ISP speed since I was able to grab 30+ megs of images in two minutes even with just that single-threaded code. Here's a modification that simply creates a text file of image links. As Ath mentioned, you can comment out line 17 for a tiny speedup but I don't think it's going to make a noticeable difference.

-skwire (August 06, 2017, 06:51 AM)

Thank u. This time it worked just as I wanted. Thank u.

(And thanks to others that helped me in this topic

)

Author Topic: DONE: Extracting All Image Links from a Booru Website (Read 15253 times)

rayman3003

DONE: Extracting All Image Links from a Booru Website

Curt

Re: DONE: Extracting All Image Links from a Booru Website

rayman3003

Re: DONE: Extracting All Image Links from a Booru Website

4wd

Re: DONE: Extracting All Image Links from a Booru Website

rayman3003

Re: DONE: Extracting All Image Links from a Booru Website

skwire

Re: DONE: Extracting All Image Links from a Booru Website

rayman3003

Re: DONE: Extracting All Image Links from a Booru Website

Ath

Re: DONE: Extracting All Image Links from a Booru Website

IainB

Re: DONE: Extracting All Image Links from a Booru Website

skwire

Re: DONE: Extracting All Image Links from a Booru Website

rayman3003

Re: DONE: Extracting All Image Links from a Booru Website