Author Topic: Bvckup 2 (Read 60090 times)

apankrat · « **Reply #25 on:** June 29, 2011, 05:27 PM »

Well, I hear you guys, but let me tell you my side of story.

Take a design, simplify it, strip all cruft, polish typography, add space, trim and trim again copy, add more space, split visually into sections and then make it marginally less flat. Guess what? It looks like something Apple would make.

They did not invent the style. It is a logical convergence point for many design styles that are iterated towards simpler versions. Apple popularized the style, that's for sure. And I fully understand that it is now associated with them, but do tell me what is it that non-Apple projects should do? Add color? Make things cramped and line-height smaller? I mean all those small things that comprise "Apple style" are based on fundamental design principles. And to ignore them just to be not mistaken for an Apple product - I am not sure if it's better than to have few fast-finger people leave due to confusion.

--

With regards to the scrolling thing and first page taking up full screen - again, this is unconventional on purpose. While this is an experiment, the idea is to make site's visitors stop and pay some attention. I am not after force-feeding 3 keywords in a terrible rush before he manages to escape from the page mere half second after loading it. I want more time, I want to form a memory.

PS. Just looked in the logs and it appears that the median time on site is well over half a minute. I think this is fan-tas-tic, Mac feel or not

apankrat · « **Reply #26 on:** June 29, 2011, 05:35 PM »

Additionally, I get a distinct 'the user is an idiot' feeling from the website; if they go to the website, nobody wants to waste 3s staring at a logo, then eventually figuring hey I have to scroll down despite having a gigantic screen that I got to minimize the need to scroll.
-worstje (June 29, 2011, 07:01 AM)

This is a very pragmatic way of looking at things. I am not really sure if I want to deal with people who get upset by their monitor real estate not being fully utilized by a web designer

Websites are intended to supply information; yours just gets in the way of its purpose in the same way those MPAA 'do not steal', FBI etc screens on a movie get in the way of what the (in that case legit!) user wants to do: watch their movie. The current website seems designed as if it were a presentation: because when someone is talking alongside sheets, you don't want tons of texts, and you have distinct screens to go with subjects.

It is a product announcement site. Consider it a more elaborate version of "Startup Xyz launching soon... <newsletter subscription form>" one-pager. If it were a real product site, it would have not most certainly have a splash page, nor it would've had copious amount of whitespace or trimmed down copy. It will be far more detailed, technical and in point. This will come, but the current site is a teaser and so it has an odd format geared towards generating interest rather then selling a product.

The point of a beta is to get feedback. Make beta licenses need some sort of feedback flag which you can trigger when you get feedback; for example through posting on your forum (although I wouldn't automate it or you only add spamming to your problem), or by having the license used for at least 24 hours or some other time-limit thing. Probably you want to combine them so the people that do not find any bugs don't get shafted by the concept. If you combine it with some basic IP checks, you can probably keep out most of the automated buggers.

Interesting idea. Noted.

steeladept · « **Reply #27 on:** June 29, 2011, 06:30 PM »

I have been lurking this thread for a while, and I think the key things that make the Appl-ish look Appl-ish, is the fact that you cut so much as to say nothing and use a single neutral monochromatic scheme. So far, (since Apple reinvented itself with the iPod - this doesn't necessarily apply previous to that) Apple has always used, white or black (almost exclusively), with gray accents which tell you nothing about the product except what they want you to know - It's Apple. In their marketing scheme, knowing "It's Apple" is the end-all be-all to what you need to know. It's Apple, so you must have it.

Now to your design aesthetic. It does speak to many of the things Apple does. No, they didn't invent minimalism, but they did bring a new and perhaps refined awareness of the design concept to many who never paid attention to it previously. The arguments I see here revolve around 1) how little is too little, and 2) how different is too different.

The thing to remember - ALWAYS - is the web is only a method of communication. If people don't understand what you are communicating, it fails. Think about what you want to communicate, instead of how, and then once you know, make sure it comes through on the first page. After that, navigation is useful only for getting more details. If there is no important details, then navigation is not necessary. Otherwise, when you get to the how, make sure you understand what the user sees and that they understand how to navigate the site - be it links, scrolling, whatever. If you keep these two thoughts at the forefront, then you are FAR less likely to confuse when you get to the how.

My personal suggestion - take the minimalist idea down just a notch, add a touch of color, and fix navigation, and the rest should be fine. ~~Unlike Apple, the whole world doesn't know everything about you and your product just by the name, so the minimalism is a bit too minimal on that site.~~ You added a bit since I was there last

I think it says enough about it, just the navigation is a bit, um, unorthodox?

andhar · « **Reply #28 on:** April 19, 2012, 02:54 AM »

Yay, there is a bvckup2 coming? I used the beta of V1 1.5 years ago and was amazed of it (there were a few problems I posted about but I was able to work with it

).
Now I just wanted to see if there were updates availabe or a complete "out-of-beta" release (as I searched for a good backup solution for my files again) and just found the thread for V2. Looking forward to it!

ah...last post is from last year....are there any news?

apankrat · « **Reply #29 on:** April 19, 2012, 01:39 PM »

I got several things piled up last year and I am still digging my way out of it. There are several projects that I am working on, and I am actively trying to trim their number down to just two. One of them is Bvckup. Bvckup2 is a major rework of the code and I have done a fair amount of coding of the supporting libraries (the UI lib is done, the threading/tasklet engine is about half way through, and there are two other). Once libraries are done, I will be putting the actual v2 app together. Before that I am planning to release separate smaller demos for each library and for some combined functionality.

I might've mentioned before, the scanning module (the one that traverses the directory tree and collects the file information) is now redone to work in a highly parallel fashion, and it is incredibly fast. When scanning the local disk, it utilizes all available CPU cores to pipe the data out of the OS and into the app. Faster, but it's nothing compared to the remote share scans. These go over the network, so there's quite a bit of latency involved, and the app typically sits there idle waiting for the remote side to reply. By launching multiple queries in parallel, it's possible to speed to whole process up by a factor of 10 (!). The scan of my NAS used to take 40-45 seconds, now it's down to 2-3 seconds.

So bear with me, there's some really cool stuff in the works. You won't be disappointed

cranioscopical · « **Reply #30 on:** April 19, 2012, 10:50 PM »

So bear with me, there's some really cool stuff in the works.
-apankrat (April 19, 2012, 01:39 PM)

Looking forward to it…

apankrat · « **Reply #31 on:** June 13, 2012, 12:32 PM »

Bump.

The Master Feature List is up - http://bvckup2.com/news/13-06-2012

PS. If you are on the mailing list, you'll get an email shortly with a link that lets you vote individual features up and down.

apankrat · « **Reply #32 on:** December 19, 2012, 03:28 AM »

The first tech demo is out, and I feel obligated to dedicate it to DNF for being such a great inspiration.

http://bvckup2.com/news/18-12-2012

tomos · « **Reply #33 on:** December 19, 2012, 03:49 AM »

aaaagh
dont go calling it Bwckup

apankrat · « **Reply #34 on:** December 19, 2012, 03:50 AM »

Why not?

tomos · « **Reply #35 on:** December 19, 2012, 04:06 AM »

[got delayed by the forum backup!]

I would say Backup for Bvckup
to me the 'v' is an upsidedown 'A'. Also, it's difficult/unnatural for a native English speaker to actually pronounce B-v-etc.
(I dont know if most others would do the same or not.)

With Bwckup, I would say Bwackup which, well, just sounds silly to me...

The first tech demo is out, and I feel obligated to dedicate it to DNF for being such a great inspiration.

http://bvckup2.com/news/18-12-2012
-apankrat (December 19, 2012, 03:28 AM)

Sorry,
I didnt mean to distract so much from the main point which is the demo + progress info

joiwind · « **Reply #36 on:** December 19, 2012, 04:22 AM »

You could also pronounce Bwckup as Borkup which would not be a happy thing.

Otherwise, yippee, we're getting closer to a stable v 2 of this fantastic app.

apankrat · « **Reply #37 on:** December 19, 2012, 04:27 AM »

Aye... good points, gentlemen. I am just not feeling this name+version approach, the "2" somehow sticks out and I wish it wasn't there :)

f0dder · « **Reply #38 on:** December 19, 2012, 10:55 AM »

I might've mentioned before, the scanning module (the one that traverses the directory tree and collects the file information) is now redone to work in a highly parallel fashion, and it is incredibly fast. When scanning the local disk, it utilizes all available CPU cores to pipe the data out of the OS and into the app.
-apankrat (April 19, 2012, 01:39 PM)

How do you determine what to scan in parallel? If you're simply going by partitions/mountpoints, you're going to kill performance on systems with multiple partitions on one physical drive. (You should look into parsing the MFT for NTFS drives, and possibly the USN journal as well.)

Also, scanning also isn't a very CPU intensive job, so while doing parallel threads, "utilizes all available CPU cores" sounds a bit too marketing/management-speak to me... I might actually avoid a backup product if it had a phrase like that anywhere on the web site :-)

apankrat · « **Reply #39 on:** December 19, 2012, 03:50 PM »

Have a little faith, will you? :)

First of all, in a lot of cases scanning will effectively be equal to fetching data from the disk index cache. Especially, if I am to repeatedly scan the same spot in the file system. In this case, interrogating the API in parallel manner (read - using multiple threads) gives quite a bit of a boost as the performance is actually CPU-bound. For example:

[Select]

Scanning C:\...
  1 threads - 2411 ms
  4 threads - 1021 ms
  8 threads - 871 ms

This is with warmed up cache. C:\ was scanned in full immediately before this test. Interestingly enough, playing with the order, in which sub-directories are queued for scanning, can speed things up by additional 5-10%:

[Select]

Scanning C:\...
  1 threads - 2293 ms
  4 threads - 955 ms
  8 threads - 777 ms

Secondly, similar speed benefits apply in case of a cold cache. If you look at a sequential scanning with one thread, there will always be time when the OS is not doing anything for us, because our app is busy copying what it got from the OS into its own data structures. So if we have 2+ threads pulling at the API, it eliminates these idle OS times. Also, I've been messing with this stuff for a while now and I haven't seen the disk performance getting killed, nowhere close to it. I am guessing there's quite a bit of aggressive caching and read-ahead activity happening, because that's basically what I would've done if I were implementing this part of the OS.

Thirdly, the problem of the OS sitting idle becomes even more pronounced when you do an over-the-network scan. The presence of network latency translates into 80% of the time spent in the API call wasted on waiting for a network reply. You keep firing multiple calls in parallel - you fill up the OS request queue on the other end more evenly. The speed ups I saw when testing against my NAS over the WiFi were in the order of 20-50 times.

I got the point re: marketing speak though. I will try and back it up with the graphs :)

With regards to the MFT/USN - I really don't want to descent to that level. I considered using USN, for example, for move detection and it is - basically - a support hell. As much as I love troubleshooting NTFS nuances, this is just not my cup of tea.

Deozaan · « **Reply #40 on:** December 19, 2012, 04:21 PM »

Bvvckup!

f0dder · « **Reply #41 on:** December 19, 2012, 06:01 PM »

Have a little faith, will you?
-apankrat (December 19, 2012, 03:50 PM)

*Grin*

Hope you don't take my posts as grumpy-old-man. I'm just interested in these things, and some of what you're syaing sounds weird compared to my own experiences. But I can handle being proved wrong, and always like learning new stuff

(Also, I've been spending quite some time looking at backup software lately - pretty much everything sucks in one way or another. Closest I've come yet are Genie Timeline which was kinda nice but had bugs and shortcomings, and Crashplan which does some of the stuff GTL sucked at better, but has it's own problems - *sigh*.)

Hm, you might have a point wrt. warm cache querying - but have you tested the code across several OSes, especially pre-Vista? That's when Microsoft started doing a lot of work on lock-free data structures and algorithms in the kernel. Have you tested on XP and below?

This is with warmed up cache. C:\ was scanned in full immediately before this test. Interestingly enough, playing with the order, in which sub-directories are queued for scanning, can speed things up by additional 5-10%:
-apankrat (December 19, 2012, 03:50 PM)

Hrm, last I played with different scanning techniques was back on XP - that's some years ago, which also means quite slower hardware. I tested NTFS, FAT32 and even ISO9660 (on a physical CD, since that's the slowest seek-speed I had available). I tried depth- vs. breadth-first, tried eliminiating SetCurrentDirectory calls since that'd mean less user<>kernel transitions (and I had hoped CWD wouldn't change, but it did - FindFirstFile probably changes directory internally), spent some effort on making the traversal non-recursive and eliminating as many memory allocations as possible... and nothing really did much of a difference. Was hellish doing cold-boots between each and every benchmark

Can't remember if that was before or after I got a Raptor disk - so it might have been on hardware before NCQ got commonplace, and it was definitely on XP. Still, even with NCQ, it's my experience that you don't need a lot of active streams before performance dies with mechanical disks. For SSDs, the story is entirely different, though - there, on some models, a moderate queue depth can be necessary to reach full performance. So a cold-scan on an SSD might benefit from multiple threads - I'd be surprised if a mechanical disk did, though!

Got any benchmark code you're willing to share? I'd be interested in trying it out on my own system, I'm afraid I didn't keep the stuff I wrote back then (and there were no threaded versions anyway).

there will always be time when the OS is not doing anything for us, because our app is busy copying what it got from the OS into its own data structures. So if we have 2+ threads pulling at the API, it eliminates these idle OS times.
-apankrat (December 19, 2012, 03:50 PM)

It's my understanding that what you're generally waiting for when traversing the filesystme is disk I/O - the CPU overhead of data structure copying and user<>kernel switches should be entirely dwarfed compared to the I/O. Which is why I'm surprised you say multiple threads help when there's a mechanical disk involved. I'd like to verify myself - and I'd like even more if somebody can find a good explanation

Thirdly, the problem of the OS sitting idle becomes even more pronounced when you do an over-the-network scan.
-apankrat (December 19, 2012, 03:50 PM)

That's a part I'm fully convinced you're right, without seeing benchmarks

- there's indeed quite some latency even on a LAN, and the SMB/CIFS protocol sucks.

I got the point re: marketing speak though. I will try and back it up with the graphs
-apankrat (December 19, 2012, 03:50 PM)

Please also change the wording, though

- even with graphs, the sentence is still suspicious. I'm too tired at the moment to come up with something better that isn't going to confuse normal people, though

With regards to the MFT/USN - I really don't want to descent to that level. I considered using USN, for example, for move detection and it is - basically - a support hell. As much as I love troubleshooting NTFS nuances, this is just not my cup of tea.
-apankrat (December 19, 2012, 03:50 PM)

It's a nastily low level to be operating at - and it definitely shouldn't be the only scanning available, since it might break anytime in the future. I'm also not sure MFT scanning is the best fit for a backup program, it's my understanding you pretty much have to read it in it's entirety (possibly lots of memory use, constructing a larger in-memory graph than necessary, or spending CPU on pruning items you're not interested in?) - but g'darnit it's fast. WizTree can scan my entire source partition in a fraction of the time just part of it can be traversed via API calls...

USN is tricky getting right, and I haven't had time to play enough with it myself. But IMHO the speed benefits should make it worth it. Without USN parsing, after (re)starting the backup program, you have to do complete traversal of all paths in the backup set. It's quite a lot faster simply scanning the USN logs and picking up changes - but yes, complex.

---

What about Symlinks, Hardlinks and Junctions? Do you handle those correctly, and have they given you much headache?

apankrat · « **Reply #42 on:** December 20, 2012, 01:36 PM »

Hm, you might have a point wrt. warm cache querying - but have you tested the code across several OSes, especially pre-Vista? That's when Microsoft started doing a lot of work on lock-free data structures and algorithms in the kernel. Have you tested on XP and below?
-f0dder (December 19, 2012, 06:01 PM)

I tested on XP, but I foolishly lent my copy of Win 3.11 to someone so 'm afraid it's going to stay just the XP for now.

I tried depth- vs. breadth-
-apankrat (December 19, 2012, 03:50 PM)

Consistent 10% difference

Got any benchmark code you're willing to share? I'd be interested in trying it out on my own system, I'm afraid I didn't keep the stuff I wrote back then (and there were no threaded versions anyway).

Will do in a bit. I assume the command-line version is OK?

What about Symlinks, Hardlinks and Junctions? Do you handle those correctly, and have they given you much headache?

You bet they did, but I'd like to think I have them sorted out. See here.

I skipped the hardlinks though. That'd be chasing a very far end of the tail of the demand curve, I just don't have time for this now.

f0dder · « **Reply #43 on:** December 20, 2012, 02:03 PM »

I tested on XP, but I foolishly lent my copy of Win 3.11 to someone so 'm afraid it's going to stay just the XP for now.
-apankrat (December 20, 2012, 01:36 PM)

*big grin* - good luck running win3.x on modern hardware, too

. Except for fSekrit, I personally wouldn't bother supporting anything lower than XP these days. But there's still a fair amount of people on that system, for various reasons... if writing LEAN_AND_MEAN software, there's a fair amount of people who'll appreciate XP support

I tried depth- vs. breadth-
-f0dder (December 19, 2012, 06:01 PM)
Consistent 10% difference
-apankrat (December 19, 2012, 03:50 PM)

Hmm! Also with a single-threaded scan? I really didn't see any noticable performance difference, which confused me - I would have supposed breadth-first to be faster (unless my hazy overview of MFT is wrong). Perhaps I simply got the code wrong?

Got any benchmark code you're willing to share? I'd be interested in trying it out on my own system, I'm afraid I didn't keep the stuff I wrote back then (and there were no threaded versions anyway).
-f0dder (December 19, 2012, 06:01 PM)
Will do in a bit. I assume the command-line version is OK?
-apankrat (December 19, 2012, 03:50 PM)

Sure thing. Would be nice with some source as well, but I can understand if it'll be too time-consuming to remove dependencies on code you want to keep private

What about Symlinks, Hardlinks and Junctions? Do you handle those correctly, and have they given you much headache?
-f0dder (December 19, 2012, 06:01 PM)
You bet they did, but I'd like to think I have them sorted out. See here. I skipped the hardlinks though. That'd be chasing a very far end of the tail of the demand curve, I just don't have time for this now.
-apankrat (December 19, 2012, 03:50 PM)

Seems like a sane enough scheme to handle things. I'm not sure there's a one-size-fits-all solution for this, anyway - and hardlink handlink is another headache. Not sure if I'd prefer to have, say, block de-duplication handle it. Would probably be less processing time to have specific hardlink support, and would be needed for proper restore - ugh

. But that's obviously outside the scope of what bvkup is designed for!

Bvvckup!
-Deozaan (December 19, 2012, 04:21 PM)

I like that - reminds me of Gobliiins^w though in (sane) reverse

apankrat · « **Reply #44 on:** December 21, 2012, 05:59 AM »

Thanks for those who submitted the error reports. There were two problems (one with the demo throwing an exception on older XP boxes - worked out to be an incorrect API documentation, and the other one was with UAC-less setups - the program didn't realize it had full admin rights as is). Both are fixed now, thanks a lot for helping. Please re-test at will if still interested.

apankrat · « **Reply #45 on:** December 27, 2012, 05:23 AM »

Got any benchmark code you're willing to share? I'd be interested in trying it out on my own system, I'm afraid I didn't keep the stuff I wrote back then (and there were no threaded versions anyway).
-f0dder (December 19, 2012, 06:01 PM)
Will do in a bit. I assume the command-line version is OK?
-apankrat (December 19, 2012, 03:50 PM)
Sure thing. Would be nice with some source as well, but I can understand if it'll be too time-consuming to remove dependencies on code you want to keep private :)
-f0dder (December 20, 2012, 02:03 PM)

Oki-doki.

bvckup2-demo2.exe - 32bit version
bvckup2-demo2-x64.exe - 64bit version, for nerds and rich kids

It's a console app, takes a path to be scanned and few other parameters:

[Select]

Syntax: bvckup2-demo2.exe [-v] [-D | -B] [-t <threads>] location-to-scan

-v verbose
-D depth first, scan child then sibling directories (default)
-B breadth first, scan sibling then child directories
-t number of threads to use

Thread count defaults to the number of CPU cores if not specified.

Scanning with all defaults should yield something like this -

[Select]

C:\Temp>bvckup2-demo2.exe C:\
8 threads, 840 ms

In a chatty mode it will look something like this -

[Select]

C:\Temp>bvckup2-demo2.exe -v C:\

------ config ------
Location: C:\
CPU cores: 8
Threads: 8
Depth first: Yes

----- scanning -----
840 ms | 104024 files, 28599 folders

------ result ------
8 threads, 840 ms

I'll run some tests and post my numbers. f0dder, if you have time, let's see yours too :)

apankrat · « **Reply #46 on:** December 27, 2012, 09:49 AM »

Updated verbose mode to dump the timing profile for FindFileFirst/Next/Close calls.
Please re-grab the .exe's from the original links.

The verbose output looks something like this now -

[Select]

------ config ------
Location: n:\
CPU cores: 8
Threads: 16
Depth first: Yes

----- scanning -----
32625 ms | 97964 files, 13192 folders

------ stats -------
FindFirstFileEx:
   0 ms 2054 | 10 ms 1 | 100 ms -
   1 ms 1070 | 20 ms 1754 | 200 ms 1
   2 ms 859 | 30 ms 972 | 300 ms -
   3 ms 708 | 40 ms 548 | 400 ms 1
   4 ms 421 | 50 ms 558 | 500 ms 1
   5 ms 242 | 60 ms 409 | 600 ms -
   6 ms 143 | 70 ms 425 | 700 ms -
   7 ms 175 | 80 ms 405 | 800 ms -
   8 ms 211 | 90 ms 315 | 900 ms -
   9 ms 221 | 100 ms 307 | 1000 ms -

FindNextFile:
   0 ms 135181 | 10 ms - | 100 ms -
   1 ms 1186 | 20 ms 127 | 200 ms -
   2 ms 60 | 30 ms 201 | 300 ms -
   3 ms 12 | 40 ms 151 | 400 ms -
   4 ms 6 | 50 ms 149 | 500 ms -
   5 ms 12 | 60 ms 139 | 600 ms -
   6 ms 5 | 70 ms 72 | 700 ms -
   7 ms 9 | 80 ms 58 | 800 ms -
   8 ms 8 | 90 ms 43 | 900 ms -
   9 ms 16 | 100 ms 22 | 1000 ms -

FindClose:
   0 ms 12412 | 10 ms - | 100 ms -
   1 ms 756 | 20 ms 1 | 200 ms -
   2 ms 16 | 30 ms - | 300 ms -
   3 ms 1 | 40 ms - | 400 ms -
   4 ms 3 | 50 ms - | 500 ms -
   5 ms - | 60 ms - | 600 ms -
   6 ms - | 70 ms - | 700 ms -
   7 ms 1 | 80 ms - | 800 ms -
   8 ms - | 90 ms - | 900 ms -
   9 ms - | 100 ms - | 1000 ms -

------ result ------
16 threads, 32625 ms

"20 ms 1754" means there were 1754 calls that took between 20 and 30 ms.

"500 ms 1" means there was 1 call that took between 500 and 600 ms.

f0dder · « **Reply #47 on:** December 27, 2012, 10:04 AM »

Cool

Too lazy to do proper testing with cold-cache right now (haven't got my testbox hooked up at the moment, and don't feel like rebooting my workstation a zillion times - sucks that windows doesn't have a way to "discard read cache", alloc-boatloads-of-memory isn't reliable enough).

190k files, 20k folders. Relatively flat hierarchy (haven't measured nesting level, but average is probably 4).

For warm-cache, there's negligible differences between breadth- and depth-first, the same goes for x86 vs x64. That was kinda expected, though

. The speed difference between 1 and 8 threads is a factor 3, after 4 threads there's no quantifiable speed increase (quadcore i7 with HyperTHreading - wonder where the bottleneck is, HT itself or some OS locks?). ~1200 vs 400 milliseconds, though, so not something that matters a lot - at least for relatively modest filesystems and higher-end systems

Cold-cache tests is what interests me most, anyway, since those tend to be slow. I'll see if I can find some time & energy to hook up my testbox and run some tests on it - the testbox also has the benefit of being a modest dual-core with a slow disk, compared to my workstation which is quadcore i7 with an SSD and a VelociRaptor :-)

apankrat · « **Reply #48 on:** December 27, 2012, 12:10 PM »

I've tried running few cold-cache test and got numbers that are wildly different - from 131 seconds to 70 to 24. This is on a box that was shutdown, then powered on (with network connection disabled, with Windows Search and Windows Update services disabled, virtually no active tasks in the Task Manager and no resident antivirus/malware apps... so theoretically there's nothing that would actively populate disk index cache on boot). I will see what I've missed in a bit and re-run the tests.

With regards to getting 3x speed up on a warm cache - what's not to like?

Arguably, this is the use case for real-time backups, with an operational profile consisting largely of frequent scans of small number of selected locations.

apankrat · « **Reply #49 on:** December 28, 2012, 02:52 PM »

Scanned C:\ with about 100K files in 20K directories.
HDD + W7/64bit + 4 cores.

[Select]

COLD WARM

   1 thread - 132 sec 2.55 sec
   4 threads - 102 sec 1.24 sec
   8 threads - 96 sec 1.32 sec
  64 threads - 88 sec 1.44 sec
256 threads - 84 sec 1.52 sec
512 threads - 82 sec -

In other words, parallel scanning yields ~2x speed up for warm caches, ~1.5x speed up for cold caches