Author Topic: Mass checksum checker (Read 21916 times)

imtrobin · « **on:** February 13, 2010, 02:39 AM »

Hi

I'm looking for a tool to mass verify checksums, like Advanced Checksum Verifirer www.irnis.net, but this tool don't support unicode files, and the author does not respond to emails anymore. Can you recommend any good ones?

Krishean · « **Reply #1 on:** February 13, 2010, 02:55 AM »

i use quicksfv for verifying checksums, i believe it does unicode properly, the only drawback is it only does crc32/md5, does not do sha1 (yet?)

Krishean · « **Reply #2 on:** February 13, 2010, 03:00 AM »

nevermind, does not handle japanese (unicode) files properly

Krishean · « **Reply #3 on:** February 13, 2010, 03:06 AM »

you might try HashCheck, i have heard good things about it, but i haven't tried it personally. it appears that it might do what you are looking for because it has several non-ascii language translations (and not supporting filenames in someone's native lanugage while having a language file would be silly) it also does more than just crc32/md5

imtrobin · « **Reply #4 on:** February 13, 2010, 10:33 AM »

Thanks, but I'm looking for a mass file verifier to verify TBs of data, not individual. One of nice things of ASCV is it recurses through every folder and builds the checksum file, which I can use to verify later if any file got corrupted.

Krishean · « **Reply #5 on:** February 13, 2010, 07:35 PM »

with quicksfv it will recurse through subdirectories and add all the files it finds to the sfv/md5 file, you just have to select the top-level directory. unfortunately as i said it dosen't seem to work with unicode filenames

i'm not sure if HashCheck has the same functionality, and i know quicksfv will not drop a checksum file in each subdirectory. i am sure someone here could program a unicode-compliant application for what you want, and if i had more time i'm sure i could do it myself

tinjaw · « **Reply #6 on:** February 13, 2010, 08:04 PM »

I tried a little searching and found the following: (I have no experience with it at all.)

ExactFile - Make and test MD5, CRC32, SHA1, SFV, md5sum, sha1sum and other hashes, quick and easy.
http://www.exactfile.com/

(emphasis mine)

* A file integrity verification tool:
   o Use it to make sure files copied to CD-ROM are bit-perfect copies,
   o Use it to make sure backups copied from one drive to another are just right,
   o Use it to make sure files haven’t been changed or damaged over time.
   * Multi-threaded, so your extra CPU cores get used when scanning multiple files and work gets done faster.
   * Happy with Unicode file names, so it doesn’t fail when you’re using it on files named in Japanese, Hebrew, Chinese, or any other language.
   * Supports multiple checksum routines (hashes), like MD5, SHA1, CRC32, RIPEMD and others.
   * Supports recursive directory scanning.
   * Supports Very Big Files — If it’s on your hard drive, ExactFile can handle it.
   * Does everything popular file summer utilities do, like fsum, md5sum, sha1sum, sfv, etc, but better!
   * Compatible with popular file checksum digest formats.
   * For Windows 2000, XP, and Vista.
   * GUI. Easy to use to get checksums for individual files, create checksum digests, and test checksum digests. Does not require the console version or any external DLLs.
   * FREE.

ExactFile is currently under development and in public beta. Download here. Watch the blog for info and updates.

Mass checksum checker

There is also a command line version.

Krishean · « **Reply #7 on:** February 13, 2010, 08:17 PM »

nice find, i might just try that myself

i also have been looking for a better checksum utility because of quicksfv's shortcomings (unicode/sha1)

imtrobin · « **Reply #8 on:** February 13, 2010, 11:14 PM »

This looks cool, testing with 1 TB data now, looking great! I can't believe I paid for ACV, and the author don't reply.

widgewunner · « **Reply #9 on:** February 14, 2010, 01:30 AM »

Multiple files:
If your data is located in one branch on a file system, you can use GIT, the version control software used to manage the Linux kernel. It uses SHA1 hashes to identify every version of every file in the tree as well as the entire tree (and every version of the tree). The design is actually quite simple and secure - the entire history of the tree of files is rendered down to one single SHA1 hash. If any version of any file is modified (or the disk is corrupted in any way), this SHA1 is changed. Thus, if the SHA1 has not changed, you can be sure that all the versions of every file are intact. (This web page does a pretty good job of explaining how this works.) Setting up GIT is quite simple using a command line interface. Its free and open source. The preferred Windows version is available at Google code: msysgit.

I've only recently gotten into using source control software but have become a fan of GIT's elegant design and useful functionality. It was this lecture by Linus Torvalds on YouTube that turned me onto the beauty of GIT. If you go to the Git documentation home page (http://git-scm.com/documentation) there is a link to this and other videos describing GIT. I've been using it heavily for a couple months now and have had no trouble whatsoever. It is a very cool tool!

Single file:
One of the first apps I install when setting up a new box is: hashtab. Once installed, just right click on any file and select Properties. On Windows, this little beauty simply adds a new File Hashes tab to the file's property sheet which displays the various hashes of the file like so:

http://i29.photobucket.com/albums/c253/ridge-runner/SCREENSHOTS/HashTab.png

Mass checksum checker

Just copy any hash code text into the Hash Comparison text box and it gives a green check mark indicating which hash type has matched (or a red X if none matched). You can select to display a variety of different hash algorithms - here is the settings page which shows the supported hash types:

http://i29.photobucket.com/albums/c253/ridge-runner/SCREENSHOTS/HashTabSettings.png

Mass checksum checker

Hope this helps!

parkint · « **Reply #10 on:** February 14, 2010, 06:55 AM »

As you know, I am also a big fan of Git.
When you gain a little proficiency with it, this is a great resource to the power of Git.

f0dder · « **Reply #11 on:** February 14, 2010, 07:15 AM »

widgewunner: git is powerful and interesting, but suggesting it as a way to get file hashes? That's kinda like using a frying pan to drive in nails

(oh, and while the implementation model might be elegant, git as a whole definitely isn't - ugh!).

imtrobin · « **Reply #12 on:** February 14, 2010, 07:36 AM »

I think Git is a little unsuitable, it keeps a copy of the whole file in one revision. It's good for distributed code, but not for file verifying.

widgewunner · « **Reply #13 on:** February 14, 2010, 12:25 PM »

As you know, I am also a big fan of Git.
When you gain a little proficiency with it, this is a great resource to the power of Git.
-parkint (February 14, 2010, 06:55 AM)

I agree. (In fact, I just received my hard copy of the book last week). And Scott Chacon's Gitcasts are also very good. I have found both the online and written documentation to be nothing short of excellent. (i.e. My first book on git was O'Reilly's: Version Control with Git by Jon Loeliger - also highly recommended.)

... git is powerful and interesting, but suggesting it as a way to get file hashes? That's kinda like using a frying pan to drive in nails ...
-f0dder (February 14, 2010, 07:15 AM)

LOL! Yes, you have a point here. (Although a frying pan does a pretty good job of it!)

... (oh, and while the implementation model might be elegant, git as a whole definitely isn't - ugh!).
-f0dder (February 14, 2010, 07:15 AM)

Can you elaborate? My understanding is that when Git first came out it was quite difficult to use, and the documentation was lousy, but it has since matured and those days are gone. I am new to Git and have had nothing but a good experience with it so far. It is small, lightning fast and non-obtrusive. It compresses your data down to a bare minimum. And contrary to what some might believe, it is actually very easy to use. Once installed, setting up a repository to track a directory tree consists of four commands:

[Select]

cd branch/to/be/followed # change to the directory you want to be managed
git init # initialize a new repository
git add . # recursively add all files and folders
git commit -m "initial commit" # commit the tree to the repository

Just repeat the last two commands any time you want to add a new version to the repository. Yes, there are much more powerful and complex commands that git can perform, but these are completely unnecessary for the purpose described here. There is also a GUI interface, but I can't comment on that as I am a command line kind of guy for this kind of stuff.

It not only provides you with SHA1 hash of every version of every file in your tree (and thus guarantees the integrity of each and every one), it has very powerful ways for you to inspect the changes that have been made to the files over time. It also has commands for copying entire repositories to other drives/servers which provides a very effective backup methodology.

However, I am not at all sure how well it would handle terabytes of data!?

I think Git is a little unsuitable, it keeps a copy of the whole file in one revision. It's good for distributed code, but not for file verifying.
-imtrobin (February 14, 2010, 07:36 AM)

I would disagree. Accurate file verification is one of the founding premises of Git. Yes it stores entire files, but it is very efficient. In Linus's talk, he mentions that the repository containing the entire history of the Linux sources (from 2005-2007), was only half the size of one checked out version of the source tree itself!

p.s. You guys did go check out hashtab didn't you? It is definitely a "must-have"!

Cheers!

f0dder · « **Reply #14 on:** February 14, 2010, 02:26 PM »

... (oh, and while the implementation model might be elegant, git as a whole definitely isn't - ugh!).
-f0dder (February 14, 2010, 07:15 AM)
Can you elaborate? My understanding is that when Git first came out it was quite difficult to use, and the documentation was lousy, but it has since matured and those days are gone.
-widgewunner (February 14, 2010, 12:25 PM)

I'll try

First, let me state that I think the git model seems pretty solid overall: the way repository information is stored in the .git folder (and the way it's structured), the way server communication is done for remote repositories et cetera. I haven't looked into each and every detail (e.g. I don't know if file blobs are stored directly or if they're compressed (can't see why they would be)), but I understand the idea of storing blobs and referring to just about everything through their SHA-1 hash values (I wonder why SHA-256 wasn't chosen, considering some known SHA-1 defects, but not too big a deal - the foucs isn't to guard against attackers but to avoid collisions under normal circumstances).

My gripes are more around the end-user tools. One thing is that the Windows port is still a bit rough (blame then *u*x people for not writing properly modular and portable code), this is something I can live with - but gee, even after creating hardlinks for all the git-blablabla.exe in libexec/git-core, the msysgit install still takes >120meg disk space... subversion is 7.4meg. Of course msysgit comes with a lot more than svn, but I shouldn't need all that extra. And I don't currently have time to check what is absolutely necessary and what's just icing on the cake; considering that git was originally a bunch of scripts, and the unix tradition of piecing small things together with duct tape, I don't feel like playing around right now

The more important thing is how you use the tools. Sure, for single-developer local-only no-branch usage, it's pretty much a no-brainer, and most of the terminology matches traditional C-VCS. There's some subtle differences here and there that you have to be aware of, though - like what HEAD means. IMHO it would have been better to use new terminology for some things - like "stage" instead of "add" (having "add" overloaded to handle both add-new-file and add-to-staging-area is bad). "Checkout" for switching branches doesn't seem like the smartest definition to me, either. And not knowing about renames (but depending on client-tool to discover this, probably through matching SHA-1 values?) also seems like a bit of a mistake. Relatively minor points, but things that can introduce confusion...

Where things can get hairy is when you collaborate with other people, especially on juggling with branches and "history rewriting". Git has some very powerful features that lets you mess things up bigtime - which by itself might not be a problem, but with the various overloads the commands have, and that history rewriting (ie, commit --amend and rebase) seems pretty common operations, you really have to be pretty careful. Some of it aren't much of an issue if you're a single developer (although you can inadvertantly destroy branch history, which can be bad), but you need to be really careful with rebasing once you work with other people - the Pro Git book has a good example of why

All that said, I'm considering moving my own VCS to git. D-VCSs clearly have advantages
over C-VCS, and while the git-windows tools have rough edges and you have to be careful and
you can do some insane things, it's fast and I believe the underlying technology has gotten an important bunch of things right. I'll probably check out some of the other D-VCS tools before deciding, like bazaar and mercurial.

I am new to Git and have had nothing but a good experience with it so far. It is small, lightning fast and non-obtrusive.
-widgewunner (February 14, 2010, 12:25 PM)

Wouldn't say it's small (at least not msysgit

), but fast indeed, and I like that it has a single .git folder instead of a per-subfolder .svn (that's un-obtrusive for me).

It compresses your data down to a bare minimum.
-widgewunner (February 14, 2010, 12:25 PM)

Does it actually compress anything (apart from server communications), or are you just referring to only storing each blob once, identified by SHA-1 hash value? It's also my understanding that files are stored in entirety, whereas other systems store patchsets. This makes checkouts extremely fast, but if you have huge files and only change a few lines, it does take up more disk space (usually not a big problem, most sane people don't store server logs under vcs, and keep source code files small... 50 committed edits to the main Notepad++ .cpp file would be ~16meg though

).

However, I am not at all sure how well it would handle terabytes of data!?
-widgewunner (February 14, 2010, 12:25 PM)

Better than other VCSes... but it's not suitable for just computing hashes

I think Git is a little unsuitable, it keeps a copy of the whole file in one revision. It's good for distributed code, but not for file verifying.
-imtrobin (February 14, 2010, 07:36 AM)
I would disagree. Accurate file verification is one of the founding premises of Git. Yes it stores entire files, but it is very efficient. In Linus's talk, he mentions that the repository containing the entire history of the Linux sources (from 2005-2007), was only half the size of one checked out version of the source tree itself!
-widgewunner (February 14, 2010, 12:25 PM)

The point is that to "compute hashes" with git, you'll be putting files under version control. You don't necessarily want to do that for a bunch of, say, ISO images. Heck, I'd even say it's very likely you don't want to do this. First, you don't need the file under VCS, second you don't want the extra duplicate it creates (remember, every file will live in the .git object stash as well as a checked out copy).

Anyway, a lot of this discussion should really probably be split out to a topic about git, since it's drifted quite far away from the topic of file checksumming.

widgewunner · « **Reply #15 on:** February 14, 2010, 08:48 PM »

I agree with everything you just said. Thanks for the input. When I said Git was small, I was certainly wrong with regard to the tool itself. I guess I was thinking in terms of Git's impact on the tree you are putting under revision control. Where CVS and SVN obtrusively place folders in every directory in a tree, Git only needs one. And the size of the repository is small - yes, Git does file compression into what it calls "pack" files.

And I have to admit that you are all absolutely correct when you say Git is not really appropriate for the specific task of file verification. Especially for binary files. I guess my recent infatuation with this tool has got me wanting to evangelize its praises, and it seemed to me if someone was asking about verifying a bunch of files, they may also be wanting to track changes as well - in which case Git may be something worth looking into.

FYI - the msysgit installation is sort of like cygwin. It includes a mini-unix environment which includes the following command line tools (which explains its size):

[Select]

basename, bash, bzip2, cat, chmod, cmp, cp, curl, cut, date, diff, du,
env, expr, false, find, gawk, git, git-*, gpg, gpgkeys_curl, gpgkeys_finger,
gpgkeys_hkp, gpgkeys_ldap, gpgsplit, gpgv, grep, gzip, head, id, kill,
less, ln, ls, md5sum, mkdir, msmtp, mv, openssl, patch, perl, ps, rm, rmdir,
rxvt, scp, sed, sh, sleep, sort, split, ssh, ssh-add, ssh-agent, ssh-keygen,
ssh-keyscan, tail, tar, tclsh, tclsh85, tee, touch, tr, true, uname, uniq,
vim, wc, wish, wish85, xargs, CA, tclConfig and tkConfig.

Sorry for the distraction. Back to your regular scheduled programming...

imtrobin · « **Reply #16 on:** February 15, 2010, 10:16 PM »

I tested with ExactFinder. Unfortunately, doesn't seems stable enough. For some files, it can't calculate the hash, and when verifying, some files causes errors too.

MerleOne · « **Reply #17 on:** February 16, 2010, 06:10 AM »

I'd go for CDCheck, it's perfect for mass CRC/MD5 (creation and verify). See : http://www.kvipu.com/CDCheck/

tinjaw · « **Reply #18 on:** February 16, 2010, 06:46 AM »

ExactFinder
-imtrobin (February 15, 2010, 10:16 PM)

I assume you mean ExactFile. That's too bad. I'd be interested to see what the developer says if you sent them your findings. Do you have the time to email them a detailed bug report?

imtrobin · « **Reply #19 on:** February 16, 2010, 09:27 AM »

Sure, I would. What he neededs. But I don't see any commonality to those failed files.

NigelH · « **Reply #20 on:** February 16, 2010, 08:29 PM »

md5deep etc from sourceforge?
does directory recursion etc.
There are some restrictions on Unicode though.