Mass checksum checker

ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

<< < (3/5) > >>

parkint:
As you know, I am also a big fan of Git.
When you gain a little proficiency with it, this is a great resource to the power of Git.

f0dder:
widgewunner: git is powerful and interesting, but suggesting it as a way to get file hashes? That's kinda like using a frying pan to drive in nails :) (oh, and while the implementation model might be elegant, git as a whole definitely isn't - ugh!).

imtrobin:
I think Git is a little unsuitable, it keeps a copy of the whole file in one revision. It's good for distributed code, but not for file verifying.

widgewunner:
As you know, I am also a big fan of Git.
When you gain a little proficiency with it, this is a great resource to the power of Git.
-parkint (February 14, 2010, 06:55 AM)
--- End quote ---
I agree. (In fact, I just received my hard copy of the book last week). And Scott Chacon's Gitcasts are also very good. I have found both the online and written documentation to be nothing short of excellent. (i.e. My first book on git was O'Reilly's: Version Control with Git by Jon Loeliger - also highly recommended.)

... git is powerful and interesting, but suggesting it as a way to get file hashes? That's kinda like using a frying pan to drive in nails :) ...-f0dder (February 14, 2010, 07:15 AM)
--- End quote ---
LOL! Yes, you have a point here. (Although a frying pan does a pretty good job of it!)

... (oh, and while the implementation model might be elegant, git as a whole definitely isn't - ugh!).-f0dder (February 14, 2010, 07:15 AM)
--- End quote ---
Can you elaborate? My understanding is that when Git first came out it was quite difficult to use, and the documentation was lousy, but it has since matured and those days are gone. I am new to Git and have had nothing but a good experience with it so far. It is small, lightning fast and non-obtrusive. It compresses your data down to a bare minimum. And contrary to what some might believe, it is actually very easy to use. Once installed, setting up a repository to track a directory tree consists of four commands:

--- ---cd branch/to/be/followed # change to the directory you want to be managed
git init # initialize a new repository
git add . # recursively add all files and folders
git commit -m "initial commit" # commit the tree to the repository
Just repeat the last two commands any time you want to add a new version to the repository. Yes, there are much more powerful and complex commands that git can perform, but these are completely unnecessary for the purpose described here. There is also a GUI interface, but I can't comment on that as I am a command line kind of guy for this kind of stuff.

It not only provides you with SHA1 hash of every version of every file in your tree (and thus guarantees the integrity of each and every one), it has very powerful ways for you to inspect the changes that have been made to the files over time. It also has commands for copying entire repositories to other drives/servers which provides a very effective backup methodology.

However, I am not at all sure how well it would handle terabytes of data!?

I think Git is a little unsuitable, it keeps a copy of the whole file in one revision. It's good for distributed code, but not for file verifying.
-imtrobin (February 14, 2010, 07:36 AM)
--- End quote ---
I would disagree. Accurate file verification is one of the founding premises of Git. Yes it stores entire files, but it is very efficient. In Linus's talk, he mentions that the repository containing the entire history of the Linux sources (from 2005-2007), was only half the size of one checked out version of the source tree itself!

p.s. You guys did go check out hashtab didn't you? It is definitely a "must-have"!

Cheers!

f0dder:
... (oh, and while the implementation model might be elegant, git as a whole definitely isn't - ugh!).-f0dder (February 14, 2010, 07:15 AM)
--- End quote ---
Can you elaborate? My understanding is that when Git first came out it was quite difficult to use, and the documentation was lousy, but it has since matured and those days are gone.-widgewunner (February 14, 2010, 12:25 PM)
--- End quote ---
I'll try :)

First, let me state that I think the git model seems pretty solid overall: the way repository information is stored in the .git folder (and the way it's structured), the way server communication is done for remote repositories et cetera. I haven't looked into each and every detail (e.g. I don't know if file blobs are stored directly or if they're compressed (can't see why they would be)), but I understand the idea of storing blobs and referring to just about everything through their SHA-1 hash values (I wonder why SHA-256 wasn't chosen, considering some known SHA-1 defects, but not too big a deal - the foucs isn't to guard against attackers but to avoid collisions under normal circumstances).

My gripes are more around the end-user tools. One thing is that the Windows port is still a bit rough (blame then *u*x people for not writing properly modular and portable code), this is something I can live with - but gee, even after creating hardlinks for all the git-blablabla.exe in libexec/git-core, the msysgit install still takes >120meg disk space... subversion is 7.4meg. Of course msysgit comes with a lot more than svn, but I shouldn't need all that extra. And I don't currently have time to check what is absolutely necessary and what's just icing on the cake; considering that git was originally a bunch of scripts, and the unix tradition of piecing small things together with duct tape, I don't feel like playing around right now :)

The more important thing is how you use the tools. Sure, for single-developer local-only no-branch usage, it's pretty much a no-brainer, and most of the terminology matches traditional C-VCS. There's some subtle differences here and there that you have to be aware of, though - like what HEAD means. IMHO it would have been better to use new terminology for some things - like "stage" instead of "add" (having "add" overloaded to handle both add-new-file and add-to-staging-area is bad). "Checkout" for switching branches doesn't seem like the smartest definition to me, either. And not knowing about renames (but depending on client-tool to discover this, probably through matching SHA-1 values?) also seems like a bit of a mistake. Relatively minor points, but things that can introduce confusion...

Where things can get hairy is when you collaborate with other people, especially on juggling with branches and "history rewriting". Git has some very powerful features that lets you mess things up bigtime - which by itself might not be a problem, but with the various overloads the commands have, and that history rewriting (ie, commit --amend and rebase) seems pretty common operations, you really have to be pretty careful. Some of it aren't much of an issue if you're a single developer (although you can inadvertantly destroy branch history, which can be bad), but you need to be really careful with rebasing once you work with other people - the Pro Git book has a good example of why :)

All that said, I'm considering moving my own VCS to git. D-VCSs clearly have advantages
over C-VCS, and while the git-windows tools have rough edges and you have to be careful and
you can do some insane things, it's fast and I believe the underlying technology has gotten an important bunch of things right. I'll probably check out some of the other D-VCS tools before deciding, like bazaar and mercurial.

I am new to Git and have had nothing but a good experience with it so far. It is small, lightning fast and non-obtrusive.-widgewunner (February 14, 2010, 12:25 PM)
--- End quote ---
Wouldn't say it's small (at least not msysgit :)), but fast indeed, and I like that it has a single .git folder instead of a per-subfolder .svn (that's un-obtrusive for me).

It compresses your data down to a bare minimum.-widgewunner (February 14, 2010, 12:25 PM)
--- End quote ---
Does it actually compress anything (apart from server communications), or are you just referring to only storing each blob once, identified by SHA-1 hash value? It's also my understanding that files are stored in entirety, whereas other systems store patchsets. This makes checkouts extremely fast, but if you have huge files and only change a few lines, it does take up more disk space (usually not a big problem, most sane people don't store server logs under vcs, and keep source code files small... 50 committed edits to the main Notepad++ .cpp file would be ~16meg though :)).

However, I am not at all sure how well it would handle terabytes of data!?-widgewunner (February 14, 2010, 12:25 PM)
--- End quote ---
Better than other VCSes... but it's not suitable for just computing hashes :)

I think Git is a little unsuitable, it keeps a copy of the whole file in one revision. It's good for distributed code, but not for file verifying.
-imtrobin (February 14, 2010, 07:36 AM)
--- End quote ---
I would disagree. Accurate file verification is one of the founding premises of Git. Yes it stores entire files, but it is very efficient. In Linus's talk, he mentions that the repository containing the entire history of the Linux sources (from 2005-2007), was only half the size of one checked out version of the source tree itself!-widgewunner (February 14, 2010, 12:25 PM)
--- End quote ---
The point is that to "compute hashes" with git, you'll be putting files under version control. You don't necessarily want to do that for a bunch of, say, ISO images. Heck, I'd even say it's very likely you don't want to do this. First, you don't need the file under VCS, second you don't want the extra duplicate it creates (remember, every file will live in the .git object stash as well as a checked out copy).

Anyway, a lot of this discussion should really probably be split out to a topic about git, since it's drifted quite far away from the topic of file checksumming.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version