The Best Way to handle finding and removiong Duplicate Files

ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

DonationCoder.com Software > Coding Snacks

(1/6) > >>

questorfla:
In trying to create an all encompassing archive of all the company documents, i have ended up with an extremely bloated folder containing more redundant info than anything else. There are probably 10 copies of the exact same file in all in different places. I can't risk running a mass "delete all but one copy"(though i wold like to) and am looking for the best way to scan this huge repository maybe in small chunks the first time to try to remove as many as i can.
Some of these files are backup copies of PST files over 5GB in size that I may have 3 or 4 copies of.

This has been a work in progress for some time and and as it all starts getting into a single location i am finding this more and more often. The ability to scan for duplicates exceeding 500MB might be a good start.

Years ago I used to have a program i used a lot on MP3's and Photos but don;t remember the name and it may not even be Windows 8 compatible. If you can point me anything at all right now, just cleaning out the duplicated PST files would be a huge help.

The more control the program offers for filters would probably be nice. I doubt I would even throw away the extras but getting them out of the main reference area would be a big help.

4wd:
Well, up to 4GB in size you could use VBinDiff but HxD will handle files up to 8EB ... if you feel the need :D

Possibly a good start would be to generate hashes on each file, (SHA-1, etc), then you need only compare those with matching hashes - you could probably whip up a command file to do it ;)

EDIT: The inbuilt Windows fc command can apparently also handle files >4GB.

Do all these files resides in the same directory?

Maybe I'll look at a simple, (or not), command file.

IainB:
My exercises in comparing/selecting duplicates and removing/deleting them have usually tended to be of sub-folders nested within a folder/directory, whose contents have been treated as a "flat file" (i.e., the process has not necessitated documents being moved from their original holding folders first). I use xplorer² to do this, and it has an exceptionally powerful and handy duplicate checker.
Example:

Here is a partial screen clip of the duplicate checker being selected for the nested folders shown:

The folders' contents are first treated as a flat file.
Here is a partial screen clip of the duplicate checker selection box floating above the flat file display (which is in a "scrap" pane that can be operated on in various ways as a logical object). Note the checks for "size" and "content" (checksum), and "select all duplicates":

Here is a a partial screen clip of the list of all the duplicates, with all but one duplicate (I think it might be the earliest-dated or something) being "auto-selected" in each case (this is also a scrap pane):

You can then select or unselect files on the list, at your whim, and operate on them as a set - e.g., copy them into a .ZIP file for archiving before deleting them en bloc.

I don't know the constraints (if any) on max file sizes. The user guide should describe such, and is a PDF file that can be downloaded from zabkat.com (the xplorer² website) with a trial version.
The support site is http://zabkat.com/x2support.htm , which also has an online manual.

I use the xplorer² PRO version, but the Ultimate may suit your needs better.
For comparison, refer http://zabkat.com/x2down.htm
Hope this helps or is of use.

4wd:
A modification of one of my other command files that's around here somewhere.

FCompare-GNU.cmd

REQUIREMENTS:
It requires the following three (3) files from the GNU DiffUtils for Windows packages, (because they're faster than the native Windows commands). Put them in the same directory as the command file.

* cmp.exe
* libiconv2.dll
* libintl3.dll
These are available from the following two archives, (they're both on SourceForge, both less than 1MB in size):
Binaries
Dependencies

RUNNING:
Just double-click on it, it'll prompt you for two directories and the extension type of the files to compare.

It'll output to both the console it opens and a log file, which you get the option to open at the end.

As a matter of interest, it did a compare of two 8.17GB ISOs in ~3:20 - the files were on separate HDDs for speed, otherwise you're going to get disk thrashing.

NOTES:

* It recurses through both directories chosen looking for matching file extensions.
* It performs a size comparison before it resorts to binary comparison, (seems logical).
* Because it uses DisableDelayedExpansion it's not going to work properly if your filenames have ! in them - rename them.
* Recommend you choose directories on different HDDs to avoid disk thrashing.
If you want, I can put up the native Windows command version but it is woefully slow by comparison.

--- Code: Text ---@if (@CodeSection == @Batch) @then @echo off color 1a setlocal EnableExtensions setlocal DisableDelayedExpansion echo Select SOURCE folder for /F "delims=" %%S in ('CScript //nologo //E:JScript "%~F0"') do ( set srce=%%S ) if "%srce%"=="" goto :End2 echo Selected folder: "%srce%" echo Select DESTINATION folder for /F "delims=" %%D in ('CScript //nologo //E:JScript "%~F0"') do ( set dest=%%D ) if "%dest%"=="" goto :End2 echo Selected folder: "%dest%" if "%srce%"=="%dest%" goto :End1 :GetExt echo. set /P ext=Please enter extension [eg. jpg] ENTER to exit: if not defined ext goto :End2 set /a totfiles=0 set /a matchfiles=0 set logfile="%~dp0%FCompare.log" echo --------------------------- >>"%logfile%" echo %date% %time% >>"%logfile%" echo --------------------------- >>"%logfile%" for /r "%srce%" %%a in (*.%ext%) do (call :CheckSize "%dest%" %%~za "%%~fa") :End echo Total files: %totfiles% echo Matching files: %matchfiles% if %matchfiles% equ 0 (echo Matching files: 0 >>"%logfile%") set totfiles= set matchfiles= set ext= if exist "%logfile%" call :ViewLog goto :GetExt :End1 color 1c echo **** SOURCE and DESTINATION are the same! **** :End2 set srce= set dest= pause exit :ViewLog set /p view=View logfile [y,n] if "%view%"=="y" (start notepad.exe "%logfile%") set view= goto :EOF :CheckSize set /a totfiles+=1 for /r %1 %%b in (*.%ext%) do ( if %2==%%~zb ( echo. echo Comparing: "%~3" "%%~b" echo Sizes: %2 %%~zb .\cmp.exe -s "%~3" "%%~b" if errorlevel 0 (call :Matching "%~3" "%%~b") ) ) goto :EOF :Matching set /a matchfiles+=1 echo - Match - echo Match: "%~1" --- "%~2" >>"%logfile%" goto :EOF endlocal End of Batch section@end // JScript section // Creates a dialog box that enables the user to select a folder and display it.var title = "Select a folder", rootFolder = 0x11;var shl = new ActiveXObject("Shell.Application");var folder = shl.BrowseForFolder(0, title, 0, rootFolder);WScript.Stdout.WriteLine(folder ? folder.self.path : "");
Log file example:

--- Code: Text ------------------------------ Sun 26/10/2014 22:19:46.65 --------------------------- Match: "R:\test\dir1\FC3 - Blood Dragon.iso" --- "D:\dir2\why.iso"

x16wda:
If you have everything in one location, it might be easier to use fsum to build a list of the MD5s of the files, then you can sort the result and see what is duplicated. Names and file extensions can be misleading. ;)

Navigate to the top level and use "fsum -r *.* > checksums.txt" and you'll get a nice report. Then you could "cat checksums.txt | sort | uniq -D -w 33 > dups.txt" to get a good list of just the duplicates. (You'd need unxutils for cat and uniq - and other useful stuff too.)

Both reports are attached here for a sample folder.

Navigation

[0] Message Index

[#] Next page

Go to full version