ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

DonationCoder.com Software > Coding Snacks

The Best Way to handle finding and removiong Duplicate Files

(1/6) > >>

questorfla:
In trying to create an all encompassing archive of all the company documents, i have ended up with an extremely bloated folder containing more redundant info than anything else.  There are probably 10 copies of the exact same file in all in different places.  I can't risk running a mass "delete all but one copy"(though i wold like to) and am looking for the best way to scan this huge repository maybe in small chunks the first time to try to remove as many as i can.  
Some of these files are backup copies of PST files over 5GB in size that I may have 3 or 4 copies of.

This has been a work in progress for some time and and as it all starts getting into a single location i am finding this more and more often.  The ability to scan for duplicates exceeding 500MB might be a good start.

Years ago I used to have a program i used a lot on MP3's and Photos but don;t remember the name and it may not even be Windows 8 compatible.  If you can point me anything at all right now, just cleaning out the duplicated PST files would be a huge help.

The more control the program offers for filters would probably be nice.  I doubt I would even throw away the extras but getting them out of the main reference area would be a big help.

4wd:
Well, up to 4GB in size you could use VBinDiff but HxD will handle files up to 8EB ... if you feel the need  :D

Possibly a good start would be to generate hashes on each file, (SHA-1, etc), then you need only compare those with matching hashes - you could probably whip up a command file to do it  ;)

EDIT: The inbuilt Windows fc command can apparently also handle files >4GB.

Do all these files resides in the same directory?

Maybe I'll look at a simple, (or not), command file.

IainB:
My exercises in comparing/selecting duplicates and removing/deleting them have usually tended to be of sub-folders nested within a folder/directory, whose contents have been treated as a "flat file" (i.e., the process has not necessitated documents being moved from their original holding folders first). I use xplorer² to do this, and it has an exceptionally powerful and handy duplicate checker.
Example:

Here is a partial screen clip of the duplicate checker being selected for the nested folders shown:



The folders' contents are first treated as a flat file.
Here is a partial screen clip of the duplicate checker selection box floating above the flat file display (which is in a "scrap" pane that can be operated on in various ways as a logical object). Note the checks for "size" and "content" (checksum), and "select all duplicates":



Here is a a partial screen clip of the list of all the duplicates, with all but one duplicate (I think it might be the earliest-dated or something) being "auto-selected" in each case (this is also a scrap pane):



You can then select or unselect files on the list, at your whim, and operate on them as a set - e.g., copy them into a .ZIP file for archiving before deleting them en bloc.

I don't know the constraints (if any) on max file sizes. The user guide should describe such, and is a PDF file that can be downloaded from zabkat.com (the xplorer² website) with a trial version.
The support site is http://zabkat.com/x2support.htm , which also has an online manual.

I use the xplorer² PRO version, but the Ultimate may suit your needs better.
For comparison, refer http://zabkat.com/x2down.htm
Hope this helps or is of use.

4wd:
A modification of one of my other command files that's around here somewhere.

FCompare-GNU.cmd

REQUIREMENTS:
It requires the following three (3) files from the GNU DiffUtils for Windows packages, (because they're faster than the native Windows commands).  Put them in the same directory as the command file.


* cmp.exe
* libiconv2.dll
* libintl3.dll
These are available from the following two archives, (they're both on SourceForge, both less than 1MB in size):
Binaries
Dependencies


RUNNING:
Just double-click on it, it'll prompt you for two directories and the extension type of the files to compare.

It'll output to both the console it opens and a log file, which you get the option to open at the end.

As a matter of interest, it did a compare of two 8.17GB ISOs in ~3:20 - the files were on separate HDDs for speed, otherwise you're going to get disk thrashing.

NOTES:

* It recurses through both directories chosen looking for matching file extensions.
* It performs a size comparison before it resorts to binary comparison, (seems logical).
* Because it uses DisableDelayedExpansion it's not going to work properly if your filenames have ! in them - rename them.
* Recommend you choose directories on different HDDs to avoid disk thrashing.
If you want, I can put up the native Windows command version but it is woefully slow by comparison.


--- Code: Text [email protected] (@CodeSection == @Batch) @then     @echo off    color 1a    setlocal EnableExtensions    setlocal DisableDelayedExpansion        echo Select SOURCE folder    for /F "delims=" %%S in ('CScript //nologo //E:JScript "%~F0"') do (        set srce=%%S    )    if "%srce%"=="" goto :End2    echo Selected folder: "%srce%"        echo Select DESTINATION folder    for /F "delims=" %%D in ('CScript //nologo //E:JScript "%~F0"') do (       set dest=%%D    )    if "%dest%"=="" goto :End2        echo Selected folder: "%dest%"        if "%srce%"=="%dest%" goto :End1        :GetExt    echo.    set /P ext=Please enter extension [eg. jpg] ENTER to exit:     if not defined ext goto :End2    set /a totfiles=0    set /a matchfiles=0    set logfile="%~dp0%FCompare.log"        echo --------------------------- >>"%logfile%"    echo %date%  %time% >>"%logfile%"    echo --------------------------- >>"%logfile%"     for /r "%srce%" %%a in (*.%ext%) do (call :CheckSize "%dest%" %%~za "%%~fa")     :End    echo Total files:    %totfiles%    echo Matching files: %matchfiles%    if %matchfiles% equ 0 (echo Matching files: 0 >>"%logfile%")    set totfiles=    set matchfiles=    set ext=    if exist "%logfile%" call :ViewLog    goto :GetExt    :End1    color 1c    echo **** SOURCE and DESTINATION are the same! ****    :End2    set srce=    set dest=    pause    exit     :ViewLog    set /p view=View logfile [y,n]    if "%view%"=="y" (start notepad.exe "%logfile%")    set view=    goto :EOF     :CheckSize    set /a totfiles+=1    for /r %1 %%b in (*.%ext%) do (        if %2==%%~zb (            echo.            echo Comparing: "%~3" "%%~b"            echo Sizes: %2   %%~zb            .\cmp.exe -s "%~3" "%%~b"            if errorlevel 0 (call :Matching "%~3" "%%~b")        )    )    goto :EOF     :Matching    set /a matchfiles+=1    echo - Match -    echo Match: "%~1" --- "%~2" >>"%logfile%"    goto :EOF     endlocal     End of Batch [email protected]  // JScript section // Creates a dialog box that enables the user to select a folder and display it.var title = "Select a folder", rootFolder = 0x11;var shl = new ActiveXObject("Shell.Application");var folder = shl.BrowseForFolder(0, title, 0, rootFolder);WScript.Stdout.WriteLine(folder ? folder.self.path : "");
Log file example:

--- Code: Text ------------------------------ Sun 26/10/2014  22:19:46.65 --------------------------- Match: "R:\test\dir1\FC3 - Blood Dragon.iso" --- "D:\dir2\why.iso"

x16wda:
If you have everything in one location, it might be easier to use fsum to build a list of the MD5s of the files, then you can sort the result and see what is duplicated. Names and file extensions can be misleading.  ;)

Navigate to the top level and use "fsum -r *.* > checksums.txt"  and you'll get a nice report. Then you could "cat checksums.txt | sort | uniq -D -w 33 > dups.txt" to get a good list of just the duplicates. (You'd need unxutils for cat and uniq - and other useful stuff too.)

Both reports are attached here for a sample folder.

Navigation

[0] Message Index

[#] Next page

Go to full version