Comparing Directories Recursively
What is the best tool to compare file hashes in two different drives/directories such as after copying a large number of files from one drive to another? I used cp -au to copy directories, not rsync, since it is between local disks.
I found a mention of hashdeep on the ‘net which means first running it against the first directory generating a file with checksums and then running it a second time against the second directory using this checksum file. Hashdeep, however, is not in the CentOS repository and, according to the ‘net, is possibly no longer maintained.\
I also found md5deep which seems similar.
Are there other tools for this automatic compare where I am really looking for a list of files that exist in only one place or where checksums do not match?
14 thoughts on - Comparing Directories Recursively
diff –brief -r dir1/ dir2/
might do what you need.
If you also want to see differences for files that may not exist in either directory:
diff –brief -Nr dir1/ dir2/
But is diff not best suited for text files?
The standard unix diff will show if the files are the same or not:
$diff 1.bin 2.bin Binary files 1.bin and 2.bin differ
If there is no output from the command, it means that the files have no differences.
Since you don’t need to know exactly how the files are different (the mere fact that they are different is what you do want to know), that should do it.
source:
find . -type f -exec md5sum \{\} \; > checksum.list
destination:
md5sum -c checksum.list
Wouldn’t diff be faster because it doesn’t have to read to the end of every file and it isn’t really calculating anything? Or am I looking at this in the wrong way.
Hi,
[snip]
rsync obviously offers the ‘exist in only one place’ feature but also offers checksum comparisons (in version 3 and higher, I understand)…
-c, –checksum
This changes the way rsync checks if the files have been changed
and are in need of a transfer. Without this option, rsync uses
a “quick check” that (by default) checks if each file’s size and
time of last modification match between the sender and receiver.
This option changes this to compare a 128-bit checksum for each
file that has a matching size. Generating the checksums means
that both sides will expend a lot of disk I/O reading all the
data in the files in the transfer (and this is prior to any
reading that will be done to transfer changed files), so this
can slow things down significantly.
The sending side generates its checksums while it is doing the
file-system scan that builds the list of the available files.
The receiver generates its checksums when it is scanning for
changed files, and will checksum any file that has the same size
as the corresponding sender’s file: files with either a changed
size or a changed checksum are selected for transfer.
Note that rsync always verifies that each transferred file was
correctly reconstructed on the receiving side by checking a
whole-file checksum that is generated as the file is trans‐
ferred, but that automatic after-the-transfer verification has
nothing to do with this option’s before-the-transfer “Does this
file need to be updated?” check.
For protocol 30 and beyond (first supported in 3.0.0), the
checksum used is MD5. For older protocols, the checksum used is
MD4.
Rich.
Fast was not communicated as requirement, albeit md5 algo works by design “fast” and it was asked to “to compare file hashes”. Nevertheless I also use diff to compare. It depends on your needs …
I did end up using diff which seemed to work well.
Thank you, this time I used diff.
Thank you, saving this for the future.
Ok!
Great, used as suggested!
In article <20171027175431.e265479c4f9b4658fe2179bf@sasktel.net>, Frank Cox wrote:
If the files are the same (which is what the OP is hoping), then diff does indeed have to read to the end of both files to be certain of this. Only if they differ can it stop reading the files as soon as a difference between them is found.
Cheers Tony
I typically use ‘rsync -av -c –dry-run ${dir1}/ ${dir2}/ ‘ (or some variation) for this.