Checksums For Git Repo Content?

Home » CentOS » Checksums For Git Repo Content?
CentOS 20 Comments

Hi all,

Since the vault for 7.3.1611 has been cleared out last sunday (20170207)
– why is that? – I’m using git to download a “SRPM”, or more accurately, its contents.

However, using git has one major drawback: It is missing checksums for the files.

Are there any plans to provide checksums for the files in git so I can be sure that what I download is actually not tampered with? SRPMS are signed so we can be sure there content is what you provided.

Regards, Leonard.

20 thoughts on - Checksums For Git Repo Content?

  • Hello John,

    SRPMS are signed which allows the integrity of the contents to be checked. Such an integrity check is missing from the git repo.

    Either a checksum file for each file or a single checksums file per package/release holding all checksums for all files of said package/release (including the tarballs that are downloaded with get_sources.sh).

    Regards, Leonard.

  • Red Hat exports the source code to the repo, I don’t think they are going to change what the put in. It is an extracted SRPM.

  • It shouldn’t be hard to generate a checksum file. Or should this request be directed at Red Hat?

    Regards, Leonard.

  • At the time of extraction, the .metadata file is created (again, not by us, but by the Red Hat team that distributes source), and all the non-text sha1sums are in there as well as all the text sources.

    You can see who modifies any of those files (the text sources and the text .metadata file).

  • Aha, .metadata, well, for f.e. bc I see only a checksum for the tarball, but not for the patch files. For the kernel it contains checksums for some (all?) source files, but again, not for the patches.

    Is this something you guys could pass on to Red Hat? If not, where should I direct a request to add checksums for patches to that metadata file?

    Regards, Leonard.

  • The patch files are in git as text files, right? Why would you need checksums of those? That is the purpose of git, right?

    There are checksums of all the NON-text (binary) files in the metadata file.

  • Checksums are there to make sure that you get what you are supposed to get. That is also true for text files. (A source tarball is just a bunch of text files in an archive.)

    Having checksums for all files (like in a SRPM) is a guarantee and saves the user the trouble to have to hand check them. So yes, checksums for patch files and spec files would be appreciated and useful.

    Regards, Leonard.

  • Git already has the protection you’re looking for. As part of its core design, git uses a hash chain to verify the integrity of its history.
    Every change and every file is thus protected. It’s impossible to insert changes or to modify the history of the git repository in a way that wouldn’t be extremely visible to all users.

    If you check out a module using git, and fetch its external sources using get_sources.sh, you can rest assured that every file used to build an RPM has been hashed and verified.

  • Hello Gordon,

    Alright, understood. Only the sources downloaded with get_sources.sh need a checksum then. Which are the ones in .metadata.

    Thanks for clearing this up and sorry Johnny for the fuzz :) .

    Regards, Leonard.

  • What failure model are you trying to solve for, specifically?

    If you’re worried about malicious tampering of the files on the server, how would your request solve anything? If you don’t trust the Git repo you’re cloning from, why would you trust a checksum file stored in that same repo?

    If you’re worried about a MITM attack, any MITM that can modify Git data in-flight can produce bogus checksum files in-flight, too.

    If you’re worried about corrupted data at rest on the remote server or corruption introduced during the transfer, Git already solves this:

    https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

    If you want to verify that a given Git clone is consistent:

    $ git fsck –full —strict Git can do this because its contents are a type of Merkle tree:

    https://en.wikipedia.org/wiki/Merkle_tree

    Merkle trees are highly resistant to attacks, particularly in the case of source code, where an attack must not only change the attacked resource, the change has to a) create some effect desired by the attacker; and b) still be legal code in the programming language being used. Getting both effects while still maintaining the same SHA1 hash is Difficult.™

    I don’t know Git internals, but I would expect the above git-fsck command to be pointless immediately after a clone, because Git should be doing something like what it does during the clone process. (I’ve been disappointed by Git’s behavior before, though, so…)

    That command should only have a useful effect after a later git pull command in order to detect whether the local copy has bitrotted in the meantime.

    A checksum guarantees nothing by itself. A file’s checksum is only as trustworthy as the source of that checksum. If you don’t trust the source to give you a correct file, you can’t trust that same source to give you a valid checksum. Any bad actor that can compromise one can compromise the other.

    *Distributed* checksums can sometimes be helpful, if they’re maintained by disparate parties on distributed servers. Here, you’re asking some third party to assert that they got a copy of the same RPM (or whatever) and that they got checksum XXXXXXX for it. That devolves into a trust relationship, rather than the math problem it naively looks like: do you trust that party not to be compromised by the same party that produced the RPM in question?

    Another trust problem — which is again a people problem rather than a math problem — is cryptographic signatures. A signed SRPM is only as trustworthy as the provider of the signing certificate. Certificate authorities are getting caught doing untrustworthy things *all the time*. Have you vetted your trusted CAs, or are you relying on a third party to do that? Why do you trust that third party to do that job thoroughly?

  • Not to stir up a hornets’ nest, but how does Google’s announcement at https://shattered.it affect this now? (Executive summary: Google has successfully produced two different PDF files that hash to the same SHA-1.) There is a whole paragraph on ‘How is GIT affected?’

  • Ridiculous? Seriously? I don’t think it’s time to be in panic mode, but it is time to prepare for the generation-after-next of GPUs which will be able to produce SHA-1 collisions quickly enough to be able to keep up with a git repo.

    Dan Goodin disagrees that it’s not a problem for git in the long run.
    See:
    https://arstechnica.com/security/2017/02/at-deaths-door-for-years-widely-used-sha1-function-is-now-dead/
    (not a bit of hyperbole in that headline, right? ) Maybe it’s a bit premature to call it ‘dead’ but it is definitely in its death rattles.

    Google is scheduled to release the source code to produce arbitrary
    “identical-prefix” collisions of SHA-1 hashes in three months. You need about $110,000 worth of compute time to pull off the attack, and that number will go down. We’re basically at the same place now with SHA-1
    as we were in 2010 with MD5.

    The full paper can be read at https://shattered.io/static/shattered.pdf

    And an interesting discussion on git’s potential handling of a SHA-1
    collision on a blob is at https://stackoverflow.com/questions/9392365/how-would-git-handle-a-sha-1-collision-on-a-blob

    It may not be urgent, but it’s not ridiculous.

  • I think the issue is that you not only have to create a trojan file that matches the same hash, but you have to create a trojan file that matches the same hash and doesn’t break compiling.

    Maybe you could do that by padding the file in a comment, but I suspect it would be extremely difficult to pull off.

    Bottom line is git needs to move to sha256 but I do not believe there is any present danger.

    I’d be more worried about fraudulently issued TLS certs combined with a DNS cache attack when doing a git checkout. That would be easier to easier to pull off.

  • To replace pre-existing checkins in place, you have to execute what’s called a second-preimage attack, which is much, much harder than the collision attack presented by Google.

    The collision attack gives you the freedom to change both files until they match, whereas fixing one of the artifacts ahead of time requires you to pull off a second-preimage attack. Since the fear up-thread is about whether we can trust what’s already in the CentOS Git repos, only a second-preimage attack will do.

    There is a way to use a collision attack against Git or similar systems:

    https://news.ycombinator.com/item?id=13715887

    However, realize that in this context, it means you’d have to:

    1. Get the Red Hat or CentOS folks to accept the good version of your patch. (i.e. The benign version of the evil patch you want to get into RHEL and CentOS.)

    2. Hope that the committer doesn’t modify your patch before committing it, thus breaking the match to the evil version you spent $100k and a month of time creating.

    3. MITM the Git sync protocol between git.CentOS.org and the target site to inject your evil version into the sync stream. Since git.CentOS.org redirects to HTTPS by default and issues HTTPS URLs for you to clone from, this means you also have to break TLS, since unbroken TLS prevents MITM attacks. That, or someone has to *aim* while shooting themselves in the foot, going out of their way to remove the “s” from the URL.

    4. Since git.CentOS.org is apparently not mirrored, you have to execute this attack between git.CentOS.org and all end users of their service that you wish to attack, rather than poison one or more of the mirrors by MITMing the mirror’s connection back to git.CentOS.org.

    So yeah, it’s still Difficult.™

    All of this is not to say that Git doesn’t have a problem. They do. It’s just that the problem in question doesn’t affect the integrity of git.CentOS.org, as far as I can see.

  • No, that’s the easy bit. If I want to replace this line of C code:

    ++i;

    with this:

    system(“dd if=/dev/zero of=/dev/sda bs=1m”);

    the attack presented by Google shows that you merely need to modify the evil version of the file (the Git checkin, in this case) until it matches the good version according to SHA1, which is spoofable given sufficient resources. Those resources let you fuzz the patch until you succeed:

    system(“dd if=/dev/../dev/zero of=/dev/sda bs=1m”);
    system(“dd if=/dev/zero of=/dev/sda”); /* 0tt^V&Y3qeF3qIGlUS */

    etc.

    That’s why this is considered a collision attack rather than a second-preimage attack: both versions of the data can be adjusted until you find a collision, rather than just the new version of the data.

    There is present danger. Just not for the git.CentOS.org use case, for the reasons I laid out in my other reply.

    TLS has other defenses that prevent this attack from working against it:

    https://news.ycombinator.com/item?id=13715349

    We’ve been to this rodeo before.

  • Thanks for the good answer, Warren.

    Since last posting on this, I’ve been watching traffic on NANOG about it, and then Linus weighed in on the issue at https://plus.google.com/+LinusTorvalds/posts/7tp2gYWQugL which, in a nutshell, says: 1.) The sky isn’t falling even though there is an actual issue with this; 2.) There are a couple of patches mitigating the primary modes of this attack; 3.)GIT will be upgrading to another hash, and that upgrade won’t break existing repos.

    So even in Linus’ words it’s not a ridiculous conversation, but it’s not super urgent, either. Which is the kind of statement I was after, and the type of information I was looking for.