Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Hidden GitHub commits and how to reveal them (neodyme.io)
94 points by chuckhend on Feb 23, 2024 | hide | past | favorite | 41 comments


This highlights why it's so important that any secret that gets committed must be rotated. Simply removing it from the git history isn't enough, because it can still linger, it's just harder to find.


Full disclosure, I work for GitHub, but push protection from Secret Scanning is awesome for this because your nearly leaked secret doesn’t make it to the remote, and it gives you instructions on how to fix your local repo!


Why does GitHub provide no way for a repository administrator to self-service a git gc? I seem to recall reading a blog post that suggested GitHub had invested a bunch of engineering resource in making cleaning up unreachable objects much more scalable.


I haven’t reached out for internally (and I’m not on a related team), the following is my own understanding.

The blog post was most likely this one: https://github.blog/2022-09-13-scaling-gits-garbage-collecti...

And I think it answers the product vision for it well (why it’s automatic):

> We have used this idea at GitHub with great success, and now treat garbage collection as a hands-off process from start to finish.

GitHub also provides these docs for what to do if there is sensitive data in your repo, which is quite involved and (given the huge amount of knowledge internally of both GitHub internals and git internals), I would trust their advice:

https://docs.github.com/en/authentication/keeping-your-accou...

You can also contact support or create/join a community discussion: https://github.com/orgs/community/discussions

If you feel strongly that a feature you need is missing, by adding your voice, you increase visibility of the request. I think GitHub does offer solutions to this problem though, including eventual GC automatically.


That's the actual insane problem.

I noticed long ago that unreferenced commits survive on GitHub for long, but I couldn't find a way to discover them.

I know that GitHub stores together the objects of many repositories, but they should have implemented and offered a way to gc them when they came up with that optimization.

Sure, there would still be the chance that someone already obtained the objects by the time you gc them, but it's a much lesser risk then leaving them there indefinitely (and they could provide a log of the last fetches to better assess the impact of the erroneous push).


> chance that someone already obtained the objects by the time you gc them

I was under the impression that there are various 'mirror github' projects that listen to the GitHub change event API and immediately crawl some/all commits.

If so, this isn't a chance - it is certain.


Ok, there's also the problem of accesses through the web interface, but it probably wouldn't take much to provide a short-lived log of them as well


We turned that on about a year ago, and that totally helped reduce the silly. The new dashboards are nice to - letting you spot what application team needs a phone call. 'This is still active' warning is fantastic. Wish all providers would give you the API to show that.


This is a useful feature but can only provide a degree of protection.

To a certain extent, your approach of considering any mistakenly pushed commit as public is laudable, but it still seems unreasonable to me to not provide an analogue to gc


Is it commited or pushed?

If I commit something locally, reset it and push to remote something else does it leave a trace?


It is still in your local repository, but it's not pushed to the remote repository. So a forensics on your local machine may reveal it (probably until you do git gc, but I'm not an expert on git forensics) but it's safe otherwise.


While I agree that you should rotate accidentally exposed secrets, it should be noted that you can remove old history from git reflog by expiring it

I see git reflog kinda like an OS recycle bin


That's only client side, and you also need to "gc" it to get rid of it, or it will still be in .git/objects and can be retrieved via something like `git cat-file`.


We're getting a little off-topic, but even git-add will put it in the object store without even committing! I once saved my boss's bacon with that. He had git-added a presentation file he'd been working on to commit it, but accidentally nuked his changes with "git-reset --hard" before comitting. He mentioned his mistake in chat, and we were able to recover the lost object by sorting files in the object directory by last-modified and cat-filing it back out by that ID. He bought me a beer for that after work that day. Good times.

Read gitcore-tutorial(7), folks. You too might save someone's bacon, some day.


I also know this, because I accidentally added a huge media file and `git add .` took too long to return and then `.git` was huge lol


How do you do that once it's on Github's servers?



I think this isn't true anymore since at least the introduction of the unwanted Github activity view: https://docs.github.com/en/repositories/viewing-activity-and...


[deleted]


That link provides instructions on how to access commits from the reflog of a GitHub repository, not how to expire or otherwise delete them.


You don’t even need the pushes API to see commits that were force pushed away. You can get the head of any branch at a given time using `gitrevisions` [1] syntax any place that you would normally put a branch or commit.

e.g to see the state of the cpython main branch on January 1 we can ask for `main@{2024-01-01}`:

https://github.com/python/cpython/tree/main@{2024-01-01}

This does not walk the commit history, but instead the server-side reflog, so it’s immune to force pushing and can only be avoided by GC of the reflog or repo. Definitely contact GH support if you pushed something you shouldn’t have.

[1] https://git-scm.com/docs/gitrevisions


Ehm no, reflogs are notoriously local...

To be 100% sure that something hasn't changed recently I tried and, nope, your revision command only looks at the local reflog, after a forced push you get different answers from the original repository (that has the full reflog) and a new clone.


I'm very confused by your comment. The grandparent comment talks about using the gitrevisions syntax in a GitHub URL to search the reflog stored on GitHub. Nothing to do with your local clone of a repository.


Yes, I refer to the reflog on GitHub’s git file servers which tracks the historical state of refs there.


I'm sorry, I was on mobile and for some strange reason didn't check the whole link. I guess I was tired.

Incredible, I didn't know about it.


If you've inadvertently committed, say, copyrighted material to GitHub, and want to fully erase it, is there a way? Other than contacting GitHub as this article mentions.

Even if you contact them, GitHub says[1] that they will not remove "non-sensitive data", but makes no reference to copyrighted material.

[1] https://docs.github.com/en/authentication/keeping-your-accou...


If it's a copyright violation (be sure that it ACTUALLY is!) they will remove content in response to a DMCA request, but any forks will only be removed if you manually find them and issue a request for each fork. This isn't very useful if you accidentally uploaded your own copyrighted material though, since that's not a violation you could issue a notice for.


Can you DMCA yourself for someone else's copyrighted material? That's what I'm talking about here.


You have to be the copyright holder or their representative, so no, it would technically be illegal to DMCA yourself for violating someone else's IP. If you asked github support nicely they might help, though.


I don't think there's a need to erase copyrighted material? If it's your material then the copyright still holds. If it's not your material it's a problem between GitHub and the copyright holder who can DMCA the "hidden commit" if for some godforsaken reason the copyright holder somehow found the commit and cares.


There isn't, you need to contact them so they can delete the offending objects.


Is this an issue with git or github only? If this is an issue with github only, i won't use it anymore for personal projects


Mostly a Git issue. In general Git won't remove old data pushed to remotes. Maybe if they run a garbage collection.

However GitHub does exacerbate it a little by providing APIs that list commits that are no longer in the history. However there are other ways to get this info such as brute-forcing short prefixes of commits.

But really this is another case of the general problem that once you publish information you can't unpublish it. If you push a secret to a repo you can't 100% reliably clean it up. You should assume that everyone with the repo took a copy.


It's not really an issue, it's just that the assumption that removing a commit from the history actually deletes it is not correct. That holds for both Git and GitHub, and probably most other Git hosts.

Also in general, don't assume that you can remove anything from the internet once it has been published.


It is an issue. It means there's no way to actually delete commits from a GitHub repo.

And it is a GitHub issue. If you were self-hosting you could just run `git prune` `git gc` or `git repack` or whatever the magic command is.


If your remote is publicly accessible (GitHub or not) anyone could have cloned it while the sensitive data was there and no magic command will make that go away


Right, but it’s not uncommon for a repo to be private with sensitive data that is identified and “removed” (using something like bfg or git-filter-branch) before being made public.

Naturally, if it’s a key or something else revocable those extra precautions should be taken regardless of using these tools, but that isn’t an option for some types of data and this implies that users have no systematic recourse.


This is a classic binary security fallacy. It's like saying "there's no point having a lock on your front door because you occasionally leave it open and then anyone could walk in!".

You know you are arguing that it should be impossible to delete things from a website right?


Git can potentially clean dangling commits `git gc --aggressive --prune=now` . Gitlab offers this as part of housekeeping. However, be aware: this garbage collection does not work if you e.g. reference a commit in an issue. (Like creating an incident that references the offending commit)


These commits can be deleted via `git gc`. Which part if GitHub's "architecture" prevents them from running that?


someone knows if tools like truffle hog scans these?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: