Hacker News new | past | comments | ask | show | jobs | submit login
GitHub was down again (githubstatus.com)
166 points by originof on July 15, 2020 | hide | past | favorite | 124 comments



I was seeing PRs failing to update & webhooks failing to trigger upon pushing code for 30 minutes before GH's status page acknowledged anything. I'm surprised they don't have monitoring in place that would catch webhooks failing within minutes of the failure beginning.


At large bureaucratic organisations there's often political implications of changing the official status, so often it lags behind reality until it cannot be swept under the rug anymore

Not saying it's right, just an observation


As a CTO responsible for an often failing eCommerce website that lost millions when down and which I took over, I fell in the same trap of trying to sweep things under the rug.

Until I decided to no longer do that and my life improved considerably.


Yeah it's very frustrating - especially if your customers are technical, they are seeing the errors and the status page says everything is fine.

I've seen status pages and error counts tied to bonuses, which only caused a giant mess of bad incentive alignment and internal lies, customers are unhappy, developers are unhappy, management are lying to upper management, it's so much easier to focus efforts on real problems and just be honest and improve. Thank goodness I dont work there anymore (cough cough Google)


Do these companies not have live error reporting and tracing? Like surely github got alerts that things weren't working? Why don't they just hookup their status page and their alerts? Or is it a political/relationship thing, and they want to have a human give out the status page updates?

This could have been caught with a cron job and some curl requests :\


In all honesty they're typically just banking on people not noticing it and are trying to make it as little of a fuss as possible and get it up before it gets to twitter. The problem is when it's not just a small blip and they haven't addressed it and it goes mainstream and is still down, it just leads to concerns about transparency.

Building infra I have to work around all sorts of 3rd party services going out or having blips throughout the day, docker registries, caches, bgp, etc., it's totally an expected part of infra design but not every team has the time or need to build in the resiliency. I see tons of outages that never get reported or IMO aren't reported adequately enough.

With that said, I'm no angel, I get all my service down notifications through slack, so when slacks down..


What's the point in having a status page if it's a political artifact? In that case, it serves zero customer value.


It’s still an indicator that “no you’re not crazy, we’re having issues on our end” but not a foolproof one. It’s kind of sad that twitter is usually the best place to confirm an outage as it begins, rather than the software providers themselves. I assume if they actually exposed global availability metrics in most cases it would not look as good as they would want it to


Down detector is based on twitter complaints and is pretty damn accurate.


3 words: Service Level Agreement

I've caught a big cloud provider not reporting a degraded service, I assume they knew but politics and $ come in and it's easier to just gaslight everyone. I get it, but my frustration is worth loosing a trailing 9.

I think there should be some 3rd party continuously testing APIs. Degraded states are downtime!


Honestly I think this can be true at any size organization. Small startups often take the approach of "let's hope no one noticed while we try to fix it", it's just that they have fewer users to notice so it's more likely to work.


They used to have real-time graphs and stuff on their status page. That was a thing of the past; with more distributed system, they're probably not sure if the service is down everywhere. If a node somewhere is still up, they might consider the service up. I don't know much, but it's up to the kind of downtime measurement system they have.


That's the way it is _everywhere_.

For example status.digitalocean.com is _not_ real time, it's manually updated.

And it's irritating as fuck.


Had the same issue this morning. The lagging status always causes the issue of "is it you, me or GitHub?" snaffoos. Really annoying to have these issues so consistently. Would switch to gitea or similar in a moment given the choice.


Just a friendly correction: `SNAFU:' Situation Normal: All Fucked Up


This misspelling brought to you by foobar.

Foobar: for when you are too polite to say FUBAR (Fucked Up Beyond All Recognition).


My two favourites lakes in the Yukon - SNAFU and TARFU (Things are Really Fued Up).

Named, of course, by the Army when they built the Alaska Highway.


I had never given any thought to this word as I had heard and used it since childhood, had no idea that was the origin!


GitHub actions downtime is becoming painful for us. Having been lured on there with 10,000 included minutes which they shortly thereafter dropped to 3,000 I feel aggrieved paying for overages incurred from actions regularly shitting the bed.


Also having outages at Azure DevOps Pipelines every other month or so it seems. And that's paying - for hours there's no mention on the status page and we are stuck there, not being able to merge PRs or release our app in the standard way.


Paying Saucelabs customer here.

It's gotten more reliable over time (especially selenium events being dropped on the floor causing tests to stall and fail), but I used to have to babysit it quite a bit and there were quite a number of times where IE instances just would not spool up (with a multiple minute timeout set). Sometimes it was a one-shot thing, other times it went on for hours.

During these incidents the average allocation times listed on their status page would double for Windows VMs (I don't recall the exact numbers but they were on the order of 10 seconds vs 5) but nothing would be red, and most of the time nothing ever did go red.

And that's what you get for using averages for things and divide infinity by n improperly.


This is weird because we haven't encountered any real issues with agents in Azure DevOps pipelines. I think we maybe had a single downtime in last 6 months. They recently removed .NET Core 2.2 SDK without any notice and broke our builds but that's another thing.


Maybe it has something to do with us using the macOS hosts/agents? For deprecating things, I know that they sometimes do brownouts, see https://devblogs.microsoft.com/devops/removing-older-images-...


I couldn't find ANYTHING about the 2.2 removal. Just one day our unit test projects stopped working. I understand that 2.1 is LTS, but still.


Github actions has been a huge let down for me. Between uptime issues and the lack of support for so many basic CI features is killing it for me (and has been for a year).

The only reason we're using it is because it's free..


Honestly I've had the opposite experience. With so many community actions available, I've had little trouble finding anything I could dream up. Sure, some of the actions features are a little immature but they are improving with time. The uptime issues are annoying and I feel like the lack of transparency is not helping that situation, but as far as CI solutions go, I feel like my move to actions has been a great way to get up and running with far less effort than other offerings like Code Pipeline.


Here we are again. Me taking a break on Hackernews because all my webhooks and pull requests are fucked and I have no idea where my devops tools are relative to what the real state of affairs is. I have pretty much had enough of this. It is too disruptive to our process. It is causing fragility and loss of confidence in our build pipeline.

At this point, we would probably be better off just bolting some lightweight git solution onto our devops tools (which are 100% custom in-house developed), rather than fighting with some more-durably-hosted offering of GitHub, et. al.

Anyone who posts that "but you cant make it more reliable than microsoft" line is not thinking about the dependencies between systems and the considerable impact incurred on a service just by virtue of it being a publicly-accessible platform without any cost barrier to entry. Sure, bringing it in house might bring additional difficulties, but I think I can eliminate a shitload of existing difficulties if we moved from webhooks across the public internet to a direct method invocation within the same binary image.


We've been self hosting Gitlab CE for a couple of years. Its been great. No downtime, upgrades are seamless, fast, works.


Gitlab is probably at the top of the list of candidates if we go down this road. I don't necessarily need it to be in the same binary as my devops tools, but certainly no further than localhost or another machine on the same network.


or, you know, host a gitlab instance yourself & call it a day



I've been having a great experience with https://sourcehut.org/ as well.


I'll use your comment to say that Federation[1] has also been discussed in Gitlab for 2 years now.

Frankly I can't wait. Imagine being able to reference other users across instances with @username:instance or something to that extent, or projects and tickets.

1. https://gitlab.com/gitlab-org/gitlab/-/issues/6468


The first two links miss the idea a bit, I'd say.

I don't often need a web interface to a git repo. I can pull and do everything locally.

What I do use GitHub for is (1) code review and approval process, (2) CICD / actions, (3) releases to push stuff out.

The branch / tag / file browser is a nice addition, but it's not key. Rendering README.md is almost as important, if not more.


Gitea has a lot of those features already: https://docs.gitea.io/en-us/comparison/

Issues with a green 'X' means they link to a feature on their issue tracker.

And, as far as I know, they are working on integrating CI/CD right now. They already have support for other non-integrated CI/CD platforms: https://docs.gitea.io/en-us/ci-cd/



For those that don't know, Gitea forked from Gogs a while back and they are very much being developed with two different philosophy. If you take take a look at the active contributors for Gitea and Gogs, you can tell how much they differ now.

https://imgur.com/ZExNVV4

https://imgur.com/v0fGXgv

There are two active contributors for Gogs, while Gitea has 27. Note, the number of contributors can't tell you if one has higher quality or not, I just wanted to point out the difference in development philosophy.

Given that Gitea has significantly more active developers working on it, we can probably assume it can add functionality faster than Gogs though.


There have been at least three major outages, e.g. git clone of a repo, in the past week alone. All three of which have been unreported (and NOT shown on their incident page), but I have email confirmation from GitHub support of these issues. It's almost time to switch to Gitlab. I have hundreds of repositories, organizations, and packages to transfer, while it will be daunting... I need reliability. I have several paid GitHub orgs and accounts as well.


To be fair they've been busy fixing the issue of slavery nomenclature in that time too. Respect where respect is due, important issues are being tackled here, you can't do everything at once.

https://twitter.com/natfriedman/status/1271253144442253312


I cannot say if this comment is being sarcastic, but for the record I found it hilarious. Thanks :)


Hilarious. GitHub has their fingers on the pulse of what developers and their customers really want. Not stability, but pretending to do things to help POCs through mindless censorship.


GitHub used to have a pretty cool status page, with all kinds of real time graphs. Does anyone know what happened to it? Since it makes me really sad that this status page is a plain lie, I had to visit HN to get the confirmation that they are having issues again, and that it just wasn't only me.


I read an article about it, look for "status page evolution" https://nimbleindustries.io/2020/06/04/has-github-been-down-...


> But that could be all a part of coordinated effort to be more transparent about their service status, an effort that should be applauded.

Microsoft could be pushing for transparency. Or people are more relaxed about transparency now that GitHub has its exit. How long did GitHub know they were looking to be acquired? Maybe this analysis should look at a longer time interval..


From the first two graphs it looks like they are a lot less liberal about using "down" instead of "warn".


The best triage policies I've ever gotten to work with had severity and priority separated.

Severity went something like this (sometimes the numbers flip which always confuses at least 20% of the team about whether things are almost normal or people are hunting each other for sport).

1: data loss

2: some workflows blocked

3: some workflows unavailable w/ workarounds (ie other routes)

4: Everything else except

5: Irritations

Having a UI break but the underlying functionality is still working is not good but people can still do their jobs, if more slowly. It's important to classify these separate from S2 and S4. There is urgency but don't panic. Go eat lunch or have your planning meeting, then go fix it. If data is getting lost ain't nobody doing nothin' until we figure it out, and then some people can go back to work but don't interrupt the people still working on it.

I think the problem is that so many metrically dysfunctional people, to the point of cliché, have rationalized that an S2 means that only 20% of our customers can't do their jobs so we are degraded but still working normally, when really a yellow status should be at S3, while S2 should be at least orange although those affected will be upset that it's not red.

Over time that 20% will shift around to most of your customers. Eventually several times, and then you'll wonder why everyone is talking trash about you on HN. It's not like that many people were affected!


The status page clearly states (when I look) that:

Incident on 2020-07-15 15:41 UTC We are investigating reports of degraded performance. Posted 9 minutes ago. Jul 15, 2020 - 15:41 UTC


We have the GitHub status RSS integrated into our Slack channel. One of my company's engineers noticed the outage at 15:06 UTC; the RSS feed picked it up at 15:49 UTC, though the message text says it was from 15:41 UTC. (And I think RSS polls, so there's some inherent lag, so I'd take the 15:41 UTC timestamp.) The half hour in between was us debugging, thinking it was us.


The last straw that got me out of mobile was working at a place with bad engineering discipline (or more precisely, bad management of engineering discipline). They were either paranoid or just didn't trust the team, and every time there was a blip in traffic someone in management would ride the engineers until they could prove it was on the other end. It almost always was. When I later saw the "stack trace or GTFO" comic I had a pretty clear idea what the author was feeling.

Eventually they rearranged the cube walls so management had to get more exercise to come harass the team. Yes, it was better use of space and the windows (in part due to my input), but that's not why the 2 people who started disassembling the cubes were doing it. "Fit of pique" is a phrase I don't get to use as often as I like, but that's what it was, whether cooler heads legitimized it or not.

Oddly, someone tried to blame my failure to convert to FTE on my interactions with one of those two engineers. He was all bark and not that much bite though. I could already handle him almost as well as anybody else and I was the new guy. No, they were trying to get everyone pagers and if that same kind of interaction happened at 2 am, I was gonna say something that got me fired. Found a much better offer and I stayed at the next place for 5 years, working on a surprising array of things and nobody ever said the p-word to me since.


Currently it does, indeed. From my Slack logs, at 15:00 UTC I noticed problems. I'm pretty sure that message is manually created, at least 41 minutes after the fact.


That's the most annoying thing. Usually when I get notifications from monitoring about some issue, the first thing I do is check the vendor or provider's status page to see whether it's an issue on their end. If there's nothing, I go and investigate.

Recently, more and more of them take 10-15 minutes until they mention a service outage. I don't work in super HA, I don't want to get an alarm because a single ping failed etc, so I'm lenient and have a few minutes of a delay in alarms. If I'm writing an internal incident report before the official status page is updated, that's bad.

This seems similar: external users noticing the outage and posting on HN before GitHub notices & acknowledges it.


Idea for an startup: Paying a service to do independent health checks to popular services with the ability to select the services i would like to be notified of their health status.


Pagerduty,OpsGenie, etc?


The company I work for moved to Gitlab because we were pessimistic on GitHub in the past few years. I don’t really have a strong opinion on which is better though, I still keep my private repositories on GitHub. However, I feel that Microsoft will start feeling the pain soon as more people in the development community get sour on GitHub.


Why do you think they will feel the pain ever?


Github had been in growth mode up until the acquisition. If Github stops being the nirvana for developers that it once was, it will be another dark mark in the history of MS acquisitions. Moreover, considering how sentiment influenced the stock market is at the moment, continued news of one of their products having outages could easily shed a considerable amount of Microsoft’s valuation, ~1%. The say stocks only go up nowadays, but when everything goes up, whoever grows at the slowest rate is really going down. I‘d assume that the Microsoft executive team won’t be happy with the new perception of Github.


Why is outage history pre-acquisition removed from their history? If you try to go back in time it seems they only retain history up to a couple months after the acquisition. Is this just a 2 year retention policy or something being swept under the rug?


I can go back all the way to 2010: https://imgur.com/DsSKcFV


Wow, it used to be so much more detailed! I get they probably can't have that level of "casual" disclosure now that they are so big, but man the current status updates just feel so... useless and unhelpful in comparison.


I have never worked at Github or MS and have no inside info on this, but it may be as simple as having switched to a MS-run system for outage history tracking as part of their own M&A integration.


I think they switched to Atlassian StatusPage[1] a while ago.

[1] https://www.atlassian.com/software/statuspage


This change must have happened this year. I remember comparing pre and post acquisition outage rates a few months ago.

If it is not purposefully being swept under the rug, it sure is convenient.


I think it's also worth noting that also corresponds with the Actions feature (went GA 11/11/19)


What's the easiest way to duplicate all your Github repositories, with history, somewhere else?

Ideally, I'd like to have two synchronized repositories, for no single point of failure, organizational or otherwise.


Run git on a personal server[1]? It's not as complicated as you might think. Probably much more usable to setup gitlab.

Then set up the alternative remote on your repos.

[1]https://www.linux.com/training-tutorials/how-run-your-own-gi...


I posted this on another thread. If you only want the commits, something like this works,

    ssh user@git.example.com
    mkdir project-1.git
    cd project-1.git
    git init —-bare
    exit
    git remote add alternate user@git.example.com:project-1.git
All you need is SSH


All you need to get all commits and all tags/branches:

  git clone --mirror https://github.com/you/repo
Push those to another server:

  git remote add new https://gitlab.com/you/repo
  git push --mirror new


I use it also that way. Gitlab does the same if you use ssh. They provide hooks on init so that gitlab knows if something happens.


That's fine for your code repository.

However git has more: - Bug tickets - PR - Wiki - many projects use their githubpages as primary homepage - newly GitHub actions

And then: Often collaborators are only known and identified by their GitHub handle. Running an own server requires some mechanism to identify them again and creating a way to handle their access credentials (ssh key etc.)

Moving a mildly successful project isn't easy. Good if more people plan for that eventuality, even if they stay on GH for the time being.


    git remote set-url --add origin git@somewhere.else:my-project
This will make it so that every time you push to "origin", it'll push twice, to two places. You can repeat this to add a third or more.


Gitlab has a feature set for this that's automated but it may be rolled under a paid plan if you want it to be bidirectional.

Building this in git itself is not hard at all, and there's likely a script or plugin for gogs or gitea.


Other answers talking about using git features are assuming that you don't care about Wiki/PRs/Issues/Labels/etc that are GitHub metadata not part of your repo history.

But GitLab does have feature support for extensive importing and mirroring. https://docs.gitlab.com/ee/user/project/import/github.html (Import your project from GitHub to GitLab) has a section on project mirroring.


    git add remote NAME URL
    git push NAME
It will not transfer GitHub specific content (issues, PE, wiki, etc.), though.


git clone ... which you already have

git remote add originFoo URL

git push --all originFoo

There are other flags like mirror but never used them.

Source: https://git-scm.com/docs/git-push


You could probably use GitHub actions to push any commits and branches to GitLab or anywhere else.


Where does github publish post-moderms of downtime? I only see things like "We have deployed a fix and are monitoring recovery." in the github status history which doesn't provide details.



The Latest commits don't appear in the commits tab


Yeah, I'm seeing the same. The branch reflects the new commit, but the PR open for that branch does not show it.


I’ve had this before. Closing and reopening the PR does seem to do the trick sometimes.


Same


For free Gitea instance, You can try https://codeberg.org


Do I have to repeat this over and over again? If these non-profit open-source projects [0] are able to self-host a git solution like GitLab, Gitea, cgit or Phabricator instance somewhere, surely your team or open-source project can too.

Even a self-hosted GH Enterprise would suffice for some businesses but this would be overkill for others. I even see the Wireguard author using his own creation (cgit) to self-host on his own git solution for years. [1]

This is problematic since many JS/TS, Go and Rust packages are on GitHub, which many developers rely on. Thus, it would be risky to think about tieing open-source project to (GitHub Actions, Apps, etc).

[0] https://news.ycombinator.com/item?id=23818020

[1] https://git.zx2c4.com


Wait, will you really repeat this until you no longer see a "github down" link on HN frontpage?

That is dedication.


That's it, I'm launching SubversionHub.


At least you won't have to rename "trunk" to something else years later for some PC reason.


Awesome, that's where I can coordinate DDoS attacks with the community, right?


You have made me laugh.


Come on, blue ocean.

CVSHub.


Why not MercurialHub?


How do they not have a single update in over 1.5 hours? This is ridiculous.


Ugh the last month has been pretty difficult. Hope they get better soon.


So that's why my automated build wasn't triggered ~4 hours ago. I was like "no way github is having issues again, they were down just the other day, it's probably just docker hub's fault". If they decided to publish a blog post about these series of outages later, I bet it would be pretty interesting.


It's been having issues all day. Wanted to show a coworker some changes I was proposing but the site wouldn't show the changes I'd pushed to my pull request. Ended up just having him pull the changes.

FWIW the git backend always seems rock solid in comparison to the front end they have displaying it.


I'm not sure this time. I had a PR update and kick off a build half an hour or so ago, only to see the build fail because git couldn't parse what it got from the clone operation.


I really want to move to GitLab but its UI is atrocious... like too mobile phone looking on desktop


noticing issues on GitHub, CircleCI, and Launch Darkly.


I had a problem with github a while ago when I tried merging a PR to the master branch, the merge commits reflected on master but the PR was still open.I would repeatedly click the merge button but the PR wouldn't show as merged


Likely unrelated, but I recently noticed that GitHub stopped updating my activities overview for july. I definitely pushed commits, but they are not noticed. Anyone else having a similar issue?


Are you pushing to master branch?


Yes, I was. However, I was pushing to master of a fork. Maybe thats the reason.


How is GitLab like in terms of downtime? I looked at their status history page and I'm seeing a lot of incidents but it's hard to figure out what it actually means.


Thats why the self hosted options are there -- and why GitLab has a competitive advantage in this sense.

Cloud solutions are great -- however they have a golden rule, don't go down ever. This is seriously damaging to GitHubs reputation.


> Thats why the self hosted options are there -- and why GitLab has a competitive advantage in this sense.

GitHub has GitHub Enterprise

> Cloud solutions are great -- however they have a golden rule, don't go down ever.

Oh where oh where can I sign up for this mythical unicorn cloud service?


IIRC GitLab is super transparent about service outages, so it may not be that they're less reliable, but that they're more honest about it.


Gitlab used to have more outages than github, but these days they're about the same or even better than github. Also, they're really transparent about handling outages. They post link directly to the issue page in their status page so you see all those gitlab employees frantically trying to restore the service. I was pretty mad when they were having the last outage because I can't finish my work, but after checking the issue page and seeing how hard they work, I felt bad and decided to cut them some slack :)


Running your own git server is trivial. I have been doing it for years on a very cheap digital ocean instance. Set up ssh keys, lock it down with ufw, done.

If that is not enough, run your own instance of gitlab.

If that is not enough use Gitlab.

Microsoft is going to attempt to make a profit on Github. That's okay, but based on past experience and current issues, their business model is lock-in not service.

I suspect the same is true for NPM.


At the current rate, Whatever you host will have a better uptime than Github.


They're probably using the Facebook SDK /s


Scaling Rails is hard ? Github needs to move to CDN, static site deployment instead.


Git is distributed.


Did Microsoft adopt Scrum?


The Microsoft Effect


[flagged]


> Eschew flamebait. Don't introduce flamewar topics unless you have something genuinely new to say. Avoid unrelated controversies and generic tangents.

https://news.ycombinator.com/newsguidelines.html


Don’t host yourself, it’s impossible to meet the reliability of the professionals


No, it's not. Apart from scheduled downtime when nobody's using it (e.g. restarts in the morning to update the kernel), it's not that hard to beat GitHub's uptime for a small Gitea instance. My power's on more than GitHub is up.

A UPS and a tethered smartphone would get me three nines uptime-while-anyone-needs-it, which is well in excess of what I need.


I think the OP was being sarcastic.


We migrated to on-prem GitLab running on k8s via the official Helm chart a year ago. We have ~50 users and so far have only had downtime when they required us to migrate from PostgreSQL 9.6 to 11 with the release of GitLab 13, and that was planned. We upgrade multiple times a month to stay up-to-date with the latest patches, and it's painless.

I don't regret it.


I think that one size fits all host it yourself / don't host it yourself is the wrong approach. For some organizations that have dedicated devops people and can easily maintain their own servers, they may be able to have better uptime and reliability for their instance. For smaller shops that don't have the time or expertise, I think it us true that GIT hosting is one of the many services that should be handled by a Cloud Service whether Github or Gitlab (or someone else).


... HN is mostly professional software developers though. Running a GitLab instance really isn't hard.


I used to agree. Now I work with a locally hosted github. It is down all the time and sometimes it just deletes all of the work from the past day. I thought it wasn't possible to do much worse, but I was obviously wrong.


At work I'm managing a gitlab instance for 15k users and 5k projects. Uptime is 100% since 1 year except for few minutes of planned downtimes every month for the monthly upgrade. To be honest I expected it to be a lot harder and run into troubles ... But I always find answers quickly in gitlab doc or forum


Hmm I am a professional, so it is easily possible for meet my own reliability


For many organizations, that's still true.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: