I was seeing PRs failing to update & webhooks failing to trigger upon pushing code for 30 minutes before GH's status page acknowledged anything. I'm surprised they don't have monitoring in place that would catch webhooks failing within minutes of the failure beginning.
At large bureaucratic organisations there's often political implications of changing the official status, so often it lags behind reality until it cannot be swept under the rug anymore
As a CTO responsible for an often failing eCommerce website that lost millions when down and which I took over, I fell in the same trap of trying to sweep things under the rug.
Until I decided to no longer do that and my life improved considerably.
Yeah it's very frustrating - especially if your customers are technical, they are seeing the errors and the status page says everything is fine.
I've seen status pages and error counts tied to bonuses, which only caused a giant mess of bad incentive alignment and internal lies, customers are unhappy, developers are unhappy, management are lying to upper management, it's so much easier to focus efforts on real problems and just be honest and improve. Thank goodness I dont work there anymore (cough cough Google)
Do these companies not have live error reporting and tracing? Like surely github got alerts that things weren't working? Why don't they just hookup their status page and their alerts? Or is it a political/relationship thing, and they want to have a human give out the status page updates?
This could have been caught with a cron job and some curl requests :\
In all honesty they're typically just banking on people not noticing it and are trying to make it as little of a fuss as possible and get it up before it gets to twitter. The problem is when it's not just a small blip and they haven't addressed it and it goes mainstream and is still down, it just leads to concerns about transparency.
Building infra I have to work around all sorts of 3rd party services going out or having blips throughout the day, docker registries, caches, bgp, etc., it's totally an expected part of infra design but not every team has the time or need to build in the resiliency. I see tons of outages that never get reported or IMO aren't reported adequately enough.
With that said, I'm no angel, I get all my service down notifications through slack, so when slacks down..
It’s still an indicator that “no you’re not crazy, we’re having issues on our end” but not a foolproof one. It’s kind of sad that twitter is usually the best place to confirm an outage as it begins, rather than the software providers themselves. I assume if they actually exposed global availability metrics in most cases it would not look as good as they would want it to
I've caught a big cloud provider not reporting a degraded service, I assume they knew but politics and $ come in and it's easier to just gaslight everyone. I get it, but my frustration is worth loosing a trailing 9.
I think there should be some 3rd party continuously testing APIs. Degraded states are downtime!
Honestly I think this can be true at any size organization. Small startups often take the approach of "let's hope no one noticed while we try to fix it", it's just that they have fewer users to notice so it's more likely to work.
They used to have real-time graphs and stuff on their status page. That was a thing of the past; with more distributed system, they're probably not sure if the service is down everywhere. If a node somewhere is still up, they might consider the service up. I don't know much, but it's up to the kind of downtime measurement system they have.
Had the same issue this morning. The lagging status always causes the issue of "is it you, me or GitHub?" snaffoos. Really annoying to have these issues so consistently. Would switch to gitea or similar in a moment given the choice.
GitHub actions downtime is becoming painful for us. Having been lured on there with 10,000 included minutes which they shortly thereafter dropped to 3,000 I feel aggrieved paying for overages incurred from actions regularly shitting the bed.
Also having outages at Azure DevOps Pipelines every other month or so it seems. And that's paying - for hours there's no mention on the status page and we are stuck there, not being able to merge PRs or release our app in the standard way.
It's gotten more reliable over time (especially selenium events being dropped on the floor causing tests to stall and fail), but I used to have to babysit it quite a bit and there were quite a number of times where IE instances just would not spool up (with a multiple minute timeout set). Sometimes it was a one-shot thing, other times it went on for hours.
During these incidents the average allocation times listed on their status page would double for Windows VMs (I don't recall the exact numbers but they were on the order of 10 seconds vs 5) but nothing would be red, and most of the time nothing ever did go red.
And that's what you get for using averages for things and divide infinity by n improperly.
This is weird because we haven't encountered any real issues with agents in Azure DevOps pipelines. I think we maybe had a single downtime in last 6 months. They recently removed .NET Core 2.2 SDK without any notice and broke our builds but that's another thing.
Github actions has been a huge let down for me. Between uptime issues and the lack of support for so many basic CI features is killing it for me (and has been for a year).
The only reason we're using it is because it's free..
Honestly I've had the opposite experience. With so many community actions available, I've had little trouble finding anything I could dream up. Sure, some of the actions features are a little immature but they are improving with time. The uptime issues are annoying and I feel like the lack of transparency is not helping that situation, but as far as CI solutions go, I feel like my move to actions has been a great way to get up and running with far less effort than other offerings like Code Pipeline.
Here we are again. Me taking a break on Hackernews because all my webhooks and pull requests are fucked and I have no idea where my devops tools are relative to what the real state of affairs is. I have pretty much had enough of this. It is too disruptive to our process. It is causing fragility and loss of confidence in our build pipeline.
At this point, we would probably be better off just bolting some lightweight git solution onto our devops tools (which are 100% custom in-house developed), rather than fighting with some more-durably-hosted offering of GitHub, et. al.
Anyone who posts that "but you cant make it more reliable than microsoft" line is not thinking about the dependencies between systems and the considerable impact incurred on a service just by virtue of it being a publicly-accessible platform without any cost barrier to entry. Sure, bringing it in house might bring additional difficulties, but I think I can eliminate a shitload of existing difficulties if we moved from webhooks across the public internet to a direct method invocation within the same binary image.
Gitlab is probably at the top of the list of candidates if we go down this road. I don't necessarily need it to be in the same binary as my devops tools, but certainly no further than localhost or another machine on the same network.
I'll use your comment to say that Federation[1] has also been discussed in Gitlab for 2 years now.
Frankly I can't wait. Imagine being able to reference other users across instances with @username:instance or something to that extent, or projects and tickets.
Issues with a green 'X' means they link to a feature on their issue tracker.
And, as far as I know, they are working on integrating CI/CD right now. They already have support for other non-integrated CI/CD platforms: https://docs.gitea.io/en-us/ci-cd/
For those that don't know, Gitea forked from Gogs a while back and they are very much being developed with two different philosophy. If you take take a look at the active contributors for Gitea and Gogs, you can tell how much they differ now.
There are two active contributors for Gogs, while Gitea has 27. Note, the number of contributors can't tell you if one has higher quality or not, I just wanted to point out the difference in development philosophy.
Given that Gitea has significantly more active developers working on it, we can probably assume it can add functionality faster than Gogs though.
There have been at least three major outages, e.g. git clone of a repo, in the past week alone. All three of which have been unreported (and NOT shown on their incident page), but I have email confirmation from GitHub support of these issues. It's almost time to switch to Gitlab. I have hundreds of repositories, organizations, and packages to transfer, while it will be daunting... I need reliability. I have several paid GitHub orgs and accounts as well.
To be fair they've been busy fixing the issue of slavery nomenclature in that time too. Respect where respect is due, important issues are being tackled here, you can't do everything at once.
Hilarious. GitHub has their fingers on the pulse of what developers and their customers really want. Not stability, but pretending to do things to help POCs through mindless censorship.
GitHub used to have a pretty cool status page, with all kinds of real time graphs. Does anyone know what happened to it? Since it makes me really sad that this status page is a plain lie, I had to visit HN to get the confirmation that they are having issues again, and that it just wasn't only me.
> But that could be all a part of coordinated effort to be more transparent about their service status, an effort that should be applauded.
Microsoft could be pushing for transparency. Or people are more relaxed about transparency now that GitHub has its exit. How long did GitHub know they were looking to be acquired? Maybe this analysis should look at a longer time interval..
The best triage policies I've ever gotten to work with had severity and priority separated.
Severity went something like this (sometimes the numbers flip which always confuses at least 20% of the team about whether things are almost normal or people are hunting each other for sport).
1: data loss
2: some workflows blocked
3: some workflows unavailable w/ workarounds (ie other routes)
4: Everything else except
5: Irritations
Having a UI break but the underlying functionality is still working is not good but people can still do their jobs, if more slowly. It's important to classify these separate from S2 and S4. There is urgency but don't panic. Go eat lunch or have your planning meeting, then go fix it. If data is getting lost ain't nobody doing nothin' until we figure it out, and then some people can go back to work but don't interrupt the people still working on it.
I think the problem is that so many metrically dysfunctional people, to the point of cliché, have rationalized that an S2 means that only 20% of our customers can't do their jobs so we are degraded but still working normally, when really a yellow status should be at S3, while S2 should be at least orange although those affected will be upset that it's not red.
Over time that 20% will shift around to most of your customers. Eventually several times, and then you'll wonder why everyone is talking trash about you on HN. It's not like that many people were affected!
We have the GitHub status RSS integrated into our Slack channel. One of my company's engineers noticed the outage at 15:06 UTC; the RSS feed picked it up at 15:49 UTC, though the message text says it was from 15:41 UTC. (And I think RSS polls, so there's some inherent lag, so I'd take the 15:41 UTC timestamp.) The half hour in between was us debugging, thinking it was us.
The last straw that got me out of mobile was working at a place with bad engineering discipline (or more precisely, bad management of engineering discipline). They were either paranoid or just didn't trust the team, and every time there was a blip in traffic someone in management would ride the engineers until they could prove it was on the other end. It almost always was. When I later saw the "stack trace or GTFO" comic I had a pretty clear idea what the author was feeling.
Eventually they rearranged the cube walls so management had to get more exercise to come harass the team. Yes, it was better use of space and the windows (in part due to my input), but that's not why the 2 people who started disassembling the cubes were doing it. "Fit of pique" is a phrase I don't get to use as often as I like, but that's what it was, whether cooler heads legitimized it or not.
Oddly, someone tried to blame my failure to convert to FTE on my interactions with one of those two engineers. He was all bark and not that much bite though. I could already handle him almost as well as anybody else and I was the new guy. No, they were trying to get everyone pagers and if that same kind of interaction happened at 2 am, I was gonna say something that got me fired. Found a much better offer and I stayed at the next place for 5 years, working on a surprising array of things and nobody ever said the p-word to me since.
Currently it does, indeed. From my Slack logs, at 15:00 UTC I noticed problems. I'm pretty sure that message is manually created, at least 41 minutes after the fact.
That's the most annoying thing. Usually when I get notifications from monitoring about some issue, the first thing I do is check the vendor or provider's status page to see whether it's an issue on their end. If there's nothing, I go and investigate.
Recently, more and more of them take 10-15 minutes until they mention a service outage. I don't work in super HA, I don't want to get an alarm because a single ping failed etc, so I'm lenient and have a few minutes of a delay in alarms. If I'm writing an internal incident report before the official status page is updated, that's bad.
This seems similar: external users noticing the outage and posting on HN before GitHub notices & acknowledges it.
Idea for an startup: Paying a service to do independent health checks to popular services with the ability to select the services i would like to be notified of their health status.
The company I work for moved to Gitlab because we were pessimistic on GitHub in the past few years. I don’t really have a strong opinion on which is better though, I still keep my private repositories on GitHub. However, I feel that Microsoft will start feeling the pain soon as more people in the development community get sour on GitHub.
Github had been in growth mode up until the acquisition. If Github stops being the nirvana for developers that it once was, it will be another dark mark in the history of MS acquisitions. Moreover, considering how sentiment influenced the stock market is at the moment, continued news of one of their products having outages could easily shed a considerable amount of Microsoft’s valuation, ~1%. The say stocks only go up nowadays, but when everything goes up, whoever grows at the slowest rate is really going down. I‘d assume that the Microsoft executive team won’t be happy with the new perception of Github.
Why is outage history pre-acquisition removed from their history? If you try to go back in time it seems they only retain history up to a couple months after the acquisition. Is this just a 2 year retention policy or something being swept under the rug?
Wow, it used to be so much more detailed! I get they probably can't have that level of "casual" disclosure now that they are so big, but man the current status updates just feel so... useless and unhelpful in comparison.
I have never worked at Github or MS and have no inside info on this, but it may be as simple as having switched to a MS-run system for outage history tracking as part of their own M&A integration.
However git has more:
- Bug tickets
- PR
- Wiki
- many projects use their githubpages as primary homepage
- newly GitHub actions
And then: Often collaborators are only known and identified by their GitHub handle. Running an own server requires some mechanism to identify them again and creating a way to handle their access credentials (ssh key etc.)
Moving a mildly successful project isn't easy. Good if more people plan for that eventuality, even if they stay on GH for the time being.
Other answers talking about using git features are assuming that you don't care about Wiki/PRs/Issues/Labels/etc that are GitHub metadata not part of your repo history.
Where does github publish post-moderms of downtime? I only see things like "We have deployed a fix and are monitoring recovery." in the github status history which doesn't provide details.
Do I have to repeat this over and over again? If these non-profit open-source projects [0] are able to self-host a git solution like GitLab, Gitea, cgit or Phabricator instance somewhere, surely your team or open-source project can too.
Even a self-hosted GH Enterprise would suffice for some businesses but this would be overkill for others. I even see the Wireguard author using his own creation (cgit) to self-host on his own git solution for years. [1]
This is problematic since many JS/TS, Go and Rust packages are on GitHub, which many developers rely on. Thus, it would be risky to think about tieing open-source project to (GitHub Actions, Apps, etc).
So that's why my automated build wasn't triggered ~4 hours ago. I was like "no way github is having issues again, they were down just the other day, it's probably just docker hub's fault". If they decided to publish a blog post about these series of outages later, I bet it would be pretty interesting.
It's been having issues all day. Wanted to show a coworker some changes I was proposing but the site wouldn't show the changes I'd pushed to my pull request. Ended up just having him pull the changes.
FWIW the git backend always seems rock solid in comparison to the front end they have displaying it.
I'm not sure this time. I had a PR update and kick off a build half an hour or so ago, only to see the build fail because git couldn't parse what it got from the clone operation.
I had a problem with github a while ago when I tried merging a PR to the master branch, the merge commits reflected on master but the PR was still open.I would repeatedly click the merge button but the PR wouldn't show as merged
Likely unrelated, but I recently noticed that GitHub stopped updating my activities overview for july. I definitely pushed commits, but they are not noticed. Anyone else having a similar issue?
How is GitLab like in terms of downtime? I looked at their status history page and I'm seeing a lot of incidents but it's hard to figure out what it actually means.
Gitlab used to have more outages than github, but these days they're about the same or even better than github. Also, they're really transparent about handling outages. They post link directly to the issue page in their status page so you see all those gitlab employees frantically trying to restore the service. I was pretty mad when they were having the last outage because I can't finish my work, but after checking the issue page and seeing how hard they work, I felt bad and decided to cut them some slack :)
Running your own git server is trivial. I have been doing it for years on a very cheap digital ocean instance. Set up ssh keys, lock it down with ufw, done.
If that is not enough, run your own instance of gitlab.
If that is not enough use Gitlab.
Microsoft is going to attempt to make a profit on Github. That's okay, but based on past experience and current issues, their business model is lock-in not service.
> Eschew flamebait. Don't introduce flamewar topics unless you have something genuinely new to say. Avoid unrelated controversies and generic tangents.
No, it's not. Apart from scheduled downtime when nobody's using it (e.g. restarts in the morning to update the kernel), it's not that hard to beat GitHub's uptime for a small Gitea instance. My power's on more than GitHub is up.
A UPS and a tethered smartphone would get me three nines uptime-while-anyone-needs-it, which is well in excess of what I need.
We migrated to on-prem GitLab running on k8s via the official Helm chart a year ago. We have ~50 users and so far have only had downtime when they required us to migrate from PostgreSQL 9.6 to 11 with the release of GitLab 13, and that was planned. We upgrade multiple times a month to stay up-to-date with the latest patches, and it's painless.
I think that one size fits all host it yourself / don't host it yourself is the wrong approach. For some organizations that have dedicated devops people and can easily maintain their own servers, they may be able to have better uptime and reliability for their instance. For smaller shops that don't have the time or expertise, I think it us true that GIT hosting is one of the many services that should be handled by a Cloud Service whether Github or Gitlab (or someone else).
I used to agree. Now I work with a locally hosted github. It is down all the time and sometimes it just deletes all of the work from the past day. I thought it wasn't possible to do much worse, but I was obviously wrong.
At work I'm managing a gitlab instance for 15k users and 5k projects. Uptime is 100% since 1 year except for few minutes of planned downtimes every month for the monthly upgrade.
To be honest I expected it to be a lot harder and run into troubles ... But I always find answers quickly in gitlab doc or forum