More

themcgruff · on May 2, 2017

(I run our Ops team.) The time burden goes down with the more people we have. It's still a burden though -- and our compensation (in theory) reflects that.

FWIW We're doing everything we can to make this "work hours only" M-F, which we could solve by hiring tons of people immediately, but we also have other ideals like keeping the company as small as possible that we want to realize too.

There's an open and ongoing discussion about making improvements in this area and I'm thankful that David and Jason have been receptive to many of the suggestions I, or anyone else on our team has had.

My personal stance is that we should do everything we can to give Ops a 40 hour work week that's during regular working hours and no more, even if that means people get cut a lot of extra slack to recover after a late night page, etc. (Hopefully our team would back me up in saying I encourage people to take reasonable time to make up "lost" hours.)

(Also fwiw, I participate equally in the on call rotations.)

dsr_ · on May 2, 2017

So you've got 6 people and want to have 8.

There's a big problem with being on call for one week out of 6 or 8: you lose touch with the procedures. Sure, your four year veterans know everything by heart - but the first few shifts of a newbie are going to be perilous. I recommend making the shifts shorter and more frequent.

Presumably one person is on-call and everyone else can be called in / woken up as necessary. So - split each day into two halves, and ask people to be on-call for a 12 hour period.

Rotate the roster around so that Jane doesn't always have the same Friday-afternoon shift, nobody has 2 shifts in a row, and put it in a shared calendar so you can always see who has the watch.

With 6 people, you'll take 2 and a seventh shifts per week. At 7, it's an even 2 shifts per week.

Benefits:

- much less of a burden that a whole week of readiness

- brains work better when they haven't been pummeled for a week at a time (at least, mine does)

- easily scales fairly when you have more people, or when someone leaves, but keeps everyone in the loop. When you have 14 people in Ops, you only have one shift a week, but you get one every week.

- much more family-friendly

OK, why 2 12 hour periods instead of splitting the day into 8, 6 or 4? Because people lose track too easily. Trying to schedule around your kid's concert or music lessons with smaller chunks is hard to keep in your head - and trying to work that in with a one week shift is nigh-impossible.

Why not a 24 hour shift? Because it's really hard to recover from that. Humans are generally awake for about 15-17 hours a day. Shifting a few hours is generally doable.

I would recommend that for anyone who took an alert call during non-core hours, you automatically expect them to take the next normal day to recover. I know that when I get woken up at 4AM, I'll run out of steam by 2 or 3PM.

VLM · on May 2, 2017

How do you even, like, commute to work on a 5 minute online time? Even stuff like essential food shopping, I can sometimes be more than a 5 minute walk away from the car and laptop. Or going to the bathroom.

amyjess · on May 2, 2017

> FWIW We're doing everything we can to make this "work hours only" M-F, which we could solve by hiring tons of people immediately, but we also have other ideals like keeping the company as small as possible that we want to realize too.

And this is what I can't stand about small businesses. Nothing makes me run from a company faster than a company saying they're committed to their ideals over what's best for everyone. Small businesses, for whatever reason, are the ones who are most likely to shout "Honor before reason!" and shoot themselves, their employees, and their customers in the foot. You don't see megacorps doing this.

What's best for everyone is to hire "tons of people immediately". If your ideals conflict with that, then you need to jettison your ideals.

mediocrejoker · on May 2, 2017

I think 5 minutes is an ideal target, and not a hard and fast rule, right? Otherwise how could you even go to the bathroom?

emperorcezar · on May 2, 2017

We rotate daily, except weekends, which has a single person. Works well for us.

themcgruff · on May 1, 2017

Basecamp | Ops/Sysadmin| Chicago, IL | REMOTE, Full-time, https://basecamp.com/

Basecamp solves the critical problems that every growing business deals with. We say it’s the saner, calmer, organized way to manage projects and communicate company-wide.

Basecamp Ops is responsible for infrastructure across 3 colocation sites in the United States and use both Google Cloud and Amazon Web Services too. We're heavily a Ruby on Rails shop though there's a few other languages hanging around in our deployments. If you are passionate about delivering fast and reliable sites at an awesome company that will respect you and help you grow personally and professionally please get in touch: https://basecamp.workable.com/j/A5A189B311. (Oh yeah we have amazing benefits too: https://m.signalvnoise.com/employee-benefits-at-basecamp-d2d...)

bogomipz · on May 1, 2017

Asking candidates to:

1) describe themselves in a thousand words or less

2) write up a mock outage report

3) pitch something new to "the team"

just to submit their CV seems a little much. Pass.

heavenlyhash · on May 15, 2017

Counterpoint: I'm AOK with these ideas.

I love take-home style interviews. They're without exception the most informative for both the interviewee and the company: much more can be conveyed than in a hour face-to-face, and it's less stressful both to produce and to review, and just plain less painful to schedule.

Doing it before the in-person stage is a little unusual (maybe after the first phonecall, to confirm it's not a waste of time for a position that's already filled, etc, would make more sense?), but I respect the idea.

themcgruff · on May 2, 2017

Thanks for your feed back. It's understandable that this might be too much for some people.

To date, we've had a single person skip those questions and still submit. This person was not a qualified applicant.

Based on the applications we've received this very minimal list of requirements appears to be working as desired. We're interested in employees who are going to do the best work of their careers. I'm okay with having a slightly higher bar (in terms of application effort) for that type of person.

bogomipz · on May 2, 2017

>"I'm okay with having a slightly higher bar (in terms of application effort) for that type of person."

Asking people to jump through hoops is not raising any kind of bar.

themcgruff · on Feb 18, 2016

(Full disclosure I work with the author, Noah, at Basecamp.)

Not sure if more important than math, but you are right it's extremely important.

One of the biggest on going problems we have isn't getting data in -- it's helping everyone get data out. Even with training sessions, documentation, and some fairly fleshed out "self help" tools, there is still confusion about where to look and how to "find" and "combine" the right data to answer a given question.

One of the ways we've partially solved this problem is through Tableau which is a commercial solution. (I was skeptical about whether we would stick with solutions like Tableau but it has been worth every penny.)

mamcx · on Feb 19, 2016

Niklaus Wirth coin the idea

Programs = Algorithms + Data Structures

Is easy to thing that the "Algorithms" side (where is more obvious to apply math) is dominant. But instead, is the data side.

Math is worthless if data not fit it. You can't do binary search until the data is ordered/unique. Thats way I think math is secondary

Of course, datastructures have math on them, and still the data side dominate. Your O(1) get work when the datastructure is made for it...

Probably exist some counterpoint (Anyway, I'm more a database guy with minimal math skills) but I think for the the idea work most of time..

themcgruff · on Feb 1, 2015

Basecamp (formerly 37signals) - Chicago IL (But you can be anywhere. REMOTE, FULL TIME)

Android Lead Dev - https://basecamp.com/jobs/android Marketing Designer - https://signalvnoise.com/posts/3841

themcgruff · on June 13, 2013

We treat all environments as production. However we add additional security to staging since it's the most experimental. (Additional security meaning it can only be accessed via VPN and after two factor authentication.)

To be very clear: Beta, production, rollout and staging environments are secure. They run all the same front end security provisions, same data center, same patches, etc. The difference is they run differing versions of the code base based on development.

themcgruff · on April 17, 2013

(I do ops at 37signals...) We have two top of the line ("high end") load balancers at each site. They malfunctioned. When they both malfunction at once, it doesn't matter whether it's active/active or active/passive or anything like that.

themcgruff · on Dec 12, 2012

In the operating room, where some would argue where procedural checklists like this might most count, they use a pre-op timeout procedure. Usually this is to ensure the right patient is being operated on, in the right place(s), and that the right operation is being performed. The same happens before the patient is "closed". A count is taken of every bit of material / tooling used in the procedure to make sure nothing is "left behind" (in the patient). Sources: 1.) http://www3.aaos.org/member/safety/guidelines.cfm for more information. 2.) My wife who was a surgery resident.

themcgruff · on Dec 1, 2012

We talked with some popular "mysql consultants" about things like mysql-master-ha and there was always a lot of hand waving murmuring about how existing solutions were incomplete / had lots of bugs / were not to be trusted in production.

It looks to me like they do mostly the same thing. We wrote our script in Ruby because it's what we know and have the most expertise with, which makes it easy for us to debug.

mysql_role_swap is not black magic, it doesn't do automatic failover, it doesn't run in the background / daemonize itself. It fails gracefully in a way we find predictable and useful.

falcolas · on Dec 1, 2012

The problem with rolling your own is that you're going to make mistakes. MHA, PRM, MMM, Tungsten, DRBD - all were written by people who know a lot about MySQL database failover, and yes, they still have problems. The reason they have problems is because doing failover properly under load is a hard problem to solve.

As for them not production ready? Sounds like a line being fed by someone who wants to sell their own solution. I have personally seen all of these solutions in use in production systems.

Some of the potential problems I note with your solution (and if you haven't been bitten by these yet, you will at some point):

1) You need to break existing connections to the old master. This is the single biggest cause of split brain that I've seen in these situations. SET GLOBAL read_only=1 only affects new connections.

2) You should verify that the slave has read and applied all of the logs from the master, to ensure that no data is lost when the slave is stopped.

3) You really should capture the binlog file & position from the slave after stopping the slave, to avoid getting inconsistant data in your two databases.

4) You may want to ensure that the arp change actually took effect on your proxy machine: prior to unlocking the proxy, make a connection from the proxy server to the VIP & verify that the server_id matches the slave's server_id.

MHA has a number of problems when considered as an automated failover solution. However, it's not required to run in an automated fashion to do manual failovers like this, and is quite good at doing them correctly; it encapsulates the lessons learned from its predecessors, and the knowledge of the engineers who have been bitten by doing a failover under load incorrectly.

There are some really great tools out there, and not using them (or learning from them) is going to cause your DBAs hours of hard work to get your DBs back in operational order when something goes wrong. Not if.

themcgruff · on Nov 30, 2012

We write out the cluster config and update the script using Chef.

In an emergency situation, chef-client is a wee bit slow ;)

Also we like having the script standalone so there's just one moving part.

BryantD · on Nov 30, 2012

Heh, that's fair enough. More moving parts is always a problem. Thanks for sharing!

themcgruff · on July 10, 2012

(I'm the author of the post).

These aren't blades and I dislike blades for the hardware/vendor lockin reasons you mentioned. Blades usually share power, network, etc. These only share power + baseboard management controller. We can lose a single power supply and still keep on going. We've distributed the applications over multiple chasis, just in case though.

rdl · on July 10, 2012

Ah, I haven't messed with these things yet. Do you like them?

themcgruff · on July 10, 2012

The BMC is horrible. It needs a cold reset frequently, and it often doesn't work at all when access from a non Windows machine. Otherwise they work really well. I like that the sleds are able to be removed from the front too.