> We've hacked together a Ruby script that retrieves a console screenshot via IPMI and checks the color in the image to determine if we've hit a failure or not.
That's pretty funny yet sounds a lot familiar to many of us as every now and then we all do these sort of nasty hacks.
Hah! I wrote a script a few months back and had to solve an issue of figuring out the state of a program running under wine, and that was my solution (to not in ruby, just a quick bash script).
I was pretty happy with the results but if felt like an incredibly crude way to solve the problem, now I read that this is being used at much higher levels than I'll get to, maybe it's not so bad :)
Vs just buying opengear console servers and something like conserver to get the actual text via serial like most large Unix environments (every job I've worked at) do. Then you can just scrape the text.
Right, this surprised me. They're doing IPMI, so why not SOL (Serial over LAN) to get the raw stream?
Actually, on second thoughts, I'm not surprised so much. I've read many threads on Supermicro IPMI and people's frustration with it (reliant on outdated Java, and hacked together wrappers over VNC) that make it seem like a deliberate choice to obfuscate things -just- enough to make other tools difficult.
No a career building those types of tools taught me SoL is garbage from most vendors. Cray, Dell, and HP are (arguably) best with mostly reliable SoL, but they still are awful. If you paste a buffer too big into a SoL session, the dell DRAC will freeze, so you have to kill and restart the serial connection. If you have > 1000 machines, hardware serial is the best thing to do for management, in addition to IPMI for power management.
* boot a custom live cd (a la Knoppix) over PXE
* Live CD places node into database if it doesn't exist yet,
using dmidecode to find serialnumbers and such
* Live Cd keeps querying database for instructions
* Engineer adds a profile to the node in the database
* Live Cd slices up disk to match the profile
* Live Cd fetches a tarball of the base OS from an URL
and throws it on the metal. Runs grub setup. Reboots.
Kind of amazing that the state of the art in this area hasn't changed in nearly 10 years. The marketing angle is funny as well - everything has to be a cloud now. I think I used to call what we built at Optiver a "private cloud" but "Metal Cloud" is nicely buzzwordy as well.
PS - Hi Marty, unknown nick here but I'm sure you can figure out who I am. :D
I did something sort of similar with Live CD images for booting hardware not on a PXE network in Cobbler. It could also register machines seen then first time by making system records for them.
Well, no. But IMHO, iron is only do-able if you have enough of it. This I learned the hard way at my previous contract, where we only had a handful of servers and all of them production.
When you want to automate your complete infra, including rolling out hardware, you need hardware to develop and test on. Entropy of life will ensure that exactly that moment you need to reinstall that PostgreSQL slave from scratch, the PXE server is unreachable, or the server has a different diskcontroller, or an iLO certificate expired. Or something stupid.
Test your code. And for Ops this means: machines that are solely for the testing pleasure of Ops. No other function.
But that is perfectly fine! That only means that you cannot have an automated install procedure that the company can rely on. There is really nothing wrong with a little manual labour at this stage. Do not spend weeks and weeks on automating this without having the environment to test your code.
Hardware provisioning is a dying art. I would love to see a modern-day xCAT[0] clone that's easy to install and configure and with proper multi-platform support. Foreman is half-way there, but AFAIK doesn't do BMC provisioning and discovery, which is a big deal.
Maybe you were not aware because it's a plugin, we kind of have that problem in the Foreman community, plugins are not as visible as they should and they can contain key features.
I too would be interested in the answer. From my perspective, Debian is the server Linux distro par excellence, and in my experience the folks who choose Ubuntu have been devs who don't actually use Linux (e.g., the sorts who develop on a Mac or in a VM rather than on a personal Linux system). It's not really fair to Ubuntu, which is decent enough in its own way, but I tend to consider the choice of Ubuntu to be a bit of an architecture smell.
I'm honestly interested in what the valid reasons to prefer Ubuntu over Debian (particularly on the server side) are.
Newer packages, sane LTS policy, easier to get non-free firmware/drivers going (as in, the default CD comes with them), seemingly more support from third parties.
They're both pretty awful due to their automagic(al tendency to break down in mysterious ways), but if I have to choose, I'll go with Ubuntu.
Disclaimer: My main box runs Gentoo and I own no Mac machines, if that changes anything in your vision of Ubuntu users.
Those two are in opposition: Debian (generally) has new-enough packages, but it's stable, which is what one wants on a server system. Meanwhile, Debian's LTS story is better than Ubuntu's: just upgrade, and know it will work.
> easier to get non-free firmware/drivers
But how often is that needed for server systems? And of course, there're the ethical & engineering issues with using proprietary software in the first place.
> seemingly more support from third parties
There is that, but if we all wanted more support from third parties, we'd have stuck with Windows, no?
> They're both pretty awful due to their automagic(al tendency to break down in mysterious ways)
I've not experienced that with Debian in a long time. I used to have issues with Ubuntu, but I don't think that they were generally all that bad. Better than what I used to experience with Macs and Windows back in the 90s, anyway.
This is a leading word. As is a lot of that paragraph. For many tools some companies use, Debian certainly is NOT "new-enough" with many package choices. Nor is Ubuntu inherently NOT "stable" - and still trails a little behind the leading edge. As for upgrades, I've watched many a server upgrade seamlessly from 10.04 LTS to 12.04 to 14.04. I'm sure there can be and has been many a person, many a thread who've not had seamless experiences. But the same applies for Debian - heck, even the release manual has a section entitled "How To Recover A Broken System" with reference to system upgrades.
"non-free firmware/drivers"
How often needed? In this article alone, IPMI, BMC, RAID, BIOS.
"And of course, there're the ethical & engineering issues"
This is a derailment. What exactly are the ethical issues for a closed source company in using other proprietary software?
I'm by no means an Ubuntu fanatic. It has its share of issues, absolutely. I have everything from FreeBSD to Debian to RHEL to OmniOS to administer, and they all have strengths and weaknesses.
A better question, in my mind, is why they'd choose Ubuntu over CentOS or RHEL, since they're running on Dell hardware, and Kickstart is far, far superior to Debian/Ubuntu's PXE install (and preseeding is awful to configure). My org went through a lot of pain because of this choice (which preceded me) and I'd hate for anyone to go through what I went through.
Also Dell's maintenance tooling barely works on Ubuntu at all; they don't even officially support their OpenManage stack on it. And forget about online firmware upgrades.
CentOS is a turd of a distribution, I'm certain the only reason it has any marketshare is because it's the only supported "free" OS for cPanel/WHM which a lot of web hosts provide for non-technical customers.
Unfortunately most of the material on ChatOps currently covers only how to get Hubot to display cat pictures or other trivia [1]. Maybe it's because each company should create their own "chat API" but I'd also like to hear some real, inside "war stories".
Does anyone knows what app does GitHub use for chats? Looks like a simple and elegant UI over Basecamp.
I'm somewhat split. There's definite value in having a shared history of what people have done, but I prefer that to take the form of command line tools pushing status updates to Hipchat or whatever. You lose so much convineance by pretending Hipchat's chat box is a terminal, everything from command history to being able to quickly iterate over the contents of a file or set environment variables.
One of the most compelling aspects of it for me isn't so much the shared history, but just visibility of what people have done. It's a really effective way of transferring knowledge of how things are done. You can easily drop into the room and watch play-by-play how a given task is done.
Semi-off-topic, but I am genuinely scared by giving a chat bot full root access to your infrastructure. This just doesn't seem like a mature enough, AAA-enabled channel. Especially when there's a third party (HipChat/Slack/...) involved.
These lines seem odd to me, maybe it's just the wording:
> [gPanel] Deploying DNS via Heaven...
> hubot is deploying dns/master (deadbeef) to production.
> hubot's production deployment of dns/master (deadbeef) is done! (6s)
Is this just an IMO odd use of the word "deploying" or does a DNS change really mean building and deploying a new package/image?
I've always thought of 'deploy' as a generic term for pushing any change to production. I think this is pretty typical - e.g. you deploy your application, even if that just means updating some files.
I deploy public DNS via new images. Image builds are fast, and our DNS changes rarely. When DNS changes frequently I wouldn't recommend it, though. Our internal DNS is using SkyDNS2 (backed by Etcd) instead, because that changes frequently (service registrations etc. as vms are started/stopped). But for the public DNS we like having the one DNS change => one git commit => one Docker image mapping to see who/why/when DNS changed.
>Once we've gathered all the information we need about the machine, we enter configuring, where we assign a static IP address to the IPMI interface and tweak our BIOS settings. From there we move to firmware_upgrade where we update FCB, BMC, BIOS, RAID, and any other firmware we'd like to manage on the system.
In theory it should be if you have a tightly controlled hardware process (and in this case, Dell, who is used to selling servers configured to initially PXE boot, etc), and you have some 'expect/send' scripting in place.
If you are not running OpenStack otherwise I'd say maintaining it is way to much effort to get relatively basic functionality. They also would still need custom work for asset databases, the memory checks, so they don't gain much.
A consultant from Ubuntu called Ironic a "still birth" and said its name was indicative of its fit with the rest of OpenStack. His team used MaaS which I gather serves most of the same purpose:
This is pretty cool but is there really that much benefit to doing this? What size IT staff do the savings justify? Does that change if you use Amazon's market-driven options like spot pricing and capacity planning discounts?
The equivalent of a M4.x10large running 24/7, which would cost $1814/month, costs about $400-$500/month lease from Dell or HP.
There are other costs, like cooling, power, peering, networking gear, colocation/building costs, having spare parts on hand, paying sysadmins, et cetera, that are going to vary based on your requirements and region.
This seems wrong to me. This was the state of the art ~3 years ago. Now, I feel like all of the machines should be provisioned already with an OS, and a basic image, and a orchestration system like CoreOS / Mesos / Docker should specialize them.
IMHO, requiring hardware, or the entire machine should be exception, not the rule.
Sometimes you have a workload that really isn't a fit for virtualization/containers/whatever the latest Rails hotness is, at all, and you just need to throw a couple of cargo trailers of insanely massively-spec'd servers at the problem. In those cases, your 'old school' server provisioning toolkit had better be on-point.
It's easy to forget just how ridiculously powerful bare iron is these days. Go to Dell.com and see how much RAM you can cram into a U or three or four today in 2015. Or see how many IOPS a modern NetApp or Symmetrix (EMC) can push with 'flashcache' or million-dollar SSDs. It is ridiculous, and while a lot of those platforms are meant for 'building your own private cloud', etc, there's a non-trivial amount of workloads/projects where bare-iron is the best tool for the job.
Containers are just a namespacing tool, though; you're still running on bare metal (well, bare Linux). Docker in particular runs on AuFS, which is slow, but other containerization tools just use a chroot.
A lot of other latency-sensitive applications tend to have so many adverse performance conditions (that can usually be remediated with a lot of blood sweat and tears) under virtualization that it becomes easier to just go bare metal and deal with physical infrastructure overhead.
Even if you were going to run CoreOS or Mesos on the machine, you'd still manage it booting your specific image, which you can change, rather than trusting the pre-installed dell verion and managing that relationship.
Now there's probably some room for debate on whether these guys job should just be outsourced to Amazon, but github has some pretty good uptime and they seem to know what they're doing, thus they've probably already won that debate.
Just because you use Containers/VMs for most of your apps doesn't mean that the lower levels don't need attention: installing OSes in the first place, hardware testing (both initially and to identify defects later), ...
And for important fileservers and databases you're going to run on specific hardware for a long time.
For whatever it's worth, SmartDataCenter, Joyent's open source SmartOS-based system for operating a cloud[1], does exactly this[2] -- and (as explained in an old but still technically valid video[3]) we have found it to be a huge, huge win. And we even made good on the Docker piece when we used SDC as the engine behind Triton[4] -- and have found it all to be every bit as potent as you suggest![5]
The big issue I'm having with that is that it involves trusting vendors to get network boot right. Especially when it comes to the looping part of "loop until DHCP gets a response" it becomes a problem. One of the cheap vendor tries 30 times and then goes to a boot failed screen after trying the disk.
Also, 1 time out of a 4-5000 or so network boot fails. Not sure why.
That's where iLO comes in. iLO is horrible, but you can ssh to it and set all manner of stuff.
When we didn't have PXE, we had a script that told iLO to boot from CD, and that the CD was located at http://something/bootme.iso. iLO would always have network, and would pass the .iso magically to the server as device to boot from.
If you have IPMI on the server this doesn't become such a big problem - you can reasonably trigger resets/reboots if it's not up after a given amount of time.
We buy the cheapest server that meets our needs, and buy it in somewhat larger quantities (often double what was originally envisioned for less than was originally budgeted). Much more efficient.
But it does mean no IPMI. However I built a small circuit that sits on a power cable that can interrupt said power cable with a relais that sits on a bus plugged into our server, so we can do the reboot thing.
I've been meaning to redo that power cable circuit using wifi as the linking technology, now that we have esp8266 available.
GitHub's physical infrastructure team doesn't dictate what technologies our engineers can run on our hardware. We are interested in providing reliable server resources in a easily consumable manner. If someone wants to provision hardware to run docker containers or similar, that's great!
We may eventually offer higher order infrastructure or platform services internally, but it's not our current focus.
I built something similar for a managed hosting provider ~10 years ago. That doesn't make it any less useful now, and this does a ton more than our tool did, far more elegantly.
At some point someone needs to manage the actual hardware, whether that's you, or a middleman, and when you're handling hundreds or even thousands of devices its just not practical without automation.
That's pretty funny yet sounds a lot familiar to many of us as every now and then we all do these sort of nasty hacks.