I'm using Linux since 20 years and used to do this all the time, back when Linux --or more specifically X Window system-- was less stable than today. I'd make sure to always compile the kernel with MagicSysRQ (you need it for the combo mentioned in TFA).
It wasn't even mandatory to restart the system: the trick was to use first MagicSysRQ and then issue a "vgareset" (IIRC I couldn't even see what I was typing, but the command was taken into account) and then, miracle: I could unlock frozen X Window sessions (more specifically: kill X, "reset" the GPU and then restart a new X Window session).
Note that very often X is fine: it's just X which is frozen. Heck, if you have another machine on your LAN and allow SSH in, you can SSH and kill X / vgareset without needing MagicSysRQ.
But since quite a few years X is so stable that such hackery ain't needed anymore. Moreover if I recall correctly "vgareset" did only exist for 32 bits system (at least at one point). Nowadays my Linux workstation regularly reaches 6 months of uptime (there are only very rarely know remote root exploits mandating a kernel upgrade) so I've kinda "forgot" how MagicSysRQ works ^ ^
>Mind you I don't think I ever used "Stop A" on a Sun for anything constructive...
Back at the dot-com I worked at, we used to have a SPARCstation 5 that we used for SPARC-II arch builds. It was a nice machine, but it had a dead NVRAM battery so it would lose its configuration every time we lost power in the building (which was a lot, because rolling blackouts used to be a thing in California).
Anyway, an interesting quirk about the SPARCstation machines is that the MAC address for the NIC is stored in NVRAM, and when you lose the NVRAM settings it defaults to ff:ff:ff:ff:ff:ff AKA the broadcast address. So by default, on boot, this machine would start sending out DHCP requests and other network traffic with a source address of the broadcast address. The switches we were using did not like that, and would start flooding all of their ports with traffic. The only way we found to fix it was to reboot the switches.
So, there is at least one constructive use for "Stop A": you can use it to configure a MAC address on your SS5 so that it doesn't inadvertently bring down the whole network in a massive broadcast storm.
Sort of. Stop-A drops you into the firmware prompt (OpenBoot), suspending the OS in the process. OpenBoot, of course implements a forth interpreter shell.
Ahh, the good old days... I once implemented a nice little boot device selecting utility in for for some of my Suns. All written in Fourth, and executed by the firmware on every boot.
Actually, I did a lot of RPN development on Suns. Not in Forth but in PostScript within NeWS/OpenWindows developing front ends for large scale Lisp applications.
Yes, I believe that's true. Stop-A pops you to a prompt (which I believe is all running Forth) where you can do all kinds of cool stuff. It completely suspends the OS. I remember I was once able to TFTP in a custom boot logo at that gets loaded right into the Bios.
I've also seen one graphical app hang X. If you Ctrl-Alt-F1 and switch to your getty session you can sometimes kill the one app and return to your X session.
It's a lot easier just to reboot and let fsck patch up your filesystem... flushing buffers isn't going to help you get back the last paragraph of your PhD thesis if Libre Office hasn't called write().
In any case, in ~8 years of using Linux, I don't think I've had a freeze that was the kernel and not just X.
One easy way to get kernel lockups is to use fglrx.
But agreed on 'just reboot': you should be using a resilient filesystem anyways, and MagicSysRq only works when it's enabked beforehand, and only for some kinds of lockups, and if the data is in the frozen application, syncing the disks isn't going to help.
And what if it's not your PhD thesis in Libre Office but a busy database server which can easily get corrupted if you don't flush.
> I don't think I've had a freeze that was the kernel and not just X.
Device drivers sometimes have bugs, especially when the device is not working properly (I had system freezes when there was a misbehaving capacitor on the graphics card).
> And what if it's not your PhD thesis in Libre Office but a busy database server which can easily get corrupted if you don't flush.
No properly configured and working ACID compliant RDBMS should lose any data when the server is reset or stopped. If it does, then it is either a problem with the hardware, OS, configuration or the RDBMS itself. The application must also be able to handle the DB disappearing, though. Sadly this is often not the case.
The disk may lie. The 's' will get the OS to send everything out to the disk. Actually writing it to the platter (or flash part, or whatever) is at the disk's discretion.
It's at the disk's discretion as far as the laws of physics are concerned, but this would a severely broken disk prone to losing data and if it was a major server disk vendor, the vendor would take a pretty serious hit to its reputation.
Yes, but my point was that I could send data to an ACID compliant server, and kill it before the commit happened and data will be lost. Just trying to point out to the parent poster that sending is not enough, you need to wait for the commiting.
Yes, I agree, but just several weeks ago I had to recover a MySQL InnoDB database after a power problem. MySQL is very popular, and it turns out it is an unsafe database server according to your definition. Well maybe it is.
Edit:
Besides, hard drives have write caches, and they can report a successful write operation to the OS when the data is still in its cache physically.
Yep, MySQL is an unsafe database. The fact that it's popular just goes to show how powerful marketing is on th emind of most people.
Hard drivers are supposed to flush their caches before they report the end of a flush operation to the OS (some flush into flash, but they flush). If your does not, it's defective. Go ahead and make use of the warranty.
MySQL is only ACID compliant under a very specific set of configuration parameters. This makes it even more dangerous than an RDBMS with binary acidity.
Re HD lying, this is rare except with dubious cheap flash sticks. The caching command set is well defined and operating systems know how to issue the relevant SCSI/SATA commands. It's critical for correct functioning of journaling filesystems such as NTFS and Ext4.
I remember some sort of magic invocation like that years ago for our medical "device" product we had to manage hundreds of remote node instances of. Something along those lines. I don't remember why the developer (Gabe) came up with sync three times being magic number. Maybe three times was just paranoia. =)
Back in the bad old days of pre-UNIX Macs, it was a common troubleshooting step to reset your PRAM. This is battery backed up RAM that holds some basic settings, and if it got corrupted somehow it could cause weird problems. You'd reboot while holding down command, option, P, and R, then wait for the boot chime to sound a second time indicating that it had been reset, then release the keys and boot normally.
Somehow this advice got mutated so that you'd keep holding the keys until you heard two boot chimes (thus resetting the stuff twice). And then it started to grow. Three was common. Some people would advise more. I'm pretty sure that doing it more than once never helped anything, but there we are.
(The cmd-opt-P-R sequence still works on modern Macs and I actually used it to resurrect a machine that wouldn't start up just a month ago, but it's far less frequently needed now.)
It's the same with the battery stats resetting and the Dalvik cache wiping these days in Android land. You do it three+ times.
Or the "Repair permissions" thing in OS X. You do it several times as well.
It's like whenever there's this one-step fix thing that a system utility does, the Common Man will interpret it as needing to repeat 3+ times in order for it to be effective.
>Run it and be amazed how much your disks/raid/OS lie. ("lie" = an fsync doesn't work)
>It seems everything from PATA consumer disks to high-end server-class SCSI disks lie like crazy. Yes, that includes SATA there in the middle. I'll discuss fixing your storage components in a second.
I believe the thinking is because sync isn't instant, especially on older slower hard drives, having to type it again give it time to actually complete.
Back in the day (old Unix), the sync call would return right away, and the kernel would sync in the background. Unless there was a current background sync happening -- then sync would block until the first one finished, which is why you would have two sync's in a row. The third sync was thrown in just for luck.
I picked up the "sync three times" thing from an AIX kernel developer, who did it from before AIX had a "shutdown". (Yes, he would do "sync, sync, sync, power-off".) My theory was that using it three times gave the system time to actually sync the data.
This is likely a sun thing. Halt on (SPARC) Solaris does not shutdown the OS. It issues a reset command to the firmware. The halt command on Solaris is roughly equivalent to pressing the reset button on a PC.
When your Sun is particularly hosed we used sync;sync;sync; halt to reset and (hopefully) not lose any data (sync forces OS write buffers to purge)
I wouldn't advise other people doing this as the filesystem can get seriously broken under certain circumstances.
For example, on a Linux laptop with the hard-drive encrypted with dm-crypt, I simply lost access to my drive due to repeated hard reboots. I don't know if I could have recovered my data from it or not, but after repeated attempts of googling for the error message and following advice I simply gave up and later reinstalled everything from scratch (it's a good thing I constantly make backups ;-)).
when doing kernel development on embedded systems or on a real host (not within a virtual machine), it's sometimes useful to get system information when it crashes. the cool thing is that it's also possible to send the sysrq through a console serial port by first sending a break command.
anyway, someone already mentioned a few mnemonics, I learned one, a quick googling lead me here:
Or, if you're on anything approaching a modern system:
1. Tap the power button (one single brief tap)
2. Wait a minute
3. If the system hasn't already shut down or isn't obviously in the process of shutting down or you just get impatient: hold the power button down
4. Now that the system is off: tap the power button.
Modern systems (whether server, desktop tower or laptop) have a "soft" power button, that when tapped briefly sends a signal to the OS. Most flavors of linux are configured so that receiving that signal initiates a proper shutdown, just like "shutdown", "poweroff" or a GUI shutdown. All of this has been true for quite a few years now.
If the system is so locked up that the power button initiated shutdown doesn't work, you might as well just pull the power, which holding the button down for a second or two will do. Even if this happens, you're probably using a journaling filesystem that comes back up cleanly.
None of that will help with unsaved data in an application, because you almost certainly have to tell the application to save, and if X has locked up there is no way to do that. A clean shutdown might signal the app with a terminate signal and allow for data to be saved before doing a hard kill.
This is not for the situation in the article, but a slightly different one. You have a remote server, there's an ssh session still open to it but it's somehow lost access to all its mount points including /
You have nothing to work with but bash builtins and /proc, you're 100 miles away and you need to get it up and running again NOW. Emergency reboot -
echo 1 > /proc/sys/kernel/sysrq
echo b > /proc/sysrq-trigger
How could that happen and leave your system in a state where it is still accessible and can be fixed by a reboot?
If you think you might face that kind of trouble you should keep a copy of Busybox and whatever other tools you might need on a RAM drive. You'd have an opportunity to figure out if rebooting would lead to a usable system.
The machine would occasionally stop responding completely. I got into the habit of leaving an open ssh session going from another box so I could try and poke around. The root drive was on a (very fast) usb 3.0 stick on an internal header on an add-in card.
There was a problem with the card or the driver as every so often something would go wrong and everything would stop responding again. The shell I left open revealed almost nothing as the root drive was gone, but it could be used to reboot the machine (thanks to the trick above), which would then be good again for another few days.
Sync never returns, all filesystems have been lost.
Yes, it's exactly like pulling and replugging, which was exactly what I needed!
--edit-- is sync even a bash builtin? Looks like it's /bin/sync on my systems. / had been lost (it was on a usb stick on an internal header on an unreliable usb3 card, I later found out.)
--edit 2-- if you meant also echoing s to the sysrq-trigger, it seemed to kill the session
That's because the default behaviour of CTRL + ALT + DEL is to issue a proper reboot.
That, and because X consumes those keys, making them useless when X goes bad (what is about all the times that Linux freezes and it's not hardware fault).
It's possible that their computer is frozen enough to need this but it's more probable that this person hasn't figured out how to disable DontZap in newer versions of xorg. You need to disable it in your xorg.conf in order to have ctrl-alt-bksp work again:
That was one of my favorite Magic SysRq keys. It's the Linux analog to Windows's Ctrl-Alt-Del.
It's the "Secure Access Key" (SAK): You press that key and it kills all programs hooked to the TTY (incl. X, in your case) and displays a proper login prompt so that you can know what you're about to login to was run by the system and not a clever malware trying to steal your password.
Linux gets frozen, what do you do? I remember to press nine buttons on my keyboard, duh. and why? Because a hard reset will "make you a lot of problems?" I am not convinced, and I am not impressed.
This is a laughably awful shortcut for anything. Why make the supposedly safe stuff so out of reach? I would have to use a separate computer to look up how to type this command. Even if I knew the command off the top of my head, what if my keyboard lacks a PrtSc (SysRq) button?
When folks are pointing to Windows for design cues, the open source community ought to reconsider some aspects of how it has implemented certain features.
It isn't equivalent, but it is approximate; folks were mentioning ctrl-alt-dlt. Three keys instead of nine keys represents a 66% increase in efficiency.
I would've said it was approximate enough in whe Windows 3.1 days when it could perform a soft reboot. But since Windows 2000, it has had completely different functionality (sometimes allowing you to log in, sometimes giving you a process manager oslt), not even remotely approximate to the magic sysrq sequence presented here. Do tell me if this has changed since Windows Vista.
Linux, by the way, can also respond to Ctrl-alt-del, and depending on the desktop it might perform functions similar to Windows.
Before this try Ctrl+Alt+F1. You might get a full-screen terminal where you can log in and kill the offending process. Ctrl+Alt+F7 to get back to the desktop.
I had to use this just today when ddd stole all other mouse and keyboard input.
That's because it's turned off by default on modern X.
You can enable them in the keyboard options of your DE, or directly with setxkbmap on startup (sorry can't look up the exact options right now) if you're not using one.
if your X is hung (ie, NOT "linux"), why not switch to a VC and fix it? yes, it's good to know about magic-sysrq, but most people don't even understand the layering of X on a VC, and the fact that other VCs are available.
of course, fixing why your X hung would be wise too, since for any normal distro and mainstream hardware, that's just not going to happen. if you really want to be fubar, have that f@cking POS systemd die on you... (yes, on-topic, since only sysrq saves the day.)
More often than not, it's your WM, DE or a GUI application eating up your RAM (Chromium used to be deadly for this) that's causing the freeze.
In more than a decade of running Linux on my desktop (and at work), I genuinely can't think of a single instance when I've not been able to pull a virtual console from a frozen desktop (albeit it often performs laggy).
Thanks for trying to diagnose my computer over a web forum, but I am competent enough to identify when and how my computer has failed. I'm not interested anecdotes from users. I develop video drivers. Am I allowed to experience these lockups now?
Firstly, I'm not just a Linux user. Like you, I'm a developer too. Given the demographic of this forum, it would pay for you not to assume that you're the only one on here that works in the industry (in fact I even hinted at that when I said I use Linux at work - but never mind)
Secondly, I was making a general comment about peoples desktops rather than talking specifically about your example (given the lack of details you posted, it would be insane of me to assume I could diagnose your fault with any precision). My point was that generally when people think their computer has locked up / X has crashed, it's actually one of the items I mentioned earlier that's at fault.
The snappy reply was appreciated though </sarcasm>. But given just how unusual your circumstances are (assuming what you said is true) and how much you seem to hate it when others discuss these topics with you; it might be an idea if you clarify your position a little better the next time such a topic arises. Like maybe saying "my crashes aren't typical because I'm a kernel developer, but.....". This way people don't accidentally post something that hits one of those raw nerves you have and it saves us all from a lot of unnecessary condescension.
When I learned this trick, it was because an X input driver was locking up[0]. The virtual consoles were completely inaccessible. Occasionally it would show me a blank screen for my efforts, but more often not even SysRq would work.
[0] I'm actually not 100% sure about that. Thank God I don't have to worry about it anymore.
unused
The server encountered an internal error or misconfiguration and was unable to complete your request.
Please contact the server administrator,
(etc...)
What about when an SSH terminal gets locked up? If I'm running something and I want to abort it, I often find I can't because of the kernel thrashing. Often I see a nontrivial amount of CPU time given to kswapd but sometimes not, and more often I don't have any monitoring up to see what's going on, or the terminal I'm monitoring is also nonresponsive.
This is especially a problem with Docker, I've found, because when I've got a container thrashing the system like this, I can't access the host or any other containers.
Is this because I'm running single-core instances and the problem goes away once I have more than one thread available? Or should I be restricting workloads to a block device not used for caching? Does anyone else experience this?
Docker can set CPU, memory, and network-bandwidth quotas for containers. If you can guarantee that the aggregate total of each quota is less than the amount of that resource available to the host, then in the worst case you'll always have some of each left for a "rescue" container.
This throws away one of the neat advantages of containers compared to VMs, though, which is that you can overcommit (or "thin-provision", if you want to be charitable) the host's resource allocation, since it's extremely unlikely that all your services will balloon in resource-consumption simultaneously.
A less guaranteed, but more economical strategy, is to just set quotas on each container such that if it individually starts using more than, say, 80% of the host's resources for an extended period, it'll be terminated. This doesn't save you from a bad interaction between containers that makes everything explode in parallel (e.g. all your containers attempting to reconnect to a stuck service with no backoff) but it should save you in the majority of cases where each host has a heterogeneous container-load, and horizontal scale happens between hosts rather than internal to them.
(The dev-side solution to this problem, though, is to just set up your software architecture such that any host that gets into such a state can hard-reboot, without you losing anything of consequence. http://12factor.net -type containers excel at this, and it looks to be the strategy https://coreos.com embraces as well.)
Docker supports (supported?) what LXC supported, which was relative CPU shares. I am not sure it would help if it supported something else (like quota minimum/maximum). When the IO subsystem is hit with a memory intensive workload that thrashes swap, all your guarantees go out the window. From my perspective (staring at top while an interactive terminal is not responding), it's the kernel that owns the CPU shares being used, and the IO volume, and it looks like the container is "almost idle" because most of the CPU time is being spent by the kernel.
I don't know if it's that something Docker can fix. It might be something kernel devs can fix, though.
The most common reason I've had to do this is that I've exhausted RAM and swap, and the system is disk thrashing for spare pages. In some setups, Linux does not like to OOM a process when it really should. (That's adjustable, by the way.)
Also, you don't need to go all the way through it. In my above example, MagicSysReq+RE or MagicSysReq+REI is enough usually. E and I are for sending SIGTERM and SIGKILL, so all the processes are now dead, and we have plenty of RAM. Sometimes it does take a bit of waiting: the system is disk thrashing. That said, you can then just resume, though you'll likely need to switch to a VT and restart X and other things. (But you get to keep your uptime.)
There's also F, which invokes the OOM killer, though I often forget about it.
What do you mean 'linux gets frozen'? I literally can't remember the last time a linux box froze on me. As others said, I would just reboot it now and let fsck pick up the pieces.
That's going to be faster than the time spent to search google for the magic incantation.
This would be where keyboards with hardware-level macro support (like the Kinesis Advantage) come to be useful.
... to be honest, I can't think of another scenario. I have an Advantage for every place I spend extended computer time in -- it's too big and clunky to carry around...
The only way my linux machine has ever locked up is the usb driver seemed to crash in which case I couldn't do this anyways. The only other way I could have possibly restarted is to ssh to my machine from my phone, I opted for the power button instead.
Hmm, I wonder if something like this extends to mac. Sometimes the login manager crashes and I can't do anything. Music still playing, but no mouse or keyboard reaction. I can ssh in, but I don't know what to run.
I learnt about this a few months back... and never had to use it. Seriously, on my AMD APU netbook running Mint, I've had no freezes whatsoever, despite the hardware being a bit odd. A neat trick though.
Can't get it to work on Fedora 20. Are you suppose to press and hold Ctrl, Alt, PrtSc(SysRq), r, e, i, s, u, and b all together? Does it have to be in that order?
Switch to a terminal (Ctrl+Alt+F2) and do AltGr+PrtScr+H - if you've got Magic SysReq enabled that will print the help information (a list of commands with their (k)eycodes) to the terminal. If it doesn't print anything then you need to RTFM to find out how to enable it. AFAIR I had to do this shortly after I switched to Kubuntu as it was disabled by default - I often had to use Ctrl+Alt+Backspace or AltGr+SysRq+K to kill X as there's a hardware bug on my Nvidia graphics card.
My 8yo has a mnemonic for it something about Elephants and Umbrellas but I've always done reissub (now with extra sync'ing power). For a time the process for him was login, Alt+F2, konsole, Ctrl+R, mine, Enter, AltGr+SysRq+R, E, I, S, U, B; then repeat the first part and you're finally ready to play Minecraft!
If AltGr is right Alt, it does print a bunch of stuff. On my system, SysRq is under PrtSc. Does that mean I have to do Fn+PrtSc to get SysRq or just PrtSc is enough.
If it prints out without presssing Fn then you're doing it right. Laptop keyboards are weird but usually it's just the PrtScr key as it acts as SysRq when using the Alt[Gr] modifier.
Control isn't necessary - I'm not sure why it would be specified. On a laptop you might have to hold the Fn key to hit PrtScr, but aside from that it's two fingers to hit 'Right Alt' and 'PrtScr' and the other hand can mash r,e,i,s,u,b.
In my workplace I have a Dell Keyboard that has the SysRq/Scroll Lock/Pause Keys on the top right, above the numpad. Holding Alt+SysRq with one hand is just impossible.
Yeah, with a typical ANSI layout it's a small stretch but definitely doable.
This is the layout that I have (and the only keyboard layout I'll ever buy, because every other throws the pipe key and backslash in a random spot along with randomly sizing the enter key)
Let it go, let it go!
Can't hold it back any more.
Let it go, let it go!
Turn away and slam the door.
I don't care what they're going to say.
Let the storm rage on.
The server never bothered me anyway.
Certain Magic SysRq sequences are now disabled by default in Ubuntu. You can re-enable them by editing /etc/sysctl.d/10-magic-sysrq.conf. Alternately, you can enable them temporarily by
# echo 1 > /proc/sys/kernel/sysrq
(Or substitute "1" by the number described in magic-sysrq.conf, above)
It wasn't even mandatory to restart the system: the trick was to use first MagicSysRQ and then issue a "vgareset" (IIRC I couldn't even see what I was typing, but the command was taken into account) and then, miracle: I could unlock frozen X Window sessions (more specifically: kill X, "reset" the GPU and then restart a new X Window session).
Note that very often X is fine: it's just X which is frozen. Heck, if you have another machine on your LAN and allow SSH in, you can SSH and kill X / vgareset without needing MagicSysRQ.
But since quite a few years X is so stable that such hackery ain't needed anymore. Moreover if I recall correctly "vgareset" did only exist for 32 bits system (at least at one point). Nowadays my Linux workstation regularly reaches 6 months of uptime (there are only very rarely know remote root exploits mandating a kernel upgrade) so I've kinda "forgot" how MagicSysRQ works ^ ^