Quoting (and splitting) is basic but eludes a lot of people including developers. I'd guess around one quarter to one third of all questions / issues raised with the command line tools I've maintained relate to quoting and so don't actually have anything to do with the tool in question.
I think it's important for anyone — developers especially — using a shell to understand basic shell principles.
Shell is a REPL that has some special filesystem-related helpers. Looking through $PATH for any +x file that matches the name of the first word entered. Performing glob expansion (*, ?, {foo,bar}...) on all but the first word. Backticks. Environment variable scoping (difference between `export FOO=bar` and `FOO=bar ./my_program`). Presence of "." and ".." in the FS. And so forth.
Once someone wraps their head around the execution model, they should be golden.
If we're talking about bash, I think it's a mistake to refer to any of this as "glob expansion", a term which doesn't appear in the bash man page. Rather, it differentiates between multiple types of expansion and the order in which they take place. Quote:
The order of expansions is: brace expansion, tilde expansion, parameter, variable and arithmetic expansion and command substitution (done in a left-to-right fashion), word splitting, and pathname expansion.
On systems that can support it, there is an additional expansion available: process substitution.
Furthermore, expansion absolutely does work on the first word and it's not obvious to me why someone might think it doesn't. The simplest example is probably tilde expansion, i.e. the ability execute ~/bin/foo or ~someone/bin/bar. But the other types work too, e.g. brace expansion:
$ ls l1
ls: l1: No such file or directory
$ l{s,1}
ls: l1: No such file or directory
$ touch l1
$ l{s,1}
l1
Command substitution:
$ $(echo l)s -l l1
-rw-r--r-- 1 user wheel 0 10 Dec 14:12 l1
Arithmetic expansion:
$ perl$((4+1)).$((36/2)) -E 'say "foo"'
foo
Even pathname expansion - here, `fi*` expands to `find` from $CWD, which is then found in $PATH and executed:
$ ls -l
$ touch find foo bar quux
$ fi* . -name f\*
./foo
./find
> I think it's a mistake to refer to any of this as "glob expansion"
Since we are being pedantic regarding the terms, while I can't really find "glob expansion" in the Bash manual either, it documents a bunch of "glob" settings: "noglob", "GLOBIGNORE", "dotglob", "extglob"...[1]
FWIW, I expected to find these in the Bash info page on Ubuntu, but curiously, "info bash" does not show up the info page (just the fallback manpage) — "info find" does work. Whatever happened to (tex)info pages?
The * character is addressed in pattern matching rather than parameter expansion.
From the bash manpage (version 5.0):
The special pattern characters have the following meanings:
* Matches any string, including the null string. When the globstar shell option is enabled, and * is used in a pathname expansion context, two adjacent *s used as a single pattern will match all files and zero or more
directories and subdirectories. If followed by a /, two adjacent *s will match only directories and subdirectories.
The GNU bash manual describes the '-f' option as "Disable filename expansion (globbing)."
"Globbing" is how * expansion is generally referenced in other documents, if not often within the bash manpage or info pages themselves.
> ...expansion absolutely does work on the first word
As was already pointed out, you are right. It does not work to match executables in the $PATH, which is what I was, for some reason, trying to communicate badly.
For me, it is a mess no matter how you look at it.
It is because there are so many layers. The way commands are parsed is already not that easy, but then the command itself has its own rules, and it is not uncommon to pass a script as an argument.
for example if you type "sh -c ls", first it is parsed into exec("sh", "-c", "ls"), then "sh" parses its arguments "-c" and "ls", then a sub-shell is started that parses the command "ls", then the actual "ls" program is launched.
This is simple but at every step, there are some quirks, first is all the shell parsing issues, with globs, interpolation and all that. Then you have the issues of how the command parses its arguments (be careful about "-"). Then back to the shell parser with the sub-command, then "ls" itself has its quirks, like locales for instance.
I have done Perl (line noise), C++ (memory unsafe with too many features), PHP (injections everywhere), JS (with IE6 support) and none of these languages make me more uncomfortable than UNIX shells.
I think one reason for this is that quotes and wildcard expansion works very differently across platforms. On Windows it's the tool's responsibility to do wildcard expansion and parse quotes, but on UNIX the shell usually does wild card expansion and (AFAIK) also removes quotes before invoking the tool. This is especially "fun" in Python scripts which invoke command line tools and that must work across UNIX-oids and Windows, or when trying to run UNIX command line tools on the Windows command line.
On Windows even splitting arguments is done as part of the application (WinMain, for example, gets a single string, which is not something glued together by below-WinMain code but how processes work on Windows, also CreateProcess and other process-creation APIs need a single command line). SSHv2 has a similar quirk where the command line for request_exec is just one string and gets executed using <user's shell> -c <string goes here>, which makes the exact semantics dependent on the shell set for the user. This might have been an attempt at lowest-common-denominator engineering for Windows.
Overall Windows command line is and always was broken and unsound due to being based on the flawed DOS batch model.
The article manages to make this point in the most hopelessly long-winded way possible, going mile after mile on "guess what's wrong! did you guess yet?" and fails to fully cover a nuanced subject in the process.
Informative writing gets to the point instead of playing gotcha games.
This is why it's generally a bad idea to allow non-quoted values if the quoting is not necessary. It sounds great - people will have to type less! - but in practice it just means they don't know exactly when they have to quote and they make mistakes.
Thankfully most languages didn't make this mistake but some did: CMake, YAML, Bash, TCL. All notoriously awful.
It's a fair point. I think the best solution would be for shells to have two modes - interactive and batch mode. Batch mode would require strict quoting.
Alternatively it could let you type without quotes and then automatically quote it for you (e.g. you press shift-enter and it auto-quotes it).
It's fine for interactive use because you can debug errors when they happen. The issue is with scripts.
Sometimes you need to deliberately not quote, though. For example, an output of a program might be a list of files, which I then might want to operate on using "for filename in $files". If I do this then I must have a guarantee that the files do not have spaces in them, but this is usually not a particularly onerous requirement for a closed system that doesn't have untrusted filename inputs. Splitting data using text separated by whitespace is part of the essence of Unix (see TAOUP). If you want to ban this kind of thing too, then you might as well not permit shell scripting at all.
The problem with “works most of the time” is that it stops working when you’re not around. So it might be fine for a one-off shell command, but you never know what the next guy will pass into it.
The biggest problem I see with people dealing with Tcl is that they carry their understanding that {} delineates scope (or something else lexically). Of course, this is wildly untrue in Tcl as {} is simply a "quote" operator.
This stuffs people up horribly until they unlearn the idea that {} has semantic meaning.
? But I do not see an advantage either way. I think you attribute to quoting issues that are caused by type-punning. The quoting rules of Tcl are simple. I mostly encountered problems when people do things like
# x is not actually a list, but may appear that way if a and b are nice
set x "$a $b"
# Proper way:
set x [list $a $b]
But I fail to see how quoting could address or alleviate this.
> Thankfully most languages didn't make this mistake but some did: CMake, YAML, Bash, TCL.
As well as Perl and PHP, though both have since recanted (Perl started 20 years ago with strict mode, which IIRC forbids barewords in most contexts; and PHP deprecated barewords in 7.2 and removed them in 8).
Off the top of my head, the only place strict Perl allows a bareword is the left-hand side of the => fat comma operator and pre-declared subroutines (IIRC, I never could quite remember the rules for that).
IMHO this is a design error in find. Tools should not heavily depend on characters which are also shell metacharacters, particularly in ways which are similar enough to the way the shell uses them to cause confusion when they collide.
I'd argue that this is a design error in glob expansion. It's highly unintuitive that globs are passed "raw" when nothing matches and it's perfectly expected that users will be confused about how globs behave if their behavior depends entirely on the contents of the file system. Human brains simply don't work that way and it's terrible user experience to expect users to be able to predict whether the glob will match before executing it.
A more intuitive solution would be to resolve globs to nothing (the empty string) if they don't match. In most users' mental models "*.jpg" means "any file ending with .jpg" when they pass it to a command like "rm" or "mv", but in reality it means "any file ending with .jpg in the current directory, or the literal string '*.jpg' if there are none" which is nonsensical. If it expanded to nothing, it would be obvious that it needs escaping when passed to commands like find because find would complain about not being passed a file name. The failure case is outright user-hostile and error-prone.
Also if every tool used its own metacharacters, users would have to re-learn them for every tool. It's common for config files to also use globs, for example. Even if we just had one set of characters for shells and one set for other tools, it's often difficult to draw the line, especially if tools use shells and thus would need to translate from one set to another. If this is a design error in find, it's a design error in all tooling that isn't a shell and we need to throw it all out and start from scratch.
"A more intuitive solution would be to resolve globs to nothing (the empty string) if they don't match."
I agree, so I use shopt -s nullglob in both bash scripts and interactive shells. It retrained my interactive habits with find very quickly because unquoted wildcards no longer work in the common case where there's no match in the current directory.
(I like dotglob, extglob, globstar and pipefail everywhere too: saner defaults than emulating traditional shell behaviours.)
There are tools that use their own metacharacters (mmv(1) comes to mind).
The issue with find here is that globs act consistently, but the returned behaviour is dependent on when the evaluation is performed.
find(1) treats * just as the shell does, but only if the shell hasn't got to the * first. So to prevent the shell from pattern-matching any * characters, you've got to quote or escape them using single quotes (') or backstrokes (\).
Quoting to prevent interpretation of metacharacters is actually commonly encountered in computer contexts. I'm having to quote instances of * but not \ for example in this comment to HN as both have semantic meaning to the content parser. Note that the escape character '\' is only significant when used immediately preceding a *.
I don't think it's that easy. The syntax of `find` is weird enough as it is, with flags that need to be passed in in the correct order and so many options to choose. So on top of all of that, one now needs to learn an exclusive new regex or globbing system. I don't see that as any easier to learn than remembering to put quotes around the search term.
I don't disagree with your central point that this problem can be fixed with better design. But the ergonomics are a hard problem to solve.
On top of shell glob expansion nuances, I think it should be important to note that I hope nobody runs a destructive action before doing a "dry run" first (unless they trust their backups and/or git history).
IOW, I hope all your find deletes are preceded with a run without the -delete flag. If you get into the habit of doing that and then doing any destructive action using a separate process (piping into xargs), you'll be spared from any hair pulling.
My preference is for any complicated shell magic to first do a run which performs `echo` of whatever operation I was doing on a file (eg. `find | xargs echo sed -i 's/foo/bar/' $file`), but there are other ways to achieve the same thing.
The echo nukes your quotes so you can't run the precise output of the echo, but yeah, if I'm doing weird shit with variables in a command I do that.
Another tip for anyone who runs history commands like `!!` or `rm !$` is the :p operator, which just prints what the shell would have executed and adds the full command into your history so you can hit ^ enter.
In case it wasn't clear: I'd do the echo run first to validate roughly what commands I would be running (minus some escaping gotchas that you raise), and then another run without the echo.
The echo trick is mostly a suggestion of a very quick trick, but it fails in a large number of cases (even something as simple as piping that command through another).
It's a little bit longer, but bash/zsh support `printf '%q ' <args...>`, which produces output where any characters that may be interpreted by the shell will be escaped.
Keep your audio muted too, maybe even make a schedule for it to automute when your wife sleeps :)
Don't surf from the bed, get up, do something, and get back to the bed when you are ready to sleep. (nope, I am not your wife even if I sound like her :)
Now that I've fixed your problems, how do I get back to sleep after I wake up or get woken up at 3-4am? :)
There's no autoplay audio for me. I just tried on Firefox for Windows (desktop) and Chrome on Android (mobile, obvs). Did you click on the video by accident maybe?
I have this shortcut for such cases. An important difference from your command is that I'm still using glob, not regex. You can use `-regex` if needed.
I was being serious. Most often you get to choose, or at least to know, the names of your files. In those cases find|grep is very useful. You do not need your shell commands to always work in the fully general case.
Glad to see fd is suggested in the first comment. Unless you're writing a script to run elsewhere I don't really see a compelling reason to use find on your own machine these days. fd just feels sensible.
This has nothing to do with find, but with how shell glob expansion works. If unmatched, it passes the argument as-is. If it matches something, it expands right away. It is made worse with the way find works, but this is shell-101 (just like not doing `rm -rf $FOO/` is).
FWIW, just seeing the title, this is what I expected to read.
DOS got a lot of thing wrong about command line parsing, but pushing wildcard expansion to the applications themselves prevented this whole class of errors
And even allowed things like 'RENAME *.txt *.bak'.. one of the most surprising things when starting to use Linux was that there was no simple way using the default utilities to accomplish the same thing...
The DOS approach means that glob expansion must be consistently implemented by each and every utility and since that is impossible it is inconsistent across tools.
Unix/Linux takes the view that the shell is responsible for interpreting parameter expansion, and that once expanded, those parameters are then handed off to the invoked command. Yes, there are edge-cases, but they are the same edge cases for each command. There's only one set of knowledge to internalise.
(I began that internalisation some 35 years ago and am tremendously grateful that much of the knowledge has been cumulative rather than substituting for earlier methods. The areas I consistently have the greatest challenges are those in which behaviours or methods have changed (in cases multiple times) over the years and/or decades. Especially for tools which are seldom used.
The bash shell isn't fully consistent, I'll grant. But it's also what I'm using much of the day, every day for interactions, and even its inconsistencies become hardwired over time.
I think on balance that's one of the things DOS got wrong. I find the Unix way of doing arguments much nicer. Part of that is consistency: that rename command may be convenient, but it's pretty strange and counterintuitive that it treats the wildcard differently from most other commands, and the same convenience could have been achieved with another syntax.
This is one thing the C-shell (which was my shell for decades) got right.
% echo no*such*file
echo: No match.
Whereas in bash you can get used to this behaviour until it bites you. Luckily the tcsh usage has already given me the "finger memory" to use the quotes.
Good explanation but he didn’t say how to quote. Single, double, double quotes surrounded by single, etc. Quoting is quite confusing and rarely taught, you just kind of learn by accident and trial and error.