A common mistake involving wildcards and the find command (2020)

formerly_proven · on Dec 10, 2021

Quoting (and splitting) is basic but eludes a lot of people including developers. I'd guess around one quarter to one third of all questions / issues raised with the command line tools I've maintained relate to quoting and so don't actually have anything to do with the tool in question.

necovek · on Dec 10, 2021

I think it's important for anyone — developers especially — using a shell to understand basic shell principles.

Shell is a REPL that has some special filesystem-related helpers. Looking through $PATH for any +x file that matches the name of the first word entered. Performing glob expansion (*, ?, {foo,bar}...) on all but the first word. Backticks. Environment variable scoping (difference between `export FOO=bar` and `FOO=bar ./my_program`). Presence of "." and ".." in the FS. And so forth.

Once someone wraps their head around the execution model, they should be golden.

darrenf · on Dec 10, 2021

If we're talking about bash, I think it's a mistake to refer to any of this as "glob expansion", a term which doesn't appear in the bash man page. Rather, it differentiates between multiple types of expansion and the order in which they take place. Quote:

The order of expansions is: brace expansion, tilde expansion, parameter, variable and arithmetic expansion and command substitution (done in a left-to-right fashion), word splitting, and pathname expansion.

On systems that can support it, there is an additional expansion available: process substitution.

Furthermore, expansion absolutely does work on the first word and it's not obvious to me why someone might think it doesn't. The simplest example is probably tilde expansion, i.e. the ability execute ~/bin/foo or ~someone/bin/bar. But the other types work too, e.g. brace expansion:

    $ ls l1
    ls: l1: No such file or directory
    $ l{s,1}
    ls: l1: No such file or directory
    $ touch l1
    $ l{s,1}
    l1

Command substitution:

    $ $(echo l)s -l l1
    -rw-r--r--  1 user  wheel  0 10 Dec 14:12 l1

Arithmetic expansion:

    $ perl$((4+1)).$((36/2)) -E 'say "foo"'
    foo

Even pathname expansion - here, `fi*` expands to `find` from $CWD, which is then found in $PATH and executed:

    $ ls -l
    $ touch find foo bar quux
    $ fi* . -name f\*
    ./foo
    ./find

And so on.*

necovek · on Dec 10, 2021

> I think it's a mistake to refer to any of this as "glob expansion"

Since we are being pedantic regarding the terms, while I can't really find "glob expansion" in the Bash manual either, it documents a bunch of "glob" settings: "noglob", "GLOBIGNORE", "dotglob", "extglob"...[1]

FWIW, I expected to find these in the Bash info page on Ubuntu, but curiously, "info bash" does not show up the info page (just the fallback manpage) — "info find" does work. Whatever happened to (tex)info pages?

[1]https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash....

dredmorbius · on Dec 10, 2021

The * character is addressed in pattern matching rather than parameter expansion.

From the bash manpage (version 5.0):

The special pattern characters have the following meanings:

* Matches any string, including the null string. When the globstar shell option is enabled, and * is used in a pathname expansion context, two adjacent *s used as a single pattern will match all files and zero or more directories and subdirectories. If followed by a /, two adjacent *s will match only directories and subdirectories.

The GNU bash manual describes the '-f' option as "Disable filename expansion (globbing)."

"Globbing" is how * expansion is generally referenced in other documents, if not often within the bash manpage or info pages themselves.

necovek · on Dec 10, 2021

> ...expansion absolutely does work on the first word

As was already pointed out, you are right. It does not work to match executables in the $PATH, which is what I was, for some reason, trying to communicate badly.

GuB-42 · on Dec 10, 2021

For me, it is a mess no matter how you look at it.

It is because there are so many layers. The way commands are parsed is already not that easy, but then the command itself has its own rules, and it is not uncommon to pass a script as an argument.

for example if you type "sh -c ls", first it is parsed into exec("sh", "-c", "ls"), then "sh" parses its arguments "-c" and "ls", then a sub-shell is started that parses the command "ls", then the actual "ls" program is launched.

This is simple but at every step, there are some quirks, first is all the shell parsing issues, with globs, interpolation and all that. Then you have the issues of how the command parses its arguments (be careful about "-"). Then back to the shell parser with the sub-command, then "ls" itself has its quirks, like locales for instance.

I have done Perl (line noise), C++ (memory unsafe with too many features), PHP (injections everywhere), JS (with IE6 support) and none of these languages make me more uncomfortable than UNIX shells.

formerly_proven · on Dec 10, 2021

You can actually use expansion even in the first word, just not for searching $PATH because that'd be inefficient.

necovek · on Dec 10, 2021

You are right, for some reason I was focused on the combined search-through-$PATH and glob expansion at the same time.

numerobis · on Dec 10, 2021

Are backticks necessary these days? I thought it was advised to use `$(...)` instead…

necovek · on Dec 10, 2021

If you are going to accidentally type them in (maybe pasting some markdown?), it's surely good to know what they do.

flohofwoe · on Dec 10, 2021

I think one reason for this is that quotes and wildcard expansion works very differently across platforms. On Windows it's the tool's responsibility to do wildcard expansion and parse quotes, but on UNIX the shell usually does wild card expansion and (AFAIK) also removes quotes before invoking the tool. This is especially "fun" in Python scripts which invoke command line tools and that must work across UNIX-oids and Windows, or when trying to run UNIX command line tools on the Windows command line.

formerly_proven · on Dec 10, 2021

On Windows even splitting arguments is done as part of the application (WinMain, for example, gets a single string, which is not something glued together by below-WinMain code but how processes work on Windows, also CreateProcess and other process-creation APIs need a single command line). SSHv2 has a similar quirk where the command line for request_exec is just one string and gets executed using <user's shell> -c <string goes here>, which makes the exact semantics dependent on the shell set for the user. This might have been an attempt at lowest-common-denominator engineering for Windows.

Overall Windows command line is and always was broken and unsound due to being based on the flawed DOS batch model.

kerblang · on Dec 10, 2021

The article manages to make this point in the most hopelessly long-winded way possible, going mile after mile on "guess what's wrong! did you guess yet?" and fails to fully cover a nuanced subject in the process.

Informative writing gets to the point instead of playing gotcha games.

IshKebab · on Dec 10, 2021

This is why it's generally a bad idea to allow non-quoted values if the quoting is not necessary. It sounds great - people will have to type less! - but in practice it just means they don't know exactly when they have to quote and they make mistakes.

Thankfully most languages didn't make this mistake but some did: CMake, YAML, Bash, TCL. All notoriously awful.

foxfluff · on Dec 10, 2021

"I" "think" "I" "would" "really" "hate" "your" "shell"

"Sincerely," "ls" "-lah" "foo" "bar."

IshKebab · on Dec 10, 2021

It's a fair point. I think the best solution would be for shells to have two modes - interactive and batch mode. Batch mode would require strict quoting.

Alternatively it could let you type without quotes and then automatically quote it for you (e.g. you press shift-enter and it auto-quotes it).

It's fine for interactive use because you can debug errors when they happen. The issue is with scripts.

rlpb · on Dec 10, 2021

Sometimes you need to deliberately not quote, though. For example, an output of a program might be a list of files, which I then might want to operate on using "for filename in $files". If I do this then I must have a guarantee that the files do not have spaces in them, but this is usually not a particularly onerous requirement for a closed system that doesn't have untrusted filename inputs. Splitting data using text separated by whitespace is part of the essence of Unix (see TAOUP). If you want to ban this kind of thing too, then you might as well not permit shell scripting at all.

koolba · on Dec 10, 2021

That’s what xargs, preferable with -0, is for.

The problem with “works most of the time” is that it stops working when you’re not around. So it might be fine for a one-off shell command, but you never know what the next guy will pass into it.

IshKebab · on Dec 10, 2021

> you might as well not permit shell scripting at all

Generally a good idea if you care about bugs at all.

jsight · on Dec 10, 2021

Yeah, something like "use strict" for scripting mode would be nice sometimes.

bsder · on Dec 10, 2021

> TCL

Could you elucidate?

The biggest problem I see with people dealing with Tcl is that they carry their understanding that {} delineates scope (or something else lexically). Of course, this is wildly untrue in Tcl as {} is simply a "quote" operator.

This stuffs people up horribly until they unlearn the idea that {} has semantic meaning.

IshKebab · on Dec 10, 2021

Yeah sure: this works

  puts Hello!

It should require quotes:

  puts "Hello!"

Otherwise things get error-prone. To be fair that's probably the least of TCLs string-related issues.

suyjuris · on Dec 10, 2021

Why not

    puts {Hello!}
    # or even
    {puts} {Hello!}

? But I do not see an advantage either way. I think you attribute to quoting issues that are caused by type-punning. The quoting rules of Tcl are simple. I mostly encountered problems when people do things like

    # x is not actually a list, but may appear that way if a and b are nice
    set x "$a $b"
    # Proper way:
    set x [list $a $b]

But I fail to see how quoting could address or alleviate this.

masklinn · on Dec 10, 2021

> Thankfully most languages didn't make this mistake but some did: CMake, YAML, Bash, TCL.

As well as Perl and PHP, though both have since recanted (Perl started 20 years ago with strict mode, which IIRC forbids barewords in most contexts; and PHP deprecated barewords in 7.2 and removed them in 8).

arnsholt · on Dec 10, 2021

Off the top of my head, the only place strict Perl allows a bareword is the left-hand side of the => fat comma operator and pre-declared subroutines (IIRC, I never could quite remember the rules for that).

twic · on Dec 10, 2021

IMHO this is a design error in find. Tools should not heavily depend on characters which are also shell metacharacters, particularly in ways which are similar enough to the way the shell uses them to cause confusion when they collide.

  find . -name %.jpg

Solved.

hnbad · on Dec 10, 2021

I'd argue that this is a design error in glob expansion. It's highly unintuitive that globs are passed "raw" when nothing matches and it's perfectly expected that users will be confused about how globs behave if their behavior depends entirely on the contents of the file system. Human brains simply don't work that way and it's terrible user experience to expect users to be able to predict whether the glob will match before executing it.

A more intuitive solution would be to resolve globs to nothing (the empty string) if they don't match. In most users' mental models "*.jpg" means "any file ending with .jpg" when they pass it to a command like "rm" or "mv", but in reality it means "any file ending with .jpg in the current directory, or the literal string '*.jpg' if there are none" which is nonsensical. If it expanded to nothing, it would be obvious that it needs escaping when passed to commands like find because find would complain about not being passed a file name. The failure case is outright user-hostile and error-prone.

Also if every tool used its own metacharacters, users would have to re-learn them for every tool. It's common for config files to also use globs, for example. Even if we just had one set of characters for shells and one set for other tools, it's often difficult to draw the line, especially if tools use shells and thus would need to translate from one set to another. If this is a design error in find, it's a design error in all tooling that isn't a shell and we need to throw it all out and start from scratch.

FDSGSG · on Dec 10, 2021

Yep, just use a better shell

  ~  echo f43f3f*gsdgds
  zsh: no matches found: f43f3f*gsdgds
  ~  bash                                                                     
  user@debian:~$ echo f43f3f*gsdgds
  f43f3f*gsdgds

twic · on Dec 10, 2021

bash has a couple of shell options [1] which can improve handling of unmatched globs:

> failglob

> If set, patterns which fail to match filenames during filename expansion result in an expansion error.

> nullglob

> If set, Bash allows filename patterns which match no files to expand to a null string, rather than themselves.

[1] https://www.gnu.org/software/bash/manual/bash.html#The-Shopt...

ukfarmer · on Dec 10, 2021

"A more intuitive solution would be to resolve globs to nothing (the empty string) if they don't match."

I agree, so I use shopt -s nullglob in both bash scripts and interactive shells. It retrained my interactive habits with find very quickly because unquoted wildcards no longer work in the common case where there's no match in the current directory.

(I like dotglob, extglob, globstar and pipefail everywhere too: saner defaults than emulating traditional shell behaviours.)

dredmorbius · on Dec 10, 2021

There are tools that use their own metacharacters (mmv(1) comes to mind).

The issue with find here is that globs act consistently, but the returned behaviour is dependent on when the evaluation is performed.

find(1) treats * just as the shell does, but only if the shell hasn't got to the * first. So to prevent the shell from pattern-matching any * characters, you've got to quote or escape them using single quotes (') or backstrokes (\).

Quoting to prevent interpretation of metacharacters is actually commonly encountered in computer contexts. I'm having to quote instances of * but not \ for example in this comment to HN as both have semantic meaning to the content parser. Note that the escape character '\' is only significant when used immediately preceding a *.

loudmax · on Dec 10, 2021

I don't think it's that easy. The syntax of `find` is weird enough as it is, with flags that need to be passed in in the correct order and so many options to choose. So on top of all of that, one now needs to learn an exclusive new regex or globbing system. I don't see that as any easier to learn than remembering to put quotes around the search term.

I don't disagree with your central point that this problem can be fixed with better design. But the ergonomics are a hard problem to solve.

necovek · on Dec 10, 2021

On top of shell glob expansion nuances, I think it should be important to note that I hope nobody runs a destructive action before doing a "dry run" first (unless they trust their backups and/or git history).

IOW, I hope all your find deletes are preceded with a run without the -delete flag. If you get into the habit of doing that and then doing any destructive action using a separate process (piping into xargs), you'll be spared from any hair pulling.

My preference is for any complicated shell magic to first do a run which performs `echo` of whatever operation I was doing on a file (eg. `find | xargs echo sed -i 's/foo/bar/' $file`), but there are other ways to achieve the same thing.

raffraffraff · on Dec 10, 2021

The echo nukes your quotes so you can't run the precise output of the echo, but yeah, if I'm doing weird shit with variables in a command I do that.

Another tip for anyone who runs history commands like `!!` or `rm !$` is the :p operator, which just prints what the shell would have executed and adds the full command into your history so you can hit ^ enter.

necovek · on Dec 10, 2021

In case it wasn't clear: I'd do the echo run first to validate roughly what commands I would be running (minus some escaping gotchas that you raise), and then another run without the echo.

The echo trick is mostly a suggestion of a very quick trick, but it fails in a large number of cases (even something as simple as piping that command through another).

Thanks for the pointer to :p too!

chenxiaolong · on Dec 11, 2021

It's a little bit longer, but bash/zsh support `printf '%q ' <args...>`, which produces output where any characters that may be interpreted by the shell will be escaped.

waynecochran · on Dec 10, 2021

The unannounced audio just woke my wife up as I was reading this at 2 am as an insomniac. Don’t do that.

necovek · on Dec 10, 2021

Keep your audio muted too, maybe even make a schedule for it to automute when your wife sleeps :)

Don't surf from the bed, get up, do something, and get back to the bed when you are ready to sleep. (nope, I am not your wife even if I sound like her :)

Now that I've fixed your problems, how do I get back to sleep after I wake up or get woken up at 3-4am? :)

flancian · on Dec 10, 2021

Perhaps try meditating for ten minutes; it often works for me.

Medito (https://anagora.org/go/medito) is a good free and open source app for mindfulness meditation, produced by a nonprofit.

necovek · on Dec 10, 2021

Thanks, too bad it's not on F-droid, but I'll try it out (I've tried meditating in the past, but I couldn't hold out long enough).

(I hope I can do away with the phone for it soon enough, or it may turn out to make things worse :))

quietbritishjim · on Dec 10, 2021

There's no autoplay audio for me. I just tried on Firefox for Windows (desktop) and Chrome on Android (mobile, obvs). Did you click on the video by accident maybe?

waynecochran · on Dec 10, 2021

You are right … a scroll gesture was interpreted as a click on the embedded video… my bad

arthur6667 · on Dec 10, 2021

I had the same & scrolled right passed the player.. Seems like a buggy js interaction or something too me.

notinty · on Dec 10, 2021

It's alright, it was 2am afterall.

eqmvii · on Dec 10, 2021

Same here, only 5am. Very annoying.

xen0 · on Dec 10, 2021

I've never claimed to be good at command-line-fu, but damn me if I can ever get a find command right first time.

Honestly, I just run 'find . | grep' to achieve what I want these days.

petepete · on Dec 10, 2021

Give fd a try. It work how you'd expect find would.

https://github.com/sharkdp/fd

asicsp · on Dec 10, 2021

I have this shortcut for such cases. An important difference from your command is that I'm still using glob, not regex. You can use `-regex` if needed.

    fs() { find -iname '*'"$1"'*' ; }

enriquto · on Dec 10, 2021

find|grep is perfectly alright. There's no shame in using it.

avaika · on Dec 10, 2021

As long as you don't have files with newline character (which is totally valid, but fortunately not widely used).

However even without newline it will be fun to process all the special symbols in file names if you decide to handle somehow the matched files.

PS Unless I missed /s in your message, which I'm not always good at :)

enriquto · on Dec 10, 2021

I was being serious. Most often you get to choose, or at least to know, the names of your files. In those cases find|grep is very useful. You do not need your shell commands to always work in the fully general case.

dang · on Dec 10, 2021

Discussed at the time:

A common mistake involving wildcards and the find command - https://news.ycombinator.com/item?id=22278602 - Feb 2020 (143 comments)

petepete · on Dec 10, 2021

Glad to see fd is suggested in the first comment. Unless you're writing a script to run elsewhere I don't really see a compelling reason to use find on your own machine these days. fd just feels sensible.

girishso · on Dec 10, 2021

I’ve almost entirely stopped using find and switched to fd, very simple options.. works pretty well for my purposes. https://github.com/sharkdp/fd

necovek · on Dec 10, 2021

This has nothing to do with find, but with how shell glob expansion works. If unmatched, it passes the argument as-is. If it matches something, it expands right away. It is made worse with the way find works, but this is shell-101 (just like not doing `rm -rf $FOO/` is).

FWIW, just seeing the title, this is what I expected to read.

asicsp · on Dec 10, 2021

Relevant: https://mywiki.wooledge.org/Arguments

See also: https://mywiki.wooledge.org/BashFAQ and https://mywiki.wooledge.org/BashPitfalls

unilynx · on Dec 10, 2021

DOS got a lot of thing wrong about command line parsing, but pushing wildcard expansion to the applications themselves prevented this whole class of errors

And even allowed things like 'RENAME *.txt *.bak'.. one of the most surprising things when starting to use Linux was that there was no simple way using the default utilities to accomplish the same thing...

dredmorbius · on Dec 10, 2021

The DOS approach means that glob expansion must be consistently implemented by each and every utility and since that is impossible it is inconsistent across tools.

Unix/Linux takes the view that the shell is responsible for interpreting parameter expansion, and that once expanded, those parameters are then handed off to the invoked command. Yes, there are edge-cases, but they are the same edge cases for each command. There's only one set of knowledge to internalise.

(I began that internalisation some 35 years ago and am tremendously grateful that much of the knowledge has been cumulative rather than substituting for earlier methods. The areas I consistently have the greatest challenges are those in which behaviours or methods have changed (in cases multiple times) over the years and/or decades. Especially for tools which are seldom used.

The bash shell isn't fully consistent, I'll grant. But it's also what I'm using much of the day, every day for interactions, and even its inconsistencies become hardwired over time.

mkl · on Dec 10, 2021

I think on balance that's one of the things DOS got wrong. I find the Unix way of doing arguments much nicer. Part of that is consistency: that rename command may be convenient, but it's pretty strange and counterintuitive that it treats the wildcard differently from most other commands, and the same convenience could have been achieved with another syntax.

MarkusWandel · on Dec 10, 2021

This is one thing the C-shell (which was my shell for decades) got right.

% echo no*such*file

echo: No match.

Whereas in bash you can get used to this behaviour until it bites you. Luckily the tcsh usage has already given me the "finger memory" to use the quotes.

$ echo no*such*file

no*such*file

wirthjason · on Dec 10, 2021

Good explanation but he didn’t say how to quote. Single, double, double quotes surrounded by single, etc. Quoting is quite confusing and rarely taught, you just kind of learn by accident and trial and error.