The second point here made me realize that it'd be super useful for a grep tool ...

msmolkin · on Sept 3, 2024

Hey, I just created a new tool called Super Grep that does exactly what you described.

I implemented a format-agnostic search that can match patterns across various naming conventions like camelCase, snake_case, PascalCase, kebab-case. If needed, I'll integrate in space-separated words.

I've just published the tool to PyPI, so you can easily install it using pip (`pip install super-grep`), and then you just run it from the command line with `super-grep`. You can let me know if you think there's a smarter name for it.

Source: https://www.github.com/msmolkin/super-grep

dang · on Sept 3, 2024

You should post this as a Show HN! But maybe wait a while (like a couple weeks or something) for the current thread to get flushed out of the hivemind cache.

If you do, email a link to hn@ycombinator.com and we'll put it in the second-chance pool (https://news.ycombinator.com/pool, explained at https://news.ycombinator.com/item?id=26998308), so it will get a random placement on HN's front page.

msmolkin · on Sept 3, 2024

Wow, thanks so much for the encouragement and advice, dang! I'm honored to receive a personal response from you and so soon after posting. I really appreciate the suggestion to post this as a Show HN. If I end up doing it, I'll definitely wait a bit–thanks for that suggestion, as I would have thought to do the opposite otherwise. Nice of you to offer to put it in the second-chance pool as well.

rldjbpin · on Sept 4, 2024

pretty cool and to me a better approach than the prescriptive advice from the OP. to me the crux of the argument is to make the code more readable from a popular tool. but if this can be well-integrated into common ide (or even grep perhaps), it would take away most of the argument down to personal preference.

skrebbel · on Sept 4, 2024

wow this is so cool!! it feels super amazing to dump a random idea on HN and then somebody makes it! i'm installing python as we speak just so i can use this.

crazygringo · on Sept 3, 2024

Adding to that, I'm often bitten trying to search for user strings because they're split across lines to adhere to 80 characters.

So if I'm trying to locate the error message "because the disk is full" but it's in the code as:

  ... + " because the " + 
    "disk is full")

then it will fail.

So really, combining both our use cases, what would be great is to simply search for a given case-insensitive alphanumeric string in files that skips all non-alphanumeric characters.

So if I search for:

  Foobar2

it would match all of:

  FooBar2
  foo_bar[2]
  "Foo " + \
    ("bar 2")
  foo.bar.2

And then in the search results, even if you get some accidental hits, you can be happy knowing that you didn't miss anything.

lathiat · on Sept 4, 2024

These are both of the problems I regularly have. The first one I immediately saw when reading the title of this submissionw as the "super case insensitive" that I often see when working on Go Codebases particularly when using a combination of Go Classes and YAML or JSON. Also happens with command line arguments being converted to variables.

But the string split thing you mentioned happens a lot when searching for OpenStack error messages in Python that is often split across lines like you showed. My current solution is to randomly shift what I'm searching for, or try pick the most unique line.

Groxx · on Sept 3, 2024

fwiw I pretty frequently use `first.?name` - the odds of it matching something like "FirstSname" are low enough that it's not an issue, and it finds all cases and all common separators in one shot.

(`first\S?name` is usually better, by ignoring whitespace -> better ignores comments describing a thing, but `.` is easier to remember and type so I usually just do that)

hnben · on Sept 3, 2024

> "super case insensitive"

lets say someone would make a plugin for their favorite IDE for this kind of search. How would the details look like?

To keep it simple, lets assume we just do the super-case-insensitivity, without the other regex condition. Lets say the user searches for "first_name" and wants to find "FirstName".

one simple solution would be to have a convention where a word starts or ends, e.g. with " ". So the user would enter "first name" into the plugin's search field. The plugin turns it into "/first[-_]?name/i" and gives this regexp to the normal search of the IDE.

another simple solution would be to ignore all word boundaries. So when the user enters "first name", the regexp would become "/f[-_]?i[-_]?r[-_]?s[-_]?t[-_]?n[-_]?a[-_]?m[-_]?e[-_]?/i". Then the search would not only be super-case-insensitive, but super-duper-case-insensitive. I guess the biggest downside would be, that this could get very slow.

I think implementing a plugin like this would be trivial for most IDEs, that support plugins.

Am I missing something?

skrebbel · on Sept 3, 2024

Hm I'd go even simpler than that. Notably, I'd not do this:

> So the user would enter "first name" into the plugin's search field.

Why wouldn't the user just enter "first_name" or "firstName" or something like that? I'm thinking about situations like, you're looking at backend code that's snake_cased, but you also want it to catch frontend code that's camelCased. So when you search for "first_name" you automagically also match "firstName" (and "FirstName" and "first-name" and so on). I wouldn't personally introduce some convention that adds spaces into the mix, I'd simply convert anything that looks snake/kebab/pascal/camel-cased into a regex that matches all 4 forms.

Could even be as stupid as converting "first_name" or "firstName", or "FirstName" etc into "first_name|firstname|first-name", no character classes needed. That catches pretty much every naming convention right? (assuming it's searched for with case insensitivity)

specialist · on Sept 3, 2024

> "first_name" or "firstName"

Ya. Query tokenizer would emit "first" and "name" for both. That'd be neat.

__MatrixMan__ · on Sept 3, 2024

Shame on me for jumping past the simple solutions, but...

If you're going that far, and you're in a context which probably has a parser for the underlying language ready at hand, you might as well just convert all tokens to a common format and do the same with the queries. So searches for foo-bar find strings like FooBar because they both normalize to foo_bar.

Then you can index by more than just line number. For instance you might find "foo" and "bar" even when "foo = 6" shows up in a file called "bar.py" or when they show up on separate lines but still in the same function.

inanutshellus · on Sept 3, 2024

IIUC, you're not missing anything though your interpretation is off from mine*. He wasn't saying it'd be hard, he was saying it should be done.

* my understanding was simply that the regex would (A) recognize `[a-z][A-Z]` and inject optional _'s and -'s between... and (B) notice mid-word hyphens or underscores and switch them to search for both.

marcosdumay · on Sept 3, 2024

The best way would be to make an escape code that matches zero or one punctuation.

So you's search for "/first\_name/i".

Izkata · on Sept 3, 2024

That already exists as "?" and was used in their example:

  /first[-_]?name/i

Or to use your example, just checking for underscores and not also dashes:

  /first_?name/i

Backslash is already used to change special characters like "?" from these meanings into just "use this character without interpreting it" (or the reverse, in some dialects).

kiitos · on Sept 3, 2024

It would be a mistake to try to solve this problem with regexes.

WizardClickBoy · on Sept 3, 2024

This reminds me of the substitution mode of Tim Pope's amazing vim plugin [abolish](https://github.com/tpope/vim-abolish?tab=readme-ov-file#subs...)

Basically in vim to substitute text you'd usually do something with :substitute (or :s), like:

:%s/textToSubstitute/replacementText/g

...and have to add a pattern for each differently-cased version of the text.

With the :Subvert command (or :S) you can do all three at once, while maintaining the casing for each replacement. So this:

textToSubstitute

TextToSubstitute

texttosubstitute

:%S/textToSubstitute/replacementText/g

...results in:

replacementText

ReplacementText

replacementtext

User23 · on Sept 3, 2024

The Emacs replace command[1] defaults to preserving UPCASE, Capitalized, and lowercase too.

[1] https://www.gnu.org/software/emacs/manual/html_node/emacs/Re...

tambourine_man · on Sept 3, 2024

Of course it does. Or it wouldn’t be Emacs

WizardClickBoy · on Sept 3, 2024

Also just realised while looking at the docs it works for search as well as replacement, with:

:S/textToFind

matching all of textToFind TextToFind texttofind TEXTTOFIND

But not TeXttOfFiND.

Golly!

gen220 · on Sept 4, 2024

In vim, I believe there's a setting that you can flip to make search case sensitive.

In my setup, `/foo` will match `FoO` and so on, but `/Foo` will only match `Foo`

WizardClickBoy · on Sept 5, 2024

There's undoubtedly a setting but if you don't want it on all the time you can always add \c at the end of the search term, like /foo\c to denote case insensitivity

boxed · on Sept 3, 2024

I think Nim has this?

archargelod · on Sept 3, 2024

Nim comes bundled with a `nimgrep` tool [0], that is essentially grep on steroids. It has `-y` flag for style insensitive matching, so "fooBar", "foo_bar" and even "Foo__Ba_R" can be matched with a simple "foobar" pattern.

The other killer feature of nimgrep is that instead of regex, you can use PEG grammar [1]

  [0] - https://nim-lang.github.io/Nim/nimgrep.html
  [1] - https://nim-lang.org/docs/pegs.html

adammarples · on Sept 3, 2024

setopt · on Sept 3, 2024

Fuzzy search is not the same. For instance, it might by default match not only “FooBar” and “foo_bar” but also e.g. “FooQux(BarQuux)”, which in a large code base might mean hundreds of false positives.

mgkimsal · on Sept 3, 2024

Ideally there'd be some sort of ranking or scoring that would happen to sort by. FooQux(BarQuux) would seemingly rank much lower then FooBar when searching for FooBar or "Foo Bar" but might still be useful in results if ranked and displayed lower.

setopt · on Sept 3, 2024

Indeed, that's a good solution – and I believe e.g. fzf does some sort of ranking by default. The devil is however in the details:

One minor inconvenience is that the scoring should ideally be different per filetype. For instance, Python would count "foo-bar" as two symbols ("foo minus bar") whereas Lisp would count it was one symbol, and that should ideally result in different scores when searching for "foobar" in both. Similarly, foo(bar) should ideally have a lower different score than "foo_bar" for symbol search even though the keywords are separated by the same number of characters.

I think this can be accomodated by keeping a per-language list of symbols and associated "penalties", which can be used to calculate "how far" keywords are from each other in the search results weighted by language semantics :)

dominicrose · on Sept 3, 2024

Let's say you have a FilterModal component and you're using it like this: x-filter-modal

Improving the IDE to find one or the other by searching for one or the other is missing the point or the article, that consistency is important.

I'd rather have a simple IDE and a good codebase than the opposite. In the example that I gave the worst thing is that it's the framework which forces you do use these two names for the same thing.

skrebbel · on Sept 3, 2024

My point is that if grep tools were more powerful we wouldn't need this very particular kind of consistency, which gives us the very big benefit of being allowed to keep every part of the codebase in its idiomatic naming convention.

I didn't miss the point, I disagreed with the point because I think it's a tool problem, not a code problem. I agree with most other points in the article.