One thing which was not immediately obvious to me for a while: the stricter your language’s formatting is, the easier it will be to grep source code.
I work a lot with Go, where all code in our repository is gofmt'ed. You can get quite far with regular expressions for finding/analyzing Go code.
(And when regexps don’t cut it anymore, Go has excellent infrastructure for working with it programmatically. http://golang.org/s/types-tutorial is a great introduction!)
I think it's because there are fewer potential substrings to check for matches, since most of the characters you add to a regex to make it longer also add to the minimum length of the expressions that it can find
That's not true. They work just fine. Typically, substring search algorithms are implemented at the encoding level, e.g., on UTF-8 directly. If you just treat that as an alphabet of size 256, then algorithms like Boyer-Moore work out of the box.
But the skip-ahead stuff isn't the most important thing nowadays. The key is staying in the fast vectorized skip loop as long as possible.
I noticed severe slowdowns when passing in the /u flag on my regexes, even with big fixed ASCII strings in the middle of the patterns. They were taking 10 times as long to complete.
That doesn't imply that things like Boyer-Moore suddenly stop being effective. Without more details (which regex engine? what regex? what corpus? which programming language?) it's impossible to state the cause, but it could be as simple as the regex engine not being smart enough to use a literal searcher in that case.
Related to this, it is generally a very good idea to be strict when naming functions, parameters, variables, etc. so that each concept has exactly one name throughout the codebase.
I encountered a similar problem in a c++ codebase and the debugging logs it produced. There was an error which was being reported in the logs as "weak ptr expired" or somethings like that. I grepped for it the whole source code (a gigantic project). No results. Going back and forth several times. Feeling stupid beyond imagination. Then I copy-pasted what was actually printed in the logs into my grep query (previously I was typing it in manually). It quickly found a match. Turns out someone wrote "weak ptr" as "week ptr". Everyone in the team had a good laugh.
But how do you effectively orhanize/enforce this for a code base of several million LOC where geographically distributed teams are working on different ends of the system all the time?
The amount of cross team coordination is staggering.
Fail verification in CI if the change doesn’t pass your checks. You can check anything, such as whether it has a duplicate name already in the codebase.
It wouldn’t be something provided by the CI tool, you’d have to write the test yourself. At the end of the day it’s just another test, albeit a more complex one than a standard unit test.
Don't use grep. Use ag[0], which is specifically designed for searching code. It's much faster, honors .gitignore, and the output can be piped back through grep if you like.
ag FooBar | grep -v Baz
It's in brew/apt/yum etc as `the_silver_searcher` (although brew install ag works fine too).
Same experience here, started with ack and switched to ag and then to rg for speed. I've found them roughly equivalent in functionality, but for those who need specific features here's a link to a feature comparison table:
Thanks for ripgrep, I use it daily and was recently going through the source code to learn how to build production quality Rust apps! (https://github.com/BurntSushi/ripgrep)
Yes, this is true, although ripgrep is more of a hybrid than ag is. ag has numerous problems with being treated as a normal `grep` tool, where as ripgrep does not. (Although, to be clear, ripgrep is not POSIX compatible.)
It's not much faster as in "over 50% faster". It's faster to invoke as you have much less to type to scan recursively with ignore list. However I find that in many projects .gitignore is too extensive, because it includes generated code, that many times is quite informative. Then it's still nice to use those alternative grep-likes, but not by much. Besides, when you can't install easily it's hard to beat something that it's already there and everywhere else.
ag provides sane default and settings for developers. grep is ubiquitous and great, but to do what most developers want it requires some guidance, whereas ag focuses on being what you want most of the time.
What I mean by that is that I enjoy the smart-case sensitivity (as in, if there are not caps in my pattern, then it defaults to case-insensitive but if I have any caps in my pattern, it uses case-sensitive) or fast filename searching with -g or both with -G.
I've tried rg and while it was faster, it also didn't provide as good support for filename searching. Ag is still what I consider "so fast I almost don't believe it"
It's like the 'tldr' command (https://tldr.sh/). Of course I still use man pages, but having something that gets me what I really need very quickly is important.
Steps:
1. Use ag
2. If ag isn't present, try to install ag
3. If I can't install ag, then use grep or find, no big deal.
> it also didn't provide as good support for filename searching
Could you elaborate on this? Is it because you need to type more? If so, I'd suggest one of two things. 1) use `fd` for searching for files, which is dedicated to that purpose. 2) define `alias rgf="rg --files | rg"` (or similar) and use `rgf foo` just like you would use `ag -g foo`.
I have stopped using the AST integrated searching vs code for Go. Instead of clicking to declaration, it is now faster in my larger code base (especially one that uses interfaces a lot) to just search for substrings. The AST search still works most of the time, but sometimes it fails and usually it is just plain slow.
The ability to easily grep for functions in C-like code is why I've come to appreciate projects defining their functions like:
int
foo_func(void) {
You can grep for `^foo_func\b` to get to a declaration or definition, or `^foo_func\b.* {$` to get to a definition or `^foo_func\b.* ;` to get to a declaration. This is instead of using something like `^\w.* \bfoo_func\(`, which is what you'd need for:
int foo_func(void) {
By the way, anyone know of a way to insert a literal asterisk here without having to follow it up with a space?
> If not, the reviewer can quickly dismiss it as a false positive
This is were you could be wrong. We would need to give a reason for dismissing it and then the risk officer would need to approve it (or reject it). False positives can be a real pain in the ass.
The post's core message seems to be lost on HN. It's about screening sources for supposedly insecure and/or injection-prone funcs using simple text scanning (such as strcat, which however is considered in iOS apps when it is a C std API func); supposedly grepability is also about quickly finding code locations of messages and variables. But comments are all about Rust or Go superiority, irrelevant grep implementation details, and AST-based code analysis tools when these are specifically dismissed in TFA as producing too many false positives. Talk about bubbles and echo chambers.
I do a few VERY SIMPLE greps. The most useful, is a pre-commit hook to check no blacklisted env vars exist in the commit diff. So, useful.
Grepping leans-in to shell. Though if you have other environments available (python, javascript etc), it makes sense to lean-into them e.g I use JavaScript examine my package.json to ensure my dependency SemVers' are "exact".
That said, I rarely write static-analysis scripts: In JavaScript-world there is already a plethora of easily configurable linting & type-checking tools. If I wanted to focus in on static-analysis etc I'd probably reach for https://danger.systems/js/
SideNote: My CI generates a metrics.csv file, which serves as a "metric catch-all" for any script I might write e.g. grep to count "// TODO" and "test.skip" strings, plus my JavasScript tests generate performance metrics (via monkey-patching React).
I don't actually DO ANYTHING with these metrics, but I'm quite happy knowing the CI is chugging away at its little metric diary. One day I'll plug it into something.
While there's no install and initial results are quick to appear, the false positives that grep or any string search tool generates will make the cynics shoot down this simple attempt to find problems in the source code.
Problems that arose:
- what about use of those questionable APIs/constants in strings (perhaps for logging) or in comments?
- some of the APIs listed in the article were only questionable when certain values were used - sometimes you can get grep/search tool of choice to play along, but if the API call spans multiple lines or the constant has been assigned to a variable that is used instead, then a plain string search won't help.
- it's hard to ignore previously flagged but accepted uses of the API/constants.
- so there's a possible bug reported, but devs usually want to see the context of the problem (the code that contains the problem) quickly/easily. Some text editors can grok the grep output and place the cursor at the particular line/character with the problem, some can't.
If you go down that road to try and reduce false positives, you'll end up with a parser for your development language of choice.
I haven't tried this approach, but having spent years using one of the best commercial SAST tools, I'm reluctant to dismiss it too quickly.
My SAST generates tons of false positives and is unforgivably slow. If this is orders of magnitude faster, it might be worth the extra false positives.
As a side note, my dream is a SAST that comments directly in the PR like a human reviewer would. Maybe that exists?
The SAST program is probably doing a lot more than a string search tool does.
If the SAST has to process C/C++ source code, then the SAST will parse all the #include'd header files. The SAST may track values to determine if illegal/uninitialized values are used.
A string search tool will skip doing all of that.
If the class of problems you're looking for contains only bad functions/constants, then a string search tool may be fine.
But as I mentioned before, the string search tool may get confused if these bad strings occur in strings/comments/irrelevant #if/#else/#elif sections.
There are another class of bugs dealing with data values which a string search tool can't deal with easily.
As an example, PC-Lint lists the type of problems the program may flag - https://www.gimpel.com/html/lintchks.htm.
A string search tool won't know about classes and virtual destructors or other concepts relevant to the programming language in question.
For the string search tool, you'd either invoke the search string tool several times with different search strings for the same source code or slightly more efficient, have one long search string containing all your search strings as alternate search targets for the string search tool.
Either case, when the string search tool spits out a positive result, it won't explain why there is a problem. The dev will have to know or lookup the problem associated with that search result.
When I worked on this area, C/C++ compilers stopped at syntax errors. Most have gotten better at flagging popular problems like variable assignments within if statements, operator precedence bugs, and printf-format string bugs.
Some divisions at Microsoft required devs to run a lightweight SAST before committing changes to locate possible problems ASAP.
It's relatively easy to integrate an SAST into your build system to scan the modified source code just before you're ready to commit the changes.
I tend to work a lot in Lisp and XML, both are more or less trees if you squint (with the Lisp syntax famously being the AST due to homoiconicity) and it always makes me wonder if there are better command line tree search or tree diff algorithms out there (extra awesome if it works with git merge strategies). I mean whitespace preference is fine and all, but sometimes you just don’t care :p
They don't \0-terminate the target on overflow, so you still need to test for that condition. So most people will have a wrapper around those to ensure the \0 is there.
strlcpy has the braindamage that it returns the length of the source buffer, which means it has to traverse the entire buffer to figure out the length.
If you want to copy out the first line from a buffer that happens to be a 10TB mapped file, that strlcpy call will take a long time to finish. If you are using strncpy/strlcpy because you don't trust the src buffer is properly null terminated but you still want to stop the copy at the first null or when the buffer is full, well, you're out of luck because strlcpy is going to blast past the end of the source buffer regardless.
I would have been much happier if it had just returned a flag indicating either successful copy (0), buffer was truncated (1), or an error occurred and errno was set (-1). Possible errors could be that the src or dest was NULL or the size was 0 (ERR_BAD_ARGUMENT).
In addition to what unilynx mentions about strncpy(), the size arguments are also, effectively, the remaining space in the destination buffer, not the entire space in the destination buffer.
So, you have to figure that out. It isn't hard (hell, it's trivial) but I think if you're either going to be aware of the pitfalls — and then these functions are mostly not going to help you — or you're not, in which case you're just as likely to pass the wrong value for the size (dest's size/src's size) and overflow the buffer anyways.
Honestly, if I had to do more than a trivial amount of string manipulation in C, I'd be wrapping that in a mini library to manage some sort of stronger string type or finding such a library (glib? ICU?) very quickly, depending on needs. std::string was one of the things in C++ that made me question why anyone was still using C, given how much less error-prone it is, comparatively. (std::string is not without problems / only as compared to char * in C.)
Random idea: maybe you could supercharge this by introducing to grep some constructs from programming languages. Now you have things like "word character", "whitespace", "start of line". In supercharged version you would have "function", "identifier"
It's very very fast / almost instant even with hundreds of source code files and millions of lines of code.
I hit ctrl+T and then can search everything, this give me a drop down that filters out the more I type, select the item in the dropdown and it goes to that source file.
I can also type:
/t and search just types
/m members
/mm methods
/u unit tests
/f file
/fp project
/e event
/mp property
/mf field
/ff project folder
e.g.
/t Foo
will find all the Foos
/mm SavePhoto
will find any methods called SavePhoto
Same works in JetBrains Rider for C# stuff.
I couldn't dev without this now, and it's all built into my IDE.
Just a small note that I would highgly recommend ripgrep[0] over standard grep. It's another modern tool that has been created by leveraging Rust and it's from BurntSushi[1] who is excellent.
GNU grep is fast as well but by default it doesn't ignore anything. Granted it's handy to have this configured out of the box, but I prefer to know and use the flags and perhaps write a shell script wrapper, and I find that practically it feels just as fast as rg or others. My main point in doing this is to avoid the situation where I'm on a different machine to my laptop and can just get going right away without having to install anything. There's a trade off in all things, I just prefer it this way.
Portability is key. There’s a reason these gnu tools have such staying power. It’s not necessarily because they’re the best, but because they’re ubiquitous.
Grep is really fast at that at the actual search (gnu grep at least), the gain there is mostly that "smarter" tools will ignore e.g. VCS data or binary files by default whereas grep will trawl through your PNGs and git packfiles.
Excerpt: "The result of this is that, in the limit, GNU grep averages fewer than 3 x86 instructions executed for each input byte it actually looks at (and it skips many bytes entirely)."
> much of Mike Haertel’s advice in this post is still good. The bits about literal scanning, avoiding searching line-by-line, and paying attention to your input handling are on the money.
Terrible, terrible advice. Cumbersome, error-prone, and slow as molasses. A quick test: Searching for 'asdfadsgf' in the Linux kernel repository takes 0.25 s using rg, 12 s using GNU fgrep -f, and 228 s (!) using your command.
You know, when your ideology results in the worst results of all, you should really reconsider your ideology.
.. which falls over as soon as you have a file with a space in the name.
Edit: this highlights the big weakness in the "UNIX philosophy", in which the only record delimiter that's conventionally recognized in pipelines is the newline but the shell recognizes characters as filename delimiters that are also allowed in filenames. Causing a cascade of delimiter bugs. Sometimes you really do need a bit more structure to your data.
(The UNIX philosophy is best understood in contrast to what went before - the COBOL or JCL style where files had fixed records, in turn based on fixed-column punchcard layouts.)
It's easy to unsafely handle filenames. I've seen it at my job where we do a lot of bash scripting. However, there are good guides on doing it the right way. [0]
Powershell falls over in the other direction: the objects flowing down the pipeline are "magic" and can't be serialised, or even necessarily inspected with normal tools. For most unix operations you can replace
foo | sort
with
foo > file
sort < file
I like the idea of powershell, but every time I try to do something complicated with it I'm disappointed.
By the way, PowerShell serialization is by orders of magnitude better then anything *nix has to offer as you can use objects from other machine shell just like they exist on your local one.
> I like the idea of powershell, but every time I try to do something complicated with it I'm disappointed.
I did some very complicated things in PowerShell. For example, check out the script that maintains ~300 mainstream packages on Chocolatey up to date, all in few minutes with bunch of self maintenance features.
I suppose this is a good demonstration of the best way of getting a right answer being to post a wrong one, as I spent a long time trying to work out how to do this last time I needed it.
rg searches my repo in under a second. Your command takes over a minute, with a warm cache. It also takes thirty seconds to type. And it doesn't work for files with a space in them. Even on single files rg is faster for me than fgrep. For example my tags file is 2GB large. rg takes 0.4 seconds to search it, fgrep takes 0.6 seconds.
Both grep and rg will do literal search with all the whizbang optimisations they could think of if they're given a literal (a string with no metacharacters) or the -F option (to not interpret the input as a regex).
Like a few other tools — ack, ag, and pt — it's specialized for source code, in addition to that it's really fast. The repo contains detailed comparisons with grep and an FAQ.
In general I think that's very good advice. In this particular instance however a dumb old grep might be superior because it could catch potential security vulnerabilities that are not explicitly hardcoded in the source code by greping through the compilation artifacts for instance. Sure you'll get a bunch of false positives that way but at least you know that nothing is slipping through the cracks.
To be clear, you can disable all smart filtering in ripgrep. e.g., `rg -uuu foo` should be equivalent to `grep -r foo`. And if you want to exhaustively search binary files, then you need to add the `-a` flag to both commands.
It "reinvented" it and made it dramatically faster. Rust is also available on FreeBSD. I don't think you actually need Rust to run ripgrep. It's not like it's an interpreted language.
As for the dramatically faster, the ripgrep author doesn't claim this. What he claims (and supports with benchmarks) is the obverse, that there are no other tools dramatically faster than ripgrep.
Basically the stated goal is to be "fancy" like ack etc. and yet remain as fast as good ole' grep.
You obviously haven't tried it on either of those. They are "second tier", which means one is completely on one's own.
What in your opinion would have to be the size of the source code to warrant jumping through the hoops to get this software running, as opposed to a combination of find + xargs + egrep,fgrep,awk?
Except tools like ripgrep aren't equivalent to find + grep. There is no simple invocation of find + grep that does what ripgrep does automatically. `git grep` would be closer.
You talk about being antagonized, but many of your comments in this thread have stated either outright incorrect things, or moved the goalposts, without acknowledging either one even when others point it out. Talk about infuriating.
Just because a misguided person such as myself wrote a piece of software doesn't mean you get to be rude to everyone who talks about it or suggests it.
Sorry, replying to you while responding to the parent because their post is already flagged.
Annatar, both the Rust compiler and ripgrep are available as packages in pkgsrc. The number of hoops one needs to jump in order to use this tool on your niche platform is exactly one. And that hoop is not even on fire.
Keep moving those goal posts though. Hopefully you can move them far enough to keep the Venn diagram of your mistruths and people who recognize them completely disjoint.
I work a lot with Go, where all code in our repository is gofmt'ed. You can get quite far with regular expressions for finding/analyzing Go code.
(And when regexps don’t cut it anymore, Go has excellent infrastructure for working with it programmatically. http://golang.org/s/types-tutorial is a great introduction!)