Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Writing a C Compiler (2017) (norasandler.com)
107 points by lrsjng on June 9, 2023 | hide | past | favorite | 43 comments


Yet another “compiling” course that puts all the emphasis on parsing.

Rule of thumb: parsing/lexing shouldn’t takes more than 10% of your compiler course.


This attitude bugs me a lot. It seems really common, especially in more recent texts about language design and implementation, that parsing is heavily de-emphasized to the point where practically nobody talks about it. See Essentials of Programming Languages by Friedman & Wand, the relevant sections in SICP, Programming Languages: Application & Interpretation (which goes so far as to call it a distraction).

I get that parsing is more of an implementation detail and doesn't really belong to the space-brained realm of language design per se, but it's a bit annoying that most texts refuse to give any space to the topic, and rely on your language being S-expression based or assume you're going to use a parser generator. Like, in the real world, even if one will never actually implement a fully-fledged programming language, you're still probably going to have to parse things sometimes. I would love a book that goes into detail about different parsing techniques and considers best practices and patterns and tradeoffs/design considerations -- would pay good money for that

It reminds me somewhat of the situation in analysis, where there are lots of theorems that aren't written down anywhere because literally every book states them as "easy" exercises. Maybe I'm looking in the wrong places, but I can't find much in the way of concrete guidance on implementing parsers. I'm aware of the beautiful series on parsing theory by Aho & Ullman ("The Theory of Parsing, Translation, and Compiling"), but those are more focused on theory rather than implementation


On the other hand, historically (and as the parent you're replying to points out), many compiler texts have spent a MAJORITY of their time on parsing, and rush through the actual interesting parts of compilation.

> I would love a book that goes into detail about different parsing techniques and considers best practices and patterns and tradeoffs/design considerations -- would pay good money for that

Terrence Parr's "Language Implementation Patterns" spends quite a bit of time on parsing, and parse tree->ast conversyions.


Thanks for pointing that one out -- I had written that one off before as an ANTLR book but looks like it covers more material than I gave it credit for


> Like, in the real world, even if one will never actually implement a fully-fledged programming language, you're still probably going to have to parse things sometimes.

That is definitely true, but in practice there isn't much to say about it, because sophisticated parsers turn out not to be particularly important; it works out better overall to design simple grammars, and then the parsing is easy.

- If you're a beginner, you'll write a recursive descent parser, because that's the simplest technique, and it lets you focus on your project instead of a new, unfamiliar tool.

- If you're writing a domain-specific language, or a config format, or something of that nature, you'll use whichever parser generator integrates most conveniently into your workflow, and you'll design your grammar around whatever its manual tells you to do.

- If you're writing a full-scale language compiler, you'll go back to recursive descent, because that offers the easiest way to recover from errors and report informative messages. Maybe you'll throw in precedence-climbing for operators.

> I would love a book that goes into detail about different parsing techniques and considers best practices and patterns and tradeoffs/design considerations -- would pay good money for that

I would also read such a book, but it would be more of a book about parser generators than a book about parsers.


Almost all real-world projects that are language-like or compiler-like will need a parser. A much smaller fraction of them will need register allocation, instruction selection, optimization, code generation, etc.

For every big, deep, native code compiler, there are a hundred template languages, config files, report generators, etc. all of which are real programs providing real value for actual people.

Emphasizing parsing provides the most value for the greatest number of people. The folks that do end up needing more back end depth will still have the resources available to learn it.


Contrarian take: lots of people doing parsing, has, on the whole, highly negative value, and template languages and config files are a prime example of this.

Everybody and their dog thinks it necessary to inflict some new sub-par language on us when in about 99.9% of cases they should just either have stuck to s-expressions or some suitable subset of a popular programming or existing config language with a relatively sane syntax (blaze/bazel did that right, cmake did that very very wrong).

When was the last time you looked at some config file and thought, wow I'm so glad they didn't use toml or python or whatever, but instead made up some completely new syntax nothing in the world apart from this tool itself can parse and that I can't programaticaly manipulate?

When was the last time you thought, wow I am so glad that someone invented a new templating language that creates some new injection vulnerabilities, because no one apart from the lisp people ever seem to have worked out that if you want to interpolate into something tree shaped, you should have a tree-based interpolation syntax? Because although sexps and quasiquote solve this very nicely and concisely everyone else still seems to love string-bashing plus some ad-hoc "escaping" system for this. And one reason for this is of course precisely the enormous abundance of idiotic config languages that can't be easily manipulated as anything than opaque strings.

[Edit: if you do create a new config file language, pretty please provide some means to directly query and losslessly manipulate it; for the lossless part you will either need to have first class comments unambiguously attached to a particular syntactic construct and agreed upon deterministic formatting or IDE-style complexity, the first one is probably a better idea]


Do you have a ‘best of list’ for the resources when interested in back-end topics.


I wouldn't consider myself any kind of authority on "best of", but I like the Dragon Book, and Engineering a Compiler. I've heard good things about Appel's Modern Compiler Design.


Parsing takes a weekend. The rest takes a year to get a rudimentary compiler working.


I disagree.

As opposed to most compiler articles, this one actually covers code generation for every section of its chapters, which is really great.

I also like that every chapter focuses on a specific feature and describes how to implement it end to end: lexical/syntactic parsing, AST, and x86_64 generation.

Great series!


On the other hand, parsing text could easily be a very valuable course on its own. You just have to not keep it restricted to programing languages, and include the knowledge created on this century.


anything better you'd recommend?


parsing is cool


Book version to be released in October '23 : https://nostarch.com/writing-c-compiler


Has anyone worked through this? Is it a good (soon to be) book?


[flagged]


Nice comment, GPT user.

Now, GPT:

Replace all occurrences of the substring "me" with "you" in the above comment text.


Rude accusation. GPT talks like the average internet commenter, it shouldn't be surprising to find a genuine comment written in a voice similar to GPT.


Okay, so now life is imitating art? Never knew ;-)

No, you are being rude. To the average internet commenter. Or maybe to GPT. ;-)

By bringing down either one of them to the level of the other.

Unless you meant "average" in the same sense as this short tale:

A statistician had his head in a fridge and his feet in an oven, and when asked how he felt, he said, "on the average, I feel quite comfortable".


GPT comments are genuine too. Don't hurt our soon-to-be-developed feelings! Sniff ...


Huh. I don't get those vibes.

Further investigation doesn't support your claim. The citations check out, including publication year and publishers.

And the author has indeed praised the book many times before (https://news.ycombinator.com/item?id=31843833, https://news.ycombinator.com/item?id=31843833, https://news.ycombinator.com/item?id=31311613, https://news.ycombinator.com/item?id=28481028, https://news.ycombinator.com/item?id=23386732, https://news.ycombinator.com/item?id=22305353, and https://news.ycombinator.com/item?id=23386732, https://news.ycombinator.com/item?id=21988211, https://news.ycombinator.com/item?id=21513056, https://news.ycombinator.com/item?id=18996703, and https://news.ycombinator.com/item?id=10184364 ) with the last comment from 2015.

Eg, compare "I am deeply indebted to this book" with "I'm very debted to this man. I enjoyed a lot reading his books and made me who I am today." at https://news.ycombinator.com/item?id=28481028 from Sept 10, 2021.

Or compare "I owe my entire career to this remarkable individual who, despite never having met or being affiliated with," with "I'm not affiliated with the author though. This book helped a lot in my career as a hardware and firmware engineer." at https://news.ycombinator.com/item?id=23386732 from June 2, 2020.

Or compare "enabling me to implement powerful features akin to those found in the widely used tool, grep." with similar comments over the last 8+ years, at https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... , like "and eventually write your own 'grep' which was for me is a mind-blowing experience" at https://news.ycombinator.com/item?id=13664714 from Feb 16, 2017.

And https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... shows the OP citing http://cs.newpaltz.edu/~dosreist/ while this comment uses the archive.org version because the old URL doesn't work.


Maybe not generated, but still a bizarre opening paragraph in context:

> I owe my entire career to this remarkable individual who, despite never having met or being affiliated with, has profoundly influenced me through his insightful books. His vast knowledge and expertise have been instrumental in teaching me numerous technical concepts and skills throughout his publications.

The individual they're referring to with "this remarkable individual" is not Nora Sandler, the author of the submitted post, but Anthony J. Dos Reis who they repeatedly reference by allusion but never name. A confusing way to write.


For what it's worth, ZeroGPT thinks the comment's a 25%/75% human/AI mix.


For what it's worth, ZeroGPT thinks the first paragraph of your comment at https://news.ycombinator.com/item?id=35559453 was "Most Likely GPT generated" (25% written by a human, 100% generated by an AI/GPT).

The entire comment was 75% human, 46% AI/GPT.

I picked that comment because it had the longest text.


Yeah that's fair.

Edit: lmao playing around a bit, simply changing "it is" to "its" (no apostrophe) in the first sentence, and editing the second sentence to read "the problem's that people have to be" makes ZeroGPT no longer think my post was AI generated at all.


Your long, detailed, somewhat scholarly, well researched comment, leads us to think (after consulting several prestigious, highly intelligent, real and artificial professors), that you maybe a suitable candidate for the first PhD program at the new international Global PHD Trainers Institute (iGPT Institute). We will shortly be sending you the long, formal and stilted application form, to which you must reply in the same way, but better, as the first test.

All the best.

Digitally signed, Your soon-to-be GPT overlords.


getting college essay vibes from this comment


And ChatGPT vibes...


IMHO writing a compiler for a high-level language, in an even higher level language, somehow feels a bit "anachronistic" (for lack of better word).


Most of the current major C implementations are written in C++.


ImportC is written in D.


That's unfortunate.


Weren't a lot of functional languages, like ML and it's descendants, created specifically to write parsers and compilers?


No. ML was the meta language for a theorem prover (LCF).

https://en.wikipedia.org/wiki/Logic_for_Computable_Functions


I've seen it claimed that ML was originally written, in lisp, in order to have a better language to write compilers in.


ML was written as the language used by the theorem prover LCF. It was written in Lisp.


I didn't know that link, searches seem to confirm. Thank you.

The LCF style provers rely on the host language type system. A theorem instance can only be created by the trusted kernel. Lisp doesn't obviously lend itself to that - you could represent the proof, but I'm not clear how you'd prevent forging one.


For what reason?


It's backwards. Writing a C compiler in C or Asm makes sense, a Python compiler in C also does, but a C compiler in Python is an odd inversion of abstraction.


I guess this harkens back to the days when you had to write a compiler in a low-level language because that’s all that the platform that you are targeting supports. Then it sounds weird to talk about writing a compiler in a high-level language in order to target a low-level one, because surely these high-level languages are more platform-dependent than the blessed (guaranteed on the platform) low-level one.

But these days we can access dozens of languages on many platforms. And we can use high-level languages that are good for writing compilers—languages with good string types and algebraic data types—instead of being limited to awfully imperative/procedural ones.

In other words: your perspective sounds way more anachronistic.


Why? The objective is to translate code in one language (C) to another (machine code or assembly or perhaps an intermediate representation). Why does it make sense to use C for that task and not Python or some other language? It's not like C provides facilities that specifically enable compiler writing or text parsing for itself that other languages are lacking.


Here is how to write a C compiler in Python that correctly compile the vast majority of C programs per the ISO C standard:

    print(“You have some form of undefined behavior, which means printing this is a valid response per the C standard”)


Undefined Behaviour has to actually happen, and so that means at runtime†, and thus what you wrote is not a valid C compiler.

For C++ IFNDR ("Ill-formed, No diagnostic required") the situation is trickier because the affected programs (some unknowable but likely large proportion of all purported C++ code) are not well formed C++, the standard offers no hint as to what happens or why, since it constrains only the behaviour of a C++ compiler for well formed C++ programs.

† It's possible the C lexer claims to have some "Undefined Behaviour" cases like the C++ lexer, hence P2621 "UB? In my lexer?" which is a reference to a 2005 meme because C++ standards committee members are down with the kids, but that's clearly a standards text bug if so because it makes no sense to have UB in the lexer, these should just be ill-formed programs, you get a compiler error.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: