Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
My Favorite Rust Function Signature (brandons.me)
238 points by brundolf on Sept 17, 2020 | hide | past | favorite | 118 comments


You don't have to write the lifetimes here, this can just be

  fn tokenize(code: &str) -> impl Iterator<Item=&str> {
but including them to be more explicit does make sense for the post.


This falls under lifetime elision where the compiler can assume a lifetime if only one is used.

https://doc.rust-lang.org/reference/lifetime-elision.html


In that case, how does the caller know that the results will last as long as the input? Does it assume `'static` or is there some cleverness?


So, every reference has a lifetime. But writing them out all the time isn't great for humans. So we looked at existing code, and found that the vast majority of lifetimes followed a few simple patterns. So we implemented these patterns as "elision rules," that is, patterns where you can not write, that is, elide, the lifetime.

Here are the rules:

1. Every reference as a parameter gets its own lifetime parameter. There is only one in this case.

2. If there is exactly one, then every possible lifetime in the return type is assigned to it.

3. If there are multiple parameters, but it's a method, not a free function, then the lifetime assigned to self is assigned to all outputs.

This is written in a bit of an abstract way, but, #2 there describes this function: we have one input, so we assign the same lifetime to all outputs, and there's only one. This means that it's not 'static, and in fact, it would fail to compile if it assumed that.

The original RFC, if you're curious https://rust-lang.github.io/rfcs/0141-lifetime-elision.html

(And, I actually think there's been some tweaks over time; if you have no inputs, the outputs can turn into 'static... the core of it is these three things though. Oh yeah, that is in the reference, which is the up to date spec, https://doc.rust-lang.org/reference/lifetime-elision.html; I linked the RFC for historical reference because it's interesting, and has some neat stats.)


(For some reason the semicolon at the end of your reference link seems to mess it up. Here's the working link: https://doc.rust-lang.org/reference/lifetime-elision.html )


Maybe because a semicolon is valid in URLs. Although the top answer on stack overflow also says: "However, if you're plucking them from relatively unstructured prose, it's probably safe to assume a semicolon at the end of a URL is meant as sentence punctuation."

Because you can never know how a given website will handle the input, I always leave a space after URL and before punctuation in a text box, like so: http://example.com ;


Elision never assumes 'static, since that's an unusual case that has very few real-world uses.

This leaves the only remaining possibility that references in the return type must have the same lifetime as references in function's arguments. Without a GC, or leaking memory, you can't make a Rust reference/lifetime out of thin air, so there's always such relationship.

When a function takes multiple arguments by reference, the compiler will ask to disambiguate which one is used for the return type. Borrow checker intentionally checks borrowing against function prototypes, not function bodies.


> Elision never assumes 'static

I'm pretty sure boxed trait objects (maybe all boxed objects?) are assumed to be 'static. That is, Box<dyn MyTrait> is shorthand for Box<dyn MyTrait<'static>). If you want non-static lifetimes you have to do something like Box<dyn MyTrait<'_> + '_>

(I only just learnt this so take it with a pinch of salt.)


You might be interested to know that the + '_ is inferred from the lifetime of the trait bound it comes from. By default if no lifetime is involved what will be inferred is 'static (namely, an owned thing). This means that if you write Box<dyn My trait<'_>> that's enough. If MyTrait doesn't have a lifetime parameter then you might have to write Box<dyn MyTrait + '_> to "flip the inference" from the default, but that only comes up in practice on return types, particularly with impl Trait.

TL;DR: Trait<'a> is actually understood as Trait<'a> + 'a.


I don't think so. Try removing the `+ '_` from this code:

https://play.rust-lang.org/?version=stable&mode=debug&editio...


Oh, you're right. Clearly I'd forgotten the rule. `'a` is not inferred for `dyn Trait<'a>`, it is inferred for `&'a dyn Trait`[1].

[1]: https://github.com/rust-lang/rfcs/blob/master/text/0599-defa...


Aha! Thanks for the link. Why isn't this in the official documentation anywhere?

Also I'm going to point people here next time someone tries to claim that Rust is simple.


If you look at the date that RFC is from before my time. My guess would be that you don't need to know the rule: 95% of the time you don't need to use `+ '_` anywhere (these rules predate the ability to use `'_` to begin with, which is why the RFC talks about `+ 'a` only) because the behavior matches the user's expectation (not having to write `&'a Foo<'a> + 'a`), and, nowadays, when on the 5% of cases you do need 'a the compiler will tell you about it

  help: to declare that the trait object captures data from argument `self`, you can add an explicit `'_` lifetime bound
     |
  14 |     fn things(&self) -> Box<dyn Things<'_> + '_> {
     |                                            ^^^^
Edit: having said that, the docs do explain these rules[1], reachable from the trait object documentation[2], while it seems that The Book doesn't.

I'm also noticing that my confusion was likely due to `&'r Ref<'q, Trait>` being interpreted as `&'r Ref<'q, Trait+'q>`, which is behavior that was introduced in a follow up RFC[3].

[1]: https://doc.rust-lang.org/reference/lifetime-elision.html#de...

[2]: https://doc.rust-lang.org/reference/types/trait-object.html?...

[3]: https://github.com/rust-lang/rfcs/blob/master/text/1156-adju...


I believe ‘static is assumed if there are no inputs.


No, not quite, the signature will be `fn foo<'a>() -> &'a u8` (there is no way to elide this)

However, Rust lifetimes are stretchy-squeezy, and a free lifetime is basically the same as 'static when it comes to covariant lifetimes, since both are "I can be whatever you want me to be".

However if the lifetime is invariant (e.g. in https://play.rust-lang.org/?version=stable&mode=debug&editio...), it is actually not 'static and will map to whatever lifetime is requested by the context.


Wibbly-wobbly timey-wimey? :)



This is determined by lifetime elision [1]. Specifically (quoting):

- Each elided lifetime in the parameters becomes a distinct lifetime parameter.

- If there is exactly one lifetime used in the parameters (elided or not), that lifetime is assigned to all elided output lifetimes.

So, Since the signature has one `&str` as a parameter, the lifetime is assigned to the output lifetime also per these rules.

[1]: https://doc.rust-lang.org/reference/lifetime-elision.html

(Edited to improve formatting)


Ahh. So if I added a second parameter it would require the explicit lifetime? That may be what I was doing before that made me think I needed to be explicit all the time.


Yes - if there are two references as parameters, there is no way for the compiler to decide which the output reference may be borrowed from - and so which lifetime to assign the output, so it must be explicitly indicated by the programmer.

There are a few a few example cases in the doc linked about that walk through the conditions it's necessary to add explicit lifetime annotation.


  // this function foo
  fn foo(s: &str) -> &str {
      &s[0..1]
  }
  
  // is equivalent to this function
  fn foo_desugared<'a>(s: &'a str) -> &'a str {
      &s[0..1]
  }
  
  // this works too because the lifetime of "" is 'static because
  // it is a str that lives in the .data segment, which will live for "any 'a"
  fn bar(s: &str) -> &str {
      &""
  }
  
  fn main() {
      // as shown here
      bar(&String::new());
  }
  
  // but this won't work, because `x` gets `drop`ed at the end of `baz`
  fn baz(x: &str) -> &str {
      let x = String::new();
      &x
  }
https://play.rust-lang.org/?version=stable&mode=debug&editio...

Keep in mind that lifetimes have the opposite behavior of generic types. A T: Trait type parameter means "something that implements at least Trait, but might implement other things that you won't have access to", while an &'a returned borrow means "something that lives at least up to 'a, but might live less".


>Keep in mind that lifetimes have the opposite behavior of generic types. A T: Trait type parameter means "something that implements at least Trait, but might implement other things that you won't have access to", while an &'a returned borrow means "something that lives at least up to 'a, but might live less".

I don't know what you're trying to say here, but taken at face value it is wrong, and your own `fn bar` example demonstrates it being wrong. "something that lives at least up to 'a but might live less" is a contradiction.

Perhaps you meant "... but might live longer" ? If so, it would be correct, and also would not be "the opposite behavior of generic types."


Given

  fn foo(x: &str) -> &str {
      x
  }
calling `let z = foo(y);` is valid. The lifetime of `y` conforms to the lifetime of `x` in the `foo` signature, while the lifetime of `z` can be as long as the one for `x` (up to x's lifetime), but it is shorter (unless you try to do `drop(y)`, which would be a borrow checker error).


Okay, so you were talking about it from the POV of the caller, not the function. In that case, you meant "at most", not "at least". That is:

>while an &'a returned borrow means "something that lives at most up to 'a, but might live less".


I don't know the exact rules, but in general:

  fn foo(bar: &x) -> &y
is inferred to be:

  fn foo<'a>(bar: &'a x) -> &'a y


> I don't know the exact rules

This pretty much sums it up for me. You need to know so many rules... you need to know what those symbols mean... when people refer to Perl's syntax that they frown upon, they are referring to this: heavy use of symbols whose meanings must be memorized. I cannot just look at the code and understand it as I can with say, Ada. Pretty frustrating. I do not have to learn Ada to know what the code does. Funny that. :D At times Rust code looks very similar to Perl and Haskell. Especially Perl.


Yeah I mean, it really depends on your perspective. The compiler has your back; there's no need to memorize the rules, because that's what it's there for.

Forget the rules and don't write a lifetime where you needed one? You get this:

    error[E0106]: missing lifetime specifier
     --> src/lib.rs:1:29
      |
    1 | fn foo(x: &i32, y: &i32) -> &i32 {
      |           ----     ----     ^ expected named lifetime parameter
      |
      = help: this function's return type contains a borrowed value, but the signature does not say whether it is borrowed from `x` or `y`
    help: consider introducing a named lifetime parameter
      |
    1 | fn foo<'a>(x: &'a i32, y: &'a i32) -> &'a i32 {
      |       ^^^^    ^^^^^^^     ^^^^^^^     ^^^
It even shows the syntax and where things changed.

Now, it's not an AI; this may not be the signature you wanted, exactly, but the point is, you don't need to memorize the elision rules. You just don't write any lifetimes until the compiler makes you deal with it, and then you deal with your problem, and move on with life.


Yeah, I do find the compiler errors useful, I just wish I did not have to see them to understand the code. I am not really trying to "trashtalk" Rust or anything, I just "keep on" expressing my opinion on my dislike of heavy use of symbols. That is all. :) I have written plenty of Perl code, too, and I was even proud of myself! Problem is that after a few weeks I could not tell what particular bits of code did. :D

"There comes a time when words are more appropriate than symbols."


I have a camel tattoo. I don’t find a Rust to look anything like Perl. YMMV :)

I find Ada very hard to read. ... come to think of it, doesn’t it also have unbalanced apostrophes? It’s been a while.


> have unbalanced apostrophes? It’s been a while.

I see what you did there (but maybe you don't see it yourself :-)

PS: at least MY parentheses are balanced.


Yeah, it does have its differences, that is true.

> unbalanced apostrophes?

What are you referring to exactly? Can you give me an example? Do you mean something like Integer'Last, which is the largest value of type Integer?


Yes! This syntax is in the first example on https://learn.adacore.com/.


As for the above example: they are called attributes. Here is a list of attributes: https://en.wikibooks.org/wiki/Ada_Programming/Attributes

> The concept of attributes is pretty unique to Ada. Attributes allow you to get - and sometimes set - information about objects or other language entities such as types. A good example is the Size attribute. It describes the size of an object or a type in bits:

    type Byte is range -128 .. 127; -- The range fits into 8 bits but the compiler is still free to choose.
    for Byte'Size use 8; -- Forces the compiler to use 8 bits.
I find it particularly neat.

Of course one of the best things I love about Ada is pre- and post-conditions or contract-based programming (that are actually checked, unless in languages like Kotlin that just bolted on some contracts that are just there for the reader, i.e. pretty useless, to be honest). Ranged types are awesome, too, as you tell the compiler the range you want and the compiler chooses an appropriate sized integer to suit the range, which is the important bit. For example, instead of asking for "16-bit signed integer", you say: "I want a type which can store values between -1 and 1000", and maybe even a hint for the compiler where you can ask the compiler to pick you a faster type or a smaller type, so you say:

    type Index is range -1 .. 1000;
Types like these are much more readable, and you can use attributes on them such as 'Min, 'Max, 'Range, and so forth. It is such a good feature, and you can remove the range checks if you want, and you can even have them checked at compile-time. These kind of features are what I would like to see from a C replacement.

See more at https://learn.adacore.com/courses/intro-to-ada/chapters/stro....

---

https://www.adaic.org/learn/materials/intro/part1/

https://www.adaic.org/resources/add_content/docs/craft/html/...

https://www.adaic.org/advantages/features-benefits/

https://www.radford.edu/nokie/classes/320/designGoals.html

https://hackaday.com/2019/09/10/why-ada-is-the-language-you-...

Those links might give you some idea about the language and its goals, if you really are interested. Of course there are much more, up-to-date sources, but I think these got most of it right.


Thanks. I went through all of the learn.adacore.com stuff, but it was like 18 months ago.


You know what? You might be even able to implement this in Rust!

    index = num(-1->1000);
or something. :D

I dunno how attributes would work, but man, they really are handy. It is very, very difficult to shoot yourself in the foot.


Rust's "associated constants" are a very similar feature. You can also do runtime checked ranged integer types, and compile-time checked ones, though that's very awkward at the moment. It'll get un-awkward at some point; an implementation of the feature needed is in nightly.


You don't need to know the rules though, it's impossible to get the rules wrong, the compiler will check for you.


> "Pretty frustrating. I do not have to learn Ada to know what the code does"

That seems unlikely, or anyone - even if they hadn't learned to program ever - would be able to read Ada. More likely that a) you've already learned enough to cover Ada by learning other programming languages or b) you assume you know what the Ada code does but are incorrect.

Let's charitably go with a) then what you're implicitly supporting is that "all languages must look and behave the same, and there's no possible benefit in a language being different enough that it requires me to learn something new".

And therefore that the lowest-common-denominator intersection of all popular languages combined is also the pinnacle of language design.

Something which seems unlikely.


I am referring to the explicitness of the code. What the code does is not hidden behind symbols, for example. I probably have learnt enough to cover Ada by learning other programming languages, but it somehow does not apply to reading Rust vs. reading Ada, is what I meant. Of course you do have to follow some guidelines to make the code even more readable, but still. For me, without knowing Ada and Rust, Ada is way much easier to understand. Ada was designed with maintainability and readability in mind, and I think it shows, at least according to my experience. People have written stuff like "I see Ada as a verbose language, however, unlike say Java, the verbosity actually benefits readability.". I find it to be true.

Perhaps you could compare some Ada code (something from AdaCore (GNAT, GNATColl)) with the Rust equivalent.


> "For me, without knowing Ada and Rust, Ada is way much easier to understand."

If you don't know Ada then you don't /know/ that you understand it, you're assuming you do - and if you do that's because it looks like languages you already know. And if that's what you're saying is good about it, then what you're saying is "things I know are good". You're seeing "Ada looks familiar" and saying "Ada has good design" as if those are the same thing. You only need to look at some beginner programming discussions to see that block-structured, scoped, procedural, looped, branched, etc. features are not things people are born understanding - that you know them already and Ada has them is convenient for you.

> "What the code does is not hidden behind symbols, for example."

I looked at the Wikipedia page, it has equals, colon-equals, slash-equals, greater-than, less-than, apostrophe, parentheses, ampersand, double-quotes, equals-greater-than which seems to be an arrow not a comparison, double-dots, and of course "when" "is" "protected" etc. they're as much symbols as they are words.

> "Ada and Rust, Ada is way much easier to understand"

I'll quote Erik Naggum from an old comp.lang.lisp(?) comment: "When I work at this system up to 12 hours a day, I'm profoundly uninterested in what user interface a novice user would prefer.".


Interesting! For some reason I thought I tried something like that and it was being forced into a static lifetime, whereas the explicit lifetime made it more flexible. Now I'm curious what it was I was trying originally that didn't work, if it wasn't this.


You probably didn't have a reference as a parameter to the function.


I'm confident I did; I may have had additional parameters though: https://news.ycombinator.com/item?id=24506034


Thanks for posting this. I follow Rust, but haven't used it. It seemed strange to me that tying all the lifetimes like that wasn't the default.


It isn't done for multiple arguments because that would make the function's signature more restrictive than what you'd normally want.


> The downside is that, if you perform this as a separate pass, your parser now has to iterate over all of the source code twice. This may not be the end of the world: tokenizing isn't the most expensive operation. But it isn't ideal, so some parsers combine the two passes into a single one, saving cycles at the expense of readability.

This paragraph is a bit muddled:

* If you keep the two passes separate, then the second pass is over the token stream. So it's irrelevant that "tokenizing isn't the most expensive operation" because you're never going to do the actual tokenising a second time.

* If you combine the two passes then you're not "saving cycles" because you're still doing both of the loops - it's just that they're now happening in parrallel rather than in sequence. This confusion is repeated later when he says the combined one is "without the runtime overhead of a second loop".

* What you are saving is memory, because you don't have to keep the whole token stream in memory at once.

But really this is a nitpick on a small note. The overall article is great.


Nitpick^2: You are probably saving a few cycles due to loop overhead around pointer offsets and bounds checking, + not having to load the items into cache twice, if the sequence is too long to fit into cache.


Agreed! I definitely wasn't telling the whole truth by saying the amount of cycles spent on looping is ultimately the same. Although the difference is complex enough that I could imagine it going either way (but you're right that the cache effects would probably be the clincher).


Lexers/parsers, when written together, generally interface each other either as "parser calls lexer.PeekToken()/lexer.AdvanceToken() until it's done parsing, or lexer returns EOF", or, rarer, "lexer calls parser.ConsumeToken(tok) until there is no more input left, and calls parser.Done()". IIRC, SQLite uses the second approach to parse SQL.

At no point does any of these schemes require the lexer to read the whole file into memory, and chop it in one go into a huge list of ready-to-be-consumed tokens.


Sure, in practice, no one ever builds the whole list of tokens in memory, because they all use the same sort of idea described in the article (and described again in your comment). But the article author chose to compare why they don't ever do it that way, and that's what I was commenting on.


I don't think this is a nitpick, this is a valid point that actually plays even more into the concept of Rust's fearless concurrency.

You're not avoiding the runtime overhead of a second loop but being able to operate on both streams in parallel would be a far more complicated concept in C/C++. This actually reminds me of that influential Rob Pike talk about writing a concurrent parser in Go.


Ah, I wasn't sure whether to open that can of worms in my comment. The truth is I was sloppy using the word "parallel". The article describes interleaving the tasks but not actually multithreading them. Doing that in C++ would be just as easy as the article did it in Rust: you make a little iterator class (not necessarily STL iterator compliant, just that sort of idea) that hides the whole lexer behind an iterface that yields the next token each time. It wouldn't have the lifetime guarantees described in the article, but in this sort of context, if you're single threading, the object lifetimes are straightforward anyway.

Maybe you could take advantage of Rust's lifetime guarantees to multithread things more easily than C++. That would potentially be a much more interesting article, if it turns out to be a lot easier than in C++.


Really enjoyed seeing this broken down and explained. Is a function like that considered idiomatic Rust?


Yep! See my comment about the lifetimes, though, you may not actually write them out in the real code, but conceptually, this is Rust's bread and butter.


I'm learning Rust for the first time and found that The Rust Book does a pretty good job of covering language concepts. I actually just finished reading the section on lifetimes last night: https://doc.rust-lang.org/book/ch10-03-lifetime-syntax.html

My experience so far with Rust has been a huge reduction in mental burden/bookkeeping compared to when I write modern C++.


That function really is a good example for explaining lifetimes in Rust.


The function may be a good example, but this blog post doesn't really do a good job at explaining it, at least not to people unfamiliar with Rust (which I assume it's aimed at). If I understood correctly, if you assign the lifetime 'a (or 'x, 'l, whatever) to all parameters, you tell the compiler that they all have the same lifetime? And if you wanted to have different lifetimes, you could use 'a and 'b? But the article doesn't explicitly mention that, so you could end up thinking that 'a is some kind of a magical constant that you have to use...


The target audience was programmers who are at least familiar with the basic concepts of stack, heap, and pointers, including but not limited to newer Rust programmers.

You make a good point about 'a; I could have made that clearer. In practice it works the same way as a generic type (the letter is arbitrary, and it's declared inside the < >), which I assumed people would be familiar with, but I can see how the connection between the two may not have been obvious.


You can still edit it to improve it in that regard. Maybe add a footnote or something.


Yes, I plan to go back and make several improvements this evening after I'm off work :)


Yep, you’re correct. Naming the first (especially when it’s also the only) lifetime parameter ’a is just a convention, like naming the first type parameter T.


That's a good point. When I first started trying to learn Rust, I also thought that there was some semantic meaning to 'a, since everyone uses that exact same name in all the examples.


It'll be fun when we get first-class generators and you can write a simple, imperative function and no helper types and still use this type signature :)



Oh that's really neat, I didn't know about that


Wow, that’s some really great stuff :)


Yeah, in my case I did code.chars().filter_map(...), where it returns a Some() at the end of each complete token. But generators would definitely be a little cleaner!


> This is something that you cannot do both a) this safely and b) this efficiently in any other language.

This is wrong. Pretty sure you could do this in ATS for example.


I may have overstated that. What I really meant was "any other language that I'm passingly familiar with", which includes the majority of mainstream languages.


Substrings in older versions of Java used to reuse the character array of the original string instead making a copy. The flip side is that the original string must be retained in memory until all the substrings have “become garbage”. So there is a trade-off.


The Rust version has that same tradeoff; the string can't be deallocated until all the tokens' lifetimes have ended. (In this particular use case, that's not really a cost, since you probably need to keep the string available the whole time anyway.)

Go, which is garbage-collected, also does the slice-by-reference trick (but fails criterion #1 because it doesn't really have iterators).


Yeah. But there's also a bigger tradeoff which is having garbage collection in the first place. Rust lets you depend on immutability without having garbage collection to clean up the values for you, which is a really special thing.


You might find generics useful as well, to support any parameter that can be referenced as a `&str`, such as a `String`:

    fn tokenize<S: AsRef<str>>(code: S) -> ...
        code.as_ref()
Edit: whoops, was too quick, apparently this isn't a great idea here.


In cases like this, it can be useful to have the generic function call a separate, private function which does the implementation and is _not_ generic. Generic functions are monomorphized, meaning copies are generated for each concrete type they're called with. Separating them out in this way can reduce the size of the generated code after monomorphization.


In cases like this, the generic function cannot return an Iterator of &str that borrow from the input because the input is being consumed, so it's a bad suggestion. The input needs to be &str to be able to return borrows from it.

If they had tried to complete their code and compile it, they would've noticed there's no way the function could return `code.as_ref()`.


That's true that in this case their plan of making the input generic wouldn't work, as it relies on the output being borrowed. The technique of reducing the size of monomorphized functions would be useful in other contexts, but not this one.


What would be the ... in this case?


It's a bad suggestion for this case [1]. The ... would have to be `impl Iterator<Item = String>`, because the function would not have anything to return a borrow from. That is exactly the problem the original article was trying to avoid in the first place.

[1]: And arguably a bad suggestion in general. If a fn needs a String, it should take a String. If it needs a &str, it should take a &str, and if the caller has a String they can call the fn with a borrow. fns that are generic on AsRef<str> are usually only found where the ergonomics make it worth it. Eg the standard library's filesystem functions often take AsRef<Path> parameters because string literals are AsRef<Path>, which means you can write

   fs::read_to_string("/etc/passwd")
instead of a more cumbersome

   fs::read_to_string(Path::new(OsStr::new("/etc/passwd")))


You're right! Didn't think about it much before posting.


Tokenizers usually need to retain some information about where each token came from, like the range in the source.

In C++ I might do it like this:

    struct token {
        string::iterator start;
        string::iterator end;
    };
and then the range in the original source can be recovered via:

    size_t offset = tok.start - src.begin();
What is the preferred way to do this in Rust? Is there something better than using indexes?

Rust doesn't appear to have a way to say "give me the location of this &str in that &str" (even though it seems like it would be safe).


Where my mind personally goes is to return a Token struct on each iteration instead of a &str. This can still have the same lifetime guarantees, though:

  struct Token<'a> {
    s: &'a str,
    start_index: usize,
  }
I believe you can get the len() of a &str, so you could just store the start index and derive the end index.


> Rust doesn't appear to have a way to say "give me the location of this &str in that &str" (even though it seems like it would be safe).

That's a fantastic idea. It seems like it would be not just safe, but quite efficient, too.

    fn offset_of(parent: &str, child: &str) -> Option<usize> {
        let cp = child.as_ptr();
        let pp = parent.as_ptr();
        if pp <= cp && cp <= pp + parent.len() {
            Some(cp as usize - pp as usize) 
        } else {
            None
        }
    }
(Edit: as can be expected, my hastily-thrown-together example code has several bugs in it. I think there's an off-by-one error in the ending condition, and I'm really not sure if it handles multi-byte characters properly.)


You could do that, with the caveat that a lot of tokens won't be &strs, like numeric literals, so you will need an alternative way of keeping positions/Spans into the source code for those.


Well, if your tokenizer returns `impl Iterator<Item=&str>` (as in the OP), every token will in fact be a &str...

In any case, this scheme is pretty limited, since you probably want other info in your token span like line numbers.


In `rustc` we keep `Span`s, which are byte offsets for start and end. We only compute line and columns when about to display diagnostics, not before, and at that we also account for "non-1-byte-to-1-col" chars (some are 0-visible-width, many other are 2 cols wide, and tabs are normalized to 4 cols wide, we need the normalization for accurate underlines). This reduces memory utilization (you don't store line/cols you will likely not need) and computation (you don't compute the line/cols you won't need). Both of these costs are actually quite low, but we also have a ton of `Span`s.


My understanding of rust-analyzer's internals are limited, but I believe https://github.com/rust-analyzer/rust-analyzer/blob/master/c... is what it does here; that is, it keeps track of the start and ending positions as it goes https://github.com/rust-analyzer/rust-analyzer/blob/master/c...


You should keep track of it in the way you want to present it, which in this case is probably the line and column numbers. I don't think it's much different in C++ and Rust.


Calculating line and column is expensive (think tabs and wcwidth) so it's better to defer that until you actually need it. Plus there are often very many tokens, so you want the size to be as small as possible.

One common trick in C++ is to make source locations just a pointer. This gives you access to the token's contents (just dereference it), but you can also figure out which source buffer it came from, and where in that file, by comparing pointers.


The size doesn't matter if you're using an iterator and throwing the tokens away, only storing/presenting them if there's an error.

Computing the position might have some overhead, but it's probably better than going over the whole file from the top when you want to turn that pointer into line/column numbers.

As for the content, in Rust this would be in your token structure (or rather in Rust your Token::StringLiteral enum variant), storing start/end pointers for each token seems weird to me.

[edit: Rustc uses byte indexes though, so my intuition might have missed something important for the design of larger compilers: https://github.com/rust-lang/rust/blob/7bdb5dee7bac15458b10b...]


Computing the column/line pair would indeed be cheap on the Lexer as you could keep track of newlines easily.

As for the token themselves, they (or at least their position/Span) will be used beyond parsing. Otherwise you won't be able to point at the right place when you have a type error, for example.


> Computing the column/line pair would indeed be cheap on the Lexer

Calculating line and column may be complicated if you're doing it for display. Perhaps the input has a flag emoji, with a variation selector, followed by a tab... So it makes sense to store the minimum you can and defer line/column calculation, especially if your parse is perf sensitive.

One place this can go wrong is when you have to show a lot of source locations, e.g. you're printing a backtrace for a stack overflow in an interpreter. Here you can easily get into an N^2 situation. So you want to defer but also memoize.


For lines it is easy, you lex bare `\n` as a "token" and increment a counter, for columns you have to have a counter as well. You already have to account for Unicode codepoints to tokenize correctly. It is also true that there's a mismatch between displayed columns and the byte position for multibyte Unicode codepoints, but you also need to account for `\t` when doing so, which is plain old ASCII. Sadly this has to be a "best effort" action or you need to escape the output, otherwise you might get things like an underline shifted by the the difference between the anticipated visual width and the byte length, and this might change from under you when the underlying platform is updated.

With all of that said, the approach taken in `rustc` is to have an arena of all `Span`s for everytokens and embed "pointers" (not actual pointers) to them. This is done because we embed `Span`s in everything in order to be able to give reasonable diagnostic messages.


Codepoints do not match characters. You only need to deal with codepoints to tokenize, but to find the column you need to count characters, which require a unicode database. I think this is what the parent was referring to, though I suspect a lot of compilers simply ignore this issue.


> What is the preferred way to do this in Rust? Is there something better than using indexes?

I don't think so.

> Rust doesn't appear to have a way to say "give me the location of this &str in that &str" (even though it seems like it would be safe).

'course there is: str::find (https://doc.rust-lang.org/std/primitive.str.html#method.find)

It looks a bit abstract because it takes a Pattern rather than an `&str`, but a char, a slice of chars, an &str, or a predicate function, are all patterns.


I think the point is to have a free method of doing so. Since a &str is a pointer and length, it contains all the information already of where it is inside the parent &str. This also covers the case of when the same token is repeated in the text.

src = `let abc = "abc"` token = src[4..7] The 4 and 7 right there are the values we want. And if a str is a pointer and length, then those values should be retrievable with simple arithmetic, rather than a string search

The only issue is if the str isn't a substring. The signature would need to be either

`fn location(&self, &child) -> Option<(usize, usize)>` or `unsafe fn location_unchecked(&self, &child) -> (usize, usize) `


str::find performs a substring search, which is incorrect if the same token appears more than once.

What I want is the literal offset of a string slice in another, aka pointer arithmetic in C:

   fn derp() -> usize {
      let src = "xxxxxxx";
      let tok = &src[3..4];
      let offset = tok.as_ptr() - src.as_ptr();
      offset // should be 3
   }
this should be doable safely; I think it's just an annoying hole in the API.


cast both of those as_ptr()s to usize and this does what you want.


I know it's just an example, but in case anyone tries to grab the code - it's worth noting that this will panic if you try to slice into the middle of a UTF-8 sequence, and will give byte offsets and not char offsets, so it might not give the answer you expect if src contains multibyte chars before tok. Also I'm fairly new to Rust so it's possible there's some other way to do it that I just don't know about.


> What I want is the literal offset of a string slice in another, aka pointer arithmetic in C:

I see, I'd misunderstood the original, sorry.

> this should be doable safely

The language doesn't know that there's any relationship between those two strings, so the method would probably need to be fallible, and be more convoluted than this.


Presumably the parent wasn't asking about searching for one during slice in another, but rather locating the original offset of a string slice derived from another.


Ah I see, I'd misunderstood.

But there's no guarantee that two &str have any sort of overlap, that relationship is not conserved by the language, so it either would be unsafe or would be failible, it'd basically be pointer-twiddling.


Sounds like a solid case to use std::string_view.


I really don’t like usage of ‘ in the function signature, it simply looks wrong.

In generally that is my main complaint as I am trying to learn Rust, the syntax seems to be all over the place. Maybe that feeling will fade away as I get more use to the language.


Almost none of it is unique to Rust, but it does draw inspiration from a wide variety of influences, so it can feel like that for sure.

This syntax was taken from OCaml.

Incidentally, nobody really loves it, but nobody could come up with anything clearly better either.


In most cases you don't need to specify lifetimes. The compiler is smart enough to infer it for a lot of cases. As noted elsewhere by Steve, it's not even required for this signature.


I also used to feel the same way. But listening to [0]Crust of Rust: Lifetime Annotations by Jon Gjengset helped me understand its need.

[0] https://www.youtube.com/watch?v=rAl-9HwD858 really helped me understand it.


If that's your main complaint, then good for Rust!


This might be a good thread to mention nom[1], a parser combinator in Rust.

1. https://github.com/Geal/nom


You can actually kind of do this with `std::shared_ptr` [0], but it's going to be nowhere near as efficient[1] as the Rust mechanism. Maybe a single-threaded implementation of it could come close but you'd have to roll it yourself or use something like `boost::local_shared_ptr`. You may also need to go to intrusive pointers to make sure there's memory alignment.

Of course the Rust variant is not only cheaper (no ref counts), it's also guaranteed thread-safe (which again, C++ would have to use something like the atomic `shared_ptr` to achieve)

TLDR: Even when C++ can kind of do it, you have to give up concurrency safety to get close while adding memory pressure & still not being as fast while Rust gives you guaranteed concurrency safety for free regardless of accidental misuse of your API.

[0] Currently that would be constructor #8 on https://en.cppreference.com/w/cpp/memory/shared_ptr/shared_p... [1] Not only are you always paying for atomic ref counting even when you don't need it, aliased `shared_ptr` will be as expensive to take a ref for as normally constructing it - there's possible no way to `make_shared` with aliasing. This means the aliased shared_ptr will be even more expensive every time you need to access the ref count since the control block loses cache coherency. On the other hand if you structure it more carefully maybe you could vend `const std::shared_ptr<std::string>&` in your iterator so that in the single-threaded case your only overhead is constructing the aliased shared_ptr once & you leave it to the user to take a copy when transferring to another thread.


Nobody in their right mind will use `shared_ptr` in this context. It will be `string_view` in and out with lifetime management left as an exercise to the programmer.


I read your parent as saying "assume we start from needing this to be checked," and then saying that this is the best that can be done. You're absolutely 1000% right that nobody would write that ever and would use string_view and not have it statically checked.


That’s exactly it. Just showing how trying to get the same safety guarantee in c++ is an either or between performance and safety.


You're over-thinking it! You don't need dynamic memory allocation and there are no concurrency issues when tokenising text.


why do we need the lifetime after the function name? i.e. `tokenize<'a>`


To make the function generic over some lifetime 'a, similar to how generic functions can be generic over some type T (e.g. foobar<T>). This is usually not necessary to write as it is elided in common use cases.


thanks! I think what was confusing me was that in some examples i see it omitted and in some not. Now it's clear.


You can think of it like a generic type parameter. The <'a> is essentially saying "for some lifetime 'a...", and then we refer to that lifetime elsewhere to bind those usages together


Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: