Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

^F intern -> not found.

But string interning is what they're doing, isn't it?

> Dictionary compression is a well-known and widely applied technique. The basic idea is that you store all the unique input values within a dictionary, and then you compress the input by substituting the input values with smaller fixed-size integer keys that act as offsets into the dictionary. Building a CedarDB dictionary on our input data and compressing the data would look like this:

That's string interning!!

Is interning just too old a concept now and it has to be rediscovered/reinvented and renamed?



String interning only stores whole strings, dictionary compression stores substrings. Essentially string interning is a trivial subset of dictionary compression (because you don’t need a substringing scheme, which is the hard part).


Doesn't interning usually refer to when you only consider identical copies, as opposed to dictionary compression where you allow for concatenations? E.g.

  Interning:
  1: "foo"
  2: "bar"

  my_string = "foo" // stored as ref->1
  my_other_string = "foobarbaz" // not found & too long to get interned, stored as "foobarbaz"

  Dictionary compression:
  1: "foo"
  2: "bar"
  
  my_string = "foo" // stored as ref->1
  my_other_string = "foobarbaz" // stored as ref->1,ref->2,"baz" (or ref->1,ref->2,ref->3 and "baz" is added to the dict)


  my_string = "foo" // stored as ref->1
Only if you explicitly intern the string. Interning can be expensive because

- the runtime has to check whether the string already is interned.

- you’re permanently creating the string, so if it isn’t reused later, that means your memory usage needlessly goes up.

Both get more expensive the more strings you intern.

I think interning is a hack that very rarely, if ever should be used. It likely won’t work well for large strings, as these tend to be unique, and use cases where it helps for shirt strings often are better handled by using enums or symbols, or by using a custom set of strings. If you do the latter, you have more control over Emory usage; you can do such things such as removing the least recently used strings, or ditching the entire cache when you’re done needing it (prime example: parsing a large XML file with many repeated nodes)


> you’re permanently creating the string, so if it isn’t reused later, that means your memory usage needlessly goes up.

Nowadays lots of runtime will GC interned strings to avoid this, this can mean churn in the interning table but avoids mis-estimations bloating the process too much.


Makes me think of an iolist in erlang/elixir.


> rediscovered/reinvented and renamed

It's not old yet; it'll be old once it's been renamed two or three times...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: