Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

java.lang.String uses UTF-16 internally. It's a wrapper around an array of 16-bit chars. Thus, it's possible for one to create a String which is NOT valid unicode by abusing surrogate pairs.

Thus, the JVM has no native datatype that represents a "valid unicode string". This is unfortunate, because if java.lang.String did enforce this it would let us make some helpful assumptions.



Well, every single way to read/write that string to bytes is unicode by default, unless you go out of your way to plug in a different encoder.

What are you trying to do, export a pointer and write the raw bytes to some destination while assuming it's correct unicode? If you're doing something that low level, it's always possible to corrupt your data and have invalid unicode, just set an invalid byte/rune somewhere in that byte string. Direct memory access always throws guarantees out the window.

It is annoying that getBytes() has to allocate and fill a byte array because of the mismatch between char/byte, but you can work around it when necessary and that's not really related to "not being unicode enough", if anything it's "too unicode" with the insistence on the char type for internal structure.


No, one of the constructors for String takes only char[] as a parameter. You can pass in an arbitrary array of chars, even invalid UTF-16.

You're correct that well-written code should never do this. However, there is no guarantee that some library you're using doesn't. You can never assume that 'new String(oldString.getBytes("UTF-8"), "UTF-8").equals(oldString)', which has some unfortunate side-effects if you're doing anything involving serialization and equality.

I agree that Java's String API is generally quite well-designed, but the ability to access the raw UTF-16 is a very big leak in the abstraction.


If that ability was lacking, other people would be complaining about it. Abstractions should not prevent you from accessing the bits underneath: they should make it unnecessary. Which they never completely succeed in, because there are always fringe use cases you didn't foresee.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: