Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For the spaces and for some (maybe most?) languages you don't even need a NN to add spaces: as words made of two or more words aren't that common, and when those occur you probably want to use the composite one, it boils down to start from the beginning of the text and look in a dictionary what's the longest string that is a valid word. The only language that I know of that uses a lot of composite words (I mean words made by sticking two or more words togheter) is German, but I think that looking for the longest sequence occurring in a dictionary would be correct most of the times.


I think you're significantly underestimating how many words could be retokenized into multiple words even before considering how concatenation affects things. For example: Concatenate is a word, but so are con, catenate, cat, and enate. Yes, no two of those are likely to be used in sequence, but I don't think that's a very reliable rule overall—"a" and "an" are both common words and negative prefixes.


Maybe you're right. I was biased by my native language, which doesn't have the a/an problem that English has.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: