Yea, CJK (Chinese, Japanese, Korean) breaking is particularly complex. Google has done a lot of work, and have this open source implementation which uses NLP. It's the best I've personally come across:
It does basic path finding, and then picks the best path based on the following rules:
1) Fewest words
2) Least variance in word length (e.g. prefer a 2,2 character split vs a 3-1 split)
3) Solo Freedom (this is based on corpus analysis which tags characters with a probability of being a 1 character word. For example 王家庭 (this is either "Wang Household" (王 家庭) or "Prince's courtyard" (王家 庭) and we split as Wang Household, because Wang 王 is a common name that frequently appears in isolation, and 庭 is less likely to be in isolation. It is interesting that solo freedom works better than comparing the corpus frequency of "Prince" 王家 vs "Household" 家庭.
It works reasonably well. A surprising number of people use it every day.
What? 王家 doesn't mean "prince". Or at least, there is no such dictionary entry in the ABC dictionary or in the 汉语大词典. I would expect 王家 to mean "prince's household", in the same way that 皇家 means "imperial household".
汉语大词典 has two glosses for 王家:
1. 犹王室,王朝,朝廷。 [Equivalent to "royal family"/"royal court".]
2. 王侯之家。 [An aristocratic household.]
There's no problem with the concept of the phrase 王家庭 meaning "prince's courtyard", since a home can easily contain a courtyard. But the phrase should arguably be segmented 王-家-庭. (Or not -- there's very little to distinguish the idea 'one word, "the prince's household"' from 'two words, "prince"/"household"'.) Regardless of that choice, the courtyard is being associated with a household, not a person.
So to illustrate the situation the input would be like “Royal, House, Hold” and whether it’s supposed to be “Royal-Household” or “Mr. Royal’ household” or “Royal House, hold ... ” is up to context right
Correct, though just to be clear, "hold" doesn't correspond to anything in the Chinese. (I only mention this because on a character-by-character basis, "royal" and "house" are fairly decent glosses of 王 and 家.)
"King" is also a decently well-known English surname, so we can draw a pretty close analogy to the distinction between 王家, the royal household, and 王家, the 王 family. Compare the English sentences
1. That's the king's house. [The king lives there]
2. That's the Kings' house. [The Kings live there]
My Chinese ability rounds down to non-existant, and I apologize for constructing a questionable example. Wang as a family name, and Wang as king or (royal-) was the first thing that came to mind. Since you know Chinese well, do you know of a good example of a 1-2, vs 2-1 split where both are reasonable but one is much more likely to be semantically correct?
I have been told there are quite long sentences that humorously can be segmented two ways and have completely different meanings, but I can't find any. My favorite in English is expertsexchange.com had to change the domain to experts-exchange.com :)
> I apologize for constructing a questionable example
No need; as an example of segmentation, the one you've presented is fine. The biggest problem, mostly irrelevant, is that "prince's courtyard" is pretty archaic. I was objecting to the translation-in-passing of 王家 as "prince". It's defensible to segment the "prince's courtyard" sense 1-1-1, but it's also defensible to segment it 2-1.
I don't have an example of the type of you're looking for to hand, but I will see about finding one.
Thanks for sharing! I also wrote a Chinese segmentation and parallel translation program, with colours for the tones. Mine only takes one definition for the literal translation though, unlike yours which only handles a single line but with more translations.
I don't think their own lexing backend is actually open-source, as Budou just relies on a choice of 3 backends (MeCab, TinySegmenter, Google NLP) to do the lexing. I'm assuming Google NLP performs the best, but that isn't free and certainly not open source.
I think the implementation used in Chrome is different from the ones used in Budou. The implementation in Chrome is dictionary based as one of the parent threads mentioned, and that is completely open-source albeit probably doesn't produce as good a result as their homegrown NLP stuff.
Korean is especially difficult. Chinese uses only hanzi and you have a limited set of logogram combinations that result in a word. Japanese is easier because they often add kana to the ends of words (for e.g. verb conjugation) and you can use that (in addition to the Chinese algorithm) to delineate words - there is no word* where a kanji character follows a kana character. Korean on the other hand uses only phonetic characters with no semantic component, so you just have to guess the way a human guesses.
Your comment is a mix of a few right things and lot of wrongs, so I’ll give more information so the casual reader doesn’t form an incorrect opinion based on it.
- Korean segmentation is way easier than Chinese and Japanese because it uses spaces between words (they are thus distinctive, not like Vietnamese which use it on syllable boundaries. Vietnamese consequently also requires segmentation)
- Chinese and Japanese segmentation are hard NLP problems that are not fixed, so they are in no way "easier" than the same take for other languages
- The limited valid combination of characters that form words in Chinese doesn’t mean segmentation is easy because there is still ambiguity in how sentence can be split. There is still no tool that produce "perfect" result
- difference in scripts is indeed used in some segmentation algorithms for Japanese, but that doesn’t solve the issue totally
- the phonetic/non-phonetic parts of Chinese characters, have been used in at least one researcher paper (too lazy to find the reference again, it didn’t worked well anyway) but are not in state of art method. So contemporary Korean not using a lot of Hanja anymore has no influence on the difficulty of segmenting it
Curious where you came up with "phonetic characters with no semantic component". It's just an alphabet with spaces. With each block representing a syllable. It's easier than Latin.
I don't speak Korean, but my understanding is that some applications don't just use spaces because of how they're used in loan words. The Korean for "travel bag" has a space in it but you might want it as one token, for example. There's a fork of MeCab that has some different cost calculations related to whitespace for use with Korean.
That's interesting, but it would be nice to see the actual example. Not sure why you would use the loan word for travel bag, or what Hangul you're talking about exactly.
Sorry, "travel bag" is just an example I remember someone mentioning before. You can see the Mecab fork with example output here, but the docs are all in Korean so I can't really follow it.
https://github.com/google/budou