> maybe you want to break down characters into their component radicals (I don't actually know this, I don't work on Chinese NLP, and have not run this theory past any chinese speakers)
No you don't, if what you want to process is text. You're right however that a big problem is the segmentation that must happen before any processing and that cannot be done 100% correctly by software. Thus, errors compounds down the chain.
I'm not "shitting" on the idea. I gave you an informed opinion as someone with Japanese & Mandarin knowledge working in NLP research.
Did you read more than the title of the paper you linked? Because the Stanford paper states:
"Results and Discussion
We consistently observed a decrease in performance (i.e. increased for perplexity) with radicals as compared to baseline, in contrast to a significant increase in performance with part-of-speech tags. [...] Such a robust trend indicates that radicals are likely not actually very useful features in language modeling"
For most tasks, you won't get more information on a word by looking at its characters decompositions, in the same way that the individual letters of a lemma won't help you for the task.
There existing use cases however. It is useful when building dictionaries for human beings (for search for example, I just put online such a tool yesterday) and when trying to automatically guess the reading of a character.
Arbitrarily saying "No you don't" isn't indicative an informed opinion.
I haven't really dug into these papers, though the Stanford paper does say "This conclusion is consistent with results from part-of-speech tagging experiments, where we found that radicals of previous word are not a helpful feature, although the radical of the current word is.", whereas the quote you pulled out has to do with language modeling.
Though I wouldn't consider a single negative result from before the deep learning trend took of necessarily indicative of the value.
The more recent paper, on the other hand, sees a positive boost from their "hierarchical radical embeddings" vs traditional word or character embeddings for 4 classification tasks. Not that this is necessarily meaningful either.
In my mind, the usefulness of this would be, not that you would get new information, per se, but that you could generalize some amount of knowledge to rare/out of vocabulary words.
Since you work in the field though, do you have any pointers to good papers on Chinese NLP?
I don't have generic good pointers but a few interesting things I read or downloaded:
- https://aclanthology.info/pdf/I/I05/I05-7002.pdf This paper make use of the radicals to build an ontology, but it does so with a stunting amount of depth (historical context, variants, etc.) that most works overlook. Too bad no data is available.
Anyway, I think for getting a fair understanding of the writing system the learning of about 600 characters in either Chinese or Japanese + basic of the chosen language is required.
No you don't, if what you want to process is text. You're right however that a big problem is the segmentation that must happen before any processing and that cannot be done 100% correctly by software. Thus, errors compounds down the chain.