Hacker News new | past | comments | ask | show | jobs | submit login

You are absolutely correct. In fact, most NLP software ignores the formatting of documents which conveys a lot of information as well. For example, section headings must be treated differently from the text that makes up the body of a section. Its very hard to even determine section headings and then its hard to take advantage of them since the big transformer models simply accept a stream of unspecialized tokens.



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: