You are absolutely correct. In fact, most NLP software ignores the formatting of documents which conveys a lot of information as well. For example, section headings must be treated differently from the text that makes up the body of a section. Its very hard to even determine section headings and then its hard to take advantage of them since the big transformer models simply accept a stream of unspecialized tokens.