You can't just wave your hand and tell someone that words are broken up into sub-word tokens that are then transformed into a numerical representation to feed to a transformer and expect people to understand what is happening. How is anyone supposed to understand what a transformer does without understanding what the actual inputs are (e.g. word embeddings)? Plus, those embeddings directly related to the self attention scores calculated in the transformer. Understanding what an embedding is is extremely relevant.