Training this pipeline was already quite expensive so I compared against all models and papers I could find online, but I couldn't afford to train a full new model just to check wav2vec2 with BPE.
That said, I did check against exhaustively allowing all 1-4 character tokens, which is pretty similar to BPE, and that performed worse in every situation.
Without even actually retraining everything, I would be curious how different the tokens are with your technique compared to using an off-the-shelf solution like SentencePiece with the same output vocabulary size.
Training this pipeline was already quite expensive so I compared against all models and papers I could find online, but I couldn't afford to train a full new model just to check wav2vec2 with BPE.
That said, I did check against exhaustively allowing all 1-4 character tokens, which is pretty similar to BPE, and that performed worse in every situation.