Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sorry, not really.

Training this pipeline was already quite expensive so I compared against all models and papers I could find online, but I couldn't afford to train a full new model just to check wav2vec2 with BPE.

That said, I did check against exhaustively allowing all 1-4 character tokens, which is pretty similar to BPE, and that performed worse in every situation.



Without even actually retraining everything, I would be curious how different the tokens are with your technique compared to using an off-the-shelf solution like SentencePiece with the same output vocabulary size.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: