Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Gemini uses SentencePiece [1], and the proprietary Gemini models share the same tokenizer vocabulary as Gemma [2, 3, 4].

Out of the large proprietary western AI labs (OpenAI, Anthropic, Google), only Anthropic with Claude 3 and newer lack local tokenizers.

[1] https://github.com/google/sentencepiece

[2] https://github.com/googleapis/python-aiplatform/blob/main/ve...

[3] https://storage.googleapis.com/deepmind-media/gemma/gemma-2-...: "We inherit from the large Gemini vocabulary (256k entries)."

[4] https://storage.googleapis.com/deepmind-media/gemma/Gemma3Re...: "We use the same tokenizer as Gemini 2.0."






Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: