More

osanseviero · 2025-05-21T05:02:45 1747803765

Hi! The model is 8B if you also load the vision and audio components. We just used the text model in LMArena.

lostmsu · 2025-05-21T19:47:06 1747856826

Are vision and audio components available yet?

osanseviero · 2025-04-20T15:14:38 1745162078

Hi! Omar from the Gemma team here.

Last time we only released the quantized GGUFs. Only llama.cpp users could use it (+ Ollama, but without vision).

Now, we released the unquantized checkpoints, so anyone can quantize themselves and use in their favorite tools, including Ollama with vision, MLX, LM Studio, etc. MLX folks also found that the model worked decently with 3 bits compared to naive 3-bit, so by releasing the unquantized checkpoints we allow further experimentation and research.

TL;DR. One was a release in a specific format/tool, we followed-up with a full release of artifacts that enable the community to do much more.

oezi · 2025-04-20T15:24:25 1745162665

Hey Omar, is there any chance that Gemma 3 might get a speech (ASR/AST/TTS) release?

osanseviero · 2025-03-12T15:23:13 1741792993

Please make sure to update to the latest llama.cpp version

osanseviero · on Dec 28, 2024

Nat Friedman leads the project. He was GitHub's CEO, among many other things. He funds many interesting ambitious projects, such as the Vesuvius Challenge (https://scrollprize.org/)

sizzle · on Dec 30, 2024

Really cool thanks for pointing me in the right direction

osanseviero · on Nov 14, 2024

Yes, they are still used

- Encoder based models have much faster inference (are auto-regressive) and are smaller. They are great for applications where speed and efficiency are key. - Most embedding models are BERT-based (see MTEB leaderboard). So widely used for retrieval. - They are also used to filter data for pre-training decoder models. The Llama 3 authors used a quality classifier (DistilRoberta) to generate quality scores for documents. Something similar is done for FineWeb Edu

itchyjunk · on Nov 14, 2024

Wait, I thought GPT's were autoregressive and encoder only like BERT used masked tokens? You're saying BERT is auto-regressive or am I misunderstanding?

woadwarrior01 · on Nov 14, 2024

You're right. Encoder only models like BERT aren't auto-regressive and are trained with the MLM objective. Decoder only (GPT) and encoder-decoder (T5) models are auto-regressive and are trained with the CLM and sometimes the PrefixLM objectives.

ipsum2 · on Nov 14, 2024

You can mask out the tokens at the end, so its technically autoregressive.

osanseviero · on July 23, 2024

Yes, there are a few dozen full open source models (license, code, data, models)

blackeyeblitzar · on July 23, 2024

What are some of the other ones? I am aware mainly of OLMo (https://blog.allenai.org/olmo-open-language-model-87ccfc95f5...)

osanseviero · on July 16, 2024

Hi all! I'm Omar from Hugging Face. Happy to answer any questions you might have about Hugging Face in general, llamas, and open ML!

brianjking · on July 16, 2024

I'd love to work at Hugging Face! Happen to be hiring any new dev relations/product/AI engineer type roles?

This is such a great piece.

ZoomerCretin · on July 16, 2024

Could you talk more about HuggingFace's new benchmark for LLMs? When did it become obvious that the old benchmarks were no longer sufficient:

swyx · on July 16, 2024

[author here] we interviewed the maintainer of that leaderboard if you want to hear from her directly! https://www.latent.space/p/benchmarks-201

tldr: old benchmarks saturated, methodology was liable to a lot of subtle biases. as she mentions on the pod, they're already working on leaderboard v3.

Hooray_Darakian · on July 16, 2024

How large is the staff at hugging face?

abidlabs · on July 16, 2024

We have ~220 total team members across all roles

osanseviero · on April 12, 2024

Zephyr 141B is a Mixtral 8x22B fine-tune. Here are some interesting details

- Base model: Mixtral 8x22B, 8 experts, 141B total params, 35B activated params

- Fine-tuned with ORPO, a new alignment algorithm with no SFT step (hence much faster than DPO/PPO)

- Trained with 7K open data instances -> high-quality, synthetic, multi-turn

- Apache 2

Everything is open:

- Final Model: https://huggingface.co/HuggingFaceH4/zephyr-orpo-141b-A35b-v...

- Base Model: https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1

- Fine-tune data: https://huggingface.co/datasets/argilla/distilabel-capybara-...

- Recipe/code to train the model: https://huggingface.co/datasets/argilla/distilabel-capybara-...

- Open-source inference engine: https://github.com/huggingface/text-generation-inference

- Open-source UI code https://github.com/huggingface/chat-ui

Have fun!

loudmax · on April 12, 2024

I like that they say how the model was trained for 1.3 hours on 4 nodes of 8 x H100s. By my rough calculation, that should probably have cost around $100 or so. (At $2 per hour, x 8 gpus x 4 nodes). Not free, but pretty cheap in the scheme of things. At least, once you know what you're doing.

dloss · on April 12, 2024

I wanted to write that TGI inference engine is not Open Source anymore, but they have reverted the license back to Apache 2.0 for the new version TGI v2.0: https://github.com/huggingface/text-generation-inference/rel...

Good news!

leblancfg · on April 12, 2024

What does ORPO stand for? Can't seem to find related links.

cateye · on April 12, 2024

Odds Ratio Preference Optimization (ORPO): https://arxiv.org/abs/2403.07691

osanseviero · on March 17, 2024

The model is also at https://huggingface.co/xai-org

osanseviero · on March 5, 2024

The dataset has

- 2 million patches

- 1068x1068 pixel patches

- 2.5 trillion pixels

Read more in https://huggingface.co/posts/aliFrancis/293058125194160