My guess, which could be completely wrong, Anthropic spent more resources on interpretability and it's paying off.
I remember when I first started using activation maps when building image classification models and it was like what on earth was I doing before this... just blindly trusting the loss.
How do you discover biases and issues with training data without interpretability?
I remember when I first started using activation maps when building image classification models and it was like what on earth was I doing before this... just blindly trusting the loss.
How do you discover biases and issues with training data without interpretability?