When I looked into this briefly my impression was that it's extremely hard to do mechanistic interpretation beyond very simple cases like CNN classification or toy problems like arithmetic in transformers. Not to say it's not a worthy pursuit, but I think the difficulty isn't justified for many researchers since the results won't make a big splash like a new model training result.
Yeah, it is harder than other things, but if we can train a model to explain collections of pixels in human language then we might be able to do similar with collections of activations.
I don't know if that is the direction, but just an example that comes to mind easily.
If someone figures out how to do this, I think their models will be far more capable and reliable.