I think you are missing the key part of the appeal here (or framing it as a nega...

I think you are missing the key part of the appeal here (or framing it as a negative).

Let's look at your question. Writing an equation for a cat is hard, actually really hard. Humans cannot reliably explain their decisions here. If I ask a person to tell me how they classify between cat and not cat, the answer will invariably be something along the lines of "well, it has the general shape of a cat". Which is actually just a huge combination of heuristics it took about 10 years to work out. There is quite a lot of work in neuroscience suggesting that the actual decision you make when you classify a cat happens before a rational is developed.

We could encode a function for that, but it relies on us knowing a lot about cats, which takes time and only works for toy examples.

If you use a convolution neural network, you can get close to human level performance on much more complex topics with little domain specific insight. There is no universal law for classifying hand written letters - they are an individual's interpretation of some symbols we made up. This task will always be 'non-rigorous' because the very underlying thing is not actually well defined. When does a 3 become an 8?

So we could have a person toil away and come up with a bunch of heuristics that we encode in a regression, but why is this better than having a machine learn those heuristics? Most problems are not life or death. What is the real added value in having people hand crafting features for predicting traffic jam delays or customer retention, when the end use is probably just to have a rough indication.

As somebody who does research using a huge range of models, I object that we should be guided by our intuition- our intuition is mostly wrong about non trivial problems.

Basically any "equation" somebody has discovered for what happens in a neutron star is "simple". There is a large amount of observational data, it is a consequence of some already well proven theorem, or relies on something well established to narrow the range of possible descriptions immensely, or (most commonly in my experience) the equation is basically a human version of deep learning, where grad students toil away making tweaks and heuristics until a point that the description fits the data somewhat well, and then there is some attempt to ascribe meaning after the fact.

For example, we can describe the trajectory of a comet using a "few" lines of high school level math. This means it is actually feasible for a person to have a reasonable intuition about what is happening, as the problem is actually dominated by a handful of important variables. Good luck getting anywhere near simple to describe cats (again, in a domain where the line of what is and isn't a cat is actually not even a property of it's physical attributes, so the problem is not properly defined under your requirement). To tell if something is or is not a cat, would require a DNA sequence. That is how we define the cat. So by your own definition, we do not have sufficient data in our dataset to properly do this classification.

I'm not sure you really understand the point you make about "statistical tests for bullshit". Most statistical tests are themselves ivory towers of theory and assumption which nobody ever verifies in practice (which is as unsciency as anything you accuse machine learning of). And people do actually use well grounded ways of evaluating machine learning models. Cross validation is very common and predates most machine learning, and has various "correctness" results.

For any model we build, if we do not have data that encodes some pathological behaviour we can test it out on, there is no test, no statistical procedure to tell us that model is flawed. If we have that data, we can run the exact same test on a black box model.

You should not conflate science with formalism or complexity. Running a statistical test is pseudoscientific unless you do it correctly and appropriately.

Saying something is not scientific because the data may not contain enough information to fully answer the question is flat out wrong.