>Generally we prefer to find a _global_ minimum of the error function because then we can expect the resulting approximator to generalise better to data that was not available during training.
Sorry to nitpick, but is this true? We are doing optimization here and a global minimum is just a better solution than a non-global minimum. Is there a connection to generalisation here?
I'ts like cpgxiii says. You're right to nitpick though, because there are no certainties. We optimise on a set of data sampled from a distribution that is probably not the real distribution, so there's some amount of sampling error. Even if we find the global optimum of our sampled data, there's no reason why it's going to be close to the global optimum of our testing data.
But - there are some guarantees. Under PAC-Learning assumptions we can place an upper bound on the expected error as a function of the number of training examples and the size of the hypothesis space (the set of possible models). The maths is in a paper called Occam's Razor: https://www.sciencedirect.com/science/article/abs/pii/002001...
Unfortunately, PAC-Learning presupposes that the sampling distribution is the same as the real distribution, i.e. what I said above we can't know for sure.
In any case, I think most people would agree that a model that can reach the global minimum of training error on a large dataset has better chance to reach the global minimum of generalisation error (i.e. in the real world) than a model that gets stuck in local minima on the training data. Modulo assumptions.
In an ML context, the global optima may often correspond to better performance and (hopefully) a more general solution.
In a motion planning or controls context, a local minima can often mean a configuration or path that is infeasible due to collision or wildly inefficient.
Sorry to nitpick, but is this true? We are doing optimization here and a global minimum is just a better solution than a non-global minimum. Is there a connection to generalisation here?