I nearly snorted my coffee through my nose at this:
"Simulated annealing's strength is that it
avoids getting caught at local maxima ..."
It's true that SA generally has better local-maximum avoiding properties than simplistic hill-climbing, but I've generally found that shotgun stochastic ballistic hill-climbing works at least as well.
> shotgun stochastic ballistic hill-climbing works at least as well
This is the method that I'd like a lot of these stochastic optimization algorithms compared to, especially genetic algorithms.
At least brute force sampling of the entire space is guaranteed to find the global extremum at some point, an argument I've heard for random sampling & minimising vs. GAs for random structure searching in materials science (a very high dimensional problem). It doesn't code any assumptions about the surface in, so can't be optimal, but that can be a benefit if you want to find something surprising.
>> shotgun stochastic ballistic hill-climbing
>> works at least as well
> This is the method that I'd like a lot of these
> stochastic optimization algorithms compared to,
> especially genetic algorithms.
> At least brute force sampling of the entire space ...
Not sure you meant it to do so, but this sounds like you think that "shotgun stochastic ballistic hill-climbing" is a brute force sampling of the entire space. It's not, it's a stochastic optimization algorithm. Brute force sampling of the entire space, or even reasonable portions of it, is completely infeasible for the problems I deal with, so I'm not sure what you're saying here. You might like to explain a little more.
Now I've read your other comment with the terminology, I think "shotgun hill-climbing" is the appropriate phrase for the sort of materials science thing I'm talking about - you start atoms in random configurations, with minimal bias, and then minimise using the forces (available analytically) on those atoms using gradient descent/whatever optimiser (usually BFGS) until you find local minima. There's no momentum involved, and the gradient is available, but you still start in random positions and minimise to local minima. That sort of method is in competition with other structure finding algorithms that use GAs, SAs or some other fudged hybrids.
My reference to "brute force" was that, as far as I understand, in the limit of lots of attempts, "shotgun" will approach "brute force" -- imagine a really nasty landscape with lots of tiny basins. I understand "stochastic ballistic" modifies things slightly.
Yes to all - thanks for the clarification. The only comment is that in large dimension cases shotgun might theoretically approach brute force, but it never does so because there are just too many places to start.
Discrete gradients are always available - you can step in random directions and re-evaluate giving an approximation. If your space is smooth(ish) (and usually it's "smooth enough") then you can use the local approximations.
Yes, sometimes it's not possible, but usually local random search gives enough to use the concept of "gradient" to help. Mostly I find that people doing the optimisation/search don't know enough about the techniques they're using to adapt them and transform their problem to get the best out of things.
Yes, but it still makes no sense to compare GA performance with a gradient-based method on the problem GP is talking about, because an explicit gradient is available.
It depends on the problem you're working on and the number of variables you need to optimize. If you only have <20 variables on a nice smooth surface then yeah a shotgun approach is probably going to work, then again pretty much anything is going to work on a problem of that size.
On the other hand if you're working with >100 variables on a highly diverging surface you're going to want an iterative approach. In particular you need something that is history dependent and avoids regions that have been previously searched. SA is great because its simple, iterative, insanely stable, has great breadth and depth searching properties and its easily modifiable.
In comparison approaches like GA might be faster to converge in some cases but because they're more complex and system dependent its going to take longer to fine tune them to your particular problem.
> On the other hand if you're working with >100 variables
> on a highly diverging surface you're going to want an
> iterative approach.
I usually am, and I do - you seem to be misunderstanding the use of the word "shotgun" when accompanied by all the other words. The idea is to start in lots of places, randomly chosen, well spread, and then start climbing from all of them. As you do so you gain velocity - perhaps better thought of as minimising an error rather than maximising a fitness or "profit" function. As you drop down the surface you gain speed, changing direction according to a random sampling of the local gradients. Keep doing this and you tend to climb out of the well again, but if you then bleed off the energy you end up stuck in a deep well.
Cross-comparing the behaviours of the different starting points lets you think globally about some of the structure, but that usually only helps if it's got global structure. Usually it's better just to start with lots of points and look at the distribution of error of where they end up.
* Shotgun - start in lots of places and run simultaneously
* Ballistic - have your location include a velocity that changes as you move over the surface
* Stochastic - don't try to analyse your local surroundings deterministically - with 100s or thousands of dimensions you can't, so you sample projections of the local gradients.
* Hill-climbing - or hill descending, if you think of it as minimising error.
In general, I find it faster than either of SA or GA.
Do you have a repo with your source or an example? In my experience conjugate gradient, powell's method or any other linear optimization is essentially useless with >100 variables or so. Optimizing anything with thousands of variables is a feat, I'd be interested to see more details of how you do it.
Everything I do in this field is commercial-in-confidence, so I can't share any of my existing code like this. I may yet write this up, get clearance, and blog about it, but that won't happen real soon.
I also found Powell's method largely useless in high dimensional spaces, but I think that's because it's semi-static. The search point doesn't gain momentum, so it quickly gets stuck in local minima. If the search point "picks up speed" then it can coast up and back out of minima. Then you bleed off the speed at greater or lesser rates so the search location (which you can think of as a particle) eventually doesn't have enough energy to get out of the minimum it's in. This is very similar to Boltzmann-like behaviour - where you end up depends less on the area (measure) of your catchment, and more on the depth of your local optimum. In this sense it shares some characteristics with SA.
Hence the "ballistic" part of this.
And truth be told, in several hundred dimensions everything is a crap shoot. High-dimensional spheres are spikey, and the terrain you're exploring is effectively a mesa with wells and flagpoles. I've just had more success with SSBHD than with SA, GA, Powell, gradient-descent, or anything else.
"Shotgun" just means "with restarts" (right?) but what do you mean by "ballistic"? "With momentum" maybe? I can't see that term used anywhere in the literature.
Sorry - how does what differ from a "learning rate"? Learning rate as used in what connection? As far as I can see there is no connection between what I've described and anything I will call a learning rate in any of the other methods, so you'll have to be more explicit.
Ah, I meant your notion of velocity/ballistic. I'm thinking of the learning rate parameter you might use to control how quickly a gradient descent algorithm converges for some machine learning algorithm.
But all these things work the same way:
* Do something random and see how good it is;
* Jiggle it at random and see if it's better;
* Consider changing.