Yeah, and plain Q learning is able to iteratively improve a policy in any enviro...

Yeah, and plain Q learning is able to iteratively improve a policy in any environment. Every single loop leading to improvement, and yet it hasn't really solved much of anything since the 60s (just some toy problems).

My point being, we don't know where the asymptote lies. Computers have had self improving algorithms since the 60s, and people have been making the same bold claims, like you, that because an iterative process for improvement has been discovered, we're close to super human AI since the 60s too.