This isn't the place for such a debate. There needs to be a proper legal inquiry into the massacre; not trial by media.
I wasn't making any assertions of my own as to who was responsible; I was just pointing out that responsibility has not yet been properly determined and thus should not be asserted as fact as Pueyo is doing here. And I'm not being pedantic; this is an extremely important point. Many propagandistic stories are pushed during a war. We should be careful not to buy into them.
As dimensionality grows, data is more linearly sperable, so fewer dimensions are signficant in distinguishing the data (though for any pair of points, those dimensions will be different).
Grad desc, as an exhaustive search over all those dimensions "maybe" less performant in very high dimensions, where we have enough data to be "right some of the time" when randomly choosing which dim to discriminate on.
If we force zero loss, ie., an interpolation regime, then we're getting interpolation as usual. Can we get there faster when dimensionality increases?
It's plausible if count(relevant dimensions) << count(dimensions), and if they discriminating dimensions for any two random points is itself random.
That does not make sense, can you expand? The steepest descent direction is far more useful than a random direction in high dimensions because that random direction has a greater chance of being nearly orthogonal to the steepest descent direction.
The random directions don’t have unit length. They are drawn from a multivariate normal. They have equal variance along all directions, including the direction of the gradient. The descent along the orthogonal directions somehow cancels out. The descent along the gradient direction somehow becomes more efficient, I don’t know why yet.
I agree, in expectation, the orthogonal directions cancel out. But the variance does not cancel out. Are you using some sort of momentum-like optimizer?
Oh, you're talking about using the forward gradient method. This still does not make much sense because if the projection of the gradient in that direction is zero or near zero, that is a very weak step in reducing the loss.
Ok, to recap: this whole article that we are discussing here is about the forward "gradient" method. Which, despite the name, only calculates a single directional derivative. If this were all, that would be a bit interesting, but not much else.
The article goes on to show how to make use of this in optimization. I called their descent algorithm the "random direction descent", because that's what it is. You chose a random direction, from a multivariate normal distribution with zero mean and identity covariance matrix (section 3.4). You chose a "step size" eta. Then you go along the chosen direction by eta times the directional derivative (times -1, so you go "downhill"). That is all explained at the top of page 4, Algorithm 1.
Why does this work, if, as you correctly noticed, the chosen direction is almost perpendicular to the gradient (one of the "features" of high dimensional spaces -> most of the volume of a unit sphere is close to the equator).
The answer is this: if you split the chosen random direction into the component along the gradient and the orthogonal component, then the first one has variance 1, and the second variance (N-1), since the overall variance is N, where N is the number of dimensions. The orthogonal component makes you move downhill, while the orthogonal component makes you move level-wise. The orthogonal component doesn't make things any better or worse.
Another commenter in this thread claims that this method adds noise, and the noise is sometimes so high that the method is useless. Well, that's not what I observed.
Why did I observe that the random direction descent outperforms the gradient descent. With the argument so far, you'd expect to do no worse, but why does it do better? This I don't have an explanation for, but maybe it's the fact that you don't have a constant step-size, but rather a stepsize that has variance 1. With the classical gradient descent, as you approach the minimum, the step becomes smaller and smaller. With this one, it still becomes smaller and smaller, but sometimes it's bigger, and sometimes even smaller. It appears there's some gain from the extra-stochasticity, so you end up getting faster at the minimum. I'm not sure why, but that's what I observed.
Can a viable business be made around targets/victims sending their phishing attacks to a reverse engineering lab? The more valuable the target, the more likely an unknown and valuable exploit might be surfaced. As a consequence, this increases the expected cost of attacking these targets/victims.
But if we simply take away consequence now no one has any reason not to steal, and then more people will steal. It may not deter all people but fewer people will be willing to gamble if the odds of facing consequences go up.
The alternative is to view it as inevitable and therefore don't do anything, but then those not doing it fall behind those doing it and suddenly those who weren't stealing before now feel like losers for not doing it because no one was facing any consequences and because they held back they are now worse off.
Can someone provide some explicit search queries so we can see the bad examples? Lots of criticism is being doled out in that thread without an actual example to see for myself.
Agree that the reddit is more relevant to me because it has multiple perspectives in a discussion format, but I don't see any particularly bad links with the first search. The first one just has more conventional review sites like GoodHousekeeping and Forbes.
2) It does not appear the person writing the review actually purchased any of the knives. The article is an assembly of quotes from Chefs. While Chefs do have domain knowledge, and probably sometimes have good opinions, theres no way to tell if any are sponsored by different knife brands. It's not a REVIEW comparing the actual products, its somebody who googled chef knife quotes and cut/paste a page together. The review itself does NOTHING to actually compare the products to each other.
That's how i felt at first, but getting deeper into the Swin transformer paper it actually makes a fair bit of sense - convolutions can be likened to self-attention ops that can only attend to local neighborhoods around pixels. That's a fairly sensible assumption for image data, but it also makes sense that more general attention would better capture complex spatial relationships if you can find a way to make it computationally feasible. Swin transformers certainly go through some contortions to get there, and I bet we'll see cleaner hierarchical architectures in the future, but the results speak for themselves.
The transformer in transformer (TnT) model looks promising - you can set up multiple overlapping domains of attention, at arbitrary scales over the input.
Not as much as you'd think. The original paper sets up its models so that Swin-T ~ ResNet-50 and Swin-S ~ ResNet-101 in compute and memory usage. They're still a bit higher in my experience, but i can also do drop-in replacements for ResNets and get better results on the same tasks and datasets, even when the datasets aren't huge.
For me it was quite the opposite feeling: after the attention all you need paper I thought that convolutions will become obsolete quite fast. AFAIK it still didn't happen completely, something is still missing in unifying the two approaches.