Hacker Newsnew | past | comments | ask | show | jobs | submit | cracker_jacks's commentslogin

"College became default"


> The Russians are suggesting that the killings were of 'Russian friendly' Ukrainians, killed after the troops left.

Satellite imagery taken before Russian troops left shows massacre happened before withdrawal. https://twitter.com/nytimes/status/1511060668610457612


This isn't the place for such a debate. There needs to be a proper legal inquiry into the massacre; not trial by media.

I wasn't making any assertions of my own as to who was responsible; I was just pointing out that responsibility has not yet been properly determined and thus should not be asserted as fact as Pueyo is doing here. And I'm not being pedantic; this is an extremely important point. Many propagandistic stories are pushed during a war. We should be careful not to buy into them.


> The more dimensions, the simpler the surface. With enough dimensionality, everything is a straight line.

Even assuming this statement were true (which is a big if). That does not explain how it would outperform gradient descent.


As dimensionality grows, data is more linearly sperable, so fewer dimensions are signficant in distinguishing the data (though for any pair of points, those dimensions will be different).

Grad desc, as an exhaustive search over all those dimensions "maybe" less performant in very high dimensions, where we have enough data to be "right some of the time" when randomly choosing which dim to discriminate on.

If we force zero loss, ie., an interpolation regime, then we're getting interpolation as usual. Can we get there faster when dimensionality increases?

It's plausible if count(relevant dimensions) << count(dimensions), and if they discriminating dimensions for any two random points is itself random.


I believe what you are describing is grid search, not gradient descent. Gradient descent is not an exhaustive search.


It is if you see vec_a . vec_b, as a traversal over the dimensions of both vectors.


That does not make sense, can you expand? The steepest descent direction is far more useful than a random direction in high dimensions because that random direction has a greater chance of being nearly orthogonal to the steepest descent direction.


The random directions don’t have unit length. They are drawn from a multivariate normal. They have equal variance along all directions, including the direction of the gradient. The descent along the orthogonal directions somehow cancels out. The descent along the gradient direction somehow becomes more efficient, I don’t know why yet.


I agree, in expectation, the orthogonal directions cancel out. But the variance does not cancel out. Are you using some sort of momentum-like optimizer?


> But the variance does not cancel out.

They are multiplied with the derivative in that direction, which is zero.


Oh, you're talking about using the forward gradient method. This still does not make much sense because if the projection of the gradient in that direction is zero or near zero, that is a very weak step in reducing the loss.


Ok, to recap: this whole article that we are discussing here is about the forward "gradient" method. Which, despite the name, only calculates a single directional derivative. If this were all, that would be a bit interesting, but not much else.

The article goes on to show how to make use of this in optimization. I called their descent algorithm the "random direction descent", because that's what it is. You chose a random direction, from a multivariate normal distribution with zero mean and identity covariance matrix (section 3.4). You chose a "step size" eta. Then you go along the chosen direction by eta times the directional derivative (times -1, so you go "downhill"). That is all explained at the top of page 4, Algorithm 1.

Why does this work, if, as you correctly noticed, the chosen direction is almost perpendicular to the gradient (one of the "features" of high dimensional spaces -> most of the volume of a unit sphere is close to the equator).

The answer is this: if you split the chosen random direction into the component along the gradient and the orthogonal component, then the first one has variance 1, and the second variance (N-1), since the overall variance is N, where N is the number of dimensions. The orthogonal component makes you move downhill, while the orthogonal component makes you move level-wise. The orthogonal component doesn't make things any better or worse.

Another commenter in this thread claims that this method adds noise, and the noise is sometimes so high that the method is useless. Well, that's not what I observed.

Why did I observe that the random direction descent outperforms the gradient descent. With the argument so far, you'd expect to do no worse, but why does it do better? This I don't have an explanation for, but maybe it's the fact that you don't have a constant step-size, but rather a stepsize that has variance 1. With the classical gradient descent, as you approach the minimum, the step becomes smaller and smaller. With this one, it still becomes smaller and smaller, but sometimes it's bigger, and sometimes even smaller. It appears there's some gain from the extra-stochasticity, so you end up getting faster at the minimum. I'm not sure why, but that's what I observed.


Yes, I am certain the background this particular website provides will be super objective and unbiased.....


Can a viable business be made around targets/victims sending their phishing attacks to a reverse engineering lab? The more valuable the target, the more likely an unknown and valuable exploit might be surfaced. As a consequence, this increases the expected cost of attacking these targets/victims.


Like https://www.paypal.com/us/security/report-suspicious-message... ?

> Forward suspicious email to spoof@paypal.com


lol, 'even', I find HN in particular loves pitchforks.


These are some topics which are a lightening rod for this, and I’ve seen some in the last day.

Green political parties who don’t support nuclear power.

Comments describing positive aspects of covid driven lockdowns.


Don't forget systemd.


Is "there's someone wrong on the internets" the same as pitchforks?


"Air tag" like trackers on decoy packages mixed in might be a cheap solution to this issue.


They don't really disincentive the behaviour though. Dye packs are a thing, people still rob banks.

People who feel they have nothing to lose don't care if they get caught either.

The solution is to provide people with a decent life by default. They can then contribute to society comfortably.


But if we simply take away consequence now no one has any reason not to steal, and then more people will steal. It may not deter all people but fewer people will be willing to gamble if the odds of facing consequences go up.

The alternative is to view it as inevitable and therefore don't do anything, but then those not doing it fall behind those doing it and suddenly those who weren't stealing before now feel like losers for not doing it because no one was facing any consequences and because they held back they are now worse off.


Can someone provide some explicit search queries so we can see the bad examples? Lots of criticism is being doled out in that thread without an actual example to see for myself.


Try “best kitchen knife set” and compare it to “best kitchen knife set reddit”


Agree that the reddit is more relevant to me because it has multiple perspectives in a discussion format, but I don't see any particularly bad links with the first search. The first one just has more conventional review sites like GoodHousekeeping and Forbes.


I believe I found the Forbes article you mentioned. It's exactly the kind of thing that one wouldnt want to bubble to the surface.

1) Its by a forbes contributor, who makes listicles for a living. https://www.forbes.com/sites/forbes-personal-shopper/people/...

2) It does not appear the person writing the review actually purchased any of the knives. The article is an assembly of quotes from Chefs. While Chefs do have domain knowledge, and probably sometimes have good opinions, theres no way to tell if any are sponsored by different knife brands. It's not a REVIEW comparing the actual products, its somebody who googled chef knife quotes and cut/paste a page together. The review itself does NOTHING to actually compare the products to each other.


Transformers subverting convolution on their own turf (vision) was certainly unexpected.


That's how i felt at first, but getting deeper into the Swin transformer paper it actually makes a fair bit of sense - convolutions can be likened to self-attention ops that can only attend to local neighborhoods around pixels. That's a fairly sensible assumption for image data, but it also makes sense that more general attention would better capture complex spatial relationships if you can find a way to make it computationally feasible. Swin transformers certainly go through some contortions to get there, and I bet we'll see cleaner hierarchical architectures in the future, but the results speak for themselves.


The transformer in transformer (TnT) model looks promising - you can set up multiple overlapping domains of attention, at arbitrary scales over the input.


But you have to pay the price for losing the inductive bias of cnns

Swin are still cpu/memory (and data) intensive compared to CNNs, right?


Not as much as you'd think. The original paper sets up its models so that Swin-T ~ ResNet-50 and Swin-S ~ ResNet-101 in compute and memory usage. They're still a bit higher in my experience, but i can also do drop-in replacements for ResNets and get better results on the same tasks and datasets, even when the datasets aren't huge.


For me it was quite the opposite feeling: after the attention all you need paper I thought that convolutions will become obsolete quite fast. AFAIK it still didn't happen completely, something is still missing in unifying the two approaches.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: