Statistical Computing for Scientists and Engineers

melling · on Dec 31, 2017

I’ve got this one and a few others on a Github repo:

https://github.com/melling/MathAndScienceNotes/tree/master/s...

I haven’t listened to the Notre Dame course yet. How would others rate it?

I have gone through the UC Berkeley Irvine 131a class. I have high-level notes:

https://github.com/melling/MathAndScienceNotes/blob/master/s...

and detailed notes in PDF for the first four classes;

https://github.com/melling/MathAndScienceNotes/blob/master/s...

Actually, wrote this up in a blog yesterday:

https://h4labs.wordpress.com/2017/12/30/learning-probability...

koprulusector · on Dec 31, 2017

Thank you so much for sharing this!!!

Myrmornis · on Dec 31, 2017

This lecture course appears to attempt to cover way, way too much material in each lecture.

_jx7j · on Dec 31, 2017

The professor himself probably agrees with you. In lecture 18 the professor states

At 3:40 - For this topic he actually has has 6 or 7 lectures, 100 pages of notes. But it's compressed to an introduction.

At 7:10 - He states HMM is 2-3 lectures, but he's going to compress it to half a lecture.

Honestly, I had courses like this in grad school. The were typically seminars and usually graded on a curve because people crammed stuff into their brains as quickly as possible and barely understood anything! They were meant to give you a broad coverage of a field, not comprehensively cover any particular set of topics.

j9461701 · on Dec 31, 2017

Cryptography in undergrad was a course like that at my school. Here is the most basic understanding of group theory we can possibly get away with, here is how it applies to RSA. Next lecture now we're moving on to elliptical curves....

Perhaps some topics are simply too big to really teach effectively through lectures. You need to go digging on your own to understand all the myriad details, and put in the hours alone with a textbook.

adrianN · on Jan 1, 2018

The point of those lectures, especially in undergrad, is to reduce the "unknown unknowns". You don't know that you could look up how some technique actually works if you haven't ever heard about it. If you want all the details you need good textbooks, or, for more advanced topics, the original papers.

severine · on Jan 1, 2018

How much material have the profs digested in order to make this huge curriculums? Because they're not just listing "anything they've read about the subject", are they?

_jx7j · on Jan 2, 2018

Some professors have had 20+ years to digest an enormous amount of information. Keep in mind they have 2 major advantages over people in industry when learning this stuff.

At universities that offer PhDs, being a professor forces you to develop a strong theoretical foundation, which makes it easier to grasp the theory of related subjects. Example: If you have a deep understanding of math/statistics theory, it's much easier to understand a paper on machine learning, even if you're not a computer science professor.

Professors at research oriented universities supervise PhD research students and each student tends to be a multiplier on their supervisor's knowledge since each PhD is a collection and extension of an existing research area. A good PhD student is basically a massive funnel of information.

I'll also say, don't ascribe more abilities or skills to a professor than what you see in the presentation. Sometimes having a strong theoretical orientation/experience comes at the expense of a practical one. Don't assume that having a strong theoretical foundation in Stats/ML automatically makes you a good Data Scientist or ML Engineer.

jofer · on Dec 31, 2017

It looks like a graduate class. It's not expected that this is the first time someone in the course has come across most of the concepts and methods.

The syllabus could almost be from an inverse theory class in my field (geology), albeit one with more focus on the underlying mathematics. I don't think it's trying to do too much, it's just not trying to be a intro course.

digitalzombie · on Dec 31, 2017

> it's just not trying to be a intro course.

> It's not expected that this is the first time someone in the course has come across most of the concepts and methods.

It is just cramming at least 4-5 stats classes in there that's all.

Unless you get those right combination it'll be your first time anyway. There are multivariate stat, 2 courses of Bayesian ( 1 tradition and 1 nonparametric), 1 course of comp stat, and whatever the heck else I've haven't encounter.

Bayesian isn't even taught in my grad program at all. I had 2 months to learn Bayesian.

I disagree with the not expected to be first time and maaaaybe it's just to get your feet wet.

Considering the stuff the professor is going over it seem more sink or swim.

ted_dunning · on Dec 31, 2017

Bayesian statistics isn't even taught in your grad program?!?!

That seems impossibly out of step with the world today. How can they explain something as simple as a multi-armed bandit from that point of view?

MichailP · on Dec 31, 2017

I can never shake off the feeling that statistics is somewhat lacking compared to the rest of "fundamental" sciences. To me it just lacks sort of brutal honesty that is present say in physics. And how come that the math often "looks" scary? To me this also seems intentional, like someone is trying to hide the lack of real content. Honestly, do we really need a whole field to run a curve through a cloud of points? Please, dispute me in comments below, I would really like to be wrong about this.

Edit: Let the down-voting begin

nabla9 · on Dec 31, 2017

>To me this also seems intentional, like someone is trying to hide the lack of real content.

You are half correct. Something is intentionally hidden, but it's not the lack of real content.

The real stuff is called 'Mathematical Statistics' and what you learn in school is 'Statistic's for Scientists and Engineers' or 'Applied Statistics'.

Statistics loos to you random because it's useful everywhere and it must be taught to so many people in other fields. You get a basic set of per-selected tools that are useful and the real stuff is hidden.

If you really want to spend the time and effort to see it as fundamental from ground up, you have to learn mathematic background like set theory, Borel sets, Lebesque measures and integrals, Fourier integrals, etc. Then you start working up from fundamental axioms of random variables. The payday for practically oriented mind comes later and learning it may not be as motivating.

radford-neal · on Dec 31, 2017

I wouldn't recommend learning Lesbegue integration, and so forth, as a starting point for getting to know more about statistics than the canned procedures you may have been exposed to. That is math, not statistics. It's useful once you get to a certain level, but is not essential for learning the concepts. And it's actually detrimental if it gives you the idea that learning more math (beyond first-year calculus and linear algebra) is the road to statistical understanding.

nabla9 · on Dec 31, 2017

The OP asked about fundamentals and rigor.

There is two ways to learn statistics. Statistical intuition vs. learning the core concepts. When probability is formulated using measure theory things really click together with the rest of the mathematics.

> That is math, not statistics.

Sometimes in the university level statistics department teaches applied statistics and you need to go mathematical department to learn mathematical statistics. What you think is 'real' statistics for you is matter of opinion, but non-applied statistics is pure abstract mathematics.

radford-neal · on Dec 31, 2017

It's a mistake to think you're learning statistics when you're learning about measure theory, just like it's a mistake to think you're learning physics when you're learning about differential equations. The mathematical tools used in a subject are not the same as the subject itself.

eastWestMath · on Dec 31, 2017

It’s been a while since I’ve done serious statistical work (I’m a logician. There’s many different kinds of distributions (i.e. lines through a field), from which you would make very different inferences - relationships which satisfy a power law can of look like a Gaussian/normal distribution. Finding the “right” distribution when you know you’re working with incomplete data isn’t simple.

I can assure you they aren’t trying to make the math look scary - to be a bit blunt, a comment like that’s usually a sign that someone hasn’t made a serious attempt to engage with a field. That sort of bad faith regarding a core discipline of the mathematical sciences probably should be downvoted.

digitalzombie · on Dec 31, 2017

As a statistician I will try to defend my field.

Statistic is a field to represent the noise and uncertainty by quantifying it.

It is unclear to you or lacking in brutal honesty but the honest truth is the world is not perfect. If Kobe can shoot 100% of the free throw then you can perfectly model it without statistic but sometime he misses and because of that we need statistic to take into account the fact that while we expect him to make freethrows sometime he miss his shot. Without statistic you cannot take into account the sometime Kobe miss shot.

It is also the field that define in definite what significant means. Statistically significant is the difference between this med really works versus it's a placebo/not sig enough.

Another concrete example of not perfect is data. Either collection of data (missing data, measurement, etc..) is not perfect and statistic helps. You can do ordinary least square without caring about statistic but finding outliers in the data, fixing the missing data, etc.. require statistic iirc.

nonbel · on Dec 31, 2017

>"As a statistician I will try to defend my field. [...] Statistically significant is the difference between this med really works versus it's a placebo/not sig enough."

Nope, you are making at least two errors here. You can search for "statistical significance vs practical significance" and "research hypothesis vs statistical hypothesis" to learn more.

That is a confusing topic though, it would be best done away with altogether like this course appears to do.

Myrmornis · on Dec 31, 2017

In general that is an inaccurate and naive criticism of a vast area of work. I also think you are confusing statistics with its application by scientists.

However, there is some truth in this regarding the application by scientists. I did my PhD in the field of computational statistics applied to population genetics. There you are usually trying to infer past history, and therefore have no hope of experimental verification. Although some important advances in computational statistics originate in that field, there were also lots of papers just throwing increasingly fancy Bayesian MCMC software at datasets without the honesty of experimental physics about what is knowable and what sadly might not be.

Also, many scientists, away from physics, simply do not have the mathematical maturity to understand most of statistics.

alpineidyll3 · on Dec 31, 2017

It's certainly a field plagued by some intentional obfuscation and annoying jargon (krigging anyone?), but there's real value in it like anything else. You can prove that to yourself easily. Choose a noise process and generate x,y samples off a line + noise. try ordinary lst-squares and then some gaussian processes...

Basically the value of statistics lies in the fact that there's some human-tractable set of noise processes and correlations which appear universally in a wide spectrum of real phenomena.

Knowing the real mechanism which generates the noise is always better, but not always tractable.

nonbel · on Dec 31, 2017

I glanced at the lecture notes and this looks much different (imo better) than your usual stats course. I didn't even see any mention of the usual hypothesis/significance testing mumbojumbo that has been causing all the trouble. They say:

"It is a course intended for those that value the role of Bayesian inference and machine learning on [sic] their research."

dang · on Dec 31, 2017

> Let the down-voting begin

It violates the HN guidelines to include this sort of baity noise in comments here. We'd appreciate it if you'd read and follow the rules. They're at https://news.ycombinator.com/newsguidelines.html.

MichailP · on Dec 31, 2017

I had to be a bit cheeky to provoke discussion, and it seems that I did. I only edited the original post to voice my concern about down-votes initially coming faster then thoughtful comments we are all hoping for. Fortunately, in the end HN didn't disappoint.

radford-neal · on Dec 31, 2017

There are some problems with the field of statistics. They may stem partially from its role in "certifying" results in other sciences, which induces a certain conservatism, and a desire by many less scrupulous practitioners to just crank out the result they want, never mind understanding how the method works, or whether it is appropriate. It also hasn't (perhaps until recently) been that "sexy" a field, so it may attract fewer really bright people. Finally, there is a tendency for bright people who come into the field from math to think that statistics is a sub-field of math, when in fact there are philosophical issues of inductive inference that are not just math, and practical skills in data analysis that are not necessarily easy just because you're good at math.

Some particular problems:

1) Scary math and pointlessly obscure terminology are indeed a problem. For unnecessarily scary math, the early literature on Dirichlet process mixture models is a good example - almost like they were designed to be incomprehensible to most people who do actually have enough background to use the results. At a lower level, there are pointlessly obscure and misleading terms like "score function", "coefficient of determination", and worst of all, "standard error of the estimate" for the estimated standard deviation of residuals in a regression model (worst since it is not in fact a "standard error" by the general definition of that term).

2) Introductory statistics is generally taught from a naive "frequentist" perspective, because that's been the tradition for the last century or so. The justifications offered in such courses for using p-values and confidence intervals are not defensible - they just sound plausible if you don't know better. There is no good solution, since more sophisticated frequentist arguments will be beyond the level of the course, and shifting to a Bayesian perspective cuts the students off from the scientific literature with p-values, etc. that they will need to be able to read.

3) Outsiders coming into the field often have strange ideas. You might think that physicists capable of building a billion dollar accelerator would be able to recognize when a statistical method they think of is nonsense, but you'd be wrong. There's a tendency for anyone who learns information theory before statistics to think that information theory is tremendously relevant - but no, rephrasing maximum likelihood or Bayesian methods in information theory terms may sometimes be slightly helpful in thinking about them, but doesn't really add anything fundamental. And no, there's nothing particularly special or interesting about distributions that maximize entropy subject to some (generally arbitrarily selected) constraint.

4) There's a tendency to want more than you can get. There is no one "objectively correct" model/prior/analysis for a data set. Subjective assessments are unavoidable. But a lot of people don't want to accept this fact, and devote great efforts to ways of trying to pretend otherwise.

However, if you think statistics is just a simple matter of running a curve through a cloud of points, you're very wrong. Even running a curve through a could of points is a complicated and subtle enough task that deep issues arise, and these issues become much more obvious if you're trying to fit a function of hundreds of variables rather than just one. And if you're trying to not just "fit" data but come to valid conclusions about cause and effect, or about underlying latent variables that will provide useful information in new contexts, then you really do need to know a lot.

gone35 · on Dec 31, 2017

There's a tendency for anyone who learns information theory before statistics to think that information theory is tremendously relevant - but no, rephrasing maximum likelihood or Bayesian methods in information theory terms may sometimes be slightly helpful in thinking about them, but doesn't really add anything fundamental.

And no, there's nothing particularly special or interesting about distributions that maximize entropy subject to some (generally arbitrarily selected) constraint.

Wow, some two heavyweight opinions there. Care to elaborate?

radford-neal · on Jan 1, 2018

Some ideas like the Minimum Description Length principle derived from information theory turn out to be just rephrasings of already-known statistical ideas. This can occasionally provide more insight, but also can lead to ridiculously irrelevant things like looking in detail at how one might produce a code for data, rather than just saying the code would have length equal to log2 of the probability of the data, which of course leads to just forgetting about the code altogether and looking at probabilities instead.

The maximum entropy idea is just wrong (in general), in that there is no good argument for doing it. Actually, it's "not even wrong", since maximizing the entropy subject to the observed values of some expectations is just not possible, since we do not observe expectations, but rather particular finite data sets.

MichailP · on Dec 31, 2017

Thank you for the very informative and balanced comment.

closed · on Dec 31, 2017

I have seen people in physics and other physical sciences work with statistics. My sense is that the less someone has to worry about "measurement error", the less intuitive statics seems to them.

For example, in some areas of research, once you have the right instrument take measurements, you can just plot the results, and your measurements (or some transformation of them) will show up as a linear function of whatever you manipulated.

But really, I'm not totally sure what you mean, since in any situation where there's uncertainty, it makes sense to me that you'd want to try to capture that uncertainty in your analysis, and that's statistics. Make those analyses more complex (e.g. taking measurements from a field with obvious spatial dependence between measures), and the models become more complex too.

geomark · on Dec 31, 2017

That's really the issue, isn't it? That nearly all data is at least somewhat noisy. Making inferences from it is just a big blur without a systematic way to handle it.

geomark · on Dec 31, 2017

Ah, but not just any curve will do, will it. You need a lot of "scary looking math" to make your case for where you draw that curve.

jaclaz · on Dec 31, 2017

There was a short tale (or maybe it was and interview, cannot remember) where Isaac Asimov retold where - student in chemistry - was given as homework/laboratory set of experiments and the chore to plot them together with a tendency line.

After n tries (as the data from the experiments were spread semi-randomly) he drew a line roughly in the middle of the diagram and called the whole thing a "shotgun diagram" (or something similar).

When he got back to class, he was surprised by the results of his mates, that more or less led to a neat line.

Then the Professor gave him the maximum vote, as the experiment was intentionally leading to "senseless" results, and he was the only one in the class to have actually honestly reported the results, while all the ohers had evidently faked or invented them.

This is on a similar note:

http://pages.cs.wisc.edu/~kovar/hall.html