Most people don't appreciate linear regression.
1) All common statistical tests are linear models: https://lindeloev.github.io/tests-as-linear/
2) Linear models are linear in the parameters, not the response! E.g. y = a*sin(x)+bx^2 is a linear model.
3) By choosing an appropriate spline basis, many non-linear relationships between the predictors and the response can be modelled by linear models.
4) And if that flexibility isn't enough, by virtue of the Taylor Theorem, linear relations are often a good approximation of non-linear ones.
These are all fantastic points, and I strongly agree that most people don't appreciate linear models nearly enough.
Another one I would add that is very important: Human beings, especially in groups, can only reasonably make linear decisions.
That is, when we are in a meeting making decisions for the direction of the company we can only say things like "we need to increase ad spend, while reducing the other costs of acquisition such as discount vouchers". If you want to find the balance between "increasing ad spend" while "decreasing other costs" that's a simple linear model.
Even if you have a great non-linear model, it's not even a matter of "interpretability" so much as "actionability". You can bring the results of a regression analysis to a meeting and very quickly model different strategies with reasonable directional confidence.
I struggled communicating actionable insights upward until I started to really understand regression analysis. After that it became amazingly simple to quickly crack open and understand fairly complex business processes.
I have a degree in statistics yet I've never thought about the relationship between linear models and business decisions in this way. You're absolutely right. This is the best comment I've read all month.
I don't follow - could you explain this with a couple of examples? What would a business proposal look like that is analogous to a nonlinear model vs. one that is analogous to a linear model?
I’m neither of the previous posters, so I may be off…
For simplicity, I’m going to assume each variable in the model is independent of every other variable.
We can interpret the coefficients in linear models. This relationship holds for the model for the range of values it is based on. This relationship is the same for the whole range of the model. (We can’t extrapolate outside of what’s been modeled.)
y = c1x1 + c2x2 +…+ cnxn (excuse the poor formatting)
The sign tells you the direction (+ means it will increase the value of y, - means it will decrease the value of y), the value of the coefficient tells you how much the y will change for a given 1-unit change in the x value.
Since this is linear, you get the same change to the output for the relevant increases no matter your starting point.
So, the regression model would say x1, x3, and x5 have positive coefficients and variables x2, x4 have negative coefficients. If you want y to increase, either start doing more of x1, x3, x5 or do less of x2, x4. Depending on what these are and your limited investment budget, for example, you may pick doing x3 if that is the largest positive coefficient.
Again, since this is linear, you can keep on putting resources into the largest coefficient and get the same increase up until your model is no longer valid.
For non-linear models, you can still interpret the coefficients, but the interpretation depends on your starting conditions and where you are on the graph.
There may be asymptotes in your non-linear model, so there is a point of diminishing returns where if you keep putting resources into a variable with a positive coefficient, this will not keep getting you commensurate results.
Sorry I don’t have any actual examples here and I don’t have time to go digging through my old textbooks to look for any.
How I understand the comment: a non-linear suggestion is that the budget for X should be 300k. The (supposedly linear) alternative is that the budget for X should increase.
What I think is the important part, is that it is better to ask decision makers for decisions on setting a continuous parameter, than to make binary yes/no or go/no-go decisions. When it's a decision by committee, I can see why that is.
> Another one I would add that is very important: Human beings, especially in groups, can only reasonably make linear decisions.
No, that's not true. Human groups are very able to make discrete decisions. Actually, often they tend to go for discrete decisions, when something continuous (and perhaps linear) would be a lot better.
(Just to be clear: if you force your linear models to make discrete predictions, they are no longer linear in any sense of the word. That's why linear optimisation is a problem that can be solved in polynomial time, and integer linear optimisation is NP complete.
Even convex optimisation, which is no longer linear but still continuous, can be solved in roughly polynomial time.)
Often people demand more decisive decisions, of 'yes'/'no' or concrete action, not shades of grey and fiddling at the margins.
Getting people to even appreciate linear models is already a step forward. Like it or not, your business strategy meetings are already a step ahead of what most people would naturally be inclined to.
Yes. I also found that in many cases being able to turn problems that require discrete decisions into problems that admit continuous decisions, eg by re-arranging how the business works etc, can unlock a lot of business value.
In my concrete cases I mostly saw that in the direct sense of being able to deploy more mathematics and operations research, eg for netting out (partially) offsetting financial instruments for a bank.
But by introspection you can come up with more example. Eg that's a common selling point for running your servers on AWS instead of building your own hardware.
I often make fun of McKinsey- style four quadrants when overused, but they really boil down to something that makes a lot of sense in communicating a problem space:
a) carefully choose the two most important dimensions of concern (as Alan Kay said: the correct point of view is worth 80 Iq points)
b) make them binary: are we happy here or do we need to change?
In a way similar to the pareto ratio, you keep a surprising amount of value in something “so simple it cant be possibly so useful”.
Of course, you can also weaponise the choice of axes for your (office) politics: pick the two axes right, and the policy outcome you want to pick might already be baked into the whole process from the start.
If you like shrinkage (I do), I highly recommend the work of Matthew Stephens, e.g. ashr [1] and vash [2] for shrinkage based on an empirically derived prior.
Especially when you use the mixed model (aka MLM) framework to automatically select the smoothing penalty for your splines. So in one simple and very intuitive framework, you can estimate linear and nonlinear effects, account for repeated measurements and nested data, and model binary, count, or continuous outcomes (and more), all fitting the model in one shot, yielding statistically valid confidence intervals and p-values.
R's mgcv package (which does all of the above) is probably the single reason I'm still using R as my primary stats language.
statsmodels is the closest thing in python to R. statsmodels has mixed model support, but mgcv apparently requires more. It is well above my paygrade, but this seems relevant: https://github.com/statsmodels/statsmodels/issues/8029 (i.e. no out of the box support, you might be able to build an approximation on your own).
> Human beings, especially in groups, can only reasonably make linear decisions.
There are absolutely decisions that need to get made, and do get made, that are not linear. Step functions are a great example. "We need to decide if we are going to accept this acquisition offer" is an example of a decision with step function utility. You can try to "linearize" it and then apply a threshold -- "let's agree on a model for the value at which we would accept an acquisition offer" -- but in many ways that obscures that the utility function can be arbitrarily non-linear.
But the parent comment is not talking about constrained optimization, just gradient following.
In the context of this post, that’s just “which of these N discrete variables, if moved from 0 to 1, will increase the quantity of interest according to the linear model?” “Which will decrease it?”
The question is not, “if I can only set M of these N variables to 1, which should I choose?”
That’s a good question, and it leads to problems in NP, but that’s not what the comment was referring to.
> In the context of this post, that’s just “which of these N discrete variables, if moved from 0 to 1, will increase the quantity of interest according to the linear model?” “Which will decrease it?”
Yes, you are right in that abstract setting.
If you always have the full hypercube of available, the problem is as easy as you describe. But if there are constraints between the variables, it gets hairier.
Which seems almost ironic, because continuous linear optimization almost certainly doesn't exist really because real numbers can only be approximated, and so we're always doing discrete linear optimization at some level.
If all the numbers that appear in your constraints are rational (p/q with finite p and q), then any solution is also a rational number (with finite nominator and finite denominator).
(Well, any finite solution. Your solution could also be unbounded, then you might have infinities in there.)
I don't follow - could you explain this with a couple of examples? What would a business proposal look like that is analogous to a nonlinear model vs. one that is analogous to a linear model?
> Human beings, especially in groups, can only reasonably make linear decisions.
This seems to be getting a lot of attention. I couldn't agree more, we assume linearity all the time because reasoning non-linearly is exceptionally difficult. Yes we can do it sometimes, but it is not the default. Reasoning linearly has its flaws, and we should recognize we are making an imperfect decision, but it is still extremely useful.
For point (3), in most of my academic research and work in industry, I have used Generalized Additive Models with great technical success (i.e., they fit the data well). Still, I have noticed that they have been rarely understood or given the proper appreciation by--it is a broad category--stakeholders. Out of laziness and habit, mostly.
I've looked at additive models, but I have so far shied away because I've read that they are not super equipped to deal with non-additive interactions.
They actually deal with non-additive "low-order" interactions quite well. In R's mgcv for example, let's say you had data from many years of temperature readings across a wide geographic area, so your data are (lat, long, year, temperature). mgcv lets you fit a model like:
where you have (1) a nonlinear two-way interaction (i.e. a smooth surface) across two spatial dimensions, (2) a univariate nonlinear effect of time, and (3) a three-way nonlinear interaction, i.e. "does the pattern of temperature distributions shift over time?"
You still can't do arbitrary high-order interactions like you can get out of tree-based methods (xgboost & friends) but that's a small price to pay for valid confidence intervals and p-values. For example, the model above will give you a p-value for the ti() term, which you can use as formal statistical evidence to say -- at what level of confidence -- a spatiotemporal trend exists.
A common problem I encounter in the literature is authors over-interpreting the slopes of a model with quadratic terms (e.g. Y = age + age^2) at the lowest and highest ages. In variably the plot (not the confidence intervals) will seem to indicate declines (for example) at the oldest ages (example: random example off internet [1]), when really the apparent negative slope is due to the limitations of quadratic models not being able to model an asymptote.
The approach I've used, when I do not have a theoretically driven choice to work with) is using fractionated polynomials [2], e.g. x^s where s = {−2, −1, −0.5, 0, 0.5, 1, 2, 3}, and then picking a strategy to pick the best fitting polynomial while avoiding overfitting.
Its not a bad technique; I've tried others like piecewise polynomial regression, knots, etc [3],but I could not figure out how to test (for example) for a group interaction between two knotted splines). Also additive models.
For my applications, using natural cubic splines provided by the 'ns' function in R, combined with trying out where knots should be positioned, is sufficient. Maybe have a look at the gratia package [1] for plotting lots of diagnostics around spline fits.
An SVM is purely a linear model from the right perspective, and if you're being really reductive, RELU neural networks are piecewise linear. I think this may be obscuring more than it helps; picking the right transformation for your particular case is a highly nontrivial problem; why sin(x) and x^2, rather than, say, tanh(x) and x^(1/2).
I have very little math knowledge and point 2 surprises me. Some quick googling suggests that a linear model should produce a straight line when graphed but the example equation you offered isn't straight. I'm missing something basic aren't I?
The thing being learned here are (a,b) and you do that using data (x,y). We can rewrite our input to be of the form z = {sin(x), x^2} and then now we have the model y = a z_1 + b z_2 which is now obviously linear in z. Since x is given to us and z is just a function of x, nothing strange is happening here. Just manipulating the data.
When statisticians talk about linear models, they talk about the parameters being linear, not your variables x_0..x_n. So y = a*sin(x) + b is a linear model, because y is linear in a and b.
IANAS, but the example is not linear in x. But you can pick one or more axes where it would be linear. In this case for y=a*sin(x)+bx^2, you set x'=sin(x) and x"=x^2 and plot y=ax'+ bx". You can also pick an arbitrary function for y and do a similar transformation.
As a student who's only been exposed to stats in undergrad (in the context of using multiple regression in Econometrics), where can I learn more about this? especially about choosing a spline basis and Taylor's theorem?
Re. 2) Then you end up doing feature engineering. For applications where you don't know the data generating process it is often better to just throw everything at the model let it extract the features.
I don't disagree in the context of the current tools. But this has always been a bugbear of mine- data science has an unhealthy bias towards modeling over data preperation.
I'd love to see tools in the ecosystem around extracting relevant features that then can be used on a lower cost, more predictable model.
if you want to convert people into loving linear models (and you should), we need to make sure that they learn the difference between 'linear models' and 'linear models fit using OLS'
i've met smart people that cant wrap their head around how it's possible to create linear model where the number of parameters exceeds the number of data points (that's an OLS restriction).
or they're worried about how they can apply their formula for calculating the std error on the parameters. bruh, it's the future and we have big computers. just bootstrap em and don't make any assumptions.
> where the number of parameters exceeds the number of data points
Linear models have many solutions fitting the data exactly in that parameter regime, many more fitting it approximately for any metric still satisfying the idea that identical outputs are preferable, and sometimes multiple solutions even with more data.
So.....not just for OLS, but for most metrics (where you'd prefer to match or approximately match the data), the parameters are underconstrained.
How much that matters depends on lots of things. If you have additional constraints (a common one that's particularly easy to program is looking for a minimum-norm solution), that trivially solves the problem. Otherwise, you might still have issues. E.g., non-minimum-norm solutions often perform badly on slightly out-of-distribution samples (since those extra basis vectors were unconstrained and thus might be large).
Is there something I'm missing where 'linear models' are used to represent something wildly different than I'm used to? Are people using norms with discontinuities or something in practice? Is the criticism of OLS perhaps unrelated to the overparameterization issue? I think I'm missing some detail that would relate all of those.
> If you want to convert people into loving linear models (and you should), we need to make sure that they learn the difference between 'linear models' and 'linear models fit using OLS'
Help me understand the pitch. What linear models are you referring to here that aren’t estimated with OLS? How should I wrap my head around having more parameters than observations?
Yeah but let’s not go crazy. Linear models perform very badly on partition-able tabular data where tree models excel. They are also obviously no replacement or competition in deep learning related tasks.
Point 3 — just pick the right basis — is very difficult outside a handful of kernels that are known to work. And how are you going to extrapolate your spline for prediction for example? Linearly is usually the answer…
Point 4 — sure for differentiable functions, but most people are fitting data not functions, and if you know it’s generating function why would you bother with a linear model?
The most important skill in regression is to RECOGNIZE the intercept. It sounds trivial, and is, until you start including interactions between terms. The number of times I've found a young graduate student screw this up...
Take a simple linear model involving a test score, their age in years (age range 7-16 years), and a binary categorical variable autism diagnosis (0=control,1=autism):
score = age + diagnosis + age:diagnosis
score = (X1)age + (X2)diagnosis + (X3)age:diagnosis.
If the X2 is significant, the naive student would say, "look a group difference!!", not realizing this is the predicted group difference at the intercept, which is when participants were 0 years old. [[
You center age by the mean, or median, or better yet, the age you are most interested in. Once interactions are in the equation, all "lower order" parameter estimates are in reference to the intercept.]]
They might also note a significant effect of age, and then assume it applies to both groups, but the parameter X1 only tells you what the predicted slope is for the reference group (controls), while the interaction tests if the age slopes differ between groups...moreover, even if the interaction isn't significant, the age effect in the autism group might not significantly differ from zero...the data is in the wish washy zone, and you have to be careful in how one interprets the data.
To some here all this will seem obvious, but to many, getting their head firmly into the conditional space of parameters when their are interaction terms takes work. (note: for now I am ignoring other ways of coding groups (grand mean vs one group being the reference) but the lesson still remains. Understand what the intercept means and to whom/what it refers.
I always struggle to get a good intuition into models with interaction terms. I usually try to write down for every class of responses which terms of the model go into it and often that helps with interpretation. There's also the ExploreModelMatrix [1] that helps with that task.
If I said something stupid above, please let me know. I'm always learning. If you are a strong Bayesian who doesn't like p-values, that is also fine. I get it. I just wanted to provide my observations about a great number of bright students I've worked with who have nevertheless struggled to fluidly interpret models with interaction terms, and point them in the right direction.
When I was at CMU a decade ago I took 36-401 and 36-402 (then taught by Shalizi) and they were both very good statistical classes and they forced me to learn base R, for better or for worse.
A big weakness of linear regression that I had to learn the hard way is that the academic assumptions for valid interpretation of the coefficients are easy to construct for small educational datasets but rarely applicable to messy real world data.
It depends. The most important assumption is independence of the observations. If that is not given, you have to either account for correlated responses using a mixed-effects model or mean-aggregate those responses (computing the mean decreases the variance but also reduces the number of data points and those two cancel each other out in calculating the t-statistic of the Wald test).
With regard to other assumptions, e.g. normality of the residuals, linear models can often deal with some degree of violation against those. But I agree that it's always good to understand the influence of those violations, e.g. by using simulations and making p-value histograms of null-data.
It depends on the "severity" of the violation of assumptions--you can also use GAMs to add flexible nonlinear relationships--and the amount of data you are working with. Statistical modeling is a nuanced job.
They may not know at CMU that the vast majority of applied, trained-on-data statistical models that help run the modern world seriously violate one or more of the model's assumptions.
I love that Ridge Regression is introduced in the context of multicollinearity. It seems almost everyone these days learns about it as a regularization technique to prevent overfitting, but one of its fundamental use cases (and indeed its origin I believe) is in balancing weights among highly correlated (or nearly linearly dependent) predictors, which can cause huge problems even if you plenty of data.
I'd love to see linear regression taught by say a quant researcher from Citadel. How do these guys use it? What do they particularly care about? Any theoretical results that meaningfully change the way they view problems? And so on.
I have some experience. Variants of regularization are a must. There are just too few samples and too much noise per sample.
In a related problem, covariance matrix estimation, variants of shrinkage is popular. The most straight forward one being Linear Shrinkage (Ledoit, Wolf).
Excepting neural nets, I think most people doing regression simply use linear regression with above type touches based on the domain.
Particularly in finance you fool yourself too much with more complex models.
Yes these are good points and probably the most important ones as far as the maths is concerned, though I would say regularisations methods are really standard things one learns in any ML / stat course.
Ledoit, Wolf shrinkage is indeed more exotic and very useful.
> There are just too few samples and too much noise per sample.
Call it 2000 liquid products on the US exchanges. Many years of data. Even if you approximate it down from per tick to 1 minutely, that doesn't feel like you're struggling for a large in sample period?
It sounds like you are assuming the joint distribution of returns in the future is equal to that of the past, and assuming away potential time dependence.
These may be valid assumptions, but even if they are, "sample size" is always relative to between-sample unit variance, and that variance can be quite large for financial data. In some cases even infinite!
They may have been referring to (for example) reported financial results or news events which are more infrequent/rare but may have outsized impact on market prices.
The linear regression - and with a single predictor at that - is the workhorse. As if - the cross-product x'*y is too little, divided by dot-product x'*x is just right (regression), and dividing it again by another dot-product y'*y (correlation, with the sqrt) would be over-doing it. :-)
There is no big mystery I'm afraid, there is no big reveal. It's as Jim Simons described in the Numberphile video interview: a slow painstaking accumulation of weak signals, plus crafting and improving various boxes of the system. (the interfaces between them are largely known) The fitting method used does not buy that much in the grand scheme of things - as long as it does not ruin things, that is.
(I've not been at Citadel but been quant R&D&trading last 20yrs)
We had to revisit linear regression multiple times in different courses for my undergrad classes. It's fascinating that optimality is provable using statistics and probability theory, although given assumptions hold of course.
For my cs phd I looked mostly at regression problems using deep learning models. I didn't look at this specifically but I still think it would be neat if there is some way to translate the rigid proofs and theorems for classical linear models to deep regression models.
It looks like this article does not mention it, but linear regression will also exhibit Double Descent phenomenon, commonly seen in deep learning. You would need to introduce some regularization, in order to see this. It would be nice to add this discussion.
Are there some papers in particular that you're referring to? Does the second descent happen after the model becomes overparameterized, like with neural nets? What kind of regularization?
[Submitted on 24 Mar 2023]
Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle
Rylan Schaeffer, Mikail Khona, Zachary Robertson, Akhilan Boopathy, Kateryna Pistunova, Jason W. Rocks, Ila Rani Fiete, Oluwasanmi Koyejo
Double descent is a surprising phenomenon in machine learning, in which as the number of model parameters grows relative to the number of data, test error drops as models grow ever larger into the highly overparameterized (data undersampled) regime. This drop in test error flies against classical learning theory on overfitting and has arguably underpinned the success of large models in machine learning. This non-monotonic behavior of test loss depends on the number of data, the dimensionality of the data and the number of model parameters. Here, we briefly describe double descent, then provide an explanation of why double descent occurs in an informal and approachable manner, requiring only familiarity with linear algebra and introductory probability. We provide visual intuition using polynomial regression, then mathematically analyze double descent with ordinary linear regression and identify three interpretable factors that, when simultaneously all present, together create double descent. We demonstrate that double descent occurs on real data when using ordinary linear regression, then demonstrate that double descent does not occur when any of the three factors are ablated. We use this understanding to shed light on recent observations in nonlinear models concerning superposition and double descent. Code is publicly available
Thanks for sharing. Add someone teaching regression (with XGBoost) this month, this is a good read. Very well written, and approachable, unlike many academic texts.
I particularly like chapter 6, visual diagnosis. Very well done.