Most people don't appreciate linear regression.
1) All common statistical tests are linear models: https://lindeloev.github.io/tests-as-linear/
2) Linear models are linear in the parameters, not the response! E.g. y = a*sin(x)+bx^2 is a linear model.
3) By choosing an appropriate spline basis, many non-linear relationships between the predictors and the response can be modelled by linear models.
4) And if that flexibility isn't enough, by virtue of the Taylor Theorem, linear relations are often a good approximation of non-linear ones.
These are all fantastic points, and I strongly agree that most people don't appreciate linear models nearly enough.
Another one I would add that is very important: Human beings, especially in groups, can only reasonably make linear decisions.
That is, when we are in a meeting making decisions for the direction of the company we can only say things like "we need to increase ad spend, while reducing the other costs of acquisition such as discount vouchers". If you want to find the balance between "increasing ad spend" while "decreasing other costs" that's a simple linear model.
Even if you have a great non-linear model, it's not even a matter of "interpretability" so much as "actionability". You can bring the results of a regression analysis to a meeting and very quickly model different strategies with reasonable directional confidence.
I struggled communicating actionable insights upward until I started to really understand regression analysis. After that it became amazingly simple to quickly crack open and understand fairly complex business processes.
I have a degree in statistics yet I've never thought about the relationship between linear models and business decisions in this way. You're absolutely right. This is the best comment I've read all month.
I don't follow - could you explain this with a couple of examples? What would a business proposal look like that is analogous to a nonlinear model vs. one that is analogous to a linear model?
I’m neither of the previous posters, so I may be off…
For simplicity, I’m going to assume each variable in the model is independent of every other variable.
We can interpret the coefficients in linear models. This relationship holds for the model for the range of values it is based on. This relationship is the same for the whole range of the model. (We can’t extrapolate outside of what’s been modeled.)
y = c1x1 + c2x2 +…+ cnxn (excuse the poor formatting)
The sign tells you the direction (+ means it will increase the value of y, - means it will decrease the value of y), the value of the coefficient tells you how much the y will change for a given 1-unit change in the x value.
Since this is linear, you get the same change to the output for the relevant increases no matter your starting point.
So, the regression model would say x1, x3, and x5 have positive coefficients and variables x2, x4 have negative coefficients. If you want y to increase, either start doing more of x1, x3, x5 or do less of x2, x4. Depending on what these are and your limited investment budget, for example, you may pick doing x3 if that is the largest positive coefficient.
Again, since this is linear, you can keep on putting resources into the largest coefficient and get the same increase up until your model is no longer valid.
For non-linear models, you can still interpret the coefficients, but the interpretation depends on your starting conditions and where you are on the graph.
There may be asymptotes in your non-linear model, so there is a point of diminishing returns where if you keep putting resources into a variable with a positive coefficient, this will not keep getting you commensurate results.
Sorry I don’t have any actual examples here and I don’t have time to go digging through my old textbooks to look for any.
How I understand the comment: a non-linear suggestion is that the budget for X should be 300k. The (supposedly linear) alternative is that the budget for X should increase.
What I think is the important part, is that it is better to ask decision makers for decisions on setting a continuous parameter, than to make binary yes/no or go/no-go decisions. When it's a decision by committee, I can see why that is.
> Another one I would add that is very important: Human beings, especially in groups, can only reasonably make linear decisions.
No, that's not true. Human groups are very able to make discrete decisions. Actually, often they tend to go for discrete decisions, when something continuous (and perhaps linear) would be a lot better.
(Just to be clear: if you force your linear models to make discrete predictions, they are no longer linear in any sense of the word. That's why linear optimisation is a problem that can be solved in polynomial time, and integer linear optimisation is NP complete.
Even convex optimisation, which is no longer linear but still continuous, can be solved in roughly polynomial time.)
Often people demand more decisive decisions, of 'yes'/'no' or concrete action, not shades of grey and fiddling at the margins.
Getting people to even appreciate linear models is already a step forward. Like it or not, your business strategy meetings are already a step ahead of what most people would naturally be inclined to.
Yes. I also found that in many cases being able to turn problems that require discrete decisions into problems that admit continuous decisions, eg by re-arranging how the business works etc, can unlock a lot of business value.
In my concrete cases I mostly saw that in the direct sense of being able to deploy more mathematics and operations research, eg for netting out (partially) offsetting financial instruments for a bank.
But by introspection you can come up with more example. Eg that's a common selling point for running your servers on AWS instead of building your own hardware.
I often make fun of McKinsey- style four quadrants when overused, but they really boil down to something that makes a lot of sense in communicating a problem space:
a) carefully choose the two most important dimensions of concern (as Alan Kay said: the correct point of view is worth 80 Iq points)
b) make them binary: are we happy here or do we need to change?
In a way similar to the pareto ratio, you keep a surprising amount of value in something “so simple it cant be possibly so useful”.
Of course, you can also weaponise the choice of axes for your (office) politics: pick the two axes right, and the policy outcome you want to pick might already be baked into the whole process from the start.
If you like shrinkage (I do), I highly recommend the work of Matthew Stephens, e.g. ashr [1] and vash [2] for shrinkage based on an empirically derived prior.
Especially when you use the mixed model (aka MLM) framework to automatically select the smoothing penalty for your splines. So in one simple and very intuitive framework, you can estimate linear and nonlinear effects, account for repeated measurements and nested data, and model binary, count, or continuous outcomes (and more), all fitting the model in one shot, yielding statistically valid confidence intervals and p-values.
R's mgcv package (which does all of the above) is probably the single reason I'm still using R as my primary stats language.
statsmodels is the closest thing in python to R. statsmodels has mixed model support, but mgcv apparently requires more. It is well above my paygrade, but this seems relevant: https://github.com/statsmodels/statsmodels/issues/8029 (i.e. no out of the box support, you might be able to build an approximation on your own).
> Human beings, especially in groups, can only reasonably make linear decisions.
There are absolutely decisions that need to get made, and do get made, that are not linear. Step functions are a great example. "We need to decide if we are going to accept this acquisition offer" is an example of a decision with step function utility. You can try to "linearize" it and then apply a threshold -- "let's agree on a model for the value at which we would accept an acquisition offer" -- but in many ways that obscures that the utility function can be arbitrarily non-linear.
But the parent comment is not talking about constrained optimization, just gradient following.
In the context of this post, that’s just “which of these N discrete variables, if moved from 0 to 1, will increase the quantity of interest according to the linear model?” “Which will decrease it?”
The question is not, “if I can only set M of these N variables to 1, which should I choose?”
That’s a good question, and it leads to problems in NP, but that’s not what the comment was referring to.
> In the context of this post, that’s just “which of these N discrete variables, if moved from 0 to 1, will increase the quantity of interest according to the linear model?” “Which will decrease it?”
Yes, you are right in that abstract setting.
If you always have the full hypercube of available, the problem is as easy as you describe. But if there are constraints between the variables, it gets hairier.
Which seems almost ironic, because continuous linear optimization almost certainly doesn't exist really because real numbers can only be approximated, and so we're always doing discrete linear optimization at some level.
If all the numbers that appear in your constraints are rational (p/q with finite p and q), then any solution is also a rational number (with finite nominator and finite denominator).
(Well, any finite solution. Your solution could also be unbounded, then you might have infinities in there.)
I don't follow - could you explain this with a couple of examples? What would a business proposal look like that is analogous to a nonlinear model vs. one that is analogous to a linear model?
> Human beings, especially in groups, can only reasonably make linear decisions.
This seems to be getting a lot of attention. I couldn't agree more, we assume linearity all the time because reasoning non-linearly is exceptionally difficult. Yes we can do it sometimes, but it is not the default. Reasoning linearly has its flaws, and we should recognize we are making an imperfect decision, but it is still extremely useful.
For point (3), in most of my academic research and work in industry, I have used Generalized Additive Models with great technical success (i.e., they fit the data well). Still, I have noticed that they have been rarely understood or given the proper appreciation by--it is a broad category--stakeholders. Out of laziness and habit, mostly.
I've looked at additive models, but I have so far shied away because I've read that they are not super equipped to deal with non-additive interactions.
They actually deal with non-additive "low-order" interactions quite well. In R's mgcv for example, let's say you had data from many years of temperature readings across a wide geographic area, so your data are (lat, long, year, temperature). mgcv lets you fit a model like:
where you have (1) a nonlinear two-way interaction (i.e. a smooth surface) across two spatial dimensions, (2) a univariate nonlinear effect of time, and (3) a three-way nonlinear interaction, i.e. "does the pattern of temperature distributions shift over time?"
You still can't do arbitrary high-order interactions like you can get out of tree-based methods (xgboost & friends) but that's a small price to pay for valid confidence intervals and p-values. For example, the model above will give you a p-value for the ti() term, which you can use as formal statistical evidence to say -- at what level of confidence -- a spatiotemporal trend exists.
A common problem I encounter in the literature is authors over-interpreting the slopes of a model with quadratic terms (e.g. Y = age + age^2) at the lowest and highest ages. In variably the plot (not the confidence intervals) will seem to indicate declines (for example) at the oldest ages (example: random example off internet [1]), when really the apparent negative slope is due to the limitations of quadratic models not being able to model an asymptote.
The approach I've used, when I do not have a theoretically driven choice to work with) is using fractionated polynomials [2], e.g. x^s where s = {−2, −1, −0.5, 0, 0.5, 1, 2, 3}, and then picking a strategy to pick the best fitting polynomial while avoiding overfitting.
Its not a bad technique; I've tried others like piecewise polynomial regression, knots, etc [3],but I could not figure out how to test (for example) for a group interaction between two knotted splines). Also additive models.
For my applications, using natural cubic splines provided by the 'ns' function in R, combined with trying out where knots should be positioned, is sufficient. Maybe have a look at the gratia package [1] for plotting lots of diagnostics around spline fits.
An SVM is purely a linear model from the right perspective, and if you're being really reductive, RELU neural networks are piecewise linear. I think this may be obscuring more than it helps; picking the right transformation for your particular case is a highly nontrivial problem; why sin(x) and x^2, rather than, say, tanh(x) and x^(1/2).
I have very little math knowledge and point 2 surprises me. Some quick googling suggests that a linear model should produce a straight line when graphed but the example equation you offered isn't straight. I'm missing something basic aren't I?
The thing being learned here are (a,b) and you do that using data (x,y). We can rewrite our input to be of the form z = {sin(x), x^2} and then now we have the model y = a z_1 + b z_2 which is now obviously linear in z. Since x is given to us and z is just a function of x, nothing strange is happening here. Just manipulating the data.
When statisticians talk about linear models, they talk about the parameters being linear, not your variables x_0..x_n. So y = a*sin(x) + b is a linear model, because y is linear in a and b.
IANAS, but the example is not linear in x. But you can pick one or more axes where it would be linear. In this case for y=a*sin(x)+bx^2, you set x'=sin(x) and x"=x^2 and plot y=ax'+ bx". You can also pick an arbitrary function for y and do a similar transformation.
As a student who's only been exposed to stats in undergrad (in the context of using multiple regression in Econometrics), where can I learn more about this? especially about choosing a spline basis and Taylor's theorem?
Re. 2) Then you end up doing feature engineering. For applications where you don't know the data generating process it is often better to just throw everything at the model let it extract the features.
I don't disagree in the context of the current tools. But this has always been a bugbear of mine- data science has an unhealthy bias towards modeling over data preperation.
I'd love to see tools in the ecosystem around extracting relevant features that then can be used on a lower cost, more predictable model.
if you want to convert people into loving linear models (and you should), we need to make sure that they learn the difference between 'linear models' and 'linear models fit using OLS'
i've met smart people that cant wrap their head around how it's possible to create linear model where the number of parameters exceeds the number of data points (that's an OLS restriction).
or they're worried about how they can apply their formula for calculating the std error on the parameters. bruh, it's the future and we have big computers. just bootstrap em and don't make any assumptions.
> where the number of parameters exceeds the number of data points
Linear models have many solutions fitting the data exactly in that parameter regime, many more fitting it approximately for any metric still satisfying the idea that identical outputs are preferable, and sometimes multiple solutions even with more data.
So.....not just for OLS, but for most metrics (where you'd prefer to match or approximately match the data), the parameters are underconstrained.
How much that matters depends on lots of things. If you have additional constraints (a common one that's particularly easy to program is looking for a minimum-norm solution), that trivially solves the problem. Otherwise, you might still have issues. E.g., non-minimum-norm solutions often perform badly on slightly out-of-distribution samples (since those extra basis vectors were unconstrained and thus might be large).
Is there something I'm missing where 'linear models' are used to represent something wildly different than I'm used to? Are people using norms with discontinuities or something in practice? Is the criticism of OLS perhaps unrelated to the overparameterization issue? I think I'm missing some detail that would relate all of those.
> If you want to convert people into loving linear models (and you should), we need to make sure that they learn the difference between 'linear models' and 'linear models fit using OLS'
Help me understand the pitch. What linear models are you referring to here that aren’t estimated with OLS? How should I wrap my head around having more parameters than observations?
Yeah but let’s not go crazy. Linear models perform very badly on partition-able tabular data where tree models excel. They are also obviously no replacement or competition in deep learning related tasks.
Point 3 — just pick the right basis — is very difficult outside a handful of kernels that are known to work. And how are you going to extrapolate your spline for prediction for example? Linearly is usually the answer…
Point 4 — sure for differentiable functions, but most people are fitting data not functions, and if you know it’s generating function why would you bother with a linear model?