Subset selection, regularization, and shrinkage

2 Pages
1
2
→

You cannot start a new topic
You cannot reply to this topic

Subset selection, regularization, and shrinkage

#1 hrothgar

Group: Advanced Members
Posts: 15,485
Joined: 2003-February-13
Gender:Male
Location:Natick, MA
Interests:Travel
Cooking
Brewing
Hiking

Posted 2011-November-21, 14:31

FWIW, Here's an amusing little post on feature selection techniques. Stay tuned for next week, when I compare sequential feature selection to ridge regression and the lasso. (For those who care, I am shamelessly stealing examples from Tibshirani. In my defense, I do include footnotes and this is much more about showing how to get stuff done in MATLAB code than breaking new ground)

http://blogs.mathworks.com/loren/

Alderaan delenda est

#2 hrothgar

Group: Advanced Members
Posts: 15,485
Joined: 2003-February-13
Gender:Male
Location:Natick, MA
Interests:Travel
Cooking
Brewing
Hiking

Posted 2011-December-06, 11:27

Second posting is now available

(This piece introduces a newish regression algorithm called "Lasso" which offers some significant advantages compared to traditional linear regression)

http://blogs.mathwor...ization-part-2/

Alderaan delenda est

#3 jdeegan

Group: Advanced Members
Posts: 1,427
Joined: 2005-August-12
Gender:Male
Interests:Economics
Finance
Bridge bidding theory
Cooking
Downhill skiing

Posted 2011-December-06, 12:11

ex transpose ex inverse, ex transpose y. I have been puzzling over a, thus far mysterious, ridge regression package included in my software for years. What are the alleged advantages of this revolutionary technique?

You gotta luv Krugman. Humble country school teacher comes from nowhere to be the Rush Limbaugh of the left. Ain't America grand.

#4 WellSpyder

Group: Advanced Members
Posts: 1,627
Joined: 2009-November-30
Location:Oxfordshire, England

Posted 2011-December-06, 12:32

jdeegan, on 2011-December-06, 12:11, said:

I have been puzzling over a, thus far mysterious, ridge regression package included in my software for years. What are the alleged advantages of this revolutionary technique?

I have to confess that despite being a practicing economist I haven't come across ridge regression or lasso before (my thanks to hrothgar for posting these links). But I'm well aware of the dangers of "over-fitting" when estimating equations using standard regression techniques that then don't prove very useful for forecasting - for example, including too many lags because they appear to be significant when looking at t-stats, etc. Putting a premium on parsimony therefore seems an attractive idea. Doing it by "forcing" coefficients towards zero doesn't necessarily seem an intuitive way of doing it, but I can see advantages in this, too, eg in the cases where traditional estimation can often produce a series of lags with opposite signs, which even though they can be significant in helping to fit the data over the estimation period seem unlikely to be able to help in predicting the future.

#5 hrothgar

Group: Advanced Members
Posts: 15,485
Joined: 2003-February-13
Gender:Male
Location:Natick, MA
Interests:Travel
Cooking
Brewing
Hiking

Posted 2011-December-06, 12:38

The following presumes that you've read stuff on lasso...

Linear Regression identifies a set of coefficients that minimize the sum of the squared errors between predicted and actual.

Lasso changes this minimization problem. We identify a set of coefficients that minimizes the sum of the squared errors plus the sum of the absolute value of the regression coefficients. (We're using an L1 norm)

Ridge regression (aka Tikhonov regularization) is the same as lasso except we substitute an L1 norm for the L2 norm. This time around we identify a set of coefficients that minimized the sum of the squared errors plus the sum of the square of the coefficients. As usual, the math is a lot easier with an L2 norm, which is why Tikhonov solved this problem a long time before lasso was a twinkle in Tibshirani's eye...

As for motivation:

1. The predictive accuracy of linear regression models suffers dramatically if you have relatively wide data sets with strong correlation between your independent variables.

2. Regularization techniques like ridge regression and lasso are often able to significantly improve predictive accuracy (at the cost of increasing your bias)

3. Lasso and ridge regression differ in the choice of the norm. The L1 norm will cause the lasso to quickly drive individual regression coefficients completely to zero, there by acting as a feature selection technique. The L2 norm used by ridge will preserve larger numbers of independent variables within the model.

4. There is also something known as an elastic net which is a convex combination of a ridge regression and a lasso and offers many of the best properties of both.

Alderaan delenda est

#6 S2000magic

Group: Full Members
Posts: 439
Joined: 2011-November-11
Gender:Male
Location:Yorba Linda, CA
Interests:magic, horseback riding, hiking, camping, F1 racing, bridge, mathematics, finance, teaching

Posted 2011-December-06, 13:17

hrothgar, on 2011-December-06, 12:38, said:

1. The predictive accuracy of linear regression models suffers dramatically if you have relatively wide data sets with strong correlation between your independent variables.

So lasso and ridge overcome the problem of multicollinearity?

BCIII

"If you're driving [the Honda S2000] with the top up, the storm outside had better have a name."

Simplify the complicated side; don't complify the simplicated side.

#7 hrothgar

Group: Advanced Members
Posts: 15,485
Joined: 2003-February-13
Gender:Male
Location:Natick, MA
Interests:Travel
Cooking
Brewing
Hiking

Posted 2011-December-06, 13:22

S2000magic, on 2011-December-06, 13:17, said:

So lasso and ridge overcome the problem of multicollinearity?

Much of the time, yes. However, you're decreasing variance by increasing bias

Alderaan delenda est

#8 hrothgar

Group: Advanced Members
Posts: 15,485
Joined: 2003-February-13
Gender:Male
Location:Natick, MA
Interests:Travel
Cooking
Brewing
Hiking

Posted 2011-December-06, 13:34

delete

Alderaan delenda est

#9 hrothgar

Group: Advanced Members
Posts: 15,485
Joined: 2003-February-13
Gender:Male
Location:Natick, MA
Interests:Travel
Cooking
Brewing
Hiking

Posted 2011-December-06, 13:50

WellSpyder, on 2011-December-06, 12:32, said:

Putting a premium on parsimony therefore seems an attractive idea. Doing it by "forcing" coefficients towards zero doesn't necessarily seem an intuitive way of doing it, but I can see advantages in this, too, eg in the cases where traditional estimation can often produce a series of lags with opposite signs, which even though they can be significant in helping to fit the data over the estimation period seem unlikely to be able to help in predicting the future.

Here's an intuitive explanation that might help.

Assume that you have a linear model where Y = f(X1, X2, ... XN) + noise vector
Furthermore, lets assume that one of these variables is a linear function of the other.

If you run your regression, the program will probably throw some warning about a rank deficient matrix, the reason being that you can't estimate a unique values for these two coefficient. Any linear combination of the two coefficients in the right ratio is equally valid.

Now perturb one of your observations by epsilon so that you no longer have this whole "rank deficiency" issue. Your regression is going to run perfectly fine. However, there's a catch... Relatively minor changes to your noise vector are going to cause enormous swings in your regression coefficients for the two correlated variables. Sometimes they'll be sitting at (+500, + 800), the next at (-15, - 24), the time after that at (-2500, -4000). If you want to believe that these coefficients have some real world meaning, this behavior is really annoying.

Adding in the regularization term penalizes solutions that are far removed from zero and makes the entire process much more stable.

Alderaan delenda est

#10 jdeegan

Group: Advanced Members
Posts: 1,427
Joined: 2005-August-12
Gender:Male
Interests:Economics
Finance
Bridge bidding theory
Cooking
Downhill skiing

Posted 2011-December-06, 15:35

I think I get it. If multicollinearity is your problem, ridge regression is a possible remedy. I always favored leaving out the surplus variable(s), or building a better model.

#11 S2000magic

Group: Full Members
Posts: 439
Joined: 2011-November-11
Gender:Male
Location:Yorba Linda, CA
Interests:magic, horseback riding, hiking, camping, F1 racing, bridge, mathematics, finance, teaching

Posted 2011-December-06, 15:47

jdeegan, on 2011-December-06, 15:35, said:

I think I get it. If multicollinearity is your problem, ridge regression is a possible remedy. I always favored leaving out the surplus variable(s), or building a better model.

That's so old-school; you probably also bid suits you have.

BCIII

"If you're driving [the Honda S2000] with the top up, the storm outside had better have a name."

Simplify the complicated side; don't complify the simplicated side.

#12 jdeegan

Group: Advanced Members
Posts: 1,427
Joined: 2005-August-12
Gender:Male
Interests:Economics
Finance
Bridge bidding theory
Cooking
Downhill skiing

Posted 2011-December-06, 17:50

And I double for penalties whenever possible.

#13 S2000magic

Group: Full Members
Posts: 439
Joined: 2011-November-11
Gender:Male
Location:Yorba Linda, CA
Interests:magic, horseback riding, hiking, camping, F1 racing, bridge, mathematics, finance, teaching

Posted 2011-December-06, 22:32

jdeegan, on 2011-December-06, 17:50, said:

And I double for penalties whenever possible.

OK, now you're scaring me.

BCIII

"If you're driving [the Honda S2000] with the top up, the storm outside had better have a name."

Simplify the complicated side; don't complify the simplicated side.

#14 jdeegan

Group: Advanced Members
Posts: 1,427
Joined: 2005-August-12
Gender:Male
Interests:Economics
Finance
Bridge bidding theory
Cooking
Downhill skiing

Posted 2011-December-07, 00:23

My wife, who is on the far side of 60, drives a Porsche Carrera rag top with a rear spoiler that deploys when you get over 70 mph. But, I would love a spin in your hi rev Honda.

#15 WellSpyder

Group: Advanced Members
Posts: 1,627
Joined: 2009-November-30
Location:Oxfordshire, England

Posted 2011-December-07, 04:07

jdeegan, on 2011-December-06, 15:35, said:

If multicollinearity is your problem, ... I always favored leaving out the surplus variable(s).

And which variable is that?

#16 S2000magic

Group: Full Members
Posts: 439
Joined: 2011-November-11
Gender:Male
Location:Yorba Linda, CA
Interests:magic, horseback riding, hiking, camping, F1 racing, bridge, mathematics, finance, teaching

Posted 2011-December-07, 08:04

jdeegan, on 2011-December-07, 00:23, said:

My wife, who is on the far side of 60, drives a Porsche Carrera rag top with a rear spoiler that deploys when you get over 70 mph. But, I would love a spin in your hi rev Honda.

I drove a 911S for many years (20,000 miles when I bought it, 180,000 miles when I sold it), and I can tell you hands down the S2000 is more fun to drive than the Porsche; and the Porsche was a blast to drive!

There is something sweet about a 9,000 RPM redline.

BCIII

"If you're driving [the Honda S2000] with the top up, the storm outside had better have a name."

Simplify the complicated side; don't complify the simplicated side.

#17 S2000magic

Group: Full Members
Posts: 439
Joined: 2011-November-11
Gender:Male
Location:Yorba Linda, CA
Interests:magic, horseback riding, hiking, camping, F1 racing, bridge, mathematics, finance, teaching

Posted 2011-December-07, 08:07

WellSpyder, on 2011-December-07, 04:07, said:

And which variable is that?

If they're sufficiently strongly correlated (positively or negatively), does it really matter which one(s) you drop?

BCIII

"If you're driving [the Honda S2000] with the top up, the storm outside had better have a name."

Simplify the complicated side; don't complify the simplicated side.

#18 helene_t

The Abbess

Group: Advanced Members
Posts: 17,197
Joined: 2004-April-22
Gender:Female
Location:UK

Posted 2011-December-07, 08:29

hrothgar, on 2011-December-06, 13:22, said:

Much of the time, yes. However, you're decreasing variance by increasing bias

LASSO is not good at dealing with correlated predictors. If there are two strongly correlated predictors it may simply be impossible to determine which of the two is the causal one and which one works only through confounding with the other. In that case, the most robust thing you can do is to give each of them approximately equal influence. This what RIDGE does. Stepwise AIC has the same problem as LASSO.

So if your main concern is to deal correctly with correlated predictors, RIDGE is preferable to just about everything else, although I suppose the best thing to do would be to have a serious talk with the domain expert to try to get to a more advanced model that captures the domain knowledge better. For example, you might put an L2 (RIDGE) penalty on coefficients that belong to clusters of two or more correlated predictors, while putting an L1 (LASSO) penalty on the lonely riders. RIDGE and LASSO are somewhat adhoc methods, they are the methods you will use when you have large data sets but shallow domain knowledge.

As for the bias, yes, but that is intentional, you apply biased estimators like RIDGE when the bias is a virtue. You have a prior belief that small coefficients are more plausible than large ones so the mean (or mode) posterior belief must be smaller than an unbiased estimator.

The world would be such a happy place, if only everyone played Acol :) --- TramTicket

#19 nige1

5-level belongs to me

Group: Advanced Members
Posts: 9,128
Joined: 2004-August-30
Gender:Male
Location:Glasgow Scotland
Interests:Poems Computers

Posted 2011-December-07, 08:33

Thank you Hrothgar. Loren Shure is brilliant at making this beautiful stuff more comprehensible to tyros like me. MATLAB seems powerful and succinct. A bit like APL, a language devised by Ken Iverson of IBM, popular in the 60s and 70s that works best with a special mathematical character-set. More recently, in rec.games.bridge, Charles Brenner uses APL to solve Bridge probability problems, without recourse to crude simulation.

Guthrie.tech

#20 hrothgar

Group: Advanced Members
Posts: 15,485
Joined: 2003-February-13
Gender:Male
Location:Natick, MA
Interests:Travel
Cooking
Brewing
Hiking

Posted 2011-December-07, 08:55

nige1, on 2011-December-07, 08:33, said:

Thank you Hrothgar. Loren Shure is brilliant at making this beautiful stuff more comprehensible to tyros like me.

Loren is, indeed, great.
However, I feel obliged to point out that those two articles (and all the code) were authored by moi...

Alderaan delenda est

2 Pages
1
2
→

You cannot start a new topic
You cannot reply to this topic

BBO Discussion Forums: Subset selection, regularization, and shrinkage - BBO Discussion Forums

Subset selection, regularization, and shrinkage

#1 hrothgar

#2 hrothgar

#3 jdeegan

#4 WellSpyder

#5 hrothgar

#6 S2000magic

#7 hrothgar

#8 hrothgar

#9 hrothgar

#10 jdeegan

#11 S2000magic

#12 jdeegan

#13 S2000magic

#14 jdeegan

#15 WellSpyder

#16 S2000magic

#17 S2000magic

#18 helene_t

#19 nige1

#20 hrothgar

1 User(s) are reading this topic
0 members, 1 guests, 0 anonymous users

Delete Post

Skin and Language

Execution Stats

BBO Discussion Forums: Subset selection, regularization, and shrinkage - BBO Discussion Forums

Subset selection, regularization, and shrinkage

1 User(s) are reading this topic 0 members, 1 guests, 0 anonymous users

Delete Post

Skin and Language

Execution Stats

1 User(s) are reading this topic
0 members, 1 guests, 0 anonymous users