Multicollinearity in linear regression models: centering variables to reduce multicollinearity

Centering is one of those statistical topics that everyone seems to have heard about, but most people don’t know much about it. You have developed a mystique that is completely unnecessary.

Centering only means subtracting a single value from all of your data points. Changes the scale of a variable and generally applies to predictors. It’s called centered because people often use the mean as the value they subtract (so the new mean is now 0), but it doesn’t have to be the mean. In fact, there are many situations in which a value other than the mean is more significant.

While centering can be done in a simple linear regression, its real benefits arise when there are multiplicative terms in the model interaction terms or quadratic terms (X-squared).

There are two reasons to focus. The first is when an interaction term is formed from the multiplication of two predictor variables on a positive scale. When you multiply them to create the interaction, the numbers close to 0 stay close to 0 and the high numbers get really high. The interaction term is then highly correlated with the original variables.

But this is easy to verify. Just create the multiplicative term in your dataset, then run a correlation between that interaction term and the original predictor. While correlations are not the best way to test for multicollinearity, they will give you a quick check.

Then try again, but first center one of your IVs.

Centering one of your variables on the mean (or some other significant value near the middle of the distribution) will make half of its values negative (since the mean is now equal to 0). When multiplied by the other positive variable, they don’t all increase together.

The other reason is to help interpret the parameter estimates (regression coefficients or betas).

Leave a Reply Cancel reply