Linear regression models are easy to fit and interpret. Moreover, they are suprisingly accurate in many real world situations where the relationship between the response and the predictors is approximately linear. However, it is often the case that not all the variables used in a multiple regression model are in associated with the response.
Including irrelevant variables leads to unnecessary complexity. By removing them we can obtain a model that is more easily interpreted. Additionally, when the number of predictors is close to the number of samples, linear models tend to overfit and therefore to perform badly in their further predictions.
The best subset model selection approach identifies a subset of the predictors that show to be related to the response. We can then fit a model using least squares on the reduced set of variables.
Here, we are going to do that using
swiss data set
We will use the classic
swiss data set provided with R datasets. We can have
a look at its documentation.
There we see that contains standardized fertility measure and socio-econimic indicators for each of the 47 French-speaking provinces of Switzerlad at about 1888. It is composed by 47 observations on 6 variables, each of which is in percent (i.e. in [0, 100]), including:
- Agriculture in % of men involded in agriculture as occupation.
- Examination as % draftees receiving highest mark on army examination.
- Education as % of education beyond primary school for draftees.
- Catholic as % of catholic (as opposed to protestant).
- Infant mortality as % of live births who live less than 1 year.
We are interested in predicting infant mortality using a multi-linear model.
We can have a quick look at how these variables interact using some plts.
And the correlation matrix.
We can see that
Infant.Mortality is positively correlated with
(obviously) with being
Catholic and negatively with
Education. Additionally we see that
Fertility is positively correlated with
Catholic and with
Agriculture and negatively with
Let us now select among these predictors using best subset selection. The
regsubsets in the
leaps package does exactly this. It performs
best predictor subset selection by identifying the best model that contains
a given number of predictors, where best is quantified using RSS.
The summary() command outputs the best set of variables for each model size.
outmat field on the summary contains a matrix with the best subset of
predictors for 1 to 5 predictor models. For example, the best model with two
Education as predictors for
Infant.Mortality. We can also see that all models include
that all models with at least two variables include also
summary object also includes metrics such as
BIC, that we
can use to determine the best overall model.
Adjusted R2 tells us that the best model is that with two variables, that is
Using CP we arrive to the same conclusion.
However using BIC we should go for the model using just
Fertility as a
predictor. We can plot this information.
From there we can see that the 2-variable model is not that bad regarding the
BIC coefficient. So, as a conclusion, we show the coefficients of the 2-variable
Education as inputs to predict