Linear regression models are easy to fit and interpret. Moreover, they are suprisingly accurate in many real world situations where the relationship between the response and the predictors is approximately linear. However, it is often the case that not all the variables used in a multiple regression model are in associated with the response.

Including irrelevant variables leads to **unnecessary complexity**. By removing
them we can obtain a model that is more **easily interpreted**. Additionally, when
the number of predictors is close to the number of samples, linear models
**tend to overfit** and therefore to perform badly in their further predictions.

The *best subset model selection* approach identifies a subset of the
predictors that show to be related to the response. We can then fit
a model using least squares on the reduced set of variables.

Here, we are going to do that using `R`

.

#### The `swiss`

data set

We will use the classic `swiss`

data set provided with R datasets. We can have
a look at its documentation.

There we see that contains *standardized fertility measure and socio-econimic
indicators for each of the 47 French-speaking provinces of Switzerlad at about
1888.* It is composed by 47 observations on 6 variables, each of which is in
percent (i.e. in [0, 100]), including:

- Fertility.
- Agriculture in % of men involded in agriculture as occupation.
- Examination as % draftees receiving highest mark on army examination.
- Education as % of education beyond primary school for draftees.
- Catholic as % of catholic (as opposed to protestant).
- Infant mortality as % of live births who live less than 1 year.

We are interested in predicting infant mortality using a multi-linear model.

We can have a quick look at how these variables interact using some plts.

And the correlation matrix.

We can see that `Infant.Mortality`

is positively correlated with `Fertility`

(obviously) with being `Catholic`

and negatively with `Examination`

and
`Education`

. Additionally we see that `Fertility`

is positively correlated with
being `Catholic`

and with `Agriculture`

and negatively with `Edication`

and
`Examination`

.

Let us now select among these predictors using best subset selection. The
function `regsubsets`

in the `leaps`

package does exactly this. It performs
best predictor subset selection by identifying the best model that contains
a given number of predictors, where best is quantified using RSS.

The summary() command outputs the best set of variables for each model size.

The `outmat`

field on the summary contains a matrix with the best subset of
predictors for 1 to 5 predictor models. For example, the best model with two
variables includes `Fertility`

and `Education`

as predictors for
`Infant.Mortality`

. We can also see that all models include `Fertility`

, and
that all models with at least two variables include also `Education`

. The
summary object also includes metrics such as
*adjusted R2*,
*CP*, or
*BIC*, that we
can use to determine the best overall model.

Adjusted R2 tells us that the best model is that with two variables, that is
`Fertility`

and `Education`

.

Using CP we arrive to the same conclusion.

However using BIC we should go for the model using just `Fertility`

as a
predictor. We can plot this information.

From there we can see that the 2-variable model is not that bad regarding the
BIC coefficient. So, as a conclusion, we show the coefficients of the 2-variable
model using `Fertility`

and `Education`

as inputs to predict `Infant.Mortality`

.