Feature selection & regularization (regression)
Chapter last updated: 13 Februar 2025
Learning outcomes/objective: Learn…
- …about model regularization/feature selection (focus on ridge regression and lasso).
- …how to implement ridge regression and lasso in R with tidymodels.
- …what a tuning parameter can be used for (\(\lambda\) in ridge regression/lasso)
Sources: James et al. (2013, Ch. 6), ISLR tidymodels labs by Emil Hvitfeldt
1 Concepts
- Feature selection
- also known as variable selection, attribute selection or variable subset selection
- process of selecting subset of relevant features/predictors for our model
- Aims: intepretability, reduce training time, avoid curse of dimensionality, improve data-model compatibility (cf. Wikipedia, 12.02.2025)
- Regularization
- = either choice of the model or modifications to the algorithm
- Penalizing the complexity of a model to reduce overfitting
- Theory vs. algorithmic feature selection
- ML largely focuses on automated selection, albeit theory may play a role
2 Why using alternatives to least squares?
- Why use another fitting procedure instead of least squares? (James et al. 2013, Ch. 6)
- Alternatives can yield better prediction accuracy and model interpretability
- Prediction Accuracy: Provided true relationship between response and predictors is approximately linear, least squares estimates will have low bias
- If \(n \gg p\)1 least squares estimates have low bias but if not possibility of overfitting
- By shrinking estimated coefficients → reduce variance at the cost of small increase in bias (James et al. 2013, 204)
- Model Interpretability: Often some features used in a multiple regression model are not associated with the outcome
- Including such irrelevant features leads to unnecessary complexity in the model
- Idea: remove these variables (set corresponding coefficient estimates to 0) to obtain a more interpretable model
3 Methods for feature/variables selection
- Subset selection: identify subset of \(p\) predictors that we believe to be related to outcome
- ..then train model using reduced set of variables (using least squares)
- e.g., best subset selection (James et al. 2013, Ch. 6.1.1), forward/backward stepwise selection (relying on model fit) (James et al. 2013, Ch. 6.1.2)
- Tidymodels: “not a huge priority” (Stackoverflow, 12.02.2025)2
- Shrinkage: train model involving all \(p\) predictors but shrink coefficients to zero relative to the least squares estimates
- This shrinkage (also known as regularization) has the effect of reducing variance
- Depending on type of shrinkage, some coefficients may be estimated to be exactly zero3
- e.g., ridge regression & the lasso (James et al. 2013, Ch. 6.2.1, 6.2.2)
- Dimension reduction: involves projecting the \(p\) predictors into a M-dimensional subspace, where \(M<p\).4
4 High dimensional data
- Most traditional statistical techniques intended for low-dimensional setting in which \(n\)5 is much greater than \(p\)6(James et al. 2013, 238)
- For data sets containing more features than observations called high-dimensional classical approaches7 are not appropriate (James et al. 2013, 239)
- But certain less flexible least squares models, such as forward stepwise selection, ridge regression, the lasso, and principal components regression, are particularly useful (James et al. 2013, 239)
- They avoid overfitting by using less flexible fitting approach than least squares (James et al. 2013, 239)
- Q: Why are the latter less flexible?!^ (see Section 7)
5 Ridge regression (RR) (1)
- Least squares (LS) estimator: least squares fitting procedure estimates \(\beta_{0}, \beta_{1}, ..., \beta_{p}\) using the values that minimize RSS
- Ridge regression (RR): seeks coefficient estimates that fit data well (minimize RSS) + shrinkage penalty
- See formula (6.5) in (James et al. 2013, 215)8
- shrinkage penalty \(\lambda \sum_{j=1}^{p}\beta_{j}^{2}\) is small when \(\beta_{0}, \beta_{1}, ..., \beta_{p}\) close to zero
- penality has effect of shrinking the estimates \(\beta_{j}\) to zero
- Tuning parameter \(\lambda\) serves to control the relative impact of these two terms (RSS and shrinking penalty) on the regression coefficient estimates
- When \(\lambda = 0\) the penalty term has no effect and ridge regression will produce least squares estimates
- As \(\lambda \rightarrow \infty\), impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero
- Penality is only applied to coefficients not the intercept
- Ridge regression will produce a different set of coefficient estimates \(\hat{\beta}_{\lambda}^{R}\) for each \(\lambda\)
- Hyperparameter tuning: fit many models with different \(\lambda\)s to find the model the performs “best”
6 Ridge regression (RR) (2): Visualization
- Figure 1 displays regression coefficient estimates for the Credit data (James et al. 2013, Ch. 6.2)
- Left-hand panel: each curve corresponds to the ridge regression coefficient estimate for one of the ten variables, plotted as a function of \(\lambda\)
- to the left \(\lambda\) is essentially zero, so ridge coefficient estimates are the same as the usual least squares estimates
- as \(\lambda\) increases, ridge coefficient estimates shrink towards zero9
- when \(\lambda\) is extremely large, all of the ridge coefficient estimates are basically zero = null model
- Left-hand panel: each curve corresponds to the ridge regression coefficient estimate for one of the ten variables, plotted as a function of \(\lambda\)
- While standard LS coefficient estimates are scale equivalent, RR coefficients are not
- Apply ridge regression after standardizing the predictors!10

7 Ridge regression (RR) (3): Why?
- Why Does Ridge Regression (RR) Improve Over Least Squares (LS)?
- Advantage rooted in bias-variance trade-off
- As \(\lambda\) increases, flexibility of ridge regression fit decreases11, leading to decreased variance but increased bias
- Q: What was variance and bias again in the context of ML? (see bias-variance trade-off)
- As \(\lambda\) increases, flexibility of ridge regression fit decreases11, leading to decreased variance but increased bias
- If relationship between outcome and features is close to linear, LS estimates will have low bias but may have high variance
- \(\rightarrow\) small change in the training data can cause a large change in the LS coefficient estimates
- In particular, when number of variables \(p\) is almost as large as number of observations \(n\)12
- If \(p>n\), LS estimates do not even have a unique solution, whereas RR still works
- Summary: RR works best in situations where the LS estimates have high variance
8 The Lasso
- Ridge regression (RR) has one disadvantage
- Unlike subset selection methods13 that select a variable subset, RR will include all \(p\) predictors in final model
- Penalty \(\lambda \sum_{j=1}^{p}\beta_{j}^{2}\) will shrink all coefficients towards zero, but not set them exactly to 014
- No problem for accuracy, but challenging for model interpretation when \(p\) is large
- Lasso (least absolute shrinkage and selection operator)
- As RR lasso shrinks the coefficient estimates towards zero but additionally…
- …its penality has effect of forcing some of coefficient estimates to be exactly equal to zero..
- ..when tuning parameter \(\lambda\) is sufficiently large (cf. James et al. 2013, Eq. 6.7, p.219)
- Lasso performs variable selection leading to easier interpretation → yields sparse models
- As in RR selecting good value of \(\lambda\) is critical
9 Lasso vs. Ridge Regression
- Lasso advantages: produces simpler, more interpretable models involving feature subset
- But, which method leads to better prediction accuracy?
- Neither ridge regression nor lasso universally dominates the other
- Lasso works better when few predictors have significant coefficients, others have small/zero coefficients
- Ridge regression performs better with many predictors of roughly equal coefficient size
- Both methods can reduce variance at the cost of slight bias increase, improving prediction accuracy
10 Lab: Ridge Regression
See earlier description here (right click and open in new tab).
We first import the data into R:
10.1 Inspecting the dataset
The dataset if relatively large with 1977 rows and 346 columns. The skim()
function can be used to get a quick overview.
10.2 Split the data
Below we split the data and create resampled partitions with vfold_cv()
stored in an object called data_folds
. Hence, we have the original data_train
and data_test
but also already resamples of data_train
in data_folds
.
# Extract data with missing outcome
data_missing_outcome <- data %>% filter(is.na(life_satisfaction))
dim(data_missing_outcome)
[1] 205 346
# Omit individuals with missing outcome from data
data <- data %>% drop_na(life_satisfaction) # ?drop_na
dim(data)
[1] 1772 346
# Split the data into training and test data
set.seed(345)
data_split <- initial_split(data, prop = 0.80)
data_split # Inspect
<Training/Testing/Total>
<1417/355/1772>
# Extract the two datasets
data_train <- training(data_split)
data_test <- testing(data_split) # Do not touch until the end!
# Create resampled partitions
data_folds <- vfold_cv(data_train, v = 5) # V-fold/k-fold cross-validation
data_folds # data_folds now contains several resamples of our training data
# 5-fold cross-validation
# A tibble: 5 × 2
splits id
<list> <chr>
1 <split [1133/284]> Fold1
2 <split [1133/284]> Fold2
3 <split [1134/283]> Fold3
4 <split [1134/283]> Fold4
5 <split [1134/283]> Fold5
The training data has 1417 rows, the test data has 355. The training data is further split into 10 folds.
10.3 Simple ridge regression (all in one)
Below we provide a quick example of a ridge regression (beware: below in the recipe we standardize predictors).
- Arguments of
glmnet
linear_reg()
mixture = 0
: to specify a ridge modelmixture = 0
specifies only ridge regularizationmixture = 1
specifies only lasso regularization- Setting mixture to a value between 0 and 1 lets us use both
penalty = 0
: Penality we need to set when usingglmnet
engine (for now 0)
# Define a recipe for preprocessing
recipe_ridge <- recipe(life_satisfaction ~ ., data = data_train) %>%
update_role(respondent_id, new_role = "ID") %>% # define as ID variable
step_impute_mean(all_numeric_predictors()) %>%
step_zv(all_predictors()) %>% # remove predictors with zero variance
step_normalize(all_numeric_predictors()) %>% # normalize predictors
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())
# Extract and preview data + recipe (direclty with $)
data_preprocessed <- prep(recipe_ridge, data_train)$template
dim(data_preprocessed)
[1] 1417 931
# Define a model
model_ridge <- linear_reg(mixture = 0, penalty = 2) %>% # ridge regression model
set_engine("glmnet") %>% # lm engine
set_mode("regression") # regression problem
# Define a workflow
workflow_ridge <- workflow() %>% # create empty workflow
add_recipe(recipe_ridge) %>% # add recipe
add_model(model_ridge) # add model
# Fit the workflow (including recipe and model)
fit_ridge <- workflow_ridge %>%
fit_resamples(resamples = data_folds, # specify cv-datasets
metrics = metric_set(rmse, rsq, mae))
collect_metrics(fit_ridge)
# A tibble: 3 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 mae standard 1.31 5 0.0242 Preprocessor1_Model1
2 rmse standard 1.73 5 0.0350 Preprocessor1_Model1
3 rsq standard 0.404 5 0.0294 Preprocessor1_Model1
If we are happy with the performance of our model (evaluated using resampling), we can fit it on the full training set and use the resulting parameters to obtain predictions in the test dataset. Subsequently we calculate the accuracy in the test data.
# Fit model in full training dataset
fit_final_ridge <- workflow_ridge %>%
parsnip::fit(data = data_train)
# Inspect coefficients:
tidy(fit_final_ridge) %>% slice(1:5) # show first 5 coefficients
# A tibble: 5 × 3
term estimate penalty
<chr> <dbl> <dbl>
1 (Intercept) 7.44 2
2 unemployed_active -0.0269 2
3 unemployed -0.0506 2
4 education 0.0145 2
5 news_politics_minutes 0.0329 2
# Test data: Predictions + accuracy
metrics_combined <- metric_set(mae, rmse, rsq) # Use several metrics
augment(fit_final_ridge , new_data = data_test) %>%
metrics_combined(truth = life_satisfaction, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 1.28
2 rmse standard 1.72
3 rsq standard 0.374
11 Lab: Lasso
See earlier description here (right click and open in new tab).
We first import the data into R:
Below we split the data and create resampled partitions with vfold_cv()
stored in an object called data_folds
. Hence, we have the original data_train
and data_test
but also already resamples of data_train
in data_folds
.
# Extract data with missing outcome
data_missing_outcome <- data %>% filter(is.na(life_satisfaction))
dim(data_missing_outcome)
[1] 205 346
# Omit individuals with missing outcome from data
data <- data %>% drop_na(life_satisfaction) # ?drop_na
dim(data)
[1] 1772 346
# Split the data into training and test data
set.seed(345)
data_split <- initial_split(data, prop = 0.80)
data_split # Inspect
<Training/Testing/Total>
<1417/355/1772>
# Extract the two datasets
data_train <- training(data_split)
data_test <- testing(data_split) # Do not touch until the end!
# Create resampled partitions
data_folds <- vfold_cv(data_train, v = 5) # V-fold/k-fold cross-validation
data_folds # data_folds now contains several resamples of our training data
# 5-fold cross-validation
# A tibble: 5 × 2
splits id
<list> <chr>
1 <split [1133/284]> Fold1
2 <split [1133/284]> Fold2
3 <split [1134/283]> Fold3
4 <split [1134/283]> Fold4
5 <split [1134/283]> Fold5
# Define the recipe
recipe_lasso <- recipe(life_satisfaction ~ ., data = data_train) %>%
update_role(respondent_id, new_role = "ID") %>% # define as ID variable
step_impute_mean(all_numeric_predictors()) %>%
step_zv(all_predictors()) %>% # remove predictors with zero variance
step_normalize(all_numeric_predictors()) %>% # normalize predictors
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())
# Specify the model
model_lasso <-
linear_reg(penalty = 0.1, mixture = 1) %>%
set_mode("regression") %>%
set_engine("glmnet")
# Define the workflow
workflow_lasso <- workflow() %>%
add_recipe(recipe_lasso) %>%
add_model(model_lasso)
# Fit the workflow (including recipe and model)
fit_lasso <- workflow_lasso %>%
fit_resamples(resamples = data_folds, # specify cv-datasets
metrics = metric_set(rmse, rsq, mae))
collect_metrics(fit_lasso)
# A tibble: 3 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 mae standard 1.11 5 0.0262 Preprocessor1_Model1
2 rmse standard 1.54 5 0.0370 Preprocessor1_Model1
3 rsq standard 0.528 5 0.0218 Preprocessor1_Model1
# Fit model in full training dataset
fit_final_lasso <- workflow_lasso %>%
parsnip::fit(data = data_train)
# Inspect coefficients
tidy(fit_final_lasso) %>% filter(estimate!=0) # omit coefficients = 0
# A tibble: 34 × 3
term estimate penalty
<chr> <dbl> <dbl>
1 (Intercept) 6.31 0.1
2 trust_people 0.00201 0.1
3 people_fair 0.116 0.1
4 trust_police 0.0581 0.1
5 trust_united_nations 0.00234 0.1
6 left_right_scale 0.0671 0.1
7 satisfied_economy 0.151 0.1
8 satisfied_government 0.200 0.1
9 state_health_services 0.0314 0.1
10 happiness 1.00 0.1
# ℹ 24 more rows
# Test data: Predictions + accuracy
metrics_combined <- metric_set(mae, rmse, rsq) # Use several metrics
augment(fit_final_lasso , new_data = data_test) %>%
metrics_combined(truth = life_satisfaction, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 1.11
2 rmse standard 1.56
3 rsq standard 0.490
12 Exercise/Homework
- Below code to help you (you need to replace all
...
)
- Use the data from above (European Social Survey (ESS)). Use the code below to load it, drop missings and to produce training, test as well as resampled data.
- Define three models (
model_lm
= linear regression,model_ridge
= ridge regression,model_lasso
= lasso regression) and create two recipes,recipe_lm
for the linear regression (the model should only include three predictors:unemployed + age + education
) andrecipe_ridge_lasso
for the other two models. Store the three models and two recipes in three workflowsworkflow_lm
,workflow_ridge
,workflow_lasso
- Train these three workflows and evaluate their accuracy using resampling (below some code to get you started).
- Pick the best workflow (and corresponding model), fit it on the whole training data and evaluate the accuracy on the test data.
We first import the data into R:
# 1. ####
# Drop missings on outcome
data <- data %>%
drop_na(life_satisfaction)
# Split the data into training and test data
data_split <- initial_split(data, prop = 0.80)
data_split # Inspect
# Extract the two datasets
data_train <- training(data_split)
data_... <- testing(data_split) # Do not touch until the end!
# Create resampled partitions
set.seed(345)
data_folds <- vfold_cv(..., v = 5) # V-fold/k-fold cross-validation
data_folds # data_folds now contains several resamples of our training data
# 2. ####
# RECIPES
recipe_lm <- recipe(life_satisfaction ~ unemployed + age + education, data = data_train) %>%
step_filter_missing(all_predictors(), threshold = 0.01) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_zv(all_predictors()) %>% # remove predictors with zero variance
step_normalize(all_numeric_predictors()) %>% # normalize predictors
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())
recipe_ridge_lasso <- recipe(life_satisfaction ~ ., data = ...) %>%
update_role(respondent_id, new_role = "ID") %>% # define as ID variable
step_filter_missing(all_predictors(), threshold = 0.01) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_zv(all_predictors()) %>% # remove predictors with zero variance
step_normalize(all_numeric_predictors()) %>% # normalize predictors
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())
# MODELS
model_lm <- linear_reg() %>% # linear model
set_engine("lm") %>% # lm engine
set_mode("regression") # regression problem
model_ridge <- linear_reg(mixture = 0, penalty = 2) %>% # ridge regression model
set_engine("glmnet") %>%
set_mode("regression") # regression problem
model_lasso <-
linear_reg(mixture = 1, penalty = 0.05) %>%
... %>%
...
# WORKFLOWS
workflow_lm <- workflow() %>%
add_recipe(recipe_lm) %>%
add_model(model_lm)
workflow_ridge <- ...
workflow_lasso <- workflow() %>%
add_recipe(recipe_ridge_lasso) %>%
add_model(model_lasso)
# 3. ####
# TRAINING/FITTING
fit_lm <- workflow_lm %>% fit_resamples(resamples = data_folds)
fit_ridge <- ...
fit_lasso <- ...
# RESAMPLED DATA: COLLECT METRICS
collect_metrics(fit_lm)
collect_metrics(...)
collect_metrics(...)
# Lasso performed best!!
# FIT LASSO TO FULL TRAINING DATA
fit_lasso <- ... %>% parsnip::fit(data = ...)
# Metrics: Training data
metrics_combined <- metric_set(rsq, rmse, mae)
augment(fit_lasso, new_data = ...) %>%
metrics_combined(truth = life_satisfaction, estimate = .pred)
# Metrics: Test data
augment(fit_lasso, new_data = ...) %>%
metrics_combined(truth = life_satisfaction, estimate = ...)
Solution
# 1. ####
# Drop missings on outcome
data <- data %>%
drop_na(life_satisfaction)
# Split the data into training and test data
data_split <- initial_split(data, prop = 0.80)
data_split # Inspect
# Extract the two datasets
data_train <- training(data_split)
data_test <- testing(data_split) # Do not touch until the end!
# Create resampled partitions
set.seed(345)
data_folds <- vfold_cv(data_train, v = 5) # V-fold/k-fold cross-validation
data_folds # data_folds now contains several resamples of our training data
# 2. ####
# RECIPES
recipe_lm <- recipe(life_satisfaction ~ unemployed + age + education, data = data_train) %>%
step_filter_missing(all_predictors(), threshold = 0.01) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_zv(all_predictors()) %>% # remove predictors with zero variance
step_normalize(all_numeric_predictors()) %>% # normalize predictors
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())
recipe_ridge_lasso <- recipe(life_satisfaction ~ ., data = data_train) %>%
update_role(respondent_id, new_role = "ID") %>% # define as ID variable
step_filter_missing(all_predictors(), threshold = 0.01) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_zv(all_predictors()) %>% # remove predictors with zero variance
step_normalize(all_numeric_predictors()) %>% # normalize predictors
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())
# MODELS
model_lm <- linear_reg() %>% # linear model
set_engine("lm") %>% # lm engine
set_mode("regression") # regression problem
model_ridge <- linear_reg(mixture = 0, penalty = 2) %>% # ridge regression model
set_engine("lm") %>% # lm engine
set_mode("regression") # regression problem
model_lasso <-
linear_reg(mixture = 1, penalty = 0.05) %>%
set_mode("regression") %>%
set_engine("glmnet")
# WORKFLOWS
workflow_lm <- workflow() %>%
add_recipe(recipe_lm) %>%
add_model(model_lm)
workflow_ridge <- workflow() %>%
add_recipe(recipe_ridge_lasso) %>%
add_model(model_ridge)
workflow_lasso <- workflow() %>%
add_recipe(recipe_ridge_lasso) %>%
add_model(model_lasso)
# 3. ####
# TRAINING/FITTING
fit_lm <- workflow_lm %>% fit_resamples(resamples = data_folds)
fit_ridge <- workflow_ridge %>% fit_resamples(resamples = data_folds)
fit_lasso <- workflow_lasso %>% fit_resamples(resamples = data_folds)
# RESAMPLED DATA: COLLECT METRICS
collect_metrics(fit_lm)
collect_metrics(fit_ridge)
collect_metrics(fit_lasso)
# Lasso performed best!!
# FIT LASSO TO FULL TRAINING DATA
fit_lasso <- workflow_lasso %>% parsnip::fit(data = data_train)
tidy(fit_lasso)
View(tidy(fit_lasso))
View(tidy(fit_lasso) %>% filter(estimate!=0))
# Metrics: Training data
metrics_combined <- metric_set(rsq, rmse, mae)
augment(fit_lasso, new_data = data_train) %>%
metrics_combined(truth = life_satisfaction, estimate = .pred)
# Metrics: Test data
augment(fit_lasso, new_data = data_test) %>%
metrics_combined(truth = life_satisfaction, estimate = .pred)
13 All the code
# namer::unname_chunks("03_01_predictive_models.qmd")
# namer::name_chunks("03_01_predictive_models.qmd")
options(scipen=99999)
#| echo: false
library(tidyverse)
library(lemon)
library(knitr)
library(kableExtra)
library(dplyr)
library(plotly)
library(randomNames)
library(stargazer)
library(tidymodels)
library(tidyverse)
library(glmnet)
load(file = "www/data/data_ess.Rdata")
library(tidymodels)
library(tidyverse)
library(glmnet)
# Load the .RData file into R
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
"173VVsu9TZAxsCF_xzBxsxiDQMVc_DdqS")))
library(skimr)
library(naniar)
#skim(data)
nrow(data)
ncol(data)
# Visualize missing for downsampled data
vis_miss(data %>% sample_n(1000), warn_large_data = FALSE)
# Again: Visualize missing for downsampled data
vis_miss(data %>% sample_n(1000), warn_large_data = FALSE)
# Extract data with missing outcome
data_missing_outcome <- data %>% filter(is.na(life_satisfaction))
dim(data_missing_outcome)
# Omit individuals with missing outcome from data
data <- data %>% drop_na(life_satisfaction) # ?drop_na
dim(data)
# Split the data into training and test data
set.seed(345)
data_split <- initial_split(data, prop = 0.80)
data_split # Inspect
# Extract the two datasets
data_train <- training(data_split)
data_test <- testing(data_split) # Do not touch until the end!
# Create resampled partitions
data_folds <- vfold_cv(data_train, v = 5) # V-fold/k-fold cross-validation
data_folds # data_folds now contains several resamples of our training data
# You can also try loo_cv(data_train) instead
# Define a recipe for preprocessing
recipe_ridge <- recipe(life_satisfaction ~ ., data = data_train) %>%
update_role(respondent_id, new_role = "ID") %>% # define as ID variable
step_impute_mean(all_numeric_predictors()) %>%
step_zv(all_predictors()) %>% # remove predictors with zero variance
step_normalize(all_numeric_predictors()) %>% # normalize predictors
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())
# Extract and preview data + recipe (direclty with $)
data_preprocessed <- prep(recipe_ridge, data_train)$template
dim(data_preprocessed)
# Define a model
model_ridge <- linear_reg(mixture = 0, penalty = 2) %>% # ridge regression model
set_engine("glmnet") %>% # lm engine
set_mode("regression") # regression problem
# Define a workflow
workflow_ridge <- workflow() %>% # create empty workflow
add_recipe(recipe_ridge) %>% # add recipe
add_model(model_ridge) # add model
# Fit the workflow (including recipe and model)
fit_ridge <- workflow_ridge %>%
fit_resamples(resamples = data_folds, # specify cv-datasets
metrics = metric_set(rmse, rsq, mae))
collect_metrics(fit_ridge)
# Fit model in full training dataset
fit_final_ridge <- workflow_ridge %>%
parsnip::fit(data = data_train)
# Inspect coefficients:
tidy(fit_final_ridge) %>% slice(1:5) # show first 5 coefficients
# Test data: Predictions + accuracy
metrics_combined <- metric_set(mae, rmse, rsq) # Use several metrics
augment(fit_final_ridge , new_data = data_test) %>%
metrics_combined(truth = life_satisfaction, estimate = .pred)
library(tidymodels)
library(tidyverse)
library(glmnet)
load(file = "www/data/data_ess.Rdata")
library(tidymodels)
library(tidyverse)
library(glmnet)
# Load the .RData file into R
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
"173VVsu9TZAxsCF_xzBxsxiDQMVc_DdqS")))
# Extract data with missing outcome
data_missing_outcome <- data %>% filter(is.na(life_satisfaction))
dim(data_missing_outcome)
# Omit individuals with missing outcome from data
data <- data %>% drop_na(life_satisfaction) # ?drop_na
dim(data)
# Split the data into training and test data
set.seed(345)
data_split <- initial_split(data, prop = 0.80)
data_split # Inspect
# Extract the two datasets
data_train <- training(data_split)
data_test <- testing(data_split) # Do not touch until the end!
# Create resampled partitions
data_folds <- vfold_cv(data_train, v = 5) # V-fold/k-fold cross-validation
data_folds # data_folds now contains several resamples of our training data
# Define the recipe
recipe_lasso <- recipe(life_satisfaction ~ ., data = data_train) %>%
update_role(respondent_id, new_role = "ID") %>% # define as ID variable
step_impute_mean(all_numeric_predictors()) %>%
step_zv(all_predictors()) %>% # remove predictors with zero variance
step_normalize(all_numeric_predictors()) %>% # normalize predictors
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())
# Specify the model
model_lasso <-
linear_reg(penalty = 0.1, mixture = 1) %>%
set_mode("regression") %>%
set_engine("glmnet")
# Define the workflow
workflow_lasso <- workflow() %>%
add_recipe(recipe_lasso) %>%
add_model(model_lasso)
# Fit the workflow (including recipe and model)
fit_lasso <- workflow_lasso %>%
fit_resamples(resamples = data_folds, # specify cv-datasets
metrics = metric_set(rmse, rsq, mae))
collect_metrics(fit_lasso)
# Fit model in full training dataset
fit_final_lasso <- workflow_lasso %>%
parsnip::fit(data = data_train)
# Inspect coefficients
tidy(fit_final_lasso) %>% filter(estimate!=0) # omit coefficients = 0
# Test data: Predictions + accuracy
metrics_combined <- metric_set(mae, rmse, rsq) # Use several metrics
augment(fit_final_lasso , new_data = data_test) %>%
metrics_combined(truth = life_satisfaction, estimate = .pred)
rm(list=ls())
library(tidymodels)
library(tidyverse)
load(file = "www/data/data_ess.Rdata")
rm(list=ls())
library(tidymodels)
library(tidyverse)
# Load the .RData file into R
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
"173VVsu9TZAxsCF_xzBxsxiDQMVc_DdqS")))
# 1. ####
# Drop missings on outcome
data <- data %>%
drop_na(life_satisfaction)
# Split the data into training and test data
data_split <- initial_split(data, prop = 0.80)
data_split # Inspect
# Extract the two datasets
data_train <- training(data_split)
data_... <- testing(data_split) # Do not touch until the end!
# Create resampled partitions
set.seed(345)
data_folds <- vfold_cv(..., v = 5) # V-fold/k-fold cross-validation
data_folds # data_folds now contains several resamples of our training data
# 2. ####
# RECIPES
recipe_lm <- recipe(life_satisfaction ~ unemployed + age + education, data = data_train) %>%
step_filter_missing(all_predictors(), threshold = 0.01) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_zv(all_predictors()) %>% # remove predictors with zero variance
step_normalize(all_numeric_predictors()) %>% # normalize predictors
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())
recipe_ridge_lasso <- recipe(life_satisfaction ~ ., data = ...) %>%
update_role(respondent_id, new_role = "ID") %>% # define as ID variable
step_filter_missing(all_predictors(), threshold = 0.01) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_zv(all_predictors()) %>% # remove predictors with zero variance
step_normalize(all_numeric_predictors()) %>% # normalize predictors
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())
# MODELS
model_lm <- linear_reg() %>% # linear model
set_engine("lm") %>% # lm engine
set_mode("regression") # regression problem
model_ridge <- linear_reg(mixture = 0, penalty = 2) %>% # ridge regression model
set_engine("glmnet") %>%
set_mode("regression") # regression problem
model_lasso <-
linear_reg(mixture = 1, penalty = 0.05) %>%
... %>%
...
# WORKFLOWS
workflow_lm <- workflow() %>%
add_recipe(recipe_lm) %>%
add_model(model_lm)
workflow_ridge <- ...
workflow_lasso <- workflow() %>%
add_recipe(recipe_ridge_lasso) %>%
add_model(model_lasso)
# 3. ####
# TRAINING/FITTING
fit_lm <- workflow_lm %>% fit_resamples(resamples = data_folds)
fit_ridge <- ...
fit_lasso <- ...
# RESAMPLED DATA: COLLECT METRICS
collect_metrics(fit_lm)
collect_metrics(...)
collect_metrics(...)
# Lasso performed best!!
# FIT LASSO TO FULL TRAINING DATA
fit_lasso <- ... %>% parsnip::fit(data = ...)
# Metrics: Training data
metrics_combined <- metric_set(rsq, rmse, mae)
augment(fit_lasso, new_data = ...) %>%
metrics_combined(truth = life_satisfaction, estimate = .pred)
# Metrics: Test data
augment(fit_lasso, new_data = ...) %>%
metrics_combined(truth = life_satisfaction, estimate = ...)
# 1. ####
# Drop missings on outcome
data <- data %>%
drop_na(life_satisfaction)
# Split the data into training and test data
data_split <- initial_split(data, prop = 0.80)
data_split # Inspect
# Extract the two datasets
data_train <- training(data_split)
data_test <- testing(data_split) # Do not touch until the end!
# Create resampled partitions
set.seed(345)
data_folds <- vfold_cv(data_train, v = 5) # V-fold/k-fold cross-validation
data_folds # data_folds now contains several resamples of our training data
# 2. ####
# RECIPES
recipe_lm <- recipe(life_satisfaction ~ unemployed + age + education, data = data_train) %>%
step_filter_missing(all_predictors(), threshold = 0.01) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_zv(all_predictors()) %>% # remove predictors with zero variance
step_normalize(all_numeric_predictors()) %>% # normalize predictors
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())
recipe_ridge_lasso <- recipe(life_satisfaction ~ ., data = data_train) %>%
update_role(respondent_id, new_role = "ID") %>% # define as ID variable
step_filter_missing(all_predictors(), threshold = 0.01) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_zv(all_predictors()) %>% # remove predictors with zero variance
step_normalize(all_numeric_predictors()) %>% # normalize predictors
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())
# MODELS
model_lm <- linear_reg() %>% # linear model
set_engine("lm") %>% # lm engine
set_mode("regression") # regression problem
model_ridge <- linear_reg(mixture = 0, penalty = 2) %>% # ridge regression model
set_engine("lm") %>% # lm engine
set_mode("regression") # regression problem
model_lasso <-
linear_reg(mixture = 1, penalty = 0.05) %>%
set_mode("regression") %>%
set_engine("glmnet")
# WORKFLOWS
workflow_lm <- workflow() %>%
add_recipe(recipe_lm) %>%
add_model(model_lm)
workflow_ridge <- workflow() %>%
add_recipe(recipe_ridge_lasso) %>%
add_model(model_ridge)
workflow_lasso <- workflow() %>%
add_recipe(recipe_ridge_lasso) %>%
add_model(model_lasso)
# 3. ####
# TRAINING/FITTING
fit_lm <- workflow_lm %>% fit_resamples(resamples = data_folds)
fit_ridge <- workflow_ridge %>% fit_resamples(resamples = data_folds)
fit_lasso <- workflow_lasso %>% fit_resamples(resamples = data_folds)
# RESAMPLED DATA: COLLECT METRICS
collect_metrics(fit_lm)
collect_metrics(fit_ridge)
collect_metrics(fit_lasso)
# Lasso performed best!!
# FIT LASSO TO FULL TRAINING DATA
fit_lasso <- workflow_lasso %>% parsnip::fit(data = data_train)
tidy(fit_lasso)
View(tidy(fit_lasso))
View(tidy(fit_lasso) %>% filter(estimate!=0))
# Metrics: Training data
metrics_combined <- metric_set(rsq, rmse, mae)
augment(fit_lasso, new_data = data_train) %>%
metrics_combined(truth = life_satisfaction, estimate = .pred)
# Metrics: Test data
augment(fit_lasso, new_data = data_test) %>%
metrics_combined(truth = life_satisfaction, estimate = .pred)
labs = knitr::all_labels()
ignore_chunks <- labs[str_detect(labs, "exclude|setup|solution|get-labels")]
labs = setdiff(labs, ignore_chunks)
References
Footnotes
\(\gg =\) much larger↩︎
“not a huge priority since other methods (e.g. regularization) are uniformly superior methods to feature selection.” (Stackoverflow, 12.02.2025)↩︎
,i.e., shrinkage methods can perform variable selection↩︎
Achieved by computing \(M\) different linear combinations (or projections) of the variables. Then these \(M\) projections are used as predictors to fit a linear regression model by least squares, e.g., principal components regression, partial least squares (James et al. 2013, Ch. 6.3.1, 6.3.2)↩︎
the number of observations↩︎
the number of features↩︎
e.g., least squares linear regression↩︎
RR minimizes \(RSS + \lambda\sum_{j=1}^{p}\beta_{j}^{2}\). Like least squares, RSS seeks coefficients estimates that fit the data well (make RSS small). But also \(\lambda\sum_{j=1}^{p}\beta_{j}^{2}\) (shrinkage penalty) is small when \(\beta\)s are close to zero so it has effect of shrinking estimates of \(\beta_{j}\) toward zero.↩︎
Occasionally coefficients can increase as \(\lambda\) increases (cf. predictor “Rating”)↩︎
Resulting in standard deviation = 1 for all predictors and final fit independent of predictor scales.↩︎
Because the coefficients (slopes) shrink, multidimensional plane is not allowed to adapt as much to the data as for a normal linear least squares model.↩︎
In that case LS estimates will be extremely variable (cf. James et al. 2013, fig. 6.5)↩︎
e.g., best subset, forward stepwise, and backward stepwise selection↩︎
..unless \(\lambda = \infty\)↩︎