Feature selection & regularization (regression)

Chapter last updated: 13 Februar 2025

Learning outcomes/objective: Learn…

…about model regularization/feature selection (focus on ridge regression and lasso).
…how to implement ridge regression and lasso in R with tidymodels.
…what a tuning parameter can be used for (\(\lambda\) in ridge regression/lasso)

Sources: James et al. (2013, Ch. 6), ISLR tidymodels labs by Emil Hvitfeldt

1 Concepts

Feature selection
- also known as variable selection, attribute selection or variable subset selection
- process of selecting subset of relevant features/predictors for our model
- Aims: intepretability, reduce training time, avoid curse of dimensionality, improve data-model compatibility (cf. Wikipedia, 12.02.2025)
Regularization
- = either choice of the model or modifications to the algorithm
- Penalizing the complexity of a model to reduce overfitting
Theory vs. algorithmic feature selection
- ML largely focuses on automated selection, albeit theory may play a role

2 Why using alternatives to least squares?

Why use another fitting procedure instead of least squares? (James et al. 2013, Ch. 6)
- Alternatives can yield better prediction accuracy and model interpretability
Prediction Accuracy: Provided true relationship between response and predictors is approximately linear, least squares estimates will have low bias
- If \(n \gg p\)¹ least squares estimates have low bias but if not possibility of overfitting
- By shrinking estimated coefficients → reduce variance at the cost of small increase in bias (James et al. 2013, 204)
Model Interpretability: Often some features used in a multiple regression model are not associated with the outcome
- Including such irrelevant features leads to unnecessary complexity in the model
- Idea: remove these variables (set corresponding coefficient estimates to 0) to obtain a more interpretable model

3 Methods for feature/variables selection

Subset selection: identify subset of \(p\) predictors that we believe to be related to outcome
- ..then train model using reduced set of variables (using least squares)
- e.g., best subset selection (James et al. 2013, Ch. 6.1.1), forward/backward stepwise selection (relying on model fit) (James et al. 2013, Ch. 6.1.2)
- Tidymodels: “not a huge priority” (Stackoverflow, 12.02.2025)²
Shrinkage: train model involving all \(p\) predictors but shrink coefficients to zero relative to the least squares estimates
- This shrinkage (also known as regularization) has the effect of reducing variance
- Depending on type of shrinkage, some coefficients may be estimated to be exactly zero³
- e.g., ridge regression & the lasso (James et al. 2013, Ch. 6.2.1, 6.2.2)
Dimension reduction: involves projecting the \(p\) predictors into a M-dimensional subspace, where \(M<p\).⁴

4 High dimensional data

Most traditional statistical techniques intended for low-dimensional setting in which \(n\)⁵ is much greater than \(p\)⁶(James et al. 2013, 238)
For data sets containing more features than observations called high-dimensional classical approaches⁷ are not appropriate (James et al. 2013, 239)
But certain less flexible least squares models, such as forward stepwise selection, ridge regression, the lasso, and principal components regression, are particularly useful (James et al. 2013, 239)
They avoid overfitting by using less flexible fitting approach than least squares (James et al. 2013, 239)
Q: Why are the latter less flexible?!^ (see Section 7)

5 Ridge regression (RR) (1)

Least squares (LS) estimator: least squares fitting procedure estimates \(\beta_{0}, \beta_{1}, ..., \beta_{p}\) using the values that minimize RSS
Ridge regression (RR): seeks coefficient estimates that fit data well (minimize RSS) + shrinkage penalty
- See formula (6.5) in (James et al. 2013, 215)⁸
shrinkage penalty \(\lambda \sum_{j=1}^{p}\beta_{j}^{2}\) is small when \(\beta_{0}, \beta_{1}, ..., \beta_{p}\) close to zero
- penality has effect of shrinking the estimates \(\beta_{j}\) to zero
- Tuning parameter \(\lambda\) serves to control the relative impact of these two terms (RSS and shrinking penalty) on the regression coefficient estimates
  - When \(\lambda = 0\) the penalty term has no effect and ridge regression will produce least squares estimates
  - As \(\lambda \rightarrow \infty\), impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero
- Penality is only applied to coefficients not the intercept
Ridge regression will produce a different set of coefficient estimates \(\hat{\beta}_{\lambda}^{R}\) for each \(\lambda\)
- Hyperparameter tuning: fit many models with different \(\lambda\)s to find the model the performs “best”

6 Ridge regression (RR) (2): Visualization

Figure 1 displays regression coefficient estimates for the Credit data (James et al. 2013, Ch. 6.2)
- Left-hand panel: each curve corresponds to the ridge regression coefficient estimate for one of the ten variables, plotted as a function of \(\lambda\)
  - to the left \(\lambda\) is essentially zero, so ridge coefficient estimates are the same as the usual least squares estimates
  - as \(\lambda\) increases, ridge coefficient estimates shrink towards zero⁹
  - when \(\lambda\) is extremely large, all of the ridge coefficient estimates are basically zero = null model
While standard LS coefficient estimates are scale equivalent, RR coefficients are not
- Apply ridge regression after standardizing the predictors!¹⁰

Figure 1: Source: James et al. (2013), Ch. 6.2, Figure 6.4

7 Ridge regression (RR) (3): Why?

Why Does Ridge Regression (RR) Improve Over Least Squares (LS)?

Advantage rooted in bias-variance trade-off
- As \(\lambda\) increases, flexibility of ridge regression fit decreases¹¹, leading to decreased variance but increased bias
  - Q: What was variance and bias again in the context of ML? (see bias-variance trade-off)
If relationship between outcome and features is close to linear, LS estimates will have low bias but may have high variance
- \(\rightarrow\) small change in the training data can cause a large change in the LS coefficient estimates
- In particular, when number of variables \(p\) is almost as large as number of observations \(n\)¹²
- If \(p>n\), LS estimates do not even have a unique solution, whereas RR still works
Summary: RR works best in situations where the LS estimates have high variance

8 The Lasso

Ridge regression (RR) has one disadvantage
- Unlike subset selection methods¹³ that select a variable subset, RR will include all \(p\) predictors in final model
- Penalty \(\lambda \sum_{j=1}^{p}\beta_{j}^{2}\) will shrink all coefficients towards zero, but not set them exactly to 0¹⁴
- No problem for accuracy, but challenging for model interpretation when \(p\) is large
Lasso (least absolute shrinkage and selection operator)
- As RR lasso shrinks the coefficient estimates towards zero but additionally…
- …its penality has effect of forcing some of coefficient estimates to be exactly equal to zero..
- ..when tuning parameter \(\lambda\) is sufficiently large (cf. James et al. 2013, Eq. 6.7, p.219)
- Lasso performs variable selection leading to easier interpretation → yields sparse models
- As in RR selecting good value of \(\lambda\) is critical

9 Lasso vs. Ridge Regression

Lasso advantages: produces simpler, more interpretable models involving feature subset
But, which method leads to better prediction accuracy?
- Neither ridge regression nor lasso universally dominates the other
- Lasso works better when few predictors have significant coefficients, others have small/zero coefficients
- Ridge regression performs better with many predictors of roughly equal coefficient size
- Both methods can reduce variance at the cost of slight bias increase, improving prediction accuracy

10 Lab: Ridge Regression

See earlier description here (right click and open in new tab).

We first import the data into R:

library(tidymodels)
library(tidyverse)
library(glmnet)
# Load the .RData file into R
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "173VVsu9TZAxsCF_xzBxsxiDQMVc_DdqS")))

10.1 Inspecting the dataset

The dataset if relatively large with 1977 rows and 346 columns. The skim() function can be used to get a quick overview.

library(skimr)
library(naniar)

#skim(data)
nrow(data)

[1] 1977

ncol(data)

[1] 346

# Visualize missing for downsampled data
  vis_miss(data %>% sample_n(1000), warn_large_data = FALSE)

# Again: Visualize missing for downsampled data
  vis_miss(data %>% sample_n(1000), warn_large_data = FALSE)

10.2 Split the data

Below we split the data and create resampled partitions with vfold_cv() stored in an object called data_folds. Hence, we have the original data_train and data_test but also already resamples of data_train in data_folds.

# Extract data with missing outcome
  data_missing_outcome <- data %>% filter(is.na(life_satisfaction))
  dim(data_missing_outcome)

[1] 205 346

# Omit individuals with missing outcome from data
  data <- data %>% drop_na(life_satisfaction) # ?drop_na
  dim(data)

[1] 1772  346

# Split the data into training and test data
  set.seed(345)
  data_split <- initial_split(data, prop = 0.80)
  data_split # Inspect

<Training/Testing/Total>
<1417/355/1772>

# Extract the two datasets
  data_train <- training(data_split)
  data_test <- testing(data_split) # Do not touch until the end!
  
# Create resampled partitions
  data_folds <- vfold_cv(data_train, v = 5) # V-fold/k-fold cross-validation
  data_folds # data_folds now contains several resamples of our training data

#  5-fold cross-validation 
# A tibble: 5 × 2
  splits             id   
  <list>             <chr>
1 <split [1133/284]> Fold1
2 <split [1133/284]> Fold2
3 <split [1134/283]> Fold3
4 <split [1134/283]> Fold4
5 <split [1134/283]> Fold5

  # You can also try loo_cv(data_train) instead

The training data has 1417 rows, the test data has 355. The training data is further split into 10 folds.

10.3 Simple ridge regression (all in one)

Below we provide a quick example of a ridge regression (beware: below in the recipe we standardize predictors).

Arguments of glmnet linear_reg()
- mixture = 0: to specify a ridge model
  - mixture = 0 specifies only ridge regularization
  - mixture = 1 specifies only lasso regularization
  - Setting mixture to a value between 0 and 1 lets us use both
- penalty = 0: Penality we need to set when using glmnet engine (for now 0)

# Define a recipe for preprocessing
  recipe_ridge <- recipe(life_satisfaction ~ ., data = data_train) %>%
    update_role(respondent_id, new_role = "ID") %>% # define as ID variable
    step_impute_mean(all_numeric_predictors()) %>%
    step_zv(all_predictors()) %>% # remove predictors with zero variance
    step_normalize(all_numeric_predictors()) %>% # normalize predictors
    step_unknown(all_nominal_predictors()) %>%
    step_dummy(all_nominal_predictors())

  # Extract and preview data + recipe (direclty with $)
  data_preprocessed <- prep(recipe_ridge, data_train)$template
  dim(data_preprocessed)

[1] 1417  931

# Define a model
  model_ridge <- linear_reg(mixture = 0, penalty = 2) %>% # ridge regression model
    set_engine("glmnet") %>% # lm engine
    set_mode("regression") # regression problem
  
# Define a workflow
  workflow_ridge <- workflow() %>% # create empty workflow
    add_recipe(recipe_ridge) %>% # add recipe
    add_model(model_ridge) # add model  
  
# Fit the workflow (including recipe and model)
  fit_ridge <- workflow_ridge %>% 
    fit_resamples(resamples = data_folds, # specify cv-datasets
    metrics = metric_set(rmse, rsq, mae))

  
  collect_metrics(fit_ridge)

# A tibble: 3 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 mae     standard   1.31      5  0.0242 Preprocessor1_Model1
2 rmse    standard   1.73      5  0.0350 Preprocessor1_Model1
3 rsq     standard   0.404     5  0.0294 Preprocessor1_Model1

If we are happy with the performance of our model (evaluated using resampling), we can fit it on the full training set and use the resulting parameters to obtain predictions in the test dataset. Subsequently we calculate the accuracy in the test data.

# Fit model in full training dataset
  fit_final_ridge <- workflow_ridge %>% 
              parsnip::fit(data = data_train)

# Inspect coefficients: 
  tidy(fit_final_ridge) %>% slice(1:5) # show first 5 coefficients

# A tibble: 5 × 3
  term                  estimate penalty
  <chr>                    <dbl>   <dbl>
1 (Intercept)             7.44         2
2 unemployed_active      -0.0269       2
3 unemployed             -0.0506       2
4 education               0.0145       2
5 news_politics_minutes   0.0329       2

# Test data: Predictions + accuracy
  metrics_combined <- metric_set(mae, rmse, rsq) # Use several metrics
  
  augment(fit_final_ridge , new_data = data_test) %>%
    metrics_combined(truth = life_satisfaction, estimate = .pred)

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard       1.28 
2 rmse    standard       1.72 
3 rsq     standard       0.374

11 Lab: Lasso

See earlier description here (right click and open in new tab).

We first import the data into R:

library(tidymodels)
library(tidyverse)
library(glmnet)
# Load the .RData file into R
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "173VVsu9TZAxsCF_xzBxsxiDQMVc_DdqS")))

# Extract data with missing outcome
  data_missing_outcome <- data %>% filter(is.na(life_satisfaction))
  dim(data_missing_outcome)

[1] 205 346

# Omit individuals with missing outcome from data
  data <- data %>% drop_na(life_satisfaction) # ?drop_na
  dim(data)

[1] 1772  346

# Split the data into training and test data
  set.seed(345)
  data_split <- initial_split(data, prop = 0.80)
  data_split # Inspect

<Training/Testing/Total>
<1417/355/1772>

# Extract the two datasets
  data_train <- training(data_split)
  data_test <- testing(data_split) # Do not touch until the end!
  
# Create resampled partitions
  data_folds <- vfold_cv(data_train, v = 5) # V-fold/k-fold cross-validation
  data_folds # data_folds now contains several resamples of our training data

#  5-fold cross-validation 
# A tibble: 5 × 2
  splits             id   
  <list>             <chr>
1 <split [1133/284]> Fold1
2 <split [1133/284]> Fold2
3 <split [1134/283]> Fold3
4 <split [1134/283]> Fold4
5 <split [1134/283]> Fold5

# Define the recipe
recipe_lasso <- recipe(life_satisfaction ~ ., data = data_train) %>%
    update_role(respondent_id, new_role = "ID") %>% # define as ID variable
    step_impute_mean(all_numeric_predictors()) %>%
    step_zv(all_predictors()) %>% # remove predictors with zero variance
    step_normalize(all_numeric_predictors()) %>% # normalize predictors
    step_unknown(all_nominal_predictors()) %>%
    step_dummy(all_nominal_predictors())

# Specify the model
  model_lasso <- 
    linear_reg(penalty = 0.1, mixture = 1) %>% 
    set_mode("regression") %>% 
    set_engine("glmnet") 

# Define the workflow
  workflow_lasso <- workflow() %>% 
    add_recipe(recipe_lasso) %>% 
    add_model(model_lasso)


# Fit the workflow (including recipe and model)
  fit_lasso <- workflow_lasso %>% 
    fit_resamples(resamples = data_folds, # specify cv-datasets
    metrics = metric_set(rmse, rsq, mae))

  
  collect_metrics(fit_lasso)

# A tibble: 3 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 mae     standard   1.11      5  0.0262 Preprocessor1_Model1
2 rmse    standard   1.54      5  0.0370 Preprocessor1_Model1
3 rsq     standard   0.528     5  0.0218 Preprocessor1_Model1

# Fit model in full training dataset
  fit_final_lasso <- workflow_lasso %>% 
              parsnip::fit(data = data_train)
  
# Inspect coefficients
  tidy(fit_final_lasso) %>% filter(estimate!=0) # omit coefficients = 0

# A tibble: 34 × 3
   term                  estimate penalty
   <chr>                    <dbl>   <dbl>
 1 (Intercept)            6.31        0.1
 2 trust_people           0.00201     0.1
 3 people_fair            0.116       0.1
 4 trust_police           0.0581      0.1
 5 trust_united_nations   0.00234     0.1
 6 left_right_scale       0.0671      0.1
 7 satisfied_economy      0.151       0.1
 8 satisfied_government   0.200       0.1
 9 state_health_services  0.0314      0.1
10 happiness              1.00        0.1
# ℹ 24 more rows

# Test data: Predictions + accuracy
  metrics_combined <- metric_set(mae, rmse, rsq) # Use several metrics
  
  augment(fit_final_lasso , new_data = data_test) %>%
    metrics_combined(truth = life_satisfaction, estimate = .pred)

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard       1.11 
2 rmse    standard       1.56 
3 rsq     standard       0.490

12 Exercise/Homework

Below code to help you (you need to replace all ...)

Use the data from above (European Social Survey (ESS)). Use the code below to load it, drop missings and to produce training, test as well as resampled data.
Define three models (model_lm = linear regression, model_ridge = ridge regression, model_lasso = lasso regression) and create two recipes, recipe_lm for the linear regression (the model should only include three predictors: unemployed + age + education) and recipe_ridge_lasso for the other two models. Store the three models and two recipes in three workflows workflow_lm, workflow_ridge, workflow_lasso
Train these three workflows and evaluate their accuracy using resampling (below some code to get you started).
Pick the best workflow (and corresponding model), fit it on the whole training data and evaluate the accuracy on the test data.

We first import the data into R:

rm(list=ls())
library(tidymodels)
library(tidyverse)
# Load the .RData file into R
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "173VVsu9TZAxsCF_xzBxsxiDQMVc_DdqS")))

# 1. ####
# Drop missings on outcome
  data <- data %>% 
    drop_na(life_satisfaction)

# Split the data into training and test data
  data_split <- initial_split(data, prop = 0.80)
  data_split # Inspect

# Extract the two datasets
  data_train <- training(data_split)
  data_... <- testing(data_split) # Do not touch until the end!

# Create resampled partitions
  set.seed(345)
  data_folds <- vfold_cv(..., v = 5) # V-fold/k-fold cross-validation
  data_folds # data_folds now contains several resamples of our training data  
  
  
  
# 2. ####  
# RECIPES
  recipe_lm <- recipe(life_satisfaction ~ unemployed + age + education, data = data_train) %>%
    step_filter_missing(all_predictors(), threshold = 0.01) %>%
    step_impute_mean(all_numeric_predictors()) %>%
    step_zv(all_predictors()) %>% # remove predictors with zero variance
    step_normalize(all_numeric_predictors()) %>% # normalize predictors
    step_unknown(all_nominal_predictors()) %>%
    step_dummy(all_nominal_predictors())


  recipe_ridge_lasso <- recipe(life_satisfaction ~ ., data = ...) %>%
    update_role(respondent_id, new_role = "ID") %>% # define as ID variable
    step_filter_missing(all_predictors(), threshold = 0.01) %>%
    step_impute_mean(all_numeric_predictors()) %>%
    step_zv(all_predictors()) %>% # remove predictors with zero variance
    step_normalize(all_numeric_predictors()) %>% # normalize predictors
    step_unknown(all_nominal_predictors()) %>%
    step_dummy(all_nominal_predictors())
  
# MODELS
  model_lm <- linear_reg() %>% # linear model
    set_engine("lm") %>% # lm engine
    set_mode("regression") # regression problem
  
  model_ridge <- linear_reg(mixture = 0, penalty = 2) %>% # ridge regression model
    set_engine("glmnet") %>%
    set_mode("regression") # regression problem
  
  model_lasso <- 
    linear_reg(mixture = 1, penalty = 0.05) %>% 
    ... %>%
    ... 

  
# WORKFLOWS
  workflow_lm <- workflow() %>% 
                  add_recipe(recipe_lm) %>% 
                  add_model(model_lm)  
  
  workflow_ridge <- ...
  
  workflow_lasso <- workflow() %>% 
                  add_recipe(recipe_ridge_lasso) %>% 
                  add_model(model_lasso) 
  
  
# 3. ####
# TRAINING/FITTING
  fit_lm <- workflow_lm %>% fit_resamples(resamples = data_folds)
  fit_ridge <- ...
  fit_lasso <- ...
  
# RESAMPLED DATA: COLLECT METRICS  
  collect_metrics(fit_lm)
  collect_metrics(...)
  collect_metrics(...)
  # Lasso performed best!!
  
# FIT LASSO TO FULL TRAINING DATA
  fit_lasso <- ... %>% parsnip::fit(data = ...)
  
# Metrics: Training data
  metrics_combined <- metric_set(rsq, rmse, mae)
  
  augment(fit_lasso, new_data = ...) %>%
    metrics_combined(truth = life_satisfaction, estimate = .pred)
  
# Metrics: Test data
  augment(fit_lasso, new_data = ...) %>%
    metrics_combined(truth = life_satisfaction, estimate = ...)

Solution

# 1. ####
# Drop missings on outcome
  data <- data %>% 
    drop_na(life_satisfaction)

# Split the data into training and test data
  data_split <- initial_split(data, prop = 0.80)
  data_split # Inspect

# Extract the two datasets
  data_train <- training(data_split)
  data_test <- testing(data_split) # Do not touch until the end!

# Create resampled partitions
  set.seed(345)
  data_folds <- vfold_cv(data_train, v = 5) # V-fold/k-fold cross-validation
  data_folds # data_folds now contains several resamples of our training data  
  
  
  
# 2. ####  
# RECIPES
  recipe_lm <- recipe(life_satisfaction ~ unemployed + age + education, data = data_train) %>%
    step_filter_missing(all_predictors(), threshold = 0.01) %>%
    step_impute_mean(all_numeric_predictors()) %>%
    step_zv(all_predictors()) %>% # remove predictors with zero variance
    step_normalize(all_numeric_predictors()) %>% # normalize predictors
    step_unknown(all_nominal_predictors()) %>%
    step_dummy(all_nominal_predictors())


  recipe_ridge_lasso <- recipe(life_satisfaction ~ ., data = data_train) %>%
    update_role(respondent_id, new_role = "ID") %>% # define as ID variable
    step_filter_missing(all_predictors(), threshold = 0.01) %>%
    step_impute_mean(all_numeric_predictors()) %>%
    step_zv(all_predictors()) %>% # remove predictors with zero variance
    step_normalize(all_numeric_predictors()) %>% # normalize predictors
    step_unknown(all_nominal_predictors()) %>%
    step_dummy(all_nominal_predictors())
  
# MODELS
  model_lm <- linear_reg() %>% # linear model
    set_engine("lm") %>% # lm engine
    set_mode("regression") # regression problem
  
  model_ridge <- linear_reg(mixture = 0, penalty = 2) %>% # ridge regression model
    set_engine("lm") %>% # lm engine
    set_mode("regression") # regression problem
  
  model_lasso <- 
    linear_reg(mixture = 1, penalty = 0.05) %>% 
    set_mode("regression") %>% 
    set_engine("glmnet") 
  
# WORKFLOWS
  workflow_lm <- workflow() %>% 
                  add_recipe(recipe_lm) %>% 
                  add_model(model_lm)  
  
  workflow_ridge <- workflow() %>% 
                  add_recipe(recipe_ridge_lasso) %>% 
                  add_model(model_ridge) 
  
  workflow_lasso <- workflow() %>% 
                  add_recipe(recipe_ridge_lasso) %>% 
                  add_model(model_lasso) 
  
  
# 3. ####
# TRAINING/FITTING
  fit_lm <- workflow_lm %>% fit_resamples(resamples = data_folds)
  fit_ridge <- workflow_ridge %>% fit_resamples(resamples = data_folds)
  fit_lasso <- workflow_lasso %>% fit_resamples(resamples = data_folds)
  
# RESAMPLED DATA: COLLECT METRICS  
  collect_metrics(fit_lm)
  collect_metrics(fit_ridge)
  collect_metrics(fit_lasso)
  # Lasso performed best!!
  
# FIT LASSO TO FULL TRAINING DATA
  fit_lasso <- workflow_lasso %>% parsnip::fit(data = data_train)
  tidy(fit_lasso)
  View(tidy(fit_lasso))
  View(tidy(fit_lasso) %>% filter(estimate!=0))
  
# Metrics: Training data
  metrics_combined <- metric_set(rsq, rmse, mae)
  
  augment(fit_lasso, new_data = data_train) %>%
    metrics_combined(truth = life_satisfaction, estimate = .pred)
  
# Metrics: Test data
  augment(fit_lasso, new_data = data_test) %>%
    metrics_combined(truth = life_satisfaction, estimate = .pred)

13 All the code

# namer::unname_chunks("03_01_predictive_models.qmd")
# namer::name_chunks("03_01_predictive_models.qmd")
options(scipen=99999)

#| echo: false
library(tidyverse)
library(lemon)
library(knitr)
library(kableExtra)
library(dplyr)
library(plotly)
library(randomNames)
library(stargazer)
library(tidymodels)
library(tidyverse)
library(glmnet)
load(file = "www/data/data_ess.Rdata")
library(tidymodels)
library(tidyverse)
library(glmnet)
# Load the .RData file into R
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "173VVsu9TZAxsCF_xzBxsxiDQMVc_DdqS")))
library(skimr)
library(naniar)

#skim(data)
nrow(data)
ncol(data)

# Visualize missing for downsampled data
  vis_miss(data %>% sample_n(1000), warn_large_data = FALSE)

# Again: Visualize missing for downsampled data
  vis_miss(data %>% sample_n(1000), warn_large_data = FALSE)
# Extract data with missing outcome
  data_missing_outcome <- data %>% filter(is.na(life_satisfaction))
  dim(data_missing_outcome)

# Omit individuals with missing outcome from data
  data <- data %>% drop_na(life_satisfaction) # ?drop_na
  dim(data)


# Split the data into training and test data
  set.seed(345)
  data_split <- initial_split(data, prop = 0.80)
  data_split # Inspect

# Extract the two datasets
  data_train <- training(data_split)
  data_test <- testing(data_split) # Do not touch until the end!
  
# Create resampled partitions
  data_folds <- vfold_cv(data_train, v = 5) # V-fold/k-fold cross-validation
  data_folds # data_folds now contains several resamples of our training data  
  # You can also try loo_cv(data_train) instead
  
# Define a recipe for preprocessing
  recipe_ridge <- recipe(life_satisfaction ~ ., data = data_train) %>%
    update_role(respondent_id, new_role = "ID") %>% # define as ID variable
    step_impute_mean(all_numeric_predictors()) %>%
    step_zv(all_predictors()) %>% # remove predictors with zero variance
    step_normalize(all_numeric_predictors()) %>% # normalize predictors
    step_unknown(all_nominal_predictors()) %>%
    step_dummy(all_nominal_predictors())

  # Extract and preview data + recipe (direclty with $)
  data_preprocessed <- prep(recipe_ridge, data_train)$template
  dim(data_preprocessed)
  
# Define a model
  model_ridge <- linear_reg(mixture = 0, penalty = 2) %>% # ridge regression model
    set_engine("glmnet") %>% # lm engine
    set_mode("regression") # regression problem
  
# Define a workflow
  workflow_ridge <- workflow() %>% # create empty workflow
    add_recipe(recipe_ridge) %>% # add recipe
    add_model(model_ridge) # add model  
  
# Fit the workflow (including recipe and model)
  fit_ridge <- workflow_ridge %>% 
    fit_resamples(resamples = data_folds, # specify cv-datasets
    metrics = metric_set(rmse, rsq, mae))

  
  collect_metrics(fit_ridge)
  
  
# Fit model in full training dataset
  fit_final_ridge <- workflow_ridge %>% 
              parsnip::fit(data = data_train)

# Inspect coefficients: 
  tidy(fit_final_ridge) %>% slice(1:5) # show first 5 coefficients

# Test data: Predictions + accuracy
  metrics_combined <- metric_set(mae, rmse, rsq) # Use several metrics
  
  augment(fit_final_ridge , new_data = data_test) %>%
    metrics_combined(truth = life_satisfaction, estimate = .pred)
library(tidymodels)
library(tidyverse)
library(glmnet)
load(file = "www/data/data_ess.Rdata")
library(tidymodels)
library(tidyverse)
library(glmnet)
# Load the .RData file into R
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "173VVsu9TZAxsCF_xzBxsxiDQMVc_DdqS")))
# Extract data with missing outcome
  data_missing_outcome <- data %>% filter(is.na(life_satisfaction))
  dim(data_missing_outcome)

# Omit individuals with missing outcome from data
  data <- data %>% drop_na(life_satisfaction) # ?drop_na
  dim(data)


# Split the data into training and test data
  set.seed(345)
  data_split <- initial_split(data, prop = 0.80)
  data_split # Inspect

# Extract the two datasets
  data_train <- training(data_split)
  data_test <- testing(data_split) # Do not touch until the end!
  
# Create resampled partitions
  data_folds <- vfold_cv(data_train, v = 5) # V-fold/k-fold cross-validation
  data_folds # data_folds now contains several resamples of our training data  
# Define the recipe
recipe_lasso <- recipe(life_satisfaction ~ ., data = data_train) %>%
    update_role(respondent_id, new_role = "ID") %>% # define as ID variable
    step_impute_mean(all_numeric_predictors()) %>%
    step_zv(all_predictors()) %>% # remove predictors with zero variance
    step_normalize(all_numeric_predictors()) %>% # normalize predictors
    step_unknown(all_nominal_predictors()) %>%
    step_dummy(all_nominal_predictors())

# Specify the model
  model_lasso <- 
    linear_reg(penalty = 0.1, mixture = 1) %>% 
    set_mode("regression") %>% 
    set_engine("glmnet") 

# Define the workflow
  workflow_lasso <- workflow() %>% 
    add_recipe(recipe_lasso) %>% 
    add_model(model_lasso)


# Fit the workflow (including recipe and model)
  fit_lasso <- workflow_lasso %>% 
    fit_resamples(resamples = data_folds, # specify cv-datasets
    metrics = metric_set(rmse, rsq, mae))

  
  collect_metrics(fit_lasso)
  
  
  
# Fit model in full training dataset
  fit_final_lasso <- workflow_lasso %>% 
              parsnip::fit(data = data_train)
  
# Inspect coefficients
  tidy(fit_final_lasso) %>% filter(estimate!=0) # omit coefficients = 0

# Test data: Predictions + accuracy
  metrics_combined <- metric_set(mae, rmse, rsq) # Use several metrics
  
  augment(fit_final_lasso , new_data = data_test) %>%
    metrics_combined(truth = life_satisfaction, estimate = .pred)
rm(list=ls())
library(tidymodels)
library(tidyverse)
load(file = "www/data/data_ess.Rdata")
rm(list=ls())
library(tidymodels)
library(tidyverse)
# Load the .RData file into R
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "173VVsu9TZAxsCF_xzBxsxiDQMVc_DdqS")))
# 1. ####
# Drop missings on outcome
  data <- data %>% 
    drop_na(life_satisfaction)

# Split the data into training and test data
  data_split <- initial_split(data, prop = 0.80)
  data_split # Inspect

# Extract the two datasets
  data_train <- training(data_split)
  data_... <- testing(data_split) # Do not touch until the end!

# Create resampled partitions
  set.seed(345)
  data_folds <- vfold_cv(..., v = 5) # V-fold/k-fold cross-validation
  data_folds # data_folds now contains several resamples of our training data  
  
  
  
# 2. ####  
# RECIPES
  recipe_lm <- recipe(life_satisfaction ~ unemployed + age + education, data = data_train) %>%
    step_filter_missing(all_predictors(), threshold = 0.01) %>%
    step_impute_mean(all_numeric_predictors()) %>%
    step_zv(all_predictors()) %>% # remove predictors with zero variance
    step_normalize(all_numeric_predictors()) %>% # normalize predictors
    step_unknown(all_nominal_predictors()) %>%
    step_dummy(all_nominal_predictors())


  recipe_ridge_lasso <- recipe(life_satisfaction ~ ., data = ...) %>%
    update_role(respondent_id, new_role = "ID") %>% # define as ID variable
    step_filter_missing(all_predictors(), threshold = 0.01) %>%
    step_impute_mean(all_numeric_predictors()) %>%
    step_zv(all_predictors()) %>% # remove predictors with zero variance
    step_normalize(all_numeric_predictors()) %>% # normalize predictors
    step_unknown(all_nominal_predictors()) %>%
    step_dummy(all_nominal_predictors())
  
# MODELS
  model_lm <- linear_reg() %>% # linear model
    set_engine("lm") %>% # lm engine
    set_mode("regression") # regression problem
  
  model_ridge <- linear_reg(mixture = 0, penalty = 2) %>% # ridge regression model
    set_engine("glmnet") %>%
    set_mode("regression") # regression problem
  
  model_lasso <- 
    linear_reg(mixture = 1, penalty = 0.05) %>% 
    ... %>%
    ... 

  
# WORKFLOWS
  workflow_lm <- workflow() %>% 
                  add_recipe(recipe_lm) %>% 
                  add_model(model_lm)  
  
  workflow_ridge <- ...
  
  workflow_lasso <- workflow() %>% 
                  add_recipe(recipe_ridge_lasso) %>% 
                  add_model(model_lasso) 
  
  
# 3. ####
# TRAINING/FITTING
  fit_lm <- workflow_lm %>% fit_resamples(resamples = data_folds)
  fit_ridge <- ...
  fit_lasso <- ...
  
# RESAMPLED DATA: COLLECT METRICS  
  collect_metrics(fit_lm)
  collect_metrics(...)
  collect_metrics(...)
  # Lasso performed best!!
  
# FIT LASSO TO FULL TRAINING DATA
  fit_lasso <- ... %>% parsnip::fit(data = ...)
  
# Metrics: Training data
  metrics_combined <- metric_set(rsq, rmse, mae)
  
  augment(fit_lasso, new_data = ...) %>%
    metrics_combined(truth = life_satisfaction, estimate = .pred)
  
# Metrics: Test data
  augment(fit_lasso, new_data = ...) %>%
    metrics_combined(truth = life_satisfaction, estimate = ...)
# 1. ####
# Drop missings on outcome
  data <- data %>% 
    drop_na(life_satisfaction)

# Split the data into training and test data
  data_split <- initial_split(data, prop = 0.80)
  data_split # Inspect

# Extract the two datasets
  data_train <- training(data_split)
  data_test <- testing(data_split) # Do not touch until the end!

# Create resampled partitions
  set.seed(345)
  data_folds <- vfold_cv(data_train, v = 5) # V-fold/k-fold cross-validation
  data_folds # data_folds now contains several resamples of our training data  
  
  
  
# 2. ####  
# RECIPES
  recipe_lm <- recipe(life_satisfaction ~ unemployed + age + education, data = data_train) %>%
    step_filter_missing(all_predictors(), threshold = 0.01) %>%
    step_impute_mean(all_numeric_predictors()) %>%
    step_zv(all_predictors()) %>% # remove predictors with zero variance
    step_normalize(all_numeric_predictors()) %>% # normalize predictors
    step_unknown(all_nominal_predictors()) %>%
    step_dummy(all_nominal_predictors())


  recipe_ridge_lasso <- recipe(life_satisfaction ~ ., data = data_train) %>%
    update_role(respondent_id, new_role = "ID") %>% # define as ID variable
    step_filter_missing(all_predictors(), threshold = 0.01) %>%
    step_impute_mean(all_numeric_predictors()) %>%
    step_zv(all_predictors()) %>% # remove predictors with zero variance
    step_normalize(all_numeric_predictors()) %>% # normalize predictors
    step_unknown(all_nominal_predictors()) %>%
    step_dummy(all_nominal_predictors())
  
# MODELS
  model_lm <- linear_reg() %>% # linear model
    set_engine("lm") %>% # lm engine
    set_mode("regression") # regression problem
  
  model_ridge <- linear_reg(mixture = 0, penalty = 2) %>% # ridge regression model
    set_engine("lm") %>% # lm engine
    set_mode("regression") # regression problem
  
  model_lasso <- 
    linear_reg(mixture = 1, penalty = 0.05) %>% 
    set_mode("regression") %>% 
    set_engine("glmnet") 
  
# WORKFLOWS
  workflow_lm <- workflow() %>% 
                  add_recipe(recipe_lm) %>% 
                  add_model(model_lm)  
  
  workflow_ridge <- workflow() %>% 
                  add_recipe(recipe_ridge_lasso) %>% 
                  add_model(model_ridge) 
  
  workflow_lasso <- workflow() %>% 
                  add_recipe(recipe_ridge_lasso) %>% 
                  add_model(model_lasso) 
  
  
# 3. ####
# TRAINING/FITTING
  fit_lm <- workflow_lm %>% fit_resamples(resamples = data_folds)
  fit_ridge <- workflow_ridge %>% fit_resamples(resamples = data_folds)
  fit_lasso <- workflow_lasso %>% fit_resamples(resamples = data_folds)
  
# RESAMPLED DATA: COLLECT METRICS  
  collect_metrics(fit_lm)
  collect_metrics(fit_ridge)
  collect_metrics(fit_lasso)
  # Lasso performed best!!
  
# FIT LASSO TO FULL TRAINING DATA
  fit_lasso <- workflow_lasso %>% parsnip::fit(data = data_train)
  tidy(fit_lasso)
  View(tidy(fit_lasso))
  View(tidy(fit_lasso) %>% filter(estimate!=0))
  
# Metrics: Training data
  metrics_combined <- metric_set(rsq, rmse, mae)
  
  augment(fit_lasso, new_data = data_train) %>%
    metrics_combined(truth = life_satisfaction, estimate = .pred)
  
# Metrics: Test data
  augment(fit_lasso, new_data = data_test) %>%
    metrics_combined(truth = life_satisfaction, estimate = .pred)
labs = knitr::all_labels()
ignore_chunks <- labs[str_detect(labs, "exclude|setup|solution|get-labels")]
labs = setdiff(labs, ignore_chunks)

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer.

Footnotes

\(\gg =\) much larger↩︎
“not a huge priority since other methods (e.g. regularization) are uniformly superior methods to feature selection.” (Stackoverflow, 12.02.2025)↩︎
,i.e., shrinkage methods can perform variable selection↩︎
Achieved by computing \(M\) different linear combinations (or projections) of the variables. Then these \(M\) projections are used as predictors to fit a linear regression model by least squares, e.g., principal components regression, partial least squares (James et al. 2013, Ch. 6.3.1, 6.3.2)↩︎
the number of observations↩︎
the number of features↩︎
e.g., least squares linear regression↩︎
RR minimizes \(RSS + \lambda\sum_{j=1}^{p}\beta_{j}^{2}\). Like least squares, RSS seeks coefficients estimates that fit the data well (make RSS small). But also \(\lambda\sum_{j=1}^{p}\beta_{j}^{2}\) (shrinkage penalty) is small when \(\beta\)s are close to zero so it has effect of shrinking estimates of \(\beta_{j}\) toward zero.↩︎
Occasionally coefficients can increase as \(\lambda\) increases (cf. predictor “Rating”)↩︎
Resulting in standard deviation = 1 for all predictors and final fit independent of predictor scales.↩︎
Because the coefficients (slopes) shrink, multidimensional plane is not allowed to adapt as much to the data as for a normal linear least squares model.↩︎
In that case LS estimates will be extremely variable (cf. James et al. 2013, fig. 6.5)↩︎
e.g., best subset, forward stepwise, and backward stepwise selection↩︎
..unless \(\lambda = \infty\)↩︎