The total number of points possible for this homework is 38. The number of points for each question is written below, and questions marked as “bonus” are optional. Submit the knitted html file from this Rmd to Gradescope.
If you collaborated with anybody for this homework, put their names here:
SOLUTION GOES HERE
SOLUTION GOES HERE
glmnet
package, fit a ridge regression and lasso regression (two separate models), each over a grid of tuning parameter values \(\lambda\) chosen by the glmnet()
function, with cardiovascular mortality as the response and all the lagged features as predictors (you should have 20 in total: 10 from particulate matter, and 10 from temperature). However, make sure you do this in a split-sample setup for validation, as follows, for each of ridge and lasso:fit the glmnet
object on the first half of the time series;
make predictions on the second half of the time series, for each \(\lambda\);
record the MAE of the predictions on the second half, for each \(\lambda\);
choose and report the value of \(\lambda\) with the lowest MAE;
plot the cardiovascular mortality time series, along with the predictions on the second half, and print the MAE and the selected value of \(\lambda\) on the plot.
You can build off the code given in the regularization lecture (weeks 5-6: “Regularization and smoothing”) for fitting the ridge and lasso models, and the regression lecture (weeks 3-4: “Regression and prediction”) for the split-sample validation. Note carefully that the lecture code includes lag 0, and here we do not, so that we can make ex-ante 4-week ahead forecasts!
# CODE GOES HERE
# CODE GOES HERE
(8 pts) Repeat Q3, except implement time series cross-validation instead of split-sample validation. You should begin time series cross-validation on the second half of the time series, treating the first half as a burn-in set. Also, be sure to fit each ridge or lasso model using a trailing window of 200 time points (not all past).
Warning: doing time series cross-validation properly will require us to pay attention to the following. The glmnet()
function chooses a sequence of tuning parameter values \(\lambda\) based on the passed feature matrix x
and response vector y
(its first two arguments). However, in time series cross-validation, this will change at each time point. So if you just call glmnet()
naively, then you will not have a consistent \(\lambda\) sequence over which to calculate MAE and perform tuning.
You can circumvent this issue by defining your own \(\lambda\) sequence ahead of time, and forcing glmnet()
to use it by passing it through its lambda
argument. Indeed, the best thing to do here is just to use the lambda
sequence that glmnet()
itself derived for the ridge and lasso models fit to the first half of the time series, which you already have from Q3. Do this, and then just as in Q3, produce a plot of the cardiovascular mortality time series, along with the predictions from time series CV on the second half, and print the MAE and the selected value of \(\lambda\) on the plot.
You can build off the code given in the regression lecture for time series cross-validation (or the code you wrote in Homework 2 to implement time series cross-validation).
# CODE GOES HERE
# CODE GOES HERE
# CODE GOES HERE
# CODE GOES HERE
(Bonus) Implement 5-fold cross-validation in order to tune \(\lambda\) in trend filtering applied to the Boston marathon men’s data set. Recall the description of how to set folds in a special “structured” way, for tuning smoothers, given near the end of the lecture notes (weeks 5-6: “Regularization and smoothing”). The code below shows how to run trend filtering and derive estimates at the held-out points for one fold. You can build off this code for your solution. You will need to install the glmgen
package from GitHub, which you can do using the code that has been commented out.
Important note: just like glmnet()
, the trendfilter()
function (in the glmnet
package) computes its own lambda
sequence. So you will need to define an initial lambda
sequence to pass to each subsequent call to trendfilter()
, so that you can have a consistent grid of tuning parameter values over which to perform cross-validation. We do this below by using the lambda
sequence that trendfilter()
itself derived when the trend filtering model is fit on the full data set.
After implementing cross-validation, compute and report the \(\lambda\) value with the smallest cross-validated MAE. Then, lastly, plot the solution at this value of \(\lambda\) when the model is fit to the full data set (this is already available in the tf
object below.)
# devtools::install_github("glmgen/glmgen", subdir = "R_pkg/glmgen")
library(glmgen)
library(fpp3)
boston = boston_marathon |>
filter(Year >= 1924) |>
filter(Event == "Men's open division") |>
mutate(Minutes = as.numeric(Time)/60) |>
select(Year, Minutes)
# Fit trend filtering on the entire data in order to grab the lambda sequence
tf = trendfilter(x = boston$Year, y = boston$Minutes, k = 1)
lambda = tf$lambda
n = nrow(boston) # Number of points
k = 5 # Number of folds
inds = rep_len(1:k, n) # Folds indices
# Fit trend filtering on all points but those in first fold. We are forcing it
# to use the lambda sequence that we saved above
tf_subset = trendfilter(x = boston$Year[inds != 1],
y = boston$Minutes[inds != 1],
k = 1, lambda = lambda)
# Compute the predictions on the points in the first fold. Plot the predictions
# (as a sanity check) at a particular value of lambda in the middle of the grid
yhat = predict(tf_subset, x.new = boston$Year[inds == 1])
## Warning: In predict:
## Predict called at new x values out of the original range.
plot(boston$Year, boston$Minutes, col = 8)
points(boston$Year[inds == 1], yhat[, 25], col = 2, pch = 19)
SOLUTION GOES HERE
acf()
, and compare to the analytic formula you derived.# CODE GOES HERE