The total number of points possible for this homework is 34. The number of points for each question is written below, and questions marked as “bonus” are optional. Submit the knitted html file from this Rmd to Gradescope.
If you collaborated with anybody for this homework, put their names here:
SOLUTION GOES HERE
SOLUTION GOES HERE
# CODE GOES HERE
SOLUTION GOES HERE
SOLUTION GOES HERE
SOLUTION GOES HERE
SOLUTION GOES HERE
SOLUTION GOES HERE
SOLUTION GOES HERE
SOLUTION GOES HERE
SOLUTION GOES HERE
When we learned time series cross-validation in lecture (weeks 3-4, “Linear regression and prediction”), we implemented it “manually”, by writing a loop in R to iterate over time, rebuild models, and so on. The fable
package in R does it differently. It relies on data being stored in a class that is known as a tsibble
, which is like a special data frame for time series. You can then use a function called stretch_tsibble()
in order to “prepare it” for time series cross-validation. Take a look at what it does with this example:
library(tidyverse)
library(fpp3)
dat = tsibble(date = as.Date("2023-10-01") + 0:9,
value = 1:10 + rnorm(10, sd = 0.25),
index = date)
dat
## # A tsibble: 10 x 2 [1D]
## date value
## <date> <dbl>
## 1 2023-10-01 1.38
## 2 2023-10-02 1.85
## 3 2023-10-03 3.10
## 4 2023-10-04 4.04
## 5 2023-10-05 4.43
## 6 2023-10-06 5.70
## 7 2023-10-07 6.63
## 8 2023-10-08 8.13
## 9 2023-10-09 9.06
## 10 2023-10-10 10.3
dat_stretched = dat|> stretch_tsibble(.init = 3)
dat_stretched
## # A tsibble: 52 x 3 [1D]
## # Key: .id [8]
## date value .id
## <date> <dbl> <int>
## 1 2023-10-01 1.38 1
## 2 2023-10-02 1.85 1
## 3 2023-10-03 3.10 1
## 4 2023-10-01 1.38 2
## 5 2023-10-02 1.85 2
## 6 2023-10-03 3.10 2
## 7 2023-10-04 4.04 2
## 8 2023-10-01 1.38 3
## 9 2023-10-02 1.85 3
## 10 2023-10-03 3.10 3
## # ℹ 42 more rows
What this does is it takes the first 3 entries of the time series and assigns them .id = 1
. Then it appends the first 4 entries of the time series and assigns them .id = 2
. Then it appends the first 5 entries of the time series and assigns them .id = 3
, and so on. Downstream, when we go to fit a forecast model with fable, it (by default) will fit a separate model to the data in each level of the .id
column. And by making forecasts at a (say) horizon h = 1
, these are actually precisely the 1-step ahead forecasts that we would generate in time series CV:
dat_fc = dat_stretched |>
model(RW = RW(value ~ drift())) |>
forecast(h = 1)
dat_fc
## # A fable: 8 x 5 [1D]
## # Key: .id, .model [8]
## .id .model date value .mean
## <int> <chr> <date> <dist> <dbl>
## 1 1 RW 2023-10-04 N(4, 0.47) 3.97
## 2 2 RW 2023-10-05 N(4.9, 0.21) 4.93
## 3 3 RW 2023-10-06 N(5.2, 0.21) 5.19
## 4 4 RW 2023-10-07 N(6.6, 0.21) 6.56
## 5 5 RW 2023-10-08 N(7.5, 0.17) 7.51
## 6 6 RW 2023-10-09 N(9.1, 0.2) 9.09
## 7 7 RW 2023-10-10 N(10, 0.17) 10.0
## 8 8 RW 2023-10-11 N(11, 0.15) 11.3
The .mean
column give us the point forecast. To evaluate these, we could join the original data dat
to the point forecasts in dat_fc
, and then align by the date
column, and compute whatever metrics we wanted. However, there is also a handy function to do all of this for us, called accuracy()
. This computes a bunch of common metrics, and here we just pull out the MAE
column:
accuracy(dat_fc, dat) |> select(.model, MAE)
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
## 1 observation is missing at 2023-10-11
## # A tibble: 1 × 2
## .model MAE
## <chr> <dbl>
## 1 RW 0.298
Now for the questions.
tsibble
, call it x
, with n
rows, and then we run stretch_tsibble(x, .init = t0)
. How many rows does the output have? Derive the answer mathematically (as an explicit formula involving \(n,t_0\)), and then verify it with a couple of code examples.SOLUTION GOES HERE
# CODE GOES HERE
dat
, matches the MAE from a manual implementation of time series CV with the same forecaster. (Your manual implementation can build off the code from the regression lecture, and/or from previous homeworks.)# CODE GOES HERE
leisure
data set from the HA book, which the code excerpt below (taken from the ARIMA lecture) prepares for us. Use time series CV, implemented using stretch_tsibble()
, model()
, and forecast()
, as described above, to evaluate the MAE of the following four models:The last models are motivated by the exploratory analysis done in lecture. The first two remove the seasonal component, which should not be very good (since there is clear seasonality in the data). For each model you should use a burn-in period of length 50 (i.e., set .init = 50
in the call to stretch_tsibble()
). A key difference in how you implement time series CV to the above examples: you should consider 1-step, 2-step, all the way through 12-step ahead forecasts. But do not worry! This can be handled with an appropriate call to forecast()
. For each model, calculate the MAE by averaging over all forecast horizons (1 through 12). Report the results and rank the models by their MAE.
leisure = us_employment |>
filter(Title == "Leisure and Hospitality", year(Month) > 2000) |>
mutate(Employed = Employed/1000) |>
select(Month, Employed)
# CODE GOES HERE
# CODE GOES HERE
# CODE GOES HERE