The total number of points possible for this homework is 46. The number of points for each question is written below, and questions marked as “bonus” are optional. Submit the knitted html file from this Rmd to Gradescope.
If you collaborated with anybody for this homework, put their names here:
First, a few notes about this homework. This last homework is more open-ended than the previous homeworks and is something in between a regular homework and a project. You should do your best to make all of your work and your reasoning clear, well-written, and expository in nature. In the more open-ended parts, this will help us to identify rigorous thinking and assign marks accordingly.
This homework is an exercise in COVID-19 forecasting. You should know that, in true (prospective) COVID-19 forecasting, the situation is much harder than the one you are facing in this homework. This is because of data revisions: the forecasters in true (prospective) COVID-19 forecasting did not have access to the same data in real-time that you have access to now, in retrospect. Instead, they had access to preliminary data that was subject to revisions, sometimes very large and irregular ones, making forecasting much harder. You can see, for example, McDonald et al. (2021), for a discussion of this.
First, download the files cases_deaths.csv
and forecasts.csv
from the course GitHub repo. The former contains weekly reported COVID-19 cases and death counts for every U.S. state, from July 4, 2020 to March 4, 2023. The latter contains 1-4 week ahead forecasts of weekly reported COVID-19 death counts for every U.S. state over that same time period. These forecasts were generated by the COVIDhub ensemble model, which is an ensemble of all qualifying forecast submissions to the U.S. COVID-19 Forecast Hub, and was the basis of official CDC communications during the pandemic. You can see Ray et al. (2022) for discussion of the ensemble model.
The code below loads in these data frames (which you can run once the downloaded files are in your working directory) and plots death curves along with one set of 1-4 week ahead forecasts (and 80% prediction intervals) for the each of CA, NY, and PA.
library(tidyverse)
cases_deaths = read.csv("cases_deaths.csv")
forecasts = read.csv("forecasts.csv")
cases_deaths = cases_deaths |> mutate(date = as.Date(date))
forecasts = forecasts |> mutate(target_date = as.Date(target_date))
fc_date = as.Date("2021-08-21")
states = c("ca", "ny", "pa")
ggplot() +
geom_line(
data = cases_deaths |> filter(geo_value %in% states) |>
mutate(alpha = ifelse(date <= fc_date-7, 1, 0.8)), # account for plotting
# quirk here: line color for a segment assigned to the point on the left
aes(x = date, y = deaths, alpha = alpha)) +
geom_vline(xintercept = fc_date, linetype = 2) +
geom_ribbon(
data = forecasts |> filter(geo_value %in% states) |>
mutate(forecast_date = target_date - ahead * 7) |>
filter(forecast_date == fc_date),
aes(x = target_date, ymin = forecast_0.1, ymax = forecast_0.9,
fill = geo_value), alpha = 0.5) +
geom_line(
data = forecasts |> filter(geo_value %in% states) |>
mutate(forecast_date = target_date - ahead * 7) |>
filter(forecast_date == fc_date),
aes(x = target_date, y = forecast_0.5, color = geo_value),
linewidth = 1.25) +
facet_wrap(vars(geo_value), scales = "free_y", ncol = 1) +
labs(x = "Date", y = "Reported Covid-19 deaths") +
theme_bw() + theme(legend.position = "none")
(4 pts) For each state, compute the sample auto-correlation function for COVID-19 deaths over the time period July 04, 2020 through March 27, 2021. This period captures the “winter 2020 wave”. Average the auto-correlation (take an average per lag) across all of the states. Plot the state-averaged auto-correlation function.
Do the same, but for the sample cross-correlation function between COVID-19 deaths and COVID-19 cases. Plot the state-averaged cross-correlation function. Discuss what you see in these plots.
# CODE GOES HERE
(8 pts) For each state, separately, fit an AR model to directly forecast the death count at a horizon \(h\) from the latest three values: \[ \hat{y}_{t+h|t} = \sum_{k=0}^2 \hat\alpha_{h,k} \, y_{t-k} \] Do this for horizons \(h=1,2,3,4\), i.e., 1-4 weeks ahead. Note that you will thus fit 4 separate models per state.
Use time series cross-validation (CV) to evaluate forecasts from these AR models starting at a target date of April 3, 2021 and going through the end of the data set. You can do this by manually forming the lags yourself, fitting the AR model withlm()
, and then manually implementing the time series CV loop, as we did in the regression lecture (weeks 3-4: “Regression and prediction”).
For each horizon (1-4 weeks ahead), report the mean absolute scaled error (MASE) from time series CV, averaged over all states. Plot the state-averaged MASE from the AR models as a function of horizon.
# CODE GOES HERE
# CODE GOES HERE
# CODE GOES HERE
# CODE GOES HERE
# CODE GOES HERE
ARIMA()
from the fable
package, and evaluate it with time series CV to make 1-4 week ahead forecasts, starting at a target date of April 3, 2021 and going through the end of the data set. (You may still want to implement the time series CV loop manually, instead of relying on stretch_tsibble()
, as the latter may be too slow due to the large memory hit.) As before, plot the MASE from time series CV averaged over states, as a function of horizon.# CODE GOES HERE
ETS()
from the fable
package: Holt’s linear trend and damped linear trend (each with additive errors). Evaluate them using time series CV, just as in Q7, plotting the state-averaged MASE as a function of horizon. Compare and discuss the MASE curves you have seen for all methods thus far.# CODE GOES HERE
# CODE GOES HERE
# CODE GOES HERE
# CODE GOES HERE
# CODE GOES HERE
(6 pts) For each of 1-4 week ahead forecasts, ensemble the predictions from the models you have fit thus far (2 AR models, 2 ARX models, 1 ARIMA model, 2 ETS models, plus any other forecasters you may have fit in Q12). Do so in two ways: take an average or median of forecasts (always for a single target). Compare the MASE of these ensemble models from time series CV on the same time period as in the previous questions. What do you see?
Now instead of simply reporting the average MASE over all states, plot the whole distribution of MASE values over all states (as a histogram), per horizon. Compare the ensemble models to one or two of the previous models (say the AR and ARX models). What do you see?
# CODE GOES HERE
forecasts
data frame, to those from your ensembles. Recall that the COVIDhub ensemble was created in real-time with only access to partial data (subject to revision). How does it compare?# CODE GOES HERE