DATA 521 is Time Series and Forecasting

COVID-19 Forecasting

About

Forecasting is predicting the future. That’s hard. There is a certain science to forecasting that we can make use of all the while recognizing that we have to assume that things are similar in relevant ways to the past to get leverage on them by using the past. This point should never be lost.

We follow the general outlined workflow of Rob J Hyndman and George Athanasopoulos in Forecasting: Principles and Practice. The key is the workflow. Tidying data to organize with around an index … a proper date/time something that can be understand as appropriately sequential [with this sequence denotable] and a key that describes some set of distinct time series that are [potentially] stored with the same or similar index of time. Time is central to time series and to the tsibble. With this resolved, there is graphical, decomposition, and feature understanding, before the application of models. Almost all the action is in adding models to the toolbox. Basic time series regressions, ETS models of varying forms, ARIMA, Dynamic regressions and their integration with ARMA, using STL for seasonal adjustment and ARMA or ETS models as STL+, and advances including aggregation and hierarchical and grouped times series and forecast reconciliation, prophet, VAR, tbats, NNets, and others.

The core issue remains the criteria for evaluation. Model fit tells us how well we do in the data that we have used to fit the model.. That’s important to know. But we often really want to know what what model is best over a given forecast horizon. We can use stretch_tsibble(.init, .step) to decide how much data is required to get a credible forecast to start and over steps of what size to repeat it. This allows us to average over a whole bunch of future forecasts that are implications of that model. Using accuracy(original_tsibble) with it, allows us to evaluate the question, which model has been best at forecasting (h periods out with steps of .step size) over the determined horizon. Because we only have one time series and cannot explicitly re-run time [but bootstrapping/bagging]; we can at least know which has best performed in our fixed time horizon task.

The course textbook:

Forecasting Policy and Practice, 3rd Edition

Software:

Slides

complete week 4 data for week 5:

load(url("https://github.com/robertwwalker/DADMStuff/raw/master/Ch4HA.RData"))

Some Resources:

Sliding Windows with Slider

Data I will use the data from the New York Times on COVID. I want to illustrate the use of slider for the creation of moving averages on tsibble structures. To render some context for the data, let me import it, transform it a bit to make it more sensible [removing negative values], and turn it into a tsibble. library(tidyverse); library(fpp3); library(hrbrthemes) NYT.COVIDN <- read.csv(url("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")) # Define a tsibble; the date is imported as character so mutate that first.

Working with Equities

I loaded the following quietly. library(knitr) library(tidyverse) library(tidyquant) library(fpp3) library(hrbrthemes) library(kableExtra) A bit on equities tidyquant is a very handy source for equities data. The data are returned as a tibble. Ford <- tq_get("F", from="2019-01-01") Ford ## # A tibble: 582 x 8 ## symbol date open high low close volume adjusted ## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 F 2019-01-02 7.53 8.02 7.

Timetk: an Exploration and Comparison

I want to use a dataset from fpp3 and have a look at how the code translation works for the things that they do. The organizing form for the data is a little bit different among the two as fpp3 is built around the tsibble as construct while timetk dodges that with the use of a bit more explicit declarations. In many ways, as some of you have learned through frustration with aggregation and tsibble, this is more flexible because not all time is neatly embedded in all other time [weeks <=> months].

fredr is neat

FRED via fredr The Federal Reserve Economic Database [FRED] is a wonderful public resource for data and the r api that connects to it is very easy to use for the things that I have previously needed. For example, one of my students was interested in commercial credit default data. I used the FRED search instructions from the following vignette to find that data. My first step was the vignette for using fredr.

DATA 521: Forecasting COVID-19

This is the summary forecasting page. This contains a few examples that are more or less complete on their own. The course textbook: Forecasting Policy and Practice, 3rd Edition

Employment

R Markdown You will need to create a FRED api login. library(fredr); library(tidyverse); library(fpp3) ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ── ## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4 ## ✓ tibble 3.0.6 ✓ dplyr 1.0.4 ## ✓ tidyr 1.1.2 ✓ stringr 1.4.0 ## ✓ readr 1.4.0 ✓ forcats 0.5.1 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ## ── Attaching packages ────────────────────────────────────────────── fpp3 0.

Spurious Regressions

Regression Basics A regression of two completely random variables, in this case normal, should result in about 5% of t-statistics with p-values less than 0.05. This is the basis for the interpretation of a p-value; how often would we see something this extreme or even more extreme randomly? NReg <- function(junk) { y1 <- rnorm(200) y2 <- rnorm(200) return(summary(lm(y1~y2))$coefficients[2,4]) } NReg.Result <- data.frame(Res=sapply(1:1000, function(x) {NReg(x)})) table(NReg.Result$Res < 0.05) ## ## FALSE TRUE ## 949 51 Checks out.