Confidence intervals using statistical tests

Interactive online version: Binder badge. Download notebook.

Confidence intervals using statistical tests#

Confidence intervals are crucial in verification for quantifying uncertainty. They are often used when testing if one forecast system is better than another.

Modified Diebold Mariano test#

Background#

One of the most widely used methods to test the difference between two competing forecasts in the stats community is the Diebold Mariano (DM) test (Diebold and Mariano 1995). Since then, there have been several modifications to the original test statistic, two of which we have implemented in scores. They are

  • the Harvey, Leybourne and Newbold (1997) modification

  • the Hering and Genton (2011) modification

Both methods use a different technique for estimating the spectral density of the timeseries of differences at frequency 0, compared with Diebold and Mariano (1995). The HLN method uses an improved and less biased estimate (see V_hat, Equation (5) in Harvey et al. 1997). However, this estimate can sometimes be nonpositive, in which case NaN is returned.

The HG method estimates the spectral density component using an exponential model for the autocovariances of timeseries of differences, so that positivity is guaranteed. Hering and Genton (2011) fit model parameters using empirical autocovariances computed up to half of the maximum lag. In this implementation, empirical autocovariances are computed up to half of the maximum lag or a lag of h, whichever is larger. The two model parameters (sigma and theta, in HG’s notation) are assumed to be positive and are computed using scipy.optimize.least_squares.

Why use the DM test?

Unlike many other statistical tests, the DM test statistic

  • does not assume that forecast errors are normally distributed,

  • accounts for scores that are serially correlated.

For instance, there is likely to be some correlation between a lead day x forecast issued on day i, and a lead day x forecast issued on day i + 1. The DM statistic handles this serial correlation using the autocorrelation function.

References#

  • Diebold, F.X. and Mariano, R.S. (1995). Comparing predictive accuracy. Journal of Business and Economic Statistics, 13, 253-263.

  • Harvey, Leybourne and Newbold (1997). Testing the equality of prediction mean squared errors, International Journal of Forecasting, 13, 281-291.

  • Hering and Genton (2011). Comparing spatial predictions, Technometrics, 53, 414-425.

Here we show a practical example of using the DM test to calculate the confidence that one forecast system performed better than another.

[1]:
import xarray as xr
import numpy as np
import pandas as pd
from scores.stats import statistical_tests
from scores.continuous import mse
from scipy.stats import skewnorm

Let’s generate some synthetic observations and two corresponding synthetic forecasts.

[2]:
# Create observations for 100 dates
obs = 10 * np.random.random((100, 100))
obs = xr.DataArray(
    data=obs,
    dims=["time", "x"],
    coords={"time": pd.date_range("2023-01-01", "2023-04-10"), "x": np.arange(0, 100)}
)

# Create forecasts for 7 lead days
fcst = xr.DataArray(data=[1]*7, dims="lead_day", coords={"lead_day": np.arange(1, 8)})
fcst = fcst * obs

# Create two synthetic forecasts that are based on the observations + noise
fcst_a = fcst + skewnorm.rvs(4, size=(7, 100, 100))
fcst_b = fcst + skewnorm.rvs(1, size=(7, 100, 100))

We next calculate the MSE for our two forecast systems.

[3]:
fcst_a_mse = mse(fcst_a, obs, preserve_dims=["time", "lead_day"])
fcst_b_mse = mse(fcst_b, obs, preserve_dims=["time", "lead_day"])
diff = fcst_a_mse - fcst_b_mse

For the DM test statistic, we need to define the value of lag h for our timeseries. For each timeseries of score differences of h-step ahead forecasts, we ask, “How many observations of the phenomenon will be made between making the forecast and having the observation that will validate the forecast?”

For example, suppose that the phenomenon is afternoon precipitation accumulation in New Zealand (00 UTC to 06 UTC each day). Then a forecast for Day+1 that is issued at 03 UTC on Day+0 will be a 2-ahead forecast, since Day+0 and Day+1 accumulations must be observed before the forecast can be verified. On the other hand, a forecast for Day+1 issued at 09 UTC on Day+0 will be a 1-step ahead forecast.

For our synthetic example, let’s assume that diff is the difference in MSE for maximum temperature forecasts. Let’s assume that there will be only two maximum temperature observations observed before the lead day 1 forecast, so the lead day 1 forecast is considered a 2-step ahead forecast, a lead day 2 forecast is considered a 3-step ahead forecast, and so on.

The value of h for each timeseries in the array needs to be specified in one of the sets of coordinates. We will add h coords to diff, our timeseries of differences in MSE.

[4]:
diff = diff.assign_coords(h=("lead_day", [2, 3, 4, 5, 6, 7, 8]))

We’re now ready to calculate the DM test statistic!

[5]:
dm_result = statistical_tests.diebold_mariano(diff, "lead_day", "h")
dm_result
[5]:
<xarray.Dataset>
Dimensions:          (lead_day: 7)
Coordinates:
  * lead_day         (lead_day) int64 1 2 3 4 5 6 7
Data variables:
    mean             (lead_day) float64 0.01582 -0.002308 ... 0.009162 0.008215
    dm_test_stat     (lead_day) float64 0.7692 -0.1241 -0.4839 ... 0.5187 0.4419
    timeseries_len   (lead_day) int64 100 100 100 100 100 100 100
    confidence_gt_0  (lead_day) float64 0.7791 0.4506 0.3142 ... 0.698 0.6707
    ci_upper         (lead_day) float64 0.05612 0.03414 ... 0.04378 0.04465
    ci_lower         (lead_day) float64 -0.02449 -0.03875 ... -0.02545 -0.02822

In the output above you’ll see

  • “mean”: the mean value for each timeseries, ignoring NaNs

  • “dm_test_stat”: the modified Diebold-Mariano test statistic for each timeseries

  • “timeseries_len”: the length of each timeseries, with NaNs removed.

  • “confidence_gt_0”: the confidence that the mean value of the population is greater than zero, based on the specified statistic_distribution argument. In this example, it tells us the confidence that fcst_b performed better than fcst_a (based on the MSE) for each lead day.

  • “ci_upper”: the upper end point of a confidence interval about the mean at specified confidence_level.

  • “ci_lower”: the lower end point of a confidence interval about the mean at specified confidence_level.