Interactive online version:

Quantile Interval Score and Interval Score#

The quantile interval and interval scores are consistent scoring functions to evaluate prediction intervals, where the aim is to predict an interval that captures uncertainty at a given confidence level. Quantile interval score can accept any values for the lower and upper quantile levels as long as \(0 < \text{lower quantile level} < \text{upper quantile level} < 1\) is met, whereas interval score is only for forecast intervals with symmetric quantile level range. For example lower and upper quantile levels of 0.2 and 0.9 can be used by the quantile interval score but not by the interval score.

The quantile interval score function is defined as the sum of three penalties, as follows:

Interval Width Penalty:

\[W = q_{u} - q_{l}\]
Over-Prediction Penalty:

\[\begin{split} O = \begin{cases} \frac{1}{\alpha_l} \cdot (q_{l} - y) & \text{if } y < q_{l} \\ 0 & \text{otherwise} \end{cases}\end{split}\]
Under-Prediction Penalty:

\[\begin{split} U = \begin{cases} \frac{1}{1 - \alpha_u} \cdot (y - q_{u}) & \text{if } y > q_{u} \\ 0 & \text{otherwise} \end{cases}\end{split}\]

So, the total score can be written as:

\[S = W + O + U\]

where:

\(S\) is the scoring function (here quantile interval score),
\(q_l\) is the lower quantile forecast,
\(q_u\) is the upper quantile forecast,
\(y\) is the observation,
\(\alpha_l\) is the lower quantile level,
\(\alpha_u\) is the upper quantile level.

Lower values of the quantile interval scores are better. As you can see, this score penalises the width of the interval as well as the extent to which the observations fall outside of the interval forecast (i.e., under-prediction penalty when the observations are larger than the upper-quantile forecasts, and over-prediction penalty when the observations are smaller than the lower-quantile forecasts).

In the case of interval score, over- and under-prediction penalties can be simplified to:

Over-Prediction Penalty:

\[\begin{split} O = \begin{cases} \frac{2}{\alpha} \cdot (q_{l} - y) & \text{if } y < q_{l} \\ 0 & \text{otherwise} \end{cases}\end{split}\]
Under-Prediction Penalty:

\[\begin{split} U = \begin{cases} \frac{2}{\alpha} \cdot (y - q_{u}) & \text{if } y > q_{u} \\ 0 & \text{otherwise} \end{cases}\end{split}\]

where \(\alpha = 1 - \text{interval range}\). For example, for a forecast with 0.5 interval range (i.e., 50% confidence level), \(\alpha\) is 0.5, and considering the symmetric quantile range condition, the lower and upper quantile levels would be \(\frac{\alpha}{2} = 0.25\) and \(1 - \frac{\alpha}{2} = 0.75\), respectively.

To learn more about quantile interval and interval scores please see the following articles:

Winkler, R. L. (1972). A Decision-Theoretic Approach to Interval Estimation. Journal of the American Statistical Association, 67(337), 187. https://doi.org/10.2307/2284720
Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359-378. Section 6.2. https://doi.org/10.1198/016214506000001437

Now let’s look at an example where we demonstrate how to use the quantile interval and interval scores:

[1]:

from scores.continuous import quantile_interval_score, interval_score
import numpy as np
import pandas as pd
import xarray as xr

[ ]:

# help(quantile_interval_score)  # Uncomment this to see the help message

[ ]:

# help(interval_score)  # Uncomment this to see the help message

In this example, we generate synthetic observation data and interval forecasts for temperature under two scenarios:

Scenario 1: There is a forecast 50% chance that the temperature will be between 15 and 20 °C, with a lower quantile of 0.10 (indicating a 10% chance that the temperature may be cooler than 15 °C) and an upper quantile of 0.60 (indicating a 40% chance that the temperature may be warmer than 20 °C).

Scenario 2: There is a forecast 50% chance that the temperature may be between 15 and 20 °C, with a lower quantile of 0.25 (indicating a 25% chance that the temperature may be cooler than 15 °C) and an upper quantile of 0.75 (indicating a 25% chance that the temperature may be warmer than 20 °C).

You might have already guessed why we chose these scenarios. Scenario 1 represents a case where the quantile range is not symmetric, so we can only use the quantile interval score for this scenario. In contrast, Scenario 2 has a symmetric quantile range allowing us to use both the quantile interval score and the interval score, which should yield the same results.

[2]:

np.random.seed(42)
lat = np.linspace(-90, 90, 10)
lon = np.linspace(-180, 180, 20)
times = pd.date_range('2023-11-19', periods=24, freq='h')
forecast_lower_1 = np.random.uniform(13.0, 15.0, size=(len(lat), len(lon), len(times)))
forecast_lower_da_1 = xr.DataArray(
    forecast_lower_1,
    dims=('lat', 'lon', 'time'),
    coords={'lat': lat, 'lon': lon, 'time': times}
)
forecast_upper_1 = np.random.uniform(20.0, 23.0, size=(len(lat), len(lon), len(times)))
forecast_upper_da_1 = xr.DataArray(
    forecast_upper_1,
    dims=('lat', 'lon', 'time'),
    coords={'lat': lat, 'lon': lon, 'time': times}
)

forecast_lower_2 = np.random.uniform(12.0, 15.0, size=(len(lat), len(lon), len(times)))
forecast_lower_da_2 = xr.DataArray(
    forecast_lower_2,
    dims=('lat', 'lon', 'time'),
    coords={'lat': lat, 'lon': lon, 'time': times}
)
forecast_upper_2 = np.random.uniform(20.0, 22.0, size=(len(lat), len(lon), len(times)))
forecast_upper_da_2 = xr.DataArray(
    forecast_upper_2,
    dims=('lat', 'lon', 'time'),
    coords={'lat': lat, 'lon': lon, 'time': times}
)

observations = np.random.uniform(10.0, 24.0, size=(len(lat), len(lon), len(times)))
obs_da = xr.DataArray(
    observations,
    dims=('lat', 'lon', 'time'),
    coords={'lat': lat, 'lon': lon, 'time': times}
)

Now let’s calculate scores for both scenarios. At first we will calculate the aggregated score. Later we will see how to calculate scores for specific dimension(s) (e.g., for each time forecast step) by using preserve_dims or reduce_dims arguments.

[3]:

overall_qis_sc1 = quantile_interval_score(
    obs=obs_da,
    fcst_lower_qtile=forecast_lower_da_1,
    fcst_upper_qtile=forecast_upper_da_1,
    lower_qtile_level=0.1,
    upper_qtile_level=0.6
)
overall_qis_sc1

As mentioned, we can use either the quantile interval score or the interval score for Scenario 2 due to the symmetric quantile range of forecast intervals in this scenario. Both scores will yield the same results. Let’s verify this now:

[4]:

# Quantile interval score for forecast Scenario 2
overall_qis_sc2 = quantile_interval_score(
    obs=obs_da,
    fcst_lower_qtile=forecast_lower_da_2,
    fcst_upper_qtile=forecast_upper_da_2,
    lower_qtile_level=0.25,
    upper_qtile_level=0.75
)

[5]:

# Interval score for forecast Scenario 2
overall_is_sc2 = interval_score(
    obs=obs_da,
    fcst_lower_qtile=forecast_lower_da_2,
    fcst_upper_qtile=forecast_upper_da_2,
    interval_range=0.5,
)
overall_is_sc2

In this particular example, we see that forecast intervals with quantile levels of 0.25 and 0.75 achieve a better score (smaller value) compared to those with quantile levels of 0.1 and 0.6. You can create your own examples to explores different scenarios.

Now let’s see how the this score varies over the forecast time for our two examples. In this case we need to set preserve_dims=['time'].

[6]:

qis_time = quantile_interval_score(
    obs=obs_da,
    fcst_lower_qtile=forecast_lower_da_1,
    fcst_upper_qtile=forecast_upper_da_1,
    lower_qtile_level=0.05,
    upper_qtile_level=0.95,
    preserve_dims=['time']
)
qis_time

[8]:

is_time = interval_score(
    obs=obs_da,
    fcst_lower_qtile=forecast_lower_da_2,
    fcst_upper_qtile=forecast_upper_da_2,
    interval_range=0.5,
    preserve_dims=['time']
)
is_time

Users have the option to input an array of weights within quantile_interval_score and interval_score to compute a weighted average score. An example scenario could involve assigning weights to regions based on their population when computing these scores. The following is an example of using weights, with 1 mostly but a larger weight at a few locations based on (lat, lon).

[9]:

weights_data = np.ones_like(observations)
weights_data[:, -4:, -4:] = 3
weights_data[:, -6:-4, -6:-4] = 3
weights = xr.DataArray(weights_data, dims=('lat', 'lon', 'time'))
weighted_qis = quantile_interval_score(
    obs=obs_da,
    fcst_lower_qtile=forecast_lower_da_1,
    fcst_upper_qtile=forecast_upper_da_1,
    lower_qtile_level=0.05,
    upper_qtile_level=0.95,
    weights=weights
)
weighted_qis

[10]:

weighted_is = interval_score(
    obs=obs_da,
    fcst_lower_qtile=forecast_lower_da_2,
    fcst_upper_qtile=forecast_upper_da_2,
    interval_range=0.5,
    weights=weights
)
weighted_is