Threshold weighted scores#
Introduction#
For a long time, people have wanted to be able to evaluate the performance of forecasts of extreme events using verification methods that can’t be hedged. Such evaluation brings together two concepts - avoidance of hedging, particularly of probabilities, and also a focus on extreme conditions (e.g. very hot conditions).
Hedging occurs when a forecast is optimised for a verification metric rather than for a forecast directive (see Murphy (1978)). In the context of an automated system, hedging occurs when the model or system is designed or trained according to a verification metric which is misaligned to the forecast directive.
An example of a forecast directive is to predict the 90th percentile wind speed. If the 90th percentile forecast wind speed forecasts were evaluated against observed wind speeds using mean absolute error (MAE), the forecaster or forecast system developer is faced with the dilemma of whether to produce forecasts that are consistent with the forecast directive or if they should be produced to optimise the expected MAE (since MAE optimises for the 50th percentile wind speed).
Hedging is of greatest concern when considering the probabilities of extreme events on which significant decisions are made. These tensions can be removed by using a consistent verification score which will not penalise the forecaster or model designer for meeting the forecast directive - i.e. these scores do not allow hedging (see the Consistent Scores tutorial).
Threshold-weighted consistent scoring functions also allow us to measure the performance of forecasts with a focus on a range of decision thresholds while still preventing hedging. An example of someone acting on a decision threshold is a manager deciding to shut the roof of a stadium if the forecast temperature is expected to exceed 35°C. Threshold-weighted scores allow us to measure the performance of predicting such extremes or some other range of decision thresholds of interest.
Threshold-weighted scores contrast with metrics like mean squared error (MSE) and mean absolute error (MAE), which can be interpreted as evaluating performance when equal weight is placed on all decision thresholds (Ehm 2016).
As another example, suppose health agencies will start taking preventative action for treating heat-related illness whenever the temperature forecast for the following day exceeds 35°C and that this preventative action is increasingly important with even higher temperature forecasts. If the temperature forecast is 20°C and the observed temperature is 10°C, then health agencies will have made the correct decision to not take preventative action, despite the large magnitude in forecast error. In this case, the MSE would be 100°C2. However, if we calculate a threshold weighted MSE (twMSE) that places an equal weight on decision thresholds above 35°C and a zero weight on decision thresholds below 35°C, the twMSE will give a value of 0 in this case. If at least one of the forecast value and the observed value is above 35°C and they differ from each other, then there will be a non-zero error.
The arguments for threshold-weighting (interval_where_one
and interval_where_positive
) are separate and different from the weights
argument that appears in most metrics in scores
. weights
of the normal kind can still be used in the threshold weighted scores. The weights
argument weights the errors at each point before calculating the mean score. Examples including weighting the results based on latitude to account for grid size or by weighting results based on population
density. For more information see the weighting results tutorial.
The following threshold weighted scores are available in scores
:
scores.continuous.tw_squared_error
scores.continuous.tw_absolute_error
scores.continuous.tw_quantile_score
scores.continuous.tw_expectile_score
scores.continuous.tw_huber_loss
There are two types of threshold weighting that are supported: rectangular and trapezoidal. More flexible threshold weighting schemes are discussed at the end of this tutorial.
Rectangular Threshold Weighting#
To specify a rectangular threshold weight, set interval_where_positive=None
and set interval_where_one
to be the interval where the threshold weight is 1. For example, if interval_where_one=(0, 10)
then a threshold weight of 1 is applied to decision thresholds satisfying 0 ≤ threshold < 10, and a threshold weight of 0 is applied otherwise. Interval endpoints can be -numpy.inf
or numpy.inf
. Only a single interval can be provided. See example 3 below for a demonstration of how
to calculate a threshold weighted score for multiple intervals.
Trapezoidal Threshold Weighting#
To specify a trapezoidal threshold weight, first define interval_where_one
which determines where there will be a threshold weight of 1 using desired endpoints. Next, determine where the threshold weights should linearly increase to the left of the rectangular threshold weight and decrease to the right of the rectangular threshold weight. This is done using the interval_where_positive
argument. Consider an example where a threshold weight of 1 is applied to decision thresholds satisfying
2 ≤ threshold < 4. Additionally we want the threshold weight to increase linearly from 0 to 1 on the interval [-2, 2) and decrease linearly from 1 to 0 on the interval [4, 10], and is 0 otherwise. This can be done by setting interval_where_positive=(-2, 10)
and interval_where_one=(2, 4)
. Interval endpoints can only be infinite if the corresponding interval_where_one
endpoint is infinite. End points of interval_where_positive
and interval_where_one
must differ except when the
endpoints are infinite. In example 2, we show a case where we set interval_where_positive=(30, np.inf)
and interval_where_one=(40, np.inf)
. This corresponds to the threshold weight linearly increasing from 0 to 1 on the interval [30, 40), and a threshold weight of 1 on the interval [40, infinity). These linear increases and decrease in the threshold weighting are referred to as “trapezoidal threshold weighting”.
The implementation of these threshold weighted scores in scores
is based on Taggart, R. (2022).
Example 1 - Rectangular Threshold Weighting#
Suppose that two forecast systems are issuing forecasts for the mean (i.e. expected value) rainfall accumulation. We want to evaluate the performance for decision thresholds above 40mm. Let’s generate the same synthetic data as in the Consistent Scores tutorial and calculate the twMSE.
[1]:
from scores.continuous import (
mse,
murphy_score,
murphy_thetas,
tw_absolute_error,
tw_expectile_score,
tw_huber_loss,
tw_quantile_score,
tw_squared_error,
)
from scipy.stats import skewnorm
import numpy as np
import xarray as xr
import plotly.graph_objects as go
from plotly.subplots import make_subplots
np.random.seed(100)
[2]:
# Generate some synthetic rainfall observations between 0 and 50mm
N = 1000
obs = xr.DataArray(data=50 * np.random.random(N), dims=["time"], coords={"time": np.arange(0, N)})
obs = obs.clip(min=0) # don't allow negative rainfall
# Generate synthetic forecasts by adding noise to each observation
fcst1 = 0.9 * obs + skewnorm.rvs(4, size=N) # fcst1 has a low bias
fcst1 = fcst1.clip(min=0) # don't allow negative rainfall
fcst2 = 1.1 * obs - skewnorm.rvs(4, size=N) # fcst2 has a high bias
fcst2 = fcst2.clip(min=0) # don't allow negative rainfall
[3]:
# First if we calculate the MSE of fcst1 and fcst2 we will see that that have similar predictive performance.
print(f"fcst1 MSE = {mse(fcst1, obs).item()}")
print(f"fcst2 MSE = {mse(fcst2, obs).item()}")
fcst1 MSE = 5.397562928167134
fcst2 MSE = 5.3763094346565685
The MSE of each forecast are similar.
Let’s apply a rectangular threshold weight that gives equal weight to decision thresholds above 40mm, and gives zero weight to thresholds below 40mm and calculate the twMSE.
[4]:
fcst1_tw_mse_upper = tw_squared_error(fcst1, obs, interval_where_one=(40, np.inf))
fcst2_tw_mse_upper = tw_squared_error(fcst2, obs, interval_where_one=(40, np.inf))
print(f"fcst1 twMSE = {fcst1_tw_mse_upper.item()}")
print(f"fcst2 twMSE = {fcst2_tw_mse_upper.item()}")
fcst1 twMSE = 2.2395142870695333
fcst2 twMSE = 3.2119291100148106
Why is this? We can visualise what’s going on with a Murphy Diagram.
[5]:
# Generate a list of thresholds of interest
thetas = murphy_thetas([fcst1, fcst2], obs, "expectile")
# Calculate the average elementary score for the mean (0.5 expectile) for each threshold theta
ms1 = 2 * murphy_score(fcst1, obs, thetas, functional="expectile", alpha=0.5)
ms2 = 2 * murphy_score(fcst2, obs, thetas, functional="expectile", alpha=0.5)
fig = make_subplots(rows=2, cols=1)
fig.add_trace(
go.Scatter(x=ms1.theta, y=ms1.total, mode="lines", name="fcst1", line=dict(color="#1b9e77")), row=1, col=1
)
fig.add_trace(
go.Scatter(
x=ms1.theta,
y=ms1.total.where(ms2.theta >= 40),
mode="lines",
line=dict(color="rgba(27,158,119, 1)"),
fillcolor="rgba(27,158,119, 0.5)",
fill="tozeroy",
showlegend=False,
),
row=1,
col=1,
)
fig.add_trace(
go.Scatter(x=ms2.theta, y=ms2.total, mode="lines", name="fcst2", line=dict(color="#7570b3")), row=1, col=1
)
fig.add_trace(
go.Scatter(
x=ms2.theta,
y=ms2.total.where(ms2.theta >= 40),
mode="lines",
fill="tozeroy",
line=dict(color="#7570b3"),
fillcolor="rgba(117,112,179, 0.5)",
showlegend=False,
),
row=1,
col=1,
)
fig.add_trace(
go.Scatter(x=[0, 40, 40, 55], y=[0, 0, 1, 1], mode="lines", name="threshold weight", line=dict(color="black")),
row=2,
col=1,
)
fig.update_layout(
xaxis_title="Rainfall (mm)",
yaxis_title="Economic Regret",
width=800,
height=600,
legend=dict(x=0.01, y=0.99, xanchor="left", yanchor="top"),
margin=dict(l=50, r=20, b=20, t=20),
)
fig.update_yaxes(title_text="Threshold Weight", row=2, col=1)
fig.update_xaxes(title_text="Rainfall (mm)", row=2, col=1)
fig