Interactive online version: Binder badge. Download notebook.

Threshold weighted scores#

Introduction#

For a long time, people have wanted to be able to evaluate the performance of forecasts of extreme events using verification methods that can’t be hedged. Such evaluation brings together two concepts - avoidance of hedging, particularly of probabilities, and also a focus on extreme conditions (e.g. very hot conditions).

Hedging occurs when a forecast is optimised for a verification metric rather than for a forecast directive (see Murphy (1978)). In the context of an automated system, hedging occurs when the model or system is designed or trained according to a verification metric which is misaligned to the forecast directive.

An example of a forecast directive is to predict the 90th percentile wind speed. If the 90th percentile forecast wind speed forecasts were evaluated against observed wind speeds using mean absolute error (MAE), the forecaster or forecast system developer is faced with the dilemma of whether to produce forecasts that are consistent with the forecast directive or if they should be produced to optimise the expected MAE (since MAE optimises for the 50th percentile wind speed).

Hedging is of greatest concern when considering the probabilities of extreme events on which significant decisions are made. These tensions can be removed by using a consistent verification score which will not penalise the forecaster or model designer for meeting the forecast directive - i.e. these scores do not allow hedging (see the Consistent Scores tutorial).

Threshold-weighted consistent scoring functions also allow us to measure the performance of forecasts with a focus on a range of decision thresholds while still preventing hedging. An example of someone acting on a decision threshold is a manager deciding to shut the roof of a stadium if the forecast temperature is expected to exceed 35°C. Threshold-weighted scores allow us to measure the performance of predicting such extremes or some other range of decision thresholds of interest.

Threshold-weighted scores contrast with metrics like mean squared error (MSE) and mean absolute error (MAE), which can be interpreted as evaluating performance when equal weight is placed on all decision thresholds (Ehm 2016).

As another example, suppose health agencies will start taking preventative action for treating heat-related illness whenever the temperature forecast for the following day exceeds 35°C and that this preventative action is increasingly important with even higher temperature forecasts. If the temperature forecast is 20°C and the observed temperature is 10°C, then health agencies will have made the correct decision to not take preventative action, despite the large magnitude in forecast error. In this case, the MSE would be 100°C2. However, if we calculate a threshold weighted MSE (twMSE) that places an equal weight on decision thresholds above 35°C and a zero weight on decision thresholds below 35°C, the twMSE will give a value of 0 in this case. If at least one of the forecast value and the observed value is above 35°C and they differ from each other, then there will be a non-zero error.

The arguments for threshold-weighting (interval_where_one and interval_where_positive) are separate and different from the weights argument that appears in most metrics in scores. weights of the normal kind can still be used in the threshold weighted scores. The weights argument weights the errors at each point before calculating the mean score. Examples including weighting the results based on latitude to account for grid size or by weighting results based on population density. For more information see the weighting results tutorial.

The following threshold weighted scores are available in scores:

  • scores.continuous.tw_squared_error

  • scores.continuous.tw_absolute_error

  • scores.continuous.tw_quantile_score

  • scores.continuous.tw_expectile_score

  • scores.continuous.tw_huber_loss

There are two types of threshold weighting that are supported: rectangular and trapezoidal. More flexible threshold weighting schemes are discussed at the end of this tutorial.

Rectangular Threshold Weighting#

To specify a rectangular threshold weight, set interval_where_positive=None and set interval_where_one to be the interval where the threshold weight is 1. For example, if interval_where_one=(0, 10) then a threshold weight of 1 is applied to decision thresholds satisfying 0 ≤ threshold < 10, and a threshold weight of 0 is applied otherwise. Interval endpoints can be -numpy.inf or numpy.inf. Only a single interval can be provided. See example 3 below for a demonstration of how to calculate a threshold weighted score for multiple intervals.

Trapezoidal Threshold Weighting#

To specify a trapezoidal threshold weight, first define interval_where_one which determines where there will be a threshold weight of 1 using desired endpoints. Next, determine where the threshold weights should linearly increase to the left of the rectangular threshold weight and decrease to the right of the rectangular threshold weight. This is done using the interval_where_positive argument. Consider an example where a threshold weight of 1 is applied to decision thresholds satisfying 2 ≤ threshold < 4. Additionally we want the threshold weight to increase linearly from 0 to 1 on the interval [-2, 2) and decrease linearly from 1 to 0 on the interval [4, 10], and is 0 otherwise. This can be done by setting interval_where_positive=(-2, 10) and interval_where_one=(2, 4). Interval endpoints can only be infinite if the corresponding interval_where_one endpoint is infinite. End points of interval_where_positive and interval_where_one must differ except when the endpoints are infinite. In example 2, we show a case where we set interval_where_positive=(30, np.inf) and interval_where_one=(40, np.inf). This corresponds to the threshold weight linearly increasing from 0 to 1 on the interval [30, 40), and a threshold weight of 1 on the interval [40, infinity). These linear increases and decrease in the threshold weighting are referred to as “trapezoidal threshold weighting”.

The implementation of these threshold weighted scores in scores is based on Taggart, R. (2022).

Example 1 - Rectangular Threshold Weighting#

Suppose that two forecast systems are issuing forecasts for the mean (i.e. expected value) rainfall accumulation. We want to evaluate the performance for decision thresholds above 40mm. Let’s generate the same synthetic data as in the Consistent Scores tutorial and calculate the twMSE.

[1]:
from scores.continuous import (
    mse,
    murphy_score,
    murphy_thetas,
    tw_absolute_error,
    tw_expectile_score,
    tw_huber_loss,
    tw_quantile_score,
    tw_squared_error,
)
from scipy.stats import skewnorm
import numpy as np
import xarray as xr
import plotly.graph_objects as go
from plotly.subplots import make_subplots

np.random.seed(100)
[2]:
# Generate some synthetic rainfall observations between 0 and 50mm
N = 1000
obs = xr.DataArray(data=50 * np.random.random(N), dims=["time"], coords={"time": np.arange(0, N)})
obs = obs.clip(min=0)  # don't allow negative rainfall

# Generate synthetic forecasts by adding noise to each observation
fcst1 = 0.9 * obs + skewnorm.rvs(4, size=N)  # fcst1 has a low bias
fcst1 = fcst1.clip(min=0)  # don't allow negative rainfall
fcst2 = 1.1 * obs - skewnorm.rvs(4, size=N)  # fcst2 has a high bias
fcst2 = fcst2.clip(min=0)  # don't allow negative rainfall
[3]:
# First if we calculate the MSE of fcst1 and fcst2 we will see that that have similar predictive performance.
print(f"fcst1 MSE = {mse(fcst1, obs).item()}")
print(f"fcst2 MSE = {mse(fcst2, obs).item()}")
fcst1 MSE = 5.397562928167134
fcst2 MSE = 5.3763094346565685

The MSE of each forecast are similar.

Let’s apply a rectangular threshold weight that gives equal weight to decision thresholds above 40mm, and gives zero weight to thresholds below 40mm and calculate the twMSE.

[4]:
fcst1_tw_mse_upper = tw_squared_error(fcst1, obs, interval_where_one=(40, np.inf))
fcst2_tw_mse_upper = tw_squared_error(fcst2, obs, interval_where_one=(40, np.inf))
print(f"fcst1 twMSE = {fcst1_tw_mse_upper.item()}")
print(f"fcst2 twMSE = {fcst2_tw_mse_upper.item()}")
fcst1 twMSE = 2.2395142870695333
fcst2 twMSE = 3.2119291100148106

Why is this? We can visualise what’s going on with a Murphy Diagram.

[5]:
# Generate a list of thresholds of interest
thetas = murphy_thetas([fcst1, fcst2], obs, "expectile")

# Calculate the average elementary score for the mean (0.5 expectile) for each threshold theta
ms1 = 2 * murphy_score(fcst1, obs, thetas, functional="expectile", alpha=0.5)
ms2 = 2 * murphy_score(fcst2, obs, thetas, functional="expectile", alpha=0.5)

fig = make_subplots(rows=2, cols=1)
fig.add_trace(
    go.Scatter(x=ms1.theta, y=ms1.total, mode="lines", name="fcst1", line=dict(color="#1b9e77")), row=1, col=1
)
fig.add_trace(
    go.Scatter(
        x=ms1.theta,
        y=ms1.total.where(ms2.theta >= 40),
        mode="lines",
        line=dict(color="rgba(27,158,119, 1)"),
        fillcolor="rgba(27,158,119, 0.5)",
        fill="tozeroy",
        showlegend=False,
    ),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(x=ms2.theta, y=ms2.total, mode="lines", name="fcst2", line=dict(color="#7570b3")), row=1, col=1
)
fig.add_trace(
    go.Scatter(
        x=ms2.theta,
        y=ms2.total.where(ms2.theta >= 40),
        mode="lines",
        fill="tozeroy",
        line=dict(color="#7570b3"),
        fillcolor="rgba(117,112,179, 0.5)",
        showlegend=False,
    ),
    row=1,
    col=1,
)

fig.add_trace(
    go.Scatter(x=[0, 40, 40, 55], y=[0, 0, 1, 1], mode="lines", name="threshold weight", line=dict(color="black")),
    row=2,
    col=1,
)
fig.update_layout(
    xaxis_title="Rainfall (mm)",
    yaxis_title="Economic Regret",
    width=800,
    height=600,
    legend=dict(x=0.01, y=0.99, xanchor="left", yanchor="top"),
    margin=dict(l=50, r=20, b=20, t=20),
)
fig.update_yaxes(title_text="Threshold Weight", row=2, col=1)
fig.update_xaxes(title_text="Rainfall (mm)", row=2, col=1)
fig
  • Our rectangular threshold weight is shown in the lower subplot. It shows a weighting of 1 for rainfall thresholds 40mm and above, and a weight of 0 otherwise.

  • The area under the lines across all decision thresholds equals the MSE (due to the scaling factor that we added to the Murphy score calculation).

  • The area under the curves where rainfall is 40mm and above is proportional to the twMSE.

  • fcst2 (purple) has a larger shaded area than fcst1 (green) and so has a lower (better) twMSE.

Let’s now apply a rectangular threshold weight with a value of 1 for all decision thresholds below 40mm, and a value of zero for all thresholds 40mm and higher.

[6]:
fcst1_tw_mse_lower = tw_squared_error(fcst1, obs, interval_where_one=(-np.inf, 40))
fcst2_tw_mse_lower = tw_squared_error(fcst2, obs, interval_where_one=(-np.inf, 40))
print(f"fcst1 twMSE = {fcst1_tw_mse_lower.item()}")
print(f"fcst2 twMSE = {fcst2_tw_mse_lower.item()}")
fcst1 twMSE = 3.1580486410975848
fcst2 twMSE = 2.1643803246417592

If we add the twMSE values thresholds above 40mm and the twMSE values for thresholds below 40mm, we can see that it equals the MSE.

[7]:
fcst1_tw_mse_upper + fcst1_tw_mse_lower
[7]:
<xarray.DataArray ()>
array(5.39756293)
[8]:
mse(fcst1, obs)
[8]:
<xarray.DataArray ()>
array(5.39756293)

Example 2 - Trapezoidal Threshold Weighting#

If we want to gradually increase the threshold weights between 30mm and 40mm, we add in some trapezoidal threshold weighting.

To do this, we add the arg interval_where_positive=(30, np.inf) to the function call.

[9]:
fcst1_tw_mse_upper = tw_squared_error(fcst1, obs, interval_where_one=(40, np.inf), interval_where_positive=(30, np.inf))
fcst2_tw_mse_upper = tw_squared_error(fcst2, obs, interval_where_one=(40, np.inf), interval_where_positive=(30, np.inf))
print(f"fcst1 twMSE = {fcst1_tw_mse_upper.item()}")
print(f"fcst2 twMSE = {fcst2_tw_mse_upper.item()}")
fcst1 twMSE = 3.2650463986170224
fcst2 twMSE = 3.90266301088566

We can also visualise this on Murphy Diagrams

[10]:
def weight_function(x):
    if x < 30:
        return 0
    elif 30 <= x <= 40:
        return (x - 30) / 10
    else:
        return 1


y_values1 = np.array([weight_function(x) for x in ms1.theta])
y_values2 = np.array([weight_function(x) for x in ms2.theta])

fig = make_subplots(rows=2, cols=1)
fig.add_trace(
    go.Scatter(x=ms1.theta, y=ms1.total, mode="lines", name="fcst1", line=dict(color="#1b9e77")), row=1, col=1
)
fig.add_trace(
    go.Scatter(
        x=ms1.theta,
        y=ms1.total * y_values1,
        mode="lines",
        line=dict(width=0),
        fillcolor="rgba(27,158,119, 0.5)",
        fill="tozeroy",
        showlegend=False,
    ),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(x=ms2.theta, y=ms2.total, mode="lines", name="fcst2", line=dict(color="#7570b3")), row=1, col=1
)
fig.add_trace(
    go.Scatter(
        x=ms2.theta,
        y=ms2.total * y_values2,
        mode="lines",
        fill="tozeroy",
        line=dict(width=0),
        fillcolor="rgba(117,112,179, 0.5)",
        showlegend=False,
    ),
    row=1,
    col=1,
)

fig.add_trace(
    go.Scatter(x=[0, 30, 40, 55], y=[0, 0, 1, 1], mode="lines", name="threshold weight", line=dict(color="black")),
    row=2,
    col=1,
)
fig.update_layout(
    xaxis_title="Rainfall (mm)",
    yaxis_title="Economic Regret",
    width=800,
    height=600,
    margin=dict(l=50, r=20, b=20, t=20),
    legend=dict(x=0.01, y=0.99, xanchor="left", yanchor="top"),
)
fig.update_yaxes(title_text="Threshold Weight", row=2, col=1)
fig.update_xaxes(title_text="Rainfall (mm)", row=2, col=1)
fig

We can see how the threshold weighting function linearly increases the area under the curve between 30 and 40mm.

Example 3 - Using rectangular threshold weights to weight both high and low extremes.#

Suppose our synthetic data is now temperature data and we want to weight decision thresholds below 4°C and above 40°C. We can calculate the threshold weighted scores twice; one for cold decision thresholds and one for hot decision thresholds and then sum those twMSE results together.

[11]:
fcst1_tw_mse_hot = tw_squared_error(fcst1, obs, interval_where_one=(40, np.inf))
fcst2_tw_mse_hot = tw_squared_error(fcst2, obs, interval_where_one=(40, np.inf))

fcst1_tw_mse_cold = tw_squared_error(fcst1, obs, interval_where_one=(-np.inf, 4))
fcst2_tw_mse_cold = tw_squared_error(fcst2, obs, interval_where_one=(-np.inf, 4))

fcst1_tw_mse = fcst1_tw_mse_hot + fcst1_tw_mse_cold
fcst2_tw_mse = fcst2_tw_mse_hot + fcst2_tw_mse_cold

print(f"fcst1 twMSE = {fcst1_tw_mse.item()}")
print(f"fcst2 twMSE = {fcst2_tw_mse.item()}")
fcst1 twMSE = 2.2835391217454557
fcst2 twMSE = 3.269698259598884

We can again visualise what’s happening with a Murphy Diagram

[12]:
# Generate a list of thresholds of interest
thetas = murphy_thetas([fcst1, fcst2], obs, "expectile")

# Calculate the average elementary score for the mean (0.5 expectile) for each threshold theta
ms1 = 2 * murphy_score(fcst1, obs, thetas, functional="expectile", alpha=0.5)
ms2 = 2 * murphy_score(fcst2, obs, thetas, functional="expectile", alpha=0.5)

fig = make_subplots(rows=2, cols=1)
fig.add_trace(
    go.Scatter(x=ms1.theta, y=ms1.total, mode="lines", name="fcst1", line=dict(color="#1b9e77")), row=1, col=1
)
fig.add_trace(
    go.Scatter(
        x=ms1.theta,
        y=xr.where((ms1.theta >= 40) | (ms1.theta < 4), ms1.total, 0),
        mode="lines",
        line=dict(color="rgba(27,158,119, 1)"),
        fillcolor="rgba(27,158,119, 0.5)",
        fill="tozeroy",
        showlegend=False,
    ),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(x=ms2.theta, y=ms2.total, mode="lines", name="fcst2", line=dict(color="#7570b3")), row=1, col=1
)
fig.add_trace(
    go.Scatter(
        x=ms2.theta,
        y=xr.where((ms2.theta >= 40) | (ms2.theta < 4), ms2.total, 0),
        mode="lines",
        fill="tozeroy",
        line=dict(color="#7570b3"),
        fillcolor="rgba(117,112,179, 0.5)",
        showlegend=False,
    ),
    row=1,
    col=1,
)

fig.add_trace(
    go.Scatter(
        x=[0, 4, 4, 40, 40, 55], y=[1, 1, 0, 0, 1, 1], mode="lines", name="threshold weight", line=dict(color="black")
    ),
    row=2,
    col=1,
)
fig.update_layout(
    xaxis_title="Temperature (°C)",
    yaxis_title="Economic Regret",
    width=800,
    height=600,
    legend=dict(x=0.01, y=0.99, xanchor="left", yanchor="top"),
    margin=dict(l=50, r=20, b=20, t=20),
)
fig.update_yaxes(title_text="Threshold Weight", row=2, col=1)
fig.update_xaxes(title_text="Temperature (°C)", row=2, col=1)
fig

We can see that the area under the Murphy diagram curves are shaded in for decision threshold less than 4°C and above 40°C which aligns with the threshold weight function.

Other Threshold Weighted Scores#

In this tutorial, we demonstrated how the tw_squared_error function could be used to calculate twMSE. There are other threshold weighted scores that work the same way. We show how they can be called.

Threshold Weighted Absolute Error#

[13]:
tw_absolute_error(fcst1, obs, interval_where_one=(40, np.inf), interval_where_positive=(30, np.inf))
[13]:
<xarray.DataArray ()>
array(0.91448512)

Threshold Weighted Quantile Score#

We need to use the alpha arg to specify the quantile level

[14]:
tw_quantile_score(fcst1, obs, alpha=0.9, interval_where_one=(40, np.inf), interval_where_positive=(30, np.inf))
[14]:
<xarray.DataArray ()>
array(0.8230366)

Threshold Weighted Expectile score#

We need to use the alpha arg to specify the expectile level

[15]:
tw_expectile_score(fcst1, obs, alpha=0.9, interval_where_one=(40, np.inf), interval_where_positive=(30, np.inf))
[15]:
<xarray.DataArray ()>
array(2.93854176)

Threshold Weighted Huber Loss#

We need to specify the Huber parameter arg; huber_param

[16]:
tw_huber_loss(fcst1, obs, huber_param=2, interval_where_one=(40, np.inf), interval_where_positive=(30, np.inf))
[16]:
<xarray.DataArray ()>
array(1.27710106)

Things To Try Next#

  • Play around with different interval_where_one and interval_where_positive values.

  • Rather than providing a float for interval_where_one and interval_where_positive, provide xr.DataArray objects where these values vary across a dimension(s).

  • If you require more complicated threshold weightings than rectangular and trapezoidal threshold weightings, you can use scores’ consistent scores. See the Consistent Scoring Rules tutorial.

  • Try to replicate example three using scores.continuous.consistent_expectile_score, so that the function only has to be called once to evaluate both hot and cold extremes at the same time.

Reference#

Taggart, R. (2022). Evaluation of point forecasts for extreme events using consistent scoring functions. Quarterly Journal of the Royal Meteorological Society, 148(742), 306-320.