The FIxed Risk Multicategorical (FIRM) Score

Interactive online version: Binder badge. Download notebook.

The FIxed Risk Multicategorical (FIRM) Score#

The FIRM score is a verification framework for scoring multicategorical forecasts and warnings (Taggart et al. 2022). The framework is tied to pre-defined risk thresholds (e.g., forecast the highest category for which the probability of observing that category or higher exceeds \(X\)%).

FIRM factors in the costs and losses associated with being on the wrong side of a decision threshold, and is therefore aligned to forecast value (where a missed event is typically more costly than a false alarm). This is in contrast to most other verification scores for multicategorical forecasts, where the optimal probability/risk threshold depends on the sample climatological frequency. Many of these other verification scores encourage excessive over-forecasting of rare events, causing many false alarms (particularly for longer lead days) and eroding users’ confidence in the forecasts.

Lower FIRM scores are better (zero is best).

How does the FIRM framework work?#

A user needs to specify the following:

  1. An increasing sequence of category thresholds

  2. The weights for the relative importance of each decision threshold

  3. The risk threshold, a value between 0 and 1 that indicates the cost of a miss relative to the cost of a false alarm

  4. Optional: A discount distance parameter, an opportunity to discount the penalty for near misses and near false alarms

Example#

Let’s say that we have a wind warning service that has three categories; no warning, gale warning and storm warning. This means that our category thresholds are [34, 48] where values correspond to wind magnitude forecasts (knots). It’s twice as important to get the warning vs no warning decision threshold correct rather than the gale vs storm warning decision correct, so our weights are [2, 1]. The risk threshold of the service, i.e. the cost of a miss to the cost of a false alarm, is equal to 0.7. To optimise the expected FIRM score, one must forecast the highest category for which the probability of observing that category or higher has greater than 30% chance of occurring. This is calculated by \(P (event) \ge 1 - risk \: threshold\). In this wind warning service we don’t discount the penalty for near misses or near false alarms.

To summarise our FIRM framework parameters:

  1. Our category thresholds values are [34, 48]

  2. Our threshold weights are [2, 1]

  3. Our risk threshold is 0.7

Our scoring matrix based on these parameters is

None

Gale

Storm

None

0

1.4

2.3

Gale

0.6

0

0.7

Storm

0.9

0.3

0

where the column headers correspond to the observed category and the row headers correspond to the forecast category. We can see that:

  • There is no penalty along the diagonal if you forecast the correct category.

  • You get a bigger penalty for missing an event than for issuing a false alarm. This is due to the risk threshold being set at 0.7.

  • You get twice the penalty for forecasting none and observing gales compared to forecasting gales and observing storm force conditions due to the threshold weights. These two penalties sum together when being out by two categories (i.e., forecasting None and observing storm force conditions).

For a detailed explanation of the FIRM calculations and various extensions, please see

Taggart, R., Loveday, N. and Griffiths, D., 2022. A scoring framework for tiered warnings and multicategorical forecasts based on fixed risk measures. Quarterly Journal of the Royal Meteorological Society, 148, 1389-1406. https://doi.org/10.1002/qj.4266.

Python example#

Let’s verify two synthetic wind warning services based on the hypothetical service listed above, where the forecast service directive is to forecast the highest category that has at least a 30% chance of occuring. Forecast A’s wind forecasts are unbiased compared to observations. Forecast B is the same as Forecast A but with the wind values increased by 5%. We want to see if we can improve our forecast score by slightly over-warning in Forecast B.

We assume that our categorical warning service is derived by converting continuous observations and forecasts into our 3 categories. This is handled within the implementation of FIRM within scores.

[26]:
import xarray as xr
import numpy as np
import pandas as pd

from scores.categorical import firm
from scipy.stats import norm
[82]:
# Multiplicative bias factor
BIAS_FACTOR = 1.05
[83]:
# Create observations for 100 dates
obs = 50 * np.random.random((100, 100))
obs = xr.DataArray(
    data=obs, dims=["time", "x"], coords={"time": pd.date_range("2023-01-01", "2023-04-10"), "x": np.arange(0, 100)}
)

# Create forecasts for 7 lead days
fcst = xr.DataArray(data=[1] * 7, dims="lead_day", coords={"lead_day": np.arange(1, 8)})
fcst = fcst * obs

# Create two forecasts. Forecast A has no bias compared to the observations, but
# Forecast B is biased upwards to align better with forecast directive of forecast
# the highest category that has at least a 30% chance of occuring
noise = 4 * norm.rvs(size=(7, 100, 100))
fcst_a = fcst + noise
fcst_b = BIAS_FACTOR * fcst + noise
[75]:
# Calculate FIRM scores for both forecasts
fcst_a_firm = firm(fcst_a, obs, 0.7, [34, 48], [2, 1], preserve_dims="lead_day")
fcst_b_firm = firm(fcst_b, obs, 0.7, [34, 48], [2, 1], preserve_dims="lead_day")
[80]:
# View results for lead day 1
fcst_a_firm.sel(lead_day=1)
[80]:
<xarray.Dataset>
Dimensions:                ()
Coordinates:
    lead_day               int64 1
Data variables:
    firm_score             float64 0.08464
    overforecast_penalty   float64 0.02808
    underforecast_penalty  float64 0.05656
[81]:
fcst_b_firm.sel(lead_day=1)
[81]:
<xarray.Dataset>
Dimensions:                ()
Coordinates:
    lead_day               int64 1
Data variables:
    firm_score             float64 0.07648
    overforecast_penalty   float64 0.04722
    underforecast_penalty  float64 0.02926

We can see that the FIRM score from Forecast B was lower (better) than Forecast A. This is because we biased our forecast up slightly to align better with the service directive of “forecast the highest category that has at least a 30% chance of occuring”. Although the number of false alarms increased for Forecast B, the FIRM score was improved by the reduction in (costlier) missed events.

We can also see the overforecast and underforecast penalties in the output. This shows the relative contributions to the overall FIRM score from over- and under-forecasting. It can be helpful in adjusting the forecast strategy.

Further extensions#

  • Instead of passing in a list of floats for the threshold arg, pass in a list of xr.DataArrays where the thresholds vary spatially and are based on climatological values

  • Test the impact of varying the risk threshold

  • Solve for the optimal way to bias the forecasts to optimize the verification score

  • Test the impact of the discount distance parameter

  • Explore some of the extensions in the FIRM paper