The Risk Matrix Score#
CAUTION: The risk matrix score is a novel metric that is still undergoing mathematical peer review. The implementation may change in line with the peer review process.
The risk matrix score provides a consistent way of scoring forecasts and warnings within a risk matrix framework (Taggart & Wilke 2024). A lower score is better.
In this tutorial we will illustrate the risk matrix framework using a slimmed down version of the synthetic case study from Taggart & Wilke (2024). Consider a simple warning service for hazardous outdoor heat with warning levels “Nil”, “Yellow Warning”, “Orange Warning” and “Red Warning”. The warning level that is issued depends on a combination of certainty and severity, as depicted in the matrix below.
The certainty categories are defined by probability thresholds (0.1, 0.3, 0.5) and the three nested severity categories (MOD+, SEV+ and EXT) are defined by three temperature thresholds (35, 37, 40) in degrees Celsius. In this example we will assume that the lower probability threshold is included in the certainty category (e.g. “possible” correponds to probabilities \(p\) satisfying \(0.1 \leq p < 0.5\)).
A forecaster makes a risk assessment by selecting one certainty category for each of the nested severity categories, according to their probability distribution of daily maximum temperature. This is equivalent to selecting one cell in each of the three right-most columns in the risk matrix. The warning level is determined by the highest colour of the selected cells. For example, if a forecaster selects “likely” for MOD+ (yellow), “possible” for SEV+ (yellow) and “possible” for EXT (orange) then this generates an Orange Warning.
Risk matrix scores from a synthetic experiment#
We now set up a synthetic experiment which will generate forecasts from three different forecasters and corresponding observations. We will score the forecasters using the risk matrix score in two different ways: (1) use a choice of decision weights that evaluates how they fill out the risk matrix, and (2) use a choice of decision weights that evaluates the warning levels they issue.
Let \(N(\mu,\sigma^2)\) denote the normal distribution with mean \(\mu\) and variance \(\sigma^2\). Suppose that each observed daily maximum temperature \(y\) is generated from random variables \(y_1\), \(y_2\) and \(y_3\) using the formula \(y = y_1 +y_2 +y_3\), where each random variable \(y_i\) is independent of the others, \(y_1 \sim N(20,10^2)\), \(y_2 \sim N(0,5^2)\) and \(y_3 \sim N(0,2^2)\). The random variables \(y_1\), \(y_2\) and \(y_3\) can be thought to represent variability from seasonal, synoptic scale and mesoscale weather processes respectively. The climatic distribution of maximum temperature is \(N(20,10^2+5^2+2^2)\). The climatic probability of exceeding 35°C, 37°C and 40°C on any given randomly selected day is therefore 0.093, 0.067 and 0.039.
We now consider three forecasters named NeverWarnNate, SeasonalSam and SynopticSally. NeverWarnNate only knows about the climatic distribution \(N(20,10^2+5^2+2^2)\) and uses this to fill out the risk matrix. SeasonalSam has access to seasonal information \(y_1\) but not synoptic or mesoscale information forecasts. He uses the probability distribution \(N(y_1,5^2 +2^2)\) to fill out the risk matrix. SynopticSally has access to seasonal information \(y_1\) and synoptic information \(y_2\) but not mesoscale information. She uses the probability distribution \(N(y_1 +y_2,2^2)\) to fill out the risk matrix.
This experiment is repeated independently 10000 times.
We will now implement this in Python. In order to calculate the risk matrix score:
the observations need to be a
1
if the observation lies in the severity category and0
if notthe forecasts needs to be in the form of a probability that the observation lies in the severity category.
Let’s start by generating the random variables and observations.
[1]:
import numpy as np
import xarray as xr
from scipy.stats import norm
# the number of forecast cases
n_fcst_cases = 10000
# temperature thresholds
temp_thresholds = [35, 37, 40]
# probability thresholds
prob_thresholds = [0.1, 0.3, 0.5]
# severity coords
sev_coords = ["MOD+", "SEV+", "EXT"]
# generate y1, y2 and y3
y1 = xr.DataArray(
data=norm.rvs(loc=20, scale=10, size=n_fcst_cases),
dims=['fcst_case'],
coords={'fcst_case': np.arange(n_fcst_cases)}
)
y2 = xr.DataArray(
data=norm.rvs(loc=0, scale=5, size=n_fcst_cases),
dims=['fcst_case'],
coords={'fcst_case': np.arange(n_fcst_cases)}
)
y3 = xr.DataArray(
data=norm.rvs(loc=0, scale=2, size=n_fcst_cases),
dims=['fcst_case'],
coords={'fcst_case': np.arange(n_fcst_cases)}
)
# generate broadcast arrays for the random variables
da_temp_thresholds = xr.DataArray(
data=temp_thresholds,
dims=["severity"],
coords={"severity": sev_coords},
)
y1, y2, y3, da_temp_thresholds = xr.broadcast(y1, y2, y3, da_temp_thresholds)
# calculate the observations in degrees Celsius
obs = y1 + y2 + y3
# we need to convert this to categorical observations: 1 if obs is in the severity category and 0 if not
obs = (obs > da_temp_thresholds).astype(int)
obs
[1]:
<xarray.DataArray (fcst_case: 10000, severity: 3)> Size: 240kB array([[0, 0, 0], [0, 0, 0], [0, 0, 0], ..., [0, 0, 0], [0, 0, 0], [0, 0, 0]]) Coordinates: * fcst_case (fcst_case) int64 80kB 0 1 2 3 4 5 ... 9995 9996 9997 9998 9999 * severity (severity) <U4 48B 'MOD+' 'SEV+' 'EXT'
Now we generate the forecasts.
[2]:
# calculate SynopticSally probability of exceedance forecasts
sally = xr.full_like(y1, np.nan)
sally.values = 1 - norm.cdf(da_temp_thresholds, loc=y1 + y2, scale=2)
# calculate SeasonalSam's probability of exceedance forecasts
sam = xr.full_like(y1, np.nan)
sam.values = 1 - norm.cdf(da_temp_thresholds, loc=y1, scale=np.sqrt(5 ** 2 + 2 ** 2))
# calculate NeverWarnNate's probability of exceedance forecasts
nate = xr.full_like(y1, 20)
nate.values = 1 - norm.cdf(da_temp_thresholds, loc=nate, scale=np.sqrt(10 ** 2 + 5 ** 2 + 2 ** 2))
# we'll combine the forecasts into one data set
fcsts = xr.Dataset({
"SynopticSally": sally,
"SeasonalSam": sam,
"NeverWarnNate": nate,
})
fcsts
[2]:
<xarray.Dataset> Size: 800kB Dimensions: (fcst_case: 10000, severity: 3) Coordinates: * fcst_case (fcst_case) int64 80kB 0 1 2 3 4 ... 9995 9996 9997 9998 9999 * severity (severity) <U4 48B 'MOD+' 'SEV+' 'EXT' Data variables: SynopticSally (fcst_case, severity) float64 240kB 0.0 0.0 0.0 ... 0.0 0.0 SeasonalSam (fcst_case, severity) float64 240kB 0.004026 ... 5.059e-12 NeverWarnNate (fcst_case, severity) float64 240kB 0.0933 ... 0.03913
Scoring the forecasters#
We now score the forecasters using the risk matrix score. The risk matrix score requires us to specify weights for each decision point in the matrix. There are 9 decision points for this particular risk matrix, each point depending on a combination of the nested severity category and probability threshold.
We’ll consider two sets of weights. The first is where we apply a weight of 1 to each decision point. The method scores.emerging.matrix_weights_to_array
will generate the xarray weight array for us.
[3]:
from scores.emerging import risk_matrix_score, matrix_weights_to_array, weights_from_warning_scaling
uniform_weight_matrix = np.array([
[1, 1, 1], # weights for the 0.5 probability threshold
[1, 1, 1], # weights for the 0.3 probability threshold
[1, 1, 1], # weights for the 0.1 probability threshold
])
uniform_weight_matrix = matrix_weights_to_array(uniform_weight_matrix, "severity", sev_coords, "prob_threshold", prob_thresholds)
uniform_weight_matrix
[3]:
<xarray.DataArray (prob_threshold: 3, severity: 3)> Size: 72B array([[1, 1, 1], [1, 1, 1], [1, 1, 1]]) Coordinates: * prob_threshold (prob_threshold) float64 24B 0.5 0.3 0.1 * severity (severity) <U4 48B 'MOD+' 'SEV+' 'EXT'
We now calculate the risk matrix score, averaged across all forecast cases. We use the threshold_assignment="lower"
default because certainty categories include the lower probability threshold.
[4]:
risk_matrix_score(fcsts, obs, uniform_weight_matrix, "severity", "prob_threshold", reduce_dims="all")
[4]:
<xarray.Dataset> Size: 24B Dimensions: () Data variables: SynopticSally float64 8B 0.0619 SeasonalSam float64 8B 0.1861 NeverWarnNate float64 8B 0.4129
A lower score is better, so SynopticSally has done best overall, followed by SeasonalSam then NeverWarnNate. This is to be expected because SynopticSally makes predictions based on better information than the other forecasters.
We now calculate the risk matrix score, but where we only put weights on decision points that affect the final warning level outcome. To do this, we need to generate the weight matrix based on the warning scaling (i.e., the colouring scheme in the risk matrix) and on supplied assessment weights. Assessment weights indicate the importance of discriminating between the different warning level decision points. We are going to choose assessment weights of (1, 2, 3), which puts a weight of
1
for accurately discriminating between Nil Warning and Yellow Warning (or higher)2
for accurately discriminating between Yellow Warning (or lower) and Orange Warning (or higher)3
for accurately discriminating between Orange Warning (or lower) and Red Warning.
The method scores.emerging.weights_from_warning_scaling
can be used to generate the risk matrix decision weights based on the warning scaling and assessment weights.
[5]:
# the warning scaling is an array of values from 0 to 3 where
# 0 = Nil warning
# 1 = Yellow warning
# 2 = Orange warning
# 3 = Red warning
# and where the numbers match the colours in the risk matrix illustrated above.
warning_scaling = np.array([
[0, 1, 2, 3],
[0, 1, 2, 2],
[0, 0, 1, 2],
[0, 0, 0, 0],
])
warning_weight_matrix = weights_from_warning_scaling(
warning_scaling, [1, 2, 3], "severity", sev_coords, "prob_threshold", prob_thresholds
)
warning_weight_matrix
[5]:
<xarray.DataArray (prob_threshold: 3, severity: 3)> Size: 72B array([[0., 0., 3.], [1., 2., 0.], [0., 1., 2.]]) Coordinates: * prob_threshold (prob_threshold) float64 24B 0.5 0.3 0.1 * severity (severity) <U4 48B 'MOD+' 'SEV+' 'EXT'
Note that the weight for the (MOD+, 0.5) decision point is 0. This is because the warning level doesn’t change at this decision point (see graphic at the top of the tutorial). That is, if one is only interested in the final warning level outcome, discriminating above 0.5 probability threshold and below 0.5 probability threshold for the MOD+ severity category is inconsequential. Other decision points that have a weight of zero are (MOD+, 0.1), (SEV+, 0.5) and (EXT, 0.3).
We now calculate the risk matrix score when emphasis is placed on the final warning decision.
[6]:
risk_matrix_score(fcsts, obs, warning_weight_matrix, "severity", "prob_threshold", reduce_dims="all")
[6]:
<xarray.Dataset> Size: 24B Dimensions: () Data variables: SynopticSally float64 8B 0.05471 SeasonalSam float64 8B 0.1648 NeverWarnNate float64 8B 0.3465
Again, SynopticSally performs best.
Things to try next#
Try adding two new forecasters to the mix. The first new forecaster doubles SynopticSally’s probability forecasts (clipped at 1), while the second new forecaster halves SynopticSally’s probability forecasts. How do you think these new forecasters will rank (using the risk matrix score) compared with SeasonalSam? Can you find two different decision_weights
that result in different rankings among the forecasters?
Reference#
Taggart, R. J. & Wilke, D. J. (2024). Warnings based on risk matrices: a coherent framework with consistent evaluation. In preparation.