Interactive online version: Binder badge. Download notebook.

Brier Score#

The Brier score is the most commonly used verification metric for evaluating a probability of a binary outcome forecast, such as a “chance of rainfall” forecast.

Probabilistic forecasts can be expressed as values between 0 and 1, or they could be expressed as an ensemble. We can use scores to evaluate both kinds of forecasts.

Using the Brier score to evaluate probabilistic forecasts (expressed as values between 0 and 1) of binary events#

Probabilistic forecasts of binary events can be expressed with values between 0 and 1, and observations are exactly 0 (event did not occur), or 1 (event occurred).

The metric is then calculated the same way as MSE and is defined as

\[s(x,y) = (x - y)^2\]

Where

  • \(x\) is the forecast between 0 and 1, and

  • \(y\) is the observation which is either 0 or 1.

The Brier score is a strictly proper scoring rule where lower values are better (it is negatively oriented), where a perfect score is 0 and the worst score is 1.

[11]:
from scores.probability import brier_score
from scipy.stats import beta, binom

import numpy as np
import xarray as xr

np.random.seed(100)
[12]:
# To learn more about the implementation of the Brier score, uncomment the following
# help(brier_score)

We generate two synthetic forecasts. By design, fcst1 is a good forecast, while fcst2 is a poor forecast. We measure the difference in skill by calculating and comparing their Brier scores.

[13]:
fcst1 = beta.rvs(2, 1, size=1000)
obs = binom.rvs(1, fcst1)
fcst2 = beta.rvs(0.5, 1, size=1000)
fcst1 = xr.DataArray(data=fcst1, dims="time", coords={"time": np.arange(0, 1000)})
fcst2 = xr.DataArray(data=fcst2, dims="time", coords={"time": np.arange(0, 1000)})
obs = xr.DataArray(data=obs, dims="time", coords={"time": np.arange(0, 1000)})
[14]:
brier_fcst1 = brier_score(fcst1, obs)
brier_fcst2 = brier_score(fcst2, obs)

print(f"Brier score for fcst1 = {brier_fcst1.item():.2f}")
print(f"Brier score for fcst2 = {brier_fcst2.item():.2f}")
Brier score for fcst1 = 0.17
Brier score for fcst2 = 0.43

As expected, fcst1 has the lower Brier score quantifying the degree to which it is better than fcst2.

Notes#

  • If you are using the Brier score on large data with Dask, consider setting check_args arg to False in brier_score.

  • In the future, the Brier score components calculation will be added.

  • You may be interested in working through the Murphy Diagram tutorial which allows you to break down the performance of the Brier score based on each threshold probability.

Using the Brier score to evaluate ensemble forecasts#

scores can also be used to evaluate ensembles using the Brier score. By default, it calculates the fair Brier score which is defined as

\[s_{i,y} = \left(\frac{i}{m} - y\right)^2 - \frac{i(m-i)}{m^2(m-1)}\]

Where:

  • \(i\) is the number of ensemble members that predict the event

  • \(m\) is the total number of ensemble members

  • \(y\) is the observed event

It is possible to calculate the Brier score without the fair correction (by setting fair_correction=False). In this case, the Brier score is calculated as

\[s_{i,y} = \left(\frac{i}{m} - y\right)^2\]

The use of the Brier score without the fair adjustment may favour ensembles that have members that have the behaviour of being sampled from a different distribution than the observations. For more information, see Ferro (2014).

Let’s first do a synthetic experiment with 10,000 timesteps to demonstrate the impact of the fair correction using the scores.probability.brier_score_for_ensemble function. Suppose the observations are randomly drawn from the distribution \(\mathrm{Pr}(y=1) = 0.5\). Let’s now create two ensemble forecasts. The first (we will call fcst_a) has 2 ensemble members where the value for each member is randomly drawn from the same distribution that the observations are drawn from and is an ideal forecast. The second ensemble (called fcst_b) has 20 ensemble members but is biased. The value for each member is randomly drawn from the distribution \(\mathrm{Pr}(y=1) = 0.4\).

Let’s now calculate the Brier score with and without the fair adjustment.

[15]:
from scores.probability import brier_score_for_ensemble

# Uncomment the following line to learn more about the ensemble Brier score
# help(brier_score_for_ensemble)

N = 10000
M1 = 2
M2 = 20
obs = np.random.choice([0, 1], size=N, p=[0.5, 0.5])
time = np.arange(N)
obs = xr.DataArray(obs, dims=["time"], coords={"time": time})

fcst_a = np.random.choice([0, 1], size=(M1, N), p=[0.5, 0.5])
ensemble = np.arange(M1)
fcst_a = xr.DataArray(fcst_a, dims=["ensemble", "time"], coords={"ensemble": ensemble, "time": time})

fcst_b = np.random.choice([0, 1], size=(M2, N), p=[0.6, 0.4])
ensemble = np.arange(M2)
fcst_b = xr.DataArray(fcst_b, dims=["ensemble", "time"], coords={"ensemble": ensemble, "time": time})

First we calculate the fair Brier score for both ensembles. The thresholds arg defines the threshold that an event occurs. By default, events are inclusive of the exact value of the threshold.

[16]:
brier_score_for_ensemble(fcst_a, obs, event_thresholds=0.5, ensemble_member_dim="ensemble")
[16]:
<xarray.DataArray (threshold: 1)>
array([0.2532])
Coordinates:
  * threshold  (threshold) float64 0.5
[17]:
brier_score_for_ensemble(fcst_b, obs, event_thresholds=0.5, ensemble_member_dim="ensemble")
[17]:
<xarray.DataArray (threshold: 1)>
array([0.26047579])
Coordinates:
  * threshold  (threshold) float64 0.5

As expected, fcst_a had a lower (better) Brier score than fcst_b. Now, let’s calculate the Brier score without the fair adjustment.

[18]:
brier_score_for_ensemble(fcst_a, obs, event_thresholds=0.5, ensemble_member_dim="ensemble", fair_correction=False)
[18]:
<xarray.DataArray (threshold: 1)>
array([0.37765])
Coordinates:
  * threshold  (threshold) float64 0.5
[19]:
brier_score_for_ensemble(fcst_b, obs, event_thresholds=0.5, ensemble_member_dim="ensemble", fair_correction=False)
[19]:
<xarray.DataArray (threshold: 1)>
array([0.27247375])
Coordinates:
  * threshold  (threshold) float64 0.5

We can see that not using the fair correction with brier_score_for_ensemble ranked the biased fcst_b much better than fcst_a. This shows the importance of using fair scores with ensembles to avoid getting misleading results!

Note, if an ensemble is passed into the function with only one ensemble member, the un-adjusted Brier score is returned in order to avoid a divide by zero error.

Calculating the Brier score for multiple thresholds with ensembles#

You can also calculate the Brier score for multiple thresholds

[20]:
brier_score_for_ensemble(fcst_a, obs, event_thresholds=[-1, 0.5, 1, 3], ensemble_member_dim="ensemble")
[20]:
<xarray.DataArray (threshold: 4)>
array([0.    , 0.2532, 0.2532, 0.    ])
Coordinates:
  * threshold  (threshold) float64 -1.0 0.5 1.0 3.0

Things to try next#

  • You can pass in a list of thresholds into the threshold arg if you want to calculate the Brier score for multiple thresholds.

  • You can set event_threshold_mode=operator.gt if you don’t want the exact threshold to be included as an event.

References#

  • Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1-3. https://doi.org/fp62r6

  • Ferro, C. A. T. (2013). Fair scores for ensemble forecasts. Quarterly Journal of the Royal Meteorological Society, 140(683), 1917–1923. https://doi.org/10.1002/qj.2270