API Documentation

Contents

API Documentation#

scores.continuous#

Import the functions from the implementations into the public API

scores.continuous.additive_bias(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the additive bias which is also sometimes called the mean error.

It is defined as

\[\text{Additive bias} =\frac{1}{N}\sum_{i=1}^{N}(x_i - y_i) \text{where } x = \text{the forecast, and } y = \text{the observation}\]

See “Mean error” section at https://www.cawcr.gov.au/projects/verification/ for more information

Parameters:
  • fcst (DataArray | Dataset) – Forecast or predicted variables.

  • obs (DataArray | Dataset) – Observed variables.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the additive bias. All other dimensions will be preserved.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the additive bias. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.

  • weights (DataArray | Dataset | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom)

Returns:

An xarray object with the additive bias of a forecast.

Return type:

DataArray | Dataset

scores.continuous.mean_error(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the mean error which is also sometimes called the additive bias.

It is defined as

\[\text{mean error} =\frac{1}{N}\sum_{i=1}^{N}(x_i - y_i) \text{where } x = \text{the forecast, and } y = \text{the observation}\]

See “Mean error” section at https://www.cawcr.gov.au/projects/verification/ for more information

Parameters:
  • fcst (DataArray | Dataset) – Forecast or predicted variables.

  • obs (DataArray | Dataset) – Observed variables.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the mean error. All other dimensions will be preserved.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the mean error. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.

  • weights (DataArray | Dataset | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom)

Returns:

An xarray object with the mean error of a forecast.

Return type:

DataArray | Dataset

scores.continuous.mae(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None, is_angular=False)#

Calculates the mean absolute error from forecast and observed data.

A detailed explanation is on https://en.wikipedia.org/wiki/Mean_absolute_error

\[\frac{1}{n} \sum_{i=1}^n | \text{forecast}_i - \text{observed}_i |\]
Parameters:
  • fcst (DataArray | Dataset | Series) – Forecast or predicted variables in xarray or pandas.

  • obs (DataArray | Dataset | Series) – Observed variables in xarray or pandas.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating MAE. All other dimensions will be preserved.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating MAE. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the absolute error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom)

  • is_angular (bool) – specifies whether fcst and obs are angular data (e.g. wind direction). If True, a different function is used to calculate the difference between fcst and obs, which accounts for circularity. Angular fcst and obs data should be in degrees rather than radians.

Returns:

By default an xarray DataArray containing a single floating point number representing the mean absolute error for the supplied data. All dimensions will be reduced.

Alternatively, an xarray structure with dimensions preserved as appropriate containing the score along reduced dimensions

Return type:

DataArray | Dataset | Series

scores.continuous.mse(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None, is_angular=False)#

Calculates the mean squared error from forecast and observed data.

See “Mean squared error” section at https://www.cawcr.gov.au/projects/verification/#MSE for more information

\[\frac{1}{n} \sum_{i=1}^n (\text{forecast}_i - \text{observed}_i)^2\]
Parameters:
  • fcst (Union[xr.Dataset, xr.DataArray, pd.Dataframe, pd.Series]) – Forecast or predicted variables in xarray or pandas.

  • obs (Union[xr.Dataset, xr.DataArray, pd.Dataframe, pd.Series]) – Observed variables in xarray or pandas.

  • reduce_dims (Union[str, Iterable[str]) – Optionally specify which dimensions to reduce when calculating MSE. All other dimensions will be preserved.

  • preserve_dims (Union[str, Iterable[str]) – Optionally specify which dimensions to preserve when calculating MSE. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the squared error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom)

  • is_angular (bool | None) – specifies whether fcst and obs are angular data (e.g. wind direction). If True, a different function is used to calculate the difference between fcst and obs, which accounts for circularity. Angular fcst and obs data should be in degrees rather than radians.

Returns:

An object containing

a single floating point number representing the mean absolute error for the supplied data. All dimensions will be reduced. Otherwise: Returns an object representing the mean squared error, reduced along the relevant dimensions and weighted appropriately.

Return type:

Union[xr.Dataset, xr.DataArray, pd.Dataframe, pd.Series]

scores.continuous.quantile_score(fcst, obs, alpha, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates a score that targets alpha-quantiles. Use with alpha = 0.5 for forecasts of the median. Use with alpha = 0.9 for forecasts of the 90th percentile.

Parameters:
  • fcst (DataArray | Dataset) – array of forecasts

  • obs (DataArray | Dataset) – array of observations

  • alpha (float) – A value between 0 and 1 (exclusive)

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the quantile score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating quantile score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the quantile score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • weights (DataArray | Dataset | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom)

Returns:

A DataArray with values being the mean generalised piecewise linear (GPL) scoring function, with the dimensions specified in dims. If dims is None, the returned DataArray will have only one element, the overall mean GPL score.

Raises:

ValueError – if alpha is not between 0 and 1.

Return type:

DataArray | Dataset

Notes

\[\begin{split} gpl(x) = \begin{cases}\alpha * (-x) & x \leq 0\\ (1-\alpha) x & x > 0\end{cases}\end{split}\]
where:
  • \(\alpha\) is the targeted quantile.

  • \(x\) is the difference, fcst - obs

References

T. Gneiting, “Making and evaluating point forecasts”, J. Amer. Stat. Assoc., Vol. 106 No. 494 (June 2011), pp. 754–755, Theorem 9

scores.continuous.rmse(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None, is_angular=False)#

Calculate the Root Mean Squared Error

A detailed explanation is on https://en.wikipedia.org/wiki/Root-mean-square_deviation

\[\sqrt{\frac{1}{n} \sum_{i=1}^n (\text{forecast}_i - \text{observed}_i)^2}\]
Parameters:
  • fcst (DataArray | Dataset | Series) – Forecast or predicted variables in xarray or pandas.

  • obs (DataArray | Dataset | Series) – Observed variables in xarray or pandas.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating RMSE. All other dimensions will be preserved.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating RMSE. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the absolute error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom)

  • is_angular (bool) – specifies whether fcst and obs are angular data (e.g. wind direction). If True, a different function is used to calculate the difference between fcst and obs, which accounts for circularity. Angular fcst and obs data should be in degrees rather than radians.

Returns:

An object containing

a single floating point number representing the root mean squared error for the supplied data. All dimensions will be reduced. Otherwise: Returns an object representing the root mean squared error, reduced along the relevant dimensions and weighted appropriately.

Return type:

DataArray | Dataset | Series

scores.continuous.murphy_score(fcst, obs, thetas, *, functional, alpha, huber_a=None, decomposition=False, reduce_dims=None, preserve_dims=None)#

Returns the mean elementary score (Ehm et. al. 2016), also known as Murphy score, evaluated at decision thresholds specified by thetas. Optionally returns a decomposition of the score in terms of penalties for over- and under-forecasting.

Select functional="quantile" and alpha=0.5 for the median functional. Select functional="expectile" and alpha=0.5 for the mean (i.e., expectation) functional.

Consider using murphy_thetas() to generate thetas. If utilising dask, you may want to store thetas as a netCDF on disk and pass it in as an xarray object. Otherwise, very large objects may be created when fcst, obs and thetas are broadcast together.

Parameters:
  • fcst (DataArray) – Forecast numerical values.

  • obs (DataArray) – Observed numerical values.

  • thetas (Sequence[float] | DataArray) – Theta thresholds.

  • functional (Literal['quantile', 'huber', 'expectile']) – The type of elementary scoring function, one of “quantile”, “huber”, or “expectile”.

  • alpha (float) – Risk parameter (i.e. quantile or expectile level) for the functional. Must be between 0 and 1.

  • huber_a (float | None) – Huber transition parameter, used for “huber” functional only. Must be strictly greater than 0.

  • decomposition (bool) – True to return penalty values due to under- and over-fcst as well as the total score, False to return total score only.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the Murphy score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the Murphy score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the FIRM score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

Returns:

An xr.Dataset with dimensions based on the preserve_dims or reduce_dims arg as well as a “theta” dimension with values matching thetas input.

If decomposition is False, the dataset’s variables will contain 1 element “total”.

If decomposition is True, in addition to “total”, it will have “underforecast” and “overforecast” data_vars.

Raises:
  • ValueError – If functional is not one of the expected functions.

  • ValueError – If alpha is not strictly between 0 and 1.

  • ValueError – If huber_a is not strictly greater than 0.

Return type:

Dataset

References

For mean elementary score definitions, see:
  • Theorem 1 of Ehm, W., Gneiting, T., Jordan, A., & Krüger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(3), 505–562. https://doi.org/10.1111/rssb.12154

  • Theorem 5.3 of Taggart, R. J. (2022). Point forecasting and forecast evaluation with generalized Huber loss. Electronic Journal of Statistics, 16(1), 201-231. https://doi.org/10.1214/21-EJS1957

scores.continuous.murphy_thetas(forecasts, obs, functional, *, huber_a=None, left_limit_delta=None)#

Return the decision thresholds (theta values) at which to evaluate elementary scores for the construction of Murphy diagrams.

This function may generate a very large number of theta thresholds if the forecast and obs data arrays are large and their values have high precision, which may lead to long computational times for Murphy diagrams. To reduce the number of thetas, users can first round forecast and obs values to an appropriate resolution.

Parameters:
  • forecasts (list[DataArray]) – Forecast values, one array per source.

  • obs (DataArray) – Observed values.

  • functional (Literal['quantile', 'huber', 'expectile']) – The type of elementary scoring function, one of “quantile”, “huber”, or “expectile”.

  • huber_a (float | None) – Huber transition parameter, used for “huber” functional only. Must be strictly greater than 0.

  • left_limit_delta (float | None) – Approximation of the left hand limit, used for “huber” and “expectile” functionals. Must be greater than or equal to 0. None will be treated as 0. Ideally, left_limit_delta should be small relative to the fcst and obs precision, and not greater than that precision.

Returns:

theta thresholds to be used to compute murphy scores.

Return type:

List[float]

Raises:
  • ValueError – If functional is not one of the expected functions.

  • ValueError – If huber_a is not strictly greater than 0.

  • ValueError – If left_limit_delta is not greater than or equal to 0.

Notes

For theta values at which to evaluate elementary scores, see

  • Corollary 2 (p.521) of Ehm, W., Gneiting, T., Jordan, A., & Krüger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(3), 505–562. https://doi.org/10.1111/rssb.12154

  • Corollary 5.6 of Taggart, R. J. (2022). Point forecasting and forecast evaluation with generalized Huber loss. Electronic Journal of Statistics, 16(1), 201-231. https://doi.org/10.1214/21-EJS1957

scores.continuous.flip_flop_index(data: DataArray, sampling_dim: str, *, is_angular: bool = False, **selections: Iterable[int]) Dataset#
scores.continuous.flip_flop_index(data: DataArray, sampling_dim: str, *, is_angular: bool = False, **selections: None) DataArray

Calculates the Flip-Flop Index along the dimensions sampling_dim.

Parameters:
  • data – Data from which to draw subsets.

  • sampling_dim – The name of the dimension along which to calculate the Flip-Flop Index.

  • is_angular – specifies whether data is directional data (e.g. wind direction).

  • **selections – Additional keyword arguments specify subsets to draw from the dimension sampling_dim of the supplied data before calculation of the Flip_Flop Index. e.g. days123=[1, 2, 3]

Returns:

An xarray.DataArray, the Flip-Flop Index by collapsing the dimension sampling_dim.

If selections are supplied: An xarray.Dataset. Each data variable is a supplied key-word argument, and corresponds to selecting the values specified from sampling_dim of data. The Flip-Flop Index is calculated for each of these selections.

Return type:

If selections are not supplied

Notes

\[\text{{Flip-Flop Index}} = \frac{{1}}{{N-2}} \left [ \left(\sum\limits_{{i=1}}^{{N-1}}|x_i - x_{{i+1}}|\right) - \left(\max_{{j}}\{{x_j\}} - \min_{{j}}\{{x_j\}}\right) \right ]\]

Where \(N\) is the number of data points, and \(x_i\) is the \(i^{{\text{{th}}}}\) data point.

Examples

>>> data = xr.DataArray([50, 20, 40, 80], coords={{'lead_day': [1, 2, 3, 4]}})
>>> flip_flop_index(data, 'lead_day')
<xarray.DataArray ()>
array(15.0)
Attributes:
    sampling_dim: lead_day
>>> flip_flop_index(data, 'lead_day', days123=[1, 2, 3], all_days=[1, 2, 3, 4])
<xarray.Dataset>
Dimensions:   ()
Coordinates:
    *empty*
Data variables:
    days123   float64 20.0
    all_days  float64 15.0
Attributes:
    selections: {{'days123': [1, 2, 3], 'all_days': [1, 2, 3, 4]}}
    sampling_dim: lead_day
scores.continuous.flip_flop_index_proportion_exceeding(data, sampling_dim, thresholds, *, is_angular=False, preserve_dims=None, reduce_dims=None, **selections)#

Calculates the Flip-Flop Index and returns the proportion exceeding (or equal to) each of the supplied thresholds.

Parameters:
  • data (DataArray) – Data from which to draw subsets.

  • sampling_dim (str) – The name of the dimension along which to calculate

  • thresholds (Iterable) – The proportion of Flip-Flop Index results equal to or exceeding these thresholds will be calculated. the Flip-Flop Index.

  • is_angular (bool) – specifies whether data is directional data (e.g. wind direction).

  • reduce_dims (Iterable[Hashable] | None) – Dimensions to reduce.

  • preserve_dims (Iterable[Hashable] | None) – Dimensions to preserve.

  • **selections (Iterable[int]) – Additional keyword arguments specify subsets to draw from the dimension sampling_dim of the supplied data before calculation of the Flip_Flop Index. e.g. days123=[1, 2, 3]

Returns:

If selections are not supplied - An xarray.DataArray with dimensions dims + ‘threshold’. The DataArray is the proportion of the Flip-Flop Index calculated by collapsing dimension sampling_dim exceeding or equal to thresholds.

If selections are supplied - An xarray.Dataset with dimensions dims + ‘threshold’. There is a data variable for each keyword in selections, and corresponds to the Flip-Flop Index proportion exceeding for the subset of data specified by the keyword values.

Examples

>>> data = xr.DataArray(
...     [[50, 20, 40, 80], [10, 50, 10, 100], [0, 30, 20, 50]],
...     dims=['station_number', 'lead_day'],
...     coords=[[10001, 10002, 10003], [1, 2, 3, 4]]
... )
>>> flip_flop_index_proportion_exceeding(data, 'lead_day', [20])
<xarray.DataArray (threshold: 1)>
array([ 0.33333333])
Coordinates:
  * threshold  (threshold) int64 20
Attributes:
    sampling_dim: lead_day
>>> flip_flop_index_proportion_exceeding(
...     data, 'lead_day', [20], days123=[1, 2, 3], all_days=[1, 2, 3, 4]
... )
<xarray.Dataset>
Dimensions:    (threshold: 1)
Coordinates:
  * threshold  (threshold) int64 20
Data variables:
    days123    (threshold) float64 0.6667
    all_days   (threshold) float64 0.3333
Attributes:
    selections: {{'days123': [1, 2, 3], 'all_days': [1, 2, 3, 4]}}
    sampling_dim: lead_day
scores.continuous.correlation.pearsonr(fcst, obs, *, reduce_dims=None, preserve_dims=None)#

Calculates the Pearson’s correlation coefficient between two xarray DataArrays

\[\rho = \frac{\sum_{i=1}^{n}{(x_i - \bar{x})(y_i - \bar{y})}} {\sqrt{\sum_{i=1}^{n}{(x_i-\bar{x})^2}\sum_{i=1}^{n}{(y_i - \bar{y})^2}}}\]
where:
  • \(\rho\) = Pearson’s correlation coefficient

  • \(x_i\) = the values of x in a sample (i.e. forecast values)

  • \(\bar{x}\) = the mean value of the forecast sample

  • \(y_i\) = the values of y in a sample (i.e. observed values)

  • \(\bar{y}\) = the mean value of the observed sample value

Parameters:
  • fcst (DataArray) – Forecast or predicted variables

  • obs (DataArray) – Observed variables.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the Pearson’s correlation coefficient. All other dimensions will be preserved.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the Pearson’s correlation coefficient. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the absolute error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.

Returns:

An xarray object with Pearson’s correlation coefficient values

Return type:

xr.DataArray

Note

This function isn’t set up to take weights.

Reference:

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

scores.continuous.multiplicative_bias(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the multiplicative bias.

Most suited for forecasts that have a lower bound at 0 such as wind speed. Will return a np.inf where the mean of obs across the dims to be reduced is 0. It is defined as

\[\text{{Multiplicative bias}} = \frac{\frac{1}{N}\sum_{i=1}^{N}x_i}{\frac{1}{N}\sum_{i=1}^{N}y_i} \text{where } x = \text{the forecast, and } y = \text{the observation}\]

See “(Multiplicative) bias” section at https://www.cawcr.gov.au/projects/verification/ for more information

Parameters:
  • fcst (DataArray | Dataset) – Forecast or predicted variables.

  • obs (DataArray | Dataset) – Observed variables.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the multiplicative bias. All other dimensions will be preserved.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the multiplicative bias. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.

  • weights (DataArray | Dataset | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom)

Returns:

An xarray object with the multiplicative bias of a forecast.

Return type:

DataArray | Dataset

scores.continuous.pbias(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the percent bias, which is the ratio of the additive bias to the mean observed value, multiplied by 100.

Percent bias is used for evaluating and comparing forecast accuracy across stations or datasets with varying magnitudes. By expressing the error as a percentage of the observed value, it allows for standardised comparisons, enabling assessment of forecast performance regardless of the absolute scale of values. Like scores.continuous.multiplicative_bias(), pbias will return a np.inf where the mean of obs across the dims to be reduced is 0. It is defined as

\[\text{Percent bias} = 100 \cdot \frac{\sum_{i=1}^{N}(x_i - y_i)}{\sum_{i=1}^{N} y_i}\]
where:
  • \(x_i\) = the values of x in a sample (i.e. forecast values)

  • \(y_i\) = the values of y in a sample (i.e. observed values)

See “pbias” section at https://search.r-project.org/CRAN/refmans/hydroGOF/html/pbias.html for more information

Parameters:
  • fcst (DataArray | Dataset) – Forecast or predicted variables.

  • obs (DataArray | Dataset) – Observed variables.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the percent bias. All other dimensions will be preserved.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the percent bias. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.

  • weights (DataArray | Dataset | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom)

Returns:

An xarray object with the percent bias of a forecast.

Return type:

DataArray | Dataset

References

  • Sorooshian, S., Duan, Q., & Gupta, V. K. (1993). Calibration of rainfall-runoff models: Application of global optimization to the Sacramento Soil Moisture Accounting Model. Water Resources Research, 29(4), 1185-1194. https://doi.org/10.1029/92WR02617

  • Alfieri, L., Pappenberger, F., Wetterhall, F., Haiden, T., Richardson, D., & Salamon, P. (2014). Evaluation of ensemble streamflow predictions in Europe. Journal of Hydrology, 517, 913-922. https://doi.org/10.1016/j.jhydrol.2014.06.035

  • Dawson, C. W., Abrahart, R. J., & See, L. M. (2007). HydroTest: A web-based toolbox of evaluation metrics for the standardised assessment of hydrological forecasts. Environmental Modelling and Software, 22(7), 1034-1052. https://doi.org/10.1016/j.envsoft.2006.06.008

  • Moriasi, D. N., Arnold, J. G., Van Liew, M. W., Bingner, R. L., Harmel, R. D., & Veith, T. L. (2007). Model evaluation guidelines for systematic quantification of accuracy in watershed simulations. Transactions of the ASABE, 50(3), 885-900. https://doi.org/10.13031/2013.23153

scores.continuous.kge(fcst, obs, *, reduce_dims=None, preserve_dims=None, scaling_factors=None, include_components=False)#

Calculate the Kling-Gupta Efficiency (KGE) between observed and simulated (or forecast) values.

KGE is a performance metric that decomposes the error into three components: correlation, variability, and bias. It is computed as:

\[\text{KGE} = 1 - \sqrt{\left[s_\rho \cdot (\rho - 1)\right]^2 + \left[s_\alpha \cdot (\alpha - 1)\right]^2 + \left[s_\beta \cdot (\beta - 1)\right]^2}\]
\[\alpha = \frac{\sigma_x}{\sigma_y}\]
\[\beta = \frac{\mu_x}{\mu_y}\]
where:
  • \(\rho\) = Pearson’s correlation coefficient between observed and forecast values as defined in scores.continuous.correlation.pearsonr()

  • \(\alpha\) is the ratio of the standard deviations (variability ratio)

  • \(\beta\) is the ratio of the means (bias)

  • \(x\) and \(y\) are forecast and observed values, respectively

  • \(\mu_x\) and \(\mu_y\) are the means of forecast and observed values, respectively

  • \(\sigma_x\) and \(\sigma_y\) are the standard deviations of forecast and observed values, respectively

  • \(s_\rho\), \(s_\alpha\) and \(s_\beta\) are the scaling factors for the correlation coefficient \(\rho\),

    the variability term \(\alpha\) and the bias term \(\beta\)

Parameters:
  • fcst (DataArray) – Forecast or predicted variables.

  • obs (DataArray) – Observed variables.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the KGE. All other dimensions will be preserved.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the KGE. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be all NaN with the same shape/dimensionality as the forecast because the standard deviation is zero for a single point.

  • scaling_factors (list[float] | ndarray | None) – A 3-element vector or list describing the weights for each term in the KGE. Defined by: scaling_factors = [\(s_\rho\), \(s_\alpha\), \(s_\beta\)] to apply to the correlation term \(\rho\), the variability term \(\alpha\) and the bias term \(\beta\) respectively. Defaults to (1.0, 1.0, 1.0). (See equation 10 in Gupta et al. (2009) for definitions of them).

  • include_components (bool | False) – If True, the function also returns the individual terms contributing to the KGE score.

Returns:

The Kling-Gupta Efficiency (KGE) score as an xarray DataArray.

If include_components is True, the function returns xarray.Dataset kge_s with the following variables:

  • kge: The KGE score.

  • rho: The Pearson correlation coefficient.

  • alpha: The variability ratio.

  • beta: The bias term.

Return type:

DataArray | Dataset

Notes

  • Statistics are calculated only from values for which both observations and simulations are not null values.

  • This function isn’t set up to take weights.

  • Currently this function is working only on xr.DataArray.

  • When preserve_dims is set to ‘all’, the function returns NaN, similar to the Pearson correlation coefficient calculation for a single data point because the standard deviation is zero for a single point.

References

  • Gupta, H. V., Kling, H., Yilmaz, K. K., & Martinez, G. F. (2009). Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modeling. Journal of Hydrology, 377(1-2), 80-91. https://doi.org/10.1016/j.jhydrol.2009.08.003.

  • Knoben, W. J. M., Freer, J. E., & Woods, R. A. (2019). Technical note: Inherent benchmark or not? Comparing Nash-Sutcliffe and Kling-Gupta efficiency scores. Hydrology and Earth System Sciences, 23(10), 4323-4331. https://doi.org/10.5194/hess-23-4323-2019.

Examples

>>> kge_s = kge(forecasts, obs,preserve_dims='lat')  # if data is of dimension {lat,time}, kge value is computed across the time dimension
>>> kge_s = kge(forecasts, obs,reduce_dims="time")  # if data is of dimension {lat,time}, reduce_dims="time" is same as preserve_dims='lat'
>>> kge_s = kge(forecasts, obs, include_components=True) # kge_s is dataset of all three components and kge value itself
>>> kge_s_weighted = kge(forecasts, obs, scaling_factors=(0.5, 1.0, 2.0)) # with scaling factors
scores.continuous.isotonic_fit(fcst, obs, *, weight=None, functional='mean', bootstraps=None, quantile_level=None, solver=None, confidence_level=0.9, min_non_nan=1, report_bootstrap_results=False)#

Calculates an isotonic regression fit to observations given forecasts as the explanatory values. Optionally returns confidence bands on the fit via bootstrapping. Forecasts and observations are scalar-valued (i.e, integers, floats or NaN).

If forecasts target the mean or a quantile functional, then the user can select that functional. Otherwise the user needs to supply a solver function

  • of one variable (if no weights are supplied), or

  • of two variables (if weights are supplied).

The solver function is used to find an optimal regression fit in each block, subject to a non-decreasing monotonicity constraint, using the pool adjacent violators (PAV) algorithm.

An example of a solver function of one variable is obs -> np.mean(obs) An example of a solver function of two variables is (obs, weight) -> np.average(obs, weights=weight).

Ties (where one forecast value has more than one observation value) are handled as follows. Forecasts are sorted in ascending order, and observations sorted to match. Where ties occur, observations are sorted in descending order within each subset of ties. Isotonic regression is then performed on this sorted observation sequence.

This implementation uses sklearn.isotonic.IsotonicRegression when functional=”mean”.

Parameters:
  • fcst (ndarray | DataArray) – np.array or xr.DataArray of forecast (or explanatory) values. Values must be float, integer or NaN.

  • obs (ndarray | DataArray) – np.array or xr.DataArray of corresponding observation (or response) values. Must be the same shape as fcst. Values must be float, integer or NaN.

  • weight (ndarray | DataArray | None) – positive weights to apply to each forecast-observation pair. Must be the same shape as fcst, or None (which is equivalent to applying equal weights).

  • functional (Literal['mean', 'quantile'] | None) – Functional that the forecasts are targeting. Either “mean” or “quantile” or None. If None then solver must be supplied. The current implementation for “quantile” does not accept weights. If weighted quantile regression is desired then the user should should supply an appropriate solver.

  • bootstraps (int | None) – the number of bootstrap samples to perform for calculating the regression confidence band. Set to None if a confidence band is not required.

  • quantile_level (float | None) – the level of the quantile functional if functional=’quantile’. Must be strictly between 0 and 1.

  • solver (Callable[[ndarray, ndarray], float] | Callable[[ndarray], float] | None) – function that accepts 1D numpy array of observations and returns a float. Function values give the regression fit for each block as determined by the PAV algorithm. If weights are supplied, the function must have a second argument that accepts 1D numpy array of weights of same length as the observations.

  • confidence_level (float | None) – Confidence level of the confidence band, strictly between 0 and 1. For example, a confidence level of 0.5 will calculate the confidence band by computing the 25th and 75th percentiles of the bootstrapped samples.

  • min_non_nan (int | None) – The minimum number of non-NaN bootstrapped values to calculate confidence bands at a particular forecast value.

  • report_bootstrap_results (bool | None) – This specifies to whether keep the bootstrapped regression values in return dictionary. Default is False to keep the output result small.

Returns:

  • “unique_fcst_sorted”: 1D numpy array of remaining forecast values sorted in

    ascending order, after any NaNs from fcst, obs and weight are removed, and only unique values are kept to keep the output size reasonably small.

  • ”fcst_counts”: 1D numpy array of forecast counts for unique values of forecast sorted

  • ”regression_values”: 1D numpy array of regression values corresponding to

    ”unique_fcst_sorted” values.

  • ”regression_func”: function that returns the regression fit based on linear

    interpolation of (“fcst_sorted”, “regression_values”), for any supplied argument (1D numpy array) of potential forecast values.

  • ”bootstrap_results”: in the case of report_bootstrap_results=True, 2D numpy

    array of bootstrapped regression values is included in return dictionary. Each row gives the interpolated regression values from a particular bootstrapped sample, evaluated at “fcst_sorted” values. If m is the number of bootstrap samples and n = len(fcst_sorted) then it is has shape (m, n). We emphasise that this array corresponds to fcst_sorted not unique_fcst_sorted.

  • ”confidence_band_lower_values”: values of lower confidence band threshold, evaluated

    at “unique_fcst_sorted” values.

  • ”confidence_band_upper_values”: values of upper confidence band threshold, evaluated

    at “unique_fcst_sorted” values.

  • ”confidence_band_lower_func”: function that returns regression fit based on linear

    interpolation of (“fcst_sorted”, “confidence_band_lower_values”), given any argument (1D numpy array) of potential forecast values.

  • ”confidence_band_upper_func”: function that returns regression fit based on linear

    interpolation of (“fcst_sorted”, “confidence_band_upper_values”), given any argument (1D numpy array) of potential forecast values.

  • ”confidence_band_levels”: tuple giving the quantile levels used to calculate the

    confidence band.

Return type:

Dictionary with the following keys

Raises:
  • ValueError – if fcst and obs are np.arrays and don’t have the same shape.

  • ValueError – if fcst and weight are np.arrays and don’t have the same shape.

  • ValueError – if fcst and obs are xarray.DataArrays and don’t have the same dimensions.

  • ValueError – if fcst and weight are xarray.DataArrays and don’t have the same dimensions.

  • ValueError – if any entries of fcst, obs or weight are not an integer, float or NaN.

  • ValueError – if there are no fcst and obs pairs after NaNs are removed.

  • ValueError – if functional is not one of “mean”, “quantile” or None.

  • ValueError – if quantile_level is not strictly between 0 and 1 when functional=”quantile”.

  • ValueError – if weight is not None when functional=”quantile”.

  • ValueError – if functional and solver are both None.

  • ValueError – if not exactly one of functional and solver is None.

  • ValueError – if bootstraps is not a positive integer.

  • ValueError – if confidence_level is not strictly between 0 and 1.

Note: This function only keeps the unique values of fcst_sorted to keep the volume of

the return dictionary small. The forecast counts is also included, so users can it to create forecast histogram (usually displayed in the reliability diagrams).

References
  • de Leeuw, Hornik and Mair. “Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods”, Journal of Statistical Software, 2009.

  • Dimitriadis, Gneiting and Jordan. “Stable reliability diagrams for probabilistic classifiers”, PNAS, Vol. 118 No. 8, 2020. Available at https://www.pnas.org/doi/10.1073/pnas.2016191118

  • Jordan, Mühlemann, and Ziegel. “Optimal solutions to the isotonic regression problem”, 2020 (version 2), available on arxiv at https://arxiv.org/abs/1904.04761

scores.continuous.consistent_expectile_score(fcst, obs, alpha, phi, phi_prime, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the score using a scoring function that is consistent for the alpha-expectile functional, based on a supplied convex function phi. See Geniting (2011), or Equation (10) from Taggart (2022).

\[\begin{split}S(x, y) = \begin{cases} (1 - \alpha)(\phi(y) - \phi(x) - \phi'(x)(y-x)), & y < x \\ \alpha(\phi(y) - \phi(x) - \phi'(x)(y-x)), & x \leq y \end{cases}\end{split}\]
where
  • \(x\) is the forecast

  • \(y\) is the observation

  • \(\alpha\) is the expectile level

  • \(\phi\) is a convex function of a single variable

  • \(\phi'\) is the subderivative of \(\phi\)

  • \(S(x,y)\) is the score.

Note that if \(\phi\) is differentiable then \(\phi'\) is its derivative.

The score is negatively oriented with \(0 \leq S(x,y) < \infty\).

We strongly recommend that you work through the tutorial https://scores.readthedocs.io/en/stable/tutorials/Consistent_Scores.html to help you understand how to use the consistent scoring functions in scores.

Parameters:
  • fcst (DataArray) – array of forecast values.

  • obs (DataArray) – array of corresponding observation values.

  • alpha (float) – expectile level. Must be strictly between 0 and 1.

  • phi (Callable[[DataArray], DataArray]) – a convex function on the real numbers, accepting a single array like argument.

  • phi_prime (Callable[[DataArray], DataArray]) – a subderivative of phi, accepting a single array like argument.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the consistent expectile score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the consistent quantile score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the consistent quantile score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom)

Returns:

array of (mean) scores that is consistent for alpha-expectile functional, with the dimensions specified by dims. If dims is None, the returned DataArray will have only one entry, the overall mean score.

Raises:

ValueError – if alpha is not strictly between 0 and 1.

Return type:

DataArray

References

  • Gneiting, T. (2011). Making and Evaluating Point Forecasts. Journal of the American Statistical Association, 106(494), 746–762. https://doi.org/10.1198/jasa.2011.r10138

  • Taggart, R. (2022). Evaluation of point forecasts for extreme events using consistent scoring functions. Quarterly Journal of the Royal Meteorological Society, 148(742), 306–320. https://doi.org/10.1002/qj.4206

scores.continuous.consistent_quantile_score(fcst, obs, alpha, g, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the score that is consistent for the alpha-quantile functional, based on nondecreasing function g. See Gneiting (2011), or Equation (8) from Taggart (2022).

\[\begin{split}S(x, y) = \begin{cases} (1 - \alpha)(g(x) - g(y)), & y < x \\ \alpha(g(y) - g(x)), & x \leq y \end{cases}\end{split}\]
where
  • \(x\) is the forecast

  • \(y\) is the observation

  • \(\alpha\) is the quantile level

  • \(g\) is a nondecreasing function of a single variable

  • \(S(x,y)\) is the score.

The score is negatively oriented with \(0 \leq S(x,y) < \infty\).

We strongly recommend that you work through the tutorial https://scores.readthedocs.io/en/stable/tutorials/Consistent_Scores.html to help you understand how to use the consistent scoring functions in scores.

Parameters:
  • fcst (DataArray) – array of forecast values.

  • obs (DataArray) – array of corresponding observation values.

  • alpha (float) – quantile level. Must be strictly between 0 and 1.

  • g (Callable[[DataArray], DataArray]) – nondecreasing function on the real numbers, accepting a single array like argument.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the consistent quantile score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the consistent quantile score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the consistent quantile score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom)

Returns:

array of (mean) scores that are consistent for alpha-quantile functional, with the dimensions specified by dims. If dims is None, the returned DataArray will have only one entry, the overall mean score.

Raises:

ValueError – if alpha is not strictly between 0 and 1.

Return type:

DataArray

References

  • Gneiting, T. (2011). Making and Evaluating Point Forecasts. Journal of the American Statistical Association, 106(494), 746–762. https://doi.org/10.1198/jasa.2011.r10138

  • Taggart, R. (2022). Evaluation of point forecasts for extreme events using consistent scoring functions. Quarterly Journal of the Royal Meteorological Society, 148(742), 306–320. https://doi.org/10.1002/qj.4206

scores.continuous.consistent_huber_score(fcst, obs, huber_param, phi, phi_prime, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the score that is consistent for the Huber mean functional with tuning parameter tuning_param, based on convex function phi. See Taggart (2022a), or Equation (11) from Taggart (2022b). See Taggart (2022b), end of Section 3.4, for the standard formula.

\[S(x,y) = \frac{1}{2}(\phi(y)-\phi(\kappa_v (x-y) + y) + \kappa_v (x-y)\phi'(x))\]
where
  • \(\kappa_v(x) = \mathrm{max}(-v, \mathrm{min}(x, v))\) is the “capping function”

  • \(v\) is the Huber parameter

  • \(x\) is the forecast

  • \(y\) is the observation

  • \(\phi\) is a convex function of a single variable

  • \(\phi'\) is the subderivative of \(\phi\)

  • \(S(x,y)\) is the score.

The score is negatively oriented with \(0 \leq S(x,y) < \infty\).

We strongly recommend that you work through the tutorial https://scores.readthedocs.io/en/stable/tutorials/Consistent_Scores.html to help you understand how to use the consistent scoring functions in scores.

Parameters:
  • fcst (DataArray) – array of forecast values.

  • obs (DataArray) – array of corresponding observation values.

  • huber_param (float) – Huber mean tuning parameter. This corresponds to the transition point between linear and quadratic loss for Huber loss. Must be positive.

  • phi (Callable[[DataArray], DataArray]) – a convex function on the real numbers, accepting a single array like argument.

  • phi_prime (Callable[[DataArray], DataArray]) – a subderivative of phi, accepting a single array like argument.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the consistent Huber score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the consistent Huber score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the consistent Huber score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom)

Returns:

array of (mean) scores that is consistent for Huber mean functional, with the dimensions specified by dims. If dims is None, the returned DataArray will have only one entry, the overall mean score.

Raises:

ValueError – if huber_param <= 0.

Return type:

DataArray

References

  • Taggart, R. J. (2022a). Point forecasting and forecast evaluation with generalized Huber loss. Electronic Journal of Statistics, 16(1), 201-231. https://doi.org/10.1214/21-ejs1957

  • Taggart, R. (2022b). Evaluation of point forecasts for extreme events using consistent scoring functions. Quarterly Journal of the Royal Meteorological Society, 148(742), 306–320. https://doi.org/10.1002/qj.4206

scores.continuous.tw_quantile_score(fcst, obs, alpha, interval_where_one, *, interval_where_positive=None, reduce_dims=None, preserve_dims=None, weights=None)#

Returns the threshold weighted quantile score.

For more flexible threshold weighting schemes, see scores.continuous.consistent_quantile_score().

Two types of threshold weighting are supported: rectangular and trapezoidal.
  • To specify a rectangular threshold weight, set interval_where_positive=None and set interval_where_one to be the interval where the threshold weight is 1. For example, if interval_where_one=(0, 10) then a threshold weight of 1 is applied to decision thresholds satisfying 0 <= threshold < 10, and a threshold weight of 0 is applied otherwise. Interval endpoints can be -numpy.inf or numpy.inf.

  • To specify a trapezoidal threshold weight, specify interval_where_positive and interval_where_one using desired endpoints. For example, if interval_where_positive=(-2, 10) and interval_where_one=(2, 4) then a threshold weight of 1 is applied to decision thresholds satisfying 2 <= threshold < 4. The threshold weight increases linearly from 0 to 1 on the interval [-2, 2) and decreases linearly from 1 to 0 on the interval [4, 10], and is 0 otherwise. Interval endpoints can only be infinite if the corresponding interval_where_one endpoint is infinite. End points of interval_where_positive and interval_where_one must differ except when the endpoints are infinite.

Parameters:
  • fcst (DataArray) – array of forecast values.

  • obs (DataArray) – array of corresponding observation values.

  • alpha (float) – the quantile level. Must be strictly between 0 and 1.

  • interval_where_one (Tuple[int | float | DataArray, int | float | DataArray]) – endpoints of the interval where the threshold weights are 1. Must be increasing. Infinite endpoints are permissible. By supplying a tuple of arrays, endpoints can vary with dimension.

  • interval_where_positive (Tuple[int | float | DataArray, int | float | DataArray] | None) – endpoints of the interval where the threshold weights are positive. Must be increasing. Infinite endpoints are only permissible when the corresponding interval_where_one endpoint is infinite. By supplying a tuple of arrays, endpoints can vary with dimension.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the threshold_weighted_quantile_score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the threshold_weighted_quantile_score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the threshold_weighted_quantile_score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom). Note that these weights are different to threshold weighting which is done by decision threshold.

Returns:

xarray data array of the threshold weighted quantile error

Raises:
  • ValueError – if interval_where_one is not length 2.

  • ValueError – if interval_where_one is not increasing.

  • ValueError – if alpha is not in the open interval (0,1).

  • ValueError – if interval_where_one is not increasing.

  • ValueError – if interval_where_one and interval_where_positive do not specify a valid trapezoidal weight.

Return type:

DataArray

References

Taggart, R. (2022). Evaluation of point forecasts for extreme events using consistent scoring functions. Quarterly Journal of the Royal Meteorological Society, 148(742), 306-320. https://doi.org/10.1002/qj.4206

scores.continuous.tw_absolute_error(fcst, obs, interval_where_one, *, interval_where_positive=None, reduce_dims=None, preserve_dims=None, weights=None)#

Returns the threshold weighted absolute error.

For more flexible threshold weighting schemes, see scores.continuous.consistent_quantile_score().

Two types of threshold weighting are supported: rectangular and trapezoidal.
  • To specify a rectangular threshold weight, set interval_where_positive=None and set interval_where_one to be the interval where the threshold weight is 1. For example, if interval_where_one=(0, 10) then a threshold weight of 1 is applied to decision thresholds satisfying 0 <= threshold < 10, and a threshold weight of 0 is applied otherwise. Interval endpoints can be -numpy.inf or numpy.inf.

  • To specify a trapezoidal threshold weight, specify interval_where_positive and interval_where_one using desired endpoints. For example, if interval_where_positive=(-2, 10) and interval_where_one=(2, 4) then a threshold weight of 1 is applied to decision thresholds satisfying 2 <= threshold < 4. The threshold weight increases linearly from 0 to 1 on the interval [-2, 2) and decreases linearly from 1 to 0 on the interval [4, 10], and is 0 otherwise. Interval endpoints can only be infinite if the corresponding interval_where_one endpoint is infinite. End points of interval_where_positive and interval_where_one must differ except when the endpoints are infinite.

Parameters:
  • fcst (DataArray) – array of forecast values.

  • obs (DataArray) – array of corresponding observation values.

  • interval_where_one (Tuple[int | float | DataArray, int | float | DataArray]) – endpoints of the interval where the threshold weights are 1. Must be increasing. Infinite endpoints are permissible. By supplying a tuple of arrays, endpoints can vary with dimension.

  • interval_where_positive (Tuple[int | float | DataArray, int | float | DataArray] | None) – endpoints of the interval where the threshold weights are positive. Must be increasing. Infinite endpoints are only permissible when the corresponding interval_where_one endpoint is infinite. By supplying a tuple of arrays, endpoints can vary with dimension.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the threshold_weighted_absolute_error. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the threshold_weighted_absolute_error. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the threshold_weighted_absolute_error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom). Note that these weights are different to threshold weighting which is done by decision threshold.

Returns:

xarray data array of the threshold weighted absolute error

Raises:
  • ValueError – if interval_where_one is not length 2.

  • ValueError – if interval_where_one is not increasing.

  • ValueError – if interval_where_one is not increasing.

  • ValueError – if interval_where_one and interval_where_positive do not specify a valid trapezoidal weight.

Return type:

DataArray

References

Taggart, R. (2022). Evaluation of point forecasts for extreme events using consistent scoring functions. Quarterly Journal of the Royal Meteorological Society, 148(742), 306-320. https://doi.org/10.1002/qj.4206

scores.continuous.tw_squared_error(fcst, obs, interval_where_one, *, interval_where_positive=None, reduce_dims=None, preserve_dims=None, weights=None)#

Returns the the threshold weighted squared error.

For more flexible threshold weighting schemes, see scores.continuous.consistent_expectile_score().

Two types of threshold weighting are supported: rectangular and trapezoidal.
  • To specify a rectangular threshold weight, set interval_where_positive=None and set interval_where_one to be the interval where the threshold weight is 1. For example, if interval_where_one=(0, 10) then a threshold weight of 1 is applied to decision thresholds satisfying 0 <= threshold < 10, and a threshold weight of 0 is applied otherwise. Interval endpoints can be -numpy.inf or numpy.inf.

  • To specify a trapezoidal threshold weight, specify interval_where_positive and interval_where_one using desired endpoints. For example, if interval_where_positive=(-2, 10) and interval_where_one=(2, 4) then a threshold weight of 1 is applied to decision thresholds satisfying 2 <= threshold < 4. The threshold weight increases linearly from 0 to 1 on the interval [-2, 2) and decreases linearly from 1 to 0 on the interval [4, 10], and is 0 otherwise. Interval endpoints can only be infinite if the corresponding interval_where_one endpoint is infinite. End points of interval_where_positive and interval_where_one must differ except when the endpoints are infinite.

Parameters:
  • fcst (DataArray) – array of forecast values.

  • obs (DataArray) – array of corresponding observation values.

  • interval_where_one (Tuple[int | float | DataArray, int | float | DataArray]) – endpoints of the interval where the threshold weights are 1. Must be increasing. Infinite endpoints are permissible. By supplying a tuple of arrays, endpoints can vary with dimension.

  • interval_where_positive (Tuple[int | float | DataArray, int | float | DataArray] | None) – endpoints of the interval where the threshold weights are positive. Must be increasing. Infinite endpoints are only permissible when the corresponding interval_where_one endpoint is infinite. By supplying a tuple of arrays, endpoints can vary with dimension.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the threshold_weighted_square_error. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the threshold_weighted_squared_error. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the threshold_weighted_squared_error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom). Note that these weights are different to threshold weighting which is done by decision threshold.

Returns:

xarray data array of the threshold weighted squared error

Raises:
  • ValueError – if interval_where_one is not length 2.

  • ValueError – if interval_where_one is not increasing.

  • ValueError – if interval_where_one is not increasing.

  • ValueError – if interval_where_one and interval_where_positive do not specify a valid trapezoidal weight.

Return type:

DataArray

Reference:

Taggart, R. (2022). Evaluation of point forecasts for extreme events using consistent scoring functions. Quarterly Journal of the Royal Meteorological Society, 148(742), 306-320. https://doi.org/10.1002/qj.4206

scores.continuous.tw_huber_loss(fcst, obs, huber_param, interval_where_one, *, interval_where_positive=None, reduce_dims=None, preserve_dims=None, weights=None)#

Returns the threshold weighted Huber loss.

For more flexible threshold weighting schemes, see scores.continuous.consistent_huber_score().

Two types of threshold weighting are supported: rectangular and trapezoidal.
  • To specify a rectangular threshold weight, set interval_where_positive=None and set interval_where_one to be the interval where the threshold weight is 1. For example, if interval_where_one=(0, 10) then a threshold weight of 1 is applied to decision thresholds satisfying 0 <= threshold < 10, and a threshold weight of 0 is applied otherwise. Interval endpoints can be -numpy.inf or numpy.inf.

  • To specify a trapezoidal threshold weight, specify interval_where_positive and interval_where_one using desired endpoints. For example, if interval_where_positive=(-2, 10) and interval_where_one=(2, 4) then a threshold weight of 1 is applied to decision thresholds satisfying 2 <= threshold < 4. The threshold weight increases linearly from 0 to 1 on the interval [-2, 2) and decreases linearly from 1 to 0 on the interval [4, 10], and is 0 otherwise. Interval endpoints can only be infinite if the corresponding interval_where_one endpoint is infinite. End points of interval_where_positive and interval_where_one must differ except when the endpoints are infinite.

Parameters:
  • fcst (DataArray) – array of forecast values.

  • obs (DataArray) – array of corresponding observation values.

  • huber_param (float) – the Huber transition parameter.

  • interval_where_one (Tuple[int | float | DataArray, int | float | DataArray]) – endpoints of the interval where the threshold weights are 1. Must be increasing. Infinite endpoints are permissible. By supplying a tuple of arrays, endpoints can vary with dimension.

  • interval_where_positive (Tuple[int | float | DataArray, int | float | DataArray] | None) – endpoints of the interval where the threshold weights are positive. Must be increasing. Infinite endpoints are only permissible when the corresponding interval_where_one endpoint is infinite. By supplying a tuple of arrays, endpoints can vary with dimension.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the threshold_weighted_expectile_score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the threshold_weighted_expectile_score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the threshold_weighted_expectile_score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom). Note that these weights are different to threshold weighting which is done by decision threshold.

Returns:

xarray data array of the threshold weighted expectile error

Raises:
  • ValueError – if interval_where_one is not length 2.

  • ValueError – if interval_where_one is not increasing.

  • ValueError – if alpha is not in the open interval (0,1).

  • ValueError – if huber_param is not positive

  • ValueError – if interval_where_one is not increasing.

  • ValueError – if interval_where_one and interval_where_positive do not specify a valid trapezoidal threshold weight.

Return type:

DataArray

References

Taggart, R. (2022). Evaluation of point forecasts for extreme events using consistent scoring functions. Quarterly Journal of the Royal Meteorological Society, 148(742), 306-320. https://doi.org/10.1002/qj.4206

scores.continuous.tw_expectile_score(fcst, obs, alpha, interval_where_one, *, interval_where_positive=None, reduce_dims=None, preserve_dims=None, weights=None)#

Returns the threshold weighted expectile score.

For more flexible threshold weighting schemes, see scores.continuous.consistent_expectile_score().

Two types of threshold weighting are supported: rectangular and trapezoidal.
  • To specify a rectangular threshold weight, set interval_where_positive=None and set interval_where_one to be the interval where the threshold weight is 1. For example, if interval_where_one=(0, 10) then a threshold weight of 1 is applied to decision thresholds satisfying 0 <= threshold < 10, and a threshold weight of 0 is applied otherwise. Interval endpoints can be -numpy.inf or numpy.inf.

  • To specify a trapezoidal threshold weight, specify interval_where_positive and interval_where_one using desired endpoints. For example, if interval_where_positive=(-2, 10) and interval_where_one=(2, 4) then a threshold weight of 1 is applied to decision thresholds satisfying 2 <= threshold < 4. The threshold weight increases linearly from 0 to 1 on the interval [-2, 2) and decreases linearly from 1 to 0 on the interval [4, 10], and is 0 otherwise. Interval endpoints can only be infinite if the corresponding interval_where_one endpoint is infinite. End points of interval_where_positive and interval_where_one must differ except when the endpoints are infinite.

Parameters:
  • fcst (DataArray) – array of forecast values.

  • obs (DataArray) – array of corresponding observation values.

  • alpha (float) – expectile level. Must be strictly between 0 and 1.

  • interval_where_one (Tuple[int | float | DataArray, int | float | DataArray]) – endpoints of the interval where the threshold weights are 1. Must be increasing. Infinite endpoints are permissible. By supplying a tuple of arrays, endpoints can vary with dimension.

  • interval_where_positive (Tuple[int | float | DataArray, int | float | DataArray] | None) – endpoints of the interval where the threshold weights are positive. Must be increasing. Infinite endpoints are only permissible when the corresponding interval_where_one endpoint is infinite. By supplying a tuple of arrays, endpoints can vary with dimension.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the threshold_weighted_expectile_score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the threshold_weighted_expectile_score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the threshold_weighted_expectile_score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom). Note that these weights are different to threshold weighting which is done by decision threshold.

Returns:

xarray data array of the threshold weighted expectile error

Raises:
  • ValueError – if interval_where_one is not length 2.

  • ValueError – if interval_where_one is not increasing.

  • ValueError – if alpha is not in the open interval (0,1).

  • ValueError – if interval_where_one and interval_where_positive do not specify a valid trapezoidal weight.

Return type:

DataArray

References

Taggart, R. (2022). Evaluation of point forecasts for extreme events using consistent scoring functions. Quarterly Journal of the Royal Meteorological Society, 148(742), 306-320. https://doi.org/10.1002/qj.4206

scores.continuous.quantile_interval_score(fcst_lower_qtile, fcst_upper_qtile, obs, lower_qtile_level, upper_qtile_level, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the quantile interval score for interval forecasts. This score penalises the interval’s width and whether the observation value lies outside the interval.

\[\text{quantile interval score} = \underbrace{(q_{u} - q_{l})}_{\text{interval width penalty}} + \underbrace{\frac{1}{\alpha_l} \cdot (q_{l} - y) \cdot \mathbb{1}(y < q_{l})}_{\text{over-prediction penalty}} + \underbrace{\frac{1}{1 - \alpha_u} \cdot (y - q_{u}) \cdot \mathbb{1}(y < q_{u})}_{\text{under-prediction penalty}}\]
where
  • \(q_u\) is the forecast at the upper quantile

  • \(q_l\) is the forecast at the lower quantile

  • \(\alpha_u\) is the upper quantile level

  • \(\alpha_l\) is the lower quantile level

  • \(y\) is the observation

  • \(\mathbb{1}(condition)\) is an indicator function that is 1 when the condition is true, and 0 otherwise.

Parameters:
  • fcst_lower_qtile (DataArray) – array of forecast values at the lower quantile.

  • fcst_upper_qtile (DataArray) – array of forecast values at the upper quantile.

  • obs (DataArray) – array of observations.

  • lower_qtile_level (float) – Quantile level between 0 and 1 (exclusive) for fcst_lower_qtile.

  • upper_qtile_level (float) – Quantile level between 0 and 1 (exclusive) for fcst_upper_qtile.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the quantile interval score. All other dimensions will be preserved. As a special case, “all” will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the quantile interval score. All other dimensions will be reduced. As a special case, “all” will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the quantile score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom) when aggregating the mean score across dimensions. Alternatively, it could be used for masking data.

Returns:

A Dataset with the dimensions specified in dims. The dataset will have the following data variables:

  • interval_width_penalty: the interval width penalty contribution of the quantile interval score

  • overprediction_penalty: the over-prediction penalty contribution of the quantile interval score

  • underprediction_penalty: the under-prediction penalty contribution of the quantile interval score

  • total: sum of all penalties

Raises:
  • ValueError – If not (0 < lower_qtile_level < upper_qtile_level < 1).

  • ValueError – If (fcst_lower_qtile > fcst_upper_qtile).any().

Return type:

Dataset

References

Winkler, R. L. (1972). A Decision-Theoretic Approach to Interval Estimation. Journal of the American Statistical Association, 67(337), 187. https://doi.org/10.2307/2284720

Examples

Calculate the quantile interval score for forecast intervals with lower and upper quantile levels of 0.1 and 0.6, respectively.

>>> import numpy as np
>>> import xarray as xr
>>> from scores.continuous import quantile_interval_score
>>> fcst_lower_level = xr.DataArray(np.random.uniform(10, 15, size=(30, 15)), dims=['time', 'station'])
>>> fcst_upper_level = xr.DataArray(np.random.uniform(20, 25, size=(30, 15)), dims=['time', 'station'])
>>> obs = xr.DataArray(np.random.uniform(8, 27,size=(30, 15)), dims=['time', 'station'])
>>> quantile_interval_score(fcst_lower_level, fcst_upper_level, obs, 0.1, 0.6)
scores.continuous.interval_score(fcst_lower_qtile, fcst_upper_qtile, obs, interval_range, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the interval score for interval forecasts. This function calls the scores.continuous.quantile_interval_score() function to calculate the interval score for cases where the quantile level range is symmetric.

\[\text{interval score} = \underbrace{(q_{u} - q_{l})}_{\text{interval width penalty}} + \underbrace{\frac{2}{\alpha} \cdot (q_{l} - y) \cdot \mathbb{1}(y < q_{l})}_{\text{over-prediction penalty}} + \underbrace{\frac{2}{\alpha} \cdot (y - q_{u}) \cdot \mathbb{1}(y < q_{u})}_{\text{under-prediction penalty}}\]
where
  • \(q_u\) is the forecast at the upper quantile

  • \(q_l\) is the forecast at the lower quantile

  • \(\alpha\) is the confidence level that is equal to \(1 - \text{interval_range}\)

  • \(y\) is the observation

  • \(\mathbb{1}(condition)\) is an indicator function that is 1 when the condition is true, and 0 otherwise.

Parameters:
  • fcst_lower_qtile (DataArray) – array of forecast values at the lower quantile.

  • fcst_upper_qtile (DataArray) – array of forecast values at the upper quantile.

  • obs (DataArray) – array of observations.

  • interval_range (float) – Range (length) of interval (e.g., 0.9 for 90% confidence level, which will result in lower quantile of 0.05 and upper quantile of 0.95). Must be strictly between 0 and 1.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the interval score. All other dimensions will be preserved. As a special case, “all” will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the interval score. All other dimensions will be reduced. As a special case, “all” will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the quantile score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom) when aggregating the mean score across dimensions. Alternatively, it could be used for masking data.

Returns:

A Dataset with the dimensions specified in dims. The dataset will have the following data variables:

  • interval_width_penalty: the interval width penalty contribution of the interval score

  • overprediction_penalty: the over-prediction penalty contribution of the interval score

  • underprediction_penalty: the under-prediction penalty contribution of the interval score

  • total: sum of all penalties

As can be seen in the interval score equation, the lower and upper quantile levels are derived from the interval range: lower_qtile_level = (1 - interval_range) / 2 and upper_qtile_level = (1 + interval_range) / 2.

Raises:
  • ValueError – If not 0 < interval_range < 1.

  • ValueError – If (fcst_lower_qtile > fcst_upper_qtile).any().

Return type:

Dataset

References

Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359-378. Section 6.2. https://doi.org/10.1198/016214506000001437

Examples

Calculate the interval score for forecast intervals with an interval range of 0.5 (i.e., lower and upper quantile levels are 0.25 and 0.75, respectively).

>>> import numpy as np
>>> import xarray as xr
>>> from scores.continuous import interval_score
>>> fcst_lower_level = xr.DataArray(np.random.uniform(10, 15, size=(30, 15)), dims=['time', 'station'])
>>> fcst_upper_level = xr.DataArray(np.random.uniform(20, 25, size=(30, 15)), dims=['time', 'station'])
>>> obs = xr.DataArray(np.random.uniform(8, 27,size=(30, 15)), dims=['time', 'station'])
>>> interval_score(fcst_lower_level, fcst_upper_level, obs, 0.5)

scores.probability#

scores.probability.crps_cdf(fcst, obs, *, threshold_dim='threshold', threshold_weight=None, additional_thresholds=None, propagate_nans=True, fcst_fill_method='linear', threshold_weight_fill_method='forward', integration_method='exact', reduce_dims=None, preserve_dims=None, weights=None, include_components=False)#

Calculates the CRPS probabilistic metric given CDF input.

Calculates the continuous ranked probability score (CRPS), or the mean CRPS over specified dimensions, given forecasts in the form of predictive cumulative distribution functions (CDFs). Can also calculate threshold-weighted versions of the CRPS by supplying a threshold_weight.

Predictive CDFs here are described by an indexed set of values rather than by closed forumulae. As a result, the resolution or number of points at which the CDF is realised has an impact on the calculation of areas under and over the curve to obtain the CRPS result. The term ‘threshold’ is used to describe the dimension which is used as an index for predictive CDF values. Various techniques are used to interpolate CDF values between indexed thresholds.

Given:
  • a predictive CDF fcst indexed at thresholds by variable x

  • an observation in CDF form obs_cdf (i.e., \(\text{obs_cdf}(x) = 0\) if \(x < \text{obs}\) and 1 if \(x >= \text{obs}\))

  • a threshold_weight array indexed by variable x

The threshold-weighted CRPS is given by:
  • \(twCRPS = \int_{-\infty}^{\infty}{[\text{threshold_weight}(x) \times (\text{fcst}(x) - \text{obs_cdf}(x))^2]\text{d}x}\), over all thresholds x.

  • The usual CRPS is the threshold-weighted CRPS with \(\text{threshold_weight}(x) = 1\) for all x.

This can be decomposed into an over-forecast penalty:

\(\int_{-\infty}^{\infty}{[\text{threshold_weight}(x) \times \text{fcst}(x) - \text{obs_cdf}(x))^2]\text{d}x}\), over all thresholds x where x >= obs

and an under-forecast penalty:

\(\int_{-\infty}^{\infty}{[\text{threshold_weight}(x) \times \text{(fcst}(x) - \text{obs_cdf}(x)^2]\text{d}x}\), over all thresholds x where x <= obs.

To obtain the components of the CRPS score, set include_components to True.

Note that there are several ways to decompose the CRPS and this decomposition differs from the one used in the scores.probability.crps_for_ensemble() function.

Note that the function crps_cdf is designed so that the obs argument contains actual observed values. crps_cdf will convert obs into CDF form in order to calculate the CRPS.

To calculate CRPS, integration is applied over the set of thresholds x taken from:
  • fcst[threshold_dim].values,

  • obs.values.

  • threshold_weight[threshold_dim].values if applicable.

  • additional_thresholds if applicable.

  • (with NaN values excluded)

There are two methods of integration:
  • “exact” gives the exact integral under that assumption that that fcst is continuous and piecewise linear between its specified values, and that threshold_weight (if supplied) is piecewise constant and right-continuous between its specified values.

  • “trapz” simply uses a trapezoidal rule using the specified values, and so is an approximation of the CRPS. To get an accurate approximation, the density of threshold values can be increased by supplying additional_thresholds.

Both methods of calculating CRPS may require adding additional values to the threshold_dim dimension in fcst and (if supplied) threshold_weight. fcst_fill_method and weight_fill_method specify how fcst and weight are to be filled at these additional points.

The function crps_cdf calculates the CRPS using forecast CDF values ‘as is’. No checks are performed to ensure that CDF values in fcst are nondecreasing along threshold_dim. Checks are conducted on fcst and threshold_weight (if applicable) to ensure that coordinates are increasing along threshold_dim.

Parameters:
  • fcst (DataArray) – array of forecast CDFs, with the threshold dimension given by threshold_dim.

  • obs (DataArray) – array of observations, not in CDF form.

  • threshold_dim (str) – name of the dimension in fcst that indexes the thresholds.

  • threshold_weight (DataArray | None) – weight to be applied along threshold_dim to calculate threshold-weighted CRPS. Must contain threshold_dim as a dimension, and may also include other dimensions from fcst. If weight=None, a weight of 1 is applied everywhere, which gives the usual CRPS.

  • additional_thresholds (Iterable[float] | None) – additional thresholds values to add to fcst and (if applicable) threshold_weight prior to calculating CRPS.

  • propagate_nans (bool) – If True, propagate NaN values along threshold_dim in fcst and threshold_weight prior to calculating CRPS. This will result in CRPS being NaN for these cases. If False, NaN values in fcst and weight will be replaced, wherever possible, with non-NaN values using the fill method specified by fcst_fill_method and threshold_weight_fill_method.

  • fcst_fill_method (Literal['linear', 'step', 'forward', 'backward']) –

    how to fill values in fcst when NaNs have been introduced (by including additional thresholds) or are specified to be removed (by setting propagate_nans=False). Select one of:

    • ”linear”: use linear interpolation, then replace any leading or trailing NaNs using linear extrapolation. Afterwards, all values are clipped to the closed interval [0, 1].

    • ”step”: apply forward filling, then replace any leading NaNs with 0.

    • ”forward”: first apply forward filling, then remove any leading NaNs by back filling.

    • ”backward”: first apply back filling, then remove any trailing NaNs by forward filling.

    • (In most cases, “linear” is likely the appropriate choice.)

  • threshold_weight_fill_method (Literal['linear', 'step', 'forward', 'backward']) – how to fill values in threshold_weight when NaNs have been introduced (by including additional thresholds) or are specified to be removed (by setting propagate_nans=False). Select one of “linear”, “step”, “forward” or “backward”. If the weight function is continuous, “linear” is probably the best choice. If it is an increasing step function, “forward” may be best.

  • integration_method (str) – one of “exact” or “trapz”.

  • preserve_dims (Tuple[str]) – dimensions to preserve in the output. All other dimensions are collapsed by taking the mean.

  • reduce_dims (Tuple[str]) – dimensions to reduce in the output by taking the mean. All other dimensions are preserved.

  • weights – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom)

  • include_components (bool) – if True, include the under and over forecast components of the score in the returned dataset.

Returns:

The following are the produced Dataset variables:
  • ”total” the total CRPS.

  • ”underforecast_penalty”: the under-forecast penalty contribution of the CRPS.

  • ”overforecast_penalty”: the over-forecast penalty contribution of the CRPS.

Return type:

xr.Dataset

Raises:
  • ValueError – if threshold_dim is not a dimension of fcst.

  • ValueError – if threshold_dim is not a dimension of threshold_weight when threshold_weight is not None.

  • ValueError – if threshold_dim is a dimension of obs.

  • ValueError – if dimensions of obs are not also dimensions of fcst.

  • ValueError – if dimensions of threshold_weight are not also dimensions of fcst when threshold_weight is not None.

  • ValueError – if dims is not a subset of dimensions of fcst.

  • ValueError – if fcst_fill_method is not one of ‘linear’, ‘step’, ‘forward’ or ‘backward’.

  • ValueError – if weight_fill_method is not one of ‘linear’, ‘step’, ‘forward’ or ‘backward’.

  • ValueError – if fcst[threshold_dim] has less than 2 values.

  • ValueError – if coordinates in fcst[threshold_dim] are not increasing.

  • ValueError – if threshold_weight is not None and coordinates in threshold_weight[threshold_dim] are not increasing.

  • ValueError – if threshold_weight has negative values.

References

  • Matheson, J. E., and R. L. Winkler, 1976: Scoring rules for continuous probability distributions. Management Science, 22(10), 1087–1095. https://doi.org/10.1287/mnsc.22.10.1087

  • Gneiting, T., & Ranjan, R. (2011). Comparing Density Forecasts Using Threshold- and Quantile-Weighted Scoring Rules. Journal of Business & Economic Statistics, 29(3), 411–422. https://doi.org/10.1198/jbes.2010.08110

scores.probability.adjust_fcst_for_crps(fcst, threshold_dim, obs, *, decreasing_tolerance=0, additional_thresholds=None, fcst_fill_method='linear', integration_method='exact')#

This function takes a forecast cumulative distribution functions (CDF) fcst.

If fcst is not decreasing outside of specified tolerance, it returns fcst.

Otherwise, the CDF envelope for fcst is computed, and the CDF from among:
  • fcst,

  • the upper envelope, and

  • the lower envelope

that has the higher (i.e. worse) CRPS is returned. In the event of a tie, preference is given in the order fcst then upper.

See scores.probability.functions.cdf_envelope for details about the CDF envelope.

The use case for this is when, either due to rounding or poor forecast process, the forecast CDF fcst fails to be nondecreasing. Rather than simply replacing fcst with NaN, adjust_fcst_for_crps returns a CDF for which CRPS can be calculated, but possibly with a predictive performance cost as measured by CRPS.

Whether a CDF is decreasing outside specified tolerance is determined as follows. For each CDF in fcst, the sum of incremental decreases along the threshold dimension is calculated. For example, if the CDF values are

[0, 0.4, 0.3, 0.9, 0.88, 1]

then the sum of incremental decreases is -0.12. This CDF decreases outside specified tolerance if 0.12 > decreasing_tolerance.

The adjusted array of forecast CDFs is determined as follows:
  • any NaN values in fcst are propagated along threshold_dim so that in each case the entire CDF is NaN;

  • any CDFs in fcst that are decreasing within specified tolerance are unchanged;

  • any CDFs in fcst that are decreasing outside specified tolerance are replaced with whichever of the upper or lower CDF envelope gives the highest CRPS, unless the original values give a higher CRPS in which case original values are kept.

See scores.probability.functions.cdf_envelope for a description of the ‘CDF envelope’.

If propagating NaNs is not desired, the user may first fill NaNs in fcst using scores.probability.functions.fill_cdf.

The CRPS for each forecast case is calculated using crps, with a weight of 1.

Parameters:
  • fcst (DataArray) – DataArray of CDF values with threshold dimension threshold_dim.

  • threshold_dim (str) – name of the threshold dimension in fcst.

  • obs (DataArray) – DataArray of observations.

  • decreasing_tolerance (float) – nonnegative tolerance value.

  • additional_thresholds (Iterable[float] | None) – optional additional thresholds passed on to crps when calculating CRPS.

  • fcst_fill_method (Literal['linear', 'step', 'forward', 'backward']) – fcst fill method passed on to crps when calculating CRPS.

  • integration_method (Literal['exact', 'trapz']) – integration method passed on to crps when calculating CRPS.

Returns:

An xarray DataArray of possibly adjusted forecast CDFs, where adjustments are made to penalise CDFs that decrease outside tolerance.

Raises:
  • ValueError – If threshold_dim is not a dimension of fcst.

  • ValueError – If decreasing_tolerance is negative.

Return type:

DataArray

scores.probability.crps_step_threshold_weight(step_points, threshold_dim, *, threshold_values=None, steppoints_in_thresholds=True, steppoint_precision=0, weight_upper=True)#

Generates an array of weights based on DataArray step points.

Creates an array of threshold weights, which can be used to calculate threshold-weighted CRPS, based on a step function. Applies a weight of 1 when step_point >= threshold, and a weight of 0 otherwise. Zeros and ones in the output weight function can be reversed by setting weight_upper=False.

Parameters:
  • step_points (xr.DataArray) – points at which the weight function changes value from 0 to 1.

  • threshold_dim (str) – name of the threshold dimension in the returned array weight function.

  • threshold_values (str) – thresholds at which to calculate weights.

  • steppoints_in_thresholds (bool) – include step_points among the threshold_dim values.

  • steppoint_precision (float) – precision at which to round step_points prior to calculating the weight function. Select 0 for no rounding.

  • weight_upper (bool) – If true, returns a weight of 1 if step_point >= threshold, and a weight of 0 otherwise. If false, returns a weight of 0 if step_point >= threshold, and a weight of 1 otherwise.

Returns:

Zeros and ones with the dimensions in step_points and an additional threshold_dim dimension.

Return type:

(xr.DataArray)

scores.probability.crps_cdf_brier_decomposition(fcst, obs, *, threshold_dim='threshold', additional_thresholds=None, fcst_fill_method='linear', reduce_dims=None, preserve_dims=None)#

Given an array fcst of predictive CDF values indexed along threshold_dim, and an array obs of observations, calculates the mean Brier score for each index along threshold_dim. Since the mean CRPS is the integral of the mean Brier score over all thresholds, this gives a threshold decomposition of the mean CRPS.

If any there are any NaNs along the threshold dimension of fcst, then NaNs are propagated along this dimension prior to calculating the decomposition. If propagating NaNs is not desired, the user may first fill NaNs in fcst using scores.probability.functions.fill_cdf.

Parameters:
  • fcst (xr.DataArray) – DataArray of CDF values with threshold dimension threshold_dim.

  • obs (xr.DataArray) – DataArray of observations, not in CDF form.

  • threshold_dim (str) – name of the threshold dimension in fcst.

  • additional_thresholds (Optional[Iterable[float]]) – additional thresholds at which to calculate the mean Brier score.

  • fcst_fill_method (Literal["linear", "step", "forward", "backward"]) –

    How to fill NaN values in fcst that arise from new user-supplied thresholds or thresholds derived from observations.

    • ”linear”: use linear interpolation, and if needed also extrapolate linearly. Clip to 0 and 1. Needs at least two non-NaN values for interpolation, so returns NaNs where this condition fails.

    • ”step”: use forward filling then set remaining leading NaNs to 0. Produces a step function CDF (i.e. piecewise constant).

    • ”forward”: use forward filling then fill any remaining leading NaNs with backward filling.

    • ”backward”: use backward filling then fill any remaining trailing NaNs with forward filling.

  • dims – dimensions to preserve in the output. The dimension threshold_dim is always preserved, even if not specified here.

  • reduce_dims (Iterable[str] | None)

  • preserve_dims (Iterable[str] | None)

Returns:

  • “total_penalty”: the mean Brier score for each threshold.

  • ”underforecast_penalty”: the mean of the underforecast penalties for the Brier score. For a particular

    forecast case, this component equals 0 if the event didn’t occur and the Brier score if it did.

  • ”overforecast_penalty”: the mean of the overforecast penalties for the Brier score. For a particular

    forecast case, this component equals 0 if the event did occur and the Brier score if it did not.

Return type:

An xarray Dataset with data_vars

Raises:
  • ValueError – if threshold_dim is not a dimension of fcst.

  • ValueError – if threshold_dim is a dimension of obs.

  • ValueError – if dimensions of obs are not also among the dimensions of fcst.

  • ValueError – if dimensions in dims is not among the dimensions of fcst.

  • ValueError – if fcst_fill_method is not one of ‘linear’, ‘step’, ‘forward’ or ‘backward’.

  • ValueError – if coordinates in fcst[threshold_dim] are not increasing.

scores.probability.crps_for_ensemble(fcst, obs, ensemble_member_dim, *, method='ecdf', reduce_dims=None, preserve_dims=None, weights=None, include_components=False)#

Calculates the continuous ranked probability score (CRPS) given an ensemble of forecasts. An ensemble of forecasts can also be thought of as a random sample from the predictive distribution.

Given an observation y, and ensemble member values \(x_i\) (for \(1 \leq i \leq M\)), the CRPS is calculated by the formula

\[CRPS(x_i, x_j, y) = \frac{\sum_{i=1}^{M}(|x_i - y|)}{M} - \frac{\sum_{i=1}^{M}\sum_{j=1}^{M}(|x_i - x_j|)}{2K}\]

where the first sum is iterated over \(1 \leq i \leq M\) and the second sum is iterated over \(1 \leq i \leq M\) and \(1 \leq j \leq M\).

The value of the constant K in this formula depends on the method:
  • If method=”ecdf” then \(K = M ^ 2\). In this case the CRPS value returned is the exact CRPS value for the empirical cumulative distribution function constructed using the ensemble values.

  • If method=”fair” then \(K = M(M - 1)\). In this case the CRPS value returned is the approximated CRPS where the ensemble values can be interpreted as a random sample from the underlying predictive distribution. This interpretation stems from the formula \(\text{CRPS}(F, y) = \mathbb{E}(|X - y|) - \frac{1}{2}\mathbb{E}(|X - X'|)\), where X and X’ are independent samples of the predictive distribution F, y is the observation (possibly unknown) and E denotes the expectation. This choice of K gives an unbiased estimate for the second expectation.

When the include_components flag is set to True, the CRPS components are calculated as

\[CRPS(x_i, x_j, y) = O(x_i, y) + U(x_i, y) - S(x_i, x_j)\]
where
  • \(O(x_i, y) = \frac{\sum_{i=1}^{M} ((x_i - y) \mathbb{1}{\{x_i > y\}})}{M}\) which is the overforecast penalty.

  • \(U(x_i, y) = \frac{\sum_{i=1}^{M} ((y - x_i) \mathbb{1}{\{x_i < y\}})}{M}\) which is the underforecast penalty.

  • \(S(x_i, x_j) = \frac{\sum_{i=1}^{M}\sum_{j=1}^{M}(|x_i - x_j|)}{2K}\) which is the forecast spread term.

Note that there are several ways to decompose the CRPS and this decomposition differs from the one used in the scores_probability.crps_cdf() function.

Parameters:
  • fcst (DataArray | Dataset) – Forecast data. Must have a dimension ensemble_member_dim.

  • obs (DataArray | Dataset) – Observation data.

  • ensemble_member_dim (str) – the dimension that specifies the ensemble member or the sample from the predictive distribution.

  • method (Literal['ecdf', 'fair']) – Either “ecdf” or “fair”.

  • reduce_dims (Sequence[str] | None) – Dimensions to reduce. Can be “all” to reduce all dimensions.

  • preserve_dims (Sequence[str] | None) – Dimensions to preserve. Can be “all” to preserve all dimensions.

  • weights (DataArray | Dataset | None) – Weights for calculating a weighted mean of individual scores.

  • include_components (bool | None) – If True, returns the CRPS with underforecast and overforecast penalties, as well as the forecast spread term (see description above).

Returns:

xarray object of (weighted mean) CRPS values.

Raises:

ValueError – when method is not one of “ecdf” or “fair”.

Return type:

DataArray | Dataset

References

    1. Ferro (2014), “Fair scores for ensemble forecasts”, Quarterly Journal of the Royal Meteorol Society, 140(683):1917-1923. https://doi.org/10.1002/qj.2270

    1. Gneiting T and A. Raftery (2007), “Strictly proper scoring rules, prediction, and estimation”, Journal of the American Statistical Association, 102(477):359-378. https://doi.org/10.1198/016214506000001437

    1. Zamo and P. Naveau (2018), “Estimation of the Continuous Ranked Probability Score with Limited Information and Applications to Ensemble Weather Forecasts”, Mathematical Geosciences 50:209-234, https://doi.org/10.1007/s11004-017-9709-7

scores.probability.tw_crps_for_ensemble(fcst, obs, ensemble_member_dim, chaining_func, *, chaining_func_kwargs=None, method='ecdf', reduce_dims=None, preserve_dims=None, weights=None, include_components=False)#

Calculates the threshold weighted continuous ranked probability score (twCRPS) given ensemble input using a chaining function chaining_func (see below). An ensemble of forecasts can also be thought of as a random sample from the predictive distribution.

The twCRPS is calculated by the formula

\[\text{twCRPS}(F, y) = \mathbb{E}_F \left| v(X) - v(y) \right| - \frac{1}{2} \mathbb{E}_F \left| v(X) - v(X') \right|,\]

where \(X\) and \(X'\) are independent samples of the predictive distribution \(F\), \(y\) is the observation, and \(v\) is a ‘chaining function’.

The chaining function \(v\) is the antiderivative of the threshold weight function \(w\), which is a non-negative function that assigns a weight to each threshold value. For example, if we wanted to assign a threshold weight of 1 to thresholds above threshold \(t\) and a threshold weight of 0 to thresholds below \(t\), our threshold weight function would be \(w(x) = \mathbb{1}{(x > t)}\), where \(\mathbb{1}\) is the indicator function which returns a value of 1 if the condition is true and 0 otherwise. A chaining function would then be \(v(x) = \text{max}(x, t)\).

There are currently two methods available for calculating the twCRPS: “ecdf” and “fair”.
  • If method=”ecdf” then the twCRPS value returned is the exact twCRPS value for the empirical cumulative distribution function constructed using the ensemble values.

  • If method=”fair” then the twCRPS value returned is the approximated twCRPS where the ensemble values can be interpreted as a random sample from the underlying predictive distribution. See https://doi.org/10.1002/qj.2270 for more details on the fair CRPS which are relevant for the fair twCRPS.

The ensemble representation of the empirical twCRPS is

\[\text{twCRPS}(F_{\text{ens}}, y; v) = \frac{1}{M} \sum_{m=1}^{M} \left| v(x_m) - v(y) \right| - \frac{1}{2M^2} \sum_{m=1}^{M} \sum_{j=1}^{M} \left| v(x_m) - v(x_j) \right|,\]

where \(M\) is the number of ensemble members.

While the ensemble representation of the fair twCRPS is

\[\text{twCRPS}(F_{\text{ens}}, y; v) = \frac{1}{M} \sum_{m=1}^{M} \left| v(x_m) - v(y) \right| - \frac{1}{2M(M - 1)} \sum_{m=1}^{M} \sum_{j=1}^{M} \left| v(x_m) - v(x_j) \right|.\]
Parameters:
  • fcst (DataArray | Dataset) – Forecast data. Must have a dimension ensemble_member_dim.

  • obs (DataArray | Dataset) – Observation data.

  • ensemble_member_dim (str) – the dimension that specifies the ensemble member or the sample from the predictive distribution.

  • chaining_func (Callable[[DataArray | Dataset], DataArray | Dataset]) – the chaining function.

  • chaining_func_kwargs (dict[str, Any] | None) – keyword arguments for the chaining function.

  • method (Literal['ecdf', 'fair']) – Either “ecdf” for the empirical twCRPS or “fair” for the Fair twCRPS.

  • reduce_dims (Sequence[str] | None) – Dimensions to reduce. Can be “all” to reduce all dimensions.

  • preserve_dims (Sequence[str] | None) – Dimensions to preserve. Can be “all” to preserve all dimensions.

  • weights (DataArray | Dataset | None) – Weights for calculating a weighted mean of individual scores. Note that these weights are different to threshold weighting which is done by decision threshold.

  • include_components (bool | None) – If True, returns the twCRPS with underforecast and overforecast penalties, as well as the forecast spread term. See scores.probability.crps_for_ensemble() for more details on the decomposition.

Returns:

xarray object of twCRPS values.

Raises:

ValueError – when method is not one of “ecdf” or “fair”.

Return type:

DataArray

Notes

Chaining functions can be created to vary the weights across given dimensions such as varying the weights by climatological values.

References

  • Allen, S., Ginsbourger, D., & Ziegel, J. (2023). Evaluating forecasts for high-impact events using transformed kernel scores. SIAM/ASA Journal on Uncertainty Quantification, 11(3), 906-940. https://doi.org/10.1137/22M1532184.

  • Allen, S. (2024). Weighted scoringRules: Emphasizing Particular Outcomes When Evaluating Probabilistic Forecasts. Journal of Statistical Software, 110(8), 1-26. https://doi.org/10.18637/jss.v110.i08

Examples

Calculate the twCRPS for an ensemble of forecasts where the chaining function is derived from a weight function that assigns a weight of 1 to thresholds above 0.5 and a weight of 0 to thresholds below 0.5.

>>> import numpy as np
>>> import xarray as xr
>>> from scores.probability import tw_crps_for_ensemble
>>> fcst = xr.DataArray(np.random.rand(10, 10), dims=['time', 'ensemble'])
>>> obs = xr.DataArray(np.random.rand(10), dims=['time'])
>>> tw_crps_for_ensemble(fcst, obs, 'ensemble', lambda x: np.maximum(x, 0.5))
scores.probability.tail_tw_crps_for_ensemble(fcst, obs, ensemble_member_dim, threshold, *, tail='upper', method='ecdf', reduce_dims=None, preserve_dims=None, weights=None, include_components=False)#

Calculates the threshold weighted continuous ranked probability score (twCRPS) weighted for a tail of the distribution from ensemble input.

A threshold weight of 1 is assigned for values of the tail and a threshold weight of 0 otherwise. The threshold value of where the tail begins is specified by the threshold argument. The tail argument specifies whether the tail is the upper or lower tail. For example, if we only care about values above 40 degrees C, we can set threshold=40 and tail="upper".

For more flexible weighting options and the relevant equations, see the scores.probability.tw_crps_for_ensemble() function.

Parameters:
  • fcst (DataArray | Dataset) – Forecast data. Must have a dimension ensemble_member_dim.

  • obs (DataArray | Dataset) – Observation data.

  • ensemble_member_dim (str) – the dimension that specifies the ensemble member or the sample from the predictive distribution.

  • threshold (DataArray | Dataset | float) – the threshold value for where the tail begins. It can either be a float for a single threshold or an xarray object if the threshold varies across dimensions (e.g., climatological values).

  • tail (Literal['upper', 'lower']) – the tail of the distribution to weight. Either “upper” or “lower”.

  • method (Literal['ecdf', 'fair']) – Either “ecdf” or “fair”. See scores.probability.tw_crps_for_ensemble() for more details.

  • reduce_dims (Sequence[str] | None) – Dimensions to reduce. Can be “all” to reduce all dimensions.

  • preserve_dims (Sequence[str] | None) – Dimensions to preserve. Can be “all” to preserve all dimensions.

  • weights (DataArray | Dataset | None) – Weights for calculating a weighted mean of individual scores. Note that these weights are different to threshold weighting which is done by decision threshold.

  • include_components (bool | None) – If True, returns the twCRPS with underforecast and overforecast penalties, as well as the forecast spread term. See scores.probability.crps_for_ensemble() for more details on the decomposition.

Returns:

xarray object of twCRPS values that has been weighted based on the tail.

Raises:
  • ValueError – when tail is not one of “upper” or “lower”.

  • ValueError – when method is not one of “ecdf” or “fair”.

Return type:

DataArray | Dataset

References

  • Allen, S., Ginsbourger, D., & Ziegel, J. (2023). Evaluating forecasts for high-impact events using transformed kernel scores. SIAM/ASA Journal on Uncertainty Quantification, 11(3), 906-940. https://doi.org/10.1137/22M1532184.

  • Allen, S. (2024). Weighted scoringRules: Emphasizing Particular Outcomes When Evaluating Probabilistic Forecasts. Journal of Statistical Software, 110(8), 1-26. https://doi.org/10.18637/jss.v110.i08

Examples

Calculate the twCRPS for an ensemble where we assign a threshold weight of 1 to thresholds above 0.5 and a threshold weight of 0 to thresholds below 0.5.

>>> import numpy as np
>>> import xarray as xr
>>> from scores.probability import tail_tw_crps_for_ensemble
>>> fcst = xr.DataArray(np.random.rand(10, 10), dims=['time', 'ensemble'])
>>> obs = xr.DataArray(np.random.rand(10), dims=['time'])
>>> tail_tw_crps_for_ensemble(fcst, obs, 'ensemble', 0.5, tail='upper')
scores.probability.interval_tw_crps_for_ensemble(fcst, obs, ensemble_member_dim, lower_threshold, upper_threshold, *, method='ecdf', reduce_dims=None, preserve_dims=None, weights=None, include_components=False)#

Calculates the threshold weighted continuous ranked probability score (twCRPS) for ensemble forecasts where the threshold weight is 1 on a specified interval and 0 otherwise.

The threshold values that define the bounds of the interval are given by the lower_threshold and upper_threshold arguments. For example, if we only want to foucs on the temperatures between -10 and -20 degrees C where aircraft icing is most likely, we can set lower_threshold=-20 and upper_threshold=-10.

For more flexible weighting options and the relevant equations, see the scores.probability.tw_crps_for_ensemble() function.

Parameters:
  • fcst (DataArray | Dataset) – Forecast data. Must have a dimension ensemble_member_dim.

  • obs (DataArray | Dataset) – Observation data.

  • ensemble_member_dim (str) – the dimension that specifies the ensemble member or the sample from the predictive distribution.

  • lower_threshold (DataArray | float) – the threshold value for where the interval begins. It can either be a float for a single threshold or an xarray object if the threshold varies across dimensions (e.g., climatological values).

  • upper_threshold (DataArray | float) – the threshold value for where the interval ends. It can either be a float for a single threshold or an xarray object if the threshold varies across dimensions (e.g., climatological values).

  • method (Literal['ecdf', 'fair']) – Either “ecdf” or “fair”. See scores.probability.tw_crps_for_ensemble() for more details.

  • reduce_dims (Sequence[str] | None) – Dimensions to reduce. Can be “all” to reduce all dimensions.

  • preserve_dims (Sequence[str] | None) – Dimensions to preserve. Can be “all” to preserve all dimensions.

  • weights (DataArray | Dataset | None) – Weights for calculating a weighted mean of individual scores. Note that these weights are different to threshold weighting which is done by decision threshold.

  • include_components (bool | None) – If True, returns the twCRPS with underforecast and overforecast penalties, as well as the forecast spread term. See scores.probability.crps_for_ensemble() for more details on the decomposition.

Returns:

xarray object of twCRPS values where the threshold weighting is based on an interval.

Raises:
  • ValueError – when lower_threshold is not less than upper_threshold.

  • ValueError – when method is not one of “ecdf” or “fair”.

Return type:

DataArray | Dataset

References

  • Allen, S., Ginsbourger, D., & Ziegel, J. (2023). Evaluating forecasts for high-impact events using transformed kernel scores. SIAM/ASA Journal on Uncertainty Quantification, 11(3), 906-940. https://doi.org/10.1137/22M1532184.

  • Allen, S. (2024). Weighted scoringRules: Emphasizing Particular Outcomes When Evaluating Probabilistic Forecasts. Journal of Statistical Software, 110(8), 1-26. https://doi.org/10.18637/jss.v110.i08

Examples

Calculate the twCRPS for an ensemble where we assign a threshold weight of 1 to thresholds between -20 and -10 and a threshold weight of 0 to thresholds outside that interval.

>>> import numpy as np
>>> import xarray as xr
>>> from scores.probability import interval_tw_crps_for_ensemble
>>> fcst = xr.DataArray(np.random.uniform(-40, 20, size=(30, 15)), dims=['time', 'ensemble'])
>>> obs = xr.DataArray(np.random.uniform(-40, 20, size=30), dims=['time'])
>>> interval_tw_crps_for_ensemble(fcst, obs, 'ensemble', -20, 10)
scores.probability.murphy_score(fcst, obs, thetas, *, functional, alpha, huber_a=None, decomposition=False, reduce_dims=None, preserve_dims=None)#

Returns the mean elementary score (Ehm et. al. 2016), also known as Murphy score, evaluated at decision thresholds specified by thetas. Optionally returns a decomposition of the score in terms of penalties for over- and under-forecasting.

Select functional="quantile" and alpha=0.5 for the median functional. Select functional="expectile" and alpha=0.5 for the mean (i.e., expectation) functional.

Consider using murphy_thetas() to generate thetas. If utilising dask, you may want to store thetas as a netCDF on disk and pass it in as an xarray object. Otherwise, very large objects may be created when fcst, obs and thetas are broadcast together.

Parameters:
  • fcst (DataArray) – Forecast numerical values.

  • obs (DataArray) – Observed numerical values.

  • thetas (Sequence[float] | DataArray) – Theta thresholds.

  • functional (Literal['quantile', 'huber', 'expectile']) – The type of elementary scoring function, one of “quantile”, “huber”, or “expectile”.

  • alpha (float) – Risk parameter (i.e. quantile or expectile level) for the functional. Must be between 0 and 1.

  • huber_a (float | None) – Huber transition parameter, used for “huber” functional only. Must be strictly greater than 0.

  • decomposition (bool) – True to return penalty values due to under- and over-fcst as well as the total score, False to return total score only.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the Murphy score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the Murphy score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the FIRM score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

Returns:

An xr.Dataset with dimensions based on the preserve_dims or reduce_dims arg as well as a “theta” dimension with values matching thetas input.

If decomposition is False, the dataset’s variables will contain 1 element “total”.

If decomposition is True, in addition to “total”, it will have “underforecast” and “overforecast” data_vars.

Raises:
  • ValueError – If functional is not one of the expected functions.

  • ValueError – If alpha is not strictly between 0 and 1.

  • ValueError – If huber_a is not strictly greater than 0.

Return type:

Dataset

References

For mean elementary score definitions, see:
  • Theorem 1 of Ehm, W., Gneiting, T., Jordan, A., & Krüger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(3), 505–562. https://doi.org/10.1111/rssb.12154

  • Theorem 5.3 of Taggart, R. J. (2022). Point forecasting and forecast evaluation with generalized Huber loss. Electronic Journal of Statistics, 16(1), 201-231. https://doi.org/10.1214/21-EJS1957

scores.probability.murphy_thetas(forecasts, obs, functional, *, huber_a=None, left_limit_delta=None)#

Return the decision thresholds (theta values) at which to evaluate elementary scores for the construction of Murphy diagrams.

This function may generate a very large number of theta thresholds if the forecast and obs data arrays are large and their values have high precision, which may lead to long computational times for Murphy diagrams. To reduce the number of thetas, users can first round forecast and obs values to an appropriate resolution.

Parameters:
  • forecasts (list[DataArray]) – Forecast values, one array per source.

  • obs (DataArray) – Observed values.

  • functional (Literal['quantile', 'huber', 'expectile']) – The type of elementary scoring function, one of “quantile”, “huber”, or “expectile”.

  • huber_a (float | None) – Huber transition parameter, used for “huber” functional only. Must be strictly greater than 0.

  • left_limit_delta (float | None) – Approximation of the left hand limit, used for “huber” and “expectile” functionals. Must be greater than or equal to 0. None will be treated as 0. Ideally, left_limit_delta should be small relative to the fcst and obs precision, and not greater than that precision.

Returns:

theta thresholds to be used to compute murphy scores.

Return type:

List[float]

Raises:
  • ValueError – If functional is not one of the expected functions.

  • ValueError – If huber_a is not strictly greater than 0.

  • ValueError – If left_limit_delta is not greater than or equal to 0.

Notes

For theta values at which to evaluate elementary scores, see

  • Corollary 2 (p.521) of Ehm, W., Gneiting, T., Jordan, A., & Krüger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(3), 505–562. https://doi.org/10.1111/rssb.12154

  • Corollary 5.6 of Taggart, R. J. (2022). Point forecasting and forecast evaluation with generalized Huber loss. Electronic Journal of Statistics, 16(1), 201-231. https://doi.org/10.1214/21-EJS1957

scores.probability.roc_curve_data(fcst, obs, thresholds, *, reduce_dims=None, preserve_dims=None, weights=None, check_args=True)#

Calculates data required for plotting a Receiver (Relative) Operating Characteristic (ROC) curve, including the area under the curve (AUC). The ROC curve is used as a way to measure the discrimination ability of a particular forecast.

The AUC is the probability that the forecast probability of a random event is higher than the forecast probability of a random non-event.

Parameters:
  • fcst (DataArray) – An array of probabilistic forecasts for a binary event in the range [0, 1].

  • obs (DataArray) – An array of binary values where 1 is an event and 0 is a non-event.

  • thresholds (Iterable[float]) – Monotonic increasing values between 0 and 1, the thresholds at and above which to convert the probabilistic forecast to a value of 1 (an ‘event’)

  • reduce_dims (Sequence[str] | None) – Optionally specify which dimensions to reduce when calculating the ROC curve data. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • preserve_dims (Sequence[str] | None) – Optionally specify which dimensions to preserve when calculating ROC curve data. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the values will be the ROC curve at each point (i.e. single-value comparison against observed) for each threshold, and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom).

  • check_args (bool) – Checks if obs data only contains values in the set {0, 1, np.nan}. You may want to skip this check if you are sure about your input data and want to improve the performance when working with dask.

Returns:

  • ‘POD’ (the probability of detection)

  • ’POFD’ (the probability of false detection)

  • ’AUC’ (the area under the ROC curve)

POD and POFD have dimensions dims + ‘threshold’, while AUC has dimensions dims.

Return type:

An xarray.Dataset with data variables

Raises:
  • ValueError – if fcst contains values outside of the range [0, 1]

  • ValueError – if obs contains non-nan values not in the set {0, 1}

  • ValueError – if ‘threshold’ is a dimension in fcst.

  • ValueError – if values in thresholds are not monotonic increasing or are outside the range [0, 1]

Notes

The probabilistic fcst is converted to a deterministic forecast for each threshold in thresholds. If a value in fcst is greater than or equal to the threshold, then it is converted into a ‘forecast event’ (fcst = 1), and a ‘forecast non-event’ (fcst = 0) otherwise. The probability of detection (POD) and probability of false detection (POFD) are calculated for the converted forecast. From the POD and POFD data, the area under the ROC curve is calculated.

Ideally concave ROC curves should be generated rather than traditional ROC curves.

scores.probability.brier_score(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None, check_args=True)#

Calculates the Brier score for forecast and observed data.

\[\text{brier score} = \frac{1}{n} \sum_{i=1}^n (\text{forecast}_i - \text{observed}_i)^2\]

If you want to speed up performance with Dask and are confident of your input data, or if you want observations to take values between 0 and 1, set check_args to False.

Parameters:
  • fcst (DataArray | Dataset) – Forecast or predicted variables in xarray.

  • obs (DataArray | Dataset) – Observed variables in xarray.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the Brier score. All other dimensions will be preserved.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the Brier score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the absolute error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom).

  • check_args (bool) – will perform some tests on the data if set to True.

Raises:
  • ValueError – if fcst contains non-nan values outside of the range [0, 1]

  • ValueError – if obs contains non-nan values not in the set {0, 1}

Return type:

DataArray | Dataset

References

scores.probability.brier_score_for_ensemble(fcst, obs, ensemble_member_dim, event_thresholds, *, reduce_dims=None, preserve_dims=None, weights=None, fair_correction=True, event_threshold_operator=<built-in function ge>, threshold_dim='threshold')#

Calculates the Brier score for an ensemble forecast for the provided thresholds. By default, the fair Brier score is calculated, which is a modified version of the Brier score that applies a correction related to the number of ensemble members and the number of ensemble members that predict the event.

Given an event defined by event_threshold, the fair Brier score for ensembles is defined as

\[s_{i,y} = \left(\frac{i}{m} - y\right)^2 - \frac{i(m-i)}{m^2(m-1)}\]

Where:

  • \(i\) is the number of ensemble members that predict the event

  • \(m\) is the total number of ensemble members

  • \(y\) is the binary observation (0 if the event was not observed or 1 otherwise)

When the fair correction is not applied (i.e., fair_correction=False), the Brier score is calculated as:

\[s_{i,y} = \left(\frac{i}{m} - y\right)^2\]

The fair correction will only be applied at coordinates where the number of ensemble members is greater than 1 and fair_correction=True. When the fair_correction is set to True and the number of ensemble members is 1, the fair correction will not be applied to avoid division by zero. If you want to set the minimum number of ensemble members required to calculate the Brier score, we recommend preprocessing your data to remove data points with fewer than the minimum number of ensemble members that you want.

Parameters:
  • fcst (DataArray | Dataset) – Real valued ensemble forecasts in xarray. The proportion of ensemble members

  • function. (above a threshold is calculated within this)

  • obs (DataArray | Dataset) – Real valued observations in xarray. These values are converted to binary values in this function based on the supplied event thresholds and operators.

  • ensemble_member_dim (str) – The dimension name of the ensemble member.

  • event_thresholds (Real | Sequence[Real]) – The threshold(s) that define what is considered an event. If multiple thresholds are provided, they should be monotonically increasing with no NaNs.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the Brier score. All other dimensions will be preserved.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the absolute error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude) when calculating the mean score across specified dimensions via reduce_dims or preserve_dims.

  • fair_correction (bool) – Whether or not to apply the fair Brier score correction. Default is True.

  • event_threshold_operator (Callable) – The mode to use to define an event relative to a supplied threshold. Default is operator.ge, which means that an event occurs if the observation is greater than or equal to the threshold. Similarly, an event will have been considered to be forecast by an ensemble member if the forecast is greater than or equal to the threshold. Passing in operator.lt will give the same results as it is mathematically equivalent in this case. Alternatively, you can provide operator.gt which means that an event occurs if the observation is strictly greater than the threshold. Again, an event for an ensemble member will have been considered to be forecast if the forecast is greater than the threshold. Using``operator.le`` will give the same results as operator.gt as it is mathematically equivalent in this case.

  • threshold_dim (str) – The name of the threshold dimension in the output. Default is ‘threshold’. It must not exist as a dimension in the forecast or observation data.

Returns:

The Brier score for the ensemble forecast.

Raises:
  • ValueError – if values in event_thresholds are not monotonically increasing.

  • ValueError – if fcst contains the dimension provided in threshold_dim.

  • ValueError – if obs contains the dimension provided in threshold_dim.

  • ValueError – if weights contains the dimension provided in threshold_dim.

  • ValueError – if ensemble_member_dim is not in fcst dimensions.

Return type:

DataArray | Dataset

References

  • Ferro, C. A. T. (2013). Fair scores for ensemble forecasts. Quarterly

    Journal of the Royal Meteorological Society, 140(683), 1917–1923. https://doi.org/10.1002/qj.2270

Examples

Calculate the Brier score for an ensemble forecast for a single threshold:

>>> import numpy as np
>>> import xarray as xr
>>> from scores.probability import brier_score_for_ensemble
>>> fcst = 10 * xr.DataArray(np.random.rand(10, 10), dims=['time', 'ensemble'])
>>> obs = 10 * xr.DataArray(np.random.rand(10), dims=['time'])
>>> brier_score_for_ensemble(fcst, obs, ensemble_member_dim='ensemble', thresholds=0.5)

Calculate the Brier score for an ensemble forecast for multiple thresholds: >>> thresholds = [0.1, 5, 9] >>> brier_score_for_ensemble(fcst, obs, ensemble_member_dim=’ensemble’, thresholds=thresholds)

scores.probability.isotonic_fit(fcst, obs, *, weight=None, functional='mean', bootstraps=None, quantile_level=None, solver=None, confidence_level=0.9, min_non_nan=1, report_bootstrap_results=False)#

Calculates an isotonic regression fit to observations given forecasts as the explanatory values. Optionally returns confidence bands on the fit via bootstrapping. Forecasts and observations are scalar-valued (i.e, integers, floats or NaN).

If forecasts target the mean or a quantile functional, then the user can select that functional. Otherwise the user needs to supply a solver function

  • of one variable (if no weights are supplied), or

  • of two variables (if weights are supplied).

The solver function is used to find an optimal regression fit in each block, subject to a non-decreasing monotonicity constraint, using the pool adjacent violators (PAV) algorithm.

An example of a solver function of one variable is obs -> np.mean(obs) An example of a solver function of two variables is (obs, weight) -> np.average(obs, weights=weight).

Ties (where one forecast value has more than one observation value) are handled as follows. Forecasts are sorted in ascending order, and observations sorted to match. Where ties occur, observations are sorted in descending order within each subset of ties. Isotonic regression is then performed on this sorted observation sequence.

This implementation uses sklearn.isotonic.IsotonicRegression when functional=”mean”.

Parameters:
  • fcst (ndarray | DataArray) – np.array or xr.DataArray of forecast (or explanatory) values. Values must be float, integer or NaN.

  • obs (ndarray | DataArray) – np.array or xr.DataArray of corresponding observation (or response) values. Must be the same shape as fcst. Values must be float, integer or NaN.

  • weight (ndarray | DataArray | None) – positive weights to apply to each forecast-observation pair. Must be the same shape as fcst, or None (which is equivalent to applying equal weights).

  • functional (Literal['mean', 'quantile'] | None) – Functional that the forecasts are targeting. Either “mean” or “quantile” or None. If None then solver must be supplied. The current implementation for “quantile” does not accept weights. If weighted quantile regression is desired then the user should should supply an appropriate solver.

  • bootstraps (int | None) – the number of bootstrap samples to perform for calculating the regression confidence band. Set to None if a confidence band is not required.

  • quantile_level (float | None) – the level of the quantile functional if functional=’quantile’. Must be strictly between 0 and 1.

  • solver (Callable[[ndarray, ndarray], float] | Callable[[ndarray], float] | None) – function that accepts 1D numpy array of observations and returns a float. Function values give the regression fit for each block as determined by the PAV algorithm. If weights are supplied, the function must have a second argument that accepts 1D numpy array of weights of same length as the observations.

  • confidence_level (float | None) – Confidence level of the confidence band, strictly between 0 and 1. For example, a confidence level of 0.5 will calculate the confidence band by computing the 25th and 75th percentiles of the bootstrapped samples.

  • min_non_nan (int | None) – The minimum number of non-NaN bootstrapped values to calculate confidence bands at a particular forecast value.

  • report_bootstrap_results (bool | None) – This specifies to whether keep the bootstrapped regression values in return dictionary. Default is False to keep the output result small.

Returns:

  • “unique_fcst_sorted”: 1D numpy array of remaining forecast values sorted in

    ascending order, after any NaNs from fcst, obs and weight are removed, and only unique values are kept to keep the output size reasonably small.

  • ”fcst_counts”: 1D numpy array of forecast counts for unique values of forecast sorted

  • ”regression_values”: 1D numpy array of regression values corresponding to

    ”unique_fcst_sorted” values.

  • ”regression_func”: function that returns the regression fit based on linear

    interpolation of (“fcst_sorted”, “regression_values”), for any supplied argument (1D numpy array) of potential forecast values.

  • ”bootstrap_results”: in the case of report_bootstrap_results=True, 2D numpy

    array of bootstrapped regression values is included in return dictionary. Each row gives the interpolated regression values from a particular bootstrapped sample, evaluated at “fcst_sorted” values. If m is the number of bootstrap samples and n = len(fcst_sorted) then it is has shape (m, n). We emphasise that this array corresponds to fcst_sorted not unique_fcst_sorted.

  • ”confidence_band_lower_values”: values of lower confidence band threshold, evaluated

    at “unique_fcst_sorted” values.

  • ”confidence_band_upper_values”: values of upper confidence band threshold, evaluated

    at “unique_fcst_sorted” values.

  • ”confidence_band_lower_func”: function that returns regression fit based on linear

    interpolation of (“fcst_sorted”, “confidence_band_lower_values”), given any argument (1D numpy array) of potential forecast values.

  • ”confidence_band_upper_func”: function that returns regression fit based on linear

    interpolation of (“fcst_sorted”, “confidence_band_upper_values”), given any argument (1D numpy array) of potential forecast values.

  • ”confidence_band_levels”: tuple giving the quantile levels used to calculate the

    confidence band.

Return type:

Dictionary with the following keys

Raises:
  • ValueError – if fcst and obs are np.arrays and don’t have the same shape.

  • ValueError – if fcst and weight are np.arrays and don’t have the same shape.

  • ValueError – if fcst and obs are xarray.DataArrays and don’t have the same dimensions.

  • ValueError – if fcst and weight are xarray.DataArrays and don’t have the same dimensions.

  • ValueError – if any entries of fcst, obs or weight are not an integer, float or NaN.

  • ValueError – if there are no fcst and obs pairs after NaNs are removed.

  • ValueError – if functional is not one of “mean”, “quantile” or None.

  • ValueError – if quantile_level is not strictly between 0 and 1 when functional=”quantile”.

  • ValueError – if weight is not None when functional=”quantile”.

  • ValueError – if functional and solver are both None.

  • ValueError – if not exactly one of functional and solver is None.

  • ValueError – if bootstraps is not a positive integer.

  • ValueError – if confidence_level is not strictly between 0 and 1.

Note: This function only keeps the unique values of fcst_sorted to keep the volume of

the return dictionary small. The forecast counts is also included, so users can it to create forecast histogram (usually displayed in the reliability diagrams).

References
  • de Leeuw, Hornik and Mair. “Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods”, Journal of Statistical Software, 2009.

  • Dimitriadis, Gneiting and Jordan. “Stable reliability diagrams for probabilistic classifiers”, PNAS, Vol. 118 No. 8, 2020. Available at https://www.pnas.org/doi/10.1073/pnas.2016191118

  • Jordan, Mühlemann, and Ziegel. “Optimal solutions to the isotonic regression problem”, 2020 (version 2), available on arxiv at https://arxiv.org/abs/1904.04761

scores.categorical#

scores.categorical.firm(fcst, obs, risk_parameter, categorical_thresholds, threshold_weights, *, discount_distance=0, reduce_dims=None, preserve_dims=None, weights=None, threshold_assignment='lower')#

Calculates the FIxed Risk Multicategorical (FIRM) score including the underforecast and overforecast penalties.

categorical_thresholds and threshold_weights must be the same length.

Parameters:
  • fcst (DataArray) – An array of real-valued forecasts that we want to treat categorically.

  • obs (DataArray) – An array of real-valued observations that we want to treat categorically.

  • risk_parameter (float) – Risk parameter (alpha) for the FIRM score. The value must satisfy 0 < risk_parameter < 1.

  • categorical_thresholds (Sequence[float] | Sequence[DataArray]) – Category thresholds (thetas) to delineate the categories. A sequence of xr.DataArrays may be supplied to allow for different thresholds at each coordinate (e.g., thresholds determined by climatology).

  • threshold_weights (Sequence[float | DataArray]) – Weights that specify the relative importance of forecasting on the correct side of each category threshold. Either a positive float can be supplied for each categorical threshold or an xr.DataArray (with no negative values) can be provided for each categorical threshold as long as its dims are a subset of obs dims. NaN values are allowed in the xr.DataArray. For each NaN value at a given coordinate, the FIRM score will be NaN at that coordinate, before dims are collapsed.

  • discount_distance (float | None) – An optional discounting distance parameter which satisfies discount_distance >= 0 such that the cost of misses and false alarms are discounted whenever the observation is within distance discount_distance of the forecast category. A value of 0 will not apply any discounting.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the FIRM score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating FIRM. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the FIRM score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom)

  • threshold_assignment (str | None) – Specifies whether the intervals defining the categories are left or right closed. That is whether the decision threshold is included in the upper (left closed) or lower (right closed) category. Defaults to “lower”.

Returns:

  • firm_score: A score for a single category for each coord based on the FIRM framework.

  • overforecast_penalty: Penalty for False Alarms.

  • underforecast_penalty: Penalty for Misses.

Return type:

An xarray Dataset with data vars

Raises:
  • ValueError – if len(categorical_thresholds) < 1.

  • ValueError – if categorical_thresholds and threshold_weights lengths are not equal.

  • ValueError – if risk_parameter <= 0 or >= 1.

  • ValueError – if any values in threshold_weights are <= 0.

  • ValueError – if discount_distance is not None and < 0.

  • scores.utils.DimensionError – if threshold_weights is a list of xr.DataArrays and if the dimensions of these xr.DataArrays is not a subset of the obs dims.

Note

Setting discount distance to None or 0, will mean that no discounting is applied. This means that errors will be penalised strictly categorically.

Setting discount distance to np.inf means that the cost of a miss is always proportional to the distance of the observation from the threshold, and similarly for false alarms.

References

Taggart, R., Loveday, N. and Griffiths, D., 2022. A scoring framework for tiered warnings and multicategorical forecasts based on fixed risk measures. Quarterly Journal of the Royal Meteorological Society, 148(744), pp.1389-1406.

scores.categorical.probability_of_detection(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None, check_args=True)#

Calculates the Probability of Detection (POD), also known as the Hit Rate. This is the proportion of observed events (obs = 1) that were correctly forecast as an event (fcst = 1).

Parameters:
  • fcst (DataArray | Dataset) – An array containing binary values in the set {0, 1, np.nan}

  • obs (DataArray | Dataset) – An array containing binary values in the set {0, 1, np.nan}

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to sum when calculating the POD. All other dimensions will be not summed. As a special case, ‘all’ will allow all dimensions to be summed. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to sum across all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to not sum when calculating the POD. All other dimensions will be summed. As a special case, ‘all’ will allow all dimensions to be not summed. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the POD score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom)

  • check_args (bool | None) – Checks if fcst and `obs data only contains values in the set {0, 1, np.nan}. You may want to skip this check if you are sure about your input data and want to improve the performance when working with dask.

Returns:

A DataArray of the Probability of Detection.

Raises:

ValueError – if there are values in fcst and obs that are not in the set {0, 1, np.nan} and check_args is true.

Return type:

DataArray | Dataset

scores.categorical.probability_of_false_detection(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None, check_args=True)#

Calculates the Probability of False Detection (POFD), also known as False Alarm Rate (not to be confused with the False Alarm Ratio). The POFD is the proportion of observed non-events (obs = 0) that were incorrectly forecast as event (i.e. fcst = 1).

Parameters:
  • fcst (DataArray | Dataset) – An array containing binary values in the set {0, 1, np.nan}

  • obs (DataArray | Dataset) – An array containing binary values in the set {0, 1, np.nan}

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to sum when calculating the POFD. All other dimensions will be not summed. As a special case, ‘all’ will allow all dimensions to be summed. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to sum across all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to not sum when calculating the POFD. All other dimensions will be summed. As a special case, ‘all’ will allow all dimensions to be not summed. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the POD score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom)

  • check_args (bool | None) – Checks if fcst and `obs data only contains values in the set {0, 1, np.nan}. You may want to skip this check if you are sure about your input data and want to improve the performance when working with dask.

Returns:

A DataArray of the Probability of False Detection.

Raises:

ValueError – if there are values in fcst and obs that are not in the set {0, 1, np.nan} and check_args is true.

Return type:

DataArray | Dataset

class scores.categorical.BinaryContingencyManager(fcst_events, obs_events)#

See https://scores.readthedocs.io/en/stable/tutorials/Binary_Contingency_Scores.html for a detailed walkthrough showing the use in practice.

A BinaryContingencyManager holds the underlying binary forecast and observed event data, from which it builds a contingency table and provides scoring functionality based on that table.

A BinaryContingencyManager is typically created by an EventOperator, such as a ThresholdOperator, but can also be created directly from binary data if the user wishes.

Supported operations include:
  • “Transforming” the data in various ways, such as dimensional reduction

  • Producing contingency tables

  • Calculating scores and metrics based on contingency tables

The full data comprises several n-dimensional binary arrays which can be considered maps of:
  • True positives (hits)

  • False positives (false alarms)

  • True negatives (correct negatives)

  • False negatives (misses)

These masks, in addition to supporting score calculations, can be accessed and used for:

  • Plotting these attributes on a map

  • Masking these values by a geographical region prior to score calculation

As such, the per-pixel information is useful as well as the overall ratios involved.

BinaryContingencyManager inherits from (uses) the BasicContingencyManager class to provide score calculations on the final contingency table. Documentation for the available scores is found in the BasicContingencyManager API entry but the calls can be made directly against instances of BinaryContingencyManager where performance or transformation are not a concern.

Parameters:
  • fcst_events (DataArray | Dataset | Series)

  • obs_events (DataArray | Dataset | Series)

transform(*, reduce_dims=None, preserve_dims=None)#

Compute the contingency table, preserving or reducing the specified dimensions.

Parameters:
  • reduce_dims (-) – Dimensions to reduce. Can be “all” to reduce all dimensions.

  • preserve_dims (-) – Dimensions to preserve. Can be “all” to preserve all dimensions.

Returns:

A scores class which supports efficient calculation of contingency metrics.

Return type:

scores.categorical.BasicContingencyManager

class scores.categorical.BasicContingencyManager(counts)#

See https://scores.readthedocs.io/en/stable/tutorials/Binary_Contingency_Scores.html for a detailed walkthrough showing the use of this class in practice.

A BasicContingencyManager object provides the scoring functions which are calculated from a contingency table. A BasicContingencyManager can be efficiently and repeatedly queried for a wide variety of scores.

A BasicContingencyManager is produced when a BinaryContingencyManager is transformed. It is also possible to create a BasicContingencyManager from event counts or a contingency table, although this not a common user requirement.

A contingency table is built only from event counts, losing the connection to the actual event tables in their full dimensionality. The event count data is much smaller than the full event tables, particularly when considering very large data sets like Numerical Weather Prediction (NWP) data, which could be terabytes to petabytes in size.

By contrast, A BinaryContingencyManager retains the full event data, which provides some more flexbility but may reduce efficiency.

Parameters:

counts (dict)

accuracy()#

Identical to fraction_correct().

Accuracy calculates the proportion of forecasts which are correct.

Returns:

A DataArray containing the accuracy score

Return type:

xr.DataArray

\[\text{accuracy} = \frac{\text{true positives} + \text{true negatives}}{\text{total count}}\]

Notes

  • Range: 0 to 1, where 1 indicates a perfect score.

  • “True positives” is the same at “hits”.

  • “False negatives” is the same as “misses”.

References

https://www.cawcr.gov.au/projects/verification/#ACC

base_rate()#

The observed event frequency.

Returns:

An xarray object containing the base rate.

Return type:

xr.DataArray

\[\text{base rate} = \frac{\text{true positives} + \text{false negatives}}{\text{total count}}\]

Notes

  • Range: 0 to 1, where 1 indicates the event occurred every time.

  • “True positives” is the same as “hits”.

  • “False negatives” is the same as “misses”.

References

Hogan, R. J. & Mason, I. B. (2011). Deterministic forecasts of binary events. In I. T. Jolliffe & D. B. Stephenson (Eds.), Forecast verification: A practitioner’s guide in atmospheric science (2nd ed., pp. 39-51). https://doi.org/10.1002/9781119960003.ch3

bias_score()#

Identical to frequency_bias().

How did the forecast frequency of “yes” events compare to the observed frequency of “yes” events?

Returns:

An xarray object containing the bias score

Return type:

xr.DataArray

\[\text{frequency bias} = \frac{\text{true positives} + \text{false positives}}{\text{true positives} + \text{false negatives}}\]

Notes

  • Range: 0 to ∞ (infinity), where 1 indicates a perfect score.

  • “True positives” is the same as “hits”.

  • “False positives” is the same as “false alarms”.

  • “False negatives” is the same as “misses”.

References

https://www.cawcr.gov.au/projects/verification/#BIAS

cohens_kappa()#

Identical to heidke_skill_score().

Calculates Cohen’s kappa.

What was the accuracy of the forecast relative to that of random chance?

Returns:

An xarray object containing the Cohen’s Kappa score

Return type:

xr.DataArray

\[\begin{split}\\ \text{Cohen's Kappa} \left(\kappa\right) = \frac{\text{true positives} + \text{true negatives} - E_{random}}{ \text{total count} - E_{\text{random}}}\end{split}\]

where

\[\begin{split}\begin{aligned} E_{\text{random}} &= \text{expected correct matches due to random chance} \\&= \frac{\left(\text{tp} + \text{fn}\right) \cdot \left(\text{tp} + \text{fp}\right) + \left(\text{tn} + \text{fn}\right) \cdot \left(\text{tn} + \text{fp}\right)}{ \text{total count}} \end{aligned}\end{split}\]

Notes

  • Range: -1 to 1, 0 indicates no skill. Perfect score: 1.

  • “True positives” (\(\text{tp}\)) is the same as “hits”.

  • “False negatives” (\(\text{fn}\)) is the same as “misses”.

  • “False positives” (\(\text{fp}\)) is the same as “false alarms”.

  • “True negatives” (\(\text{tn}\)) is the same as “correct negatives”.

References

critical_success_index()#

Identical to threat_score().

Returns:

An xarray object containing the critical success index

Return type:

xr.DataArray

\[\text{threat score} = \frac{\text{true positives}}{\text{true positives} + \text{false positives} + \text{false negatives}}\]

Notes

  • Range: 0 to 1, 0 indicates no skill. Perfect score: 1.

  • Often known as CSI.

  • “True positives” is the same as “hits”

  • “False positives” is the same as “false alarms”

  • “True negatives” is the same as “correct negatives”

References

https://www.cawcr.gov.au/projects/verification/#CSI

equitable_threat_score()#

Identical to gilberts_skill_score().

Calculates the Equitable threat score.

How well did the forecast “yes” events correspond to the observed “yes” events (accounting for hits due to chance)?

Returns:

An xarray object containing the equitable threat score

Return type:

xr.DataArray

\[\text{ETS} = \frac{\text{true positives} - \text{true positives} _\text{random}} {\text{true positives} + \text{false negatives} + \text{false positives} - \text{true positives} _\text{random}}\]

where

\[\text{true_positives}_{\text{random}} = \frac{(\text{true positives} + \text{false negatives}) (\text{true positives} + \text{false positives})}{\text{total count}}\]

Notes

  • Range: -1/3 to 1, 0 indicates no skill. Perfect score: 1.

  • “True positives” is the same as “hits”.

  • “False negatives” is the same as “misses”.

  • “False positives” is the same as “false alarms”

References

  • Gilbert, G.K., 1884. Finley’s tornado predictions. American Meteorological Journal, 1(5), pp.166–172.

  • Hogan, R.J., Ferro, C.A., Jolliffe, I.T. and Stephenson, D.B., 2010. Equitability revisited: Why the “equitable threat score” is not equitable. Weather and Forecasting, 25(2), pp.710-726. https://doi.org/10.1175/2009WAF2222350.1

f1_score()#

Calculates the F1 score.

Returns:

An xarray object containing the F1 score

Return type:

xr.DataArray

\[\text{F1} = \frac{2 \cdot \text{true positives}}{(2 \cdot \text{true positives}) + \text{false positives} + \text{false negatives}}\]

Notes

  • “True positives” is the same as “hits”.

  • “False positives” is the same as “false alarms”.

  • “False negatives” is the same as “misses”.

References

false_alarm_rate()#

Identical to probability_of_false_detection.

What fraction of the non-events were incorrectly predicted?

Returns:

An xarray object containing the false alarm rate

Return type:

xr.DataArray

\[\text{false alarm rate} = \frac{\text{false positives}}{\text{true negatives} + \text{false positives}}\]

Notes

  • Range: 0 to 1. Perfect score: 0.

  • Not to be confused with the false alarm ratio.

  • “False positives” is the same as “false alarms”.

  • “True negatives” is the same as “correct negatives”.

References

https://www.cawcr.gov.au/projects/verification/#POFD

false_alarm_ratio()#

The false alarm ratio (FAR) calculates the fraction of the predicted “yes” events which did not eventuate (i.e., were false alarms).

Returns:

An xarray object containing the false alarm ratio

Return type:

xr.DataArray

\[\text{false alarm ratio} = \frac{\text{false positives}}{\text{true positives} + \text{false positives}}\]

Notes

  • Range: 0 to 1. Perfect score: 0.

  • Not to be confused with the False Alarm Rate.

  • “False positives” is the same as “false alarms”.

  • “True positives” is the same as “hits”.

References

https://www.cawcr.gov.au/projects/verification/#FAR

forecast_rate()#

The forecast event frequency.

Returns:

An xarray object containing the forecast rate.

Return type:

xr.DataArray

\[\text{forecast rate} = \frac{\text{true positives} + \text{false positives}}{\text{total count}}\]

Notes

  • Range: 0 to 1, where 1 indicates the event was forecast every time.

  • “True positives” is the same as “hits”.

  • “False positives” is the same as “false alarms”.

References

Hogan, R. J. & Mason, I. B. (2011). Deterministic forecasts of binary events. In I. T. Jolliffe & D. B. Stephenson (Eds.), Forecast verification: A practitioner’s guide in atmospheric science (2nd ed., pp. 39-51). https://doi.org/10.1002/9781119960003.ch3

format_table(positive_value_name='Positive', negative_value_name='Negative')#

Formats a contingency table from a 2x2 contingency manager into a confusion matrix.

Parameters:
  • positive_value_name (str) – A string with the value to be displayed on the contingency table visual. The default value is ‘Positive’.

  • negative_value_name (str) – A string with the value to be displayed on the contingency table visual. The default value is ‘Negative’.

Returns:

A Pandas DataFrame representing 2x2 binary contingency table if possible. xr.DataArray: An xarray DataArray showing higher-dimension contingency table information otherwise

Return type:

pandas.DataFrame

This function retrieves the contingency table, extracts the relevant counts, and constructs a 2x2 confusion matrix represented as a Pandas DataFrame. This function does not support HTML rendering of tables where dimensionality has been preserved.

Example

Assuming self.xr_table returns an xr.Array:

array([ 5., 11.,  1.,  1., 18.])

Coordinates:

    contingency
    (contingency)
    <U11
    'tp_count' ... 'total_count'

This function will return:

Positive  Negative Total

Positive 5 1 6 Negative 1 11 12 Total 6 12 18

fraction_correct()#

Identical to accuracy().

Fraction correct calculates the proportion of forecasts which are correct.

Returns:

An xarray object containing the fraction correct.

Return type:

xr.DataArray

\[\text{fraction correct} = \frac{\text{true positives} + \text{true negatives}}{\text{total count}}\]

Notes

  • Range: 0 to 1, where 1 indicates a perfect score.

  • “True positives” is the same as “hits”.

  • “True negatives” is the same as “correct negatives”.

References

https://www.cawcr.gov.au/projects/verification/#ACC

frequency_bias()#

Identical to bias_score().

How did the forecast frequency of “yes” events compare to the observed frequency of “yes” events?

Returns:

An xarray object containing the frequency bias

Return type:

xr.DataAray

\[\text{frequency bias} = \frac{\text{true positives} + \text{false positives}}{\text{true positives} + \text{false negatives}}\]

Notes

  • Range: 0 to ∞ (infinity), where 1 indicates a perfect score.

  • “True positives” is the same as “hits”.

  • “False positives” is the same as “false alarms”.

  • “False negatives” is the same as “misses”.

References

https://www.cawcr.gov.au/projects/verification/#BIAS

get_counts()#
Returns:

A dictionary of the contingency table counts. Values are xr.DataArray instances. The keys are:

  • tp_count for true positive count

  • tn_count for true negative count,

  • fp_count for false positive count

  • fn_count for false negative count

  • total_count for the total number of events.

Return type:

dict

Here is an example of what may be returned:

{'tp_count': <xarray.DataArray ()> Size: 8B array(5.),
 'tn_count': <xarray.DataArray ()> Size: 8B array(11.),
 'fp_count': <xarray.DataArray ()> Size: 8B array(1.),
 'fn_count': <xarray.DataArray ()> Size: 8B array(1.),
 'total_count': <xarray.DataArray ()> Size: 8B array(18.)}
get_table()#
Returns:

The contingency table as an xarray object. Contains a coordinate dimension called ‘contingency’. Valid coordinates for this dimenstion are:

  • tp_count for true positive count

  • tn_count for true negative count

  • fp_count for false positive count

  • fn_count for false negative count

  • total_count for the total number of events.

Return type:

xr.DataArray

Here is an example of what may be returned:

array([ 5., 11.,  1.,  1., 18.])

Coordinates:

    contingency
    (contingency)
    <U11
    'tp_count' ... 'total_count'

Indexes:

    contingency
    PandasIndex

Attributes: (0)
gilberts_skill_score()#

Identical to equitable_threat_score().

Calculates the Gilbert skill score.

How well did the forecast “yes” events correspond to the observed “yes” events (accounting for hits due to chance)?

Returns:

An xarray object containing the Gilberts Skill Score

Return type:

xr.DataArray

\[\text{GSS} = \frac{\text{true positives} - \text{true positives} _\text{random}} {\text{true positives} + \text{false negatives} + \text{false positives} - \text{true positivies} _\text{random}}\]

where

\[\text{true_positives}_{\text{random}} = \frac{(\text{true positives} + \text{false negatives}) (\text{true positives} + \text{false positives})}{\text{total count}}\]

Notes

  • Range: -1/3 to 1, 0 indicates no skill. Perfect score: 1.

  • “True positives” is the same as “hits”.

  • “False negatives” is the same as “misses”

  • “False positives” is the same as “false alarms”.

References

  • Gilbert, G.K., 1884. Finley’s tornado predictions. American Meteorological Journal, 1(5), pp.166–172.

  • Hogan, R.J., Ferro, C.A., Jolliffe, I.T. and Stephenson, D.B., 2010. Equitability revisited: Why the “equitable threat score” is not equitable. Weather and Forecasting, 25(2), pp.710-726. https://doi.org/10.1175/2009WAF2222350.1

hanssen_and_kuipers_discriminant()#

Identical to peirce_skill_score() and true_skill_statistic().

How well did the forecast separate the “yes” events from the “no” events?

Returns:

An xarray object containing Hanssen and Kuipers’ Discriminant

Return type:

xr.DataArray

\[\text{HK} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}} - \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}\]

where \(\text{HK}\) is Hansen and Kuipers Discriminant

Notes

  • Range: -1 to 1, 0 indicates no skill. Perfect score: 1.

  • “True positives” is the same as “hits”.

  • “False negatives” is the same as “misses”.

  • “False positives” is the same as “false alarms”.

  • “True negatives” is the same as “correct negatives”.

References

https://www.cawcr.gov.au/projects/verification/#HK

heidke_skill_score()#

Identical to cohens_kappa().

Calculates the Heidke skill score.

What was the accuracy of the forecast relative to that of random chance?

Returns:

An xarray object containing the Heidke skill score

Return type:

xr.DataArray

\[\begin{split}\\ \text{HSS} = \frac{\text{true positives} + \text{true negatives} - E_{random}}{ \text{total count} - E_{\text{random}}}\end{split}\]

where

\[\begin{split}\begin{aligned} E_{\text{random}} &= \text{expected correct matches due to random chance} \\&= \frac{\left(\text{tp} + \text{fn}\right) \cdot \left(\text{tp} + \text{fp}\right) + \left(\text{tn} + \text{fn}\right) \cdot \left(\text{tn} + \text{fp}\right)}{ \text{total count}} \end{aligned}\end{split}\]

Notes

  • Range: -1 to 1, 0 indicates no skill. Perfect score: 1.

  • HSS = Heidke Skill Score

  • “True positives” (\(\text{tp}\)) is the same as “hits”.

  • “False negatives” (\(\text{fn}\)) is the same as “misses”.

  • “False positives” (\(\text{fp}\)) is the same as “false alarms”.

  • “True negatives” (\(\text{tn}\)) is the same as “correct negatives”.

References

hit_rate()#

Identical to true_positive_rate(), probability_of_detection, sensitivity() and recall().

Calculates the proportion of the observed events that were correctly forecast.

Returns:

An xarray object containing the hit rate

Return type:

xr.DataArray

\[\text{true positives} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}\]

Notes

  • Range: 0 to 1. Perfect score: 1.

  • “True positives” is the same as “hits”.

  • “False negatives” is the same as “misses”.

References

https://www.cawcr.gov.au/projects/verification/#POD

negative_predictive_value()#

Calculates the negative predictive value (NPV).

Of all the times a negative result was predicted, how often was the prediction actually correct?

Returns:

An xarray object containing the negative predictive value score

Return type:

xr.DataArray

\[\text{negative predictive value} = \frac{\text{true negatives}}{\text{true negatives} + \text{false negatives}}\]

Notes

  • Range: 0 to 1. Perfect score: 1.

  • “True negatives” is the same as “correct rejections”.

  • “False negatives” is the same as “missed detections”.

References

odds_ratio()#

Calculates the odds ratio

What is the ratio of the odds of a “yes” forecast being correct, to the odds of a “yes” forecast being wrong?

Returns:

An xarray object containing the odds ratio

Return type:

xr.DataArray

\[\begin{split}\begin{aligned} \text{odds ratio} &= \left[\frac{\text{POD}}{\text{1 - POD}}\right] \div \left[\frac{\text{POFD}}{\text{1 - POFD}}\right] \\ &= \frac{\text{true positives} \cdot \text{true negatives}}{\text{false positives} \cdot \text{false negatives}} \end{aligned}\end{split}\]

where

\[\text{POD} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}\]

and

\[\text{POFD} = \frac{\text{false positives}}{\text{true negatives} + \text{false positives}}\]

Notes

  • Range: 0 to ∞, 1 indicates no skill. Perfect score: ∞.

  • POD = Probability of Detection

  • POFD = Probability of False Detection

  • “True positives” is the same as “hits”.

  • “False negatives” is the same as “misses”.

  • “False positives” is the same as “false alarms”.

  • “True negatives” is the same as “correct negatives”.

References

odds_ratio_skill_score()#

Identical to yules_q().

Calculates the odds ratio skill score.

What was the improvement of the forecast over random chance?

Returns:

An xarray object containing the odds ratio skill score

Return type:

xr.DataArray

\[\begin{split}\begin{aligned} \text{ORSS} &= \frac{\text{OR} - 1}{\text{OR} + 1} \\ &= \frac{\text{true positives} \cdot \text{true negatives} - \text{false positives} \cdot \text{false negatives}}{ \text{true positives} \cdot \text{true negatives} + \text{false positives} \cdot \text{false negatives}} \end{aligned}\end{split}\]

Notes

  • Range: -1 to 1, 0 indicates no skill. Perfect score: 1.

  • ORSS = Odds Ratio Skill Score

  • OR = Odds ratio, see: odds_ratio()

  • “True positives” is the same as “hits”.

  • “False negatives” is the same as “misses”.

  • “False positives” is the same as “false alarms”.

  • “True negatives” is the same as “correct negatives”.

References

peirce_skill_score()#

Identical to hanssen_and_kuipers_discriminant() and true_skill_statistic().

How well did the forecast separate the “yes” events from the “no” events?

Returns:

An xarray object containing the Peirce Skill Score

Return type:

xr.DataArray

\[\text{Peirce skill score} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}} - \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}\]

Notes

  • Range: -1 to 1, 0 indicates no skill. Perfect score: 1.

  • “True positives” is the same as “hits”.

  • “False negatives” is the same as “misses”.

  • “False positives” is the same as “false alarms”.

  • “True negatives” is the same as “correct negatives”.

References

positive_predictive_value()#

Identical to success_ratio() and precision().

What proportion of the forecast events actually eventuated?

Returns:

An xarray object containing the positive predictive value score

Return type:

xr.DataArray

\[\text{positive predictive value} = \frac{\text{true positives}}{\text{true positives} + \text{false positives}}\]

Notes

  • Range: 0 to 1. Perfect score: 1.

  • “True positives” is the same as “hits”.

  • “False positives” is the same as “false alarms”.

References

precision()#

Identical to success_ratio() and positive_predictive_value().

What proportion of the forecast events actually eventuated?

Returns:

An xarray object containing the precision score

Return type:

xr.DataArray

\[\text{precision} = \frac{\text{true positives}}{\text{true positives} + \text{false positives}}\]

Notes

  • Range: 0 to 1. Perfect score: 1.

  • “True positives” is the same as “hits”.

  • “False positives” is the same as “false alarms”.

References

probability_of_detection()#

Probability of detection (POD) is identical to hit_rate(), true_positive_rate(), sensitivity() and recall().

Calculates the proportion of the observed events that were correctly forecast.

Returns:

An xarray object containing the probability of detection

Return type:

xr.DataArray

\[\text{hit rate} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}\]

Notes

  • Range: 0 to 1. Perfect score: 1.

  • “True positives” is the same as “hits”

  • “False negatives” is the same as “misses”

References

https://www.cawcr.gov.au/projects/verification/#POD

probability_of_false_detection()#

Identical to false_alarm_rate().

What fraction of the non-events were incorrectly predicted?

Returns:

An xarray object containing the probability of false detection

Return type:

xr.DataArray

\[\text{probability of false detection} = \frac{\text{false positives}}{\text{true negatives} + \text{false positives}}\]

Notes

  • Range: 0 to 1. Perfect score: 0.

  • “False positives” is the same as “false alarms”.

  • “True negatives” is the same as “correct negatives”.

References

https://www.cawcr.gov.au/projects/verification/#POFD

recall()#

Identical to hit_rate(), probability_of_detection, true_positive_rate(), and sensitivity().

Calculates the proportion of the observed events that were correctly forecast.

Returns:

An xarray object containing the probability of detection

Return type:

xr.DataArray

\[\text{recall} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}\]

Notes

  • Range: 0 to 1. Perfect score: 1.

  • “True positives” is the same as “hits”.

  • “False negatives” is the same as “misses”.

References

sensitivity()#

Identical to hit_rate(), probability_of_detection, true_positive_rate(), and recall().

Calculates the proportion of the observed events that were correctly forecast.

Returns:

An xarray object containing the probability of detection

Return type:

xr.DataArray

\[\text{sensitivity} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}\]

Notes

  • Range: 0 to 1. Perfect score: 1.

  • “True positives” is the same as “hits”.

  • “False negatives” is the same as “misses”.

References

specificity()#

Identical to true_negative_rate().

The probability that an observed non-event will be correctly predicted.

Returns:

An xarray object containing the true negative rate (specificity).

Return type:

xr.DataArray

\[\text{specificity} = \frac{\text{true negatives}}{\text{true negatives} + \text{false positives}}\]

Notes

  • “True negatives” is the same as “correct negatives”.

  • “False positives” is the same as “false alarms”.

Reference:
success_ratio()#

Identical to precision() and positive_predictive_value().

What proportion of the forecast events actually eventuated?

Returns:

An xarray object containing the success ratio

Return type:

xr.DataArray

\[\text{success ratio} = \frac{\text{true positives}}{\text{true positives} + \text{false positives}}\]

Notes

  • Range: 0 to 1. Perfect score: 1.

  • “True positives” is the same as “hits”.

  • “False positives” is the same as “false alarms”.

References

https://www.cawcr.gov.au/projects/verification/#SR

symmetric_extremal_dependence_index()#

Calculates the Symmetric Extremal Dependence Index (SEDI).

Returns:

An xarray object containing the SEDI score

Return type:

xr.DataArray

\[\frac{\ln(\text{POFD}) - \ln(\text{POD}) + \ln(\text{1-POD}) - \ln(\text{1 -POFD})} {\ln(\text{POFD}) + \ln(\text{POD}) + \ln(\text{1-POD}) + \ln(\text{1 -POFD})}\]

where

\[\text{POD} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}\]

and

\[\text{POFD} = \frac{\text{false positives}}{\text{true negatives} + \text{false positives}}\]

Notes

  • POD = Probability of Detection

  • POFD = Probability of False Detection

  • “True positives” is the same as “hits”.

  • “False negatives” is the same as “misses”.

  • “False positives” is the same as “false alarms”.

  • “True negatives” is the same as “correct negatives”.

  • Range: -1 to 1, Perfect score: 1.

References

Ferro, C.A.T. and Stephenson, D.B., 2011. Extremal dependence indices: Improved verification measures for deterministic forecasts of rare binary events. Weather and Forecasting, 26(5), pp.699-713. https://doi.org/10.1175/WAF-D-10-05030.1

threat_score()#

Identical to critical_success_index().

Returns:

An xarray object containing the threat score

Return type:

xr.DataArray

\[\text{threat score} = \frac{\text{true positives}}{\text{true positives} + \text{false positives} + \text{false negatives}}\]

Notes

  • Range: 0 to 1, 0 indicates no skill. Perfect score: 1.

  • “True positives” is the same as “hits”.

  • “False positives” is the same as “false alarms”.

  • “True negatives” is the same as “correct negatives”.

References

https://www.cawcr.gov.au/projects/verification/#CSI

true_negative_rate()#

Identical to specificity().

The probability that an observed non-event will be correctly predicted.

Returns:

An xarray object containing the true negative rate.

Return type:

xr.DataArray

\[\text{true negative rate} = \frac{\text{true negatives}}{\text{true negatives} + \text{false positives}}\]

Notes

  • “True negatives” is the same as “correct negatives”.

  • “False positives” is the same as “false alarms”.

Reference:

https://en.wikipedia.org/wiki/Sensitivity_and_specificity

true_positive_rate()#

Identical to hit_rate(), probability_of_detection, sensitivity() and recall().

The proportion of the observed events that were correctly forecast.

Returns:

An xarray object containing the true positive rate

Return type:

xr.DataArray

\[\text{true positive rate} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}\]

Notes

  • Range: 0 to 1. Perfect score: 1.

  • “True positives” is the same as “hits”.

  • “False negatives” is the same as “misses”.

References

https://www.cawcr.gov.au/projects/verification/#POD

true_skill_statistic()#

Identical to peirce_skill_score() and hanssen_and_kuipers_discriminant().

How well did the forecast separate the “yes” events from the “no” events?

Returns:

An xarray object containing the true skill statistic

Return type:

xr.DataArray

\[\text{true skill statistic} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}} - \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}\]

Notes

  • Range: -1 to 1, 0 indicates no skill. Perfect score: 1.

  • “True positives” is the same as “hits”.

  • “False negatives” is the same as “misses”.

  • “False positives” is the same as “false alarms”.

  • “True negatives” is the same as “correct negatives”.

References

https://www.cawcr.gov.au/projects/verification/#HK

yules_q()#

Identical to odds_ratio_skill_score().

Calculates the Yule’s Q.

What was the improvement of the forecast over random chance?

Returns:

An xarray object containing Yule’s Q

Return type:

xr.DataArray

\[\begin{split}\begin{aligned} \text{Yule's Q} &= \frac{\text{OR} - 1}{\text{OR} + 1} \\ &= \frac{\text{true positives} \cdot \text{true negatives} - \text{false positives} \cdot \text{false negatives}}{ \text{true positives} \cdot \text{true negatives} + \text{false positives} \cdot \text{false negatives}} \end{aligned}\end{split}\]

Notes

  • Range: -1 to 1, 0 indicates no skill. Perfect score: 1.

  • OR = Odds ratio, see: odds_ratio().

  • “True positives” is the same as “hits”.

  • “False negatives” is the same as “misses”.

  • “False positives” is the same as “false alarms”.

  • “True negatives” is the same as “correct negatives”.

References

Stephenson, D.B., 2000. Use of the “odds ratio” for diagnosing forecast skill. Weather and Forecasting, 15(2), pp.221-232. https://doi.org/10.1175/1520-0434(2000)015%3C0221:UOTORF%3E2.0.CO;2

class scores.categorical.ThresholdEventOperator(*, precision=8, default_event_threshold=0.001, default_op_fn=<built-in function ge>)#

See https://scores.readthedocs.io/en/stable/tutorials/Binary_Contingency_Scores.html for a detailed walkthrough showing the use in practice.

A ThresholdEventOperator is used to produce a BinaryContingencyManager from forecast and observed data. It considers an event to be defined when the forecast and observed variables meet a particular threshold condition (e.g. rainfall above 1mm).

The class may be used for any variable, for any threshold, and for any comparison operator (e.g. greater-than, less-than, greater-than-or-equal-to, … )

make_contingency_manager(fcst, obs, *, event_threshold=None, op_fn=None)#

Using this function requires a careful understanding of the structure of the data and the use of the operator function. The default operator is a simple greater-than operator, so this will work on a simple DataArray. To work on a DataSet, a richer understanding is required. It is recommended to work through the tutorial at https://scores.readthedocs.io/en/stable/tutorials/Binary_Contingency_Scores.html . This tutorial reviews more complex use cases, including multivariate gridded model data, and station data structures.

Parameters:
  • fcst (DataArray | Dataset | Series)

  • obs (DataArray | Dataset | Series)

Return type:

BinaryContingencyManager

make_event_tables(fcst, obs, *, event_threshold=None, op_fn=None)#

Using this function requires a careful understanding of the structure of the data and the use of the operator function. The default operator is a simple greater-than operator, so this will work on a simple DataArray. To work on a DataSet, a richer understanding is required. It is recommended to work through the tutorial at https://scores.readthedocs.io/en/stable/tutorials/Binary_Contingency_Scores.html . This tutorial reviews more complex use cases, including multivariate gridded model data, and station data structures.

Parameters:
  • fcst (DataArray | Dataset | Series)

  • obs (DataArray | Dataset | Series)

class scores.categorical.EventOperator#

Abstract Base Class (ABC) for event operators which can be used in deriving contingency tables. ABCs are not used directly but instead define the requirements for other classes and may be used for type checking.

abstract make_contingency_manager(fcst, obs, *, event_threshold=None, op_fn=None)#

This method should be over-ridden to return a contingency table.

Parameters:
  • fcst (DataArray | Dataset | Series)

  • obs (DataArray | Dataset | Series)

abstract make_event_tables(fcst, obs, *, event_threshold=None, op_fn=None)#

This method should be over-ridden to return forecast and observed event tables

Parameters:
  • fcst (DataArray | Dataset | Series)

  • obs (DataArray | Dataset | Series)

scores.spatial#

class scores.fast.fss.typing.FssComputeMethod(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)#

Choice of compute backend for FSS.

INVALID = -1#

invalid backend

NUMPY = 1#

computations using numpy (default)

NUMBA = 2#

computations using numba (currently unimplemented)

scores.spatial.fss_2d(fcst, obs, *, event_threshold, window_size, spatial_dims, zero_padding=False, reduce_dims=None, preserve_dims=None, threshold_operator=None, compute_method=FssComputeMethod.NUMPY, dask='forbidden')#

Uses fss_2d_single_field() to compute the Fractions Skill Score (FSS) for each 2D spatial field in the DataArray and then aggregates them over the output of gather dimensions.

The aggregation method is the intended extension of the score defined by Robert and Leans (2008) for multiple forecasts [1,2]

Note

This method takes in a threshold_operator to compare the input fields against the event_threshold which defaults to numpy.greater, and is compatible with lightweight numpy operators. If you would like to do more advanced binary discretization consider doing this separately and using scores.spatial.fss_2d_binary() instead, which takes in a pre-discretized binary field.

Optionally aggregates the output along other dimensions if reduce_dims or (mutually exclusive) preserve_dims are specified.

For implementation for a single 2-D field, see: fss_2d_single_field(). This offers the user the ability to use their own aggregation methods, rather than xarray.ufunc

Warning

dask option is not fully tested

Parameters:
  • fcst (DataArray) – An array of forecasts

  • obs (DataArray) – An array of observations (same spatial shape as fcst)

  • event_threshold (float) – A scalar to compare fcst and obs fields to generate a binary “event” field. (defaults to np.greater).

  • window_size (Tuple[int, int]) – A pair of positive integers (height, width) of the sliding window_size; the window size must be greater than 0 and fit within the shape of obs and fcst.

  • spatial_dims (Tuple[str, str]) – A pair of dimension names (x, y) where x and y are the spatial dimensions to slide the window along to compute the FSS. e.g. ("lat", "lon").

  • zero_padding (bool) – If set to true, applies a 0-valued window border around the data field before computing the FSS. If set to false (default), it uses the edges (thickness = window size) of the input fields as the border. One can think of it as: - zero_padding = False => inner border - zero_padding = True => outer border with 0 values

  • reduce_dims (Iterable[Hashable] | None) – Optional set of additional dimensions to aggregate over (mutually exclusive to preserve_dims).

  • preserve_dims (Iterable[Hashable] | None) – Optional set of dimensions to keep (all other dimensions are reduced); By default all dimensions except spatial_dims are preserved (mutually exclusive to reduce_dims).

  • threshold_operator (Callable | None) – The threshold operator used to generate the binary event field. E.g. np.greater. Note: this may depend on the backend compute method and not all operators may be supported. Generally, np.greater, np.less and their counterparts. Defaults to “np.greater”.

  • compute_method (FssComputeMethod) – currently only supports FssComputeMethod.NUMPY see: scores.fast.fss.typing.FssComputeMethod

  • dask (str) – See xarray.apply_ufunc for options

Returns:

An xarray.DataArray containing the FSS computed over the spatial_dims

The resultant array will have the score grouped against the remaining dimensions, unless reduce_dims/preserve_dims are specified; in which case, they will be aggregated over the specified dimensions accordingly.

For an exact usage please refer to FSS.ipynb in the tutorials.

Raises:
  • DimensionError – Various errors are thrown if the input dimensions do not conform. e.g. if the window size is larger than the input arrays, or if the the spatial dimensions in the args are missing in the input arrays.

  • DimensionWarning – If spatial_dims are attempting to be preserved e.g. in preserve_dims

Return type:

DataArray

References

  1. Roberts, N. M., and H. W. Lean, 2008: Scale-Selective Verification of Rainfall Accumulations from High-Resolution Forecasts of Convective Events. Monthly Weather Review, 136, 78–97, https://doi.org/10.1175/2007mwr2123.1.

  2. Mittermaier, M. P., 2021: A “Meta” Analysis of the Fractions Skill Score: The Limiting Case and Implications for Aggregation. Monthly Weather Review, 149, 3491–3504, https://doi.org/10.1175/mwr-d-18-0106.1.

scores.spatial.fss_2d_binary(fcst, obs, *, window_size, spatial_dims, zero_padding=False, reduce_dims=None, preserve_dims=None, compute_method=FssComputeMethod.NUMPY, check_boolean=True, dask='forbidden')#

Computes the Fractions Skill Score (FSS) using a binary field.

Takes in a binary (True or False) field of fcst and obs events. For example the output of a binary threshold applied to a continuous field or an event operator.

Optionally the user can set check_boolean to True to check that the field is boolean before any computations. Note: this asserts that the underlying fields are of type np.bool_ but does not coerce them. If the user is confident that the input field is binary (but not necessarily boolean), they may set this flag to False, to allow for more flexible binary configurations such as 0 and 1; this should give the same results.

Uses fss_2d from the scores.spatial module to perform fss computation, but without the threshold operation.

See also

scores.spatial.fss_2d() for more details and the continuous version. As well as detailed argument definitions.

Parameters:
  • fcst (DataArray)

  • obs (DataArray)

  • window_size (Tuple[int, int])

  • spatial_dims (Tuple[str, str])

  • zero_padding (bool)

  • reduce_dims (Iterable[Hashable] | None)

  • preserve_dims (Iterable[Hashable] | None)

  • compute_method (FssComputeMethod)

  • check_boolean (bool)

  • dask (str)

Return type:

DataArray

scores.spatial.fss_2d_single_field(fcst, obs, *, event_threshold, window_size, zero_padding=False, threshold_operator=None, compute_method=FssComputeMethod.NUMPY)#

Calculates the Fractions Skill Score (FSS) [1] for a given forecast and observed 2-D field.

The FSS is computed by counting the squared sum of forecast and observation events in a given window size. This is repeated over all possible window positions that can fit in the input arrays. A common method to do this is via a sliding window.

While the various compute backends may have their own implementations; most of them use some form of prefix sum algorithm to compute this quickly.

For 2-D fields this data structure is known as the “Summed Area Table” [2].

Once the squared sums are computed, the final FSS value can be derived by accumulating the squared sums [3].

The caller is responsible for making sure the input fields are in the 2-D spatial domain. (Although it should work for any np.array as long as it’s 2D, and window_size is appropriately sized.)

Parameters:
  • fcst (ndarray[Any, dtype[float]]) – An array of forecasts

  • obs (ndarray[Any, dtype[float]]) – An array of observations (same spatial shape as fcst)

  • event_threshold (float) – A scalar to compare fcst and obs fields to generate a binary “event” field.

  • window_size (Tuple[int, int]) – A pair of positive integers height, width) of the sliding window; the window dimensions must be greater than 0 and fit within the shape of obs and fcst.

  • threshold_operator (Callable | None) – The threshold operator used to generate the binary event field. E.g. np.greater. Note: this may depend on the backend compute method and not all operators may be supported. Generally, np.greater, np.less and their counterparts. Defaults to “np.greater”.

  • compute_method (FssComputeMethod) – currently only supports FssComputeMethod.NUMPY see: :py:class:FssComputeMethod

  • zero_padding (bool)

Returns:

A float representing the accumulated FSS.

Raises:

DimensionError – Various errors are thrown if the input dimensions do not conform. e.g. if the window size is larger than the input arrays, or if the the spatial dimensions of the input arrays do not match.

Return type:

float64

References

  1. Roberts, N. M., and H. W. Lean, 2008: Scale-Selective Verification of Rainfall Accumulations from High-Resolution Forecasts of Convective Events. Monthly Weather Review 136, 78–97, https://doi.org/10.1175/2007mwr2123.1.

  2. https://en.wikipedia.org/wiki/Summed-area_table

  3. FAGGIAN, N., B. ROUX, P. STEINLE, and B. EBERT, 2015: Fast calculation of the fractions skill score. MAUSAM, 66, 457–466, https://doi.org/10.54302/mausam.v66i3.555.

scores.stats#

scores.stats.statistical_tests.diebold_mariano(da_timeseries, ts_dim, h_coord, *, method='HG', confidence_level=0.95, statistic_distribution='normal')#

Given an array of (multiple) timeseries, with each timeseries consisting of score differences for h-step ahead forecasts, calculates a modified Diebold-Mariano test statistic for each timeseries. Several other statistics are also returned such as the confidence that the population mean of score differences is greater than zero and confidence intervals for that mean.

Two methods for calculating the test statistic have been implemented: the “HG” method Hering and Genton (2011) and the “HLN” method of Harvey, Leybourne and Newbold (1997). The default “HG” method has an advantage of only generating positive estimates for the spectral density contribution to the test statistic. For further details see scores.stats.confidence_intervals.impl._dm_test_statistic.

Prior to any calculations, NaNs are removed from each timeseries. If there are NaNs in da_timeseries then a warning will occur. This is because NaNs may impact the autocovariance calculation.

To determine the value of h for each timeseries of score differences of h-step ahead forecasts, one may ask ‘How many observations of the phenomenon will be made between making the forecast and having the observation that will validate the forecast?’ For example, suppose that the phenomenon is afternoon precipitation accumulation in New Zealand (00z to 06z each day). Then a Day+1 forecast issued at 03z on Day+0 will be a 2-ahead forecast, since Day+0 and Day+1 accumulations will be observed before the forecast can be validated. On the other hand, a Day+1 forecast issued at 09z on Day+0 will be a 1-step ahead forecast. The value of h for each timeseries in the array needs to be specified in one of the sets of coordinates. See the example below.

Confidence intervals and “confidence_gt_0” statistics are calculated using the test statistic, which is assumed to have either the standard normal distribution or Student’s t distribution with n - 1 degrees of freedom (where n is the length of the timeseries). The distribution used is specified by statistic_distribution. See Harvey, Leybourne and Newbold (1997) for why the t distribution may be preferred, especially for shorter timeseries.

If da_timeseries is a chunked array, data will be brought into memory during this calculation due to the autocovariance implementation.

Parameters:
  • da_timeseries (DataArray) – a 2 dimensional array containing the timeseries.

  • ts_dim (str) – name of the dimension which identifies each timeseries in the array.

  • h_coord (str) – name of the coordinate specifying, for each timeseries, that the timeseries is an h-step ahead forecast. h_coord coordinates must be indexed by the dimension ts_dim.

  • method (Literal['HG', 'HLN']) – method for calculating the test statistic, one of “HG” or “HLN”.

  • confidence_level (float) – the confidence level, between 0 and 1 exclusive, at which to calculate confidence intervals.

  • statistic_distribution (Literal['normal', 't']) – the distribution of the test-statistic under the null hypothesis of equipredictive skill. Used to calculate the “confidence_gt_0” statistic and confidence intervals. One of “normal” or “t” (for Student’s t distribution).

Returns:

  • “mean”: the mean value for each timeseries, ignoring NaNs

  • ”dm_test_stat”: the modified Diebold-Mariano test statistic for each timeseries

  • ”timeseries_len”: the length of each timeseries, with NaNs removed.

  • ”confidence_gt_0”: the confidence that the mean value of the population is greater than zero, based on the specified statistic_distribution. Precisely, it is the value of the cumululative distribution function evaluated at dm_test_stat.

  • ”ci_upper”: the upper end point of a confidence interval about the mean at specified confidence_level.

  • ”ci_lower”: the lower end point of a confidence interval about the mean at specified confidence_level.

Return type:

Dataset, indexed by ts_dim, with six variables

Raises:
  • ValueError – if method is not one of “HG” or “HLN”.

  • ValueError – if statistic_distribution is not one of “normal” or “t”.

  • ValueError – if 0 < confidence_level < 1 fails.

  • ValueError – if len(da_timeseries.dims) != 2.

  • ValueError – if ts_dim is not a dimension of da_timeseries.

  • ValueError – if h_coord is not a coordinate of da_timeseries.

  • ValueError – if ts_dim is not the only dimension of da_timeseries[h_coord].

  • ValueError – if h_coord values aren’t positive integers.

  • ValueError – if h_coord values aren’t less than the lengths of the timeseries after NaNs are removed.

  • RuntimeWarnning – if there is a NaN in diffs.

References

  • Diebold and Mariano, ‘Comparing predictive accuracy’, Journal of Business and Economic Statistics 13 (1995), 253-265.

  • Hering and Genton, ‘Comparing spatial predictions’, Technometrics 53 no. 4 (2011), 414-425.

  • Harvey, Leybourne and Newbold, ‘Testing the equality of prediction mean squared errors’, International Journal of Forecasting 13 (1997), 281-291.

Example

This array gives three timeseries of score differences. Coordinates in the “lead_day” dimension uniquely identify each timeseries. Here ts_dim=”lead_day”. Coordinates in the “valid_date” dimension give the forecast validity timestamp of each item in the timeseries. The “h” coordinates specify that the timeseries are for 2, 3 and 4-step ahead forecasts respectively. Here h_coord=”h”.

>>> da_timeseries = xr.DataArray(
...     data=[[1, 2, 3.0, 4, np.nan], [2.0, 1, -3, -1, 0], [1.0, 1, 1, 1, 1]],
...     dims=["lead_day", "valid_date"],
...     coords={
...         "lead_day": [1, 2, 3],
...         "valid_date": ["2020-01-01", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-05"],
...         "h": ("lead_day", [2, 3, 4]),
...     },
... )
>>> dm_test_stats(da_timeseries, "lead_day", "h")

scores.processing#

scores.processing.isotonic_fit(fcst, obs, *, weight=None, functional='mean', bootstraps=None, quantile_level=None, solver=None, confidence_level=0.9, min_non_nan=1, report_bootstrap_results=False)#

Calculates an isotonic regression fit to observations given forecasts as the explanatory values. Optionally returns confidence bands on the fit via bootstrapping. Forecasts and observations are scalar-valued (i.e, integers, floats or NaN).

If forecasts target the mean or a quantile functional, then the user can select that functional. Otherwise the user needs to supply a solver function

  • of one variable (if no weights are supplied), or

  • of two variables (if weights are supplied).

The solver function is used to find an optimal regression fit in each block, subject to a non-decreasing monotonicity constraint, using the pool adjacent violators (PAV) algorithm.

An example of a solver function of one variable is obs -> np.mean(obs) An example of a solver function of two variables is (obs, weight) -> np.average(obs, weights=weight).

Ties (where one forecast value has more than one observation value) are handled as follows. Forecasts are sorted in ascending order, and observations sorted to match. Where ties occur, observations are sorted in descending order within each subset of ties. Isotonic regression is then performed on this sorted observation sequence.

This implementation uses sklearn.isotonic.IsotonicRegression when functional=”mean”.

Parameters:
  • fcst (ndarray | DataArray) – np.array or xr.DataArray of forecast (or explanatory) values. Values must be float, integer or NaN.

  • obs (ndarray | DataArray) – np.array or xr.DataArray of corresponding observation (or response) values. Must be the same shape as fcst. Values must be float, integer or NaN.

  • weight (ndarray | DataArray | None) – positive weights to apply to each forecast-observation pair. Must be the same shape as fcst, or None (which is equivalent to applying equal weights).

  • functional (Literal['mean', 'quantile'] | None) – Functional that the forecasts are targeting. Either “mean” or “quantile” or None. If None then solver must be supplied. The current implementation for “quantile” does not accept weights. If weighted quantile regression is desired then the user should should supply an appropriate solver.

  • bootstraps (int | None) – the number of bootstrap samples to perform for calculating the regression confidence band. Set to None if a confidence band is not required.

  • quantile_level (float | None) – the level of the quantile functional if functional=’quantile’. Must be strictly between 0 and 1.

  • solver (Callable[[ndarray, ndarray], float] | Callable[[ndarray], float] | None) – function that accepts 1D numpy array of observations and returns a float. Function values give the regression fit for each block as determined by the PAV algorithm. If weights are supplied, the function must have a second argument that accepts 1D numpy array of weights of same length as the observations.

  • confidence_level (float | None) – Confidence level of the confidence band, strictly between 0 and 1. For example, a confidence level of 0.5 will calculate the confidence band by computing the 25th and 75th percentiles of the bootstrapped samples.

  • min_non_nan (int | None) – The minimum number of non-NaN bootstrapped values to calculate confidence bands at a particular forecast value.

  • report_bootstrap_results (bool | None) – This specifies to whether keep the bootstrapped regression values in return dictionary. Default is False to keep the output result small.

Returns:

  • “unique_fcst_sorted”: 1D numpy array of remaining forecast values sorted in

    ascending order, after any NaNs from fcst, obs and weight are removed, and only unique values are kept to keep the output size reasonably small.

  • ”fcst_counts”: 1D numpy array of forecast counts for unique values of forecast sorted

  • ”regression_values”: 1D numpy array of regression values corresponding to

    ”unique_fcst_sorted” values.

  • ”regression_func”: function that returns the regression fit based on linear

    interpolation of (“fcst_sorted”, “regression_values”), for any supplied argument (1D numpy array) of potential forecast values.

  • ”bootstrap_results”: in the case of report_bootstrap_results=True, 2D numpy

    array of bootstrapped regression values is included in return dictionary. Each row gives the interpolated regression values from a particular bootstrapped sample, evaluated at “fcst_sorted” values. If m is the number of bootstrap samples and n = len(fcst_sorted) then it is has shape (m, n). We emphasise that this array corresponds to fcst_sorted not unique_fcst_sorted.

  • ”confidence_band_lower_values”: values of lower confidence band threshold, evaluated

    at “unique_fcst_sorted” values.

  • ”confidence_band_upper_values”: values of upper confidence band threshold, evaluated

    at “unique_fcst_sorted” values.

  • ”confidence_band_lower_func”: function that returns regression fit based on linear

    interpolation of (“fcst_sorted”, “confidence_band_lower_values”), given any argument (1D numpy array) of potential forecast values.

  • ”confidence_band_upper_func”: function that returns regression fit based on linear

    interpolation of (“fcst_sorted”, “confidence_band_upper_values”), given any argument (1D numpy array) of potential forecast values.

  • ”confidence_band_levels”: tuple giving the quantile levels used to calculate the

    confidence band.

Return type:

Dictionary with the following keys

Raises:
  • ValueError – if fcst and obs are np.arrays and don’t have the same shape.

  • ValueError – if fcst and weight are np.arrays and don’t have the same shape.

  • ValueError – if fcst and obs are xarray.DataArrays and don’t have the same dimensions.

  • ValueError – if fcst and weight are xarray.DataArrays and don’t have the same dimensions.

  • ValueError – if any entries of fcst, obs or weight are not an integer, float or NaN.

  • ValueError – if there are no fcst and obs pairs after NaNs are removed.

  • ValueError – if functional is not one of “mean”, “quantile” or None.

  • ValueError – if quantile_level is not strictly between 0 and 1 when functional=”quantile”.

  • ValueError – if weight is not None when functional=”quantile”.

  • ValueError – if functional and solver are both None.

  • ValueError – if not exactly one of functional and solver is None.

  • ValueError – if bootstraps is not a positive integer.

  • ValueError – if confidence_level is not strictly between 0 and 1.

Note: This function only keeps the unique values of fcst_sorted to keep the volume of

the return dictionary small. The forecast counts is also included, so users can it to create forecast histogram (usually displayed in the reliability diagrams).

References
  • de Leeuw, Hornik and Mair. “Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods”, Journal of Statistical Software, 2009.

  • Dimitriadis, Gneiting and Jordan. “Stable reliability diagrams for probabilistic classifiers”, PNAS, Vol. 118 No. 8, 2020. Available at https://www.pnas.org/doi/10.1073/pnas.2016191118

  • Jordan, Mühlemann, and Ziegel. “Optimal solutions to the isotonic regression problem”, 2020 (version 2), available on arxiv at https://arxiv.org/abs/1904.04761

scores.processing.broadcast_and_match_nan(*args: Dataset) tuple[Dataset, ...]#
scores.processing.broadcast_and_match_nan(*args: DataArray) tuple[DataArray, ...]

Input xarray data objects are ‘matched’ - they are broadcast against each other (forced to have the same dimensions), and the position of nans are forced onto all DataArrays. This matching process is applied across all supplied DataArrays, as well as all DataArrays inside supplied Datasets.

Parameters:

*args – any number of xarray data objects supplied as positional arguments. See examples below.

Returns:

A tuple of data objects of the same length as the number of data objects supplied as input. Each returned object is the ‘matched’ version of the input.

Raises:

ValueError – if any input args is not an xarray data object.

Examples

>>> # Matching xarray data objects
>>> da1_matched, ds_matched, da2_matched = scores.processing.broadcast_and_match_nan(da1, ds, da2)
>>> # Matching a tuple of xarray data objects
>>> input_tuple = (da1, ds, da2)
>>> matched_tuple = broadcast_and_match_nan(*input_tuple)
>>> da1_matched = matched_tuple[0]
>>> ds_matched = matched_tuple[1]
>>> da2_matched = matched_tuple[2]
scores.processing.comparative_discretise(data, comparison, mode, *, abs_tolerance=None)#

Converts the values of data to 0 or 1 based on how they relate to the specified values in comparison via the mode operator.

Parameters:
  • data (DataArray | Dataset) – The data to convert to discrete values.

  • comparison (DataArray | float | int) – The values to which to compare data.

  • mode (Callable | str) –

    Specifies the required relation of data to thresholds for a value to fall in the ‘event’ category (i.e. assigned to 1). The mode can be a string or a Python operator function from the list below. Allowed modes are:

    • operator.ge: Where values in data greater than or equal to the corresponding threshold are assigned as 1.

    • operator.gt: Where values in data greater than the corresponding threshold are assigned as 1.

    • operator.le: Where values in data less than or equal to the corresponding threshold are assigned as 1.

    • operator.lt: Where values in data less than the corresponding threshold are assigned as 1.

    • operator.eq: values in data equal to the corresponding threshold are assigned as 1.

    • operator.ne: values in data not equal to the corresponding threshold are assigned as 1.

    • ’>=’ values in data greater than or equal to the corresponding threshold are assigned as 1.

    • ’>’ values in data greater than the corresponding threshold are assigned as 1.

    • ’<=’ values in data less than or equal to the corresponding threshold are assigned as 1.

    • ’<’ values in data less than the corresponding threshold are assigned as 1.

    • ’==’ values in data equal to the corresponding threshold are assigned as 1.

    • ’!=’ values in data not equal to the corresponding threshold are assigned as 1.

  • abs_tolerance (float | None) – If supplied, values in data that are within abs_tolerance of a threshold are considered to be equal to that threshold. This is generally used to correct for floating point rounding, e.g. we may want to consider 1.0000000000000002 as equal to 1.

Returns:

An xarray data object of the same type as data. The dimensions of the output are the union of all dimensions in data and comparison. The values of the output are either 0 or 1 or NaN, depending on the truth of the operation data <mode> comparison.

Raises:
  • ValueError – if abs_tolerance is not a non-negative float.

  • ValueError – if mode is not valid.

  • TypeError – if comparison is not a float, int or xarray.DataArray.

Return type:

DataArray | Dataset

scores.processing.binary_discretise(data, thresholds, mode, *, abs_tolerance=None, autosqueeze=False)#

Converts the values of data to 0 or 1 for each threshold in thresholds according to the operation defined by mode.

Parameters:
  • data (DataArray | Dataset) – The data to convert to discrete values.

  • thresholds (Iterable[Hashable] | None) – Threshold(s) at which to convert the values of data to 0 or 1.

  • mode (Callable | str) –

    Specifies the required relation of data to thresholds for a value to fall in the ‘event’ category (i.e. assigned to 1). The mode can be a string or a Python operator function from the list below. Allowed modes are:

    • operator.ge: Where values in data greater than or equal to the corresponding threshold are assigned as 1.

    • operator.gt: Where values in data greater than the corresponding threshold are assigned as 1.

    • operator.le: Where values in data less than or equal to the corresponding threshold are assigned as 1.

    • operator.lt: Where values in data less than the corresponding threshold are assigned as 1.

    • operator.eq: values in data equal to the corresponding threshold are assigned as 1.

    • operator.ne: values in data not equal to the corresponding threshold are assigned as 1.

    • ’>=’ values in data greater than or equal to the corresponding threshold are assigned as 1.

    • ’>’ values in data greater than the corresponding threshold are assigned as 1.

    • ’<=’ values in data less than or equal to the corresponding threshold are assigned as 1.

    • ’<’ values in data less than the corresponding threshold are assigned as 1.

    • ’==’ values in data equal to the corresponding threshold are assigned as 1.

    • ’!=’ values in data not equal to the corresponding threshold are assigned as 1.

  • abs_tolerance (float | None) – If supplied, values in data that are within abs_tolerance of a threshold are considered to be equal to that threshold. This is generally used to correct for floating point rounding, e.g. we may want to consider 1.0000000000000002 as equal to 1

  • autosqueeze (bool | None) – If True and only one threshold is supplied, then the dimension ‘threshold’ is squeezed out of the output. If thresholds is float-like, then this is forced to True, otherwise defaults to False.

Returns:

An xarray data object with the type and dimensions of data, plus an extra dimension ‘threshold’ if autosqueeze is False. The values of the output are either 0 or 1, depending on whether data <mode> threshold is True or not (although NaNs are preserved).

Raises:
  • ValueError – if ‘threshold’ is a dimension in data.

  • ValueError – if “Values in thresholds are not monotonic increasing”

scores.processing.binary_discretise_proportion(data, thresholds, mode, *, reduce_dims=None, preserve_dims=None, abs_tolerance=None, autosqueeze=False)#

Returns the proportion of data in each category. The categories are defined by the relationship of data to threshold as specified by the operation mode.

Parameters:
  • data (DataArray | Dataset) – The data to convert into 0 and 1 according the thresholds before calculating the proportion.

  • thresholds (Iterable) – The proportion of values equal to or exceeding these thresholds will be calculated.

  • mode (Callable | str) –

    Specifies the required relation of data to thresholds for a value to fall in the ‘event’ category (i.e. assigned to 1). The mode can be a string or a Python operator function from the list below. Allowed modes are:

    • operator.ge: Where values in data greater than or equal to the corresponding threshold are assigned as 1.

    • operator.gt: Where values in data greater than the corresponding threshold are assigned as 1.

    • operator.le: Where values in data less than or equal to the corresponding threshold are assigned as 1.

    • operator.lt: Where values in data less than the corresponding threshold are assigned as 1.

    • operator.eq: values in data equal to the corresponding threshold are assigned as 1.

    • operator.ne: values in data not equal to the corresponding threshold are assigned as 1.

    • ’>=’ values in data greater than or equal to the corresponding threshold are assigned as 1.

    • ’>’ values in data greater than the corresponding threshold are assigned as 1.

    • ’<=’ values in data less than or equal to the corresponding threshold are assigned as 1.

    • ’<’ values in data less than the corresponding threshold are assigned as 1.

    • ’==’ values in data equal to the corresponding threshold are assigned as 1.

    • ’!=’ values in data not equal to the corresponding threshold are assigned as 1.

  • reduce_dims (Iterable[Hashable] | None) – Dimensions to reduce.

  • preserve_dims (Iterable[Hashable] | None) – Dimensions to preserve.

  • abs_tolerance (bool | None) – If supplied, values in data that are within abs_tolerance of a threshold are considered to be equal to that threshold. This is generally used to correct for floating point rounding, e.g. we may want to consider 1.0000000000000002 as equal to 1.

  • autosqueeze (bool) – If True and only one threshold is supplied, then the dimension ‘threshold’ is squeezed out of the output. If thresholds is float-like, then this is forced to True, otherwise defaults to False.

Returns:

An xarray data object with the type of data, dimension dims + ‘threshold’. The values of the output are the proportion of data that satisfy the relationship to thresholds as specified by mode.

Examples

>>> import operator
>>> import xarray as xr
>>> from scores.processing import binary_discretise_proportion
>>> data = xr.DataArray([0, 0.5, 0.5, 1])
>>> binary_discretise_proportion(data, [0, 0.5, 1], operator.eq)
<xarray.DataArray (threshold: 3)>
array([ 0.25,  0.5 ,  0.25])
Coordinates:
  * threshold  (threshold) float64 0.0 0.5 1.0
Attributes:
    discretisation_tolerance: 0
    discretisation_mode: ==
>>> binary_discretise_proportion(data, [0, 0.5, 1], operator.ge)
<xarray.DataArray (threshold: 3)>
array([ 1.  ,  0.75,  0.25])
Coordinates:
  * threshold  (threshold) float64 0.0 0.5 1.0
Attributes:
    discretisation_tolerance: 0
    discretisation_mode: >=
scores.processing.proportion_exceeding(data, thresholds, *, reduce_dims=None, preserve_dims=None)#

Calculates the proportion of data equal to or exceeding thresholds.

Parameters:
  • data (xarray.Dataset or xarray.DataArray) – The data from which to calculate the proportion exceeding thresholds

  • thresholds (iterable) – The proportion of values equal to or exceeding these thresholds will be calculated.

  • reduce_dims (Iterable[Hashable] | None) – Dimensions to reduce.

  • preserve_dims (Iterable[Hashable] | None) – Dimensions to preserve.

Returns:

An xarray data object with the type of data and dimensions dims + ‘threshold’. The values are the proportion of data that are greater than or equal to the corresponding threshold.

scores.processing.cdf.round_values(array, rounding_precision, *, final_round_decpl=7)#

Round data array to specified precision.

Rounding is done differently to xarray.DataArray.round or numpy.round where the number of decimal places is specified in those cases. Instead, here the rounding precision is specified as a float. The value is rounded to the nearest value that is divisible by rounding_precision.

For example, 3.73 rounded to precision 0.2 is 3.8, and 37.3 rounded to precision 20 is 40.

Assumes that rounding_precision >=0, with 0 indicating no rounding to be performed. If rounding_precision > 0, a final round to final_round_decpl decimal places is performed to remove artefacts of python rounding process.

Parameters:
  • array (xr.DataArray) – array of data to be rounded

  • rounding_precision (float) – rounding precision

  • final_round_decpl (int) – final round to specified number of decimal places when rounding_precision > 0.

Returns:

DataArray with rounded values.

Return type:

xr.DataArray

Raises:

ValueError – If rounding_precision < 0.

scores.processing.cdf.propagate_nan(cdf, threshold_dim)#

Propagates the NaN values from a “cdf” variable along the threshold_dim.

Parameters:
  • cdf (xr.DataArray) – CDF values, so that P(X <= threshold) = cdf_value for each threshold in the threshold_dim dimension.

  • threshold_dim (str) – name of the threshold dimension in cdf.

Returns:

cdf variable with NaNs propagated.

Return type:

xr.DataArray

Raises:

ValueError – If threshold_dim is not a dimension of cdf.

scores.processing.cdf.observed_cdf(obs, threshold_dim, *, threshold_values=None, include_obs_in_thresholds=True, precision=0)#

Returns a data array of observations converted into CDF format.

Such that:

returned_value = 0 if threshold < observation returned_value = 1 if threshold >= observation

Parameters:
  • obs (xr.DataArray) – observations

  • threshold_dim (str) – name of dimension in returned array that contains the threshold values.

  • threshold_values (Optional[Iterable[float]]) – values to include among thresholds.

  • include_obs_in_thresholds (bool) – if True, include (rounded) observed values among thresholds.

  • precision (float) – precision applied to observed values prior to constructing the CDF and thresholds. Select 0 for highest precision (i.e. no rounding).

Returns:

Observed CDFs and thresholds in the threshold_dim dimension.

Return type:

xr.DataArray

Raises:
  • ValueError – if precision < 0.

  • ValueError – if all observations are NaN and no non-NaN threshold_values are not supplied.

scores.processing.cdf.integrate_square_piecewise_linear(function_values, threshold_dim)#

Calculates integral values and collapses threshold_dim.

Calculates \(\int{(F(t)^2)}\), where:
  • If t is a threshold value in threshold_dim then F(t) is in function_values,

  • F is piecewise linear between each of the t values in threshold_dim.

This function assumes that:
  • threshold_dim is a dimension of function_values

  • coordinates of threshold_dim are increasing.

Parameters:
  • function_values (xr.DataArray) – array of function values F(t).

  • threshold_dim (xr.DataArray) – dimension along which to integrate.

Returns:

Integral values and threshold_dim collapsed:

  • Returns value of the integral with threshold_dim collapsed and other dimensions preserved.

  • Returns NaN if there are less than two non-NaN function_values.

Return type:

xr.DataArray

scores.processing.cdf.add_thresholds(cdf, threshold_dim, new_thresholds, fill_method, *, min_nonnan=2)#

Takes a CDF data array with dimension threshold_dim and adds values from new_thresholds.

The CDF is then filled to replace any NaN values. The array cdf requires at least 2 non-NaN values along threshold_dim.

Parameters:
  • cdf (xr.DataArray) – array of CDF values.

  • threshold_dim (str) – name of the threshold dimension in cdf.

  • new_thresholds (Iterable[float]) – new thresholds to add to cdf.

  • fill_method (Literal["linear", "step", "forward", "backward", "none"]) – one of “linear”, “step”, “forward” or “backward”, as described in fill_cdf. If no filling, set to “none”.

  • min_nonnan (int) – passed onto fill_cdf for performing filling.

Returns:

Additional thresholds, and values at those thresholds determined by the specified fill method.

Return type:

xr.DataArray

scores.processing.cdf.fill_cdf(cdf, threshold_dim, method, min_nonnan)#

Fills NaNs in a CDF of a real-valued random variable along threshold_dim with appropriate values between 0 and 1.

Parameters:
  • cdf (xr.DataArray) – CDF values, where P(Y <= threshold) = cdf_value for each threshold in threshold_dim.

  • threshold_dim (str) – the threshold dimension in the CDF, along which filling is performed.

  • method (Literal["linear", "step", "forward", "backward"]) –

    one of:

    • ”linear”: use linear interpolation, and if needed also extrapolate linearly. Clip to 0 and 1. Needs at least two non-NaN values for interpolation, so returns NaNs where this condition fails.

    • ”step”: use forward filling then set remaining leading NaNs to 0. Produces a step function CDF (i.e. piecewise constant).

    • ”forward”: use forward filling then fill any remaining leading NaNs with backward filling.

    • ”backward”: use backward filling then fill any remaining trailing NaNs with forward filling.

  • min_nonnan (int) – the minimum number of non-NaN entries required along threshold_dim for filling to be performed. All CDF values are set to np.nan where this condition fails. min_nonnan must be at least 2 for the “linear” method, and at least 1 for the other methods.

Returns:

Containing the same values as cdf but with NaNs filled.

Return type:

xr.DataArray

Raises:
  • ValueError – If threshold_dim is not a dimension of cdf.

  • ValueError – If min_nonnan < 1 when method=”step” or if min_nonnan < 2 when method=”linear”.

  • ValueError – If method is not “linear”, “step”, “forward” or “backward”.

  • ValueError – If any non-NaN value of cdf lies outside the unit interval [0,1].

scores.processing.cdf.decreasing_cdfs(cdf, threshold_dim, tolerance)#

A CDF of a real-valued random variable should be nondecreasing along threshold_dim.

This is sometimes violated due to rounding issues or bad forecast process. decreasing_cdfs checks CDF values decrease beyond specified tolerance; that is, whenever the sum of the incremental decreases exceeds tolerarance.

For example, if the CDF values are [0, 0.4, 0.3, 0.9, 0.88, 1] then the sum of incremental decreases is -0.12. Given a specified positive tolerance, the CDF values decrease beyond tolerance if the sum of incremental decreases < -tolerance.

Intended use is for CDFs with increasing coordinates along threshold_dim dimension, and where either each CDF is always NaN or always non-NaN.

Parameters:
  • cdf (xr.DataArray) – data array of CDF values

  • threshold_dim (str) – threshold dimension, such that P(Y < threshold) = cdf_value.

  • tolerance (float) – nonnegative tolerance value.

Returns:

Containing threshold_dim collapsed and values True if and only if the CDF is decreasing outside tolerance. If the CDF consists only of NaNs then the value is False.

Return type:

xr.DataArray

Raises:
  • ValueError – If threshold_dim is not a dimension of cdf.

  • ValueError – If tolerance is negative.

  • ValueError – If coordinates are not increasing along threshold_dim.

  • ValueError – If some, but not all, CDF values in cdf along threshold_dim are NaN.

scores.processing.cdf.cdf_envelope(cdf, threshold_dim)#

Forecast cumulative distribution functions (CDFs) for real-valued random variables.

CDFs that are reconstructed from known points on the distribution should be nondecreasing with respect to the threshold dimension. However, sometimes this may fail due to rounding or poor forecast process. This function returns the “envelope” of the original CDF, which consists of two bounding CDFs, both of which are nondecreasing.

The following example shows values from an original CDF that has a decreasing subsequence (and so is not a true CDF). The resulting “upper” and “lower” CDFs minimally adjust “original” so that “lower” <= “original” <= “upper”.

  • “original”: [0, .5, .2, .8, 1]

  • “upper”: [0, .5, .5, .8, 1]

  • “lower”: [0, .2, .2, .8, 1]

This function does not perform checks that 0 <= cdf <= 1.

Parameters:
  • cdf (xr.DataArray) – forecast CDF with thresholds in the thresholds_dim.

  • threshold_dim (str) – dimension in fcst_cdf that contains the threshold ordinates.

Returns:

An xarray DataArray consisting of three CDF arrays indexed along the “cdf_type” dimension with the following indices:

  • ”original”: same data as cdf.

  • ”upper”: minimally adjusted “original” CDF that is nondecreasing and satisfies “upper” >= “original”.

  • ”lower”: minimally adjusted “original” CDF that is nondecreasing and satisfies “lower” <= “original”.

NaN values in cdf are maintained in “original”, “upper” and “lower”.

Return type:

xr.DataArray

Raises:

ValueError – If threshold_dim is not a dimension of cdf.

scores.pandas#

scores.pandas.continuous.mse(fcst, obs, *, is_angular=False)#

Calculates the mean squared error from forecast and observed data.

A detailed explanation is on https://en.wikipedia.org/wiki/Mean_squared_error

\[\frac{1}{n} \sum_{i=1}^n (\text{forecast}_i - \text{observed}_i)^2\]

Notes

Dimensional reduction is not supported for pandas and the user should convert their data to xarray to formulate the call to the base metric, scores.continuous.mse.

Parameters:
  • fcst (Type[Series]) – Forecast or predicted variables in pandas.

  • obs (Type[Series]) – Observed variables in pandas.

  • is_angular (bool) – specifies whether fcst and obs are angular data (e.g. wind direction). If True, a different function is used to calculate the difference between fcst and obs, which accounts for circularity. Angular fcst and obs data should be in degrees rather than radians.

Returns:

An object containing a single floating point number representing the mean squared error for the supplied data. All dimensions will be reduced.

Return type:

pandas.Series

scores.pandas.continuous.rmse(fcst, obs, *, is_angular=False)#

Calculate the Root Mean Squared Error from xarray or pandas objects.

A detailed explanation is on https://en.wikipedia.org/wiki/Root-mean-square_deviation

\[\sqrt{\frac{1}{n} \sum_{i=1}^n (\text{forecast}_i - \text{observed}_i)^2}\]

Notes

Dimensional reduction is not supported for pandas and users wishing this extra functionality should convert their data to xarray to formulate the call to scores.continuous.rmse.

Parameters:
  • fcst (Type[Series]) – Forecast or predicted variables in pandas.

  • obs (Type[Series]) – Observed variables in pandas.

  • is_angular (bool) – specifies whether fcst and obs are angular data (e.g. wind direction). If True, a different function is used to calculate the difference between fcst and obs, which accounts for circularity. Angular fcst and obs data should be in degrees rather than radians.

Returns:

An object containing a single floating point number representing the root mean squared error for the supplied data. All dimensions will be reduced.

Return type:

pandas.Series

scores.pandas.continuous.mae(fcst, obs, *, is_angular=False)#

Calculates the mean absolute error from forecast and observed data.

A detailed explanation is on https://en.wikipedia.org/wiki/Mean_absolute_error

\[\frac{1}{n} \sum_{i=1}^n | \text{forecast}_i - \text{observed}_i |\]

Notes

Dimensional reduction is not supported for pandas and the user should convert their data to xarray to formulate the call to scores.continuous.mae.

Parameters:
  • fcst (Type[Series]) – Forecast or predicted variables in pandas.

  • obs (Type[Series]) – Observed variables in pandas.

  • is_angular (bool) – specifies whether fcst and obs are angular data (e.g. wind direction). If True, a different function is used to calculate the difference between fcst and obs, which accounts for circularity. Angular fcst and obs data should be in degrees rather than radians.

Returns:

An object containing a single floating point number representing the mean absolute error for the supplied data. All dimensions will be reduced.

Return type:

pandas.Series

scores.emerging#

scores.emerging.risk_matrix_score(fcst, obs, decision_weights, severity_dim, prob_threshold_dim, *, threshold_assignment='lower', reduce_dims=None, preserve_dims=None, weights=None)#

Caution

This is an implementation of a novel metric that is still undergoing mathematical peer review. This implementation may change in line with the peer review process.

Calculates the risk matrix score of Taggart & Wilke (2024).

Let \((S_1, \ldots,S_m)\) denote the tuple of nested severity categories, let \((p_1, \ldots,p_n)\) denote the probability thresholds that delineate the certainty categories, and let \(w_{i,j}\) denote the weight applied to the decision point associated with severity category \(S_i\) and probability threshold \(p_j\).

In this implementation of the risk matrix score, a forecast \(F\) is of the form \(F=(f_1,\ldots,f_m)\) where \(f_i\) denotes the forecast probability that the observation lies in severity category \(S_i\). A corresponding observation \(y\) is given by \(y=(y_1,\ldots,y_m)\) where \(y_i\) is 1 if the observation lies in severity category \(S_i\) and 0 otherwise. Then the risk matrix score \(\text{RMS}\) is given by the formula

\[\text{RMS}(F,y) = \sum_{i=1}^m\sum_{j=1}^n w_{i,j} \, s_j(f_i, y_i),\]

where

\[\begin{split}s_j(f_i,y_i) = \begin{cases} p_j & \text{if } y_i=0\text{ and } f_i \geq p_j \\ 1 - p_j & \text{if } y_i=1\text{ and } f_i < p_j \\ 0 & \text{otherwise.} \end{cases}\end{split}\]

The formula above is for the case where the threshold_assignment is “lower”, with the adjustments (\(f_i > p_j\) and \(f_i \leq p_j\)) applied when the threshold_assignment is “upper”.

Parameters:
  • fcst (DataArray | Dataset) – an array of forecast probabilities for the observation lying in each severity category. Must have a dimension severity_dim.

  • obs (DataArray | Dataset) – an array of binary observations with a value of 1 if the observation was in the severity category and 0 otherwise. Must have a dimension severity_dim.

  • decision_weights (DataArray) – an array of non-negative weights to apply to each risk matrix decision point, indexed by coordinates in severity_dim and prob_threshold_dim.

  • severity_dim (str) – the dimension specifying severity categories.

  • prob_threshold_dim (str) – the dimension in decision_weights specifying probability thresholds.

  • threshold_assignment (str | None) – Either “upper” or “lower”. Specifies whether the probability intervals defining the certainty categories, with endpoints given by the decision thresholds in decision_weights, are left or right closed. That is, whether the probability decision threshold is included in the upper (left closed) or lower (right closed) certainty category. Defaults to “lower”.

  • reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the mean risk matrix score, where the mean is taken over all forecast cases. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the mean risk matrix score, where the mean is taken over all forecast cases. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved, apart from severity_dim and prob_threshold_dim. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

  • weights (DataArray | None) – Optionally provide an array for weighted averaging (e.g. by area, by latitude, by population, custom) of scores across all forecast cases.

Returns:

An xarray object of risk matrix scores, averaged according to specified weights across appropriate dimensions.

Raises:
  • ValueError – if severity_dim is not a dimension of fcst, obs or decision_weights.

  • ValueError – if severity_dim is a dimension of weights.

  • ValueError – if prob_threshold_dim is not a dimension of decision_weights.

  • ValueError – if prob_threshold_dim is a dimension of weights.

  • ValueError – if decision_weights does not have exactly 2 dimensions.

  • ValueError – if fcst values are negative or greater than 1.

  • ValueError – if obs values are other than 0, 1 or nan.

  • ValueError – if prob_threshold_dim coordinates are not strictly between 0 and 1.

  • ValueError – if severity_dim coordinates differ for any of fcst, obs or decision_weights.

  • ValueError – if threshold_assignment is not “upper” or lower”.

Return type:

DataArray | Dataset

References

  • Taggart, R. J., & Wilke, D. J. (2024). Warnings based on risk matrices: a coherent framework with consistent evaluation. In preparation.

Examples

Calculate the risk matrix score where the risk matrix has three nested severity categories (“MOD+”, “SEV+” and “EXT”) and three probability thresholds (0.1, 0.3 and 0.5). The decision weights place greater emphasis on higher end severity.

>>> import xarray as xr
>>> from scores.emerging import risk_matrix_score
>>> decision_weights = xr.DataArray(
>>>     data=[[1, 2, 3], [1, 2, 3], [1, 2, 3]],
>>>     dims=["probability_threshold", "severity"],
>>>     coords={'probability_threshold': [0.1, 0.3, 0.5], 'severity': ["MOD+", "SEV+", "EXT"]}
>>> )
>>> fcst = xr.DataArray(
>>>     data=[[0.45, 0.22, 0.05], [0.65, 0.32, 0.09]],
>>>     dims=["time", "severity"],
>>>     coords={'time': [0, 1], 'severity': ["MOD+", "SEV+", "EXT"]}
>>> )
>>> obs = xr.DataArray(
>>>     data=[[1, 1, 0], [1, 0, 0]],
>>>     dims=["time", "severity"],
>>>     coords={'time': [0, 1], 'severity': ["MOD+", "SEV+", "EXT"]}
>>> )
>>> risk_matrix_score(fcst, obs, decision_weights, "severity", "probability_threshold")
scores.emerging.matrix_weights_to_array(matrix_weights, severity_dim, severity_coords, prob_threshold_dim, prob_threshold_coords)#

Caution

This function is part of an implementation of a novel metric that is still undergoing mathematical peer review. This implementation may change in line with the peer review process.

Generates a 2-dimensional xr.DataArray of weights for each decision point, which is used for the scores.emerging.risk_matrix_score() calculation. Assumes that values toward the left in matrix_weights correspond to less severe categories, while values towards the top in matrix_weights correspond to higher probability thresholds.

Parameters:
  • matrix_weights (ndarray) – array of weights to place on each risk matrix decision point, with rows (ascending) corresponding to (increasing) probability thresholds, and columns (left to right) corresponding to (increasing) severity categories.

  • severity_dim (str) – name of the severity category dimension.

  • severity_coords (Iterable) – labels for each of the severity categories, in order of increasing severity. Does NOT include the lowest severity category for which no warning would be issued.

  • prob_threshold_dim (str) – name of the probability threshold dimension.

  • prob_threshold_coords (Iterable) – list of the probability decision thresholds in the risk matrix, strictly between 0 and 1.

Returns:

xarray data array of weights for each risk matrix decision point, indexed by prob_threshold_dim and severity_dim.

Raises:
  • ValueError – if matrix_weights isn’t two dimensional.

  • ValueError – if number of rows of matrix_weights doesn’t equal length of prob_threshold_coords.

  • ValueError – if number of columns of matrix_weights doesn’t equal length of severity_coords.

  • ValueError – if prob_threshold_coords aren’t strictly between 0 and 1.

Return type:

DataArray

References

  • Taggart, R. J., & Wilke, D. J. (2024). Warnings based on risk matrices: a coherent framework with consistent evaluation. In preparation.

Examples

Returns weights for each risk matrix decision point, where weights increase with increasing severity category and decrease with increasing probability threshold.

>>> import numpy as np
>>> from scores.emerging import matrix_weights_to_array
>>> matrix_weights = np.array([
>>>     [1, 2, 3],
>>>     [2, 4, 6],
>>>     [3, 6, 9],
>>> ])
>>> matrix_weights_to_array(
>>>     matrix_weights, "severity", ["MOD+", "SEV+", "EXT"], "prob_threshold", [0.1, 0.3, 0.5]
>>> )
scores.emerging.weights_from_warning_scaling(scaling_matrix, assessment_weights, severity_dim, severity_coords, prob_threshold_dim, prob_threshold_coords)#

Caution

This function is part of an implementation of a novel metric that is still undergoing mathematical peer review. This implementation may change in line with the peer review process.

Given a warning scaling matrix, assessment weights and other inputs, returns the weights for each risk matrix decision point as an xarray data array. The returned data array is designed to be used for the scores.emerging.risk_matrix_score() calculation.

Comprehensive checks are made on scaling_matrix to ensure it satisfies the properties of warning scaling in Table 1 of Taggart & Wilke (2024).

Parameters:
  • scaling_matrix (ndarray) – a 2-dimensional matrix encoding the warning scaling. Warning levels are given integer values 0, 1, …, q. The top row corresponds to the highest certainty category while the right-most column corresponds to most severe category.

  • assessment_weights (Iterable) – positive weights used for warning assessment. The kth weight (corresponding to assessment_weights[k-1]) is proportional to the importance of accurately discriminating between warning states below level k and warning states at or above level k. len(assessment_weights) must be at least q.

  • severity_dim (str) – name of the severity category dimension.

  • severity_coords (Iterable) – labels for each of the severity categories, in order of increasing severity. Does NOT include the lowest severity category for which no warning would be issued.

  • prob_threshold_dim (str) – name of the probability threshold dimension.

  • prob_threshold_coords (Iterable) – list of the probability decision thresholds in the risk matrix, strictly between 0 and 1.

Returns:

xarray data array of weights for each risk matrix decision point, indexed by prob_threshold_dim and severity_dim.

Raises:
  • ValueError – if matrix_weights isn’t two dimensional.

  • ValueError – if scaling_matrix has entries that are not non-negative integers.

  • ValueError – if the first column or last row of scaling_matrix has nonzero entries.

  • ValueError – if scaling_matrix decreases along any row (moving left to right).

  • ValueError – if scaling_matrix increases along any column (moving top to bottom).

  • ValueError – if number of rows of matrix_weights doesn’t equal length of prob_threshold_coords.

  • ValueError – if number of columns of matrix_weights doesn’t equal length of severity_coords.

  • ValueErrorlen(assessment_weights) is less than the maximum value in scaling_matrix.

  • ValueError – if prob_threshold_coords aren’t strictly between 0 and 1.

  • ValueError – if assessment_weights aren’t strictly positive.

Return type:

DataArray

References

  • Taggart, R. J., & Wilke, D. J. (2024). Warnings based on risk matrices: a coherent framework with consistent evaluation. In preparation.

Examples

Returns weights for each risk matrix decision point, for the SHORT-RANGE scaling matrix of Taggart & Wilke (2024), with ESCALATION assessment weights.

>>> import numpy as np
>>> from scores.emerging import weights_from_warning_scaling
>>> scaling = np.array([
>>>     [0, 2, 3, 3],
>>>     [0, 1, 2, 3],
>>>     [0, 1, 1, 2],
>>>     [0, 0, 0, 0],
>>> ])
>>> weights_from_warning_scaling(
>>>     scaling, [1, 2, 3],  "severity", ["MOD+", "SEV+", "EXT"], "prob_threshold", [0.1, 0.3, 0.5]
>>> )