API Documentation

API Documentation#

scores.continuous#

Import the functions from the implementations into the public API

scores.continuous.additive_bias(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the additive bias which is also sometimes called the mean error.

It is defined as

\[\text{Additive bias} =\frac{1}{N}\sum_{i=1}^{N}(x_i - y_i) \text{where } x = \text{the forecast, and } y = \text{the observation}\]

See “Mean error” section at https://jwgfvr.github.io/forecastverification/index.html#meanerror for more information.

Parameters:

fcst (DataArray | Dataset) – Forecast or predicted variables.
obs (DataArray | Dataset) – Observed variables.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the additive bias. All other dimensions will be preserved.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the additive bias. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.
weights (DataArray | Dataset | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, NaN values in weights can be replaced by weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

An xarray object with the additive bias of a forecast.

Return type:

DataArray | Dataset

References

https://jwgfvr.github.io/forecastverification/index.html#meanerror

scores.continuous.mean_error(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the mean error which is also sometimes called the additive bias.

It is defined as

\[\text{mean error} =\frac{1}{N}\sum_{i=1}^{N}(x_i - y_i) \text{where } x = \text{the forecast, and } y = \text{the observation}\]

See “Mean error” section at https://jwgfvr.github.io/forecastverification/index.html#meanerror for more information.

Parameters:

fcst (DataArray | Dataset) – Forecast or predicted variables.
obs (DataArray | Dataset) – Observed variables.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the mean error. All other dimensions will be preserved.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the mean error. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.
weights (DataArray | Dataset | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, NaN values in weights can be replaced by weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

An xarray object with the mean error of a forecast.

Return type:

DataArray | Dataset

References

https://jwgfvr.github.io/forecastverification/index.html#meanerror

scores.continuous.mae(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None, is_angular=False)#

Calculates the mean absolute error from forecast and observed data.

A detailed explanation is on https://en.wikipedia.org/wiki/Mean_absolute_error

\[\frac{1}{n} \sum_{i=1}^n | \text{forecast}_i - \text{observed}_i |\]

Parameters:

fcst (DataArray | Dataset | Series) – Forecast or predicted variables in xarray or pandas.
obs (DataArray | Dataset | Series) – Observed variables in xarray or pandas.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating MAE. All other dimensions will be preserved.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating MAE. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the absolute error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, NaN values in weights can be replaced by weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.
is_angular (bool) – specifies whether fcst and obs are angular data (e.g. wind direction). If True, a different function is used to calculate the difference between fcst and obs, which accounts for circularity. Angular fcst and obs data should be in degrees rather than radians.

Returns:

By default an xarray DataArray containing a single floating point number representing the mean absolute error for the supplied data. All dimensions will be reduced.

Alternatively, an xarray structure with dimensions preserved as appropriate containing the score along reduced dimensions

Raises:

ValueError – If fcst and obs are not xarray objects and weights is not None.

Return type:

DataArray | Dataset | Series

scores.continuous.mse(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None, is_angular=False)#

Calculates the mean squared error from forecast and observed data.

See “Mean squared error” section at https://jwgfvr.github.io/forecastverification/index.html#MSE for more information.

\[\frac{1}{n} \sum_{i=1}^n (\text{forecast}_i - \text{observed}_i)^2\]

Parameters:

fcst (Union[xr.Dataset, xr.DataArray, pd.Dataframe, pd.Series]) – Forecast or predicted variables in xarray or pandas.
obs (Union[xr.Dataset, xr.DataArray, pd.Dataframe, pd.Series]) – Observed variables in xarray or pandas.
reduce_dims (Union[str, Iterable[str]) – Optionally specify which dimensions to reduce when calculating MSE. All other dimensions will be preserved.
preserve_dims (Union[str, Iterable[str]) – Optionally specify which dimensions to preserve when calculating MSE. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the squared error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, NaN values in weights can be replaced by weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.
is_angular (bool | None) – specifies whether fcst and obs are angular data (e.g. wind direction). If True, a different function is used to calculate the difference between fcst and obs, which accounts for circularity. Angular fcst and obs data should be in degrees rather than radians.

Returns:

An object containing: a single floating point number representing the mean absolute error for the supplied data. All dimensions will be reduced. Otherwise: Returns an object representing the mean squared error, reduced along the relevant dimensions and weighted appropriately.

Return type:

Union[xr.Dataset, xr.DataArray, pd.Dataframe, pd.Series]

Raises:

ValueError – If fcst and obs are not xarray objects and weights is not None.

References

scores.continuous.quantile_score(fcst, obs, alpha, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates a score that targets alpha-quantiles. Use with alpha = 0.5 for forecasts of the median. Use with alpha = 0.9 for forecasts of the 90th percentile.

Parameters:

fcst (DataArray | Dataset) – array of forecasts
obs (DataArray | Dataset) – array of observations
alpha (float) – A value between 0 and 1 (exclusive)
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the quantile score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating quantile score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the quantile score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | Dataset | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

A DataArray with values being the mean generalised piecewise linear (GPL) scoring function, with the dimensions specified in dims. If dims is None, the returned DataArray will have only one element, the overall mean GPL score.

Raises:

ValueError – if alpha is not between 0 and 1.

Return type:

DataArray | Dataset

Notes

\[\begin{split} gpl(x) = \begin{cases}\alpha * (-x) & x \leq 0\\ (1-\alpha) x & x > 0\end{cases}\end{split}\]

where:

\(\alpha\) is the targeted quantile.
\(x\) is the difference, fcst - obs

References

T. Gneiting, “Making and evaluating point forecasts”, J. Amer. Stat. Assoc., Vol. 106 No. 494 (June 2011), pp. 754–755, Theorem 9

scores.continuous.rmse(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None, is_angular=False)#

Calculate the Root Mean Squared Error

A detailed explanation is on https://en.wikipedia.org/wiki/Root-mean-square_deviation

\[\sqrt{\frac{1}{n} \sum_{i=1}^n (\text{forecast}_i - \text{observed}_i)^2}\]

Parameters:

fcst (DataArray | Dataset | Series) – Forecast or predicted variables in xarray or pandas.
obs (DataArray | Dataset | Series) – Observed variables in xarray or pandas.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating RMSE. All other dimensions will be preserved.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating RMSE. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the absolute error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, NaN values in weights can be replaced by weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.
is_angular (bool) – specifies whether fcst and obs are angular data (e.g. wind direction). If True, a different function is used to calculate the difference between fcst and obs, which accounts for circularity. Angular fcst and obs data should be in degrees rather than radians.

Returns:

An object containing: a single floating point number representing the root mean squared error for the supplied data. All dimensions will be reduced. Otherwise: Returns an object representing the root mean squared error, reduced along the relevant dimensions and weighted appropriately.

Raises:

ValueError – If fcst and obs are not xarray objects and weights is not None.

Return type:

DataArray | Dataset | Series

scores.continuous.murphy_score(fcst, obs, thetas, *, functional, alpha, huber_a=None, decomposition=False, reduce_dims=None, preserve_dims=None)#

Returns the mean elementary score (Ehm et. al. 2016), also known as Murphy score, evaluated at decision thresholds specified by thetas. Optionally returns a decomposition of the score in terms of penalties for over- and under-forecasting.

Select functional="quantile" and alpha=0.5 for the median functional. Select functional="expectile" and alpha=0.5 for the mean (i.e., expectation) functional.

Consider using murphy_thetas() to generate thetas. If utilising dask, you may want to store thetas as a netCDF on disk and pass it in as an xarray object. Otherwise, very large objects may be created when fcst, obs and thetas are broadcast together.

Parameters:

fcst (DataArray) – Forecast numerical values.
obs (DataArray) – Observed numerical values.
thetas (Sequence[float] | DataArray) – Theta thresholds.
functional (Literal['quantile', 'huber', 'expectile']) – The type of elementary scoring function, one of “quantile”, “huber”, or “expectile”.
alpha (float) – Risk parameter (i.e. quantile or expectile level) for the functional. Must be between 0 and 1.
huber_a (float | None) – Huber transition parameter, used for “huber” functional only. Must be strictly greater than 0.
decomposition (bool) – True to return penalty values due to under- and over-fcst as well as the total score, False to return total score only.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the Murphy score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the Murphy score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the FIRM score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

Returns:

An xr.Dataset with dimensions based on the preserve_dims or reduce_dims arg as well as a “theta” dimension with values matching thetas input.

If decomposition is False, the dataset’s variables will contain 1 element “total”.

If decomposition is True, in addition to “total”, it will have “underforecast” and “overforecast” data_vars.

Raises:

ValueError – If functional is not one of the expected functions.
ValueError – If alpha is not strictly between 0 and 1.
ValueError – If huber_a is not strictly greater than 0.

Return type:

Dataset

References

For mean elementary score definitions, see:

Theorem 1 of Ehm, W., Gneiting, T., Jordan, A., & Krüger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(3), 505–562. https://doi.org/10.1111/rssb.12154
Theorem 5.3 of Taggart, R. J. (2022). Point forecasting and forecast evaluation with generalized Huber loss. Electronic Journal of Statistics, 16(1), 201-231. https://doi.org/10.1214/21-EJS1957

scores.continuous.murphy_thetas(forecasts, obs, functional, *, huber_a=None, left_limit_delta=None)#

Return the decision thresholds (theta values) at which to evaluate elementary scores for the construction of Murphy diagrams.

This function may generate a very large number of theta thresholds if the forecast and obs data arrays are large and their values have high precision, which may lead to long computational times for Murphy diagrams. To reduce the number of thetas, users can first round forecast and obs values to an appropriate resolution.

Parameters:

forecasts (list[DataArray]) – Forecast values, one array per source.
obs (DataArray) – Observed values.
functional (Literal['quantile', 'huber', 'expectile']) – The type of elementary scoring function, one of “quantile”, “huber”, or “expectile”.
huber_a (float | None) – Huber transition parameter, used for “huber” functional only. Must be strictly greater than 0.
left_limit_delta (float | None) – Approximation of the left hand limit, used for “huber” and “expectile” functionals. Must be greater than or equal to 0. None will be treated as 0. Ideally, left_limit_delta should be small relative to the fcst and obs precision, and not greater than that precision.

Returns:

theta thresholds to be used to compute murphy scores.

Return type:

List[float]

Raises:

ValueError – If functional is not one of the expected functions.
ValueError – If huber_a is not strictly greater than 0.
ValueError – If left_limit_delta is not greater than or equal to 0.

Notes

For theta values at which to evaluate elementary scores, see

Corollary 2 (p.521) of Ehm, W., Gneiting, T., Jordan, A., & Krüger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(3), 505–562. https://doi.org/10.1111/rssb.12154
Corollary 5.6 of Taggart, R. J. (2022). Point forecasting and forecast evaluation with generalized Huber loss. Electronic Journal of Statistics, 16(1), 201-231. https://doi.org/10.1214/21-EJS1957

scores.continuous.flip_flop_index(data: DataArray, sampling_dim: str, *, is_angular: bool = False, **selections: Iterable[int]) → Dataset#

scores.continuous.flip_flop_index(data: DataArray, sampling_dim: str, *, is_angular: bool = False, **selections: None) → DataArray

Calculates the Flip-Flop Index along the dimensions sampling_dim.

Parameters:

data – Data from which to draw subsets.
sampling_dim – The name of the dimension along which to calculate the Flip-Flop Index.
is_angular – specifies whether data is directional data (e.g. wind direction).
**selections – Additional keyword arguments specify subsets to draw from the dimension sampling_dim of the supplied data before calculation of the Flip_Flop Index. e.g. days123=[1, 2, 3]

Returns:

An xarray.DataArray, the Flip-Flop Index by collapsing the dimension sampling_dim.

If selections are supplied: An xarray.Dataset. Each data variable is a supplied key-word argument, and corresponds to selecting the values specified from sampling_dim of data. The Flip-Flop Index is calculated for each of these selections.

Return type:

If selections are not supplied

Notes

\[\text{{Flip-Flop Index}} = \frac{{1}}{{N-2}} \left [ \left(\sum\limits_{{i=1}}^{{N-1}}|x_i - x_{{i+1}}|\right) - \left(\max_{{j}}\{{x_j\}} - \min_{{j}}\{{x_j\}}\right) \right ]\]

Where \(N\) is the number of data points, and \(x_i\) is the \(i^{{\text{{th}}}}\) data point.

Examples

>>> data = xr.DataArray([50, 20, 40, 80], coords={{'lead_day': [1, 2, 3, 4]}})

>>> flip_flop_index(data, 'lead_day')
<xarray.DataArray ()>
array(15.0)
Attributes:
    sampling_dim: lead_day

>>> flip_flop_index(data, 'lead_day', days123=[1, 2, 3], all_days=[1, 2, 3, 4])
<xarray.Dataset>
Dimensions:   ()
Coordinates:
    *empty*
Data variables:
    days123   float64 20.0
    all_days  float64 15.0
Attributes:
    selections: {{'days123': [1, 2, 3], 'all_days': [1, 2, 3, 4]}}
    sampling_dim: lead_day

scores.continuous.flip_flop_index_proportion_exceeding(data, sampling_dim, thresholds, *, is_angular=False, preserve_dims=None, reduce_dims=None, **selections)#

Calculates the Flip-Flop Index and returns the proportion exceeding (or equal to) each of the supplied thresholds.

Parameters:

data (DataArray) – Data from which to draw subsets.
sampling_dim (str) – The name of the dimension along which to calculate
thresholds (Iterable) – The proportion of Flip-Flop Index results equal to or exceeding these thresholds will be calculated. the Flip-Flop Index.
is_angular (bool) – specifies whether data is directional data (e.g. wind direction).
reduce_dims (Iterable[Hashable] | None) – Dimensions to reduce.
preserve_dims (Iterable[Hashable] | None) – Dimensions to preserve.
**selections (Iterable[int]) – Additional keyword arguments specify subsets to draw from the dimension sampling_dim of the supplied data before calculation of the Flip_Flop Index. e.g. days123=[1, 2, 3]

Returns:

If selections are not supplied - An xarray.DataArray with dimensions dims + ‘threshold’. The DataArray is the proportion of the Flip-Flop Index calculated by collapsing dimension sampling_dim exceeding or equal to thresholds.

If selections are supplied - An xarray.Dataset with dimensions dims + ‘threshold’. There is a data variable for each keyword in selections, and corresponds to the Flip-Flop Index proportion exceeding for the subset of data specified by the keyword values.

Examples

>>> data = xr.DataArray(
...     [[50, 20, 40, 80], [10, 50, 10, 100], [0, 30, 20, 50]],
...     dims=['station_number', 'lead_day'],
...     coords=[[10001, 10002, 10003], [1, 2, 3, 4]]
... )

>>> flip_flop_index_proportion_exceeding(data, 'lead_day', [20])
<xarray.DataArray (threshold: 1)>
array([ 0.33333333])
Coordinates:
  * threshold  (threshold) int64 20
Attributes:
    sampling_dim: lead_day

>>> flip_flop_index_proportion_exceeding(
...     data, 'lead_day', [20], days123=[1, 2, 3], all_days=[1, 2, 3, 4]
... )
<xarray.Dataset>
Dimensions:    (threshold: 1)
Coordinates:
  * threshold  (threshold) int64 20
Data variables:
    days123    (threshold) float64 0.6667
    all_days   (threshold) float64 0.3333
Attributes:
    selections: {{'days123': [1, 2, 3], 'all_days': [1, 2, 3, 4]}}
    sampling_dim: lead_day

See also

scores.continuous.flip_flop_index()

scores.continuous.correlation.pearsonr(fcst, obs, *, reduce_dims=None, preserve_dims=None)#

Calculates the Pearson’s correlation coefficient between two xarray DataArrays

\[\rho = \frac{\sum_{i=1}^{n}{(x_i - \bar{x})(y_i - \bar{y})}} {\sqrt{\sum_{i=1}^{n}{(x_i-\bar{x})^2}\sum_{i=1}^{n}{(y_i - \bar{y})^2}}}\]

where:

\(\rho\) = Pearson’s correlation coefficient
\(x_i\) = the values of x in a sample (i.e. forecast values)
\(\bar{x}\) = the mean value of the forecast sample
\(y_i\) = the values of y in a sample (i.e. observed values)
\(\bar{y}\) = the mean value of the observed sample value

Parameters:

fcst (DataArray | Dataset) – Forecast or predicted variables
obs (DataArray | Dataset) – Observed variables.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the Pearson’s correlation coefficient. All other dimensions will be preserved.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the Pearson’s correlation coefficient. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the absolute error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.

Returns:

An xarray object with Pearson’s correlation coefficient values

Raises:

ValueError – If a user tries to preserve all dimensions a ValueError will be raised.
TypeError – If the input types are not xarray DataArrays or Datasets.
ValueError – If the input Datasets do not have the same data variables.

Return type:

DataArray | Dataset

Note

This function isn’t set up to take weights.

Reference:: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

Example

>>> import xarray as xr
>>> import numpy as np
>>> from scores.continuous.correlation.correlation_impl import pearsonr
>>> # Create example forecast and observation DataArrays
>>> fcst = xr.DataArray(np.random.rand(10, 5), dims=("time", "location"))
>>> obs = xr.DataArray(np.random.rand(10, 5), dims=("time", "location"))
>>> # Calculate Pearson's correlation coefficient
>>> result = pearsonr(fcst, obs, reduce_dims="time")
>>> print(result)

scores.continuous.correlation.spearmanr(fcst, obs, *, reduce_dims=None, preserve_dims=None)#

Calculates the Spearman’s rank correlation coefficient between two xarray objects.

\[r_s = \rho\big(R(x), R(y)\big)\]

where:

\(\rho\) is the Pearson correlation coefficient.
\(R\) is the ranking operator.

Parameters:

fcst (DataArray | Dataset) – Forecast or predicted variables.
obs (DataArray | Dataset) – Observed variables.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the Spearman’s rank correlation coefficient. All other dimensions will be preserved.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the Spearman’s rank correlation coefficient. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved.

Returns:

An xarray object with Spearman’s rank correlation coefficient values.

Raises:

ValueError – If a user tries to preserve all dimensions ValueError will be raised.
TypeError – If the input types are not xarray DataArrays or Datasets.
ValueError – If the input Datasets do not have the same data variables.

Return type:

DataArray | Dataset

Note

This function isn’t set up to take weights.

Reference:: Spearman, C. (1904). The Proof and Measurement of Association between Two Things. The American Journal of Psychology, 15(1), 72–101. https://doi.org/10.2307/1412159

Example

>>> import xarray as xr
>>> import numpy as np
>>> from scores.continuous.correlation.correlation_impl import spearmanr
>>> # Create example forecast and observation DataArrays
>>> fcst = xr.DataArray(np.random.rand(10, 5), dims=("time", "location"))
>>> obs = xr.DataArray(np.random.rand(10, 5), dims=("time", "location"))
>>> # Calculate Spearman's rank correlation coefficient
>>> result = spearmanr(fcst, obs, reduce_dims="time")
>>> print(result)

scores.continuous.multiplicative_bias(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the multiplicative bias.

Most suited for forecasts that have a lower bound at 0 such as wind speed. Will return a np.inf where the mean of obs across the dims to be reduced is 0. It is defined as

\[\text{{Multiplicative bias}} = \frac{\frac{1}{N}\sum_{i=1}^{N}x_i}{\frac{1}{N}\sum_{i=1}^{N}y_i} \text{where } x = \text{the forecast, and } y = \text{the observation}\]

See “(Multiplicative) bias” section at https://jwgfvr.github.io/forecastverification/index.html#multiplicative_bias for more information.

Parameters:

fcst (DataArray | Dataset) – Forecast or predicted variables.
obs (DataArray | Dataset) – Observed variables.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the multiplicative bias. All other dimensions will be preserved.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the multiplicative bias. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.
weights (DataArray | Dataset | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, NaN values in weights can be replaced by weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

An xarray object with the multiplicative bias of a forecast.

Return type:

DataArray | Dataset

References

https://jwgfvr.github.io/forecastverification/index.html#multiplicative_bias

scores.continuous.pbias(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the percent bias, which is the ratio of the additive bias to the mean observed value, multiplied by 100.

Percent bias is used for evaluating and comparing forecast accuracy across stations or datasets with varying magnitudes. By expressing the error as a percentage of the observed value, it allows for standardised comparisons, enabling assessment of forecast performance regardless of the absolute scale of values. Like scores.continuous.multiplicative_bias(), pbias will return a np.inf where the mean of obs across the dims to be reduced is 0. It is defined as

\[\text{Percent bias} = 100 \cdot \frac{\sum_{i=1}^{N}(x_i - y_i)}{\sum_{i=1}^{N} y_i}\]

where:

\(x_i\) = the values of x in a sample (i.e. forecast values)
\(y_i\) = the values of y in a sample (i.e. observed values)

See “pbias” section at https://search.r-project.org/CRAN/refmans/hydroGOF/html/pbias.html for more information

Parameters:

fcst (DataArray | Dataset) – Forecast or predicted variables.
obs (DataArray | Dataset) – Observed variables.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the percent bias. All other dimensions will be preserved.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the percent bias. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.
weights (DataArray | Dataset | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, NaN values in weights can be replaced by weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

An xarray object with the percent bias of a forecast.

Return type:

DataArray | Dataset

References

Sorooshian, S., Duan, Q., & Gupta, V. K. (1993). Calibration of rainfall-runoff models: Application of global optimization to the Sacramento Soil Moisture Accounting Model. Water Resources Research, 29(4), 1185-1194. https://doi.org/10.1029/92WR02617
Alfieri, L., Pappenberger, F., Wetterhall, F., Haiden, T., Richardson, D., & Salamon, P. (2014). Evaluation of ensemble streamflow predictions in Europe. Journal of Hydrology, 517, 913-922. https://doi.org/10.1016/j.jhydrol.2014.06.035
Dawson, C. W., Abrahart, R. J., & See, L. M. (2007). HydroTest: A web-based toolbox of evaluation metrics for the standardised assessment of hydrological forecasts. Environmental Modelling and Software, 22(7), 1034-1052. https://doi.org/10.1016/j.envsoft.2006.06.008
Moriasi, D. N., Arnold, J. G., Van Liew, M. W., Bingner, R. L., Harmel, R. D., & Veith, T. L. (2007). Model evaluation guidelines for systematic quantification of accuracy in watershed simulations. Transactions of the ASABE, 50(3), 885-900. https://doi.org/10.13031/2013.23153

scores.continuous.percent_within_x(fcst, obs, threshold, *, reduce_dims=None, preserve_dims=None, is_angular=False, decimals=None, is_inclusive=True)#

Computes the proportion of forecasts within a specified absolute tolerance of the observations.

This score calculates the percentage of forecast values that are within a specified threshold (plus an optional tolerance) of the observed values. This metric is particularly useful when evaluating how often a forecast falls within an acceptable error band, regardless of direction.

The general formulations are:

\[\text{Percent within X (exclusive)} = 100 \cdot \frac{\sum_{i=1}^{N} \mathbf{1}\left(|x_i - y_i| < \tau\right)} {\sum_{i=1}^{N} \mathbf{1}_{\text{valid}}}\]

\[\text{Percent within X (inclusive)} = 100 \cdot \frac{\sum_{i=1}^{N} \mathbf{1}\left(|x_i - y_i| \leq \tau\right)} {\sum_{i=1}^{N} \mathbf{1}_{\text{valid}}}\]

where:

\(x_i\) is the forecast value at index \(i\)
\(y_i\) is the observed value at index \(i\)
\(\tau\) is the absolute error threshold
\(\mathbf{1}(\cdot)\) is the indicator function
\(\mathbf{1}_{\text{valid}}\) is 1 where both \(x_i\) and \(y_i\) are not missing (NaN), 0 otherwise

Parameters:

fcst (DataArray | Dataset) – Forecast or predicted variables.
obs (DataArray | Dataset) – Observed variables.
threshold (float) – The main threshold to test closeness against.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the percent within X . All other dimensions will be preserved. Only one of reduce_dims and preserve_dims can be specified.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the percent within X. All other dimensions will be reduced. Only one of reduce_dims and preserve_dims can be specified.
is_angular (bool | None) – If True, uses angular distance in degrees.
decimals (int | None) – A way to avoid floating-point precision issues. Absolute errors are rounded to this many digits before threshold calculations.
is_inclusive (bool | None) – Whether to treat the condition as inclusive (<=) or exclusive (<).

Returns:

Percent of forecasts within tolerance for each preserved dimension.

Return type:

DataArray | Dataset

Important

This metric can incentivise hedging (such as introducing biases to maximise one’s score). See the tutorial for more information.

Note

The result is bounded in [0, 100].
NaNs in forecasts or observations are excluded.
If total valid forecast-observation pairs are zero, output is NaN.

Examples

>>> import numpy as np
>>> import xarray as xr
>>> from scores.standard_impl import percent_within_x
>>> obs_raw = np.array(
...     [
...         [[1,2,3], [4,5,6]],
...         [[3,2,1], [6,5,4]],
...         [[3,2,5], [2,2,6]],
...         [[5,2,3], [4,-1,4]],
...     ]
...
... )  # dimension lengths: x=4, y=2, t=3
>>> obs = xr.DataArray(obs_raw, dims=["x", "y", "t"])
>>> fcst = obs * 1.2 + 0.1  # add some synthetic bias and variance

>>> # Example 1:
>>> # percent of forecasts with less than or equal to 0.5 absolute error
>>> # reduce over t - time - should produce an xy-grid (4 by 2)
>>> percent_within_x(fcst=fcst, obs=obs, threshold=0.5, is_inclusive=True, reduce_dims=["t"])
<xarray.DataArray (x: 4, y: 2)> Size: 64B
array([[66.66666667,  0.        ],
       [66.66666667,  0.        ],
       [33.33333333, 66.66666667],
       [33.33333333, 33.33333333]])
Dimensions without coordinates: x, y

>>> # Example 2:
>>> # percent of forecasts with less than or equal to 0.5 absolute error
>>> # reduce over (x, y) - space - should be a t-vector (3 by 1)
>>> percent_within_x(fcst=fcst, obs=obs, threshold=0.5, is_inclusive=True, reduce_dims=["x","y"])
<xarray.DataArray (t: 3)> Size: 24B
array([25., 75., 12.5])
Dimensions without coordinates: t

>>> # Example 3:
>>> # percent of forecasts with less than 0.5 absolute error (is_inclusive=False)
>>> # reduce over (x, y) - space - should be a t-vector (3 by 1)
>>> percent_within_x(fcst=fcst, obs=obs, threshold=0.5, is_inclusive=False, reduce_dims=["x","y"])
<xarray.DataArray (t: 3)> Size: 24B
array([12.5., 12.5., 12.5])
Dimensions without coordinates: t

>>> # Example 4:
>>> # Controlling floating-point precision issues
>>> np.set_printoptions(precision=17) # make floating-point precision issues visible
>>> obs = xr.DataArray([0.1 + 0.2], dims=["t"])
>>> print(f'obs {obs.values}')
obs [0.30000000000000004]
>>> fcst = obs + 0.3
>>> print(f'fcst {fcst.values}')
fcst [0.6000000000000001]
>>> unrounded = percent_within_x(fcst=fcst, obs=obs, threshold=0.3, is_inclusive=True, decimals=20)
>>> print(f'incorrectly creating a penalty: {unrounded.values}')
incorrectly ignoring a success: 0.0
>>> rounded = percent_within_x(fcst=fcst, obs=obs, threshold=0.3, is_inclusive=True, decimals=5)
>>> print(f'correctly recognising a success: {rounded.values}')
correctly recognising a success: 100.0

scores.continuous.kge(fcst, obs, *, reduce_dims=None, preserve_dims=None, scaling_factors=None, include_components=False)#

Calculate the Kling-Gupta Efficiency (KGE) between observed and simulated (or forecast) values.

KGE is a performance metric that decomposes the error into three components: correlation, variability, and bias. It is computed as:

\[\text{KGE} = 1 - \sqrt{\left[s_\rho \cdot (\rho - 1)\right]^2 + \left[s_\alpha \cdot (\alpha - 1)\right]^2 + \left[s_\beta \cdot (\beta - 1)\right]^2}\]

\[\alpha = \frac{\sigma_x}{\sigma_y}\]

\[\beta = \frac{\mu_x}{\mu_y}\]

where:

\(\rho\) = Pearson’s correlation coefficient between observed and forecast values as defined in scores.continuous.correlation.pearsonr()
\(\alpha\) is the ratio of the standard deviations (variability ratio)
\(\beta\) is the ratio of the means (bias)
\(x\) and \(y\) are forecast and observed values, respectively
\(\mu_x\) and \(\mu_y\) are the means of forecast and observed values, respectively
\(\sigma_x\) and \(\sigma_y\) are the standard deviations of forecast and observed values, respectively
\(s_\rho\), \(s_\alpha\) and \(s_\beta\) are the scaling factors for the correlation coefficient \(\rho\), the variability term \(\alpha\) and the bias term \(\beta\)

Parameters:

fcst (DataArray) – Forecast or predicted variables.
obs (DataArray) – Observed variables.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the KGE. All other dimensions will be preserved.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the KGE. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be all NaN with the same shape/dimensionality as the forecast because the standard deviation is zero for a single point.
scaling_factors (list[float] | ndarray | None) – A 3-element vector or list describing the weights for each term in the KGE. Defined by: scaling_factors = [\(s_\rho\), \(s_\alpha\), \(s_\beta\)] to apply to the correlation term \(\rho\), the variability term \(\alpha\) and the bias term \(\beta\) respectively. Defaults to (1.0, 1.0, 1.0). (See equation 10 in Gupta et al. (2009) for definitions of them).
include_components (bool | False) – If True, the function also returns the individual terms contributing to the KGE score.

Returns:

The Kling-Gupta Efficiency (KGE) score as an xarray DataArray.

If include_components is True, the function returns xarray.Dataset kge_s with the following variables:

kge: The KGE score.
rho: The Pearson correlation coefficient.
alpha: The variability ratio.
beta: The bias term.

Return type:

DataArray | Dataset

Notes

Statistics are calculated only from values for which both observations and simulations are not null values.
This function isn’t set up to take weights.
Currently this function is working only on xr.DataArray.
When preserve_dims is set to ‘all’, the function returns NaN, similar to the Pearson correlation coefficient calculation for a single data point because the standard deviation is zero for a single point.

References

Gupta, H. V., Kling, H., Yilmaz, K. K., & Martinez, G. F. (2009). Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modeling. Journal of Hydrology, 377(1-2), 80-91. https://doi.org/10.1016/j.jhydrol.2009.08.003.
Knoben, W. J. M., Freer, J. E., & Woods, R. A. (2019). Technical note: Inherent benchmark or not? Comparing Nash-Sutcliffe and Kling-Gupta efficiency scores. Hydrology and Earth System Sciences, 23(10), 4323-4331. https://doi.org/10.5194/hess-23-4323-2019.

Examples

>>> kge_s = kge(forecasts, obs,preserve_dims='lat')  # if data is of dimension {lat,time}, kge value is computed across the time dimension
>>> kge_s = kge(forecasts, obs,reduce_dims="time")  # if data is of dimension {lat,time}, reduce_dims="time" is same as preserve_dims='lat'
>>> kge_s = kge(forecasts, obs, include_components=True) # kge_s is dataset of all three components and kge value itself
>>> kge_s_weighted = kge(forecasts, obs, scaling_factors=(0.5, 1.0, 2.0)) # with scaling factors

scores.continuous.nse(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None, is_angular=False)#

The Nash-Sutcliffe model Efficiency coefficient (NSE) is primarily used in hydrology to assess the skill of model predictions (of e.g. “discharge”).

While NSE is often calculated over observations and model predictions in the time dimension, it is actually a fairly generic statistical measure that determines the relative magnitude of the residual variance (“noise”) compared to the measured data variance (“information”) (Nash and Sutcliffe, 1970). Incidentally, it is (inversely) related to the signal-to-noise ratio (SNR).

The general formulation of NSE is as follows:

\[\text{NSE} = 1 - \frac{\sum_i{(O_i - S_i)^2}}{\sum_i{(O_i - \bar{O})^2}}\]

where:

\(i\) is a generic “indexer” representing the set of datapoints along the dimensions being reduced e.g. time (\(t\)) or xy-coordinates (\((x, y)\)). The latter represents reduction over two dimensions as an example.
\(O_i\) is the observation at index \(i\).
\(S_i\) is the “forecast” or model simulation at index \(i\).
\(\bar{O}\) is the mean observation of the set of indexed samples as specified by reduce_dims and preserve_dims.

Parameters:

fcst (DataArray | Dataset) – “Forecast” or predicted variables.
obs (DataArray | Dataset) – Observed variables.
reduce_dims (Iterable[Hashable] | None) – dimensions to reduce when calculating the NSE. (i.e. NSE will be calculated using datapoints along these dimensions as samples, the other dimensions will be preserved).
preserve_dims (Iterable[Hashable] | None) – dimensions to preserve. Mutually exclusive to reduce_dims. All dimensions not specified here will be reduced as described in reduce_dims. Note: preserve_dims="all" is not supported for NSE. See notes below.
weights (DataArray | Dataset | None) – Optional weighting to apply to the NSE computation. Typically weights are applied over the time dimension but can vary by location as well. Weights must be non-negative and specified for each data point (i.e. the user must not assume broadcasting will handle appropriate assignment of weights for this score).
is_angular (bool) – specifies whether fcst and obs are angular data (e.g. wind direction). If True, a different function is used to calculate the difference between fcst and obs, which accounts for circularity. Angular fcst and obs data should be in degrees rather than radians.

Returns:

NSE score for each preserved dimension

xr.Dataset: if fcst, obs and optionally weights are all datasets.

xr.DataArray: ditto above - where inputs are all dataarrays

See comments below for more information on mixed xarray data types (which this score does not handle) and type isomorphism.

Raises:

DimensionError – If any dimension checks fail.
KeyError – If no dimensions are being reduced - NSE requries at least 1 dimension to be reduced to compute the observation variance.
IndexError – If the dimensions being reduced only have 1 value - not sufficient for computation of obs variance.
UserWarning – If weights are negative, or invalid (e.g. all zeros or all NaNs).
UserWarning – If attempting to divide by 0. The computation will still succeed but produce np.nan (numerator is also 0) or -np.inf where divide by zero would occur.
Exception – Any other errors or warnings not otherwise listed due to calculations associated with utility functions such as gather_dimensions.
RuntimeError – If something went wrong with the underlying implementation. The user will be prompted to report this as a github issue.

Return type:

DataArray | Dataset

Supplementary details:

Nash-Sutcliffe efficiencies range from -Inf to 1. Essentially, the closer to 1, the more accurate the model is.
- NSE = 1, corresponds to a perfect match of the model to the obs.
- NSE = 0, indicates that the model is as accurate as the mean obs.
- -Inf < NSE < 0, indicates that the mean obs is better predictor than the model.
The optional weights argument can additionally be used to perform a weighted NSE (wNSE). Although, this is a generic set of weights, and it is the user’s responsiblility to define them appropriately. Typically this is the observation itself (Hundecha, Y., & Bárdossy, A., 2004).
weights must be non-negative. Therefore, the observations must ideally also be non-negative (or formulated appropriately) if used as weights.
While is_angular is typically not used for this score, NSE is generic enough that it _could_ be used in wider context, and hence is kept as an option. It is defaulted to False as that’s the typical use-case.

Important

This score does not allow mixed xarray data structures as inputs. Either provide all xr.DataArray or all xr.Dataset exclusively, for the fcst, obs and (optionally) weights arguments.

This is an intentionally imposed constraint to make sure the inner computations are simple to check and deterministic. See tips below for more information.

Warning

Operations between dataarrays are not guarenteed to preserve names. If the user is working with dataarrays, it is assumed that preserving names is not a major requirement. If a user needs the name preserved, they should explicitly convert all data array inputs to datasets using xr.DataArray.to_dataset(...) , and verify that the naming is retained appropriately before calling the score.

For operations where ONLY xr.DataArray inputs are used, the returned score will have its name forced to the name of this score i.e. “NSE”, for simplicity.

See tips below for more information.

Note

For Hydrology in particular \(i = t\), the reduced dimension is usually the time dimension. However, in order to keep things generic, this function does not explicitly mandate a time dimension be provided. Instead it requires it has a hard requirement that at least one dimension is being reduced from either a specification of reduce_dims or preserve_dims (mutually exclusive).

The reason is that the observation variance cannot be non-zero if nothing is being reduced.

As a side-effect of the above requirement, preserve_dims="all" is not allowed and will naturally throw an error.

Note

Divide by zero is allowed - to accomodate scenarios where all obs entries in the group being reduced is constant (0 obs variance).

While these may cause divide by zero errors, they should not halt execution of computations for other valid coordinates - so a warning is issued instead to prompt the user to double check the data.

It may also be that divide by zero is unavoidable - in which case we still want to return the correctly calculated values. To this end, this is how numpy resolves divide by zero:

# note: psuedocode
n / 0  =  NaN   : if n == 0  (represented by np.nan)
       = -Inf   : if n == 0  (represented by -np.inf)

Tip

When dealing with dask arrays dask, no computation will happen until .compute(...) is called on the returned score.

Tip

Work with datasets where possible with NSE, or for any score that supports datasets for that matter. Datasets maintain structural integrity better than their dataarray counterparts and also are compatible with higher order types like xr.DataTree.

Operations between datasets are more predictable than operations with mixed types. Data arrays on the other hand may ignore names and broadcast liberally even when names do not match, and this may not be consistent depending on the operation. This may or may not be the intented behaviour the user expects. Operations between only dataarrays are fine as long as preserving names is not mandatory.

Examples

>>> import numpy as np
>>> import xarray as xr
>>> from scores.continuous import nse
>>> obs_raw = np.array(
...     [
...         [[1,2,3], [4,5,6]],
...         [[3,2,1], [6,5,4]],
...         [[3,2,5], [2,2,6]],
...         [[5,2,3], [4,-1,4]],
...     ]
...
... )  # dimension lengths: x=4, y=2, t=3
>>> obs = xr.DataArray(obs_raw, dims=["x", "y", "t"])
>>> fcst = obs * 1.2 + 0.1  # add some synthetic bias and variance
>>> # Example 1:
>>> # reduce over t - time - should produce a xy-grid (4 by 2)
>>> nse(obs, fcst, reduce_dims=["t"])
<xarray.DataArray (x: 4, y: 2)> Size: 64B
array([[ 0.71180556, -0.28819444],
       [ 0.71180556, -0.28819444],
       [ 0.70982143,  0.85742188],
       [ 0.70982143,  0.93208333]])
Dimensions without coordinates: x, y
>>> # Example 2:
>>> # reduce over (x, y) - space - should be a t-vector (3 by 1)
>>> nse(obs, fcst, reduce_dims=["x", "y"])
<xarray.DataArray (t: 3)> Size: 24B
array([0.77469136, 0.90123457, 0.74722222])
Dimensions without coordinates: t

References

Nash, J. E., & Sutcliffe, J. V. (1970). River flow forecasting through conceptual models part I — A discussion of principles. In Journal of Hydrology (Vol. 10, Issue 3, pp. 282– 290). Elsevier BV. https://doi.org/10.1016/0022-1694%2870%2990255-6
Hundecha, Y., & Bárdossy, A. (2004). Modeling of the effect of land use changes on the runoff generation of a river basin through parameter regionalization of a watershed model. Journal of Hydrology, 292(1-4), 281-295. https://doi.org/10.1016/j.jhydrol.2004.01.002

scores.continuous.isotonic_fit(fcst, obs, *, weight=None, functional='mean', bootstraps=None, quantile_level=None, solver=None, confidence_level=0.9, min_non_nan=1, report_bootstrap_results=False)#

Calculates an isotonic regression fit to observations given forecasts as the explanatory values. Optionally returns confidence bands on the fit via bootstrapping. Forecasts and observations are scalar-valued (i.e, integers, floats or NaN).

If forecasts target the mean or a quantile functional, then the user can select that functional. Otherwise the user needs to supply a solver function

of one variable (if no weights are supplied), or

of two variables (if weights are supplied).

The solver function is used to find an optimal regression fit in each block, subject to a non-decreasing monotonicity constraint, using the pool adjacent violators (PAV) algorithm.

An example of a solver function of one variable is obs -> np.mean(obs) An example of a solver function of two variables is (obs, weight) -> np.average(obs, weights=weight).

Ties (where one forecast value has more than one observation value) are handled as follows. Forecasts are sorted in ascending order, and observations sorted to match. Where ties occur, observations are sorted in descending order within each subset of ties. Isotonic regression is then performed on this sorted observation sequence.

This implementation uses sklearn.isotonic.IsotonicRegression when functional=”mean”.

Parameters:

fcst (ndarray | DataArray) – np.array or xr.DataArray of forecast (or explanatory) values. Values must be float, integer or NaN.
obs (ndarray | DataArray) – np.array or xr.DataArray of corresponding observation (or response) values. Must be the same shape as fcst. Values must be float, integer or NaN.
weight (ndarray | DataArray | None) – positive weights to apply to each forecast-observation pair. Must be the same shape as fcst, or None (which is equivalent to applying equal weights).
functional (Literal['mean', 'quantile'] | None) – Functional that the forecasts are targeting. Either “mean” or “quantile” or None. If None then solver must be supplied. The current implementation for “quantile” does not accept weights. If weighted quantile regression is desired then the user should should supply an appropriate solver.
bootstraps (int | None) – the number of bootstrap samples to perform for calculating the regression confidence band. Set to None if a confidence band is not required.
quantile_level (float | None) – the level of the quantile functional if functional=’quantile’. Must be strictly between 0 and 1.
solver (Callable[[ndarray, ndarray], float] | Callable[[ndarray], float] | None) – function that accepts 1D numpy array of observations and returns a float. Function values give the regression fit for each block as determined by the PAV algorithm. If weights are supplied, the function must have a second argument that accepts 1D numpy array of weights of same length as the observations.
confidence_level (float | None) – Confidence level of the confidence band, strictly between 0 and 1. For example, a confidence level of 0.5 will calculate the confidence band by computing the 25th and 75th percentiles of the bootstrapped samples.
min_non_nan (int | None) – The minimum number of non-NaN bootstrapped values to calculate confidence bands at a particular forecast value.
report_bootstrap_results (bool | None) – This specifies to whether keep the bootstrapped regression values in return dictionary. Default is False to keep the output result small.

Returns:

“unique_fcst_sorted”: 1D numpy array of remaining forecast values sorted in
ascending order, after any NaNs from fcst, obs and weight are removed, and only unique values are kept to keep the output size reasonably small.
”fcst_counts”: 1D numpy array of forecast counts for unique values of forecast sorted
”regression_values”: 1D numpy array of regression values corresponding to
”unique_fcst_sorted” values.
”regression_func”: function that returns the regression fit based on linear
interpolation of (“fcst_sorted”, “regression_values”), for any supplied argument (1D numpy array) of potential forecast values.
”bootstrap_results”: in the case of report_bootstrap_results=True, 2D numpy
array of bootstrapped regression values is included in return dictionary. Each row gives the interpolated regression values from a particular bootstrapped sample, evaluated at “fcst_sorted” values. If m is the number of bootstrap samples and n = len(fcst_sorted) then it is has shape (m, n). We emphasise that this array corresponds to fcst_sorted not unique_fcst_sorted.
”confidence_band_lower_values”: values of lower confidence band threshold, evaluated
at “unique_fcst_sorted” values.
”confidence_band_upper_values”: values of upper confidence band threshold, evaluated
at “unique_fcst_sorted” values.
”confidence_band_lower_func”: function that returns regression fit based on linear
interpolation of (“fcst_sorted”, “confidence_band_lower_values”), given any argument (1D numpy array) of potential forecast values.
”confidence_band_upper_func”: function that returns regression fit based on linear
interpolation of (“fcst_sorted”, “confidence_band_upper_values”), given any argument (1D numpy array) of potential forecast values.
”confidence_band_levels”: tuple giving the quantile levels used to calculate the
confidence band.

Return type:

Dictionary with the following keys

Raises:

ValueError – if fcst and obs are np.arrays and don’t have the same shape.
ValueError – if fcst and weight are np.arrays and don’t have the same shape.
ValueError – if fcst and obs are xarray.DataArrays and don’t have the same dimensions.
ValueError – if fcst and weight are xarray.DataArrays and don’t have the same dimensions.
ValueError – if any entries of fcst, obs or weight are not an integer, float or NaN.
ValueError – if there are no fcst and obs pairs after NaNs are removed.
ValueError – if functional is not one of “mean”, “quantile” or None.
ValueError – if quantile_level is not strictly between 0 and 1 when functional=”quantile”.
ValueError – if weight is not None when functional=”quantile”.
ValueError – if functional and solver are both None.
ValueError – if not exactly one of functional and solver is None.
ValueError – if bootstraps is not a positive integer.
ValueError – if confidence_level is not strictly between 0 and 1.

Note: This function only keeps the unique values of fcst_sorted to keep the volume of

the return dictionary small. The forecast counts is also included, so users can it to create forecast histogram (usually displayed in the reliability diagrams).

References

de Leeuw, Hornik and Mair. “Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods”, Journal of Statistical Software, 2009.
Dimitriadis, Gneiting and Jordan. “Stable reliability diagrams for probabilistic classifiers”, PNAS, Vol. 118 No. 8, 2020. Available at https://www.pnas.org/doi/10.1073/pnas.2016191118
Jordan, Mühlemann, and Ziegel. “Optimal solutions to the isotonic regression problem”, 2020 (version 2), available on arxiv at https://arxiv.org/abs/1904.04761

scores.continuous.consistent_expectile_score(fcst, obs, alpha, phi, phi_prime, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the score using a scoring function that is consistent for the alpha-expectile functional, based on a supplied convex function phi. See Geniting (2011), or Equation (10) from Taggart (2022).

\[\begin{split}S(x, y) = \begin{cases} (1 - \alpha)(\phi(y) - \phi(x) - \phi'(x)(y-x)), & y < x \\ \alpha(\phi(y) - \phi(x) - \phi'(x)(y-x)), & x \leq y \end{cases}\end{split}\]

where

\(x\) is the forecast
\(y\) is the observation
\(\alpha\) is the expectile level
\(\phi\) is a convex function of a single variable
\(\phi'\) is the subderivative of \(\phi\)
\(S(x,y)\) is the score.

Note that if \(\phi\) is differentiable then \(\phi'\) is its derivative.

The score is negatively oriented with \(0 \leq S(x,y) < \infty\).

We strongly recommend that you work through the tutorial https://scores.readthedocs.io/en/stable/tutorials/Consistent_Scores.html to help you understand how to use the consistent scoring functions in scores.

Parameters:

fcst (DataArray) – array of forecast values.
obs (DataArray) – array of corresponding observation values.
alpha (float) – expectile level. Must be strictly between 0 and 1.
phi (Callable[[DataArray], DataArray]) – a convex function on the real numbers, accepting a single array like argument.
phi_prime (Callable[[DataArray], DataArray]) – a subderivative of phi, accepting a single array like argument.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the consistent expectile score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the consistent quantile score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the consistent quantile score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

array of (mean) scores that is consistent for alpha-expectile functional, with the dimensions specified by dims. If dims is None, the returned DataArray will have only one entry, the overall mean score.

Raises:

ValueError – if alpha is not strictly between 0 and 1.

Return type:

DataArray

References

Gneiting, T. (2011). Making and Evaluating Point Forecasts. Journal of the American Statistical Association, 106(494), 746–762. https://doi.org/10.1198/jasa.2011.r10138
Taggart, R. (2022). Evaluation of point forecasts for extreme events using consistent scoring functions. Quarterly Journal of the Royal Meteorological Society, 148(742), 306–320. https://doi.org/10.1002/qj.4206

scores.continuous.consistent_quantile_score(fcst, obs, alpha, g, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the score that is consistent for the alpha-quantile functional, based on nondecreasing function g. See Gneiting (2011), or Equation (8) from Taggart (2022).

\[\begin{split}S(x, y) = \begin{cases} (1 - \alpha)(g(x) - g(y)), & y < x \\ \alpha(g(y) - g(x)), & x \leq y \end{cases}\end{split}\]

where

\(x\) is the forecast
\(y\) is the observation
\(\alpha\) is the quantile level
\(g\) is a nondecreasing function of a single variable
\(S(x,y)\) is the score.

The score is negatively oriented with \(0 \leq S(x,y) < \infty\).

Parameters:

fcst (DataArray) – array of forecast values.
obs (DataArray) – array of corresponding observation values.
alpha (float) – quantile level. Must be strictly between 0 and 1.
g (Callable[[DataArray], DataArray]) – nondecreasing function on the real numbers, accepting a single array like argument.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the consistent quantile score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the consistent quantile score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the consistent quantile score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

array of (mean) scores that are consistent for alpha-quantile functional, with the dimensions specified by dims. If dims is None, the returned DataArray will have only one entry, the overall mean score.

Raises:

ValueError – if alpha is not strictly between 0 and 1.

Return type:

DataArray

References

Gneiting, T. (2011). Making and Evaluating Point Forecasts. Journal of the American Statistical Association, 106(494), 746–762. https://doi.org/10.1198/jasa.2011.r10138
Taggart, R. (2022). Evaluation of point forecasts for extreme events using consistent scoring functions. Quarterly Journal of the Royal Meteorological Society, 148(742), 306–320. https://doi.org/10.1002/qj.4206

scores.continuous.consistent_huber_score(fcst, obs, huber_param, phi, phi_prime, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the score that is consistent for the Huber mean functional with tuning parameter tuning_param, based on convex function phi. See Taggart (2022a), or Equation (11) from Taggart (2022b). See Taggart (2022b), end of Section 3.4, for the standard formula.

\[S(x,y) = \frac{1}{2}(\phi(y)-\phi(\kappa_v (x-y) + y) + \kappa_v (x-y)\phi'(x))\]

where

\(\kappa_v(x) = \mathrm{max}(-v, \mathrm{min}(x, v))\) is the “capping function”
\(v\) is the Huber parameter
\(x\) is the forecast
\(y\) is the observation
\(\phi\) is a convex function of a single variable
\(\phi'\) is the subderivative of \(\phi\)
\(S(x,y)\) is the score.

The score is negatively oriented with \(0 \leq S(x,y) < \infty\).

Parameters:

fcst (DataArray) – array of forecast values.
obs (DataArray) – array of corresponding observation values.
huber_param (float) – Huber mean tuning parameter. This corresponds to the transition point between linear and quadratic loss for Huber loss. Must be positive.
phi (Callable[[DataArray], DataArray]) – a convex function on the real numbers, accepting a single array like argument.
phi_prime (Callable[[DataArray], DataArray]) – a subderivative of phi, accepting a single array like argument.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the consistent Huber score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the consistent Huber score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the consistent Huber score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

array of (mean) scores that is consistent for Huber mean functional, with the dimensions specified by dims. If dims is None, the returned DataArray will have only one entry, the overall mean score.

Raises:

ValueError – if huber_param <= 0.

Return type:

DataArray

References

Taggart, R. J. (2022a). Point forecasting and forecast evaluation with generalized Huber loss. Electronic Journal of Statistics, 16(1), 201-231. https://doi.org/10.1214/21-ejs1957
Taggart, R. (2022b). Evaluation of point forecasts for extreme events using consistent scoring functions. Quarterly Journal of the Royal Meteorological Society, 148(742), 306–320. https://doi.org/10.1002/qj.4206

scores.continuous.tw_quantile_score(fcst, obs, alpha, interval_where_one, *, interval_where_positive=None, reduce_dims=None, preserve_dims=None, weights=None)#

Returns the threshold weighted quantile score.

For more flexible threshold weighting schemes, see scores.continuous.consistent_quantile_score().

Two types of threshold weighting are supported: rectangular and trapezoidal.

To specify a rectangular threshold weight, set interval_where_positive=None and set interval_where_one to be the interval where the threshold weight is 1. For example, if interval_where_one=(0, 10) then a threshold weight of 1 is applied to decision thresholds satisfying 0 <= threshold < 10, and a threshold weight of 0 is applied otherwise. Interval endpoints can be -numpy.inf or numpy.inf.
To specify a trapezoidal threshold weight, specify interval_where_positive and interval_where_one using desired endpoints. For example, if interval_where_positive=(-2, 10) and interval_where_one=(2, 4) then a threshold weight of 1 is applied to decision thresholds satisfying 2 <= threshold < 4. The threshold weight increases linearly from 0 to 1 on the interval [-2, 2) and decreases linearly from 1 to 0 on the interval [4, 10], and is 0 otherwise. Interval endpoints can only be infinite if the corresponding interval_where_one endpoint is infinite. End points of interval_where_positive and interval_where_one must differ except when the endpoints are infinite.

Parameters:

fcst (DataArray) – array of forecast values.
obs (DataArray) – array of corresponding observation values.
alpha (float) – the quantile level. Must be strictly between 0 and 1.
interval_where_one (Tuple[int | float | DataArray, int | float | DataArray]) – endpoints of the interval where the threshold weights are 1. Must be increasing. Infinite endpoints are permissible. By supplying a tuple of arrays, endpoints can vary with dimension.
interval_where_positive (Tuple[int | float | DataArray, int | float | DataArray] | None) – endpoints of the interval where the threshold weights are positive. Must be increasing. Infinite endpoints are only permissible when the corresponding interval_where_one endpoint is infinite. By supplying a tuple of arrays, endpoints can vary with dimension.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the threshold_weighted_quantile_score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the threshold_weighted_quantile_score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the threshold_weighted_quantile_score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

xarray data array of the threshold weighted quantile error

Raises:

ValueError – if interval_where_one is not length 2.
ValueError – if interval_where_one is not increasing.
ValueError – if alpha is not in the open interval (0,1).
ValueError – if interval_where_one is not increasing.
ValueError – if interval_where_one and interval_where_positive do not specify a valid trapezoidal weight.

Return type:

DataArray

References

Taggart, R. (2022). Evaluation of point forecasts for extreme events using consistent scoring functions. Quarterly Journal of the Royal Meteorological Society, 148(742), 306-320. https://doi.org/10.1002/qj.4206

scores.continuous.tw_absolute_error(fcst, obs, interval_where_one, *, interval_where_positive=None, reduce_dims=None, preserve_dims=None, weights=None)#

Returns the threshold weighted absolute error.

For more flexible threshold weighting schemes, see scores.continuous.consistent_quantile_score().

Two types of threshold weighting are supported: rectangular and trapezoidal.

To specify a rectangular threshold weight, set interval_where_positive=None and set interval_where_one to be the interval where the threshold weight is 1. For example, if interval_where_one=(0, 10) then a threshold weight of 1 is applied to decision thresholds satisfying 0 <= threshold < 10, and a threshold weight of 0 is applied otherwise. Interval endpoints can be -numpy.inf or numpy.inf.
To specify a trapezoidal threshold weight, specify interval_where_positive and interval_where_one using desired endpoints. For example, if interval_where_positive=(-2, 10) and interval_where_one=(2, 4) then a threshold weight of 1 is applied to decision thresholds satisfying 2 <= threshold < 4. The threshold weight increases linearly from 0 to 1 on the interval [-2, 2) and decreases linearly from 1 to 0 on the interval [4, 10], and is 0 otherwise. Interval endpoints can only be infinite if the corresponding interval_where_one endpoint is infinite. End points of interval_where_positive and interval_where_one must differ except when the endpoints are infinite.

Parameters:

fcst (DataArray) – array of forecast values.
obs (DataArray) – array of corresponding observation values.
interval_where_one (Tuple[int | float | DataArray, int | float | DataArray]) – endpoints of the interval where the threshold weights are 1. Must be increasing. Infinite endpoints are permissible. By supplying a tuple of arrays, endpoints can vary with dimension.
interval_where_positive (Tuple[int | float | DataArray, int | float | DataArray] | None) – endpoints of the interval where the threshold weights are positive. Must be increasing. Infinite endpoints are only permissible when the corresponding interval_where_one endpoint is infinite. By supplying a tuple of arrays, endpoints can vary with dimension.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the threshold_weighted_absolute_error. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the threshold_weighted_absolute_error. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the threshold_weighted_absolute_error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

xarray data array of the threshold weighted absolute error

Raises:

ValueError – if interval_where_one is not length 2.
ValueError – if interval_where_one is not increasing.
ValueError – if interval_where_one is not increasing.
ValueError – if interval_where_one and interval_where_positive do not specify a valid trapezoidal weight.

Return type:

DataArray

References

scores.continuous.tw_squared_error(fcst, obs, interval_where_one, *, interval_where_positive=None, reduce_dims=None, preserve_dims=None, weights=None)#

Returns the the threshold weighted squared error.

For more flexible threshold weighting schemes, see scores.continuous.consistent_expectile_score().

Two types of threshold weighting are supported: rectangular and trapezoidal.

To specify a rectangular threshold weight, set interval_where_positive=None and set interval_where_one to be the interval where the threshold weight is 1. For example, if interval_where_one=(0, 10) then a threshold weight of 1 is applied to decision thresholds satisfying 0 <= threshold < 10, and a threshold weight of 0 is applied otherwise. Interval endpoints can be -numpy.inf or numpy.inf.
To specify a trapezoidal threshold weight, specify interval_where_positive and interval_where_one using desired endpoints. For example, if interval_where_positive=(-2, 10) and interval_where_one=(2, 4) then a threshold weight of 1 is applied to decision thresholds satisfying 2 <= threshold < 4. The threshold weight increases linearly from 0 to 1 on the interval [-2, 2) and decreases linearly from 1 to 0 on the interval [4, 10], and is 0 otherwise. Interval endpoints can only be infinite if the corresponding interval_where_one endpoint is infinite. End points of interval_where_positive and interval_where_one must differ except when the endpoints are infinite.

Parameters:

fcst (DataArray) – array of forecast values.
obs (DataArray) – array of corresponding observation values.
interval_where_one (Tuple[int | float | DataArray, int | float | DataArray]) – endpoints of the interval where the threshold weights are 1. Must be increasing. Infinite endpoints are permissible. By supplying a tuple of arrays, endpoints can vary with dimension.
interval_where_positive (Tuple[int | float | DataArray, int | float | DataArray] | None) – endpoints of the interval where the threshold weights are positive. Must be increasing. Infinite endpoints are only permissible when the corresponding interval_where_one endpoint is infinite. By supplying a tuple of arrays, endpoints can vary with dimension.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the threshold_weighted_square_error. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the threshold_weighted_squared_error. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the threshold_weighted_squared_error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

xarray data array of the threshold weighted squared error

Raises:

ValueError – if interval_where_one is not length 2.
ValueError – if interval_where_one is not increasing.
ValueError – if interval_where_one is not increasing.
ValueError – if interval_where_one and interval_where_positive do not specify a valid trapezoidal weight.

Return type:

DataArray

Reference:: Taggart, R. (2022). Evaluation of point forecasts for extreme events using consistent scoring functions. Quarterly Journal of the Royal Meteorological Society, 148(742), 306-320. https://doi.org/10.1002/qj.4206

scores.continuous.tw_huber_loss(fcst, obs, huber_param, interval_where_one, *, interval_where_positive=None, reduce_dims=None, preserve_dims=None, weights=None)#

Returns the threshold weighted Huber loss.

For more flexible threshold weighting schemes, see scores.continuous.consistent_huber_score().

Two types of threshold weighting are supported: rectangular and trapezoidal.

To specify a rectangular threshold weight, set interval_where_positive=None and set interval_where_one to be the interval where the threshold weight is 1. For example, if interval_where_one=(0, 10) then a threshold weight of 1 is applied to decision thresholds satisfying 0 <= threshold < 10, and a threshold weight of 0 is applied otherwise. Interval endpoints can be -numpy.inf or numpy.inf.
To specify a trapezoidal threshold weight, specify interval_where_positive and interval_where_one using desired endpoints. For example, if interval_where_positive=(-2, 10) and interval_where_one=(2, 4) then a threshold weight of 1 is applied to decision thresholds satisfying 2 <= threshold < 4. The threshold weight increases linearly from 0 to 1 on the interval [-2, 2) and decreases linearly from 1 to 0 on the interval [4, 10], and is 0 otherwise. Interval endpoints can only be infinite if the corresponding interval_where_one endpoint is infinite. End points of interval_where_positive and interval_where_one must differ except when the endpoints are infinite.

Parameters:

fcst (DataArray) – array of forecast values.
obs (DataArray) – array of corresponding observation values.
huber_param (float) – the Huber transition parameter.
interval_where_one (Tuple[int | float | DataArray, int | float | DataArray]) – endpoints of the interval where the threshold weights are 1. Must be increasing. Infinite endpoints are permissible. By supplying a tuple of arrays, endpoints can vary with dimension.
interval_where_positive (Tuple[int | float | DataArray, int | float | DataArray] | None) – endpoints of the interval where the threshold weights are positive. Must be increasing. Infinite endpoints are only permissible when the corresponding interval_where_one endpoint is infinite. By supplying a tuple of arrays, endpoints can vary with dimension.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the threshold_weighted_expectile_score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the threshold_weighted_expectile_score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the threshold_weighted_expectile_score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

xarray data array of the threshold weighted expectile error

Raises:

ValueError – if interval_where_one is not length 2.
ValueError – if interval_where_one is not increasing.
ValueError – if alpha is not in the open interval (0,1).
ValueError – if huber_param is not positive
ValueError – if interval_where_one is not increasing.
ValueError – if interval_where_one and interval_where_positive do not specify a valid trapezoidal threshold weight.

Return type:

DataArray

References

scores.continuous.tw_expectile_score(fcst, obs, alpha, interval_where_one, *, interval_where_positive=None, reduce_dims=None, preserve_dims=None, weights=None)#

Returns the threshold weighted expectile score.

For more flexible threshold weighting schemes, see scores.continuous.consistent_expectile_score().

Two types of threshold weighting are supported: rectangular and trapezoidal.

To specify a rectangular threshold weight, set interval_where_positive=None and set interval_where_one to be the interval where the threshold weight is 1. For example, if interval_where_one=(0, 10) then a threshold weight of 1 is applied to decision thresholds satisfying 0 <= threshold < 10, and a threshold weight of 0 is applied otherwise. Interval endpoints can be -numpy.inf or numpy.inf.
To specify a trapezoidal threshold weight, specify interval_where_positive and interval_where_one using desired endpoints. For example, if interval_where_positive=(-2, 10) and interval_where_one=(2, 4) then a threshold weight of 1 is applied to decision thresholds satisfying 2 <= threshold < 4. The threshold weight increases linearly from 0 to 1 on the interval [-2, 2) and decreases linearly from 1 to 0 on the interval [4, 10], and is 0 otherwise. Interval endpoints can only be infinite if the corresponding interval_where_one endpoint is infinite. End points of interval_where_positive and interval_where_one must differ except when the endpoints are infinite.

Parameters:

fcst (DataArray) – array of forecast values.
obs (DataArray) – array of corresponding observation values.
alpha (float) – expectile level. Must be strictly between 0 and 1.
interval_where_one (Tuple[int | float | DataArray, int | float | DataArray]) – endpoints of the interval where the threshold weights are 1. Must be increasing. Infinite endpoints are permissible. By supplying a tuple of arrays, endpoints can vary with dimension.
interval_where_positive (Tuple[int | float | DataArray, int | float | DataArray] | None) – endpoints of the interval where the threshold weights are positive. Must be increasing. Infinite endpoints are only permissible when the corresponding interval_where_one endpoint is infinite. By supplying a tuple of arrays, endpoints can vary with dimension.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the threshold_weighted_expectile_score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the threshold_weighted_expectile_score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the threshold_weighted_expectile_score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

xarray data array of the threshold weighted expectile error

Raises:

ValueError – if interval_where_one is not length 2.
ValueError – if interval_where_one is not increasing.
ValueError – if alpha is not in the open interval (0,1).
ValueError – if interval_where_one and interval_where_positive do not specify a valid trapezoidal weight.

Return type:

DataArray

References

scores.continuous.quantile_interval_score(fcst_lower_qtile, fcst_upper_qtile, obs, lower_qtile_level, upper_qtile_level, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the quantile interval score for interval forecasts. This score penalises the interval’s width and whether the observation value lies outside the interval.

\[\text{quantile interval score} = \underbrace{(q_{u} - q_{l})}_{\text{interval width penalty}} + \underbrace{\frac{1}{\alpha_l} \cdot (q_{l} - y) \cdot \mathbb{1}(y < q_{l})}_{\text{over-prediction penalty}} + \underbrace{\frac{1}{1 - \alpha_u} \cdot (y - q_{u}) \cdot \mathbb{1}(y > q_{u})}_{\text{under-prediction penalty}}\]

where

\(q_u\) is the forecast at the upper quantile
\(q_l\) is the forecast at the lower quantile
\(\alpha_u\) is the upper quantile level
\(\alpha_l\) is the lower quantile level
\(y\) is the observation
\(\mathbb{1}(condition)\) is an indicator function that is 1 when the condition is true, and 0 otherwise.

Parameters:

fcst_lower_qtile (DataArray) – array of forecast values at the lower quantile.
fcst_upper_qtile (DataArray) – array of forecast values at the upper quantile.
obs (DataArray) – array of observations.
lower_qtile_level (float) – Quantile level between 0 and 1 (exclusive) for fcst_lower_qtile.
upper_qtile_level (float) – Quantile level between 0 and 1 (exclusive) for fcst_upper_qtile.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the quantile interval score. All other dimensions will be preserved. As a special case, “all” will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the quantile interval score. All other dimensions will be reduced. As a special case, “all” will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the quantile score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

A Dataset with the dimensions specified in dims. The dataset will have the following data variables:

interval_width_penalty: the interval width penalty contribution of the quantile interval score
overprediction_penalty: the over-prediction penalty contribution of the quantile interval score
underprediction_penalty: the under-prediction penalty contribution of the quantile interval score
total: sum of all penalties

Raises:

ValueError – If not (0 < lower_qtile_level < upper_qtile_level < 1).
ValueError – If (fcst_lower_qtile > fcst_upper_qtile).any().

Return type:

Dataset

References

Winkler, R. L. (1972). A Decision-Theoretic Approach to Interval Estimation. Journal of the American Statistical Association, 67(337), 187. https://doi.org/10.2307/2284720

Examples

Calculate the quantile interval score for forecast intervals with lower and upper quantile levels of 0.1 and 0.6, respectively.

>>> import numpy as np
>>> import xarray as xr
>>> from scores.continuous import quantile_interval_score
>>> fcst_lower_level = xr.DataArray(np.random.uniform(10, 15, size=(30, 15)), dims=['time', 'station'])
>>> fcst_upper_level = xr.DataArray(np.random.uniform(20, 25, size=(30, 15)), dims=['time', 'station'])
>>> obs = xr.DataArray(np.random.uniform(8, 27,size=(30, 15)), dims=['time', 'station'])
>>> quantile_interval_score(fcst_lower_level, fcst_upper_level, obs, 0.1, 0.6)

scores.continuous.interval_score(fcst_lower_qtile, fcst_upper_qtile, obs, interval_range, *, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the interval score for interval forecasts. This function calls the scores.continuous.quantile_interval_score() function to calculate the interval score for cases where the quantile level range is symmetric.

\[\text{interval score} = \underbrace{(q_{u} - q_{l})}_{\text{interval width penalty}} + \underbrace{\frac{2}{\alpha} \cdot (q_{l} - y) \cdot \mathbb{1}(y < q_{l})}_{\text{over-prediction penalty}} + \underbrace{\frac{2}{\alpha} \cdot (y - q_{u}) \cdot \mathbb{1}(y < q_{u})}_{\text{under-prediction penalty}}\]

where

\(q_u\) is the forecast at the upper quantile
\(q_l\) is the forecast at the lower quantile
\(\alpha\) is the confidence level that is equal to \(1 - \text{interval_range}\)
\(y\) is the observation
\(\mathbb{1}(condition)\) is an indicator function that is 1 when the condition is true, and 0 otherwise.

Parameters:

fcst_lower_qtile (DataArray) – array of forecast values at the lower quantile.
fcst_upper_qtile (DataArray) – array of forecast values at the upper quantile.
obs (DataArray) – array of observations.
interval_range (float) – Range (length) of interval (e.g., 0.9 for 90% confidence level, which will result in lower quantile of 0.05 and upper quantile of 0.95). Must be strictly between 0 and 1.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the interval score. All other dimensions will be preserved. As a special case, “all” will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the interval score. All other dimensions will be reduced. As a special case, “all” will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the quantile score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

A Dataset with the dimensions specified in dims. The dataset will have the following data variables:

interval_width_penalty: the interval width penalty contribution of the interval score
overprediction_penalty: the over-prediction penalty contribution of the interval score
underprediction_penalty: the under-prediction penalty contribution of the interval score
total: sum of all penalties

As can be seen in the interval score equation, the lower and upper quantile levels are derived from the interval range: lower_qtile_level = (1 - interval_range) / 2 and upper_qtile_level = (1 + interval_range) / 2.

Raises:

ValueError – If not 0 < interval_range < 1.
ValueError – If (fcst_lower_qtile > fcst_upper_qtile).any().

Return type:

Dataset

References

Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359-378. Section 6.2. https://doi.org/10.1198/016214506000001437

See also:
scores.continuous.quantile_interval_score()

Examples

Calculate the interval score for forecast intervals with an interval range of 0.5 (i.e., lower and upper quantile levels are 0.25 and 0.75, respectively).

>>> import numpy as np
>>> import xarray as xr
>>> from scores.continuous import interval_score
>>> fcst_lower_level = xr.DataArray(np.random.uniform(10, 15, size=(30, 15)), dims=['time', 'station'])
>>> fcst_upper_level = xr.DataArray(np.random.uniform(20, 25, size=(30, 15)), dims=['time', 'station'])
>>> obs = xr.DataArray(np.random.uniform(8, 27,size=(30, 15)), dims=['time', 'station'])
>>> interval_score(fcst_lower_level, fcst_upper_level, obs, 0.5)

scores.continuous.qq(fcst, obs, *, quantiles, interpolation_method='linear', data_source_dim='data_source', reduce_dims=None, preserve_dims=None)#

Calculate the quantiles of empirical forecast and observation data for a quantile-quantile (Q-Q) plot.

Forecast and observation data are broadcast and NaN values are matched before calculating the quantiles. This function calculates the quantiles for the user to plot using their preferred plotting library.

Currently this function is implemented for working with raw forecast and observation data rather than a theoretical probability distribution.

Parameters:

fcst (DataArray | Dataset) – Forecast data.
obs (DataArray | Dataset) – Observation data.
quantiles (Iterable[float]) – Quantiles to compute, e.g., [0.1, 0.5, 0.9].
interpolation_method (str) – Interpolation method for estimating the quantile value, default is ‘linear’. Valid options include ‘inverted_cdf’, ‘averaged_inverted_cdf’, ‘closest_observation’, ‘interpolated_inverted_cdf’, ‘hazen’, ‘weibull’, ‘linear’, ‘median_unbiased’, ‘normal_unbiased’, ‘lower’, ‘higher’, ‘midpoint’, and ‘nearest’. See numpy.quantile() for more information on interpolation method options.
data_source_dim (str) – Name of the new dimension to be created for the data source, default is ‘data_source’. This dimension will contain two values: ‘fcst’ and ‘obs’, representing the forecast and observation data, respectively.
reduce_dims (Iterable[Hashable] | None) – Dimensions to reduce over when calculating quantiles.
preserve_dims (Iterable[Hashable] | None) – Dimensions to preserve when calculating quantiles.

Returns:

An xarray object with a new dimension called “data_source”.

Raises:

ValueError – If the interpolation method is invalid.
ValueError – if quantiles are not between 0 and 1.
ValueError – If a dimension in the input data has the same name as data_source_dim
ValueError – If fcst and obs are Datasets with different data variables.
ValueError – If a user tries to preserve all dimensions.
TypeError – If fcst and obs are not both xarray DataArrays or Datasets.

Return type:

DataArray | Dataset

References

Déqué, M. (2011). Deterministic forecasts of continuous variables. In I. T. Jolliffe & D. B. Stephenson (Eds.), Forecast verification: A practitioner’s guide in atmospheric science (2nd ed., pp. 77–94). Wiley. https://doi.org/10.1002/9781119960003.ch5

Example

>>> import xarray as xr
>>> import numpy as np
>>> from scores.plotdata import qq
>>> fcst = xr.DataArray(np.random.rand(100), dims='time')
>>> obs = xr.DataArray(np.random.rand(100), dims='time')
>>> result = qq(fcst, obs, quantiles=[0.1, 0.5, 0.9])

scores.probability#

scores.probability.crps_cdf(fcst, obs, *, threshold_dim='threshold', threshold_weight=None, additional_thresholds=None, propagate_nans=True, fcst_fill_method='linear', threshold_weight_fill_method='forward', integration_method='exact', reduce_dims=None, preserve_dims=None, weights=None, include_components=False)#

Calculates the CRPS probabilistic metric given CDF input.

Calculates the continuous ranked probability score (CRPS), or the mean CRPS over specified dimensions, given forecasts in the form of predictive cumulative distribution functions (CDFs). Can also calculate threshold-weighted versions of the CRPS by supplying a threshold_weight.

Predictive CDFs here are described by an indexed set of values rather than by closed forumulae. As a result, the resolution or number of points at which the CDF is realised has an impact on the calculation of areas under and over the curve to obtain the CRPS result. The term ‘threshold’ is used to describe the dimension which is used as an index for predictive CDF values. Various techniques are used to interpolate CDF values between indexed thresholds.

Given:

a predictive CDF fcst indexed at thresholds by variable x
an observation in CDF form obs_cdf (i.e., \(\text{obs_cdf}(x) = 0\) if \(x < \text{obs}\) and 1 if \(x >= \text{obs}\))
a threshold_weight array indexed by variable x

The threshold-weighted CRPS is given by:

\(twCRPS = \int_{-\infty}^{\infty}{[\text{threshold_weight}(x) \times (\text{fcst}(x) - \text{obs_cdf}(x))^2]\text{d}x}\), over all thresholds x.
The usual CRPS is the threshold-weighted CRPS with \(\text{threshold_weight}(x) = 1\) for all x.

This can be decomposed into an over-forecast penalty:

\(\int_{-\infty}^{\infty}{[\text{threshold_weight}(x) \times \text{fcst}(x) - \text{obs_cdf}(x))^2]\text{d}x}\), over all thresholds x where x >= obs

and an under-forecast penalty:

\(\int_{-\infty}^{\infty}{[\text{threshold_weight}(x) \times \text{(fcst}(x) - \text{obs_cdf}(x)^2]\text{d}x}\), over all thresholds x where x <= obs.

To obtain the components of the CRPS score, set include_components to True.

Note that there are several ways to decompose the CRPS and this decomposition differs from the one used in the scores.probability.crps_for_ensemble() function.

Note that the function crps_cdf is designed so that the obs argument contains actual observed values. crps_cdf will convert obs into CDF form in order to calculate the CRPS.

To calculate CRPS, integration is applied over the set of thresholds x taken from:

fcst[threshold_dim].values,
obs.values.
threshold_weight[threshold_dim].values if applicable.
additional_thresholds if applicable.
(with NaN values excluded)

There are two methods of integration:

“exact” gives the exact integral under that assumption that that fcst is continuous and piecewise linear between its specified values, and that threshold_weight (if supplied) is piecewise constant and right-continuous between its specified values.
“trapz” simply uses a trapezoidal rule using the specified values, and so is an approximation of the CRPS. To get an accurate approximation, the density of threshold values can be increased by supplying additional_thresholds.

Both methods of calculating CRPS may require adding additional values to the threshold_dim dimension in fcst and (if supplied) threshold_weight. fcst_fill_method and weight_fill_method specify how fcst and weight are to be filled at these additional points.

The function crps_cdf calculates the CRPS using forecast CDF values ‘as is’. No checks are performed to ensure that CDF values in fcst are nondecreasing along threshold_dim. Checks are conducted on fcst and threshold_weight (if applicable) to ensure that coordinates are increasing along threshold_dim.

Parameters:

fcst (DataArray) – array of forecast CDFs, with the threshold dimension given by threshold_dim.
obs (DataArray) – array of observations, not in CDF form.
threshold_dim (str) – name of the dimension in fcst that indexes the thresholds.
threshold_weight (DataArray | None) – weight to be applied along threshold_dim to calculate threshold-weighted CRPS. Must contain threshold_dim as a dimension, and may also include other dimensions from fcst. If weight=None, a weight of 1 is applied everywhere, which gives the usual CRPS.
additional_thresholds (Iterable[float] | None) – additional thresholds values to add to fcst and (if applicable) threshold_weight prior to calculating CRPS.
propagate_nans (bool) – If True, propagate NaN values along threshold_dim in fcst and threshold_weight prior to calculating CRPS. This will result in CRPS being NaN for these cases. If False, NaN values in fcst and weight will be replaced, wherever possible, with non-NaN values using the fill method specified by fcst_fill_method and threshold_weight_fill_method.
fcst_fill_method (Literal['linear', 'step', 'forward', 'backward']) –
how to fill values in fcst when NaNs have been introduced (by including additional thresholds) or are specified to be removed (by setting propagate_nans=False). Select one of:
- ”linear”: use linear interpolation, then replace any leading or trailing NaNs using linear extrapolation. Afterwards, all values are clipped to the closed interval [0, 1].
- ”step”: apply forward filling, then replace any leading NaNs with 0.
- ”forward”: first apply forward filling, then remove any leading NaNs by back filling.
- ”backward”: first apply back filling, then remove any trailing NaNs by forward filling.
- (In most cases, “linear” is likely the appropriate choice.)
threshold_weight_fill_method (Literal['linear', 'step', 'forward', 'backward']) – how to fill values in threshold_weight when NaNs have been introduced (by including additional thresholds) or are specified to be removed (by setting propagate_nans=False). Select one of “linear”, “step”, “forward” or “backward”. If the weight function is continuous, “linear” is probably the best choice. If it is an increasing step function, “forward” may be best.
integration_method (str) – one of “exact” or “trapz”.
preserve_dims (Tuple[str]) – dimensions to preserve in the output. All other dimensions are collapsed by taking the mean.
reduce_dims (Tuple[str]) – dimensions to reduce in the output by taking the mean. All other dimensions are preserved.
weights – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.
include_components (bool) – if True, include the under and over forecast components of the score in the returned dataset.

Returns:

The following are the produced Dataset variables:

”total” the total CRPS.
”underforecast_penalty”: the under-forecast penalty contribution of the CRPS.
”overforecast_penalty”: the over-forecast penalty contribution of the CRPS.

Return type:

xr.Dataset

Raises:

ValueError – if threshold_dim is not a dimension of fcst.
ValueError – if threshold_dim is not a dimension of threshold_weight when threshold_weight is not None.
ValueError – if threshold_dim is a dimension of obs.
ValueError – if dimensions of obs are not also dimensions of fcst.
ValueError – if dimensions of threshold_weight are not also dimensions of fcst when threshold_weight is not None.
ValueError – if dims is not a subset of dimensions of fcst.
ValueError – if fcst_fill_method is not one of ‘linear’, ‘step’, ‘forward’ or ‘backward’.
ValueError – if weight_fill_method is not one of ‘linear’, ‘step’, ‘forward’ or ‘backward’.
ValueError – if fcst[threshold_dim] has less than 2 values.
ValueError – if coordinates in fcst[threshold_dim] are not increasing.
ValueError – if threshold_weight is not None and coordinates in threshold_weight[threshold_dim] are not increasing.
ValueError – if threshold_weight has negative values.

See also

scores.probability.crps_cdf_brier_decomposition()
scores.probability.crps_for_ensemble()

References

Matheson, J. E., and R. L. Winkler, 1976: Scoring rules for continuous probability distributions. Management Science, 22(10), 1087–1095. https://doi.org/10.1287/mnsc.22.10.1087
Gneiting, T., & Ranjan, R. (2011). Comparing Density Forecasts Using Threshold- and Quantile-Weighted Scoring Rules. Journal of Business & Economic Statistics, 29(3), 411–422. https://doi.org/10.1198/jbes.2010.08110

scores.probability.adjust_fcst_for_crps(fcst, threshold_dim, obs, *, decreasing_tolerance=0, additional_thresholds=None, fcst_fill_method='linear', integration_method='exact')#

This function takes a forecast cumulative distribution functions (CDF) fcst.

If fcst is not decreasing outside of specified tolerance, it returns fcst.

Otherwise, the CDF envelope for fcst is computed, and the CDF from among:

fcst,
the upper envelope, and
the lower envelope

that has the higher (i.e. worse) CRPS is returned. In the event of a tie, preference is given in the order fcst then upper.

See scores.probability.functions.cdf_envelope for details about the CDF envelope.

The use case for this is when, either due to rounding or poor forecast process, the forecast CDF fcst fails to be nondecreasing. Rather than simply replacing fcst with NaN, adjust_fcst_for_crps returns a CDF for which CRPS can be calculated, but possibly with a predictive performance cost as measured by CRPS.

Whether a CDF is decreasing outside specified tolerance is determined as follows. For each CDF in fcst, the sum of incremental decreases along the threshold dimension is calculated. For example, if the CDF values are

[0, 0.4, 0.3, 0.9, 0.88, 1]

then the sum of incremental decreases is -0.12. This CDF decreases outside specified tolerance if 0.12 > decreasing_tolerance.

The adjusted array of forecast CDFs is determined as follows:

any NaN values in fcst are propagated along threshold_dim so that in each case the entire CDF is NaN;
any CDFs in fcst that are decreasing within specified tolerance are unchanged;
any CDFs in fcst that are decreasing outside specified tolerance are replaced with whichever of the upper or lower CDF envelope gives the highest CRPS, unless the original values give a higher CRPS in which case original values are kept.

See scores.probability.functions.cdf_envelope for a description of the ‘CDF envelope’.

If propagating NaNs is not desired, the user may first fill NaNs in fcst using scores.probability.functions.fill_cdf.

The CRPS for each forecast case is calculated using crps, with a weight of 1.

Parameters:

fcst (DataArray) – DataArray of CDF values with threshold dimension threshold_dim.
threshold_dim (str) – name of the threshold dimension in fcst.
obs (DataArray) – DataArray of observations.
decreasing_tolerance (float) – nonnegative tolerance value.
additional_thresholds (Iterable[float] | None) – optional additional thresholds passed on to crps when calculating CRPS.
fcst_fill_method (Literal['linear', 'step', 'forward', 'backward']) – fcst fill method passed on to crps when calculating CRPS.
integration_method (Literal['exact', 'trapz']) – integration method passed on to crps when calculating CRPS.

Returns:

An xarray DataArray of possibly adjusted forecast CDFs, where adjustments are made to penalise CDFs that decrease outside tolerance.

Raises:

ValueError – If threshold_dim is not a dimension of fcst.
ValueError – If decreasing_tolerance is negative.

Return type:

DataArray

scores.probability.crps_step_threshold_weight(step_points, threshold_dim, *, threshold_values=None, steppoints_in_thresholds=True, steppoint_precision=0, weight_upper=True)#

Generates an array of weights based on DataArray step points.

Creates an array of threshold weights, which can be used to calculate threshold-weighted CRPS, based on a step function. Applies a weight of 1 when step_point >= threshold, and a weight of 0 otherwise. Zeros and ones in the output weight function can be reversed by setting weight_upper=False.

Parameters:

step_points (xr.DataArray) – points at which the weight function changes value from 0 to 1.
threshold_dim (str) – name of the threshold dimension in the returned array weight function.
threshold_values (str) – thresholds at which to calculate weights.
steppoints_in_thresholds (bool) – include step_points among the threshold_dim values.
steppoint_precision (float) – precision at which to round step_points prior to calculating the weight function. Select 0 for no rounding.
weight_upper (bool) – If true, returns a weight of 1 if step_point >= threshold, and a weight of 0 otherwise. If false, returns a weight of 0 if step_point >= threshold, and a weight of 1 otherwise.

Returns:

Zeros and ones with the dimensions in step_points and an additional threshold_dim dimension.

Return type:

(xr.DataArray)

scores.probability.crps_cdf_brier_decomposition(fcst, obs, *, threshold_dim='threshold', additional_thresholds=None, fcst_fill_method='linear', reduce_dims=None, preserve_dims=None)#

Given an array fcst of predictive CDF values indexed along threshold_dim, and an array obs of observations, calculates the mean Brier score for each index along threshold_dim. Since the mean CRPS is the integral of the mean Brier score over all thresholds, this gives a threshold decomposition of the mean CRPS.

If any there are any NaNs along the threshold dimension of fcst, then NaNs are propagated along this dimension prior to calculating the decomposition. If propagating NaNs is not desired, the user may first fill NaNs in fcst using scores.probability.functions.fill_cdf.

Parameters:

fcst (xr.DataArray) – DataArray of CDF values with threshold dimension threshold_dim.
obs (xr.DataArray) – DataArray of observations, not in CDF form.
threshold_dim (str) – name of the threshold dimension in fcst.
additional_thresholds (Optional[Iterable[float]]) – additional thresholds at which to calculate the mean Brier score.
fcst_fill_method (Literal["linear", "step", "forward", "backward"]) –
How to fill NaN values in fcst that arise from new user-supplied thresholds or thresholds derived from observations.
- ”linear”: use linear interpolation, and if needed also extrapolate linearly. Clip to 0 and 1. Needs at least two non-NaN values for interpolation, so returns NaNs where this condition fails.
- ”step”: use forward filling then set remaining leading NaNs to 0. Produces a step function CDF (i.e. piecewise constant).
- ”forward”: use forward filling then fill any remaining leading NaNs with backward filling.
- ”backward”: use backward filling then fill any remaining trailing NaNs with forward filling.
dims – dimensions to preserve in the output. The dimension threshold_dim is always preserved, even if not specified here.
reduce_dims (Iterable[str] | None)
preserve_dims (Iterable[str] | None)

Returns:

“total_penalty”: the mean Brier score for each threshold.
”underforecast_penalty”: the mean of the underforecast penalties for the Brier score. For a particular
forecast case, this component equals 0 if the event didn’t occur and the Brier score if it did.
”overforecast_penalty”: the mean of the overforecast penalties for the Brier score. For a particular
forecast case, this component equals 0 if the event did occur and the Brier score if it did not.

Return type:

An xarray Dataset with data_vars

Raises:

ValueError – if threshold_dim is not a dimension of fcst.
ValueError – if threshold_dim is a dimension of obs.
ValueError – if dimensions of obs are not also among the dimensions of fcst.
ValueError – if dimensions in dims is not among the dimensions of fcst.
ValueError – if fcst_fill_method is not one of ‘linear’, ‘step’, ‘forward’ or ‘backward’.
ValueError – if coordinates in fcst[threshold_dim] are not increasing.

scores.probability.crps_for_ensemble(fcst, obs, ensemble_member_dim, *, method='ecdf', reduce_dims=None, preserve_dims=None, weights=None, include_components=False)#

Calculates the continuous ranked probability score (CRPS) given an ensemble of forecasts. An ensemble of forecasts can also be thought of as a random sample from the predictive distribution.

Given an observation y, and ensemble member values \(x_i\) (for \(1 \leq i \leq M\)), the CRPS is calculated by the formula

\[CRPS(x_i, x_j, y) = \frac{\sum_{i=1}^{M}(|x_i - y|)}{M} - \frac{\sum_{i=1}^{M}\sum_{j=1}^{M}(|x_i - x_j|)}{2K}\]

where the first sum is iterated over \(1 \leq i \leq M\) and the second sum is iterated over \(1 \leq i \leq M\) and \(1 \leq j \leq M\).

The value of the constant K in this formula depends on the method:

If method=”ecdf” then \(K = M ^ 2\). In this case the CRPS value returned is the exact CRPS value for the empirical cumulative distribution function constructed using the ensemble values.
If method=”fair” then \(K = M(M - 1)\). In this case the CRPS value returned is the approximated CRPS where the ensemble values can be interpreted as a random sample from the underlying predictive distribution. This interpretation stems from the formula \(\text{CRPS}(F, y) = \mathbb{E}(|X - y|) - \frac{1}{2}\mathbb{E}(|X - X'|)\), where X and X’ are independent samples of the predictive distribution F, y is the observation (possibly unknown) and E denotes the expectation. This choice of K gives an unbiased estimate for the second expectation.

When the include_components flag is set to True, the CRPS components are calculated as

\[CRPS(x_i, x_j, y) = O(x_i, y) + U(x_i, y) - S(x_i, x_j)\]

where

\(O(x_i, y) = \frac{\sum_{i=1}^{M} ((x_i - y) \mathbb{1}{\{x_i > y\}})}{M}\) which is the overforecast penalty.
\(U(x_i, y) = \frac{\sum_{i=1}^{M} ((y - x_i) \mathbb{1}{\{x_i < y\}})}{M}\) which is the underforecast penalty.
\(S(x_i, x_j) = \frac{\sum_{i=1}^{M}\sum_{j=1}^{M}(|x_i - x_j|)}{2K}\) which is the forecast spread term.

Note that there are several ways to decompose the CRPS and this decomposition differs from the one used in the scores_probability.crps_cdf() function.

Parameters:

fcst (DataArray | Dataset) – Forecast data. Must have a dimension ensemble_member_dim.
obs (DataArray | Dataset) – Observation data.
ensemble_member_dim (str) – the dimension that specifies the ensemble member or the sample from the predictive distribution.
method (Literal['ecdf', 'fair']) – Either “ecdf” or “fair”.
reduce_dims (Sequence[str] | None) – Dimensions to reduce. Can be “all” to reduce all dimensions.
preserve_dims (Sequence[str] | None) – Dimensions to preserve. Can be “all” to preserve all dimensions.
weights (DataArray | Dataset | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.
include_components (bool | None) – If True, returns the CRPS with underforecast and overforecast penalties, as well as the forecast spread term (see description above).

Returns:

xarray object of (weighted mean) CRPS values.

Raises:

ValueError – when method is not one of “ecdf” or “fair”.

Return type:

DataArray | Dataset

References

1. Ferro (2014), “Fair scores for ensemble forecasts”, Quarterly Journal of the Royal Meteorol Society, 140(683):1917-1923. https://doi.org/10.1002/qj.2270
1. Gneiting T and A. Raftery (2007), “Strictly proper scoring rules, prediction, and estimation”, Journal of the American Statistical Association, 102(477):359-378. https://doi.org/10.1198/016214506000001437
1. Zamo and P. Naveau (2018), “Estimation of the Continuous Ranked Probability Score with Limited Information and Applications to Ensemble Weather Forecasts”, Mathematical Geosciences 50:209-234, https://doi.org/10.1007/s11004-017-9709-7

scores.probability.tw_crps_for_ensemble(fcst, obs, ensemble_member_dim, chaining_func, *, chaining_func_kwargs=None, method='ecdf', reduce_dims=None, preserve_dims=None, weights=None, include_components=False)#

Calculates the threshold weighted continuous ranked probability score (twCRPS) given ensemble input using a chaining function chaining_func (see below). An ensemble of forecasts can also be thought of as a random sample from the predictive distribution.

The twCRPS is calculated by the formula

\[\text{twCRPS}(F, y) = \mathbb{E}_F \left| v(X) - v(y) \right| - \frac{1}{2} \mathbb{E}_F \left| v(X) - v(X') \right|,\]

where \(X\) and \(X'\) are independent samples of the predictive distribution \(F\), \(y\) is the observation, and \(v\) is a ‘chaining function’.

The chaining function \(v\) is the antiderivative of the threshold weight function \(w\), which is a non-negative function that assigns a weight to each threshold value. For example, if we wanted to assign a threshold weight of 1 to thresholds above threshold \(t\) and a threshold weight of 0 to thresholds below \(t\), our threshold weight function would be \(w(x) = \mathbb{1}{(x > t)}\), where \(\mathbb{1}\) is the indicator function which returns a value of 1 if the condition is true and 0 otherwise. A chaining function would then be \(v(x) = \text{max}(x, t)\).

There are currently two methods available for calculating the twCRPS: “ecdf” and “fair”.

If method=”ecdf” then the twCRPS value returned is the exact twCRPS value for the empirical cumulative distribution function constructed using the ensemble values.
If method=”fair” then the twCRPS value returned is the approximated twCRPS where the ensemble values can be interpreted as a random sample from the underlying predictive distribution. See https://doi.org/10.1002/qj.2270 for more details on the fair CRPS which are relevant for the fair twCRPS.

The ensemble representation of the empirical twCRPS is

\[\text{twCRPS}(F_{\text{ens}}, y; v) = \frac{1}{M} \sum_{m=1}^{M} \left| v(x_m) - v(y) \right| - \frac{1}{2M^2} \sum_{m=1}^{M} \sum_{j=1}^{M} \left| v(x_m) - v(x_j) \right|,\]

where \(M\) is the number of ensemble members.

While the ensemble representation of the fair twCRPS is

\[\text{twCRPS}(F_{\text{ens}}, y; v) = \frac{1}{M} \sum_{m=1}^{M} \left| v(x_m) - v(y) \right| - \frac{1}{2M(M - 1)} \sum_{m=1}^{M} \sum_{j=1}^{M} \left| v(x_m) - v(x_j) \right|.\]

Parameters:

fcst (DataArray | Dataset) – Forecast data. Must have a dimension ensemble_member_dim.
obs (DataArray | Dataset) – Observation data.
ensemble_member_dim (str) – the dimension that specifies the ensemble member or the sample from the predictive distribution.
chaining_func (Callable[[DataArray | Dataset], DataArray | Dataset]) – the chaining function.
chaining_func_kwargs (dict[str, Any] | None) – keyword arguments for the chaining function.
method (Literal['ecdf', 'fair']) – Either “ecdf” for the empirical twCRPS or “fair” for the Fair twCRPS.
reduce_dims (Sequence[str] | None) – Dimensions to reduce. Can be “all” to reduce all dimensions.
preserve_dims (Sequence[str] | None) – Dimensions to preserve. Can be “all” to preserve all dimensions.
weights (DataArray | Dataset | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.
include_components (bool | None) – If True, returns the twCRPS with underforecast and overforecast penalties, as well as the forecast spread term. See scores.probability.crps_for_ensemble() for more details on the decomposition.

Returns:

xarray object of twCRPS values.

Raises:

ValueError – when method is not one of “ecdf” or “fair”.

Return type:

DataArray

Notes

Chaining functions can be created to vary the weights across given dimensions such as varying the weights by climatological values.

References

Allen, S., Ginsbourger, D., & Ziegel, J. (2023). Evaluating forecasts for high-impact events using transformed kernel scores. SIAM/ASA Journal on Uncertainty Quantification, 11(3), 906-940. https://doi.org/10.1137/22M1532184.
Allen, S. (2024). Weighted scoringRules: Emphasizing Particular Outcomes When Evaluating Probabilistic Forecasts. Journal of Statistical Software, 110(8), 1-26. https://doi.org/10.18637/jss.v110.i08

Examples

Calculate the twCRPS for an ensemble of forecasts where the chaining function is derived from a weight function that assigns a weight of 1 to thresholds above 0.5 and a weight of 0 to thresholds below 0.5.

>>> import numpy as np
>>> import xarray as xr
>>> from scores.probability import tw_crps_for_ensemble
>>> fcst = xr.DataArray(np.random.rand(10, 10), dims=['time', 'ensemble'])
>>> obs = xr.DataArray(np.random.rand(10), dims=['time'])
>>> tw_crps_for_ensemble(fcst, obs, 'ensemble', lambda x: np.maximum(x, 0.5))

scores.probability.tail_tw_crps_for_ensemble(fcst, obs, ensemble_member_dim, threshold, *, tail='upper', method='ecdf', reduce_dims=None, preserve_dims=None, weights=None, include_components=False)#

Calculates the threshold weighted continuous ranked probability score (twCRPS) weighted for a tail of the distribution from ensemble input.

A threshold weight of 1 is assigned for values of the tail and a threshold weight of 0 otherwise. The threshold value of where the tail begins is specified by the threshold argument. The tail argument specifies whether the tail is the upper or lower tail. For example, if we only care about values above 40 degrees C, we can set threshold=40 and tail="upper".

For more flexible weighting options and the relevant equations, see the scores.probability.tw_crps_for_ensemble() function.

Parameters:

fcst (DataArray | Dataset) – Forecast data. Must have a dimension ensemble_member_dim.
obs (DataArray | Dataset) – Observation data.
ensemble_member_dim (str) – the dimension that specifies the ensemble member or the sample from the predictive distribution.
threshold (DataArray | Dataset | float) – the threshold value for where the tail begins. It can either be a float for a single threshold or an xarray object if the threshold varies across dimensions (e.g., climatological values).
tail (Literal['upper', 'lower']) – the tail of the distribution to weight. Either “upper” or “lower”.
method (Literal['ecdf', 'fair']) – Either “ecdf” or “fair”. See scores.probability.tw_crps_for_ensemble() for more details.
reduce_dims (Sequence[str] | None) – Dimensions to reduce. Can be “all” to reduce all dimensions.
preserve_dims (Sequence[str] | None) – Dimensions to preserve. Can be “all” to preserve all dimensions.
weights (DataArray | Dataset | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.
include_components (bool | None) – If True, returns the twCRPS with underforecast and overforecast penalties, as well as the forecast spread term. See scores.probability.crps_for_ensemble() for more details on the decomposition.

Returns:

xarray object of twCRPS values that has been weighted based on the tail.

Raises:

ValueError – when tail is not one of “upper” or “lower”.
ValueError – when method is not one of “ecdf” or “fair”.

Return type:

DataArray | Dataset

References

Allen, S., Ginsbourger, D., & Ziegel, J. (2023). Evaluating forecasts for high-impact events using transformed kernel scores. SIAM/ASA Journal on Uncertainty Quantification, 11(3), 906-940. https://doi.org/10.1137/22M1532184.
Allen, S. (2024). Weighted scoringRules: Emphasizing Particular Outcomes When Evaluating Probabilistic Forecasts. Journal of Statistical Software, 110(8), 1-26. https://doi.org/10.18637/jss.v110.i08

Examples

Calculate the twCRPS for an ensemble where we assign a threshold weight of 1 to thresholds above 0.5 and a threshold weight of 0 to thresholds below 0.5.

>>> import numpy as np
>>> import xarray as xr
>>> from scores.probability import tail_tw_crps_for_ensemble
>>> fcst = xr.DataArray(np.random.rand(10, 10), dims=['time', 'ensemble'])
>>> obs = xr.DataArray(np.random.rand(10), dims=['time'])
>>> tail_tw_crps_for_ensemble(fcst, obs, 'ensemble', 0.5, tail='upper')

scores.probability.interval_tw_crps_for_ensemble(fcst, obs, ensemble_member_dim, lower_threshold, upper_threshold, *, method='ecdf', reduce_dims=None, preserve_dims=None, weights=None, include_components=False)#

Calculates the threshold weighted continuous ranked probability score (twCRPS) for ensemble forecasts where the threshold weight is 1 on a specified interval and 0 otherwise.

The threshold values that define the bounds of the interval are given by the lower_threshold and upper_threshold arguments. For example, if we only want to foucs on the temperatures between -10 and -20 degrees C where aircraft icing is most likely, we can set lower_threshold=-20 and upper_threshold=-10.

For more flexible weighting options and the relevant equations, see the scores.probability.tw_crps_for_ensemble() function.

Parameters:

fcst (DataArray | Dataset) – Forecast data. Must have a dimension ensemble_member_dim.
obs (DataArray | Dataset) – Observation data.
ensemble_member_dim (str) – the dimension that specifies the ensemble member or the sample from the predictive distribution.
lower_threshold (DataArray | float) – the threshold value for where the interval begins. It can either be a float for a single threshold or an xarray object if the threshold varies across dimensions (e.g., climatological values).
upper_threshold (DataArray | float) – the threshold value for where the interval ends. It can either be a float for a single threshold or an xarray object if the threshold varies across dimensions (e.g., climatological values).
method (Literal['ecdf', 'fair']) – Either “ecdf” or “fair”. See scores.probability.tw_crps_for_ensemble() for more details.
reduce_dims (Sequence[str] | None) – Dimensions to reduce. Can be “all” to reduce all dimensions.
preserve_dims (Sequence[str] | None) – Dimensions to preserve. Can be “all” to preserve all dimensions.
weights (DataArray | Dataset | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.
include_components (bool | None) – If True, returns the twCRPS with underforecast and overforecast penalties, as well as the forecast spread term. See scores.probability.crps_for_ensemble() for more details on the decomposition.

Returns:

xarray object of twCRPS values where the threshold weighting is based on an interval.

Raises:

ValueError – when lower_threshold is not less than upper_threshold.
ValueError – when method is not one of “ecdf” or “fair”.

Return type:

DataArray | Dataset

References

Allen, S., Ginsbourger, D., & Ziegel, J. (2023). Evaluating forecasts for high-impact events using transformed kernel scores. SIAM/ASA Journal on Uncertainty Quantification, 11(3), 906-940. https://doi.org/10.1137/22M1532184.
Allen, S. (2024). Weighted scoringRules: Emphasizing Particular Outcomes When Evaluating Probabilistic Forecasts. Journal of Statistical Software, 110(8), 1-26. https://doi.org/10.18637/jss.v110.i08

Examples

Calculate the twCRPS for an ensemble where we assign a threshold weight of 1 to thresholds between -20 and -10 and a threshold weight of 0 to thresholds outside that interval.

>>> import numpy as np
>>> import xarray as xr
>>> from scores.probability import interval_tw_crps_for_ensemble
>>> fcst = xr.DataArray(np.random.uniform(-40, 20, size=(30, 15)), dims=['time', 'ensemble'])
>>> obs = xr.DataArray(np.random.uniform(-40, 20, size=30), dims=['time'])
>>> interval_tw_crps_for_ensemble(fcst, obs, 'ensemble', -20, 10)

scores.probability.murphy_score(fcst, obs, thetas, *, functional, alpha, huber_a=None, decomposition=False, reduce_dims=None, preserve_dims=None)#

Select functional="quantile" and alpha=0.5 for the median functional. Select functional="expectile" and alpha=0.5 for the mean (i.e., expectation) functional.

Parameters:

fcst (DataArray) – Forecast numerical values.
obs (DataArray) – Observed numerical values.
thetas (Sequence[float] | DataArray) – Theta thresholds.
functional (Literal['quantile', 'huber', 'expectile']) – The type of elementary scoring function, one of “quantile”, “huber”, or “expectile”.
alpha (float) – Risk parameter (i.e. quantile or expectile level) for the functional. Must be between 0 and 1.
huber_a (float | None) – Huber transition parameter, used for “huber” functional only. Must be strictly greater than 0.
decomposition (bool) – True to return penalty values due to under- and over-fcst as well as the total score, False to return total score only.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the Murphy score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the Murphy score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the FIRM score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

Returns:

An xr.Dataset with dimensions based on the preserve_dims or reduce_dims arg as well as a “theta” dimension with values matching thetas input.

If decomposition is False, the dataset’s variables will contain 1 element “total”.

If decomposition is True, in addition to “total”, it will have “underforecast” and “overforecast” data_vars.

Raises:

ValueError – If functional is not one of the expected functions.
ValueError – If alpha is not strictly between 0 and 1.
ValueError – If huber_a is not strictly greater than 0.

Return type:

Dataset

References

For mean elementary score definitions, see:

Theorem 1 of Ehm, W., Gneiting, T., Jordan, A., & Krüger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(3), 505–562. https://doi.org/10.1111/rssb.12154
Theorem 5.3 of Taggart, R. J. (2022). Point forecasting and forecast evaluation with generalized Huber loss. Electronic Journal of Statistics, 16(1), 201-231. https://doi.org/10.1214/21-EJS1957

scores.probability.murphy_thetas(forecasts, obs, functional, *, huber_a=None, left_limit_delta=None)#

Return the decision thresholds (theta values) at which to evaluate elementary scores for the construction of Murphy diagrams.

Parameters:

forecasts (list[DataArray]) – Forecast values, one array per source.
obs (DataArray) – Observed values.
functional (Literal['quantile', 'huber', 'expectile']) – The type of elementary scoring function, one of “quantile”, “huber”, or “expectile”.
huber_a (float | None) – Huber transition parameter, used for “huber” functional only. Must be strictly greater than 0.
left_limit_delta (float | None) – Approximation of the left hand limit, used for “huber” and “expectile” functionals. Must be greater than or equal to 0. None will be treated as 0. Ideally, left_limit_delta should be small relative to the fcst and obs precision, and not greater than that precision.

Returns:

theta thresholds to be used to compute murphy scores.

Return type:

List[float]

Raises:

ValueError – If functional is not one of the expected functions.
ValueError – If huber_a is not strictly greater than 0.
ValueError – If left_limit_delta is not greater than or equal to 0.

Notes

For theta values at which to evaluate elementary scores, see

Corollary 2 (p.521) of Ehm, W., Gneiting, T., Jordan, A., & Krüger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(3), 505–562. https://doi.org/10.1111/rssb.12154
Corollary 5.6 of Taggart, R. J. (2022). Point forecasting and forecast evaluation with generalized Huber loss. Electronic Journal of Statistics, 16(1), 201-231. https://doi.org/10.1214/21-EJS1957

scores.probability.roc_curve_data(fcst, obs, thresholds='auto', *, reduce_dims=None, preserve_dims=None, weights=None, check_args=True)#

Calculates data required for plotting a Receiver (Relative) Operating Characteristic (ROC) curve, including the area under the curve (AUC). The ROC curve is used as a way to measure the discrimination ability of a particular forecast.

The AUC is the probability that the forecast probability of a random event is higher than the forecast probability of a random non-event.

Parameters:

fcst (DataArray) – An array of probabilistic forecasts for a binary event in the range [0, 1].
obs (DataArray) – An array of binary values where 1 is an event and 0 is a non-event.
thresholds (str | Iterable[float]) – By default, when thresholds = "auto", the ROC thresholds are automatically generated. Otherwise, you can supply an iterable of floats with monotonic increasing values between 0 and 1, which are the thresholds at and above which to convert the probabilistic forecast to a value of 1 (an ‘event’). If there are many unique forecast values, this can lead to a very large number of automatically generated thresholds. If performance is slow, consider supplying thresholds manually as an iterable of floats. Np.inf is added automatically to the end of the thresholds to ensure that the full ROC curve is produced. Similarly, if 0 is not included, it will be added automatically to ensure the full ROC curve is produced.
reduce_dims (Sequence[str] | None) – Optionally specify which dimensions to reduce when calculating the ROC curve data. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Sequence[str] | None) – Optionally specify which dimensions to preserve when calculating ROC curve data. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the values will be the ROC curve at each point (i.e. single-value comparison against observed) for each threshold, and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.
check_args (bool) – Checks if obs data only contains values in the set {0, 1, np.nan}. You may want to skip this check if you are sure about your input data and want to improve the performance when working with dask.

Returns:

‘POD’ (the probability of detection)
’POFD’ (the probability of false detection)
’AUC’ (the area under the ROC curve)

POD and POFD have dimensions dims + ‘threshold’, while AUC has dimensions dims.

Return type:

An xarray.Dataset with data variables

Raises:

ValueError – if fcst contains values outside of the range [0, 1].
ValueError – if obs contains non-nan values not in the set {0, 1}.
ValueError – if ‘threshold’ is a dimension in fcst.
ValueError – if values in thresholds are not monotonic increasing or are outside the range [0, 1].
ValueError – if thresholds is a string that is not “auto”.
ValueError – if thresholds is an empty iterable.

Warning

If the number of automatically generated thresholds is very large (>1000), a warning is raised suggesting that the user supply thresholds manually as an iterable of floats if performance is slow.

Notes

If thresholds is an iterable of floats, the probabilistic fcst is converted to a deterministic forecast for each threshold in thresholds. If a value in fcst is greater than or equal to the threshold, then it is converted into a ‘forecast event’ (fcst = 1), and a ‘forecast non-event’ (fcst = 0) otherwise. The probability of detection (POD) and probability of false detection (POFD) are calculated for the converted forecast. From the POD and POFD data, the area under the ROC curve is calculated. An additional threshold of np.inf is added to the end of thresholds so that it always has a value when POD=0 and POFD=0.

If threshold="auto" which is the default, then the thresholds used are the ordered, unique forecast values.

Ideally concave ROC curves should be generated rather than traditional ROC curves.

Example

>>> import xarray as xr
>>> import numpy as np
>>> from scores.probability import roc_curve_data
>>> fcst = xr.DataArray(np.random.rand(3, 4), dims=["time", "location"])
>>> obs = xr.DataArray(np.random.randint(0, 2, size=(3, 4)), dims=["time", "location"])
>>> result = roc_curve_data(fcst, obs)

scores.probability.brier_score(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None, check_args=True)#

Calculates the Brier score for forecast and observed data.

\[\text{brier score} = \frac{1}{n} \sum_{i=1}^n (\text{forecast}_i - \text{observed}_i)^2\]

If you want to speed up performance with Dask and are confident of your input data, or if you want observations to take values between 0 and 1, set check_args to False.

Parameters:

fcst (DataArray | Dataset) – Forecast or predicted variables in xarray.
obs (DataArray | Dataset) – Observed variables in xarray.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the Brier score. All other dimensions will be preserved.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the Brier score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the absolute error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.
check_args (bool) – will perform some tests on the data if set to True.

Raises:

ValueError – if fcst contains non-nan values outside of the range [0, 1]
ValueError – if obs contains non-nan values not in the set {0, 1}

Return type:

DataArray | Dataset

References

Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1-3. https://doi.org/fp62r6
https://en.wikipedia.org/wiki/Brier_score

See also

scores.probability.brier_score_for_ensemble()

scores.probability.brier_score_for_ensemble(fcst, obs, ensemble_member_dim, event_thresholds, *, reduce_dims=None, preserve_dims=None, weights=None, fair_correction=True, event_threshold_operator=<built-in function ge>, threshold_dim='threshold')#

Calculates the Brier score for an ensemble forecast for the provided thresholds. By default, the fair Brier score is calculated, which is a modified version of the Brier score that applies a correction related to the number of ensemble members and the number of ensemble members that predict the event.

Given an event defined by event_threshold, the fair Brier score for ensembles is defined as

\[s_{i,y} = \left(\frac{i}{m} - y\right)^2 - \frac{i(m-i)}{m^2(m-1)}\]

Where:

\(i\) is the number of ensemble members that predict the event
\(m\) is the total number of ensemble members
\(y\) is the binary observation (0 if the event was not observed or 1 otherwise)

When the fair correction is not applied (i.e., fair_correction=False), the Brier score is calculated as:

\[s_{i,y} = \left(\frac{i}{m} - y\right)^2\]

The fair correction will only be applied at coordinates where the number of ensemble members is greater than 1 and fair_correction=True. When the fair_correction is set to True and the number of ensemble members is 1, the fair correction will not be applied to avoid division by zero. If you want to set the minimum number of ensemble members required to calculate the Brier score, we recommend preprocessing your data to remove data points with fewer than the minimum number of ensemble members that you want.

Parameters:

fcst (DataArray | Dataset) – Real valued ensemble forecasts in xarray. The proportion of ensemble members
function. (above a threshold is calculated within this)
obs (DataArray | Dataset) – Real valued observations in xarray. These values are converted to binary values in this function based on the supplied event thresholds and operators.
ensemble_member_dim (str) – The dimension name of the ensemble member.
event_thresholds (Real | Sequence[Real]) – The threshold(s) that define what is considered an event. If multiple thresholds are provided, they should be monotonically increasing with no NaNs.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the Brier score. All other dimensions will be preserved.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the absolute error at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.
fair_correction (bool) – Whether or not to apply the fair Brier score correction. Default is True.
event_threshold_operator (Callable) – The mode to use to define an event relative to a supplied threshold. Default is operator.ge, which means that an event occurs if the observation is greater than or equal to the threshold. Similarly, an event will have been considered to be forecast by an ensemble member if the forecast is greater than or equal to the threshold. Passing in operator.lt will give the same results as it is mathematically equivalent in this case. Alternatively, you can provide operator.gt which means that an event occurs if the observation is strictly greater than the threshold. Again, an event for an ensemble member will have been considered to be forecast if the forecast is greater than the threshold. Using``operator.le`` will give the same results as operator.gt as it is mathematically equivalent in this case.
threshold_dim (str) – The name of the threshold dimension in the output. Default is ‘threshold’. It must not exist as a dimension in the forecast or observation data.

Returns:

The Brier score for the ensemble forecast.

Raises:

ValueError – if values in event_thresholds are not monotonically increasing.
ValueError – if fcst contains the dimension provided in threshold_dim.
ValueError – if obs contains the dimension provided in threshold_dim.
ValueError – if weights contains the dimension provided in threshold_dim.
ValueError – if ensemble_member_dim is not in fcst dimensions.

Return type:

DataArray | Dataset

References

Ferro, C. A. T. (2013). Fair scores for ensemble forecasts. Quarterly
Journal of the Royal Meteorological Society, 140(683), 1917–1923. https://doi.org/10.1002/qj.2270

See also

scores.probability.brier_score()

Examples

Calculate the Brier score for an ensemble forecast for a single threshold:

>>> import numpy as np
>>> import xarray as xr
>>> from scores.probability import brier_score_for_ensemble
>>> fcst = 10 * xr.DataArray(np.random.rand(10, 10), dims=['time', 'ensemble'])
>>> obs = 10 * xr.DataArray(np.random.rand(10), dims=['time'])
>>> brier_score_for_ensemble(fcst, obs, ensemble_member_dim='ensemble', thresholds=0.5)

Calculate the Brier score for an ensemble forecast for multiple thresholds: >>> thresholds = [0.1, 5, 9] >>> brier_score_for_ensemble(fcst, obs, ensemble_member_dim=’ensemble’, thresholds=thresholds)

scores.probability.isotonic_fit(fcst, obs, *, weight=None, functional='mean', bootstraps=None, quantile_level=None, solver=None, confidence_level=0.9, min_non_nan=1, report_bootstrap_results=False)#

If forecasts target the mean or a quantile functional, then the user can select that functional. Otherwise the user needs to supply a solver function

of one variable (if no weights are supplied), or

of two variables (if weights are supplied).

The solver function is used to find an optimal regression fit in each block, subject to a non-decreasing monotonicity constraint, using the pool adjacent violators (PAV) algorithm.

An example of a solver function of one variable is obs -> np.mean(obs) An example of a solver function of two variables is (obs, weight) -> np.average(obs, weights=weight).

This implementation uses sklearn.isotonic.IsotonicRegression when functional=”mean”.

Parameters:

fcst (ndarray | DataArray) – np.array or xr.DataArray of forecast (or explanatory) values. Values must be float, integer or NaN.
obs (ndarray | DataArray) – np.array or xr.DataArray of corresponding observation (or response) values. Must be the same shape as fcst. Values must be float, integer or NaN.
weight (ndarray | DataArray | None) – positive weights to apply to each forecast-observation pair. Must be the same shape as fcst, or None (which is equivalent to applying equal weights).
functional (Literal['mean', 'quantile'] | None) – Functional that the forecasts are targeting. Either “mean” or “quantile” or None. If None then solver must be supplied. The current implementation for “quantile” does not accept weights. If weighted quantile regression is desired then the user should should supply an appropriate solver.
bootstraps (int | None) – the number of bootstrap samples to perform for calculating the regression confidence band. Set to None if a confidence band is not required.
quantile_level (float | None) – the level of the quantile functional if functional=’quantile’. Must be strictly between 0 and 1.
solver (Callable[[ndarray, ndarray], float] | Callable[[ndarray], float] | None) – function that accepts 1D numpy array of observations and returns a float. Function values give the regression fit for each block as determined by the PAV algorithm. If weights are supplied, the function must have a second argument that accepts 1D numpy array of weights of same length as the observations.
confidence_level (float | None) – Confidence level of the confidence band, strictly between 0 and 1. For example, a confidence level of 0.5 will calculate the confidence band by computing the 25th and 75th percentiles of the bootstrapped samples.
min_non_nan (int | None) – The minimum number of non-NaN bootstrapped values to calculate confidence bands at a particular forecast value.
report_bootstrap_results (bool | None) – This specifies to whether keep the bootstrapped regression values in return dictionary. Default is False to keep the output result small.

Returns:

“unique_fcst_sorted”: 1D numpy array of remaining forecast values sorted in
ascending order, after any NaNs from fcst, obs and weight are removed, and only unique values are kept to keep the output size reasonably small.
”fcst_counts”: 1D numpy array of forecast counts for unique values of forecast sorted
”regression_values”: 1D numpy array of regression values corresponding to
”unique_fcst_sorted” values.
”regression_func”: function that returns the regression fit based on linear
interpolation of (“fcst_sorted”, “regression_values”), for any supplied argument (1D numpy array) of potential forecast values.
”bootstrap_results”: in the case of report_bootstrap_results=True, 2D numpy
array of bootstrapped regression values is included in return dictionary. Each row gives the interpolated regression values from a particular bootstrapped sample, evaluated at “fcst_sorted” values. If m is the number of bootstrap samples and n = len(fcst_sorted) then it is has shape (m, n). We emphasise that this array corresponds to fcst_sorted not unique_fcst_sorted.
”confidence_band_lower_values”: values of lower confidence band threshold, evaluated
at “unique_fcst_sorted” values.
”confidence_band_upper_values”: values of upper confidence band threshold, evaluated
at “unique_fcst_sorted” values.
”confidence_band_lower_func”: function that returns regression fit based on linear
interpolation of (“fcst_sorted”, “confidence_band_lower_values”), given any argument (1D numpy array) of potential forecast values.
”confidence_band_upper_func”: function that returns regression fit based on linear
interpolation of (“fcst_sorted”, “confidence_band_upper_values”), given any argument (1D numpy array) of potential forecast values.
”confidence_band_levels”: tuple giving the quantile levels used to calculate the
confidence band.

Return type:

Dictionary with the following keys

Raises:

ValueError – if fcst and obs are np.arrays and don’t have the same shape.
ValueError – if fcst and weight are np.arrays and don’t have the same shape.
ValueError – if fcst and obs are xarray.DataArrays and don’t have the same dimensions.
ValueError – if fcst and weight are xarray.DataArrays and don’t have the same dimensions.
ValueError – if any entries of fcst, obs or weight are not an integer, float or NaN.
ValueError – if there are no fcst and obs pairs after NaNs are removed.
ValueError – if functional is not one of “mean”, “quantile” or None.
ValueError – if quantile_level is not strictly between 0 and 1 when functional=”quantile”.
ValueError – if weight is not None when functional=”quantile”.
ValueError – if functional and solver are both None.
ValueError – if not exactly one of functional and solver is None.
ValueError – if bootstraps is not a positive integer.
ValueError – if confidence_level is not strictly between 0 and 1.

Note: This function only keeps the unique values of fcst_sorted to keep the volume of

the return dictionary small. The forecast counts is also included, so users can it to create forecast histogram (usually displayed in the reliability diagrams).

References

de Leeuw, Hornik and Mair. “Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods”, Journal of Statistical Software, 2009.
Dimitriadis, Gneiting and Jordan. “Stable reliability diagrams for probabilistic classifiers”, PNAS, Vol. 118 No. 8, 2020. Available at https://www.pnas.org/doi/10.1073/pnas.2016191118
Jordan, Mühlemann, and Ziegel. “Optimal solutions to the isotonic regression problem”, 2020 (version 2), available on arxiv at https://arxiv.org/abs/1904.04761

scores.categorical#

scores.categorical.firm(fcst, obs, risk_parameter, categorical_thresholds, threshold_weights, *, discount_distance=0, reduce_dims=None, preserve_dims=None, weights=None, threshold_assignment='lower', include_components=False)#

Calculates the FIxed Risk Multicategorical (FIRM) score including the underforecast and overforecast penalties.

categorical_thresholds and threshold_weights must be the same length.

Parameters:

fcst (DataArray | Dataset) – An array of real-valued forecasts that we want to treat categorically.
obs (DataArray | Dataset) – An array of real-valued observations that we want to treat categorically.
risk_parameter (float) – Risk parameter (alpha) for the FIRM score. The value must satisfy 0 < risk_parameter < 1.
categorical_thresholds (Sequence[float] | Sequence[DataArray]) – Category thresholds (thetas) to delineate the categories. A sequence of xr.DataArrays may be supplied to allow for different thresholds at each coordinate (e.g., thresholds determined by climatology).
threshold_weights (Sequence[float | DataArray]) – Weights that specify the relative importance of forecasting on the correct side of each category threshold. Either a positive float can be supplied for each categorical threshold or an xr.DataArray (with no negative values) can be provided for each categorical threshold as long as its dims are a subset of obs dims. NaN values are allowed in the xr.DataArray. For each NaN value at a given coordinate, the FIRM score will be NaN at that coordinate, before dims are collapsed.
discount_distance (float | None) – An optional discounting distance parameter which satisfies discount_distance >= 0 such that the cost of misses and false alarms are discounted whenever the observation is within distance discount_distance of the forecast category. A value of 0 will not apply any discounting.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the FIRM score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating FIRM. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the FIRM score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.
threshold_assignment (str | None) – Specifies whether the intervals defining the categories are left or right closed. That is whether the decision threshold is included in the upper (left closed) or lower (right closed) category. Defaults to “lower”.
include_components (bool) – If True, then the function returns the FIRM score along with the overforecast and underforecast penalties as a new dimension called “components”. If False (default), then only the FIRM score is returned.

Returns:

An xarray object with the FIRM score. If include_components is true, then a new dimension called “components” is created that contains the following coordinates:

firm_score: A score for a single category for each coord based on the FIRM framework.
overforecast_penalty: Penalty for False Alarms.
underforecast_penalty: Penalty for Misses.

Raises:

ValueError – if len(categorical_thresholds) < 1.
ValueError – if categorical_thresholds and threshold_weights lengths are not equal.
ValueError – if risk_parameter <= 0 or >= 1.
ValueError – if any values in threshold_weights are <= 0.
ValueError – if discount_distance is not None and < 0.
scores.utils.DimensionError – if threshold_weights is a list of xr.DataArrays and if the dimensions of these xr.DataArrays is not a subset of the obs dims.
ValueError – if threshold_assignment is not “upper” or “lower”.
ValueError – if the dimension name “components” is used in fcst or obs.
TypeError – If fcst and obs are not both xarray DataArrays or Datasets.

Return type:

DataArray | Dataset

Note

Setting discount distance to None or 0, will mean that no discounting is applied. This means that errors will be penalised strictly categorically.

Setting discount distance to np.inf means that the cost of a miss is always proportional to the distance of the observation from the threshold, and similarly for false alarms.

Example

>>> import numpy as np
>>> import xarray as xr
>>> from scores.categorical import firm
>>> fcst = xr.DataArray(np.random.rand(4, 6, 8), dims=['time', 'lat', 'lon'])
>>> obs = xr.DataArray(np.random.rand(4, 6, 8), dims=['time', 'lat', 'lon'])
>>> categorical_thresholds = [0.2, 0.5, 0.8]
>>> risk_parameter = 0.1
>>> firm_score = firm(fcst, obs, risk_parameter, categorical_thresholds)

References

Taggart, R., Loveday, N. and Griffiths, D., 2022. A scoring framework for tiered warnings and multicategorical forecasts based on fixed risk measures. Quarterly Journal of the Royal Meteorological Society, 148(744), pp.1389-1406.

scores.categorical.probability_of_detection(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None, check_args=True)#

Calculates the Probability of Detection (POD), also known as the Hit Rate. This is the proportion of observed events (obs = 1) that were correctly forecast as an event (fcst = 1).

Parameters:

fcst (DataArray | Dataset) – An array containing binary values in the set {0, 1, np.nan}
obs (DataArray | Dataset) – An array containing binary values in the set {0, 1, np.nan}
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to sum when calculating the POD. All other dimensions will be not summed. As a special case, ‘all’ will allow all dimensions to be summed. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to sum across all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to not sum when calculating the POD. All other dimensions will be summed. As a special case, ‘all’ will allow all dimensions to be not summed. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the POD score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.
check_args (bool | None) – Checks if fcst and `obs data only contains values in the set {0, 1, np.nan}. You may want to skip this check if you are sure about your input data and want to improve the performance when working with dask.

Returns:

A DataArray of the Probability of Detection.

Raises:

ValueError – if there are values in fcst and obs that are not in the set {0, 1, np.nan} and check_args is true.

Return type:

DataArray | Dataset

scores.categorical.probability_of_false_detection(fcst, obs, *, reduce_dims=None, preserve_dims=None, weights=None, check_args=True)#

Calculates the Probability of False Detection (POFD), also known as False Alarm Rate (not to be confused with the False Alarm Ratio). The POFD is the proportion of observed non-events (obs = 0) that were incorrectly forecast as event (i.e. fcst = 1).

Parameters:

fcst (DataArray | Dataset) – An array containing binary values in the set {0, 1, np.nan}
obs (DataArray | Dataset) – An array containing binary values in the set {0, 1, np.nan}
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to sum when calculating the POFD. All other dimensions will be not summed. As a special case, ‘all’ will allow all dimensions to be summed. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to sum across all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to not sum when calculating the POFD. All other dimensions will be summed. As a special case, ‘all’ will allow all dimensions to be not summed. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the POD score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.
check_args (bool | None) – Checks if fcst and `obs data only contains values in the set {0, 1, np.nan}. You may want to skip this check if you are sure about your input data and want to improve the performance when working with dask.

Returns:

A DataArray of the Probability of False Detection.

Raises:

ValueError – if there are values in fcst and obs that are not in the set {0, 1, np.nan} and check_args is true.

Return type:

DataArray | Dataset

class scores.categorical.BinaryContingencyManager(fcst_events, obs_events)#

See https://scores.readthedocs.io/en/stable/tutorials/Binary_Contingency_Scores.html for a detailed walkthrough showing the use in practice.

A BinaryContingencyManager holds the underlying binary forecast and observed event data, from which it builds a contingency table and provides scoring functionality based on that table.

A BinaryContingencyManager is typically created by an EventOperator, such as a ThresholdOperator, but can also be created directly from binary data if the user wishes.

Supported operations include:

“Transforming” the data in various ways, such as dimensional reduction
Producing contingency tables
Calculating scores and metrics based on contingency tables

The full data comprises several n-dimensional binary arrays which can be considered maps of:

True positives (hits)
False positives (false alarms)
True negatives (correct negatives)
False negatives (misses)

These masks, in addition to supporting score calculations, can be accessed and used for:

Plotting these attributes on a map
Masking these values by a geographical region prior to score calculation

As such, the per-pixel information is useful as well as the overall ratios involved.

BinaryContingencyManager inherits from (uses) the BasicContingencyManager class to provide score calculations on the final contingency table. Documentation for the available scores is found in the BasicContingencyManager API entry but the calls can be made directly against instances of BinaryContingencyManager where performance or transformation are not a concern.

Parameters:

fcst_events (DataArray | Dataset | Series)
obs_events (DataArray | Dataset | Series)

transform(*, reduce_dims=None, preserve_dims=None)#

Compute the contingency table, preserving or reducing the specified dimensions.

Parameters:

reduce_dims (-) – Dimensions to reduce. Can be “all” to reduce all dimensions.
preserve_dims (-) – Dimensions to preserve. Can be “all” to preserve all dimensions.

Returns:

A scores class which supports efficient calculation of contingency metrics.

Return type:

scores.categorical.BasicContingencyManager

class scores.categorical.BasicContingencyManager(counts)#

See https://scores.readthedocs.io/en/stable/tutorials/Binary_Contingency_Scores.html for a detailed walkthrough showing the use of this class in practice.

A BasicContingencyManager object provides the scoring functions which are calculated from a contingency table. A BasicContingencyManager can be efficiently and repeatedly queried for a wide variety of scores.

A BasicContingencyManager is produced when a BinaryContingencyManager is transformed. It is also possible to create a BasicContingencyManager from event counts or a contingency table, although this not a common user requirement.

A contingency table is built only from event counts, losing the connection to the actual event tables in their full dimensionality. The event count data is much smaller than the full event tables, particularly when considering very large data sets like Numerical Weather Prediction (NWP) data, which could be terabytes to petabytes in size.

By contrast, A BinaryContingencyManager retains the full event data, which provides some more flexbility but may reduce efficiency.

Parameters:: counts (dict)

accuracy()#

Identical to fraction_correct().

Accuracy calculates the proportion of forecasts which are correct.

Returns:: A DataArray containing the accuracy score
Return type:: xr.DataArray

\[\text{accuracy} = \frac{\text{true positives} + \text{true negatives}}{\text{total count}}\]

Notes

Range: 0 to 1, where 1 indicates a perfect score.
“True positives” is the same at “hits”.
“False negatives” is the same as “misses”.

References

https://www.cawcr.gov.au/projects/verification/#ACC

base_rate()#

The observed event frequency.

Returns:: An xarray object containing the base rate.
Return type:: xr.DataArray

\[\text{base rate} = \frac{\text{true positives} + \text{false negatives}}{\text{total count}}\]

Notes

Range: 0 to 1, where 1 indicates the event occurred every time.
“True positives” is the same as “hits”.
“False negatives” is the same as “misses”.

References

Hogan, R. J. & Mason, I. B. (2011). Deterministic forecasts of binary events. In I. T. Jolliffe & D. B. Stephenson (Eds.), Forecast verification: A practitioner’s guide in atmospheric science (2nd ed., pp. 39-51). https://doi.org/10.1002/9781119960003.ch3

bias_score()#

Identical to frequency_bias().

How did the forecast frequency of “yes” events compare to the observed frequency of “yes” events?

Returns:: An xarray object containing the bias score
Return type:: xr.DataArray

\[\text{frequency bias} = \frac{\text{true positives} + \text{false positives}}{\text{true positives} + \text{false negatives}}\]

Notes

Range: 0 to ∞ (infinity), where 1 indicates a perfect score.
“True positives” is the same as “hits”.
“False positives” is the same as “false alarms”.
“False negatives” is the same as “misses”.

References

https://www.cawcr.gov.au/projects/verification/#BIAS

cohens_kappa()#

Identical to heidke_skill_score().

Calculates Cohen’s kappa.

What was the accuracy of the forecast relative to that of random chance?

Returns:: An xarray object containing the Cohen’s Kappa score
Return type:: xr.DataArray

\[\begin{split}\\ \text{Cohen's Kappa} \left(\kappa\right) = \frac{\text{true positives} + \text{true negatives} - E_{random}}{ \text{total count} - E_{\text{random}}}\end{split}\]

where

\[\begin{split}\begin{aligned} E_{\text{random}} &= \text{expected correct matches due to random chance} \\&= \frac{\left(\text{tp} + \text{fn}\right) \cdot \left(\text{tp} + \text{fp}\right) + \left(\text{tn} + \text{fn}\right) \cdot \left(\text{tn} + \text{fp}\right)}{ \text{total count}} \end{aligned}\end{split}\]

Notes

Range: -1 to 1, 0 indicates no skill. Perfect score: 1.
“True positives” (\(\text{tp}\)) is the same as “hits”.
“False negatives” (\(\text{fn}\)) is the same as “misses”.
“False positives” (\(\text{fp}\)) is the same as “false alarms”.
“True negatives” (\(\text{tn}\)) is the same as “correct negatives”.

References

https://en.wikipedia.org/wiki/Cohen%27s_kappa

critical_success_index()#

Identical to threat_score().

Returns:: An xarray object containing the critical success index
Return type:: xr.DataArray

\[\text{threat score} = \frac{\text{true positives}}{\text{true positives} + \text{false positives} + \text{false negatives}}\]

Notes

Range: 0 to 1, 0 indicates no skill. Perfect score: 1.
Often known as CSI.
“True positives” is the same as “hits”
“False positives” is the same as “false alarms”
“True negatives” is the same as “correct negatives”

References

https://www.cawcr.gov.au/projects/verification/#CSI

equitable_threat_score()#

Identical to gilberts_skill_score().

Calculates the Equitable threat score.

How well did the forecast “yes” events correspond to the observed “yes” events (accounting for hits due to chance)?

Returns:: An xarray object containing the equitable threat score
Return type:: xr.DataArray

\[\text{ETS} = \frac{\text{true positives} - \text{true positives} _\text{random}} {\text{true positives} + \text{false negatives} + \text{false positives} - \text{true positives} _\text{random}}\]

where

\[\text{true_positives}_{\text{random}} = \frac{(\text{true positives} + \text{false negatives}) (\text{true positives} + \text{false positives})}{\text{total count}}\]

Notes

Range: -1/3 to 1, 0 indicates no skill. Perfect score: 1.
“True positives” is the same as “hits”.
“False negatives” is the same as “misses”.
“False positives” is the same as “false alarms”

References

Gilbert, G.K., 1884. Finley’s tornado predictions. American Meteorological Journal, 1(5), pp.166–172.
Hogan, R.J., Ferro, C.A., Jolliffe, I.T. and Stephenson, D.B., 2010. Equitability revisited: Why the “equitable threat score” is not equitable. Weather and Forecasting, 25(2), pp.710-726. https://doi.org/10.1175/2009WAF2222350.1

f1_score()#

Calculates the F1 score.

Returns:: An xarray object containing the F1 score
Return type:: xr.DataArray

\[\text{F1} = \frac{2 \cdot \text{true positives}}{(2 \cdot \text{true positives}) + \text{false positives} + \text{false negatives}}\]

Notes

“True positives” is the same as “hits”.
“False positives” is the same as “false alarms”.
“False negatives” is the same as “misses”.

References

https://en.wikipedia.org/wiki/F-score

false_alarm_rate()#

Identical to probability_of_false_detection.

What fraction of the non-events were incorrectly predicted?

Returns:: An xarray object containing the false alarm rate
Return type:: xr.DataArray

\[\text{false alarm rate} = \frac{\text{false positives}}{\text{true negatives} + \text{false positives}}\]

Notes

Range: 0 to 1. Perfect score: 0.
Not to be confused with the false alarm ratio.
“False positives” is the same as “false alarms”.
“True negatives” is the same as “correct negatives”.

References

https://www.cawcr.gov.au/projects/verification/#POFD

false_alarm_ratio()#

The false alarm ratio (FAR) calculates the fraction of the predicted “yes” events which did not eventuate (i.e., were false alarms).

Returns:: An xarray object containing the false alarm ratio
Return type:: xr.DataArray

\[\text{false alarm ratio} = \frac{\text{false positives}}{\text{true positives} + \text{false positives}}\]

Notes

Range: 0 to 1. Perfect score: 0.
Not to be confused with the False Alarm Rate.
“False positives” is the same as “false alarms”.
“True positives” is the same as “hits”.

References

https://www.cawcr.gov.au/projects/verification/#FAR

forecast_rate()#

The forecast event frequency.

Returns:: An xarray object containing the forecast rate.
Return type:: xr.DataArray

\[\text{forecast rate} = \frac{\text{true positives} + \text{false positives}}{\text{total count}}\]

Notes

Range: 0 to 1, where 1 indicates the event was forecast every time.
“True positives” is the same as “hits”.
“False positives” is the same as “false alarms”.

References

format_table(positive_value_name='Positive', negative_value_name='Negative')#

Formats a contingency table from a 2x2 contingency manager into a confusion matrix.

Parameters:

positive_value_name (str) – A string with the value to be displayed on the contingency table visual. The default value is ‘Positive’.
negative_value_name (str) – A string with the value to be displayed on the contingency table visual. The default value is ‘Negative’.

Returns:

A Pandas DataFrame representing 2x2 binary contingency table if possible. xr.DataArray: An xarray DataArray showing higher-dimension contingency table information otherwise

Return type:

pandas.DataFrame

This function retrieves the contingency table, extracts the relevant counts, and constructs a 2x2 confusion matrix represented as a Pandas DataFrame. This function does not support HTML rendering of tables where dimensionality has been preserved.

Example

Assuming self.xr_table returns an xr.Array:

array([ 5., 11.,  1.,  1., 18.])

Coordinates:

    contingency
    (contingency)
    <U11
    'tp_count' ... 'total_count'

This function will return:

        Positive  Negative Total
Positive       5       1       6
Negative       1      11       12
Total          6      12       18

fraction_correct()#

Identical to accuracy().

Fraction correct calculates the proportion of forecasts which are correct.

Returns:: An xarray object containing the fraction correct.
Return type:: xr.DataArray

\[\text{fraction correct} = \frac{\text{true positives} + \text{true negatives}}{\text{total count}}\]

Notes

Range: 0 to 1, where 1 indicates a perfect score.
“True positives” is the same as “hits”.
“True negatives” is the same as “correct negatives”.

References

https://www.cawcr.gov.au/projects/verification/#ACC

frequency_bias()#

Identical to bias_score().

How did the forecast frequency of “yes” events compare to the observed frequency of “yes” events?

Returns:: An xarray object containing the frequency bias
Return type:: xr.DataAray

\[\text{frequency bias} = \frac{\text{true positives} + \text{false positives}}{\text{true positives} + \text{false negatives}}\]

Notes

Range: 0 to ∞ (infinity), where 1 indicates a perfect score.
“True positives” is the same as “hits”.
“False positives” is the same as “false alarms”.
“False negatives” is the same as “misses”.

References

https://www.cawcr.gov.au/projects/verification/#BIAS

get_counts()#

Returns:

A dictionary of the contingency table counts. Values are xr.DataArray instances. The keys are:

tp_count for true positive count
tn_count for true negative count,
fp_count for false positive count
fn_count for false negative count
total_count for the total number of events.

Return type:

dict

Here is an example of what may be returned:

{'tp_count': <xarray.DataArray ()> Size: 8B array(5.),
 'tn_count': <xarray.DataArray ()> Size: 8B array(11.),
 'fp_count': <xarray.DataArray ()> Size: 8B array(1.),
 'fn_count': <xarray.DataArray ()> Size: 8B array(1.),
 'total_count': <xarray.DataArray ()> Size: 8B array(18.)}

get_table()#

Returns:

The contingency table as an xarray object. Contains a coordinate dimension called ‘contingency’. Valid coordinates for this dimenstion are:

tp_count for true positive count
tn_count for true negative count
fp_count for false positive count
fn_count for false negative count
total_count for the total number of events.

Return type:

xr.DataArray

Here is an example of what may be returned:

array([ 5., 11.,  1.,  1., 18.])

Coordinates:

    contingency
    (contingency)
    <U11
    'tp_count' ... 'total_count'

Indexes:

    contingency
    PandasIndex

Attributes: (0)

gilberts_skill_score()#

Identical to equitable_threat_score().

Calculates the Gilbert skill score.

How well did the forecast “yes” events correspond to the observed “yes” events (accounting for hits due to chance)?

Returns:: An xarray object containing the Gilberts Skill Score
Return type:: xr.DataArray

\[\text{GSS} = \frac{\text{true positives} - \text{true positives} _\text{random}} {\text{true positives} + \text{false negatives} + \text{false positives} - \text{true positivies} _\text{random}}\]

where

\[\text{true_positives}_{\text{random}} = \frac{(\text{true positives} + \text{false negatives}) (\text{true positives} + \text{false positives})}{\text{total count}}\]

Notes

Range: -1/3 to 1, 0 indicates no skill. Perfect score: 1.
“True positives” is the same as “hits”.
“False negatives” is the same as “misses”
“False positives” is the same as “false alarms”.

References

Gilbert, G.K., 1884. Finley’s tornado predictions. American Meteorological Journal, 1(5), pp.166–172.
Hogan, R.J., Ferro, C.A., Jolliffe, I.T. and Stephenson, D.B., 2010. Equitability revisited: Why the “equitable threat score” is not equitable. Weather and Forecasting, 25(2), pp.710-726. https://doi.org/10.1175/2009WAF2222350.1

hanssen_and_kuipers_discriminant()#

Identical to peirce_skill_score() and true_skill_statistic().

How well did the forecast separate the “yes” events from the “no” events?

Returns:: An xarray object containing Hanssen and Kuipers’ Discriminant
Return type:: xr.DataArray

\[\text{HK} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}} - \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}\]

where \(\text{HK}\) is Hansen and Kuipers Discriminant

Notes

Range: -1 to 1, 0 indicates no skill. Perfect score: 1.
“True positives” is the same as “hits”.
“False negatives” is the same as “misses”.
“False positives” is the same as “false alarms”.
“True negatives” is the same as “correct negatives”.

References

https://www.cawcr.gov.au/projects/verification/#HK

heidke_skill_score()#

Identical to cohens_kappa().

Calculates the Heidke skill score.

What was the accuracy of the forecast relative to that of random chance?

Returns:: An xarray object containing the Heidke skill score
Return type:: xr.DataArray

\[\begin{split}\\ \text{HSS} = \frac{\text{true positives} + \text{true negatives} - E_{random}}{ \text{total count} - E_{\text{random}}}\end{split}\]

where

Notes

Range: -1 to 1, 0 indicates no skill. Perfect score: 1.
HSS = Heidke Skill Score
“True positives” (\(\text{tp}\)) is the same as “hits”.
“False negatives” (\(\text{fn}\)) is the same as “misses”.
“False positives” (\(\text{fp}\)) is the same as “false alarms”.
“True negatives” (\(\text{tn}\)) is the same as “correct negatives”.

References

https://en.wikipedia.org/wiki/Cohen%27s_kappa

hit_rate()#

Identical to true_positive_rate(), probability_of_detection, sensitivity() and recall().

Calculates the proportion of the observed events that were correctly forecast.

Returns:: An xarray object containing the hit rate
Return type:: xr.DataArray

\[\text{true positives} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}\]

Notes

Range: 0 to 1. Perfect score: 1.
“True positives” is the same as “hits”.
“False negatives” is the same as “misses”.

References

https://www.cawcr.gov.au/projects/verification/#POD

negative_predictive_value()#

Calculates the negative predictive value (NPV).

Of all the times a negative result was predicted, how often was the prediction actually correct?

Returns:: An xarray object containing the negative predictive value score
Return type:: xr.DataArray

\[\text{negative predictive value} = \frac{\text{true negatives}}{\text{true negatives} + \text{false negatives}}\]

Notes

Range: 0 to 1. Perfect score: 1.
“True negatives” is the same as “correct rejections”.
“False negatives” is the same as “missed detections”.

References

https://en.wikipedia.org/wiki/Positive_and_negative_predictive_values
Monaghan, T. F., Rahman, S. N., Agudelo, C. W., Wein, A. J., Lazar, J. M., Everaert, K., & Dmochowski, R. R. (2021). Foundational statistical principles in medical research: Sensitivity, specificity, positive predictive value, and negative predictive value. Medicina, 57(5), 503. https://doi.org/10.3390/medicina57050503

odds_ratio()#

Calculates the odds ratio

What is the ratio of the odds of a “yes” forecast being correct, to the odds of a “yes” forecast being wrong?

Returns:: An xarray object containing the odds ratio
Return type:: xr.DataArray

\[\begin{split}\begin{aligned} \text{odds ratio} &= \left[\frac{\text{POD}}{\text{1 - POD}}\right] \div \left[\frac{\text{POFD}}{\text{1 - POFD}}\right] \\ &= \frac{\text{true positives} \cdot \text{true negatives}}{\text{false positives} \cdot \text{false negatives}} \end{aligned}\end{split}\]

where

\[\text{POD} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}\]

and

\[\text{POFD} = \frac{\text{false positives}}{\text{true negatives} + \text{false positives}}\]

Notes

Range: 0 to ∞, 1 indicates no skill. Perfect score: ∞.
POD = Probability of Detection
POFD = Probability of False Detection
“True positives” is the same as “hits”.
“False negatives” is the same as “misses”.
“False positives” is the same as “false alarms”.
“True negatives” is the same as “correct negatives”.

References

Stephenson, D.B., 2000. Use of the “odds ratio” for diagnosing forecast skill. Weather and Forecasting, 15(2), pp.221-232. https://doi.org/10.1175/1520-0434(2000)015%3C0221:UOTORF%3E2.0.CO;2

odds_ratio_skill_score()#

Identical to yules_q().

Calculates the odds ratio skill score.

What was the improvement of the forecast over random chance?

Returns:: An xarray object containing the odds ratio skill score
Return type:: xr.DataArray

\[\begin{split}\begin{aligned} \text{ORSS} &= \frac{\text{OR} - 1}{\text{OR} + 1} \\ &= \frac{\text{true positives} \cdot \text{true negatives} - \text{false positives} \cdot \text{false negatives}}{ \text{true positives} \cdot \text{true negatives} + \text{false positives} \cdot \text{false negatives}} \end{aligned}\end{split}\]

Notes

Range: -1 to 1, 0 indicates no skill. Perfect score: 1.
ORSS = Odds Ratio Skill Score
OR = Odds ratio, see: odds_ratio()
“True positives” is the same as “hits”.
“False negatives” is the same as “misses”.
“False positives” is the same as “false alarms”.
“True negatives” is the same as “correct negatives”.

References

Stephenson, D.B., 2000. Use of the “odds ratio” for diagnosing forecast skill. Weather and Forecasting, 15(2), pp.221-232. https://doi.org/10.1175/1520-0434(2000)015%3C0221:UOTORF%3E2.0.CO;2

peirce_skill_score()#

Identical to hanssen_and_kuipers_discriminant() and true_skill_statistic().

How well did the forecast separate the “yes” events from the “no” events?

Returns:: An xarray object containing the Peirce Skill Score
Return type:: xr.DataArray

\[\text{Peirce skill score} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}} - \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}\]

Notes

Range: -1 to 1, 0 indicates no skill. Perfect score: 1.
“True positives” is the same as “hits”.
“False negatives” is the same as “misses”.
“False positives” is the same as “false alarms”.
“True negatives” is the same as “correct negatives”.

References

https://www.cawcr.gov.au/projects/verification/#HK
Peirce, C.S., 1884. The numerical measure of the success of predictions. Science, ns-4(93), pp.453-454. https://doi.org/10.1126/science.ns-4.93.453.b

positive_predictive_value()#

Identical to success_ratio() and precision().

What proportion of the forecast events actually eventuated?

Returns:: An xarray object containing the positive predictive value score
Return type:: xr.DataArray

\[\text{positive predictive value} = \frac{\text{true positives}}{\text{true positives} + \text{false positives}}\]

Notes

Range: 0 to 1. Perfect score: 1.
“True positives” is the same as “hits”.
“False positives” is the same as “false alarms”.

References

https://www.cawcr.gov.au/projects/verification/#SR
https://en.wikipedia.org/wiki/Positive_and_negative_predictive_values
Monaghan, T. F., Rahman, S. N., Agudelo, C. W., Wein, A. J., Lazar, J. M., Everaert, K., & Dmochowski, R. R. (2021). Foundational statistical principles in medical research: Sensitivity, specificity, positive predictive value, and negative predictive value. Medicina, 57(5), 503. https://doi.org/10.3390/medicina57050503

precision()#

Identical to success_ratio() and positive_predictive_value().

What proportion of the forecast events actually eventuated?

Returns:: An xarray object containing the precision score
Return type:: xr.DataArray

\[\text{precision} = \frac{\text{true positives}}{\text{true positives} + \text{false positives}}\]

Notes

Range: 0 to 1. Perfect score: 1.
“True positives” is the same as “hits”.
“False positives” is the same as “false alarms”.

References

probability_of_detection()#

Probability of detection (POD) is identical to hit_rate(), true_positive_rate(), sensitivity() and recall().

Calculates the proportion of the observed events that were correctly forecast.

Returns:: An xarray object containing the probability of detection
Return type:: xr.DataArray

\[\text{hit rate} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}\]

Notes

Range: 0 to 1. Perfect score: 1.
“True positives” is the same as “hits”
“False negatives” is the same as “misses”

References

https://www.cawcr.gov.au/projects/verification/#POD

probability_of_false_detection()#

Identical to false_alarm_rate().

What fraction of the non-events were incorrectly predicted?

Returns:: An xarray object containing the probability of false detection
Return type:: xr.DataArray

\[\text{probability of false detection} = \frac{\text{false positives}}{\text{true negatives} + \text{false positives}}\]

Notes

Range: 0 to 1. Perfect score: 0.
“False positives” is the same as “false alarms”.
“True negatives” is the same as “correct negatives”.

References

https://www.cawcr.gov.au/projects/verification/#POFD

recall()#

Identical to hit_rate(), probability_of_detection, true_positive_rate(), and sensitivity().

Calculates the proportion of the observed events that were correctly forecast.

Returns:: An xarray object containing the probability of detection
Return type:: xr.DataArray

\[\text{recall} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}\]

Notes

Range: 0 to 1. Perfect score: 1.
“True positives” is the same as “hits”.
“False negatives” is the same as “misses”.

References

sensitivity()#

Identical to hit_rate(), probability_of_detection, true_positive_rate(), and recall().

Calculates the proportion of the observed events that were correctly forecast.

Returns:: An xarray object containing the probability of detection
Return type:: xr.DataArray

\[\text{sensitivity} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}\]

Notes

Range: 0 to 1. Perfect score: 1.
“True positives” is the same as “hits”.
“False negatives” is the same as “misses”.

References

https://www.cawcr.gov.au/projects/verification/#POD
https://en.wikipedia.org/wiki/Sensitivity_and_specificity
Monaghan, T. F., Rahman, S. N., Agudelo, C. W., Wein, A. J., Lazar, J. M., Everaert, K., & Dmochowski, R. R. (2021). Foundational statistical principles in medical research: Sensitivity, specificity, positive predictive value, and negative predictive value. Medicina, 57(5), 503. https://doi.org/10.3390/medicina57050503

specificity()#

Identical to true_negative_rate().

The probability that an observed non-event will be correctly predicted.

Returns:: An xarray object containing the true negative rate (specificity).
Return type:: xr.DataArray

\[\text{specificity} = \frac{\text{true negatives}}{\text{true negatives} + \text{false positives}}\]

Notes

“True negatives” is the same as “correct negatives”.
“False positives” is the same as “false alarms”.

Reference:

https://en.wikipedia.org/wiki/Sensitivity_and_specificity
Monaghan, T. F., Rahman, S. N., Agudelo, C. W., Wein, A. J., Lazar, J. M., Everaert, K., & Dmochowski, R. R. (2021). Foundational statistical principles in medical research: Sensitivity, specificity, positive predictive value, and negative predictive value. Medicina, 57(5), 503. https://doi.org/10.3390/medicina57050503

success_ratio()#

Identical to precision() and positive_predictive_value().

What proportion of the forecast events actually eventuated?

Returns:: An xarray object containing the success ratio
Return type:: xr.DataArray

\[\text{success ratio} = \frac{\text{true positives}}{\text{true positives} + \text{false positives}}\]

Notes

Range: 0 to 1. Perfect score: 1.
“True positives” is the same as “hits”.
“False positives” is the same as “false alarms”.

References

https://www.cawcr.gov.au/projects/verification/#SR

symmetric_extremal_dependence_index()#

Calculates the Symmetric Extremal Dependence Index (SEDI).

Returns:: An xarray object containing the SEDI score
Return type:: xr.DataArray

\[\frac{\ln(\text{POFD}) - \ln(\text{POD}) + \ln(\text{1-POD}) - \ln(\text{1 -POFD})} {\ln(\text{POFD}) + \ln(\text{POD}) + \ln(\text{1-POD}) + \ln(\text{1 -POFD})}\]

where

\[\text{POD} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}\]

and

\[\text{POFD} = \frac{\text{false positives}}{\text{true negatives} + \text{false positives}}\]

Notes

POD = Probability of Detection
POFD = Probability of False Detection
“True positives” is the same as “hits”.
“False negatives” is the same as “misses”.
“False positives” is the same as “false alarms”.
“True negatives” is the same as “correct negatives”.
Range: -1 to 1, Perfect score: 1.

References

Ferro, C.A.T. and Stephenson, D.B., 2011. Extremal dependence indices: Improved verification measures for deterministic forecasts of rare binary events. Weather and Forecasting, 26(5), pp.699-713. https://doi.org/10.1175/WAF-D-10-05030.1

threat_score()#

Identical to critical_success_index().

Returns:: An xarray object containing the threat score
Return type:: xr.DataArray

\[\text{threat score} = \frac{\text{true positives}}{\text{true positives} + \text{false positives} + \text{false negatives}}\]

Notes

Range: 0 to 1, 0 indicates no skill. Perfect score: 1.
“True positives” is the same as “hits”.
“False positives” is the same as “false alarms”.
“True negatives” is the same as “correct negatives”.

References

https://www.cawcr.gov.au/projects/verification/#CSI

true_negative_rate()#

Identical to specificity().

The probability that an observed non-event will be correctly predicted.

Returns:: An xarray object containing the true negative rate.
Return type:: xr.DataArray

\[\text{true negative rate} = \frac{\text{true negatives}}{\text{true negatives} + \text{false positives}}\]

Notes

“True negatives” is the same as “correct negatives”.
“False positives” is the same as “false alarms”.

Reference:: https://en.wikipedia.org/wiki/Sensitivity_and_specificity

true_positive_rate()#

Identical to hit_rate(), probability_of_detection, sensitivity() and recall().

The proportion of the observed events that were correctly forecast.

Returns:: An xarray object containing the true positive rate
Return type:: xr.DataArray

\[\text{true positive rate} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}\]

Notes

Range: 0 to 1. Perfect score: 1.
“True positives” is the same as “hits”.
“False negatives” is the same as “misses”.

References

https://www.cawcr.gov.au/projects/verification/#POD

true_skill_statistic()#

Identical to peirce_skill_score() and hanssen_and_kuipers_discriminant().

How well did the forecast separate the “yes” events from the “no” events?

Returns:: An xarray object containing the true skill statistic
Return type:: xr.DataArray

\[\text{true skill statistic} = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}} - \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}\]

Notes

Range: -1 to 1, 0 indicates no skill. Perfect score: 1.
“True positives” is the same as “hits”.
“False negatives” is the same as “misses”.
“False positives” is the same as “false alarms”.
“True negatives” is the same as “correct negatives”.

References

https://www.cawcr.gov.au/projects/verification/#HK

yules_q()#

Identical to odds_ratio_skill_score().

Calculates the Yule’s Q.

What was the improvement of the forecast over random chance?

Returns:: An xarray object containing Yule’s Q
Return type:: xr.DataArray

\[\begin{split}\begin{aligned} \text{Yule's Q} &= \frac{\text{OR} - 1}{\text{OR} + 1} \\ &= \frac{\text{true positives} \cdot \text{true negatives} - \text{false positives} \cdot \text{false negatives}}{ \text{true positives} \cdot \text{true negatives} + \text{false positives} \cdot \text{false negatives}} \end{aligned}\end{split}\]

Notes

Range: -1 to 1, 0 indicates no skill. Perfect score: 1.
OR = Odds ratio, see: odds_ratio().
“True positives” is the same as “hits”.
“False negatives” is the same as “misses”.
“False positives” is the same as “false alarms”.
“True negatives” is the same as “correct negatives”.

References

Stephenson, D.B., 2000. Use of the “odds ratio” for diagnosing forecast skill. Weather and Forecasting, 15(2), pp.221-232. https://doi.org/10.1175/1520-0434(2000)015%3C0221:UOTORF%3E2.0.CO;2

class scores.categorical.ThresholdEventOperator(*, precision=8, default_event_threshold=0.001, default_op_fn=<built-in function ge>)#

See https://scores.readthedocs.io/en/stable/tutorials/Binary_Contingency_Scores.html for a detailed walkthrough showing the use in practice.

A ThresholdEventOperator is used to produce a BinaryContingencyManager from forecast and observed data. It considers an event to be defined when the forecast and observed variables meet a particular threshold condition (e.g. rainfall above 1mm).

The class may be used for any variable, for any threshold, and for any comparison operator (e.g. greater-than, less-than, greater-than-or-equal-to, … )

make_contingency_manager(fcst, obs, *, event_threshold=None, op_fn=None)#

Using this function requires a careful understanding of the structure of the data and the use of the operator function. The default operator is a simple greater-than operator, so this will work on a simple DataArray. To work on a DataSet, a richer understanding is required. It is recommended to work through the tutorial at https://scores.readthedocs.io/en/stable/tutorials/Binary_Contingency_Scores.html . This tutorial reviews more complex use cases, including multivariate gridded model data, and station data structures.

Parameters:

fcst (DataArray | Dataset | Series)
obs (DataArray | Dataset | Series)

Return type:

BinaryContingencyManager

make_event_tables(fcst, obs, *, event_threshold=None, op_fn=None)#

Parameters:

fcst (DataArray | Dataset | Series)
obs (DataArray | Dataset | Series)

class scores.categorical.EventOperator#

Abstract Base Class (ABC) for event operators which can be used in deriving contingency tables. ABCs are not used directly but instead define the requirements for other classes and may be used for type checking.

abstractmethod make_contingency_manager(fcst, obs, *, event_threshold=None, op_fn=None)#

This method should be over-ridden to return a contingency table.

Parameters:

fcst (DataArray | Dataset | Series)
obs (DataArray | Dataset | Series)

abstractmethod make_event_tables(fcst, obs, *, event_threshold=None, op_fn=None)#

This method should be over-ridden to return forecast and observed event tables

Parameters:

fcst (DataArray | Dataset | Series)
obs (DataArray | Dataset | Series)

scores.categorical.risk_matrix_score(fcst, obs, decision_weights, severity_dim, prob_threshold_dim, *, threshold_assignment='lower', reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the risk matrix score of Taggart & Wilke (2025).

Let \((S_1, \ldots,S_m)\) denote the tuple of nested severity categories, let \((p_1, \ldots,p_n)\) denote the probability thresholds that delineate the certainty categories, and let \(w_{i,j}\) denote the weight applied to the decision point associated with severity category \(S_i\) and probability threshold \(p_j\).

In this implementation of the risk matrix score, a forecast \(F\) is of the form \(F=(f_1,\ldots,f_m)\) where \(f_i\) denotes the forecast probability that the observation lies in severity category \(S_i\). A corresponding observation \(y\) is given by \(y=(y_1,\ldots,y_m)\) where \(y_i\) is 1 if the observation lies in severity category \(S_i\) and 0 otherwise. Then the risk matrix score \(\text{RMaS}\) is given by the formula

\[\text{RMaS}(F,y) = \sum_{i=1}^m\sum_{j=1}^n w_{i,j} \, s_j(f_i, y_i),\]

where

\[\begin{split}s_j(f_i,y_i) = \begin{cases} p_j & \text{if } y_i=0\text{ and } f_i \geq p_j \\ 1 - p_j & \text{if } y_i=1\text{ and } f_i < p_j \\ 0 & \text{otherwise.} \end{cases}\end{split}\]

The formula above is for the case where the threshold_assignment is “lower”, with the adjustments (\(f_i > p_j\) and \(f_i \leq p_j\)) applied when the threshold_assignment is “upper”.

Parameters:

fcst (DataArray | Dataset) – an array of forecast probabilities for the observation lying in each severity category. Must have a dimension severity_dim.
obs (DataArray | Dataset) – an array of binary observations with a value of 1 if the observation was in the severity category and 0 otherwise. Must have a dimension severity_dim.
decision_weights (DataArray) – an array of non-negative weights to apply to each risk matrix decision point, indexed by coordinates in severity_dim and prob_threshold_dim.
severity_dim (str) – the dimension specifying severity categories.
prob_threshold_dim (str) – the dimension in decision_weights specifying probability thresholds.
threshold_assignment (str | None) – Either “upper” or “lower”. Specifies whether the probability intervals defining the certainty categories, with endpoints given by the decision thresholds in decision_weights, are left or right closed. That is, whether the probability decision threshold is included in the upper (left closed) or lower (right closed) certainty category. Defaults to “lower”.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the mean risk matrix score, where the mean is taken over all forecast cases. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the mean risk matrix score, where the mean is taken over all forecast cases. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved, apart from severity_dim and prob_threshold_dim. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

An xarray object of risk matrix scores, averaged according to specified weights across appropriate dimensions.

Raises:

ValueError – if severity_dim is not a dimension of fcst, obs or decision_weights.
ValueError – if severity_dim is a dimension of weights.
ValueError – if prob_threshold_dim is not a dimension of decision_weights.
ValueError – if prob_threshold_dim is a dimension of weights.
ValueError – if decision_weights does not have exactly 2 dimensions.
ValueError – if fcst values are negative or greater than 1.
ValueError – if obs values are other than 0, 1 or nan.
ValueError – if prob_threshold_dim coordinates are not strictly between 0 and 1.
ValueError – if severity_dim coordinates differ for any of fcst, obs or decision_weights.
ValueError – if threshold_assignment is not “upper” or lower”.

Return type:

DataArray | Dataset

References

Taggart, R. J., & Wilke, D. J. (2025). Warnings based on risk matrices: A coherent framework with consistent evaluation. Natural Hazards and Earth System Sciences, 25(8), 2657–2677. https://doi.org/10.5194/nhess-25-2657-2025

Examples

Calculate the risk matrix score where the risk matrix has three nested severity categories (“MOD+”, “SEV+” and “EXT”) and three probability thresholds (0.1, 0.3 and 0.5). The decision weights place greater emphasis on higher end severity.

>>> import xarray as xr
>>> from scores.categorical import risk_matrix_score
>>> decision_weights = xr.DataArray(
>>>     data=[[1, 2, 3], [1, 2, 3], [1, 2, 3]],
>>>     dims=["probability_threshold", "severity"],
>>>     coords={'probability_threshold': [0.1, 0.3, 0.5], 'severity': ["MOD+", "SEV+", "EXT"]}
>>> )
>>> fcst = xr.DataArray(
>>>     data=[[0.45, 0.22, 0.05], [0.65, 0.32, 0.09]],
>>>     dims=["time", "severity"],
>>>     coords={'time': [0, 1], 'severity': ["MOD+", "SEV+", "EXT"]}
>>> )
>>> obs = xr.DataArray(
>>>     data=[[1, 1, 0], [1, 0, 0]],
>>>     dims=["time", "severity"],
>>>     coords={'time': [0, 1], 'severity': ["MOD+", "SEV+", "EXT"]}
>>> )
>>> risk_matrix_score(fcst, obs, decision_weights, "severity", "probability_threshold")

scores.categorical.matrix_weights_to_array(matrix_weights, severity_dim, severity_coords, prob_threshold_dim, prob_threshold_coords)#

Generates a 2-dimensional xr.DataArray of weights for each decision point, which is used for the scores.categorical.risk_matrix_score() calculation. Assumes that values toward the left in matrix_weights correspond to less severe categories, while values towards the top in matrix_weights correspond to higher probability thresholds.

Parameters:

matrix_weights (ndarray) – array of weights to place on each risk matrix decision point, with rows (ascending) corresponding to (increasing) probability thresholds, and columns (left to right) corresponding to (increasing) severity categories.
severity_dim (str) – name of the severity category dimension.
severity_coords (Iterable) – labels for each of the severity categories, in order of increasing severity. Does NOT include the lowest severity category for which no warning would be issued.
prob_threshold_dim (str) – name of the probability threshold dimension.
prob_threshold_coords (Iterable) – list of the probability decision thresholds in the risk matrix, strictly between 0 and 1.

Returns:

xarray data array of weights for each risk matrix decision point, indexed by prob_threshold_dim and severity_dim.

Raises:

ValueError – if matrix_weights isn’t two dimensional.
ValueError – if number of rows of matrix_weights doesn’t equal length of prob_threshold_coords.
ValueError – if number of columns of matrix_weights doesn’t equal length of severity_coords.
ValueError – if prob_threshold_coords aren’t strictly between 0 and 1.

Return type:

DataArray

References

Taggart, R. J., & Wilke, D. J. (2025). Warnings based on risk matrices: A coherent framework with consistent evaluation. Natural Hazards and Earth System Sciences, 25(8), 2657–2677. https://doi.org/10.5194/nhess-25-2657-2025

Examples

Returns weights for each risk matrix decision point, where weights increase with increasing severity category and decrease with increasing probability threshold.

>>> import numpy as np
>>> from scores.categorical import matrix_weights_to_array
>>> matrix_weights = np.array([
>>>     [1, 2, 3],
>>>     [2, 4, 6],
>>>     [3, 6, 9],
>>> ])
>>> matrix_weights_to_array(
>>>     matrix_weights, "severity", ["MOD+", "SEV+", "EXT"], "prob_threshold", [0.1, 0.3, 0.5]
>>> )

scores.categorical.weights_from_warning_scaling(scaling_matrix, evaluation_weights, severity_dim, severity_coords, prob_threshold_dim, prob_threshold_coords)#

Given a warning scaling matrix, evaluation weights and other inputs, returns the weights for each risk matrix decision point as an xarray data array. The returned data array is designed to be used for the scores.categorical.risk_matrix_score() calculation.

Comprehensive checks are made on scaling_matrix to ensure it satisfies the properties of warning scaling in Table 1 of Taggart & Wilke (2025).

Parameters:

scaling_matrix (ndarray) – a 2-dimensional matrix encoding the warning scaling. Warning levels are given integer values 0, 1, …, q. The top row corresponds to the highest certainty category while the right-most column corresponds to most severe category.
evaluation_weights (Iterable) – positive weights used for warning evaluation. The kth weight (corresponding to evaluation_weights[k-1]) is proportional to the importance of accurately discriminating between warning states below level k and warning states at or above level k. len(evaluation_weights) must be at least q.
severity_dim (str) – name of the severity category dimension.
severity_coords (Iterable) – labels for each of the severity categories, in order of increasing severity. Does NOT include the lowest severity category for which no warning would be issued.
prob_threshold_dim (str) – name of the probability threshold dimension.
prob_threshold_coords (Iterable) – list of the probability decision thresholds in the risk matrix, strictly between 0 and 1.

Returns:

xarray data array of weights for each risk matrix decision point, indexed by prob_threshold_dim and severity_dim.

Raises:

ValueError – if matrix_weights isn’t two dimensional.
ValueError – if scaling_matrix has entries that are not non-negative integers.
ValueError – if the first column or last row of scaling_matrix has nonzero entries.
ValueError – if scaling_matrix decreases along any row (moving left to right).
ValueError – if scaling_matrix increases along any column (moving top to bottom).
ValueError – if number of rows of matrix_weights doesn’t equal length of prob_threshold_coords.
ValueError – if number of columns of matrix_weights doesn’t equal length of severity_coords.
ValueError – len(evaluation_weights) is less than the maximum value in scaling_matrix.
ValueError – if prob_threshold_coords aren’t strictly between 0 and 1.
ValueError – if evaluation_weights aren’t strictly positive.

Return type:

DataArray

References

Taggart, R. J., & Wilke, D. J. (2025). Warnings based on risk matrices: A coherent framework with consistent evaluation. Natural Hazards and Earth System Sciences, 25(8), 2657–2677. https://doi.org/10.5194/nhess-25-2657-2025

Examples

Returns weights for each risk matrix decision point, for the SHORT-RANGE scaling matrix of Taggart & Wilke (2025), with ESCALATION evaluation weights.

>>> import numpy as np
>>> from scores.categorical import weights_from_warning_scaling
>>> scaling = np.array([
>>>     [0, 2, 3, 3],
>>>     [0, 1, 2, 3],
>>>     [0, 1, 1, 2],
>>>     [0, 0, 0, 0],
>>> ])
>>> weights_from_warning_scaling(
>>>     scaling, [1, 2, 3],  "severity", ["MOD+", "SEV+", "EXT"], "prob_threshold", [0.1, 0.3, 0.5]
>>> )

scores.categorical.seeps(fcst, obs, prob_dry, light_heavy_threshold, *, dry_light_threshold=0.2, mask_clim_extremes=True, lower_masked_value=0.1, upper_masked_value=0.85, reduce_dims=None, preserve_dims=None, weights=None)#

Calculates the stable equitable error in probability space (SEEPS) score.

When used to evaluate precipitation forecasts, the SEEPS score calculates the performance of a forecast across three categories:

Dry weather (e.g., less than or equal to 0.2mm),
Light precipitation (the climatological lower two-thirds of rainfall conditioned on it raining),
Heavy precipitation (the climatological upper one-third of rainfall conditioned on it raining).

The SEEPS penalty matrix is defined as

\[\begin{split}s = \frac{1}{2} \left( \begin{matrix} 0 & \frac{1}{1-p_1} & \frac{1}{p_3}+\frac{1}{1-p_1} \\ \frac{1}{p_1} & 0 & \frac{1}{p_3} \\ \frac{1}{p_1}+\frac{1}{1-p_3} & \frac{1}{1-p_3} & 0 \end{matrix} \right)\end{split}\]

where

\(p_1\) is the climatological probability of the dry weather category,
\(p_3\) is the climatological probability of the heavy precipitation category,
The rows correspond to the forecast category (dry, light, heavy),
The columns correspond to the observation category (dry, light, heavy),

as defined in Eq 15 in Rodwell et al. (2010).

Let \(p_2\) denote the climatological probability of light precipitation occuring. Note that since \(p_2 = 2p_3\) and \(p_1 + p_2 + p_3 = 1\), then \(p_3 = (1 - p_1) / 3\) can be substituted into the penalty matrix. In this implementation, the user only provides \(p_1\) with the prob_dry arg and the function calculates \(p_3\) internally. Additionally, this implementation of the score is negatively oriented, meaning that lower scores are better.

Sometimes in the literature, a positively oriented version of SEEPS is used, which is defined as \(1 - \mathrm{SEEPS}\).

By default, the scores are only calculated for points where \(p_1 \in [0.1, 0.85]\) as per Rodwell et al. (2010). This can be changed by setting mask_clim_extremes to False or by changing the lower_masked_value and upper_masked_value parameters.

Parameters:

fcst (DataArray) – An array of real-valued forecasts (e.g., precipitation forecasts in mm).
obs (DataArray) – An array of real-valued observations (e.g., precipitation forecasts in mm).
prob_dry (DataArray) – The climatological probability of the dry weather category. This is called \(p_1\) in the SEEPS penalty matrix. Must be in the range [0, 1].
light_heavy_threshold (DataArray) – An array of the rainfall thresholds (e.g., in mm) that separates light and heavy precipitation. The threshold itself is included in the light precipitation category.
dry_light_threshold (float | None) – The threshold (e.g., in mm) that separates dry weather from light precipitation. Defaults to 0.2. Dry weather is defined as less than or equal to this threshold.
mask_clim_extremes (bool | None) – If True, mask out the score at points where \(p_1\) is less than lower_masked_value or greater than upper_masked_value. Instead a NaN is returned at these points. Defaults to True.
lower_masked_value (float | None) – The SEEPS score is masked at points where prob_dry is less than this value. Defaults to 0.1.
upper_masked_value (float | None) – The SEEPS score is masked at points where prob_dry is greater than this value. Defaults to 0.85.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the SEEPS score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the SEEPS score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the SEEPS score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.

Returns:

An xarray DataArray containing the SEEPS score.

Raises:

ValueError – if any values in prob_dry are outside the range [0, 1].

Return type:

DataArray

Warning

This function raises a warning if any values in prob_dry are exactly equal to 0 or 1.

References

Rodwell, M. J., Richardson, D. S., Hewson, T. D., & Haiden, T. (2010). A new equitable score suitable for verifying precipitation in numerical weather prediction. Quarterly Journal of the Royal Meteorological Society, 136(650), 1344–1363. https://doi.org/10.1002/qj.656

Examples

>>> import numpy as np
>>> import xarray as xr
>>> from scores.categorical import seeps
>>> fcst = xr.DataArray(np.random.rand(4, 6, 8), dims=['time', 'lat', 'lon'])
>>> obs = xr.DataArray(np.random.rand(4, 6, 8), dims=['time', 'lat', 'lon'])
>>> prob_dry = xr.DataArray(np.random.rand(6, 8), dims=['lat', 'lon'])
>>> light_heavy_threshold = 2 * xr.DataArray(np.random.rand(4, 6, 8), dims=['time', 'lat', 'lon'])
>>> seeps(fcst, obs, prob_dry, light_heavy_threshold=light_heavy_threshold)
<xarray.DataArray ()>
Size: 8B
array(0.84333334)

scores.spatial#

class scores.fast.fss.typing.FssComputeMethod(value)#

Choice of compute backend for FSS.

INVALID = -1#: invalid backend

NUMPY = 1#: computations using numpy (default)

NUMBA = 2#: computations using numba (currently unimplemented)

scores.spatial.fss_2d(fcst, obs, *, event_threshold, window_size, spatial_dims, zero_padding=False, reduce_dims=None, preserve_dims=None, threshold_operator=None, compute_method=FssComputeMethod.NUMPY, dask='forbidden')#

Uses fss_2d_single_field() to compute the Fractions Skill Score (FSS) for each 2D spatial field in the DataArray and then aggregates them over the output of gather dimensions.

The aggregation method is the intended extension of the score defined by Robert and Leans (2008) for multiple forecasts ^[1,2]

Note

This method takes in a threshold_operator to compare the input fields against the event_threshold which defaults to numpy.greater, and is compatible with lightweight numpy operators. If you would like to do more advanced binary discretization consider doing this separately and using scores.spatial.fss_2d_binary() instead, which takes in a pre-discretized binary field.

Optionally aggregates the output along other dimensions if reduce_dims or (mutually exclusive) preserve_dims are specified.

For implementation for a single 2-D field, see: fss_2d_single_field(). This offers the user the ability to use their own aggregation methods, rather than xarray.ufunc

Warning

dask option is not fully tested

Parameters:

fcst (DataArray) – An array of forecasts
obs (DataArray) – An array of observations (same spatial shape as fcst)
event_threshold (float) – A scalar to compare fcst and obs fields to generate a binary “event” field. (defaults to np.greater).
window_size (Tuple[int, int]) – A pair of positive integers (height, width) of the sliding window_size; the window size must be greater than 0 and fit within the shape of obs and fcst.
spatial_dims (Tuple[str, str]) – A pair of dimension names (x, y) where x and y are the spatial dimensions to slide the window along to compute the FSS. e.g. ("lat", "lon").
zero_padding (bool) – If set to true, applies a 0-valued window border around the data field before computing the FSS. If set to false (default), it uses the edges (thickness = window size) of the input fields as the border. One can think of it as: - zero_padding = False => inner border - zero_padding = True => outer border with 0 values
reduce_dims (Iterable[Hashable] | None) – Optional set of additional dimensions to aggregate over (mutually exclusive to preserve_dims).
preserve_dims (Iterable[Hashable] | None) – Optional set of dimensions to keep (all other dimensions are reduced); By default all dimensions except spatial_dims are preserved (mutually exclusive to reduce_dims).
threshold_operator (Callable | None) – The threshold operator used to generate the binary event field. E.g. np.greater. Note: this may depend on the backend compute method and not all operators may be supported. Generally, np.greater, np.less and their counterparts. Defaults to “np.greater”.
compute_method (FssComputeMethod) – currently only supports FssComputeMethod.NUMPY see: scores.fast.fss.typing.FssComputeMethod
dask (str) – See xarray.apply_ufunc for options

Returns:

An xarray.DataArray containing the FSS computed over the spatial_dims

The resultant array will have the score grouped against the remaining dimensions, unless reduce_dims/preserve_dims are specified; in which case, they will be aggregated over the specified dimensions accordingly.

For an exact usage please refer to FSS.ipynb in the tutorials.

Raises:

DimensionError – Various errors are thrown if the input dimensions do not conform. e.g. if the window size is larger than the input arrays, or if the the spatial dimensions in the args are missing in the input arrays.
DimensionWarning – If spatial_dims are attempting to be preserved e.g. in preserve_dims

Return type:

DataArray

References

Roberts, N. M., and H. W. Lean, 2008: Scale-Selective Verification of Rainfall Accumulations from High-Resolution Forecasts of Convective Events. Monthly Weather Review, 136, 78–97, https://doi.org/10.1175/2007mwr2123.1.
Mittermaier, M. P., 2021: A “Meta” Analysis of the Fractions Skill Score: The Limiting Case and Implications for Aggregation. Monthly Weather Review, 149, 3491–3504, https://doi.org/10.1175/mwr-d-18-0106.1.

scores.spatial.fss_2d_binary(fcst, obs, *, window_size, spatial_dims, zero_padding=False, reduce_dims=None, preserve_dims=None, compute_method=FssComputeMethod.NUMPY, check_boolean=True, dask='forbidden')#

Computes the Fractions Skill Score (FSS) using a binary field.

Takes in a binary (True or False) field of fcst and obs events. For example the output of a binary threshold applied to a continuous field or an event operator.

Optionally the user can set check_boolean to True to check that the field is boolean before any computations. Note: this asserts that the underlying fields are of type np.bool_ but does not coerce them. If the user is confident that the input field is binary (but not necessarily boolean), they may set this flag to False, to allow for more flexible binary configurations such as 0 and 1; this should give the same results.

Uses fss_2d from the scores.spatial module to perform fss computation, but without the threshold operation.

See also

scores.spatial.fss_2d() for more details and the continuous version. As well as detailed argument definitions.

Parameters:

fcst (DataArray)
obs (DataArray)
window_size (Tuple[int, int])
spatial_dims (Tuple[str, str])
zero_padding (bool)
reduce_dims (Iterable[Hashable] | None)
preserve_dims (Iterable[Hashable] | None)
compute_method (FssComputeMethod)
check_boolean (bool)
dask (str)

Return type:

DataArray

scores.spatial.fss_2d_single_field(fcst, obs, *, event_threshold, window_size, zero_padding=False, threshold_operator=None, compute_method=FssComputeMethod.NUMPY)#

Calculates the Fractions Skill Score (FSS) ^[1] for a given forecast and observed 2-D field.

The FSS is computed by counting the squared sum of forecast and observation events in a given window size. This is repeated over all possible window positions that can fit in the input arrays. A common method to do this is via a sliding window.

While the various compute backends may have their own implementations; most of them use some form of prefix sum algorithm to compute this quickly.

For 2-D fields this data structure is known as the “Summed Area Table” ^[2].

Once the squared sums are computed, the final FSS value can be derived by accumulating the squared sums ^[3].

The caller is responsible for making sure the input fields are in the 2-D spatial domain. (Although it should work for any np.array as long as it’s 2D, and window_size is appropriately sized.)

Parameters:

fcst (ndarray[tuple[Any, ...], dtype[float]]) – An array of forecasts
obs (ndarray[tuple[Any, ...], dtype[float]]) – An array of observations (same spatial shape as fcst)
event_threshold (float) – A scalar to compare fcst and obs fields to generate a binary “event” field.
window_size (Tuple[int, int]) – A pair of positive integers height, width) of the sliding window; the window dimensions must be greater than 0 and fit within the shape of obs and fcst.
threshold_operator (Callable | None) – The threshold operator used to generate the binary event field. E.g. np.greater. Note: this may depend on the backend compute method and not all operators may be supported. Generally, np.greater, np.less and their counterparts. Defaults to “np.greater”.
compute_method (FssComputeMethod) – currently only supports FssComputeMethod.NUMPY see: :py:class:FssComputeMethod
zero_padding (bool)

Returns:

A float representing the accumulated FSS.

Raises:

DimensionError – Various errors are thrown if the input dimensions do not conform. e.g. if the window size is larger than the input arrays, or if the the spatial dimensions of the input arrays do not match.

Return type:

float64

References

Roberts, N. M., and H. W. Lean, 2008: Scale-Selective Verification of Rainfall Accumulations from High-Resolution Forecasts of Convective Events. Monthly Weather Review 136, 78–97, https://doi.org/10.1175/2007mwr2123.1.
https://en.wikipedia.org/wiki/Summed-area_table
FAGGIAN, N., B. ROUX, P. STEINLE, and B. EBERT, 2015: Fast calculation of the fractions skill score. MAUSAM, 66, 457–466, https://doi.org/10.54302/mausam.v66i3.555.

scores.stats#

scores.stats.statistical_tests.diebold_mariano(da_timeseries, ts_dim, h_coord, *, method='HG', confidence_level=0.95, statistic_distribution='normal')#

Given an array of (multiple) timeseries, with each timeseries consisting of score differences for h-step ahead forecasts, calculates a modified Diebold-Mariano test statistic for each timeseries. Several other statistics are also returned such as the confidence that the population mean of score differences is greater than zero and confidence intervals for that mean.

Two methods for calculating the test statistic have been implemented: the “HG” method Hering and Genton (2011) and the “HLN” method of Harvey, Leybourne and Newbold (1997). The default “HG” method has an advantage of only generating positive estimates for the spectral density contribution to the test statistic. For further details see scores.stats.confidence_intervals.impl._dm_test_statistic.

Prior to any calculations, NaNs are removed from each timeseries. If there are NaNs in da_timeseries then a warning will occur. This is because NaNs may impact the autocovariance calculation.

To determine the value of h for each timeseries of score differences of h-step ahead forecasts, one may ask ‘How many observations of the phenomenon will be made between making the forecast and having the observation that will validate the forecast?’ For example, suppose that the phenomenon is afternoon precipitation accumulation in New Zealand (00z to 06z each day). Then a Day+1 forecast issued at 03z on Day+0 will be a 2-ahead forecast, since Day+0 and Day+1 accumulations will be observed before the forecast can be validated. On the other hand, a Day+1 forecast issued at 09z on Day+0 will be a 1-step ahead forecast. The value of h for each timeseries in the array needs to be specified in one of the sets of coordinates. See the example below.

Confidence intervals and “confidence_gt_0” statistics are calculated using the test statistic, which is assumed to have either the standard normal distribution or Student’s t distribution with n - 1 degrees of freedom (where n is the length of the timeseries). The distribution used is specified by statistic_distribution. See Harvey, Leybourne and Newbold (1997) for why the t distribution may be preferred, especially for shorter timeseries.

If da_timeseries is a chunked array, data will be brought into memory during this calculation due to the autocovariance implementation.

Parameters:

da_timeseries (DataArray) – a 2 dimensional array containing the timeseries.
ts_dim (str) – name of the dimension which identifies each timeseries in the array.
h_coord (str) – name of the coordinate specifying, for each timeseries, that the timeseries is an h-step ahead forecast. h_coord coordinates must be indexed by the dimension ts_dim.
method (Literal['HG', 'HLN']) – method for calculating the test statistic, one of “HG” or “HLN”.
confidence_level (float) – the confidence level, between 0 and 1 exclusive, at which to calculate confidence intervals.
statistic_distribution (Literal['normal', 't']) – the distribution of the test-statistic under the null hypothesis of equipredictive skill. Used to calculate the “confidence_gt_0” statistic and confidence intervals. One of “normal” or “t” (for Student’s t distribution).

Returns:

“mean”: the mean value for each timeseries, ignoring NaNs
”dm_test_stat”: the modified Diebold-Mariano test statistic for each timeseries
”timeseries_len”: the length of each timeseries, with NaNs removed.
”confidence_gt_0”: the confidence that the mean value of the population is greater than zero, based on the specified statistic_distribution. Precisely, it is the value of the cumululative distribution function evaluated at dm_test_stat.
”ci_upper”: the upper end point of a confidence interval about the mean at specified confidence_level.
”ci_lower”: the lower end point of a confidence interval about the mean at specified confidence_level.

Return type:

Dataset, indexed by ts_dim, with six variables

Raises:

ValueError – if method is not one of “HG” or “HLN”.
ValueError – if statistic_distribution is not one of “normal” or “t”.
ValueError – if 0 < confidence_level < 1 fails.
ValueError – if len(da_timeseries.dims) != 2.
ValueError – if ts_dim is not a dimension of da_timeseries.
ValueError – if h_coord is not a coordinate of da_timeseries.
ValueError – if ts_dim is not the only dimension of da_timeseries[h_coord].
ValueError – if h_coord values aren’t positive integers.
ValueError – if h_coord values aren’t less than the lengths of the timeseries after NaNs are removed.
RuntimeWarnning – if there is a NaN in diffs.

References

Diebold and Mariano, ‘Comparing predictive accuracy’, Journal of Business and Economic Statistics 13 (1995), 253-265.
Hering and Genton, ‘Comparing spatial predictions’, Technometrics 53 no. 4 (2011), 414-425.
Harvey, Leybourne and Newbold, ‘Testing the equality of prediction mean squared errors’, International Journal of Forecasting 13 (1997), 281-291.

Example

This array gives three timeseries of score differences. Coordinates in the “lead_day” dimension uniquely identify each timeseries. Here ts_dim=”lead_day”. Coordinates in the “valid_date” dimension give the forecast validity timestamp of each item in the timeseries. The “h” coordinates specify that the timeseries are for 2, 3 and 4-step ahead forecasts respectively. Here h_coord=”h”.

>>> da_timeseries = xr.DataArray(
...     data=[[1, 2, 3.0, 4, np.nan], [2.0, 1, -3, -1, 0], [1.0, 1, 1, 1, 1]],
...     dims=["lead_day", "valid_date"],
...     coords={
...         "lead_day": [1, 2, 3],
...         "valid_date": ["2020-01-01", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-05"],
...         "h": ("lead_day", [2, 3, 4]),
...     },
... )

>>> dm_test_stats(da_timeseries, "lead_day", "h")

scores.processing#

scores.processing.aggregate(values, *, reduce_dims, weights=None, method='mean')#

Computes a weighted or unweighted aggregation of the input data across specified dimensions. The input data is typically the “score” at each point.

This function applies a mean reduction or a sum over the dimensions given by reduce_dims on the input values, optionally using weights to compute a weighted mean or sum. The method arg specifies if you want to produce a weighted mean or weighted sum.

If reduce_dims is None, no aggregation is performed and the original values are returned unchanged.

If weights is None, an unweighted mean or sum is computed. If weights are provided, negative weights are not allowed and will raise a ValueError.

If weights are provided but reduce_dims is None (i.e., no reduction), a UserWarning is emitted since the weights will be ignored.

Weights must not contain NaN values. Missing values can be filled by weights.fillna(0) if you would like to assign a weight of zero to those points (e.g., masking).

Parameters:

values (DataArray | Dataset) – Input data to be aggregated.
reduce_dims (Iterable[Hashable] | None) – Dimensions over which to apply the mean. Can be a string, list of strings, or None. If None, no reduction is performed.
weights (DataArray | Dataset | None) – Weights to apply for weighted averaging. Must be broadcastable to values and contain no negative values. If None, an unweighted mean is calculated. Defaults to None.
method (str) – Aggregation method to use. Either “mean” or “sum”. Defaults to “mean”.

Returns:

An xarray object (same type as the input) with (un)weighted mean or sum of values

Raises:

ValueError – If weights contains any negative values.
ValueError – if weights contains any NaN values
ValueError – if method is not ‘mean’ or ‘sum’
ValueError – if weights is an xr.Dataset when values is an xr.DataArray

Return type:

DataArray | Dataset

Warning

UserWarning: If weights are provided but no reduction is performed (reduce_dims is None), a warning is issued since weights are ignored.

Examples

>>> import xarray as xr
>>> import numpy as np
>>> da = xr.DataArray(np.arange(6).reshape(2, 3), dims=['x', 'y'])
>>> weights = xr.DataArray([1, 2], dims=['x'])
>>> apply_weighted_mean(da, reduce_dims=['x'], weights=weights)
<xarray.DataArray (y: 3)>
array([2., 3., 4.])
Dimensions without coordinates: y

scores.processing.block_bootstrap(array_list, *, blocks, n_iteration, exclude_dims=None, circular=True)#

Perform block bootstrapping on provided arrays. The function creates new arrays by repeatedly bootstrapping along specified dimensions and stacking the new arrays along a new “iteration” dimension. Additionally, it includes internal functions for chunk size calculation and handling Dask arrays for chunk size limitation.

Parameters:

array_list (List[DataArray | Dataset] | DataArray | Dataset) – The data to bootstrap, which can be a single xarray object or a list of multiple xarray objects. In the case where multiple datasets are passed, each dataset can have its own set of dimension. However, for successful bootstrapping, dimensions across all input arrays must be nested. For instance, for block.keys=['d1', 'd2', 'd3'], an array with dimension ‘d1’ and ‘d2’ is valid, but an array with only dimension ‘d2’ is not valid. All datasets are bootstrapped according to the same random samples along available dimensions.
blocks (Dict[str, int]) – A dictionary specifying the dimension(s) to bootstrap and the block sizes to use along each dimension: {dimension: block_size}. The keys represent the dimensions to be bootstrapped, and the values indicate the block sizes along each dimension. The dimension provided here should exist in the data provided in array_list.
n_iteration (int) – The number of iterations to repeat the bootstrapping process. Determines how many bootstrapped arrays will be generated and stacked along the iteration dimension.
exclude_dims (List[List[str]] | None) – An optional parameter indicating the dimensions to be excluded during bootstrapping for each array provided in array_list. This parameter expects a list of lists, where each inner list corresponds to the dimensions to be excluded for the respective array. By default, the assumption is that no dimensions are excluded, and all arrays are bootstrapped across all specified dimensions in blocks.
circular (bool) – A boolean flag indicating whether circular block bootstrapping should be performed. Circular bootstrapping means that bootstrapping continues from the beginning when the end of the data is reached. By default, this parameter is set to True.

Returns:

If a single Dataset/DataArray (XarrayLike) is provided, the functions returns a bootstrapped XarrayLike object along the “iteration” dimension. If multiple XarrayLike objects are provided, it returns a tuple of bootstrapped XarrayLike objects, each stacked along the “iteration” dimension.

Raises:

ValueError – If bootstrapped dimensions don’t consistent sizes across arrays_list.
ValueError – If there is not at least one input array that contains all dimensions in blocks.keys().
ValueError – If exclude_dims is not a list of lists.
ValueError – If the list exclude_dims is not the same length as the number of as array_list.

Return type:

DataArray | Dataset | Tuple[DataArray | Dataset, …]

References

Gilleland, E. (2020). Bootstrap Methods for Statistical Inference. Part I: Comparative Forecast Verification for Continuous Variables. Journal of Atmospheric and Oceanic Technology, 37(11), 2117–2134. https://doi.org/10.1175/jtech-d-20-0069.1
Wilks, D. S. (2011). Statistical methods in the atmospheric sciences. Academic press. https://doi.org/10.1016/C2017-0-03921-6

Examples

Bootstrap a fcst and obs dataset along the time and space dimensions with block sizes of 10 for each dimension. The bootstrapping is repeated 1000 times.

>>> import numpy as np
>>> import xarray as xr
>>> from scores.processing import block_bootstrap
>>> obs = xr.DataArray(np.random.rand(100, 100), dims=["time", "space"])
>>> fcst = xr.DataArray(np.random.rand(100, 100), dims=["time", "space"])
>>> bootstrapped_obs, bootstrapped_fcst = block_bootstrap(
...     [obs, fcst],
...     blocks={"time": 10, "space": 10},
...     n_iteration=1000,
... )

scores.processing.isotonic_fit(fcst, obs, *, weight=None, functional='mean', bootstraps=None, quantile_level=None, solver=None, confidence_level=0.9, min_non_nan=1, report_bootstrap_results=False)#

If forecasts target the mean or a quantile functional, then the user can select that functional. Otherwise the user needs to supply a solver function

of one variable (if no weights are supplied), or

of two variables (if weights are supplied).

The solver function is used to find an optimal regression fit in each block, subject to a non-decreasing monotonicity constraint, using the pool adjacent violators (PAV) algorithm.

An example of a solver function of one variable is obs -> np.mean(obs) An example of a solver function of two variables is (obs, weight) -> np.average(obs, weights=weight).

This implementation uses sklearn.isotonic.IsotonicRegression when functional=”mean”.

Parameters:

fcst (ndarray | DataArray) – np.array or xr.DataArray of forecast (or explanatory) values. Values must be float, integer or NaN.
obs (ndarray | DataArray) – np.array or xr.DataArray of corresponding observation (or response) values. Must be the same shape as fcst. Values must be float, integer or NaN.
weight (ndarray | DataArray | None) – positive weights to apply to each forecast-observation pair. Must be the same shape as fcst, or None (which is equivalent to applying equal weights).
functional (Literal['mean', 'quantile'] | None) – Functional that the forecasts are targeting. Either “mean” or “quantile” or None. If None then solver must be supplied. The current implementation for “quantile” does not accept weights. If weighted quantile regression is desired then the user should should supply an appropriate solver.
bootstraps (int | None) – the number of bootstrap samples to perform for calculating the regression confidence band. Set to None if a confidence band is not required.
quantile_level (float | None) – the level of the quantile functional if functional=’quantile’. Must be strictly between 0 and 1.
solver (Callable[[ndarray, ndarray], float] | Callable[[ndarray], float] | None) – function that accepts 1D numpy array of observations and returns a float. Function values give the regression fit for each block as determined by the PAV algorithm. If weights are supplied, the function must have a second argument that accepts 1D numpy array of weights of same length as the observations.
confidence_level (float | None) – Confidence level of the confidence band, strictly between 0 and 1. For example, a confidence level of 0.5 will calculate the confidence band by computing the 25th and 75th percentiles of the bootstrapped samples.
min_non_nan (int | None) – The minimum number of non-NaN bootstrapped values to calculate confidence bands at a particular forecast value.
report_bootstrap_results (bool | None) – This specifies to whether keep the bootstrapped regression values in return dictionary. Default is False to keep the output result small.

Returns:

“unique_fcst_sorted”: 1D numpy array of remaining forecast values sorted in
ascending order, after any NaNs from fcst, obs and weight are removed, and only unique values are kept to keep the output size reasonably small.
”fcst_counts”: 1D numpy array of forecast counts for unique values of forecast sorted
”regression_values”: 1D numpy array of regression values corresponding to
”unique_fcst_sorted” values.
”regression_func”: function that returns the regression fit based on linear
interpolation of (“fcst_sorted”, “regression_values”), for any supplied argument (1D numpy array) of potential forecast values.
”bootstrap_results”: in the case of report_bootstrap_results=True, 2D numpy
array of bootstrapped regression values is included in return dictionary. Each row gives the interpolated regression values from a particular bootstrapped sample, evaluated at “fcst_sorted” values. If m is the number of bootstrap samples and n = len(fcst_sorted) then it is has shape (m, n). We emphasise that this array corresponds to fcst_sorted not unique_fcst_sorted.
”confidence_band_lower_values”: values of lower confidence band threshold, evaluated
at “unique_fcst_sorted” values.
”confidence_band_upper_values”: values of upper confidence band threshold, evaluated
at “unique_fcst_sorted” values.
”confidence_band_lower_func”: function that returns regression fit based on linear
interpolation of (“fcst_sorted”, “confidence_band_lower_values”), given any argument (1D numpy array) of potential forecast values.
”confidence_band_upper_func”: function that returns regression fit based on linear
interpolation of (“fcst_sorted”, “confidence_band_upper_values”), given any argument (1D numpy array) of potential forecast values.
”confidence_band_levels”: tuple giving the quantile levels used to calculate the
confidence band.

Return type:

Dictionary with the following keys

Raises:

ValueError – if fcst and obs are np.arrays and don’t have the same shape.
ValueError – if fcst and weight are np.arrays and don’t have the same shape.
ValueError – if fcst and obs are xarray.DataArrays and don’t have the same dimensions.
ValueError – if fcst and weight are xarray.DataArrays and don’t have the same dimensions.
ValueError – if any entries of fcst, obs or weight are not an integer, float or NaN.
ValueError – if there are no fcst and obs pairs after NaNs are removed.
ValueError – if functional is not one of “mean”, “quantile” or None.
ValueError – if quantile_level is not strictly between 0 and 1 when functional=”quantile”.
ValueError – if weight is not None when functional=”quantile”.
ValueError – if functional and solver are both None.
ValueError – if not exactly one of functional and solver is None.
ValueError – if bootstraps is not a positive integer.
ValueError – if confidence_level is not strictly between 0 and 1.

Note: This function only keeps the unique values of fcst_sorted to keep the volume of

the return dictionary small. The forecast counts is also included, so users can it to create forecast histogram (usually displayed in the reliability diagrams).

References

de Leeuw, Hornik and Mair. “Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods”, Journal of Statistical Software, 2009.
Dimitriadis, Gneiting and Jordan. “Stable reliability diagrams for probabilistic classifiers”, PNAS, Vol. 118 No. 8, 2020. Available at https://www.pnas.org/doi/10.1073/pnas.2016191118
Jordan, Mühlemann, and Ziegel. “Optimal solutions to the isotonic regression problem”, 2020 (version 2), available on arxiv at https://arxiv.org/abs/1904.04761

scores.processing.broadcast_and_match_nan(*args: Dataset) → tuple[Dataset, ...]#

scores.processing.broadcast_and_match_nan(*args: DataArray) → tuple[DataArray, ...]

Input xarray data objects are ‘matched’ - they are broadcast against each other (forced to have the same dimensions), and the position of nans are forced onto all DataArrays. This matching process is applied across all supplied DataArrays, as well as all DataArrays inside supplied Datasets.

Parameters:: *args – any number of xarray data objects supplied as positional arguments. See examples below.
Returns:: A tuple of data objects of the same length as the number of data objects supplied as input. Each returned object is the ‘matched’ version of the input.
Raises:: ValueError – if any input args is not an xarray data object.

Examples

>>> # Matching xarray data objects
>>> da1_matched, ds_matched, da2_matched = scores.processing.broadcast_and_match_nan(da1, ds, da2)

>>> # Matching a tuple of xarray data objects
>>> input_tuple = (da1, ds, da2)
>>> matched_tuple = broadcast_and_match_nan(*input_tuple)
>>> da1_matched = matched_tuple[0]
>>> ds_matched = matched_tuple[1]
>>> da2_matched = matched_tuple[2]

scores.processing.comparative_discretise(data, comparison, mode, *, abs_tolerance=None)#

Converts the values of data to 0 or 1 based on how they relate to the specified values in comparison via the mode operator.

Parameters:

data (DataArray | Dataset) – The data to convert to discrete values.
comparison (DataArray | float | int) – The values to which to compare data.
mode (Callable | str) –
Specifies the required relation of data to thresholds for a value to fall in the ‘event’ category (i.e. assigned to 1). The mode can be a string or a Python operator function from the list below. Allowed modes are:
- operator.ge: Where values in data greater than or equal to the corresponding threshold are assigned as 1.
- operator.gt: Where values in data greater than the corresponding threshold are assigned as 1.
- operator.le: Where values in data less than or equal to the corresponding threshold are assigned as 1.
- operator.lt: Where values in data less than the corresponding threshold are assigned as 1.
- operator.eq: values in data equal to the corresponding threshold are assigned as 1.
- operator.ne: values in data not equal to the corresponding threshold are assigned as 1.
- ’>=’ values in data greater than or equal to the corresponding threshold are assigned as 1.
- ’>’ values in data greater than the corresponding threshold are assigned as 1.
- ’<=’ values in data less than or equal to the corresponding threshold are assigned as 1.
- ’<’ values in data less than the corresponding threshold are assigned as 1.
- ’==’ values in data equal to the corresponding threshold are assigned as 1.
- ’!=’ values in data not equal to the corresponding threshold are assigned as 1.
abs_tolerance (float | None) – If supplied, values in data that are within abs_tolerance of a threshold are considered to be equal to that threshold. This is generally used to correct for floating point rounding, e.g. we may want to consider 1.0000000000000002 as equal to 1.

Returns:

An xarray data object of the same type as data. The dimensions of the output are the union of all dimensions in data and comparison. The values of the output are either 0 or 1 or NaN, depending on the truth of the operation data <mode> comparison.

Raises:

ValueError – if abs_tolerance is not a non-negative float.
ValueError – if mode is not valid.
TypeError – if comparison is not a float, int or xarray.DataArray.

Return type:

DataArray | Dataset

scores.processing.binary_discretise(data, thresholds, mode, *, abs_tolerance=None, autosqueeze=False)#

Converts the values of data to 0 or 1 for each threshold in thresholds according to the operation defined by mode.

Parameters:

data (DataArray | Dataset) – The data to convert to discrete values.
thresholds (Real | Sequence[Real]) – Threshold(s) at which to convert the values of data to 0 or 1.
mode (Callable | str) –
Specifies the required relation of data to thresholds for a value to fall in the ‘event’ category (i.e. assigned to 1). The mode can be a string or a Python operator function from the list below. Allowed modes are:
- operator.ge: Where values in data greater than or equal to the corresponding threshold are assigned as 1.
- operator.gt: Where values in data greater than the corresponding threshold are assigned as 1.
- operator.le: Where values in data less than or equal to the corresponding threshold are assigned as 1.
- operator.lt: Where values in data less than the corresponding threshold are assigned as 1.
- operator.eq: values in data equal to the corresponding threshold are assigned as 1.
- operator.ne: values in data not equal to the corresponding threshold are assigned as 1.
- ’>=’ values in data greater than or equal to the corresponding threshold are assigned as 1.
- ’>’ values in data greater than the corresponding threshold are assigned as 1.
- ’<=’ values in data less than or equal to the corresponding threshold are assigned as 1.
- ’<’ values in data less than the corresponding threshold are assigned as 1.
- ’==’ values in data equal to the corresponding threshold are assigned as 1.
- ’!=’ values in data not equal to the corresponding threshold are assigned as 1.
abs_tolerance (float | None) – If supplied, values in data that are within abs_tolerance of a threshold are considered to be equal to that threshold. This is generally used to correct for floating point rounding, e.g. we may want to consider 1.0000000000000002 as equal to 1
autosqueeze (bool | None) – If True and only one threshold is supplied, then the dimension ‘threshold’ is squeezed out of the output. If thresholds is float-like, then this is forced to True, otherwise defaults to False.

Returns:

An xarray data object with the type and dimensions of data, plus an extra dimension ‘threshold’ if autosqueeze is False. The values of the output are either 0 or 1, depending on whether data <mode> threshold is True or not (although NaNs are preserved).

Raises:

ValueError – if ‘threshold’ is a dimension in data.
ValueError – if “Values in thresholds are not monotonic increasing”

Return type:

DataArray | Dataset

scores.processing.binary_discretise_proportion(data, thresholds, mode, *, reduce_dims=None, preserve_dims=None, abs_tolerance=None, autosqueeze=False)#

Returns the proportion of data in each category. The categories are defined by the relationship of data to threshold as specified by the operation mode.

Parameters:

data (DataArray | Dataset) – The data to convert into 0 and 1 according the thresholds before calculating the proportion.
thresholds (Iterable) – The proportion of values equal to or exceeding these thresholds will be calculated.
mode (Callable | str) –
Specifies the required relation of data to thresholds for a value to fall in the ‘event’ category (i.e. assigned to 1). The mode can be a string or a Python operator function from the list below. Allowed modes are:
- operator.ge: Where values in data greater than or equal to the corresponding threshold are assigned as 1.
- operator.gt: Where values in data greater than the corresponding threshold are assigned as 1.
- operator.le: Where values in data less than or equal to the corresponding threshold are assigned as 1.
- operator.lt: Where values in data less than the corresponding threshold are assigned as 1.
- operator.eq: values in data equal to the corresponding threshold are assigned as 1.
- operator.ne: values in data not equal to the corresponding threshold are assigned as 1.
- ’>=’ values in data greater than or equal to the corresponding threshold are assigned as 1.
- ’>’ values in data greater than the corresponding threshold are assigned as 1.
- ’<=’ values in data less than or equal to the corresponding threshold are assigned as 1.
- ’<’ values in data less than the corresponding threshold are assigned as 1.
- ’==’ values in data equal to the corresponding threshold are assigned as 1.
- ’!=’ values in data not equal to the corresponding threshold are assigned as 1.
reduce_dims (Iterable[Hashable] | None) – Dimensions to reduce.
preserve_dims (Iterable[Hashable] | None) – Dimensions to preserve.
abs_tolerance (bool | None) – If supplied, values in data that are within abs_tolerance of a threshold are considered to be equal to that threshold. This is generally used to correct for floating point rounding, e.g. we may want to consider 1.0000000000000002 as equal to 1.
autosqueeze (bool) – If True and only one threshold is supplied, then the dimension ‘threshold’ is squeezed out of the output. If thresholds is float-like, then this is forced to True, otherwise defaults to False.

Returns:

An xarray data object with the type of data, dimension dims + ‘threshold’. The values of the output are the proportion of data that satisfy the relationship to thresholds as specified by mode.

Examples

>>> import operator
>>> import xarray as xr
>>> from scores.processing import binary_discretise_proportion
>>> data = xr.DataArray([0, 0.5, 0.5, 1])
>>> binary_discretise_proportion(data, [0, 0.5, 1], operator.eq)
<xarray.DataArray (threshold: 3)>
array([ 0.25,  0.5 ,  0.25])
Coordinates:
  * threshold  (threshold) float64 0.0 0.5 1.0
Attributes:
    discretisation_tolerance: 0
    discretisation_mode: ==

>>> binary_discretise_proportion(data, [0, 0.5, 1], operator.ge)
<xarray.DataArray (threshold: 3)>
array([ 1.  ,  0.75,  0.25])
Coordinates:
  * threshold  (threshold) float64 0.0 0.5 1.0
Attributes:
    discretisation_tolerance: 0
    discretisation_mode: >=

See also

scores.processing.binary_discretise()

scores.processing.proportion_exceeding(data, thresholds, *, reduce_dims=None, preserve_dims=None)#

Calculates the proportion of data equal to or exceeding thresholds.

Parameters:

data (xarray.Dataset or xarray.DataArray) – The data from which to calculate the proportion exceeding thresholds
thresholds (iterable) – The proportion of values equal to or exceeding these thresholds will be calculated.
reduce_dims (Iterable[Hashable] | None) – Dimensions to reduce.
preserve_dims (Iterable[Hashable] | None) – Dimensions to preserve.

Returns:

An xarray data object with the type of data and dimensions dims + ‘threshold’. The values are the proportion of data that are greater than or equal to the corresponding threshold.

scores.processing.cdf.round_values(array, rounding_precision, *, final_round_decpl=7)#

Round data array to specified precision.

Rounding is done differently to xarray.DataArray.round or numpy.round where the number of decimal places is specified in those cases. Instead, here the rounding precision is specified as a float. The value is rounded to the nearest value that is divisible by rounding_precision.

For example, 3.73 rounded to precision 0.2 is 3.8, and 37.3 rounded to precision 20 is 40.

Assumes that rounding_precision >=0, with 0 indicating no rounding to be performed. If rounding_precision > 0, a final round to final_round_decpl decimal places is performed to remove artefacts of python rounding process.

Parameters:

array (xr.DataArray) – array of data to be rounded
rounding_precision (float) – rounding precision
final_round_decpl (int) – final round to specified number of decimal places when rounding_precision > 0.

Returns:

DataArray with rounded values.

Return type:

xr.DataArray

Raises:

ValueError – If rounding_precision < 0.

scores.processing.cdf.propagate_nan(cdf, threshold_dim)#

Propagates the NaN values from a “cdf” variable along the threshold_dim.

Parameters:

cdf (xr.DataArray) – CDF values, so that P(X <= threshold) = cdf_value for each threshold in the threshold_dim dimension.
threshold_dim (str) – name of the threshold dimension in cdf.

Returns:

cdf variable with NaNs propagated.

Return type:

xr.DataArray

Raises:

ValueError – If threshold_dim is not a dimension of cdf.

scores.processing.cdf.observed_cdf(obs, threshold_dim, *, threshold_values=None, include_obs_in_thresholds=True, precision=0)#

Returns a data array of observations converted into CDF format.

Such that:: returned_value = 0 if threshold < observation returned_value = 1 if threshold >= observation

Parameters:

obs (xr.DataArray) – observations
threshold_dim (str) – name of dimension in returned array that contains the threshold values.
threshold_values (Optional[Iterable[float]]) – values to include among thresholds.
include_obs_in_thresholds (bool) – if True, include (rounded) observed values among thresholds.
precision (float) – precision applied to observed values prior to constructing the CDF and thresholds. Select 0 for highest precision (i.e. no rounding).

Returns:

Observed CDFs and thresholds in the threshold_dim dimension.

Return type:

xr.DataArray

Raises:

ValueError – if precision < 0.
ValueError – if all observations are NaN and no non-NaN threshold_values are not supplied.

scores.processing.cdf.integrate_square_piecewise_linear(function_values, threshold_dim)#

Calculates integral values and collapses threshold_dim.

Calculates \(\int{(F(t)^2)}\), where:

If t is a threshold value in threshold_dim then F(t) is in function_values,
F is piecewise linear between each of the t values in threshold_dim.

This function assumes that:

threshold_dim is a dimension of function_values
coordinates of threshold_dim are increasing.

Parameters:

function_values (xr.DataArray) – array of function values F(t).
threshold_dim (xr.DataArray) – dimension along which to integrate.

Returns:

Integral values and threshold_dim collapsed:

Returns value of the integral with threshold_dim collapsed and other dimensions preserved.
Returns NaN if there are less than two non-NaN function_values.

Return type:

xr.DataArray

scores.processing.cdf.add_thresholds(cdf, threshold_dim, new_thresholds, fill_method, *, min_nonnan=2)#

Takes a CDF data array with dimension threshold_dim and adds values from new_thresholds.

The CDF is then filled to replace any NaN values. The array cdf requires at least 2 non-NaN values along threshold_dim.

Parameters:

cdf (xr.DataArray) – array of CDF values.
threshold_dim (str) – name of the threshold dimension in cdf.
new_thresholds (Iterable[float]) – new thresholds to add to cdf.
fill_method (Literal["linear", "step", "forward", "backward", "none"]) – one of “linear”, “step”, “forward” or “backward”, as described in fill_cdf. If no filling, set to “none”.
min_nonnan (int) – passed onto fill_cdf for performing filling.

Returns:

Additional thresholds, and values at those thresholds determined by the specified fill method.

Return type:

xr.DataArray

scores.processing.cdf.fill_cdf(cdf, threshold_dim, method, min_nonnan)#

Fills NaNs in a CDF of a real-valued random variable along threshold_dim with appropriate values between 0 and 1.

Parameters:

cdf (xr.DataArray) – CDF values, where P(Y <= threshold) = cdf_value for each threshold in threshold_dim.
threshold_dim (str) – the threshold dimension in the CDF, along which filling is performed.
method (Literal["linear", "step", "forward", "backward"]) –
one of:
- ”linear”: use linear interpolation, and if needed also extrapolate linearly. Clip to 0 and 1. Needs at least two non-NaN values for interpolation, so returns NaNs where this condition fails.
- ”step”: use forward filling then set remaining leading NaNs to 0. Produces a step function CDF (i.e. piecewise constant).
- ”forward”: use forward filling then fill any remaining leading NaNs with backward filling.
- ”backward”: use backward filling then fill any remaining trailing NaNs with forward filling.
min_nonnan (int) – the minimum number of non-NaN entries required along threshold_dim for filling to be performed. All CDF values are set to np.nan where this condition fails. min_nonnan must be at least 2 for the “linear” method, and at least 1 for the other methods.

Returns:

Containing the same values as cdf but with NaNs filled.

Return type:

xr.DataArray

Raises:

ValueError – If threshold_dim is not a dimension of cdf.
ValueError – If min_nonnan < 1 when method=”step” or if min_nonnan < 2 when method=”linear”.
ValueError – If method is not “linear”, “step”, “forward” or “backward”.
ValueError – If any non-NaN value of cdf lies outside the unit interval [0,1].

scores.processing.cdf.decreasing_cdfs(cdf, threshold_dim, tolerance)#

A CDF of a real-valued random variable should be nondecreasing along threshold_dim.

This is sometimes violated due to rounding issues or bad forecast process. decreasing_cdfs checks CDF values decrease beyond specified tolerance; that is, whenever the sum of the incremental decreases exceeds tolerarance.

For example, if the CDF values are [0, 0.4, 0.3, 0.9, 0.88, 1] then the sum of incremental decreases is -0.12. Given a specified positive tolerance, the CDF values decrease beyond tolerance if the sum of incremental decreases < -tolerance.

Intended use is for CDFs with increasing coordinates along threshold_dim dimension, and where either each CDF is always NaN or always non-NaN.

Parameters:

cdf (xr.DataArray) – data array of CDF values
threshold_dim (str) – threshold dimension, such that P(Y < threshold) = cdf_value.
tolerance (float) – nonnegative tolerance value.

Returns:

Containing threshold_dim collapsed and values True if and only if the CDF is decreasing outside tolerance. If the CDF consists only of NaNs then the value is False.

Return type:

xr.DataArray

Raises:

ValueError – If threshold_dim is not a dimension of cdf.
ValueError – If tolerance is negative.
ValueError – If coordinates are not increasing along threshold_dim.
ValueError – If some, but not all, CDF values in cdf along threshold_dim are NaN.

scores.processing.cdf.cdf_envelope(cdf, threshold_dim)#

Forecast cumulative distribution functions (CDFs) for real-valued random variables.

CDFs that are reconstructed from known points on the distribution should be nondecreasing with respect to the threshold dimension. However, sometimes this may fail due to rounding or poor forecast process. This function returns the “envelope” of the original CDF, which consists of two bounding CDFs, both of which are nondecreasing.

The following example shows values from an original CDF that has a decreasing subsequence (and so is not a true CDF). The resulting “upper” and “lower” CDFs minimally adjust “original” so that “lower” <= “original” <= “upper”.

“original”: [0, .5, .2, .8, 1]
“upper”: [0, .5, .5, .8, 1]
“lower”: [0, .2, .2, .8, 1]

This function does not perform checks that 0 <= cdf <= 1.

Parameters:

cdf (xr.DataArray) – forecast CDF with thresholds in the thresholds_dim.
threshold_dim (str) – dimension in fcst_cdf that contains the threshold ordinates.

Returns:

An xarray DataArray consisting of three CDF arrays indexed along the “cdf_type” dimension with the following indices:

”original”: same data as cdf.

”upper”: minimally adjusted “original” CDF that is nondecreasing and satisfies “upper” >= “original”.

”lower”: minimally adjusted “original” CDF that is nondecreasing and satisfies “lower” <= “original”.

NaN values in cdf are maintained in “original”, “upper” and “lower”.

Return type:

xr.DataArray

Raises:

ValueError – If threshold_dim is not a dimension of cdf.

scores.plotdata#

scores.plotdata.murphy_score(fcst, obs, thetas, *, functional, alpha, huber_a=None, decomposition=False, reduce_dims=None, preserve_dims=None)#

Select functional="quantile" and alpha=0.5 for the median functional. Select functional="expectile" and alpha=0.5 for the mean (i.e., expectation) functional.

Parameters:

fcst (DataArray) – Forecast numerical values.
obs (DataArray) – Observed numerical values.
thetas (Sequence[float] | DataArray) – Theta thresholds.
functional (Literal['quantile', 'huber', 'expectile']) – The type of elementary scoring function, one of “quantile”, “huber”, or “expectile”.
alpha (float) – Risk parameter (i.e. quantile or expectile level) for the functional. Must be between 0 and 1.
huber_a (float | None) – Huber transition parameter, used for “huber” functional only. Must be strictly greater than 0.
decomposition (bool) – True to return penalty values due to under- and over-fcst as well as the total score, False to return total score only.
reduce_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to reduce when calculating the Murphy score. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Iterable[Hashable] | None) – Optionally specify which dimensions to preserve when calculating the Murphy score. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the errors will be the FIRM score at each point (i.e. single-value comparison against observed), and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.

Returns:

An xr.Dataset with dimensions based on the preserve_dims or reduce_dims arg as well as a “theta” dimension with values matching thetas input.

If decomposition is False, the dataset’s variables will contain 1 element “total”.

If decomposition is True, in addition to “total”, it will have “underforecast” and “overforecast” data_vars.

Raises:

ValueError – If functional is not one of the expected functions.
ValueError – If alpha is not strictly between 0 and 1.
ValueError – If huber_a is not strictly greater than 0.

Return type:

Dataset

References

For mean elementary score definitions, see:

Theorem 1 of Ehm, W., Gneiting, T., Jordan, A., & Krüger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(3), 505–562. https://doi.org/10.1111/rssb.12154
Theorem 5.3 of Taggart, R. J. (2022). Point forecasting and forecast evaluation with generalized Huber loss. Electronic Journal of Statistics, 16(1), 201-231. https://doi.org/10.1214/21-EJS1957

scores.plotdata.murphy_thetas(forecasts, obs, functional, *, huber_a=None, left_limit_delta=None)#

Return the decision thresholds (theta values) at which to evaluate elementary scores for the construction of Murphy diagrams.

Parameters:

forecasts (list[DataArray]) – Forecast values, one array per source.
obs (DataArray) – Observed values.
functional (Literal['quantile', 'huber', 'expectile']) – The type of elementary scoring function, one of “quantile”, “huber”, or “expectile”.
huber_a (float | None) – Huber transition parameter, used for “huber” functional only. Must be strictly greater than 0.
left_limit_delta (float | None) – Approximation of the left hand limit, used for “huber” and “expectile” functionals. Must be greater than or equal to 0. None will be treated as 0. Ideally, left_limit_delta should be small relative to the fcst and obs precision, and not greater than that precision.

Returns:

theta thresholds to be used to compute murphy scores.

Return type:

List[float]

Raises:

ValueError – If functional is not one of the expected functions.
ValueError – If huber_a is not strictly greater than 0.
ValueError – If left_limit_delta is not greater than or equal to 0.

Notes

For theta values at which to evaluate elementary scores, see

Corollary 2 (p.521) of Ehm, W., Gneiting, T., Jordan, A., & Krüger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(3), 505–562. https://doi.org/10.1111/rssb.12154
Corollary 5.6 of Taggart, R. J. (2022). Point forecasting and forecast evaluation with generalized Huber loss. Electronic Journal of Statistics, 16(1), 201-231. https://doi.org/10.1214/21-EJS1957

scores.plotdata.qq(fcst, obs, *, quantiles, interpolation_method='linear', data_source_dim='data_source', reduce_dims=None, preserve_dims=None)#

Calculate the quantiles of empirical forecast and observation data for a quantile-quantile (Q-Q) plot.

Currently this function is implemented for working with raw forecast and observation data rather than a theoretical probability distribution.

Parameters:

fcst (DataArray | Dataset) – Forecast data.
obs (DataArray | Dataset) – Observation data.
quantiles (Iterable[float]) – Quantiles to compute, e.g., [0.1, 0.5, 0.9].
interpolation_method (str) – Interpolation method for estimating the quantile value, default is ‘linear’. Valid options include ‘inverted_cdf’, ‘averaged_inverted_cdf’, ‘closest_observation’, ‘interpolated_inverted_cdf’, ‘hazen’, ‘weibull’, ‘linear’, ‘median_unbiased’, ‘normal_unbiased’, ‘lower’, ‘higher’, ‘midpoint’, and ‘nearest’. See numpy.quantile() for more information on interpolation method options.
data_source_dim (str) – Name of the new dimension to be created for the data source, default is ‘data_source’. This dimension will contain two values: ‘fcst’ and ‘obs’, representing the forecast and observation data, respectively.
reduce_dims (Iterable[Hashable] | None) – Dimensions to reduce over when calculating quantiles.
preserve_dims (Iterable[Hashable] | None) – Dimensions to preserve when calculating quantiles.

Returns:

An xarray object with a new dimension called “data_source”.

Raises:

ValueError – If the interpolation method is invalid.
ValueError – if quantiles are not between 0 and 1.
ValueError – If a dimension in the input data has the same name as data_source_dim
ValueError – If fcst and obs are Datasets with different data variables.
ValueError – If a user tries to preserve all dimensions.
TypeError – If fcst and obs are not both xarray DataArrays or Datasets.

Return type:

DataArray | Dataset

References

Example

>>> import xarray as xr
>>> import numpy as np
>>> from scores.plotdata import qq
>>> fcst = xr.DataArray(np.random.rand(100), dims='time')
>>> obs = xr.DataArray(np.random.rand(100), dims='time')
>>> result = qq(fcst, obs, quantiles=[0.1, 0.5, 0.9])

scores.plotdata.roc(fcst, obs, thresholds='auto', *, reduce_dims=None, preserve_dims=None, weights=None, check_args=True)#

The AUC is the probability that the forecast probability of a random event is higher than the forecast probability of a random non-event.

Parameters:

fcst (DataArray) – An array of probabilistic forecasts for a binary event in the range [0, 1].
obs (DataArray) – An array of binary values where 1 is an event and 0 is a non-event.
thresholds (str | Iterable[float]) – By default, when thresholds = "auto", the ROC thresholds are automatically generated. Otherwise, you can supply an iterable of floats with monotonic increasing values between 0 and 1, which are the thresholds at and above which to convert the probabilistic forecast to a value of 1 (an ‘event’). If there are many unique forecast values, this can lead to a very large number of automatically generated thresholds. If performance is slow, consider supplying thresholds manually as an iterable of floats. Np.inf is added automatically to the end of the thresholds to ensure that the full ROC curve is produced. Similarly, if 0 is not included, it will be added automatically to ensure the full ROC curve is produced.
reduce_dims (Sequence[str] | None) – Optionally specify which dimensions to reduce when calculating the ROC curve data. All other dimensions will be preserved. As a special case, ‘all’ will allow all dimensions to be reduced. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
preserve_dims (Sequence[str] | None) – Optionally specify which dimensions to preserve when calculating ROC curve data. All other dimensions will be reduced. As a special case, ‘all’ will allow all dimensions to be preserved. In this case, the result will be in the same shape/dimensionality as the forecast, and the values will be the ROC curve at each point (i.e. single-value comparison against observed) for each threshold, and the forecast and observed dimensions must match precisely. Only one of reduce_dims and preserve_dims can be supplied. The default behaviour if neither are supplied is to reduce all dims.
weights (DataArray | None) – An array of weights to apply to the score (e.g., weighting a grid by latitude). If None, no weights are applied. If provided, the weights must be broadcastable to the data dimensions and must not contain negative or NaN values. If appropriate, users can choose to replace NaN values in weights by calling weights.fillna(0). The weighting approach follows xarray.computation.weighted.DataArrayWeighted. See the scores weighting tutorial for more information on how to use weights.
check_args (bool) – Checks if obs data only contains values in the set {0, 1, np.nan}. You may want to skip this check if you are sure about your input data and want to improve the performance when working with dask.

Returns:

‘POD’ (the probability of detection)
’POFD’ (the probability of false detection)
’AUC’ (the area under the ROC curve)

POD and POFD have dimensions dims + ‘threshold’, while AUC has dimensions dims.

Return type:

An xarray.Dataset with data variables

Raises:

ValueError – if fcst contains values outside of the range [0, 1].
ValueError – if obs contains non-nan values not in the set {0, 1}.
ValueError – if ‘threshold’ is a dimension in fcst.
ValueError – if values in thresholds are not monotonic increasing or are outside the range [0, 1].
ValueError – if thresholds is a string that is not “auto”.
ValueError – if thresholds is an empty iterable.

Warning

If the number of automatically generated thresholds is very large (>1000), a warning is raised suggesting that the user supply thresholds manually as an iterable of floats if performance is slow.

Notes

If threshold="auto" which is the default, then the thresholds used are the ordered, unique forecast values.

Ideally concave ROC curves should be generated rather than traditional ROC curves.

Example

>>> import xarray as xr
>>> import numpy as np
>>> from scores.probability import roc_curve_data
>>> fcst = xr.DataArray(np.random.rand(3, 4), dims=["time", "location"])
>>> obs = xr.DataArray(np.random.randint(0, 2, size=(3, 4)), dims=["time", "location"])
>>> result = roc_curve_data(fcst, obs)

scores.pandas#

scores.pandas.continuous.mse(fcst, obs, *, is_angular=False)#

Calculates the mean squared error from forecast and observed data.

A detailed explanation is on https://en.wikipedia.org/wiki/Mean_squared_error

\[\frac{1}{n} \sum_{i=1}^n (\text{forecast}_i - \text{observed}_i)^2\]

Notes

Dimensional reduction is not supported for pandas and the user should convert their data to xarray to formulate the call to the base metric, scores.continuous.mse.

Parameters:

fcst (Type[Series]) – Forecast or predicted variables in pandas.
obs (Type[Series]) – Observed variables in pandas.
is_angular (bool) – specifies whether fcst and obs are angular data (e.g. wind direction). If True, a different function is used to calculate the difference between fcst and obs, which accounts for circularity. Angular fcst and obs data should be in degrees rather than radians.

Returns:

An object containing a single floating point number representing the mean squared error for the supplied data. All dimensions will be reduced.

Return type:

pandas.Series

scores.pandas.continuous.rmse(fcst, obs, *, is_angular=False)#

Calculate the Root Mean Squared Error from xarray or pandas objects.

A detailed explanation is on https://en.wikipedia.org/wiki/Root-mean-square_deviation

\[\sqrt{\frac{1}{n} \sum_{i=1}^n (\text{forecast}_i - \text{observed}_i)^2}\]

Notes

Dimensional reduction is not supported for pandas and users wishing this extra functionality should convert their data to xarray to formulate the call to scores.continuous.rmse.

Parameters:

fcst (Type[Series]) – Forecast or predicted variables in pandas.
obs (Type[Series]) – Observed variables in pandas.
is_angular (bool) – specifies whether fcst and obs are angular data (e.g. wind direction). If True, a different function is used to calculate the difference between fcst and obs, which accounts for circularity. Angular fcst and obs data should be in degrees rather than radians.

Returns:

An object containing a single floating point number representing the root mean squared error for the supplied data. All dimensions will be reduced.

Return type:

pandas.Series

scores.pandas.continuous.mae(fcst, obs, *, is_angular=False)#

Calculates the mean absolute error from forecast and observed data.

A detailed explanation is on https://en.wikipedia.org/wiki/Mean_absolute_error

\[\frac{1}{n} \sum_{i=1}^n | \text{forecast}_i - \text{observed}_i |\]

Notes

Dimensional reduction is not supported for pandas and the user should convert their data to xarray to formulate the call to scores.continuous.mae.

Parameters:

fcst (Type[Series]) – Forecast or predicted variables in pandas.
obs (Type[Series]) – Observed variables in pandas.
is_angular (bool) – specifies whether fcst and obs are angular data (e.g. wind direction). If True, a different function is used to calculate the difference between fcst and obs, which accounts for circularity. Angular fcst and obs data should be in degrees rather than radians.

Returns:

An object containing a single floating point number representing the mean absolute error for the supplied data. All dimensions will be reduced.

Return type:

pandas.Series

API Documentation

Contents

API Documentation#

scores.continuous#

scores.probability#

scores.categorical#

scores.spatial#

scores.stats#

scores.processing#

scores.plotdata#

scores.pandas#