Key Features of `scores`#

Data Handling#

Works with labelled, n-dimensional data (e.g., geospatial, vertical and temporal dimensions) for both point-based and gridded data. scores can effectively handle the dimensionality, data size and data structures commonly used for:
- gridded Earth system data (e.g., numerical weather prediction models)
- tabular, point, latitude/longitude or site-based data (e.g., forecasts for specific locations).
Handles missing data, masking of data and weighting of results (e.g. by area, by latitude, by population).
Supports xarray datatypes, and works with NetCDF4, HDF5, Zarr and GRIB data formats among others.

A companion Jupyter Notebook tutorial for each metric and statistical test that demonstrates its use in practice.
Over 75 metrics, statistical techniques and data processing tools, including:
- commonly-used metrics (e.g., mean absolute error (MAE), root mean squared error (RMSE))
- novel scores not commonly found elsewhere (e.g., FIxed Risk Multicategorical (FIRM) score (Taggart et al., 2022), Flip-Flop Index (Griffiths et al., 2019, 2021))
- recently developed, user-focused scores and diagnostics including:
  - threshold-weighted scores, such as threshold-weighted continuous ranked probability score (twCRPS) (Gneiting and Ranjan 2011) and threshold-weighted mean squared error (MSE) (Taggart 2022)
  - Murphy diagrams (Ehm et al., 2016)
  - isotonic regression for reliability diagrams (Dimitriadis et al., 2021)
- statistical tests (e.g., the Diebold-Mariano (Diebold & Mariano, 1995) test, with both the Harvey et al. (1997) and Hering & Genton (2011) modifications).
All scores and statistical techniques have undergone a thorough scientific and software review.
An area specifically to hold emerging scores which are still undergoing research and development. This provides a clear mechanism for people to share, access and collaborate on new scores, and be able to easily re-use versioned implementations of those scores.

Highly modular - provides its own implementations, avoids extensive dependencies and offers a consistent API.
Easy to integrate and use in a wide variety of environments. It has been used on workstations, servers and in high performance computing (supercomputing) environments.
Maintains 100% automated test coverage.
Uses Dask for scaling and performance.
Some metrics work with pandas and we aim to expand this capability.