Interactive online version: Binder badge. Download notebook.

Percent Within X#

The “percent within X” score evaluates how often a continuous single-valued forecast falls within a specified absolute error threshold of the observation. This is a simple and easy to explain metric, particularly useful for applications where a forecast is considered “close enough” if it lies within a certain band around the observation.

Unlike traditional error metrics (e.g. MAE or RMSE), the percent within X score does not penalise outliers more harshly- it simply counts whether each forecast is inside or outside the acceptable range.

Two versions are supported:

  • Inclusive: Forecast is within or equal to the threshold (|fcst - obs| τ)

  • Exclusive: Forecast must be strictly within the threshold (|fcst - obs| < τ)

This score is expressed as a percentage (0-100), representing the fraction of valid forecast-observation pairs that satisfy the condition.

Important Caveat#

While the score is intuitive and operationally appealing, it is not a consistent scoring function for common forecast directives such as “forecast the expected (mean) value” . This means that it can encourage hedging. Forecasters may be incentivised to report values that maximise their score rather than reflect their true belief.

For example, when verifying wind speed forecasts with a fixed threshold (e.g. within 5 knots), a forecast of 5 knots will be scored as “correct” for any observed value between 0 and 10 knots. In contrast, a forecast of 0 will only be correct for observations between 0 and 5 knots. This creates an incentive to forecast values that maximise the width of the “acceptable” observation band - particularly in skewed or bounded distributions like wind speed or precipitation. In such cases, the best-scoring forecast may systematically over- or under-predict the truth, depending on how the threshold interacts with physical limits.

This issue arises because:

  • The score only considers whether the forecast falls within a fixed band, not how close it is to the observation.

  • All errors outside the threshold are treated equally, whether they are just over or wildly off.

  • The threshold creates an artificial cutoff that can distort optimisation.

As a result, percent within X is best understood as a user-facing diagnostic, valuable for answering operational questions like “how often do we get within 2 degrees?” However, it should not be used for model training or for rigorous forecast intercomparison.

For applications with a forecast directive expressed as “forecast the expected (mean) or the median value”, consider using a consistent scoring metric such as MSE or MAE respectively.

Mathematical Formulation#

Inclusive form:

\[\text{PERCENT WITHIN X} = 100 \cdot \frac{\sum_{i=1}^{N} \mathbf{1}(|x_i - y_i| \leq \tau)}{\sum_{i=1}^{N} \mathbf{1}_{\text{valid}}}\]

Exclusive form:

\[\text{PERCENT WITHIN X} = 100 \cdot \frac{\sum_{i=1}^{N} \mathbf{1}(|x_i - y_i| < \tau)}{\sum_{i=1}^{N} \mathbf{1}_{\text{valid}}}\]

Where:

  • \(x_i\): forecast value

  • \(y_i\): observed value

  • \(\tau\): error threshold

  • \(\mathbf{1}(\cdot)\): indicator function (1 if the condition is true, 0 if it is false)

  • \(\mathbf{1}_{\text{valid}}\): 1 if both forecast and observation are not NaN

Implementation Notes in scores#

  • The threshold must be specified.

  • Set is_inclusive=True (default) for τ logic; set to False for strict < τ.

  • Supports angular data with is_angular=True.

  • Supports reduce_dims and preserve_dims to aggregate across or retain specific dimensions.

  • Floating point precision can be controlled via decimals.

  • The score is positively oriented: higher values indicate better agreement.

Calculating the percent within X score in scores#

[ ]:
from scores.continuous import percent_within_x, mae

# You can view the documentation for the function by running the following
# help(percent_within_x)

import numpy as np
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt

Synthetic Forecast and Observation Data#

To demonstrate the behavior of the percent within X score, we construct synthetic forecast and observation data with realistic characteristics.

The dataset simulates a gridded variable over:

  • 100 daily timesteps (time)

  • 7 lead days (lead)

  • 3 spatial locations (location)

Observations#

Observations are generated from a normal distribution with:

  • Mean = 2.5

  • Standard deviation = 5

These values are intended to reflect a continuous geophysical variable (e.g., temperature anomaly) with no inherent lower or upper bound.

Later, we will experiment with variables bounded at zero which follow a skewed distribution, similar to precipitation or wind speed.

Forecasts#

Forecasts are designed to:

  • Have the same mean and standard deviation as the observations

  • Be positively correlated with the observations at a level that decreases with increasing leadtime

[2]:
# Dimensions
n_days = 100
n_leads = 7
n_locs = 3

# Coordinate labels
time = pd.date_range(start="2025-01-01", periods=n_days, freq="D")
lead = np.arange(1, n_leads + 1)
location = [f"loc_{i}" for i in range(n_locs)]

rng = np.random.default_rng(seed=42)  # for reproducibility

# Step 1: Create realistic observations
obs_vals = rng.normal(loc=2.5, scale=5, size=(n_days, n_locs))
obs = xr.DataArray(obs_vals, coords=[time, location], dims=["time", "location"])

# Step 2: Broadcast obs across lead to build correlated forecasts
obs_mean = obs.mean().item()
obs_std = obs.std().item()
obs_norm = (obs - obs_mean) / obs_std

# Add synthetic lead-dependent correlation
rho_array = np.linspace(0.95, 0.05, n_leads).reshape((1, n_leads, 1))
obs_expanded = obs_norm.values[:, None, :]

noise = rng.normal(size=obs_expanded.shape)
fcst_norm = rho_array * obs_expanded + np.sqrt(1 - rho_array**2) * noise
fcst_vals = fcst_norm * obs_std + obs_mean

fcst = xr.DataArray(fcst_vals, coords=[time, lead, location], dims=["time", "lead", "location"])

Now we calculate the percentage of forecasts that fall within 2 units (e.g., degrees Celsius) of the corresponding observations. By setting is_inclusive=True, we count forecasts as correct if their absolute error is less than or equal to 2 units.

The results are aggregated over time and location, so the output shows the score as a function of lead time. This is equivalent to specifying preserve_dims=["lead"] in the function call.

[3]:
percent_within = percent_within_x(fcst=fcst, obs=obs, threshold=2.0, is_inclusive=True, reduce_dims=["time","location"])
fig, ax = plt.subplots()
percent_within.plot(ax=ax, marker='o')
ax.set_title("Percent Within ±2 Units Across Lead Times")
ax.set_ylabel("Percent Within (%)")
ax.set_xlabel("Forecast Lead Time")
ax.set_ylim(0, 100)
ax.annotate("Better", xy=(0.05, 0.95), xycoords='axes fraction',
            ha='left', va='top', fontsize=10)

ax.annotate("Worse", xy=(0.05, 0.05), xycoords='axes fraction',
            ha='left', va='bottom', fontsize=10)
plt.show()
../_build/doctrees/nbsphinx/tutorials_Percent_Within_X_5_0.png

Next, consider a variable that cannot fall below zero (e.g., wind speed) and a single-valued forecast interpreted as the median of its predictive distribution. In this setting, the appropriate Consistent Scoring Rule is the Mean Absolute Error (MAE). We also examine the percentage of forecasts that fall within 5 knots of the observation. However, we show that this “percent within X” metric is not a Consistent Scoring Rule, since a forecaster could game it by never predicting values below 5 knots.

[4]:

# Step 1: Simulate observations as synoptic signal and mesoscale noise. The samples may be a combination of locations and days. rng = np.random.default_rng(40) # for n_samples = 2000 synoptic_signal = np.clip(rng.normal(loc=3.0, scale=5, size=n_samples),0, None) # Mesoscale noise: unpredictable but from known distribution mesoscale_noise = rng.normal(loc=0.0, scale=5, size=n_samples) # Observations: the true values obs_vals = np.clip(synoptic_signal + mesoscale_noise, 0, None) # Optimal Forecast: the forecaster knows the synoptic signal perfectly and bases their forecast on that. # The forecaster only knows the characteristics of the mesoscale noise and adjusts the forecast by the median of the # noise distribution (which is 0). fcst_vals = synoptic_signal.copy() # Hedged Forecast: this forecaster attempts to game the verification by clipping the forecast to never go below 5 knots. fcst_hedged_vals = np.clip(fcst_vals, 5, None) # Step 2. Package into DataArray objects. coords = np.arange(n_samples) obs = xr.DataArray(obs_vals, coords=[coords], dims=["samples"]) fcst = xr.DataArray(fcst_vals, coords=[coords], dims=["samples"]) fcst_hedged = xr.DataArray(fcst_hedged_vals, coords=[coords], dims=["samples"])

Now we will visualise the effects of hedging on performance. In the below scatterplots the entries falling between the dashed lines are within 5 knots of the observed. By changing any forecast below 5 knots to equal a forecast of exactly 5 knots, we bring more outcomes into the acceptable range (the diagonal dashed lines). Already, the introduction of an over-forecasting bias of low values is obvious.

[ ]:
# Flatten across time and location
obs_flat = obs.values.flatten()
fcst_flat = fcst.values.flatten()
fcst_hedged_flat = fcst_hedged.values.flatten()

# Create side-by-side plots
fig, axes = plt.subplots(1, 2, figsize=(12, 6), sharey=True, sharex=True)

# Determine range of plots
obs_max = obs_flat.max()
fcst_max = fcst_flat.max()
fcst_hedge_max = fcst_hedged_flat.max()
max_values = [obs_max, fcst_max, fcst_hedge_max]
max_val = max(max_values)

# Plot unhedged forecasts versus observations
axes[0].scatter(fcst_flat, obs_flat, alpha=0.5, edgecolor='k', linewidth=0.2)
axes[0].plot([0, max_val+5], [0, max_val+5], '-', color='gray')
axes[0].plot([5, max_val+10], [0, max_val+5], '--', color='gray')
axes[0].plot([-5, max_val], [0, max_val+5], '--', color='gray')
axes[0].set_xlim(0, max_val+5)
axes[0].set_ylim(0, max_val+5)
axes[0].set_title("Forecast (Original) vs Observed")
axes[0].set_xlabel("Forecast (knots)")
axes[0].set_ylabel("Observed (knots)")

# Plot hedged forecasts versus observations
axes[1].scatter(fcst_hedged_flat, obs_flat, alpha=0.5, edgecolor='k', linewidth=0.2)
axes[1].plot([0, max_val+5], [0, max_val+5], '-', color='gray')
axes[1].plot([5, max_val+10], [0, max_val+5], '--', color='gray')
axes[1].plot([-5, max_val+5], [0, max_val+10], '--', color='gray')
axes[1].set_xlim(0, max_val+5)
axes[1].set_ylim(0, max_val+5)
axes[1].set_title("Forecast (Hedged ≥5) vs Observed")
axes[1].set_xlabel("Forecast (knots)")

plt.tight_layout()
plt.show()
../_build/doctrees/nbsphinx/tutorials_Percent_Within_X_9_0.png

Next we calculate the performance and the improvements associated with hedging become clear (an increase in the percent of forecasts within 5 knots). However, this comes at the expense of increased Mean Absolute Error (MAE). Remember that the MAE is the consistent scoring function when the single-valued forecast is the median value of the predicted distribution.

[ ]:
# Calculate percent within X
percent_within_x_orig = percent_within_x(fcst=fcst, obs=obs, threshold=5.0, is_inclusive=True)
percent_within_x_hedged = percent_within_x(fcst=fcst_hedged, obs=obs, threshold=5.0, is_inclusive=True)

# Calculate mean absolute error
mae_orig = mae(fcst=fcst, obs=obs)
mae_hedged = mae(fcst=fcst_hedged, obs=obs)

# Extract scalar values
pwx_orig = float(percent_within_x_orig.values)
pwx_hedged = float(percent_within_x_hedged.values)
mae_orig_val = float(mae_orig.values)
mae_hedged_val = float(mae_hedged.values)

# Create summary table
summary_df = pd.DataFrame({
    "percent within ±5": [pwx_orig, pwx_hedged],
    "MAE (knots)": [mae_orig_val, mae_hedged_val],
}, index=["Original Forecast", "Hedged Forecast (≥5)"])

# Round for clarity
summary_df = summary_df.round(3)

print(summary_df)
                      Percent Within ±5  MAE (knots)
Original Forecast                 81.20        2.847
Hedged Forecast (≥5)              88.35        3.656

Things to try next#

Explore the is_angular option within percent_within_x, relevant if forecasting direction in degrees.

Explore the decimals option which is meant to help control floating point precision issues. For example, a value that should be 5 might be imported as 4.999999998, causing it to incorrectly fall outside a threshold - rounding the error before comparison avoids this.