A spread-versus-error framework to reliably quantify the potential for subseasonal windows of forecast opportunity

Rupp, Philip; Spaeth, Jonas; Birner, Thomas

doi:10.5194/wcd-7-767-2026

Articles | Volume 7, issue 2

https://doi.org/10.5194/wcd-7-767-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/wcd-7-767-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 7, issue 2

Research article

|

12 May 2026

Research article |

| 12 May 2026

A spread-versus-error framework to reliably quantify the potential for subseasonal windows of forecast opportunity

Philip Rupp, Jonas Spaeth, and Thomas Birner

Download

Final revised paper (published on 12 May 2026)
Supplement to the final revised paper
Preprint (discussion started on 23 Oct 2025)
Supplement to the preprint

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-4925', Anonymous Referee #1, 28 Nov 2025
The manuscript “A spread-versus-error framework to reliably quantify the potential for subseasonal windows of forecast opportunity” by Rupp et al. explores the relationship between the ensemble spread and forecast error in sub-seasonal ensemble forecasts (days 14-46) by ECMWF system and in a statistical toy model. The authors propose an approach, based on spread-error relationship, to identify regions where variations in ensemble spread correlate with variations in forecast error and demonstrate, using a simple statistical model, that spread-error relationship can be deteriorated by insufficient sampling, lack of physical processes that modulate predictability, and model deficiencies.
The paper provides several interesting ideas, in particular exploring the connection between intra-forecast and inter-forecast variability of the spread, and illustrating several critical issues of sub-seasonal forecasting (such of under-sampling) using the toy model. I have no doubt that the paper should be published in WCD. However, I ask the authors to clarify several critical points before publication.

Major points.
I find that the term “the potential for windows of forecast opportunity” is obscure. I suspect that what the authors mean is “the potential to make skillful forecasts”. Instead, the current message is “the potential for opportunity to make skillful forecasts”. If this is really what the authors want to say, then I wonder what it means in practice.

The authors focus on one property of the forecast – reliability. However, I am used to think of skillful forecasts in terms of accuracy. The forecasts may lack accuracy because of low predictability even if the forecasting system is reliable. Consequently, I am used to think of windows of forecast opportunity as of periods with enhanced skill, and accuracy sufficient for decision making. I feel that your analysis, as illustrated in Figure 3, only highlights areas where physical processes modulate predictability, however it leaves open the question of whether the predictability in these regions is ever sufficient for making skillful forecasts. Therefore, I do not agree with the following statement at L468-469: “Our spread-error framework shows that, over large areas of the Northern Hemisphere, those windows are opened by slowly varying teleconnections”. I feel it is difficult to discuss forecasting opportunity without analyzing accuracy (for example, anomaly correlation coefficients) and therefore I ask the authors to be more careful about their definitions and be more critical about implication of their findings.

I am not sure why spread-error scatterplots should be made using daily values. The authors show in Figure 7 that intra-variations are spurious; thus, a lot of spread in Figure 2b is just noise. Why not define SRS using time-averaged (e.g. weekly mean) statistics?

The authors make important point about time averaging (L13-14); however, this point is only illustrated by a supplementary figure (Figure S3). If the point is important enough to be elevated to the abstract, then the figure should be a part of the main manuscript.

Specific points:
L61-64: Are these assertions supported by research, or is it your hypothesis? If this is the former, a reference is needed. If this is your hypotheses, please be clear about it.
L113: Provide full reference for Leutbecher et al.
L114-115: “A comparison between the IFS model and the CNRM model further shows qualitatively robust patterns (discussed in Section 6).” Robust patterns of what? Also, more information about the used CNRM data is needed.
L115-116: It is quite difficult to comprehend what exactly “forecast spread reliability is influenced by the potential for windows of opportunity” means. I am not sure which definition of “reliability” the authors are using. A reliable ensemble forecast system (or any other forecast system that provides probabilistic forecasts) is one whose predicted probabilities correspond to the observed frequencies; this is what a reliability diagram illustrates. It would help if the authors provided the definition of reliability they are using. In addition, what is the difference between “windows of opportunity” and “potential for windows of opportunity”? “Opportunity” and “potential” sound synonymous to me.
L125-127: “However, if the ensemble size is small, sampling errors will be relatively large. In such a case, some forecast/time step with, e.g., low spread, could be also associated with comparably large error, as the spread is simply underestimated due to sampling error.” You assume that spread is not a good predictor for accuracy, but has this been studied? Also, how to define whether the ensemble size is small or not? The size you are using (50 members at least) does not sound small to me.
Figure 2: Have you tried plotting only the “inter” component of your variance separation, rather than showing daily spread and error, which are mostly noise?
Figure 2 captions: “Red dashed line” not “Orange dashed line”
L151: How do you define “anomaly”? Figure 2 shows only positive values. For anomalies I would expect both positive (above climatology) and negative (below climatology) values.
L175: Do you assume that ensemble mean is well represented in the toy model, or do you also assume it is well represented in operational forecasts? Is this assumption justified?
L242: Does your assumption hold? I understand that, as you under-sample the forecast distribution, the variability of the spread will in general increase. However, I believe that the variability of ensemble mean would also increase, leading to increased error. Why this would not be the case?
L251: If the error is overestimated then how this can lead to a lower error?
L235-255: I cannot understand your explanations for decreased SRS in experiment (b), and I am not sure that you can explain it without analysing variability of ensemble mean.
L262-270: Do you mean that a larger ensemble size than 100 members would be required to capture the spread-error relationship in the case shown in panel “c”? Have you tested this with your toy model?
L271: “intrincic” -> ” intrinsic”
L289-290: Can you be more specific about which effects are unsystematic? I understand that insufficient number of cases leads to unsystematic effects, but can for example small sample size lead to unsystematic effects, or does it always lead to decreased SRS?
L324-329: Can you provide equations for the inter- and intra- variability?
L341: I do not know what the journal’s policy is, but I would prefer to see the definition of the theoretical sampling error estimate in the text rather than in figure captions.
L351-352: I presume you refer to Figure 4d? It would be nice to explicitly refer to this figure in the text, for clarity.
L388-389: It took me a while to figure out that you are using different colour scale for Figs. 9b and 9d. I suggest using the same scale because you are making the point about smallness of the anomalies in Fig.9d, which cannot be seen with the present scales.
Citation: https://doi.org/10.5194/egusphere-2025-4925-RC1
- AC1: 'Reply on RC1', Philip Rupp, 25 Mar 2026
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-4925/egusphere-2025-4925-AC1-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-4925-AC1
RC2:
'Comment on egusphere-2025-4925', Tim Woollings, 09 Jan 2026

This is a nice paper which investigates the potential of ensemble spread to provide useful indication of likely forecast error on S2S timescales. The methods are novel and varied, the application sound and the results should prove useful to the forecasting community. I recommend publication after considering the comments below.

1. Fig 1 shows dominant regions of mean spread over the North Pacific and Atlantic. The shaded regions of large variability in spread lie on the flanks of these, so can the variability in spread be interpreted as (predominantly) north-south shifts of the usual regions of high spread aligning with the jets / storm tracks?

2. All days with lead times 14-46 days are combined in lots of the analysis here. Two thoughts on this: 1) Given the high day-to-day autocorrelation, the number of independent samples will be much less than it seems. Does this need to be taken into account anywhere? Perhaps the binning limits the impact of this. 2) The examples shown (eg fig 2a) show that the ensemble spread has saturated by day 14, which is good. This seems to be a necessary condition, as otherwise there might be a trivial link between spread and error as both are related to lead time. Can the authors confirm that saturation by day 14 is seen everywhere, not just at the couple of points shown?

3. The raw data fig 2b suggests low error values for the largest spread values (>25000), which goes against the overall relationship. Is this common or just a feature of this location?

4. Fig 3 plots the slope for every NH point, with hatching marking points where the slope is not significantly different from zero. Does this mean that at all the non-hatched points the correlation between spread and error is significant (accounting for autocorrelation)?

5. I’m not sure I understand the pink shading in fig 4 - this doesn’t seem to agree with the binned data shown by the black dots. In particular some of the dots with large values of spread and error seem to have very large spread values compared to the shading - is that right?

6. The SRS maps in fig 6 are interesting. Eg it looks like the model spread is considerably more reliable for the northern centre of the NAO than the southern centre. Is this consistent with any other literature?

7. The relation to inter-over-intra variability in fig 8 is interesting. Can this be taken further back, eg to basic variances of the real atmosphere such as shown for differnent frequency bands in Blackmon et al (1984), and others?

8. Section 5 could be rounded off with a summary number - eg what fraction of spatial variance in SRS is explained by the perfect model test?

9. The whole paper is framed around ‘windows of opportunity’, ie the low-spread end of the spectrum. Is there any interest in the high-spread end of the spectrum (walls of adversity perhaps…?)? The method uses a linear fit across the whole range of spread - do the results reflect the high-spread end of the relationship as much as the low-spread end?

10. There are several new results given in the Conclusions & Discussion section, which are important enough to make the abstract. Consider moving these into the main paper.

Minor:

- SRS of 0.6 is given as a summary figure in the abstract which is a nice idea but might be hard to interpret without knowing more about what SRS is.

- line 28: I would say that the whole ensemble is the ‘actual prediction’, not just the ensemble mean.

- line 43: ‘areas occasionally associated with anomalously low spread’ are highlighted here, but could it also be occasionally high spread?

- line 132: ‘potential’ windows of opportunity?

- line 397-8: ref to support this statement.

- line 451: consider linking to https://doi.org/10.48550/arXiv.2411.17694 on signal-noise issues in subseasonal forecasts.

Typos:

- line 80: Ref style
- line 108: forecasts
- line 220: considerably
- line 271: intrinsic
- fig 6 caption: black line rather than grey?
- line 335: not essentially
- fig 10 caption: check line colours

Citation: https://doi.org/10.5194/egusphere-2025-4925-RC2
- AC2: 'Reply on RC2', Philip Rupp, 25 Mar 2026
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-4925/egusphere-2025-4925-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-4925-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Philip Rupp on behalf of the Authors (26 Mar 2026) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (09 Apr 2026) by Tim Woollings

RR by Tim Woollings (20 Apr 2026)

RR by Anonymous Referee #1 (21 Apr 2026)

For final publication, the manuscript should be

accepted as is

accepted subject to technical corrections

accepted subject to minor revisions

reconsidered after major revisions

rejected

Were a revised manuscript to be sent for another round of reviews:

I would be willing to review the revised manuscript.

I would not be willing to review the revised manuscript.

Suggestions for revision or reasons for rejection

I would like to thank the authors for carefully addressing my comments on the previous version of the manuscript. I find the revised manuscript substantially improved and much clearer. Some further minor points are listed below; otherwise I am happy to recommend the manuscript for publication.

Minor comments:
L193: The text still suggests that Figure 2 shows anomalies while the full values are shown.

L196-198: It might be useful to add reference to a standard statistical textbook, e.g.: Wilks, D.S. (2006) Statistical methods in the atmospheric sciences: an introduction, 2nd edition. Oxford, UK: Academic Press.

L234: Suggest adding “…the RELATIVELY high reliability of spread in these regions…”

L297: Remove double “to”

L600-601: If by “both studies” you mean Scaife and Smith (2018) and Roberts and Vitart (2024) then I suggest spelling them explicitly (“both Scaife and Smith (2018) and Roberts and Vitart (2024)”) because the former appears only in the previous paragraph. Also, Scaife and Smith did not analyze the subseasonal regime, therefore I suggest correcting to: “…subseasonal and seasonal regime.”

L483-484: “Reanalysis observations” sounds strange. Perhaps once clarified that you use reanalysis as a substitute for observations you could use just “reanalysis”? Also, you could use “verification” wherever appropriate.

L492: Again “reanalysis observations”, see my previous comment.

L661: The reference to Karpechko et al. (2025) is this one: Karpechko, A. Yu., Butler, A. H., and Vitart, F.: Signal, noise and skill in sub-seasonal forecasts: the role of teleconnections, Weather Clim. Dynam., 6, 1661–1681, https://doi.org/10.5194/wcd-6-1661-2025, 2025.

Hide

ED: Publish subject to minor revisions (review by editor) (21 Apr 2026) by Tim Woollings

AR by Philip Rupp on behalf of the Authors (26 Apr 2026) Author's response Author's tracked changes Manuscript

ED: Publish as is (29 Apr 2026) by Tim Woollings

AR by Philip Rupp on behalf of the Authors (03 May 2026) Manuscript

Short summary

Weather forecasts several weeks ahead are uncertain, but this uncertainty itself can change depending on large-scale atmospheric conditions. We present a new way to measure how well forecasts capture these changes in uncertainty. Our results show that reliability of uncertainty varies strongly with region and is linked to slow, predictable patterns in the atmosphere. These findings help identify periods when forecasts are more trustworthy.