Impacts of tropical forecast errors on two extreme precipitation events: insights from relaxation experiments using machine-learning weather prediction models

Li, Siyu; Dias, Juliana; Moore, Benjamin; Quinting, Julian

doi:10.5194/wcd-7-787-2026

Articles | Volume 7, issue 2

https://doi.org/10.5194/wcd-7-787-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/wcd-7-787-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 7, issue 2

Research article

|

19 May 2026

Research article |

| 19 May 2026

Impacts of tropical forecast errors on two extreme precipitation events: insights from relaxation experiments using machine-learning weather prediction models

Siyu Li, Juliana Dias, Benjamin Moore, and Julian Quinting

Download

Final revised paper (published on 19 May 2026)
Preprint (discussion started on 14 Jan 2026)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2026-35', Yannick Peings, 17 Feb 2026

The paper examines the week 3-4 prediction skill of two machine learning weather prediction (MLWP) models for two climate events that brought significant precipitation over California in 2022/23 (December-January 2022/23 and February-March 2023). The two MLWP, NeuralGCM and Pangu-Weather, are compared to a traditional S2S General Circulation Model (GCM), UFS. The authors use a relaxation technique (nudging) to impose observed atmospheric variability in the tropics in a set of ensemble reforecasts, that they compare to the original reforecasts of the two climate events. They find that imposing accurate tropical variability largely improves the prediction skill of the North Pacific atmospheric circulation and associated moisture flux at week 3-4 lead time, especially for the December case study. This is true for both the two MLWP models and UFS, with comparable physical mechanisms that lead to the improvement (Rossby wave sources in the subtropics). This demonstrates that improved S2S prediction in the tropics would induce a higher prediction skill of such precipitation events in the mid-latitudes, and also that the new generation of MLWP models exhibit comparable skill and mechanisms as the traditional physics-based forecast models when such tropical relaxation techniques is used (for much lower computational costs). The prediction skill of the two MLWP models is in fact slightly higher than UFS for the two case studies, without and with tropical relaxation, but as noted by the authors, a more robust comparison of prediction skill would require a more systematic evaluation over a greater number of cases.

The paper is a nice contribution to the field of S2S prediction, it is clear and well-written. However there is room for improvement, and I have some comments and suggestions listed herebelow.

1) l. 28, when discussing the potential for S2S prediction using MLWP models, some references are missing to reflect what has been done already. For instance, the two following papers are relevant references to include as they discuss and demonstrate the advance of S2S forecast skill using these models.
Weyn, J. A., Durran, D. R., Caruana, R., & Cresswell-Clay, N. (2021). Sub-seasonal forecasting with a large ensemble of deep-learning weather prediction models. Journal of Advances in Modeling Earth Systems, 13(7), e2021MS002502. https://doi.org/10.1029/2021ms002502
Chen, L., Zhong, X., Li, H., Wu, J., Lu, B., Chen, D., et al. (2024). A machine learning model that outperforms conventional global subseasonal forecast models. Nature Communications, 15(1), 6425. https://doi.org/10.1038/s41467-024-50714-1

2) l. 85 : “Sea surface temperature are prescribed from ERA5.” Can you detail here? Do you maintain SST anomalies from initialization (persistent SST anomalies)?

3) l. 91: “This leads to its significantly lower computational resource requirements compared to the other two models of this study.” Could you give an rough estimate of each MLWP model’s computational cost here, relative to UFS?

4) Section 2.3 : it sounds like the daily anomalies for the models are calculated from the ERA5 daily climatology. Ideally the model anomalies should be calculated using the model daily climatology, but this requires a set of hindcasts over a sufficient long period. I do not think that using model climatology would significantly change the results, but this should be mentioned for transparency.

5) l. 125, it is unclear what the “model replay” experiment is used for in the study.

6) Section 3.1: the December case study has also been highlighted in our recent paper (Peings et al. 2026), as a window of opportunity for S2S forecasting. The three models used in our study (two MLWP models and the ECMWF S2S model) exhibit good prediction skill for this period at week 2 as shown in the paper, but we also found good skill for week 3 and more generally for the week 2-4 window. We also preformed a sensitivity study with one of the MLWP model to demonstrate that the skill was coming from the tropics. I think this paper is worth being cited because it aligns with the result presented here.
Peings, Y., Dong, C., Mahesh, A., Pritchard, M., Collins, W., & Magnusdottir, G. (2026). Subseasonal forecasting and MJO teleconnections in machine learning weather prediction models. Journal of Geophysical Research: Atmospheres, 131, e2025JD044910. https://doi.org/10.1029/2025JD044910

7) The section about the physical mechanism leading to more skillful predictions for the two case studies would benefit from being developed. The RWS anomalies of Fig. 3 and Fig. 5 are noisy and they are not very explicit. I think it would be interesting to see how they bridge the tropics with the extratropics. I.e., showing the Rossby wave associated with it, maybe at different lead times (week 1, 2 and 3) to show its development. You could also show how the deep convection anomalies in the tropics differ in CRL versus NTR in function of time, maybe using a Hovmoller plot (time in function of longitude) which would reveal how MJO propagation changes with nudging and makes for a more accurate teleconnection. The paper only includes 6 figures so there is room for a couple figures further detailing the tropics-extratropics teleconnection leading to improved skill in the North Pacific/North America sector (especially for the December case).

8) In conclusion, when stating that “However, drawing more definitive conclusions will require a systematic evaluation over multiple years and similar events to assess the generalization of these results”, it should be mentioned that a systematic evaluation of the S2S forecast skill for the North Pacific/Western North America region has been done for NeuralGCM (Peings et al. 2026). The study shows that two MLWP models (SFNO-HENS and NeuralGCM) exhibit comparable S2S skill to ECWMF for the case of the MJO and North Pacific atmospheric patterns during the October-March season.

9) l. 296: “This suggests that a better representation of the tropical atmospheric state in the models would have improved the prediction of this particular event”. The conclusion would benefit from a discussion of how the MLWP models have the potential to improve prediction skill in the tropics, and consequently in the mid-latitudes (if they do).
Do the authors anticipate that S2S forecast skill will improve with future developments in both traditional dynamical models and machine-learning weather prediction (MLWP) systems? Or does the current similarity in S2S skill between MLWP models and GCMs indicate an intrinsic predictability limit of the climate system that may be difficult to surpass?
Nudging simulations such as those presented in the paper are valuable for investigating mechanisms and tracing potential sources of predictability for specific events. However, do we realistically expect S2S forecasts in the tropics to become sufficiently accurate to substantially improve prediction skill in the mid-latitudes? I know that is the million-dollar question bit it would be worthwhile to address it in the conclusion to place the results in a broader predictability context.

Citation: https://doi.org/10.5194/egusphere-2026-35-RC1
- AC1: 'Reply on RC1', Siyu Li, 02 Mar 2026
  
  Thank you for your time and efforts to review our works.
  I have attached a pdf file for your comments in details. And most of suggestions has adapted in the new manuscript which will upload in the future.
  Best regards
  Siyu Li
  
  Citation: https://doi.org/10.5194/egusphere-2026-35-AC1
RC2: 'Comment on egusphere-2026-35', Anonymous Referee #2, 26 Feb 2026

The study discusses how constraining the atmospheric state in the tropics improves sub-seasonal forecasts of two extreme precipitation events in western North America in the winter of 2022/2023 in two machine-learning weather prediction (MLWP) models. It complements an earlier study by some of the authors, where an equivalent relaxation experiment was performed in a physics-based weather prediction model. It is found that the response to the tropical constraint in the MLWP models is similar to that in the physics-based model, although somewhat weaker owing to a stronger baseline performance. An analysis of Rossby wave sources indicates that the MLWP models simulate the tropical-extratropical teleconnections contributing to the extreme events in a physically consistent way. The authors emphasise the general point that such relaxation experiments are a useful diagnostic tool to understand and improve sub-seasonal predictions.
The paper is a useful contribution to the field of MLWP model evaluation and merits publication. Since it discusses two very specific case studies, it lacks the generality that the title suggest, but on the other hand there is value in a detailed assessment of how MLWP forecasts represent tropical-extratropical teleconnections for specific mid-latitude extreme precipitation events. The presentation is mostly clear and concise, but I would recommend some clarifications and revisions, as well as adding some further analysis and discussion - see the comments below.
- Would it be worth investigating the reasons for a different importance of tropical forcing between the two cases a bit more, e.g. by looking into origin and propagation of forecast errors over time? The RWS diagnostic only works for tropical sources, but it would be good to quantify mid- and high-latitude contributions.

- The title is too general for what is being presented. I suggest to start from the title of the reference study by Moore et al. ("Impacts of tropical forecast errors on weeks 3–4 extreme precipitation predictions over California during winter 2022–23") and modify this to reflect the new aspect of relaxing MLWP models

- l. 7: the fact that only tropical relaxation is considered should be mentioned earlier than this, potentially first sentence of the abstract

- l. 73: "6-hour forecast increment" - I suspect you are referring to the output available, not to the model time step (which would be hard to believe). Please clarify.

- l. 85: The fact that NeuralGCM uses "perfect" SST prescribed from ERA5 strikes me as important. It means that the NeuralGCM setup could not issue real-time forecasts, and it should have an unfair advantage over the other models considered. What is your view, maybe you can add some discussion or analysis on this?

- ll. 89-90 (..., Pangu-Weather is ..."): Please specify which Pangu model you are actually using for your inference relaxation study - is it the on with a 24h time step?

- ll. 90ff. ("Overall the model is..."): I don't understand this, please rephrase. With "auto-regressive during training" I assume you are referring to rolling out for more than one model time step during training and minimizing the loss computed from the rolled-out errors. This is indeed more costly during the training, but has no impact on inference cost. The real reason that Pangu is cheapest among the models you are considering is probably that its does not need to run a GCM dynamical core (expensive, both UFS and NeuralGCM have it).

- l. 96: Why did you choose to compute the climatology over this long period? Can you please check whether substantial trends are present for the variables you are considering? If this is the case, there is the risk that anomaly correlations presented are inappropriately dominated by these trends.

- Table 1: The "Replay to ERA5" experiment is never used in the manuscript. Why? Please either remove the reference to this experiment, or use it when discussing the results.

- ll. 115ff.: Did you test the sensitivity to the width of the tapering region, or to the functional shape of the transition? If yes, a comment on that would be helpful.

- ll. 119f. ("Relaxed region is corrected by 100% at each time step"): You say on line 108 that the relaxation "gently steers the model state", which is inconsistent with a 100% replacement of the model forecast by the reference. Maybe worth clarifying this on line 108 and elsewhere

- ll. 127 - 131: Can you please explain the motivation or justification for relaxing different variables for different models? One might argue that this makes the experiments less comparable.

- ll. 148-151: Please elaborate how you compute and interpret anomaly correlation, it is left a bit vague. As I understand it, this is the pattern correlation between the verification anomaly in Fig 1 and forecast anomalies in Figs 2 & 4. - correct?

- l. 172: I would not call this a forecast bust. Larger errors are expected for any extreme event occurring in the observations when forecasts have modest levels of skill and tend to predict climatology.

- Figure 1: The green lines are really hard to see - is it worth making separate panels?

- l 181: please define how the water vapour flux is computed (I assume you are showing the magnitude of the vector quantity). Also please add some discussion on why you do not use precipitation directly and what whether this constitute a caveat of the study. It might be worth citing Lavers et al., Weather and Forecasting (2017), https://doi.org/10.1175/WAF-D-17-0073.1 in this context.

- Figure 2 caption: Looks to me like the bold green line is at 40 not at 20.

- Figure 3: this is extremely hard to see, even on a very large screen. Please revise to have fewer panels or less details in each.

- l. 235 ("This similarity suggests..."): OK but what does this mean? That only sources in the deep tropics matter?

- l. 257: Could this also be because NeuralGCM sees the observed SST (see one of my earlier comments)?

- l. 269: This is a trivial result - any relaxation of ensembles towards a common reference state will reduce the ensemble spread

- Figure 6: Seeing Z500 anomaly correlations of close to 1 for almost every single ensemble members at 3-4 weeks lead time makes me wonder whether the ACC you compute is a discerning enough metric. Can you please show the same plot for MAE? A reader could conclude from the extremely high ACC for most ensemble members that there is near-perfect deterministic forecast skill in the sub-seasonal range for these events, which I would be sceptical about even with strong impact from tropical sources of predictability.

- l. 276: Can you discuss a bit more why there is less impact for the February event? Given the flow pattern, a stronger impact of mid- or high-latitude dynamics is plausible (see also the Moore et al. study)

- l. 277 (we did not investigate precip directly): I think you need to be upfront about this caveat and discuss it in the methods section

- l. 304 ("reduced tropical influence on the event"): As mentioned before, it would be good to have some further analysis on this.

Citation: https://doi.org/10.5194/egusphere-2026-35-RC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Siyu Li on behalf of the Authors (25 Mar 2026) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (03 Apr 2026) by Daniela Domeisen

RR by Yannick Peings (10 Apr 2026)

For final publication, the manuscript should be

accepted as is

accepted subject to technical corrections

accepted subject to minor revisions

reconsidered after major revisions

rejected

Were a revised manuscript to be sent for another round of reviews:

I would be willing to review the revised manuscript.

I would not be willing to review the revised manuscript.

Suggestions for revision or reasons for rejection

Thanks to the authors for addressing my comments. The paper is in great shape, I only have a few minor comments.

1) Panels g-i in Fig.4 are not particularly useful because the most important here is that the reforecasts more closely match ERA5. Rather than showing NTR-CRL, I suggest you show where the error is diminished or reduced. You have the original error being CRL-ERA5, you now have a new error NTR-ERA5, these panels could show where the difference in errors (NTR-ERA5 minus CRL-ERA5) is reduced (say in green, or blue) and where it is increased (in red?). This would directly show the reader where nudging has led to improvement in the VP200 anomalies. Only a suggestion for consideration by the authors since the current figure is fine but could be more informative in my opinion.

2) Caption of Fig. 4 says the interval for VP200 from ERA5 is 6.10-6 m2/s-1, but the contour labels suggest otherwise. Unless the contour labels are for the shading? Please make sure the contour intervals are correct here.

3) Panels g-i in Fig. 7, same comment as for Fig. 4.

4) Section 2.2.2: in Annex, I recommend that you show the equivalent of Fig. 2c,f,i, but for the experiments with fixed SST/SIC from initial conditions, to demonstrate that prescribing observed SST/SIC does not impact forecast skill significantly.

5) Section 2.2.3: say a word about the treatment of SST and SIC in Pangu-Weather (they are not included in the set of fields that are only atmospheric fields).

6) l. 268, a space is missing (“thatPeings”) and there are two dots after “et al”
and l. 295, a space is missing before the parenthese (after “onwards”)

Hide

RR by Anonymous Referee #2 (13 Apr 2026)

ED: Publish subject to technical corrections (13 Apr 2026) by Daniela Domeisen

AR by Siyu Li on behalf of the Authors (21 Apr 2026) Author's response Manuscript

Short summary

Weather forecasts weeks to months ahead, called subseasonal forecasts, help communities prepare for floods or droughts but are hard to make accurately. We tested a method called relaxation, which nudges parts of a model to see how different regions affect predictions. Using two machine learning models and a traditional model, we found the machine learning models performed better. Relaxation offers a simple, low-cost way to improve future forecasts.