Quantifying uncertainty in simulations of the West  African monsoon with the use of surrogate models

Fischer, Matthias; Knippertz, Peter; van der Linden, Roderick; Lemburg, Alexander; Pante, Gregor; Proppe, Carsten; Marsham, John H.

doi:https://doi.org/10.5194/wcd-5-511-2024

Articles | Volume 5, issue 2

https://doi.org/10.5194/wcd-5-511-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/wcd-5-511-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 5, issue 2

Research article

|

17 Apr 2024

Research article |

| 17 Apr 2024

Quantifying uncertainty in simulations of the West African monsoon with the use of surrogate models

Matthias Fischer, Peter Knippertz, Roderick van der Linden, Alexander Lemburg, Gregor Pante, Carsten Proppe, and John H. Marsham

Download

Final revised paper (published on 17 Apr 2024)
Preprint (discussion started on 08 Sep 2023)

Interactive discussion

Status: closed

RC1: 'Comment on egusphere-2023-1922', Anonymous Referee #1, 21 Nov 2023

General Comments:
This study makes use of the technique of surrogate modelling to quantify the uncertainty contributions and effects of selected model parameters on a variety of output fields and quantities of interest that characterize the West African Monsoon (WAM) in the ICON operational weather prediction model. This is a novel use of the surrogate modelling (emulation) approach to assess the model behaviour of the monsoon system under uncertainty. The paper definitely falls within the scope of Weather and Climate Dynamics and EGU, and is written to a high standard. However, there are several points that I believe need clarification (see specific comments below). In particular, I’m not fully convinced about the need/justification for the transform of the input space with the pdf’s prior to sampling for the training data of the surrogate models - see comment ‘Line 195-196…’. I’m also concerned about the interpretation of ‘p-values’ for the Kruskal-Wallis statistical testing, which seems misleading and needs amending - see comment ‘Lines 522-523…’ .
Once these issues and the further issues/comments below are addressed, I would recommend the publication of the manuscript in Weather and Climate Dynamics.
Specific Comments:
- Line 19-20: ‘…rather affects…’ This sounds weird and is unclear. Please clarify.
- Line 96: ‘…within the past years…’ This is vague (what timescale is ‘the past’?) and needs rephrasing. Maybe ‘…within the last XX years…’, or, ‘…within recent years…’?
- Line 114: ‘…In meteorology, universal kriging has been applied in very few studies…’. I don’t agree with this. I think a version of universal kriging has been applied in several studies that relate to meteorology, including tropical sea breeze convection (e.g. 10.1029/2019JD031699) and deep convective clouds and hail (e.g. 10.5194/acp-20-2201-2020). Please amend to reflect this.
- Line 134: ‘…as well as parameter studies.’ – What do you mean by ‘parameter studies’. Is this not the sensitivity analysis? Please explain / clarify.
- Table 1: I find the way the parameters are presented in Table 1 quite confusing. Only one distribution is Beta, and yet the parameter columns are labelled as beta parameters as a default? It would be much easier to understand if the distributions were just written in full in a single ‘PDF’ column, e.g. Normal(1.0, 0.34²) or Normal(μ=1.0, σ²=0.34²), and I recommend doing this.
- Line 143: ‘…define meaningful PDFs representing the full epistemic uncertainty.’ Is this possible? Is the ‘full epistemic uncertainty’ actually known. [epistemic uncertainty is uncertainty due to a lack of knowledge – so, this includes the ‘unknown unknowns’ part as well as the ‘known unknowns’ – Hence, I think it may be more correct to say that these PDFs will contain our best knowledge of the associated epistemic uncertainty, rather than the *full* epistemic uncertainty. Please amend appropriately.
- Lines 144-145: For clarity, change ‘…physics, such as grid-scale…’ to ‘…physics. These are the grid-scale…’. ‘such as’ suggests there may be other options, but the following list contains all of the parameters considered. Also, at the end of the sentence (L 146), please add a reference to Table 1.
- Line 160: ‘Particularly for the last three parameters within the family of convection parametrization’ – This does not read well and needs clarification. Which parameters are being referred to here? Also, the order of the parameter descriptions in the text does not correspond to the order they are listed in Table 1, which may confuse the reader – please consider aligning these orders.
- Line 165: ‘…of entrorg and zvz0i…’ Here and elsewhere, I find it difficult to remember the descriptions of the parameters from the model names/acronyms, which don’t seem intuitive to me. I think it would help to be more descriptive in the text, e.g. ‘…of the entrainment and terminal ice velocity parameters, entrorg and zvz0i,…’ so the reader doesn’t have to keep looking things up.
- Line 168: ‘…in the ensemble physics perturbations…’. What are these? This needs more explanation.
- Lines 174-175: ‘…in the case of a fundamental sensitivity analysis, a uniform distribution is not necessarily a good choice, as there is no physical foundation for assuming a jump in the PDF from a constant value to zero at the upper and lower limits.’ I’m not sure I fully agree here. For sensitivity analysis, uniform distributions are used to reflect that (under current knowledge) a parameter’s value is equally likely to be any value across a given range. Beyond that range (and so the physical meaning of it) is irrelevant to the sensitivity analysis, as it doesn’t analyse beyond those limits.
In terms of physical foundations for the distribution choices that are used in this study (Table 1, Figure 5, what is the ‘physical foundation’ for the shapes and rates of decay in the tails of the PDFs selected? – How exactly were these distributions chosen? (Was there a robust elicitation process?) And, how realistic are they?
The appropriateness of non-uniform distributions can also be questioned – When multiple peaked marginal PDFs come together this can highly bias the sampling of a multi-dimensional parameter space and effectively places a strong constraint on that parameter space prior to any actual calibration. How confident are the authors in the accuracy of these distribution specifications and the constraints to their analysis that these PDFs impose?
- Lines 195-196: ‘Since probability varies strongly across the input space, it is meaningful to train the model with higher accuracy in regions with higher probability.’ I’m not sure I fully understand the logic of this… just because the probability distributions suggest you may not sample an area of parameter space as frequently as another (i.e. in a sensitivity analysis), does that mean that you should want or accept higher error in the predictions (and so less-informed predictions) when you do sample there? Could having variable errors in prediction accuracy across the parameter space lead to bias in the results (e.g. sensitivity analysis) from the sampling (even with lower frequency of samples) of the areas (edges of parameter space) with lower probability / lower accuracy? Has this been tested?
In my mind, one should want the surrogate model (emulator) to be as good a representation as possible of the complex model (simulator) across all of the parameter uncertainty space considered, to then be confident in using that representation in place of the simulator for all sampled parameter combinations.
I think to take this approach of weighted emulator accuracy, you need to be highly confident in the accuracy of the parameter PDFs being used to create that weighting (connects to comment above). However, in the conclusion (Lines 670-671) you suggest this is not the case. Also, other factors such as the smoothness of the output response surface can affect the number of parameter combinations required to obtain a reasonable emulator (a rougher response surface may require more training information) – Would a rougher surface in an area of lower probability exacerbate the potential bias in results from prediction accuracy in such a weighted approach?
I’m interested to know how different the results would be if the training data were sampled evenly over the physical parameter uncertainty space without the PDFs – This would indicate the need/benefit, or not, for this more complex and weighted sampling approach.
- Line 214: ‘…be used to employ sensitivity studies in a resource-friendly way.’ What is meant by ‘resource-friendly’? Please clarify.
- Line 234: ‘Furthermore, we add i. i. d. Gaussian noise with variance sigma_n^2…’. It isn’t clear how this is done. Please clarify.
- Line 314: Is there a general reference for the ICON model, for if a reader wants to find out more details?
- Line 330: ‘…QoIs are thus averaged over these four August periods…’ Does the averaging over the 4 years lead to an overall behaviour that is still realistic? (i.e. Is it possible that for a process, the different meteorology might lead to a high value or a low value, but then the averaging leads to a more central value that is never observed?)
- Line 335: Is it possible to give an indication of the actual amount/size of data that is stored (required level of storage for if someone wanted to repeat this).
- Lines 339-345: Please add units to all of the characteristics of the WAM.
- Section 2.4: Please give the units for each of the QoIs.
- Line 364: ‘…all QoIs are averaged over the study period…’ Please give more detail and clarity on the averaging periods/resolution (here, and/or with the individual QoI’s below). How are they averaged? – Daily? 6-hourly?
- Line 371: ‘…the longitudinal range is chosen…’ Is this a fixed longitudinal range that is the same for all simulations?
- Line 477: I think it might be useful to give a full definition of the parameter names on their first use in this section for the general reader, as they are not obvious from the acronyms.
- Figure 6: The labelling ’1), 2),…’ is difficult to see, especially when under dark shading. Could the numbering not be included with the names on the left/right for better clarity?
- Lines 515: ‘…all other model parameters are set to their mean values…’ Why is the mean value used for this choice? And not, say, the model’s default values? How does this fixed choice for other parameters affect the results shown in Figure 5?
- Lines 522-523: Here and elsewhere (including the Fig 7, Fig 8 and Fig 9 captions) I am very concerned about, and do not agree with, the interpretation of ‘p-values’ for the Kruskal-Wallis testing. For p-values, the general rules from basic statistics are that a p-value, p ≥ 0.05 shows no evidence against the null hypothesis, H0, being tested, that 0.01 < p ≤ 0.05 indicates weak evidence against H0, that 0.001 < p ≤ 0.01 indicates strong evidence against H0, and then p ≤ 0.001 is very strong evidence against H0. Hence, to say that 0.05 < p < 0.1 shows high significance, and p < 0.05 shows very high significance is just clearly misleading. Please update the results and figures to have an appropriate interpretation of the p-values.
Technical Corrections:
- Line 14: Change: ‘Results show that….’ to ‘The results show that…’, for readability.
- Line 50: ‘…remain to be fraught…’. This sounds strange. I would re-phrase to ‘…are fraught…’
- Line 73: Change: ‘…deep convection has large effect…’ to ‘…deep convection has a large effect…’
- Line 137: Change ‘elaborated’ to ‘explained’.
- Line 151: Change ‘…soil in form of…’ to ‘…soil in the form of…’
- Line 154: Change ‘last two’ to ‘remaining’.
- Line 171: Change ‘…as it is also…’ to ‘…as is also…’.
- Line 322: Change ‘although convection…’ to ‘although the convection…’. Also, Should ‘forecast’ be ‘forecasting’?
- Line 360: Change ‘because latter’ to ‘because the latter’
- Line 361: Change ‘and causes’ to ‘which causes’
- Line 452: Change ‘is no requirement’ to ‘is not a requirement’.
- Figure 4 caption: Change ‘as result of’ to ‘resulting from’.
- Line 579: Should ‘pressure’ be ‘mean sea level pressure’?
- Line 616: Where you reference figure 4, perhaps also reference/indicate the colours for the parameters that this is referring to, for ease of interpretation?
- Line 620: Change ‘in a significant amount…’ to ‘by a significant amount…’
- Line 633: Change ‘midlevel’ to ‘mid-level’
- Line 640: Change ‘only little’ to ‘only a little’.

Citation: https://doi.org/10.5194/egusphere-2023-1922-RC1
RC2: 'Comment on egusphere-2023-1922', Anonymous Referee #2, 26 Nov 2023

The study “Quantifying uncertainty in simulations of the West African Monsoon with the use of surrogate models” by M. Fischer et al. addresses uncertainties in modeling the West African Monsoon (WAM). Uncertainty quantification is based on emulators for the dependence of WAM characteristics as modeled by the ICON model on 6 sub-grid-scale parameters, 2 each in relation to deep convection, the sub-cloud, and boundary layer. Results include that interactions in the effects of these parameters are weak, precipitation shows the most complex dependence on all 6 parameters, and that the two parameters related to convection, notable entrainment rate and terminal fall speed of ice, have the largest effect overall. The study is thorough and comprehensive, both methodologically and in its process-based interpretation of the results.
I recommend publication of the manuscript after the following minor clarifications:
- As another example for the use of Universal Kriging in the Atmospheric Sciences, the following publication has recently employed it to obtain counter-factuals for shipping pollution: Diamond, M. S., Director, H. M., Eastman, R., Possner, A., & Wood, R. (2020). Substantial Cloud Brightening from Shipping in Subtropical Low Clouds. AGU Advances, 1, e2019AV000111. https:/doi.org/ 10.1029/2019AV000111

- Just as a suggestion, I wonder whether some readers, notably those familiar with Gaussian Process emulators with Leeds involvement, might find it easier to relate to Section 2.2.2 if function choices were contrasted to those used in this literature, which to my knowledge, e.g., often assumes a Matérn co-variance structure, and would refer to the “aleatoric uncertainty due to weather noise” as “nugget effect”.

- I would ask the authors to discuss further why there is so little interaction between the parameters. After all, the quantification of such interactions is a key strength of their approach. Could this be a consequence of the domain expertise that went into the selection of the 6 parameters?

- In how far are the below-cloud parameters related to cold-pool dynamics? Does their weak control on WAM characteristics imply anything for the relevance of parameterizing cold pools?

- The choice of using a 4-year “climatology” seems an important one, especially since emulators are cross-validated, and not tested on unseen data (i.e. an unseen 4-year period). Even though I would be surprised if the overall results would depend on this choice, some further elaboration would be helpful.

- Section 2.5 was not detailed enough for me to fully grasp how the spatially resolved results were obtained.

- L459ff: Isn’t the aleatoric uncertainty sigma_n quantified?

- L382f: the “factorization” strategy needs elaboration.

- L149: description of thkhmin needs more detail.

Citation: https://doi.org/10.5194/egusphere-2023-1922-RC2
AC1: 'Comment on egusphere-2023-1922', Matthias Fischer, 13 Dec 2023

We would like to thank the reviewers for their constructive and helpful comments on the manuscript. Overall, we agree with the given remarks and provide a short response below. For other minor (technical) comments that are not mentioned below, we do not provide responses here but will modify respective parts of the manuscript.
- Line 19-20: ‘…rather affects…’ This sounds weird and is unclear. Please clarify.
will be clarified
- Line 96: ‘…within the past years…’ This is vague (what timescale is ‘the past’?) and needs rephrasing. Maybe ‘…within the last XX years…’, or, ‘…within recent years…’?
will be clarified
- Line 114: ‘…In meteorology, universal kriging has been applied in very few studies…’. I don’t agree with this. I think a version of universal kriging has been applied in several studies that relate to meteorology, including tropical sea breeze convection (e.g. 10.1029/2019JD031699) and deep convective clouds and hail (e.g. 10.5194/acp-20-2201-2020). Please amend to reflect this.
Universal kriging indeed has been applied in a few studies. The proposed references by both reviewers (Diamond, M. S. (2020), Wellmann (2020)) are good examples which we will include in our literature review.

However, in J. M. Park (2020) no background is given about the universal kriging method and whether/how/why it is applied rather than using simple or ordinary kriging. Therefore, we decided not to include it in the context of universal kriging.
- Line 134: ‘…as well as parameter studies.’ – What do you mean by ‘parameter studies’. Is this not the sensitivity analysis? Please explain / clarify.
will be clarified
- Table 1: I find the way the parameters are presented in Table 1 quite confusing. Only one distribution is Beta, and yet the parameter columns are labelled as beta parameters as a default? It would be much easier to understand if the distributions were just written in full in a single ‘PDF’ column, e.g. Normal(1.0, 0.342) or Normal(μ=1.0, σ2=0.342), and I recommend doing this.
We agree that this labeling, which was chosen to keep the notation plain and compact, may be misleading. We will modify the notation according to the suggestion.
- Line 143: ‘…define meaningful PDFs representing the full epistemic uncertainty.’ Is this possible? Is the ‘full epistemic uncertainty’ actually known. [epistemic uncertainty is uncertainty due to a lack of knowledge – so, this includes the ‘unknown unknowns’ part as well as the ‘known unknowns’ – Hence, I think it may be more correct to say that these PDFs will contain our best knowledge of the associated epistemic uncertainty, rather than the *full* epistemic uncertainty. Please amend appropriately.
Thank you for this comment - this is absolutely correct and we will adapt the formulation accordingly.
- Lines 144-145: For clarity, change ‘…physics, such as grid-scale…’ to ‘…physics. These are the grid-scale…’. ‘such as’ suggests there may be other options, but the following list contains all of the parameters considered. Also, at the end of the sentence (L 146), please add a reference to Table 1.
will be clarified
- Line 160: ‘Particularly for the last three parameters within the family of convection parametrization’ – This does not read well and needs clarification. Which parameters are being referred to here? Also, the order of the parameter descriptions in the text does not correspond to the order they are listed in Table 1, which may confuse the reader – please consider aligning these orders.
will be clarified. For the order and grouping of parameters, we will adapt the description according to the Table 1 and Result section.
- Line 165: ‘…of entrorg and zvz0i…’ Here and elsewhere, I find it difficult to remember the descriptions of the parameters from the model names/acronyms, which don’t seem intuitive to me. I think it would help to be more descriptive in the text, e.g. ‘…of the entrainment and terminal ice velocity parameters, entrorg and zvz0i,…’ so the reader doesn’t have to keep looking things up.
will be clarified
- Line 168: ‘…in the ensemble physics perturbations…’. What are these? This needs more explanation.
will be clarified with better reference to the given literature source
- Lines 174-175: ‘…in the case of a fundamental sensitivity analysis, a uniform distribution is not necessarily a good choice, as there is no physical foundation for assuming a jump in the PDF from a constant value to zero at the upper and lower limits.’ I’m not sure I fully agree here. For sensitivity analysis, uniform distributions are used to reflect that (under current knowledge) a parameter’s value is equally likely to be any value across a given range. Beyond that range (and so the physical meaning of it) is irrelevant to the sensitivity analysis, as it doesn’t analyze beyond those limits.

In terms of physical foundations for the distribution choices that are used in this study (Table 1, Figure 5, what is the ‘physical foundation’ for the shapes and rates of decay in the tails of the PDFs selected? – How exactly were these distributions chosen? (Was there a robust elicitation process?) And, how realistic are they?

The appropriateness of non-uniform distributions can also be questioned – When multiple peaked marginal PDFs come together this can highly bias the sampling of a multi-dimensional parameter space and effectively places a strong constraint on that parameter space prior to any actual calibration. How confident are the authors in the accuracy of these distribution specifications and the constraints to their analysis that these PDFs impose?
We understand and emphasize that assigning PDFs to the parameters is a crucial and important step of the whole analysis, which is by no mean trivial. The selected PDFs do have a direct impact on the results of the global sensitivity analysis, but not on the parameter studies, where the relationship between the QoIs and physical parameters are shown. For the global sensitivity analysis, we may ask the question whether a uniform or a more sophisticated PDF choice is more meaningful. Here, we concluded that a uniform distribution with equal probabilities within a certain range and zero probability beyond the limits would be rather dubious, because parameter values close to the limits within the range would contribute to the global sensitivity analysis with ‘full’ weight and parameter values beyond the limit (but still close to the range) with zero weight. Although a uniform distribution would not be the best choice in our opinion, defining other distributions is challenging. We already elaborate our choices in L178-184 but will expand this and add that other definitions, e.g. uniform distribution are often preferred by other authors.
- Lines 195-196: ‘Since probability varies strongly across the input space, it is meaningful to train the model with higher accuracy in regions with higher probability.’ I’m not sure I fully understand the logic of this… just because the probability distributions suggest you may not sample an area of parameter space as frequently as another (i.e. in a sensitivity analysis), does that mean that you should want or accept higher error in the predictions (and so less-informed predictions) when you do sample there? Could having variable errors in prediction accuracy across the parameter space lead to bias in the results (e.g. sensitivity analysis) from the sampling (even with lower frequency of samples) of the areas (edges of parameter space) with lower probability / lower accuracy? Has this been tested?

In my mind, one should want the surrogate model (emulator) to be as good a representation as possible of the complex model (simulator) across all of the parameter uncertainty space considered, to then be confident in using that representation in place of the simulator for all sampled parameter combinations.

I think to take this approach of weighted emulator accuracy, you need to be highly confident in the accuracy of the parameter PDFs being used to create that weighting (connects to comment above). However, in the conclusion (Lines 670-671) you suggest this is not the case. Also, other factors such as the smoothness of the output response surface can affect the number of parameter combinations required to obtain a reasonable emulator (a rougher response surface may require more training information) – Would a rougher surface in an area of lower probability exacerbate the potential bias in results from prediction accuracy in such a weighted approach?

I’m interested to know how different the results would be if the training data were sampled evenly over the physical parameter uncertainty space without the PDFs – This would indicate the need/benefit, or not, for this more complex and weighted sampling approach.
We create the surrogate models in order to carrying out global sensitivity analyses. The amount/density of points in the parameter space which are used for conducting the sensitivity analysis corresponds to the probability distribution. This means that in order to get a more precise sensitivity analysis, it is meaningful that the model is more accurate in regions in the parameter space with higher PDF values. If the only aim is to construct a surrogate model that should be just as accurate in the 'tails' of the parameters, then we would concede that the reviewer's comment is correct. However, a comparison study using the meteorological model is difficult: to do this, the entire ICON model runs would have to be carried out again with different parameter combinations that were sampled differently. Furthermore, we suspect that the results would not differ much. In our opinion, our approach is the more intuitive/elegant approach and also optimal in terms of the global sensitivity analysis.

A comparative study may be subject to future research using less expensive toy/academic problems in a rather methodical/mathematical paper. It would indeed be interesting to investigate whether rough model behavior in the tails could lead to lower overall accuracy or even biases.
- Line 214: ‘…be used to employ sensitivity studies in a resource-friendly way.’ What is meant by ‘resource-friendly’? Please clarify.
will be clarified
- Line 234: ‘Furthermore, we add i. i. d. Gaussian noise with variance sigma_n^2…’. It isn’t clear how this is done. Please clarify.
will be clarified (see comment below L459ff)
- Line 314: Is there a general reference for the ICON model, for if a reader wants to find out more details?
We referred to one version of the ICON manual (Reinert D., 2019) but we will revise this again and add a reference where introducing the ICON model in the manuscript.
- Line 330: ‘…QoIs are thus averaged over these four August periods…’ Does the averaging over the 4 years lead to an overall behavior that is still realistic? (i.e. Is it possible that for a process, the different meteorology might lead to a high value or a low value, but then the averaging leads to a more central value that is never observed?)
In a preliminary study, we only included one August period and found that the fluctuations in the relationship between parameters and outputs were relatively large. Therefore, we averaged over 4 months (always August to cover similar climatology). Due to the fluctuations for individual years, it was not possible to investigate the differences between the years with sufficient significance. However, the fact that we get a more robust signal by using four months (significant results in the model validation) strongly suggests that we obtain a smoothing rather than a cancellation of the individual signals. Thus, we are confident to have a realistic estimate of the averaged behavior.

- Line 335: Is it possible to give an indication of the actual amount/size of data that is stored (required level of storage for if someone wanted to repeat this).
will be added
- Lines 339-345: Please add units to all of the characteristics of the WAM.
will be added
- Section 2.4: Please give the units for each of the QoIs.
will be added
- Line 364: ‘…all QoIs are averaged over the study period…’ Please give more detail and clarity on the averaging periods/resolution (here, and/or with the individual QoI’s below). How are they averaged? – Daily? 6-hourly?
We will add more detail here and make clear which time resolution is used for averaging.
- Line 371: ‘…the longitudinal range is chosen…’ Is this a fixed longitudinal range that is the same for all simulations?
Yes, this is chosen for all simulation outputs, as the topography is the same, and to make the results comparable.
- Line 477: I think it might be useful to give a full definition of the parameter names on their first use in this section for the general reader, as they are not obvious from the acronyms.
will be added
- Figure 6: The labelling ’1), 2),…’ is difficult to see, especially when under dark shading. Could the numbering not be included with the names on the left/right for better clarity?
will be clarified. We tried to use a consistent layout/labelling with the following figures, but we will make this clearer.
- Lines 515: ‘…all other model parameters are set to their mean values…’ Why is the mean value used for this choice? And not, say, the model’s default values? How does this fixed choice for other parameters affect the results shown in Figure 5?
The mean values correspond to the default values as the PDFs are defined that way. In L513-514 we emphasized that this illustration is meaningful - i.e. it is expected to be similar (only having vertical shifts) if the other parameters are chosen differently. We will elaborate this in more detail.
- Lines 522-523: Here and elsewhere (including the Fig 7, Fig 8 and Fig 9 captions) I am very concerned about, and do not agree with, the interpretation of ‘p-values’ for the Kruskal-Wallis testing. For p-values, the general rules from basic statistics are that a p-value, p ≥ 0.05 shows no evidence against the null hypothesis, H0, being tested, that 0.01 < p ≤ 0.05 indicates weak evidence against H0, that 0.001 < p ≤ 0.01 indicates strong evidence against H0, and then p ≤ 0.001 is very strong evidence against H0. Hence, to say that 0.05 < p < 0.1 shows high significance, and p < 0.05 shows very high significance is just clearly misleading. Please update the results and figures to have an appropriate interpretation of the p-values.
Thank you for this comment. We will adjust the interpretation of the statistical test. The terms "very high significance" and "high significance" are misleading. We prefer to use levels (in percentage) rather than the chosen labels. The interpretation will then need some adaptation.
- As another example for the use of Universal Kriging in the Atmospheric Sciences, the following publication has recently employed it to obtain counter-factuals for shipping pollution: Diamond, M. S., Director, H. M., Eastman, R., Possner, A., & Wood, R. (2020). Substantial Cloud Brightening from Shipping in Subtropical Low Clouds. AGU Advances, 1, e2019AV000111. https:/doi.org/ 10.1029/2019AV000111
See above comment (L114): We will include this in our literature review.
- Just as a suggestion, I wonder whether some readers, notably those familiar with Gaussian Process emulators with Leeds involvement, might find it easier to relate to Section 2.2.2 if function choices were contrasted to those used in this literature, which to my knowledge, e.g., often assumes a Matérn co-variance structure, and would refer to the “aleatoric uncertainty due to weather noise” as “nugget effect”.
We agree that using other covariance functions such as the Matérn covariance is often meaningful. Even though the squared exponential function worked very well in our case, we will now mention the Matérn function as an established alternative choice. Also, we will refer to the ‘nugget effect’ to make the explanation more accessible to readers from different communities.
- I would ask the authors to discuss further why there is so little interaction between the parameters. After all, the quantification of such interactions is a key strength of their approach. Could this be a consequence of the domain expertise that went into the selection of the 6 parameters?
When selecting the 6 parameters, we aimed to include various effects on the WAM system, but we did not expect the parameter interactions to be so little. We would expect the interactions to be larger if we broadened the parameter ranges (PDFs).
- In how far are the below-cloud parameters related to cold-pool dynamics? Does their weak control on WAM characteristics imply anything for the relevance of parameterizing cold pools?
The below-cloud parameters control how much rain is evaporated underneath the clouds and therefore modify surface rain and thermodynamic profiles. More evaporation creates cooler subcloud layers, which in turn leads to a larger negative buoyancy relative to neighboring grid cells and thus a larger lateral acceleration. This has some resemblance with having stronger cold pools but probably the grid spacing we use in our experiments (13 km) is not fine enough to fully resolve this process, including the triggering of new storms through cold pools. Nevertheless, the results we find for these parameters give some first indication about the general relevance of cold pools for the monsoon system and thus the potential gain from a cold pool parameterization, which would attempt to represent the subgrid aspects of the problem.
- The choice of using a 4-year “climatology” seems an important one, especially since emulators are cross-validated, and not tested on unseen data (i.e. an unseen 4-year period). Even though I would be surprised if the overall results would depend on this choice, some further elaboration would be helpful.
See answer above to “Line 330”
- Section 2.5 was not detailed enough for me to fully grasp how the spatially resolved results were obtained.
The explanation is indeed quite theoretical. We will add some detail to make it more intuitive.
- L459ff: Isn’t the aleatoric uncertainty sigma_n quantified?
Yes, it is determined by maximizing the likelihood (as all hyperparameters) and then gives insight about the aleatoric uncertainty of the surrogate model. We will explain this in more detail.
- L382f: the “factorization” strategy needs elaboration.
will be clarified
- L149: description of tkhmin needs more detail.
More detail will be added with another reference to literature of the DWD.

Citation: https://doi.org/10.5194/egusphere-2023-1922-AC1

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Matthias Fischer on behalf of the Authors (10 Jan 2024) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (12 Jan 2024) by Stephan Pfahl

RR by Anonymous Referee #1 (16 Feb 2024)

For final publication, the manuscript should be

accepted as is

accepted subject to technical corrections

accepted subject to minor revisions

reconsidered after major revisions

rejected

Were a revised manuscript to be sent for another round of reviews:

I would be willing to review the revised manuscript.

I would not be willing to review the revised manuscript.

Suggestions for revision or reasons for rejection

2nd Review of paper: ‘Quantifying uncertainty in simulations of the West African Monsoon with the use of surrogate models’.

I have one specific comment below that I think still needs to be addressed properly in the manuscript. All of my other comments have been dealt with well in the updated manuscript. Once addressed, along with the few technical corrections I spotted, I would recommend the publication of the manuscript in Weather and Climate Dynamics.

Specific Comment:

- Section 2.1.1, Para1: Lines 203-205: ‘Since probability varies strongly across the input space, it is meaningful to train the model with higher accuracy in regions with higher probability. This is because we construct surrogate models particularly for performing global sensitivity analysis.’
- Lines 206-207: ‘Thus, for a more accurate sensitivity analysis, it is crucial for the model to exhibit higher accuracy in regions of the parameter space where the PDF values are greater.’

I think these statements are in some ways misleading and this paragraph should be edited to remove any misconceptions / ensure clarity and to also acknowledge the potential negative consequences of the sampling strategy applied. Following the author’s response to my previous comment on the sampling for the training data (on training the model with higher accuracy in regions with higher probability, previously lines 195-196), I’m still not convinced about the validity of this. I can understand that the authors want the surrogate to be as accurate as possible where they will sample more, but with a fixed amount of training data in total (as I think is the case here – using n = 10*p (where p is the number of parameters perturbed), this approach must have an opposite effect on the accuracy of the samples in areas of lower probability (by reducing it), which although sampled less, **can/will still be sampled in a global sensitivity analysis and so can affect the results of it**. There is **no evidence** provided (or to my knowledge) to say that this approach / sampling strategy is **crucial** to obtain a more accurate sensitivity analysis, and I think it is just as possible that it could lead to less accurate sensitivity results.

The fact that the authors intend to perform a global sensitivity analysis doesn’t make sense to me as a reason to vary the accuracy of the underlying model (here, the emulator/surrogate model) that you want to understand the sensitivity of. The effects of the PDFs are still accounted for in the sampling of the sensitivity analysis procedure itself, and so this seems to be an unnecessary step that has potential to induce possibly significant inaccuracy in some emulator predictions and hence the obtained sensitivity results. When constructing a surrogate model, technical aspects such as changes in the smoothness of the surface that one is trying to approximate can affect the emulators accuracy around the input space and so be valid reasons for the requirement of more/less training data in different areas – In my experience, if more data is needed, this is added in addition to the base training sample of size 10*p. Given this, it seems also possible that the outcome of the sampling strategy described could lead to fewer training points in areas of input space that the Gaussian process might already find the output more difficult to capture well [if they happen to be the areas of lower probability], which would then further lead to poor representation of the climate model, which could affect sensitivity results.

I understand that it isn’t possible (due to computational expense) to re-run the study with a uniform training sample for the surrogate model and do the direct comparison, and also that validation of the surrogate models should provide some evidence that the emulator prediction is reasonable across input space [this evidence seems limited here in showing prediction accuracy in different areas of input space]. However, I think it is important that any caveats of the sampling strategy used are clearly acknowledged [i.e. that the global sensitivity analysis **can/will** still sample in areas of low probability, where the emulator here will be less accurate, which could adversely affect the resulting sensitivities] and that all statements of something being ‘better’ or ‘crucial’ are either evidenced or not used.

Technical Corrections:
- Line 48: Should the word ‘trough’ be ‘through’ ?
- Line 96: Should the word ‘economy’ be ‘economics’ ?
- Line 170: ‘…to counteracts this underestimation…’ should be ‘…to counteract this underestimation…’

Hide

ED: Publish subject to minor revisions (review by editor) (16 Feb 2024) by Stephan Pfahl

AR by Matthias Fischer on behalf of the Authors (26 Feb 2024) Author's response Author's tracked changes Manuscript

ED: Publish as is (01 Mar 2024) by Stephan Pfahl

AR by Matthias Fischer on behalf of the Authors (05 Mar 2024)

Short summary

Our research enhances the understanding of the complex dynamics within the West African monsoon system by analyzing the impact of specific model parameters on its characteristics. Employing surrogate models, we identified critical factors such as the entrainment rate and the fall velocity of ice. Precise definition of these parameters in weather models could improve forecast accuracy, thus enabling better strategies to manage and reduce the impact of weather events.