the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Quantifying uncertainty in simulations of the West African monsoon with the use of surrogate models
Matthias Fischer
Peter Knippertz
Roderick van der Linden
Alexander Lemburg
Gregor Pante
Carsten Proppe
John H. Marsham
Download
 Final revised paper (published on 17 Apr 2024)
 Preprint (discussion started on 08 Sep 2023)
Interactive discussion
Status: closed

RC1: 'Comment on egusphere20231922', Anonymous Referee #1, 21 Nov 2023
General Comments:
This study makes use of the technique of surrogate modelling to quantify the uncertainty contributions and effects of selected model parameters on a variety of output fields and quantities of interest that characterize the West African Monsoon (WAM) in the ICON operational weather prediction model. This is a novel use of the surrogate modelling (emulation) approach to assess the model behaviour of the monsoon system under uncertainty. The paper definitely falls within the scope of Weather and Climate Dynamics and EGU, and is written to a high standard. However, there are several points that I believe need clarification (see specific comments below). In particular, I’m not fully convinced about the need/justification for the transform of the input space with the pdf’s prior to sampling for the training data of the surrogate models  see comment ‘Line 195196…’. I’m also concerned about the interpretation of ‘pvalues’ for the KruskalWallis statistical testing, which seems misleading and needs amending  see comment ‘Lines 522523…’ .
Once these issues and the further issues/comments below are addressed, I would recommend the publication of the manuscript in Weather and Climate Dynamics.
Specific Comments:
 Line 1920: ‘…rather affects…’ This sounds weird and is unclear. Please clarify.
 Line 96: ‘…within the past years…’ This is vague (what timescale is ‘the past’?) and needs rephrasing. Maybe ‘…within the last XX years…’, or, ‘…within recent years…’?
 Line 114: ‘…In meteorology, universal kriging has been applied in very few studies…’. I don’t agree with this. I think a version of universal kriging has been applied in several studies that relate to meteorology, including tropical sea breeze convection (e.g. 10.1029/2019JD031699) and deep convective clouds and hail (e.g. 10.5194/acp2022012020). Please amend to reflect this.
 Line 134: ‘…as well as parameter studies.’ – What do you mean by ‘parameter studies’. Is this not the sensitivity analysis? Please explain / clarify.
 Table 1: I find the way the parameters are presented in Table 1 quite confusing. Only one distribution is Beta, and yet the parameter columns are labelled as beta parameters as a default? It would be much easier to understand if the distributions were just written in full in a single ‘PDF’ column, e.g. Normal(1.0, 0.34^{2}) or Normal(μ=1.0, σ^{2}=0.34^{2}), and I recommend doing this.
 Line 143: ‘…define meaningful PDFs representing the full epistemic uncertainty.’ Is this possible? Is the ‘full epistemic uncertainty’ actually known. [epistemic uncertainty is uncertainty due to a lack of knowledge – so, this includes the ‘unknown unknowns’ part as well as the ‘known unknowns’ – Hence, I think it may be more correct to say that these PDFs will contain our best knowledge of the associated epistemic uncertainty, rather than the *full* epistemic uncertainty. Please amend appropriately.
 Lines 144145: For clarity, change ‘…physics, such as gridscale…’ to ‘…physics. These are the gridscale…’. ‘such as’ suggests there may be other options, but the following list contains all of the parameters considered. Also, at the end of the sentence (L 146), please add a reference to Table 1.
 Line 160: ‘Particularly for the last three parameters within the family of convection parametrization’ – This does not read well and needs clarification. Which parameters are being referred to here? Also, the order of the parameter descriptions in the text does not correspond to the order they are listed in Table 1, which may confuse the reader – please consider aligning these orders.
 Line 165: ‘…of entrorg and zvz0i…’ Here and elsewhere, I find it difficult to remember the descriptions of the parameters from the model names/acronyms, which don’t seem intuitive to me. I think it would help to be more descriptive in the text, e.g. ‘…of the entrainment and terminal ice velocity parameters, entrorg and zvz0i,…’ so the reader doesn’t have to keep looking things up.
 Line 168: ‘…in the ensemble physics perturbations…’. What are these? This needs more explanation.
 Lines 174175: ‘…in the case of a fundamental sensitivity analysis, a uniform distribution is not necessarily a good choice, as there is no physical foundation for assuming a jump in the PDF from a constant value to zero at the upper and lower limits.’ I’m not sure I fully agree here. For sensitivity analysis, uniform distributions are used to reflect that (under current knowledge) a parameter’s value is equally likely to be any value across a given range. Beyond that range (and so the physical meaning of it) is irrelevant to the sensitivity analysis, as it doesn’t analyse beyond those limits.
In terms of physical foundations for the distribution choices that are used in this study (Table 1, Figure 5, what is the ‘physical foundation’ for the shapes and rates of decay in the tails of the PDFs selected? – How exactly were these distributions chosen? (Was there a robust elicitation process?) And, how realistic are they?
The appropriateness of nonuniform distributions can also be questioned – When multiple peaked marginal PDFs come together this can highly bias the sampling of a multidimensional parameter space and effectively places a strong constraint on that parameter space prior to any actual calibration. How confident are the authors in the accuracy of these distribution specifications and the constraints to their analysis that these PDFs impose?
 Lines 195196: ‘Since probability varies strongly across the input space, it is meaningful to train the model with higher accuracy in regions with higher probability.’ I’m not sure I fully understand the logic of this… just because the probability distributions suggest you may not sample an area of parameter space as frequently as another (i.e. in a sensitivity analysis), does that mean that you should want or accept higher error in the predictions (and so lessinformed predictions) when you do sample there? Could having variable errors in prediction accuracy across the parameter space lead to bias in the results (e.g. sensitivity analysis) from the sampling (even with lower frequency of samples) of the areas (edges of parameter space) with lower probability / lower accuracy? Has this been tested?
In my mind, one should want the surrogate model (emulator) to be as good a representation as possible of the complex model (simulator) across all of the parameter uncertainty space considered, to then be confident in using that representation in place of the simulator for all sampled parameter combinations.
I think to take this approach of weighted emulator accuracy, you need to be highly confident in the accuracy of the parameter PDFs being used to create that weighting (connects to comment above). However, in the conclusion (Lines 670671) you suggest this is not the case. Also, other factors such as the smoothness of the output response surface can affect the number of parameter combinations required to obtain a reasonable emulator (a rougher response surface may require more training information) – Would a rougher surface in an area of lower probability exacerbate the potential bias in results from prediction accuracy in such a weighted approach?
I’m interested to know how different the results would be if the training data were sampled evenly over the physical parameter uncertainty space without the PDFs – This would indicate the need/benefit, or not, for this more complex and weighted sampling approach.
 Line 214: ‘…be used to employ sensitivity studies in a resourcefriendly way.’ What is meant by ‘resourcefriendly’? Please clarify.
 Line 234: ‘Furthermore, we add i. i. d. Gaussian noise with variance sigma_n^2…’. It isn’t clear how this is done. Please clarify.
 Line 314: Is there a general reference for the ICON model, for if a reader wants to find out more details?
 Line 330: ‘…QoIs are thus averaged over these four August periods…’ Does the averaging over the 4 years lead to an overall behaviour that is still realistic? (i.e. Is it possible that for a process, the different meteorology might lead to a high value or a low value, but then the averaging leads to a more central value that is never observed?)
 Line 335: Is it possible to give an indication of the actual amount/size of data that is stored (required level of storage for if someone wanted to repeat this).
 Lines 339345: Please add units to all of the characteristics of the WAM.
 Section 2.4: Please give the units for each of the QoIs.
 Line 364: ‘…all QoIs are averaged over the study period…’ Please give more detail and clarity on the averaging periods/resolution (here, and/or with the individual QoI’s below). How are they averaged? – Daily? 6hourly?
 Line 371: ‘…the longitudinal range is chosen…’ Is this a fixed longitudinal range that is the same for all simulations?
 Line 477: I think it might be useful to give a full definition of the parameter names on their first use in this section for the general reader, as they are not obvious from the acronyms.
 Figure 6: The labelling ’1), 2),…’ is difficult to see, especially when under dark shading. Could the numbering not be included with the names on the left/right for better clarity?
 Lines 515: ‘…all other model parameters are set to their mean values…’ Why is the mean value used for this choice? And not, say, the model’s default values? How does this fixed choice for other parameters affect the results shown in Figure 5?
 Lines 522523: Here and elsewhere (including the Fig 7, Fig 8 and Fig 9 captions) I am very concerned about, and do not agree with, the interpretation of ‘pvalues’ for the KruskalWallis testing. For pvalues, the general rules from basic statistics are that a pvalue, p ≥ 0.05 shows no evidence against the null hypothesis, H0, being tested, that 0.01 < p ≤ 0.05 indicates weak evidence against H0, that 0.001 < p ≤ 0.01 indicates strong evidence against H0, and then p ≤ 0.001 is very strong evidence against H0. Hence, to say that 0.05 < p < 0.1 shows high significance, and p < 0.05 shows very high significance is just clearly misleading. Please update the results and figures to have an appropriate interpretation of the pvalues.
Technical Corrections:
 Line 14: Change: ‘Results show that….’ to ‘The results show that…’, for readability.
 Line 50: ‘…remain to be fraught…’. This sounds strange. I would rephrase to ‘…are fraught…’
 Line 73: Change: ‘…deep convection has large effect…’ to ‘…deep convection has a large effect…’
 Line 137: Change ‘elaborated’ to ‘explained’.
 Line 151: Change ‘…soil in form of…’ to ‘…soil in the form of…’
 Line 154: Change ‘last two’ to ‘remaining’.
 Line 171: Change ‘…as it is also…’ to ‘…as is also…’.
 Line 322: Change ‘although convection…’ to ‘although the convection…’. Also, Should ‘forecast’ be ‘forecasting’?
 Line 360: Change ‘because latter’ to ‘because the latter’
 Line 361: Change ‘and causes’ to ‘which causes’
 Line 452: Change ‘is no requirement’ to ‘is not a requirement’.
 Figure 4 caption: Change ‘as result of’ to ‘resulting from’.
 Line 579: Should ‘pressure’ be ‘mean sea level pressure’?
 Line 616: Where you reference figure 4, perhaps also reference/indicate the colours for the parameters that this is referring to, for ease of interpretation?
 Line 620: Change ‘in a significant amount…’ to ‘by a significant amount…’
 Line 633: Change ‘midlevel’ to ‘midlevel’
 Line 640: Change ‘only little’ to ‘only a little’.
Citation: https://doi.org/10.5194/egusphere20231922RC1 
RC2: 'Comment on egusphere20231922', Anonymous Referee #2, 26 Nov 2023
The study “Quantifying uncertainty in simulations of the West African Monsoon with the use of surrogate models” by M. Fischer et al. addresses uncertainties in modeling the West African Monsoon (WAM). Uncertainty quantification is based on emulators for the dependence of WAM characteristics as modeled by the ICON model on 6 subgridscale parameters, 2 each in relation to deep convection, the subcloud, and boundary layer. Results include that interactions in the effects of these parameters are weak, precipitation shows the most complex dependence on all 6 parameters, and that the two parameters related to convection, notable entrainment rate and terminal fall speed of ice, have the largest effect overall. The study is thorough and comprehensive, both methodologically and in its processbased interpretation of the results.
I recommend publication of the manuscript after the following minor clarifications:
 As another example for the use of Universal Kriging in the Atmospheric Sciences, the following publication has recently employed it to obtain counterfactuals for shipping pollution: Diamond, M. S., Director, H. M., Eastman, R., Possner, A., & Wood, R. (2020). Substantial Cloud Brightening from Shipping in Subtropical Low Clouds. AGU Advances, 1, e2019AV000111. https:/doi.org/ 10.1029/2019AV000111
 Just as a suggestion, I wonder whether some readers, notably those familiar with Gaussian Process emulators with Leeds involvement, might find it easier to relate to Section 2.2.2 if function choices were contrasted to those used in this literature, which to my knowledge, e.g., often assumes a Matérn covariance structure, and would refer to the “aleatoric uncertainty due to weather noise” as “nugget effect”.
 I would ask the authors to discuss further why there is so little interaction between the parameters. After all, the quantification of such interactions is a key strength of their approach. Could this be a consequence of the domain expertise that went into the selection of the 6 parameters?
 In how far are the belowcloud parameters related to coldpool dynamics? Does their weak control on WAM characteristics imply anything for the relevance of parameterizing cold pools?
 The choice of using a 4year “climatology” seems an important one, especially since emulators are crossvalidated, and not tested on unseen data (i.e. an unseen 4year period). Even though I would be surprised if the overall results would depend on this choice, some further elaboration would be helpful.
 Section 2.5 was not detailed enough for me to fully grasp how the spatially resolved results were obtained.
 L459ff: Isn’t the aleatoric uncertainty sigma_n quantified?
 L382f: the “factorization” strategy needs elaboration.
 L149: description of thkhmin needs more detail.Citation: https://doi.org/10.5194/egusphere20231922RC2 
AC1: 'Comment on egusphere20231922', Matthias Fischer, 13 Dec 2023
We would like to thank the reviewers for their constructive and helpful comments on the manuscript. Overall, we agree with the given remarks and provide a short response below. For other minor (technical) comments that are not mentioned below, we do not provide responses here but will modify respective parts of the manuscript.
 Line 1920: ‘…rather affects…’ This sounds weird and is unclear. Please clarify.
will be clarified
 Line 96: ‘…within the past years…’ This is vague (what timescale is ‘the past’?) and needs rephrasing. Maybe ‘…within the last XX years…’, or, ‘…within recent years…’?
will be clarified
 Line 114: ‘…In meteorology, universal kriging has been applied in very few studies…’. I don’t agree with this. I think a version of universal kriging has been applied in several studies that relate to meteorology, including tropical sea breeze convection (e.g. 10.1029/2019JD031699) and deep convective clouds and hail (e.g. 10.5194/acp2022012020). Please amend to reflect this.
Universal kriging indeed has been applied in a few studies. The proposed references by both reviewers (Diamond, M. S. (2020), Wellmann (2020)) are good examples which we will include in our literature review.
However, in J. M. Park (2020) no background is given about the universal kriging method and whether/how/why it is applied rather than using simple or ordinary kriging. Therefore, we decided not to include it in the context of universal kriging. Line 134: ‘…as well as parameter studies.’ – What do you mean by ‘parameter studies’. Is this not the sensitivity analysis? Please explain / clarify.
will be clarified
 Table 1: I find the way the parameters are presented in Table 1 quite confusing. Only one distribution is Beta, and yet the parameter columns are labelled as beta parameters as a default? It would be much easier to understand if the distributions were just written in full in a single ‘PDF’ column, e.g. Normal(1.0, 0.342) or Normal(μ=1.0, σ2=0.342), and I recommend doing this.
We agree that this labeling, which was chosen to keep the notation plain and compact, may be misleading. We will modify the notation according to the suggestion.
 Line 143: ‘…define meaningful PDFs representing the full epistemic uncertainty.’ Is this possible? Is the ‘full epistemic uncertainty’ actually known. [epistemic uncertainty is uncertainty due to a lack of knowledge – so, this includes the ‘unknown unknowns’ part as well as the ‘known unknowns’ – Hence, I think it may be more correct to say that these PDFs will contain our best knowledge of the associated epistemic uncertainty, rather than the *full* epistemic uncertainty. Please amend appropriately.
Thank you for this comment  this is absolutely correct and we will adapt the formulation accordingly.
 Lines 144145: For clarity, change ‘…physics, such as gridscale…’ to ‘…physics. These are the gridscale…’. ‘such as’ suggests there may be other options, but the following list contains all of the parameters considered. Also, at the end of the sentence (L 146), please add a reference to Table 1.
will be clarified
 Line 160: ‘Particularly for the last three parameters within the family of convection parametrization’ – This does not read well and needs clarification. Which parameters are being referred to here? Also, the order of the parameter descriptions in the text does not correspond to the order they are listed in Table 1, which may confuse the reader – please consider aligning these orders.
will be clarified. For the order and grouping of parameters, we will adapt the description according to the Table 1 and Result section.
 Line 165: ‘…of entrorg and zvz0i…’ Here and elsewhere, I find it difficult to remember the descriptions of the parameters from the model names/acronyms, which don’t seem intuitive to me. I think it would help to be more descriptive in the text, e.g. ‘…of the entrainment and terminal ice velocity parameters, entrorg and zvz0i,…’ so the reader doesn’t have to keep looking things up.
will be clarified
 Line 168: ‘…in the ensemble physics perturbations…’. What are these? This needs more explanation.
will be clarified with better reference to the given literature source
 Lines 174175: ‘…in the case of a fundamental sensitivity analysis, a uniform distribution is not necessarily a good choice, as there is no physical foundation for assuming a jump in the PDF from a constant value to zero at the upper and lower limits.’ I’m not sure I fully agree here. For sensitivity analysis, uniform distributions are used to reflect that (under current knowledge) a parameter’s value is equally likely to be any value across a given range. Beyond that range (and so the physical meaning of it) is irrelevant to the sensitivity analysis, as it doesn’t analyze beyond those limits.
In terms of physical foundations for the distribution choices that are used in this study (Table 1, Figure 5, what is the ‘physical foundation’ for the shapes and rates of decay in the tails of the PDFs selected? – How exactly were these distributions chosen? (Was there a robust elicitation process?) And, how realistic are they?
The appropriateness of nonuniform distributions can also be questioned – When multiple peaked marginal PDFs come together this can highly bias the sampling of a multidimensional parameter space and effectively places a strong constraint on that parameter space prior to any actual calibration. How confident are the authors in the accuracy of these distribution specifications and the constraints to their analysis that these PDFs impose?We understand and emphasize that assigning PDFs to the parameters is a crucial and important step of the whole analysis, which is by no mean trivial. The selected PDFs do have a direct impact on the results of the global sensitivity analysis, but not on the parameter studies, where the relationship between the QoIs and physical parameters are shown. For the global sensitivity analysis, we may ask the question whether a uniform or a more sophisticated PDF choice is more meaningful. Here, we concluded that a uniform distribution with equal probabilities within a certain range and zero probability beyond the limits would be rather dubious, because parameter values close to the limits within the range would contribute to the global sensitivity analysis with ‘full’ weight and parameter values beyond the limit (but still close to the range) with zero weight. Although a uniform distribution would not be the best choice in our opinion, defining other distributions is challenging. We already elaborate our choices in L178184 but will expand this and add that other definitions, e.g. uniform distribution are often preferred by other authors.
 Lines 195196: ‘Since probability varies strongly across the input space, it is meaningful to train the model with higher accuracy in regions with higher probability.’ I’m not sure I fully understand the logic of this… just because the probability distributions suggest you may not sample an area of parameter space as frequently as another (i.e. in a sensitivity analysis), does that mean that you should want or accept higher error in the predictions (and so lessinformed predictions) when you do sample there? Could having variable errors in prediction accuracy across the parameter space lead to bias in the results (e.g. sensitivity analysis) from the sampling (even with lower frequency of samples) of the areas (edges of parameter space) with lower probability / lower accuracy? Has this been tested?
In my mind, one should want the surrogate model (emulator) to be as good a representation as possible of the complex model (simulator) across all of the parameter uncertainty space considered, to then be confident in using that representation in place of the simulator for all sampled parameter combinations.
I think to take this approach of weighted emulator accuracy, you need to be highly confident in the accuracy of the parameter PDFs being used to create that weighting (connects to comment above). However, in the conclusion (Lines 670671) you suggest this is not the case. Also, other factors such as the smoothness of the output response surface can affect the number of parameter combinations required to obtain a reasonable emulator (a rougher response surface may require more training information) – Would a rougher surface in an area of lower probability exacerbate the potential bias in results from prediction accuracy in such a weighted approach?
I’m interested to know how different the results would be if the training data were sampled evenly over the physical parameter uncertainty space without the PDFs – This would indicate the need/benefit, or not, for this more complex and weighted sampling approach.We create the surrogate models in order to carrying out global sensitivity analyses. The amount/density of points in the parameter space which are used for conducting the sensitivity analysis corresponds to the probability distribution. This means that in order to get a more precise sensitivity analysis, it is meaningful that the model is more accurate in regions in the parameter space with higher PDF values. If the only aim is to construct a surrogate model that should be just as accurate in the 'tails' of the parameters, then we would concede that the reviewer's comment is correct. However, a comparison study using the meteorological model is difficult: to do this, the entire ICON model runs would have to be carried out again with different parameter combinations that were sampled differently. Furthermore, we suspect that the results would not differ much. In our opinion, our approach is the more intuitive/elegant approach and also optimal in terms of the global sensitivity analysis.
A comparative study may be subject to future research using less expensive toy/academic problems in a rather methodical/mathematical paper. It would indeed be interesting to investigate whether rough model behavior in the tails could lead to lower overall accuracy or even biases. Line 214: ‘…be used to employ sensitivity studies in a resourcefriendly way.’ What is meant by ‘resourcefriendly’? Please clarify.
will be clarified
 Line 234: ‘Furthermore, we add i. i. d. Gaussian noise with variance sigma_n^2…’. It isn’t clear how this is done. Please clarify.
will be clarified (see comment below L459ff)
 Line 314: Is there a general reference for the ICON model, for if a reader wants to find out more details?
We referred to one version of the ICON manual (Reinert D., 2019) but we will revise this again and add a reference where introducing the ICON model in the manuscript.
 Line 330: ‘…QoIs are thus averaged over these four August periods…’ Does the averaging over the 4 years lead to an overall behavior that is still realistic? (i.e. Is it possible that for a process, the different meteorology might lead to a high value or a low value, but then the averaging leads to a more central value that is never observed?)
In a preliminary study, we only included one August period and found that the fluctuations in the relationship between parameters and outputs were relatively large. Therefore, we averaged over 4 months (always August to cover similar climatology). Due to the fluctuations for individual years, it was not possible to investigate the differences between the years with sufficient significance. However, the fact that we get a more robust signal by using four months (significant results in the model validation) strongly suggests that we obtain a smoothing rather than a cancellation of the individual signals. Thus, we are confident to have a realistic estimate of the averaged behavior.
 Line 335: Is it possible to give an indication of the actual amount/size of data that is stored (required level of storage for if someone wanted to repeat this).will be added
 Lines 339345: Please add units to all of the characteristics of the WAM.
will be added
 Section 2.4: Please give the units for each of the QoIs.
will be added
 Line 364: ‘…all QoIs are averaged over the study period…’ Please give more detail and clarity on the averaging periods/resolution (here, and/or with the individual QoI’s below). How are they averaged? – Daily? 6hourly?
We will add more detail here and make clear which time resolution is used for averaging.
 Line 371: ‘…the longitudinal range is chosen…’ Is this a fixed longitudinal range that is the same for all simulations?
Yes, this is chosen for all simulation outputs, as the topography is the same, and to make the results comparable.
 Line 477: I think it might be useful to give a full definition of the parameter names on their first use in this section for the general reader, as they are not obvious from the acronyms.
will be added
 Figure 6: The labelling ’1), 2),…’ is difficult to see, especially when under dark shading. Could the numbering not be included with the names on the left/right for better clarity?
will be clarified. We tried to use a consistent layout/labelling with the following figures, but we will make this clearer.
 Lines 515: ‘…all other model parameters are set to their mean values…’ Why is the mean value used for this choice? And not, say, the model’s default values? How does this fixed choice for other parameters affect the results shown in Figure 5?
The mean values correspond to the default values as the PDFs are defined that way. In L513514 we emphasized that this illustration is meaningful  i.e. it is expected to be similar (only having vertical shifts) if the other parameters are chosen differently. We will elaborate this in more detail.
 Lines 522523: Here and elsewhere (including the Fig 7, Fig 8 and Fig 9 captions) I am very concerned about, and do not agree with, the interpretation of ‘pvalues’ for the KruskalWallis testing. For pvalues, the general rules from basic statistics are that a pvalue, p ≥ 0.05 shows no evidence against the null hypothesis, H0, being tested, that 0.01 < p ≤ 0.05 indicates weak evidence against H0, that 0.001 < p ≤ 0.01 indicates strong evidence against H0, and then p ≤ 0.001 is very strong evidence against H0. Hence, to say that 0.05 < p < 0.1 shows high significance, and p < 0.05 shows very high significance is just clearly misleading. Please update the results and figures to have an appropriate interpretation of the pvalues.
Thank you for this comment. We will adjust the interpretation of the statistical test. The terms "very high significance" and "high significance" are misleading. We prefer to use levels (in percentage) rather than the chosen labels. The interpretation will then need some adaptation.
 As another example for the use of Universal Kriging in the Atmospheric Sciences, the following publication has recently employed it to obtain counterfactuals for shipping pollution: Diamond, M. S., Director, H. M., Eastman, R., Possner, A., & Wood, R. (2020). Substantial Cloud Brightening from Shipping in Subtropical Low Clouds. AGU Advances, 1, e2019AV000111. https:/doi.org/ 10.1029/2019AV000111
See above comment (L114): We will include this in our literature review.
 Just as a suggestion, I wonder whether some readers, notably those familiar with Gaussian Process emulators with Leeds involvement, might find it easier to relate to Section 2.2.2 if function choices were contrasted to those used in this literature, which to my knowledge, e.g., often assumes a Matérn covariance structure, and would refer to the “aleatoric uncertainty due to weather noise” as “nugget effect”.
We agree that using other covariance functions such as the Matérn covariance is often meaningful. Even though the squared exponential function worked very well in our case, we will now mention the Matérn function as an established alternative choice. Also, we will refer to the ‘nugget effect’ to make the explanation more accessible to readers from different communities.
 I would ask the authors to discuss further why there is so little interaction between the parameters. After all, the quantification of such interactions is a key strength of their approach. Could this be a consequence of the domain expertise that went into the selection of the 6 parameters?
When selecting the 6 parameters, we aimed to include various effects on the WAM system, but we did not expect the parameter interactions to be so little. We would expect the interactions to be larger if we broadened the parameter ranges (PDFs).
 In how far are the belowcloud parameters related to coldpool dynamics? Does their weak control on WAM characteristics imply anything for the relevance of parameterizing cold pools?
The belowcloud parameters control how much rain is evaporated underneath the clouds and therefore modify surface rain and thermodynamic profiles. More evaporation creates cooler subcloud layers, which in turn leads to a larger negative buoyancy relative to neighboring grid cells and thus a larger lateral acceleration. This has some resemblance with having stronger cold pools but probably the grid spacing we use in our experiments (13 km) is not fine enough to fully resolve this process, including the triggering of new storms through cold pools. Nevertheless, the results we find for these parameters give some first indication about the general relevance of cold pools for the monsoon system and thus the potential gain from a cold pool parameterization, which would attempt to represent the subgrid aspects of the problem.
 The choice of using a 4year “climatology” seems an important one, especially since emulators are crossvalidated, and not tested on unseen data (i.e. an unseen 4year period). Even though I would be surprised if the overall results would depend on this choice, some further elaboration would be helpful.
See answer above to “Line 330”
 Section 2.5 was not detailed enough for me to fully grasp how the spatially resolved results were obtained.
The explanation is indeed quite theoretical. We will add some detail to make it more intuitive.
 L459ff: Isn’t the aleatoric uncertainty sigma_n quantified?
Yes, it is determined by maximizing the likelihood (as all hyperparameters) and then gives insight about the aleatoric uncertainty of the surrogate model. We will explain this in more detail.
 L382f: the “factorization” strategy needs elaboration.
will be clarified
 L149: description of tkhmin needs more detail.
More detail will be added with another reference to literature of the DWD.
Citation: https://doi.org/10.5194/egusphere20231922AC1