|Below some thoughts and recommendations, I suggest moving forward with this manuscript with minor revisions.|
The authors present a revised version of their ML-based surface front detection. One certainly cannot dispute the usefulness of feature-based detection methods, but my reservation about whether manual surface analysis should guide the development of a next-generation automated feature-based method remains intact.
The discrepancy between the traditional TFP-based methods, which arguably have their own shortcomings, and the presented method for emulating DWD and WCP fronts does not, in my opinion, indicate a particular weakness of the TFP-based methods, but must be considered in light of the weakness of the manual surface maps, which either fail to account for or erroneously indicate or displace relevant surface features otherwise correctly detected by the automated TFP-based method. The question remains as to what should be accepted as ground truth, and as stated earlier, I cannot recommend relying fully on manual analysis. This position is in contrast to several statements by the authors who continue to hold on to manual analysis as a ground truth and consequently continue to argue throughout the manuscript that traditional TFP-based methods are outperformed. The answer might simply be that the manual analysis is erroneous in many cases, and the ML-method has learned the bias while the TFP-based method – in fact – outperforms both. It is simply a fundamentally different viewpoint.
Even though the authors have improved their comparison between a TFP method and their new method, I still think the comparison is incorrect. A proper choice for a baseline method must always be seen relative to the ground truth. Here, manual surface charts are used as ground truth, which are drawn based on several variables at several heights, and as baseline a method is chosen which uses one variable at one height and was never developed with the specific goal to reproduce surface charts. As already mentioned, I recommended removing this comparison, since it is not necessary for the publication. Only the comparison with an earlier ML method would make sense. However, since even the authors of this study argue in their reply that they are not able to handle the code provided by of one of the earlier ML methods, I am concerned about the reproducibility of these studies. At the end of this document, I recommend another ML method that uses the same ground truth – maybe this code is more user friendly and can serve as the baseline method the author wish to have.
Nevertheless, in their revised introduction, the authors have addressed aspects of this discussion, but the need for labeled training data is so central to their method, there is basically no other option for the training of the ML method but to accept the author decision and evaluate the manuscript considering their viewpoint.
To me then, the automated method gives us gridded front data that might be useful for meteorological research related to phenomena associated with the passage of surface fronts. The presented example, a confirmation of an earlier studies that addressed the question of front-related extreme precipitation events, leaves unfortunately the question open of what exactly can be learned using ML-based methods given that the authors basically show that the method reproduces exactly what was previously found using a TFP-based method. Recommendation: It would be helpful to at least give some indication at the end of this section of what exact new physical insight can now be generated with the new method that could not be generated before.
The climatological application is otherwise a very nice example that would motivate a section on the issue of explainability of data-driven methods. While the presented method produces climatological patterns in agreement with previous findings, it is beyond that capable of splitting different front types in a clear manner. Of particular interest would be to understand what variables are key for the learning process and if the climatological patterns would look different if only trained on a single variable. A particular strength of the ML method could be to use a low number of input features to reproduce manual analysis. Again, to me, however the results seem to result from the combination of various input channels, while traditional methods often rely on a single variable which seem to be not sufficient to separate different front types. Layerwise backward propagation might be a simple way of showing what variables allow the network to develop this ability. Recommendation: It would be useful to give some indications in this direction at the end of the corresponding section.
In the summary it is argued that the method can also be applied to higher-resolution data. I think this is not the case. To make the method mesh independent, the input training data would need to be converted to continues space and training would need to be performed in continues space as is done in random feature methods or eventually also in FFT-space. There is something to be said here about mapping between Banach spaces.
The authors are encouraged to add more reference to their statements relating fronts to, for example, wind gusts or extreme weather.
Some ML related questions:
- Is the Batch normalization really needed? Usually, it accelerates the training process and additionally improves the skill. However, from a theoretical viewpoint, it is unclear why this is case and thus it might not be needed in this particular application.
- Why is the drop-out chance set to 0.2? Is there any over-fitting without it? How does this relate to the problem of choosing arbitrary thresholds? I recommend a brief discussion of the sensitivity.
- Why did you choose 3 drop-out layers and avg. pooling steps in your U-Net architecture and not less or more?
- Why are the number of channels changing from 330 to 64 after the first encoding block, but for all further encoding it increases by a factor of two?
- Reference for U-Net should also be given to Shelhamer et al. 2016 (doi: 10.1109/TPAMI.2016.2572683)
- L. 345, how did you determine the deformation factor of k=3? Shouldn’t the choice be tested against randomness in some way? How, as before, does this choice relate to the problem of choosing arbitrary thresholds? A common weakness of traditional methods.
Several previous studies have questioned the usefulness of front lines in general and for the use in next-generation front detection methods. These studies rather recommend using frontal regions or frontal volumes. Is all of what is done in this section needed simply to obtain front lines?
I recommend removing this section and the corresponding comparison in Section 3.1.1. Also, it is noted that only midlatitude fronts are included for the TFP method, but in Section 3.2.2. the opposite is done.
Even though POD and SR are intuitive measures, I recommend to better explain the meaning of nmws and nws. The latter is “the count of all provided fronts” the former “all fronts that could be matched”. To what does provided refer to (provided by whom)? What is a front that is provided but cannot be matched?
Fig. 6 is missing a color bar for the gray shading.
Fig. 6 The yellow class is labelled as “no class” but there seems to be no yellow label in the figure.
Overall, I am afraid I do not understand the purpose of this section. Is it about showing that DWD and WCP fronts have gradients?
Fig.9: What is the variance for the shown averaged values for each line and are the differences between the methods within or outside, for example, the range given by -/+ two times the standard deviation of the sample that went into the averaging for each method? The lines all look very similar to me and may not significantly be different from each other.
In all honesty, this does section does not add much to the paper. This section should be removed as the paper can be published without this information.
I am afraid I do not support the usage of an attribution measured that uses an attribution radius defined in terms of degrees. I would assume that 2.5 degrees correspond to a different area/distance at different latitudes so you will attribute less precipitation to fronts at higher latitudes, don’t you?
Not sure if the difference between fr and a(fr) is fully clear. Is the first the number of fronts at a grid point and the second a probability? What do you mean by “grid point p is associated with a front” other than “a front occurs at p”?
Fig. 10: Maybe I missed it but why are the polar regions not shown?
Fig. 10-12: Some words in the title of the figures are capitalized others not.
The authors may consider the following paper, which appears to target the same ground truth but uses a random forest method. I guess that this is the baseline method the authors are looking for.
Bochenek, B.; Ustrnul, Z.; Wypych, A.; Kubacka, D. Machine Learning-Based Front Detection in Central Europe. Atmosphere 2021, 12, 1312. https://doi.org/10.3390/ atmos12101312
This is a good, well-written paper that should be of interest to the readership. However, I have a couple of minor comments that could be addressed in a revised version of the paper.
The paper is longer than it needs to be, and some information is spread over the paper which makes it difficult to extract the relevant pieces. E.g. when introducing the vertical levels in l.117 and then mentioning that you only use 9 pressure levels in l.196. Why not describe the data-set augmentation with the data in section 2.1?
Section 3.1 and 3.2: I have tried for a while to understand why you present results for the validation AND the test dataset and gave up. Why do you need section 3.1? You are writing in l.300: “We validated our model during training using 1460 samples of data from 2017. We evaluated our trained models on 1 year of data from 2016 using an object based evaluation described as described later in this section.” This does not really explain why you need the two sections. Also, section 3.1 starts with “The trained models were evaluated on test sets…” which generates ultimate confusion. Do you loose any information when removing section 3.1? Maybe I am missing something.
l.193: I do not understand this. You say you “ignore the outer 20 pixel”. But then you are saying that the brighter areas can be used as input in the caption of Figure 4. Are they used as input but not predicted? But then the output domain should be smaller than the input domain in Figure 3…? And why do you crop to 128x256 pixel (l.199)? And then there is again a confusing mentioning of the 5 degree border in the caption of Table 2…
l.8: I would not call the baseline model “ETH”. ETH is a very large institution.
l.21: Maybe add a reference to the Mei-Yu front?
l.22: “Determining the position and propagation of surface fronts plays an important role for weather forecasting”. Well, the prediction of the position, yes. But is the same true for the automatic detection? Fronts can easily be identified in field maps by the trained eye. Why do we need the ability to detect them automatically with ML? I do understand why, but it would be good if this would be made more explicit in the intro, otherwise it seems that you have a hammer and are searching for nails.
l.24: What are empirical guidelines?
Section 2.1: Maybe I missed it, but do you actually state the resolution of the NWS and DWD datasets somewhere (or the resolution equivalent of the PNG image)?
Figure 3: I do not understand the encode and decode blocks. Can you add some info here? Also, what are the white boxes the “copy” arrows end in?
l.198: “If both labels are available”. What does this mean? At a certain point in time? Why should this matter?
Table 2: The whole caption should be reformulated. “For the global region this border is included within the mentioned range.” ?
l.242: This paragraph is important but very difficult to understand. It should be rewritten.
l.279: I would not use “t” for the index of the channels as “t” is often used for time.
l.280-282: I do not understand this. “individually for each batch”? “more emphasize onto classification”? Either equation (2) holds, or not.
l.289: Why did you not evaluate the baseline at 0.25 degree? I guess there are good reasons, but please state them.
Table 3: You can as well remove the “Stationary” line.
Table 5: “The suffix “all”…” I do not understand this sentence.
l.488: I find this a bit confusing. You would not leave out a certain region in a real-world application, so why here?
l.253: predicted fronts
l.301: remove “described”
l.346: “be be”
l.349: “slight edge”?
l.351: “fact that training”
l.403: “most likely”
l.445: “and the European data”
Caption Figure 7: “on the for the”
l.514: “for is the lack”