Bonferroni's correction on hold-out data is an excellent suggestion. To adapt it into time series forecasting, one could perform temporal cross-validation with rolling windows and follow the performance's variance through time.
Unfortunately, the computational time would explode if the ML method's optimization is performed naively. Precise measurements of the statistical significance would crowd out researchers except for Big Tech.
Unfortunately, the computational time would explode if the ML method's optimization is performed naively. Precise measurements of the statistical significance would crowd out researchers except for Big Tech.