This recipe results in large amounts of time spent before any results occur (depending on the task you are trying to solve). Classification is an easy task to use this recipe, but when you venture into object detection or pose estimation, data collection, labeling, and setting up training and evaluation infrastructure is much more complex.
Can you expand a little bit? I often find if I skip one or more steps mentioned here, the later debugging is tremendously harder (and often involves go back to these steps again). Some of these advice like visualization are well supported in many frameworks usually through TensorBoard. Others really just good common-sense try-first-or-you-will-regret-later steps that don't require significant amount of time investment.