(un?)fortunately h264 is doing far more than mpeg iframes. Each frame contains look back data to up to 16 other frames, and each frame is also divided into variable size blocks 4-16 pixels in dimension. This arbitrary blocking of the white frames likely what is consuming so much space.
If you encoded in mpeg I'm sure you would dramatically reduce the file size, but not as magically as you would think. It will still store a new white iframe every 16 frames by default, though many encoders will let you specify an alternative.
In theory I guess YouTube could spend n-times as much processing to encode each video in n formats to find the smallest for that particular video and serve that encoding, even doing so after a video reaches x views to cut out yy% of wasted computation, but then they would have to support n codecs on m devices instead of just 1.
We can assume the YouTube video encoders are descent, and they could likely encode such videos with one I-Frame only. Though they may use more I-Frame to make seeking faster.
Iām on mobile but I will check that later using yt-dlp and ffmpeg to list the I-frames.
Given the number of videos that are a single static image plus audio surely YouTube has built in a special case for them.