youtube-dl -k gave me two files, one webm/opus audio file and one mp4/av1 video file.
The files had roughly same sizes, the audio being 9728k large, the video 9476k.
The video had 127 keyframes, out of 17030 frames total, distributed over 567 seconds of video (one keyframe every 4.5 seconds). I don't have a method at hand to measure the encoded size of the keyframes but if I extract the first 90k of the file using dd (bs=90k count=1), I get 21 frames, so that's a lower bound for the size that's actually needed to encode one single keyframe, more shouldn't actually be neccessary.
So of the 9.4M video file, > 99% is waste. Quite literally as 1% of 9476k is 95k. With the addition of audio, the amount of waste has to be adjusted to roughly half of the total audio+video data. Still a large amount of data that can be saved, especially at youtube scale where reductions of traffic in the sub-percent range are material for promotions.
This could also be solved by youtube detecting these situations and simply spacing keyframes farther apart or using duplicate frame decimation/vfr encoding. Seeking would still be fast since it could just fetch and skip a huge bunch of empty P-frames.
For static image videos YouTube's H.264 version of the video is generally significantly smaller than either the VP9 or AV1 versions. Either YouTube's detection and modification of the encoder settings for this type of video or something intrinsic to the x264 encoder delivers better results from their H.264 encoding pipeline.
The 480p video-only file sizes for this video as reported by youtube-dl have VP9 at 9.70MiB, AV1 at 9.25MiB, and H.264 at 3.55MiB.
So we give them something 13 kb-ish[0] and they compress it to 4.16 MB. I'm sure I'm already not putting it kind enough. Combined with the tiny battery in my phone I really feel like people are pulling a prank on me. I'm a behavioral experiment now! I think my roomba just looked at me.
YouTube's a video encoding site. The original video that was given to YouTube will have been substantially larger than 13 kb-ish.
If your content is audio only then use an audio encoding site like SoundCloud. Or, if you really want to use YouTube, then you need to get them to develop an audio specific option.
I think we already have tons of brilliant people developing video codec. I'm sure they are all well aware there isn't a way to define part(s) of a video a slide show. What is missing is people pointing out how silly this is. The keyframes also define the seek points. Its a mess.
Yes, the formats and players should be changed to also allow more seek points than the key frames, exactly for these "picture not changing" scenarios but with the audio content we want to seek and consume.
And then additionally, there should be better recognition of the "picture not changing" conditions, to allow better use of such a feature.
I moan about this a lot with codec ppl. I think they are just building on top of what they have?
I was trying to distribute a 3 hour lecture with a single slide. I suppose seeing the speaker for a few seconds would be nice too. haha, I'm asking a lot? The slide had a lot of detail so the video uhhh I mean the jpg got enormous and few keyframes so no seeking? No thanks, ill do the "compression" in javascript. (eeewww!)
If it's a solid color or simple enough for the static image compression algorithms in play to optimize away, even those keyframes would be tiny. Some codecs will just optimize those keyframes away entirely.