It seems "trivial" enough to detect a total shot change, and then in cases of things like fast-moving action sequences you just collapse all short shots into a single longer shot (e.g. all shots 5 seconds or less get collapsed together with any neighborhing shot 5 seconds or less).
And then pick a single frame from the exact middle, or else the most "still" frame that shows the least change from neighboring ones.
It seems "trivial" enough to detect a total shot change, and then in cases of things like fast-moving action sequences you just collapse all short shots into a single longer shot (e.g. all shots 5 seconds or less get collapsed together with any neighborhing shot 5 seconds or less).
And then pick a single frame from the exact middle, or else the most "still" frame that shows the least change from neighboring ones.