Ah, I didn't realize that always happened. I thought it was only if you did something that might have OS specific rendering characteristics (text-draws, etc).
Unfortunately canvas (rgb'ish) can't overlay as efficiently as <video> (yuv'ish), so there is some power cost relative to the lowest power video overlays.
It really only matters in long form content where nothing else on the page is changing though.
Basically after detecting silence for 30 seconds or so it switches from a sink backed by the OS audio device to a null sink.
Note: Since this uses a different clock than the audio device we have received some reports that when the context is finally used there can be some distortion at specific tones. The workaround is for sites to use the suspend resume API mentioned in the article.
It's unfortunately Chromium-only for now, and I wanted to keep code simple. I've got a PoC lying around with VideoFrame and whatnot, but I thought this would be better for a post.
I still remembered my id from all those years ago, 1569200. I was excited to read others were logging in with their old numbers, so I tried the password I thought I had used, but no luck.
Very cool! I think it's missing some entries though. I'm pretty sure we've had at least one in third_party/ffmpeg. Those fixes often land upstream first which might make tracking difficult.
Even if you have crash reporting disabled there should be a .dmp generated somewhere in the user profile directory. Manually uploading that to a bug at https://crbug.com/new would allow a Chrome developer to debug it.
If you can't share the dump for similar reasons to why you have crash reporting disabled, you can build minidump_stackwalk from Chromium and use it to generate an unsymbolized stack trace that you can post to the bug. A Chrome developer can then symbolize it.
Thanks for the nice write up! I work on the WebCodecs team at Chrome. I'm glad to hear it's mostly working for you. If you (or anyone else) has specific requests for new knobs regarding "We may need more encoding options, like non-reference frames or SVC", please file issues at https://github.com/w3c/webcodecs/issues
And I have some issues with the copyTo method of VideoFrame, on mobile (Pixel 7 Pro) it is unreliable and output all 0 Uint8Array beyond 20 frames, to the point I am forced to render each frame to an OffscreenCanvas. Also the many formats of frame output around RGBA/R8 with reduced range 16-235 or full range 0-255 makes it hard to use in my convoluted way.
Please file an issue at https://crbug.com/new with the details and we can take a look. Are you rendering frames in order?
Android may have some quirks due to legacy MediaCodec restrictions around how we more commonly need frames for video elements, frames only work in sequential order since they must be released to an output texture to access them (and releasing invalidates prior frames to speed up very old MediaCodecs).
* maybe possible already, but it’s not immediately clear how to change the bitrate of the encoder dynamically when doing VBR/CBR (seems like you can only do it with per-frame quantization params which isn’t very friendly)
* being able to specify the reference frame to use for encoding p frames
* being able to generate slices efficiently / display them easily. For example, Oculus Link encodes 1/n of the video in parallel encoders and decodes similarly. This way your encoding time only contributes 1/n frame encode/decode worth of latency because the rest is amortized with tx+decode of other slices. I suspect the biggest requirement here is to be able to cheaply and easily get N VideoFrames OR be able to cheaply split a VideoFrame into horizontal or vertical slices.
* Does splitting frames in WebGPU/WebGL work for the use case here? I'm not sure we could do anything internally (we're at the mercy of hardware decode implementations) without implementing such a shader.
> what kind of scheme are you thinking beyond per frame QP
Ideally I'd like to be able to set the CBR / VBR bitrate instead of some vague QP parameter that I manually have to profile to figure out how it corresponds to a bitrate for a given encoder. Of course, maybe encoders don't actually support this? I can't recall. It's been a while.
> Does splitting frames in WebGPU/WebGL work for the use case here? I'm not sure we could do anything internally (we're at the mercy of hardware decode implementations) without implementing such a shader.
I don't think you need a shader. We did it at Oculus Link with existing HW encoders and it worked fine (at least for AMD and NVidia - not 100% sure about Intel's capabilities). It did require some bitmunging to muck with the NVidia H264 bitstream to make the parallel QCOM decoders happy with slices coming from a single encoder session* but it wasn't that significant a problem.
For video streaming, supporting a standard for Webcams to be able to deliver slices with timestampped information about the rolling shutter (+ maybe IMU for mobile use cases) would help create a market for premium low-latency webcams. You'd need to figure out how to implement just in time rolling shutter corrections on the display side to mitigate the downsides of rolling shutter but the extra IMU information would be very useful (many mobile camera display packages support this functionality). VR displays often have rolling shutter so a rolling shutter webcam + display together would really make it possible to do "just in time" corrections for where pixels end up to adjust for latency. I'm not sure how much you'd get out of that, but my hunch is that if you knock out all the details you should be able to shave off nearly a frame of latency glass to glass.
Speaking of adjustments, extracting motion vectors from the video is also useful, at least for VR, so that you can give the compositor the relevant information to apply last-minute corrections for that "locked to your motion" feeling (counteracts motion sickness).
On a related note, with HW GPU encoders, it would be nice to have the webcam frame sent from the webcam directly to the GPU instead of round-tripping into a CPU buffer that you then either transport to the GPU or encode on the CPU - this should save a few ms of latency. Think NVidia's Direct standards but extended so that the GPU can grab the frame from the webcam, encode & maybe even send it out over Ethernet directly (the Ethernet part would be particularly valuable for tech like Stadia / GeForce now). I know the HW standards for that don't actually exist yet, but it might be interesting to explore with NVidia, AMD, and Intel what HW acceleration of that data path might look like.
* NVidia's encoder supports slices directly and has an artificial limit on the number of encoder sessions on consumer drivers (they raised it in the past few years but IIRC it's still anemic). That however means that the generated slices have some incorrect parameters in the bitstream if you want to decode them independently. So you have to muck with the bitstream in a trivial way so that the decoders see independent valid H264 bitstreams they can decode. On AMD you don't have a limit to the number of encoder.
Ah I see what you mean. It'd probably be hard for us to standardize this in a way that worked across platforms which likely precludes us from doing anything quickly here. The stuff easiest to standardize for WebCodecs is stuff that's already standardized as part of the relevant codec spec (e.g, AVC, AV1, etc) and well supported on a significant range of hardware.
> ... instead of round-tripping into a CPU buffer
We're working on optimizing this in 2024, we do avoid CPU buffers in some cases, but not as many as we could.
> It'd probably be hard for us to standardize this in a way that worked across platforms which likely precludes us from doing anything quickly here. The stuff easiest to standardize for WebCodecs is stuff that's already standardized as part of the relevant codec spec (e.g, AVC, AV1, etc) and well supported on a significant range of hardware.
As I said, oculus link worked with off the shelf encoders. Only the Nvidia one needed some special work and even that’s not even needed anymore since they raised the number of encoders (and the amount of work was really trivial - just adjusting some header information in the h.264 framing). I think all you really need is the ability to either slice a VideoFrame into strips 0 cost and have the user feed them into separate encoders OR to request sliced encoding and under the hood that’s implemented however (either multiple encoder sessions or using Nvidia slice API if using nvenc). You can even make support for sliced encoding optional and implement it just for the backends where it’s doable.
I'm currently working with WebCodecs to get (the long awaited) frame-by-frame seeking and reverse playback working in the browser. And it even seems to work, albeit the VideoDecoder queuing logic seems to give some grief for this. Any tips on figuring out how many chunks have to be queued for a specific VideoFrame to pop out?
An aside: to work with video/container files, be sure to check the libav.js project that can be used to demux streams (WebCodecs don't do this) and even used as a polyfill decoder for browsers without WebCodec support!
Thanks. I appreciate that making an API that can be implemented with the wide variety of decoding implementations is not an easy task.
But to be specific, this is a bit problematic with I-frames only videos too, and with optimizeForLatency enabled (that does make the queue shorter). I can of course .flush() to get the frames out but this is too slow for smooth playback.
I think I could just keep pushing chunks until I see the frame I want coming out but it will have to be done in an async "busy loop" which feels a bit nasty. But this is done also in the "official" examples I think.
Something like "enqueue" event (similarly to dequeue) that more chunks after last .decode() are needed to saturate the decoder would allow for a clean implementation. Don't know if this is possible with all backends though.
Often Chrome doesn't know when more frames are needed either, so it's not something we could add an API for unfortunately.
Yes, just feeding inputs 1 by 1 for each dequeue event until you get the number of outputs you want in your steady state is the best way. It minimizes memory usage. I'll see about updating the MDN documentation to state this better.
Wow, great to see some work in this space. I've been wanting to do reverse playback, frame accurate seek and step by step forward and back rendering in the browser for esports game analysis. The regular video tag gets you somewhat of the way there but navigating frame by frame will sometimes jump an extra frame. Likewise trying to stop at an exact point will often be 1 or 2 frames off where you should be. Firefox is much worse, when pausing at a time you could +-12 frames where you should be.
I must find some time to dig into this, thanks for sharing it.
I have it working with WebCodecs, but currently i-frames only videos and all the decoded frames are read to memory. Not impossible to lift these restrictions, but the current WebCodec API will likely make it a bit brittle (and/or janky). For my current case this is not a big problem so I haven't fought with it too much.
Figuring out libav.js demuxing may be a bit of a challenge, even though the API is quite nice as traditional AV APIs go. I'll put out my small wrapper for these in a few days.
Edit: to be clear I don't have anything to do with libav.js other than happening to find it and using it to scratch my itch. Most demuxing examples for WebCodecs use mp4box.js which really makes one a bit uncomfortably intimate with guts of the MP4 format.
It's just a lot of work to get everything right. It's kind of working, but I removed synchronization because the signaling between the WebWorker and AudioWorklet got too convoluted. It all makes sense; I just wish there was an easier way to emit audio.
While you're here, how difficult would it be to implement echo cancellation? The current demo is uni-directional but we'll need to make it bi-directional for conferencing.