In this particular case, no, as the image wouldn't be being sent into the PC to render in its framebuffer; but rather the PC would just be drawing an empty window, and reporting the geometry of that window to the monitor. The monitor, with two HDMI leads plugged in, would be responsible for compositing the inputs together, according to the geometry the PC reported, but all internal to itself.
That's how hardware-accelerated video decoding used to work in Windows XP days IIRC (before GPU-based desktop composition), the video player would be a blank black square and the GPU would be told to draw the video on those coordinates.
Because of how it was implemented, you could drag VLC around while the video was playing and the video would stay "behind" everything, with the VLC window acting as a "hole" through which you could see it. (So you could move the window to the left and see half a black square on the left, and the left-most half of the video on the right)
Nowadays with desktop composition AKA DWM, Windows just makes sure to black out DRM content from any frames/video it sends to an app requesting to capture the screen, making sure to send the video-including composed desktop only to the display. (And if you have some GPU recording software like NVIDIA ShadowPlay, it switches off when DRM content starts playing) You can see it in action with the Netflix UWP app. Of course, a bunch of DRM-respecting software -- like Netflix in Google Chrome -- doesn't really follow that spec and can still be screenshot/video captured like any app.
It provides lower resolution content to those devices, capping at 720p.
Which is the same as Netflix effectively, Netflix just provides higher to PCs when they can block screen recording entirely. Disney might eventually I guess.
This isn't a firmware issue, because enabling this would require adding the hardware to the scaler ASIC to actually process multiple video streams, and to increase the buffer size and bandwidth n-fold so that it can synchronize and composite the unsynced video sources (also introducing 1+ frames of latency).
The GP was assuming a scaler that already supported a Picture-in-Picture or split-screen feature—as many modern monitors do!
The problem—and it is purely a firmware problem—is that all such monitors make this kind of support "dumb", with hardcoded geometry (i.e. split-screen as exactly halving or quartering the screen; PiP as putting one input in an box that composites over exactly the upper-left quarter of the lower-right quadrant of the screen.)
There's nothing in the scaler ASIC that particularly benefits from these numbers being hardcoded, such that its job would be any more complex if they were controllable via machine-registers POKEable using HDMI-CEC commands.