If you ignore the FPU (I think it can be power gated off) the two cores should be roughly the same size and power consumption.
Dual issue sounds like it would add a bunch of complexity, but ARM describe it as "limited" (and that's about all I can say, I couldn't find any documentation). The impression I get is that it's really simple.
Something along the line of "if two 16 bit instructions are 32bit aligned, and they go down different pipelines, and they aren't dependant on each other" then execute both. It might be limitations that the second instruction can't access registers at all (for example, a branch instruction) or that it must only access registers from seperate register file bank, meaning you don't even have to add extra read/write ports to the register file.
If the feature is limited enough, you could get it down to just a few hundred gates in the instruction decode stage, taking advantage of resources in later stages that would have otherwise been idle.
According to ARM's specs, the Cortex-M33 takes the exact same area as the Cortex-M4 (the rough older equivalent without dual-issue, and arguably equal to the Hazard3), uses 2.5% less power and gets 17% more performance in the CoreMark benchmark.
That is exactly what the "limited dual issue" is - two non-conflicting pre-decoded instructions (either 16b+16b or if a stall has occurred) can be sent down the execution pipe at the same time. I believe that must be a memory op and an ALU op.
I would expect the ARM cores to be much larger, as well as use much more power.