I'm pretty sure that it's just serially summing the network weights, which results in an accumulated offset to the self-attention layers of the transformer. It's not doing any kind of analysis of multiple networks prior to application to make them "play nice" together; it's just looping and summing.
Uh this isn't true at all, at least with auto1111.