Check out this model, I've had limited success with it.
Best I've done so far is to just add the labels it gives to the overlapping segments whisper spits out, which means some sentences have multiple speakers, but that's mostly the case because of cross-talk. I'd say it gets it right ~80% of the time with the 5 speakers I've done it on across ~16 hours of audio.
https://huggingface.co/pyannote/speaker-diarization