I agree that the lack of standards and baselines in the fraud detection space is...

I agree that the lack of standards and baselines in the fraud detection space isn't ideal. One example: some fraud products will build models using human labels as the target to be predicted. Radar, on the other hand, tries to predict whether a charge actually turns out to be fraudulent (we use dispute/chargeback data we get directly from card issuers/networks). These are in fact different problems and the fact that the industry generally doesn't have a consistent target makes discourse and comparisons more muddled.

(And on class imbalance: we spent quite a bit of time experimenting/analyzing how to deal with it—we found that sampling rate has a marginal impact on performance but not a huge one.)