One possible approach is to make a model that besides its intended prediction al...

mlyle · on July 1, 2020

How would one do this? If you have a ML model with outputs classifying credit risk, and outputs classifying race, it's easy to learn coefficients that are bad at classifying race but still take race into account in classifying credit risk.

That is, what's your loss function and how does it prevent race from being considered in the credit decision?

ebalit · on July 3, 2020

It's a setup that is somewhat similar to GANs (and even closer to a related method called Fader Networks):

- a first network take the input data and return a representation A (like an embedding vector): let's called it the "censor network"

- a second network take this embedding A as input and is trained to predict the class that should be censored (for example the gender of a person) : the "discriminator network"

- a third network take the same embedding A as input and is trained to predict the real task of interest (for example the probability of credit default) : the "predictor network"

The idea is that, by training the censor to make the discriminator fail (predict the wrong class) while making the predictor work, it will force the censor to learn a transformation of the input data that keeps the task related information in the embedding A, but removes the information correlated to the "censored class" (and that could be used to discriminate).

Here's a reference about this kind of methods, but it's still an active domain of study in ML and there are many papers that followed this one: https://arxiv.org/pdf/1801.07593.pdf

mlyle · on July 3, 2020

Neat. But might it not just be easier to predict the influence of race and then use that to adjust the output/threshold?