Determining Gender of a Name with 80% Accuracy Using Only Three Features

Rangi42 · on April 5, 2016

They never say which last letters are associated with which sex. I'm guessing that "o" is usually male and "a" is female, but, what about the other 24? Is "n" usually male ("-son")? Does "y" have a bias?

(Regarding sex vs. gender: yes, they aren't perfectly correlated and some people don't stay with their assigned gender, but AFAIK they often/usually choose a new name which probably matches their gender? Thus why "dead names" are a thing.)

woodman · on April 5, 2016

Those who find this interesting might also be interested in exploring the additional dimensions offered: state and year. Without using any machine learning, you anchor a fact (name, sex, state, birth year range) and get the probability for the unanchored facts. The more you can anchor the higher your level of confidence. Now this is way off topic, but if that sounded interesting... "Constraint Logic Propagation Conflict Spreadsheets" by William Taysom [0]

[0] https://www.youtube.com/watch?v=voG5-15aDu4

danso · on April 5, 2016

Using the SSN baby names data is a staple in my data journalism classes. I like using it as an example of how even a brute force comparison can be decent enough (90%+ or better) to do gender analysis on a wide variety of public datasets, such as campaign finance and public payroll.

Here's one example I've created regarding the Pulitzer board composition: https://github.com/compciv/gendered-pulitzer-board

Note: it was only an example...obviously a little tweaking is needed to put into actual production. Also, obviously has varied effectiveness on datasets with non-traditional American names. Here's one of the more comprehensive efforts by a student, on New Yorker bylines:

https://github.com/alecglassford/compciv-2016/blob/master/pr...

The SSN name data seems almost certainly flawed to a small degree...I.e. It's just hard to believe that there are dozens of boys named Jennifer (and yet, strangely, no boys named Sue!)...but we're talking about infinitesimal rounding errors. The vast, vast majority of names are 99% one way or the other...with a few exceptions such as Leslie...though you can mitigate that by using older years of the SSN database.

And here's a battery of SQL queries relating to baby names and gender analysis: http://2015.padjo.org/tutorials/babynames-and-college-salari...

So I have to strongly disagree with OP that 80% accuracy is something to be astounded by when it comes to gender classification...however, I do agree that in terms of features, doing a simple frequency count of last characters...or number of vowels overall can be a strong indicator of gender for a name, with female names tending for the softer sound. I wonder how much more using Soundex would add to the accuracy? Creating a trained name classifier would be a fun project in service of a tool that could gender classify how masculine or feminine a made-up name sounds like...which would be a slightly useful tool if you were a fantasy fiction writer, though I suppose if you were to be a successful writer, your ear would be trained well enough for he purpose to not delegate it to a computational tool.

youngprogrammer · on April 5, 2016

> The SSN name data seems almost certainly flawed to a small degree...I.e. It's just hard to believe that there are dozens of boys named Jennifer (and yet, strangely, no boys named Sue!)...but we're talking about infinitesimal rounding errors. The vast, vast majority of names are 99% one way or the other...with a few exceptions such as Leslie...though you can mitigate that by using older years of the SSN database.

I agree that the SSN data has flaws, but I only took names with at least 20 people. But the classification is probably iffy, as some names are classified as male and female.

> So I have to strongly disagree with OP that 80% accuracy is something to be astounded by when it comes to gender classification...

I originally hypothesized, I could reach 90% accuracy, but I could only get up to 82% max. As stated in the blog, 80% is the accuracy of a mammogram detecting cancer in a 40-45 year old woman which is pretty good for 3 features!

> I wonder how much more using Soundex would add to the accuracy? Creating a trained name classifier would be a fun project in service of a tool that could gender classify how masculine or feminine a made-up name sounds like...which would be a slightly useful tool if you were a fantasy fiction writer, though I suppose if you were to be a successful writer, your ear would be trained well enough for he purpose to not delegate it to a computational tool.

This would be very interesting to see!

jventura · on April 5, 2016

I agree with your comment in the sense that 80% is not, as it seems to be, a very large number. One has to understand that in statistics the baseline technique is the "random". In the case of genders for USA, a random technique would probably have around 50% accuracy (http://kff.org/other/state-indicator/distribution-by-gender/), so 80% is just 30% more than the baseline (there's still the other 2/5 to cover).

I used to work with Text-Mining techniques and it was quite easy to come up with a statistical metric that could reach 70%-75% accuracy, for almost all problems that I've dealt with (mainly in the sub-area of concept extraction - information retrieval). I assume it is similar for this guy, so that is why he reaches this kind of values very fast. For instance, with only the character frequency and order he reports results of 75% which, if you look carefully, is just 20%-25% above the random baseline, not "pure" 75%..

Then he trains a classifier (Random Forest) and extracts the most relevant features that the classifier used, which is existence of an 'a' character in the last two positions of the name (the order of 'a' feature seems quite irrelevant to me). With these features he reaches the 80% ("random" + 30%) accuracy which is a good number but still far. I would say that these two features are to be expected, as, at least in my mother tongue (Portuguese), feminine names end with an 'a' (eg. Maria, Ana, Roberta [vs Roberto - masc], etc..).

But all in all it can be considered a good exercise!

amelius · on April 5, 2016

Note that the statistics may depend on birth year for certain names.

realusername · on April 5, 2016

* Determining Gender of an (american) Name [...]

Still impressive but it would be interesting to see the result for other countries. Maybe for some countries, the gender could be harder or easier to guess. I can see how some people would use this to try to reconstruct a database where gender is missing.

carlob · on April 5, 2016

You can probably get to 80% accuracy in Italian just using the last letter. If it's A it's most likely a female, if it's O a man.

Counterexamples among the 50 most common male names given in 2014 are:

    Andrea 2.46%

    Mattia 2.39%

    Luca 1.14%

    Elia 0.45%

Counterexamples among the 50 most common female names given in 2014 are:

    None

There are a handful of names ending with E or a consonant for both genders, which will have to be placed at random.

Source (in Italian): http://www.istat.it/it/prodotti/contenuti-interattivi/calcol...

scotty79 · on April 5, 2016

> If it's A it's most likely a female

Same in polish. If 'a' is last you get like 98% probability it's female name and if last is not 'a' then 98% it's a male name.

carlob · on April 5, 2016

But Kuba is a male name :)

s_q_b · on April 5, 2016

Many names in Romance language preserve the ending of original declension from the Latin. The first declension feminine ending, the "base" name for a woman's name, is "a."

Thus many Romance countries (and Christian countries, since many The Catholic Church requires the child to take the name of a saint for Baptism), have a disproportionate number of female names that end in "a."

The "a" phenom signifying femininity is even suspected to go all the way back to Proto-Indo-European.

SeanDav · on April 5, 2016

I am confused. Why use a dataset to train an AI, when one can just use a decent dataset as a look up table and dispense with the complexity of machine learning, while achieving close to perfect accuracy?

Obviously as an exercise in machine learning, this is perfectly acceptable, but as a solution to a problem, not so much.

youngprogrammer · on April 5, 2016

For me, this was an exercise in practicing machine learning and I found it very interesting that you could get 80% accuracy with such few features. A look up table works very well, but if a name does not exist in the table, you could possibly use some kind of ML to guess the gender.

jknz · on April 5, 2016

As soon as a name is not present in the database, the approach based on machine learning is useful.

It is very likely that the name absent from the db will have features that reveal its gender.

So 1) try db lookup, if absent 2) use a classification algorithm (which will always provide an answer)

ryandrake · on April 5, 2016

A non-software-developer and a developer need shopping lists for their next trip to the supermarket. Developer: "I know! First we'll set up Postgres, then hook up the controller logic... Oh, we'll need a front end that works well on both desktop and mobile..." The non-developer grabs a pen and paper and is at Safeway in 5 minutes.

avereveard · on April 5, 2016

but what about Andrea?

function_seven · on April 5, 2016

In the United States, most likely female.

carlob · on April 5, 2016

Everywhere except Italy most likely female, in Italy, mostly male.

zyx321 · on April 5, 2016

So it's similar to Sasha? Female in the US and other English language regions, Male in French and German language regions, unisex in eastern Europe.

onion2k · on April 5, 2016

It's determining sex, not gender. Sex is a binary physical attribute (plus some edge cases); gender is a much more fluid aspect of who we are. By writing code that puts people in to only "male" or "female" and ignoring everything else that people identify with you're disenfranchising a lot of people.

This is definitely some interesting work, but I really would recommend not using it in the real world. You almost certainly don't need to know a user's gender; if you really do then you should ask them what it is and use their own definition.

lifeindeath · on April 5, 2016

He said determine gender of a name, not of a person. Names (as part of the speech, not as people identifiers) have genders, grammar says so. Then one can argue that 80% accuracy is not that much, but as first approach to machine learning is actually an interesting choice

thaumasiotes · on April 5, 2016

> Names (as part of the speech, not as people identifiers) have genders, grammar says so.

You'd have to be speaking a language with gender marked on the noun; English is not such a language.

Pilfer · on April 5, 2016

blond/blonde

actor/actress

fiancé/fiancée

widower/widow

waiter/waitress

thaumasiotes · on April 5, 2016

None of those are nouns with grammatical gender (semantic gender, yes. Grammatical gender, no). English is not such a language. Our pronouns are marked for gender; our nouns, including names, aren't.

thaumasiotes · on April 5, 2016

It's worth observing, as a fun fact, that English obeys a grammatical gender system which is not masculine/feminine:

in modern American English, "who" is a pronoun for people (and anything being granted a sort of honorary personhood, like a pet), and "what" is a pronoun for non-people. The third-person pronouns follow the same distinction, with "it" being marked for nonpersonhood. This is a change from historical usage, and is why e.g. some people today will be offended by another person referring to their baby as "it".

thecatspaw · on April 5, 2016

A name can have multiple genders, for example Kim is male and female

thro21887 · on April 5, 2016

There is very strong correlation between sex and gender ;-)

collyw · on April 5, 2016

http://www.oxforddictionaries.com/definition/english/gender

Seems that gender can be used both ways.

kbenson · on April 5, 2016

I wish English made it easier to just use gender neutral speech when referring to individuals. The fact that the third-person neutral pronoun is "it", which in English generally carries a negative connotation when applied to an individual, makes the whole situation kind of tricky. I don't imagine someone would be pleased to be referred to as "it", and I wouldn't enjoy using it if I knew it was making someone uncomfortable, or worse, upset.

hug · on April 5, 2016

Third person gender neutral pronoun in English is the singular "they", and has been for literally centuries. Don't let prescriptivists tell you otherwise.

woodman · on April 5, 2016

It places the lotion in the basket. Yup, uncomfortable.

captainmuon · on April 5, 2016

Isn't it the other way around? The name (birthname or assumed) reflects a person's social identity and thus is related primarily to their gender. It is only indirectly correlated to their sex (biological makeup), which you cannot directly infer from the name they specify. So I'd say this guesses a person's gender.

Somewhat orthogonal is the issue that the guess is binary (m/f). But the thing is, as long as the accurracy is ~80%, the total mismatch rate is much larger than the rate of people that are mismatched because there is no category for them.

Obviously, you shouldn't guess someone's sex and or gender and present it to them or others. But this can be used for a lot of other things:

- If you have a large dataset of names and other information, but no gender info, and want to see if e.g. women are underrepresented. Even with a weak classifier, you can set bounds.

- If you have a real-name, real-data policy, you can use this as a pre-screening to see if the information entered is plausible. It's of course up to you to then do something responsible with that information. I'd prefer to allow pseudonyms or assumed identites in most cases, but sometimes it's not possible (e.g. if this is a egovernment or insurance project).

zyx321 · on April 5, 2016

>If you have a real-name, real-data policy, you can use this as a pre-screening to see if the information entered is plausible.

If people want to lie, they are going to give you a fake gender to match their fake name. What you're suggesting would flag 20% of your legitimate users and 0% of malicious users.

zyx321 · on April 5, 2016

Approximately 99% of biologically male people identify with a (predominantly) male gender identity, and approximately 99% of biologically female people identify with a (predominantly) female gender identity. Of the remainder, approximately 99% identify as the opposite gender (born male, identifying female, or vice versa). Those people will usually prefer change their name to reflect their gender.

Unless you are specifically targeting liberal arts college students or Tumblr's otherkin community, Gender: Male/Female is going to cover the vast majority of your users. Unless you are running an adult dating website, there's no reason to ask your users' biological sex ever.

All that being said, I do agree that if you want to know a user's gender you should be asking them rather than trying to guess. And since it doesn't cost anything extra you might as well throw "other" in there as an option.

powera · on April 5, 2016

I would think it shouldn't be used in the real world because an 80% guess rate of gender based on name is really terrible compared to other naive approaches, like "list of 1000 most common names by gender".

youngprogrammer · on April 5, 2016

80% is not very good for practical uses (the Gender Classification as a Service on the web most likely use a map of names to probability of gender), but I think it is very good for 3 features.

jacalata · on April 5, 2016

Well, why is that constraint interesting?

youngprogrammer · on April 5, 2016

This is interesting because using only last 2 letters of a name and the position of a's, and throwing away all the other information, you can guess the gender of a name with 80% accuracy.

jacalata · on April 5, 2016

That's really just restating what you did, not explaining why it's interesting to have done it. Is three features a common limitation of systems? Is this useful in a way that augments existing tools like comparing the name to the SSN database? There is definitely a place for articles saying "this is a toy problem you can use technique x on" but it's worth differentiating that from "this is a novel technique/a discovery made using this technique", and it feels a little like you've done the first kind and you're trying to label it as the second kind.

onion2k · on April 5, 2016

The point still stands though - guessing a user's gender based on their name whatever solution you use will get it wrong every time the user doesn't identify with the normative gender for that given name. It's much more user friendly to ask (and include an option for "Prefer not to say").

LyndsySimon · on April 5, 2016

> get it wrong every time the user doesn't identify with the normative gender for that given name

While I'm not really buying your root argument, this statement is spot on.

Case in point - I'm a man named Lyndsy.

cjslep · on April 5, 2016

Serious question: How can a name have a biological sex? (Wouldn't the fact that penis-people and vagina-people could be named Amy, but society heavily biases towards one, imply names are indicators of societal genders and not biological sex?)

I agree with your above points about the dangers of making guesses on gender. I'm just not well versed in the progressive gender concepts and don't understand the semantics you are arguing for.

jacalata · on April 5, 2016

I think that by 'biological sex' they are saying it is identifying the gender of the person as it was assigned to them at birth, which is usually based on visible biological sex characteristics.

profmonocle · on April 5, 2016

No one is suggesting this should be done in actual production software - even the author says it's just "an easy project to learn machine learning".

I appreciate the point you're trying to make, but it's kind of a straw man argument.

thecatspaw · on April 5, 2016

usually in forms the gender is irrelevant, what matters most of the time (if it even matters) is the sex

DanBC · on April 5, 2016

If you're gathering sex:

i) you want to know how a person identifies

ii) you're gathering demographic data so you also want to get trans status

At this point there's no point asking about gender. Certainly in the UK asking for sex and then asking for gender is going to be seen as hostile to trans people, while asking for what sex someone identifies as and then asking about trans status is seen as less hostile.

I can't think of a situation where you'd ask for the sex and not for trans status.

captainmuon · on April 5, 2016

In my field, there is mandatory radiation protection for pregnant women. If you start, you want to ask on the form for the (biological, birth) sex, and if it is not female you can simply skip the part where you inform the worker about rights and responsibilities in case of pregnancies.

Honestly, I don't think you need to ask for any (gender|sex)-information in most cases, but it is often added out of completism. The only valid reasons are: - Self-presentation of the user, e.g. in a profile page -> allow the user to write whatever they want - Advertising, argh, but you often can't get around this. I don't know if advertisers care about how people identify or about trans people, but they probably want a simple m/f flag that fits enough people. - Medical or legal stuff

collyw · on April 5, 2016

Considering the vast majority of people are not trans, then I think that in most situations it would be perfectly acceptable only to ask the sex.

sydneysider · on April 5, 2016

Blah blah blah, go back to liberal arts college

dang · on April 5, 2016

Please don't do this here.

nness · on April 5, 2016

Work out how to do this for ethnic make-up and Twitter may have a job for you...

jonathankoren · on April 5, 2016

That is trivially easy.

You simply find the most common names in each country with predominant language. Want to find Indian names? Most common names in India. Latino names? Most common names in Mexico. etc. It's not perfect, but nothing is. Case in point:

As a wise man once sang, "I'm not black like Barry White. No, I am white like Frank Black is."

erikpukinskis · on April 5, 2016

This kind of thing makes me nervous, because I can't think of any uses for this kind of thing that aren't pretty nefarious in my mind. I immediately imagine targeting women with makeup and celebrities and men with cars and sports. Even if you could achieve 100% accuracy, you're pigeonholing your users in some sort of weird gender jail where behavior consistent with their gender is reinforced.

Maybe this is a failure of imagination on my part though... Do other people have a sense of some altruistic feature that would rely on tech like this?

Terr_ · on April 5, 2016

Nefarious? Guessing the gender of a name is something humans do all the time, especially with the aid of learned conventions in a society. Janus and Janice, Don and Dawn...

Even if the computer improves to twice human accuracy (i.e. mistakes half as often) I don't see how it'll reveal or bias anything that wasn't already about to happen.

wyattpeak · on April 5, 2016

It's worth noting that in plenty of languages, names can be trivially gendered with accuracy approaching 100%.

I haven't seen Icelanders rising up to make their names more ambiguous, and Italians don't seem unduly overrun with nefarious lipstick ads.

I also guarantee that advertisers can do a good deal better than 80% by browsing habits alone.

frobozz · on April 5, 2016

Dear Ms. Pukinskis,

I'm writing to you to let you know that as you haven't provided us with a preferred form of address, out computer has automatically chosen one for you (see above). If you'd prefer us to use a different form, please let us know.

danso · on April 5, 2016

The US government has already used last name data to guess ethnicity of persons when sending out notices for a class action lawsuit: http://www.consumerfinance.gov/reports/using-publicly-availa...

SixSigma · on April 5, 2016

> I immediately imagine targeting women with makeup and celebrities and men with cars and sports.

Then it is your imagination being nefarious.