They never say which last letters are associated with which sex. I'm guessing that "o" is usually male and "a" is female, but, what about the other 24? Is "n" usually male ("-son")? Does "y" have a bias?
(Regarding sex vs. gender: yes, they aren't perfectly correlated and some people don't stay with their assigned gender, but AFAIK they often/usually choose a new name which probably matches their gender? Thus why "dead names" are a thing.)
Those who find this interesting might also be interested in exploring the additional dimensions offered: state and year. Without using any machine learning, you anchor a fact (name, sex, state, birth year range) and get the probability for the unanchored facts. The more you can anchor the higher your level of confidence. Now this is way off topic, but if that sounded interesting... "Constraint Logic Propagation Conflict Spreadsheets" by William Taysom [0]
Using the SSN baby names data is a staple in my data journalism classes. I like using it as an example of how even a brute force comparison can be decent enough (90%+ or better) to do gender analysis on a wide variety of public datasets, such as campaign finance and public payroll.
Note: it was only an example...obviously a little tweaking is needed to put into actual production. Also, obviously has varied effectiveness on datasets with non-traditional American names. Here's one of the more comprehensive efforts by a student, on New Yorker bylines:
The SSN name data seems almost certainly flawed to a small degree...I.e. It's just hard to believe that there are dozens of boys named Jennifer (and yet, strangely, no boys named Sue!)...but we're talking about infinitesimal rounding errors. The vast, vast majority of names are 99% one way or the other...with a few exceptions such as Leslie...though you can mitigate that by using older years of the SSN database.
So I have to strongly disagree with OP that 80% accuracy is something to be astounded by when it comes to gender classification...however, I do agree that in terms of features, doing a simple frequency count of last characters...or number of vowels overall can be a strong indicator of gender for a name, with female names tending for the softer sound. I wonder how much more using Soundex would add to the accuracy? Creating a trained name classifier would be a fun project in service of a tool that could gender classify how masculine or feminine a made-up name sounds like...which would be a slightly useful tool if you were a fantasy fiction writer, though I suppose if you were to be a successful writer, your ear would be trained well enough for he purpose to not delegate it to a computational tool.
> The SSN name data seems almost certainly flawed to a small degree...I.e. It's just hard to believe that there are dozens of boys named Jennifer (and yet, strangely, no boys named Sue!)...but we're talking about infinitesimal rounding errors. The vast, vast majority of names are 99% one way or the other...with a few exceptions such as Leslie...though you can mitigate that by using older years of the SSN database.
I agree that the SSN data has flaws, but I only took names with at least 20 people. But the classification is probably iffy, as some names are classified as male and female.
> So I have to strongly disagree with OP that 80% accuracy is something to be astounded by when it comes to gender classification...
I originally hypothesized, I could reach 90% accuracy, but I could only get up to 82% max. As stated in the blog, 80% is the accuracy of a mammogram detecting cancer in a 40-45 year old woman which is pretty good for 3 features!
> I wonder how much more using Soundex would add to the accuracy? Creating a trained name classifier would be a fun project in service of a tool that could gender classify how masculine or feminine a made-up name sounds like...which would be a slightly useful tool if you were a fantasy fiction writer, though I suppose if you were to be a successful writer, your ear would be trained well enough for he purpose to not delegate it to a computational tool.
I agree with your comment in the sense that 80% is not, as it seems to be, a very large number. One has to understand that in statistics the baseline technique is the "random". In the case of genders for USA, a random technique would probably have around 50% accuracy (http://kff.org/other/state-indicator/distribution-by-gender/), so 80% is just 30% more than the baseline (there's still the other 2/5 to cover).
I used to work with Text-Mining techniques and it was quite easy to come up with a statistical metric that could reach 70%-75% accuracy, for almost all problems that I've dealt with (mainly in the sub-area of concept extraction - information retrieval). I assume it is similar for this guy, so that is why he reaches this kind of values very fast. For instance, with only the character frequency and order he reports results of 75% which, if you look carefully, is just 20%-25% above the random baseline, not "pure" 75%..
Then he trains a classifier (Random Forest) and extracts the most relevant features that the classifier used, which is existence of an 'a' character in the last two positions of the name (the order of 'a' feature seems quite irrelevant to me). With these features he reaches the 80% ("random" + 30%) accuracy which is a good number but still far. I would say that these two features are to be expected, as, at least in my mother tongue (Portuguese), feminine names end with an 'a' (eg. Maria, Ana, Roberta [vs Roberto - masc], etc..).
But all in all it can be considered a good exercise!
Still impressive but it would be interesting to see the result for other countries. Maybe for some countries, the gender could be harder or easier to guess. I can see how some people would use this to try to reconstruct a database where gender is missing.
Many names in Romance language preserve the ending of original declension from the Latin. The first declension feminine ending, the "base" name for a woman's name, is "a."
Thus many Romance countries (and Christian countries, since many The Catholic Church requires the child to take the name of a saint for Baptism), have a disproportionate number of female names that end in "a."
The "a" phenom signifying femininity is even suspected to go all the way back to Proto-Indo-European.
I am confused. Why use a dataset to train an AI, when one can just use a decent dataset as a look up table and dispense with the complexity of machine learning, while achieving close to perfect accuracy?
Obviously as an exercise in machine learning, this is perfectly acceptable, but as a solution to a problem, not so much.
For me, this was an exercise in practicing machine learning and I found it very interesting that you could get 80% accuracy with such few features. A look up table works very well, but if a name does not exist in the table, you could possibly use some kind of ML to guess the gender.
A non-software-developer and a developer need shopping lists for their next trip to the supermarket. Developer: "I know! First we'll set up Postgres, then hook up the controller logic... Oh, we'll need a front end that works well on both desktop and mobile..." The non-developer grabs a pen and paper and is at Safeway in 5 minutes.
It's determining sex, not gender. Sex is a binary physical attribute (plus some edge cases); gender is a much more fluid aspect of who we are. By writing code that puts people in to only "male" or "female" and ignoring everything else that people identify with you're disenfranchising a lot of people.
This is definitely some interesting work, but I really would recommend not using it in the real world. You almost certainly don't need to know a user's gender; if you really do then you should ask them what it is and use their own definition.
He said determine gender of a name, not of a person. Names (as part of the speech, not as people identifiers) have genders, grammar says so. Then one can argue that 80% accuracy is not that much, but as first approach to machine learning is actually an interesting choice
None of those are nouns with grammatical gender (semantic gender, yes. Grammatical gender, no). English is not such a language. Our pronouns are marked for gender; our nouns, including names, aren't.
It's worth observing, as a fun fact, that English obeys a grammatical gender system which is not masculine/feminine:
in modern American English, "who" is a pronoun for people (and anything being granted a sort of honorary personhood, like a pet), and "what" is a pronoun for non-people. The third-person pronouns follow the same distinction, with "it" being marked for nonpersonhood. This is a change from historical usage, and is why e.g. some people today will be offended by another person referring to their baby as "it".
I wish English made it easier to just use gender neutral speech when referring to individuals. The fact that the third-person neutral pronoun is "it", which in English generally carries a negative connotation when applied to an individual, makes the whole situation kind of tricky. I don't imagine someone would be pleased to be referred to as "it", and I wouldn't enjoy using it if I knew it was making someone uncomfortable, or worse, upset.
Third person gender neutral pronoun in English is the singular "they", and has been for literally centuries. Don't let prescriptivists tell you otherwise.
Isn't it the other way around? The name (birthname or assumed) reflects a person's social identity and thus is related primarily to their gender. It is only indirectly correlated to their sex (biological makeup), which you cannot directly infer from the name they specify. So I'd say this guesses a person's gender.
Somewhat orthogonal is the issue that the guess is binary (m/f). But the thing is, as long as the accurracy is ~80%, the total mismatch rate is much larger than the rate of people that are mismatched because there is no category for them.
Obviously, you shouldn't guess someone's sex and or gender and present it to them or others. But this can be used for a lot of other things:
- If you have a large dataset of names and other information, but no gender info, and want to see if e.g. women are underrepresented. Even with a weak classifier, you can set bounds.
- If you have a real-name, real-data policy, you can use this as a pre-screening to see if the information entered is plausible. It's of course up to you to then do something responsible with that information. I'd prefer to allow pseudonyms or assumed identites in most cases, but sometimes it's not possible (e.g. if this is a egovernment or insurance project).
>If you have a real-name, real-data policy, you can use this as a pre-screening to see if the information entered is plausible.
If people want to lie, they are going to give you a fake gender to match their fake name. What you're suggesting would flag 20% of your legitimate users and 0% of malicious users.
Approximately 99% of biologically male people identify with a (predominantly) male gender identity, and approximately 99% of biologically female people identify with a (predominantly) female gender identity. Of the remainder, approximately 99% identify as the opposite gender (born male, identifying female, or vice versa). Those people will usually prefer change their name to reflect their gender.
Unless you are specifically targeting liberal arts college students or Tumblr's otherkin community, Gender: Male/Female is going to cover the vast majority of your users. Unless you are running an adult dating website, there's no reason to ask your users' biological sex ever.
All that being said, I do agree that if you want to know a user's gender you should be asking them rather than trying to guess. And since it doesn't cost anything extra you might as well throw "other" in there as an option.
I would think it shouldn't be used in the real world because an 80% guess rate of gender based on name is really terrible compared to other naive approaches, like "list of 1000 most common names by gender".
80% is not very good for practical uses (the Gender Classification as a Service on the web most likely use a map of names to probability of gender), but I think it is very good for 3 features.
This is interesting because using only last 2 letters of a name and the position of a's, and throwing away all the other information, you can guess the gender of a name with 80% accuracy.
That's really just restating what you did, not explaining why it's interesting to have done it. Is three features a common limitation of systems? Is this useful in a way that augments existing tools like comparing the name to the SSN database?
There is definitely a place for articles saying "this is a toy problem you can use technique x on" but it's worth differentiating that from "this is a novel technique/a discovery made using this technique", and it feels a little like you've done the first kind and you're trying to label it as the second kind.
The point still stands though - guessing a user's gender based on their name whatever solution you use will get it wrong every time the user doesn't identify with the normative gender for that given name. It's much more user friendly to ask (and include an option for "Prefer not to say").
Serious question: How can a name have a biological sex? (Wouldn't the fact that penis-people and vagina-people could be named Amy, but society heavily biases towards one, imply names are indicators of societal genders and not biological sex?)
I agree with your above points about the dangers of making guesses on gender. I'm just not well versed in the progressive gender concepts and don't understand the semantics you are arguing for.
I think that by 'biological sex' they are saying it is identifying the gender of the person as it was assigned to them at birth, which is usually based on visible biological sex characteristics.
ii) you're gathering demographic data so you also want to get trans status
At this point there's no point asking about gender. Certainly in the UK asking for sex and then asking for gender is going to be seen as hostile to trans people, while asking for what sex someone identifies as and then asking about trans status is seen as less hostile.
I can't think of a situation where you'd ask for the sex and not for trans status.
In my field, there is mandatory radiation protection for pregnant women. If you start, you want to ask on the form for the (biological, birth) sex, and if it is not female you can simply skip the part where you inform the worker about rights and responsibilities in case of pregnancies.
Honestly, I don't think you need to ask for any (gender|sex)-information in most cases, but it is often added out of completism. The only valid reasons are:
- Self-presentation of the user, e.g. in a profile page -> allow the user to write whatever they want
- Advertising, argh, but you often can't get around this. I don't know if advertisers care about how people identify or about trans people, but they probably want a simple m/f flag that fits enough people.
- Medical or legal stuff
You simply find the most common names in each country with predominant language. Want to find Indian names? Most common names in India. Latino names? Most common names in Mexico. etc. It's not perfect, but nothing is. Case in point:
As a wise man once sang, "I'm not black like Barry White. No, I am white like Frank Black is."
This kind of thing makes me nervous, because I can't think of any uses for this kind of thing that aren't pretty nefarious in my mind. I immediately imagine targeting women with makeup and celebrities and men with cars and sports. Even if you could achieve 100% accuracy, you're pigeonholing your users in some sort of weird gender jail where behavior consistent with their gender is reinforced.
Maybe this is a failure of imagination on my part though... Do other people have a sense of some altruistic feature that would rely on tech like this?
Nefarious? Guessing the gender of a name is something humans do all the time, especially with the aid of learned conventions in a society. Janus and Janice, Don and Dawn...
Even if the computer improves to twice human accuracy (i.e. mistakes half as often) I don't see how it'll reveal or bias anything that wasn't already about to happen.
I'm writing to you to let you know that as you haven't provided us with a preferred form of address, out computer has automatically chosen one for you (see above). If you'd prefer us to use a different form, please let us know.
(Regarding sex vs. gender: yes, they aren't perfectly correlated and some people don't stay with their assigned gender, but AFAIK they often/usually choose a new name which probably matches their gender? Thus why "dead names" are a thing.)