Hacker News new | past | comments | ask | show | jobs | submit login
Names.io – Global Exhaustive Scraped Name Db (github.com/debdut)
147 points by debdut on Oct 8, 2020 | hide | past | favorite | 95 comments



I have strong doubts this is exhaustive or that it could be. For one, it doesn't contain the name "Genoveffa" which is the italianized form of Jennifer. There are wikipedia-level people named thus, so it's not that uncommon.[0]

There's an infinite amount of misspelled/localized names outside their country of origin e.g. Maicol for Michael, Sandiago for Santiago, Uilliam, Villiam, Willian etc etc

Aditional "anecdata": I live in a country (Hungary) where you have to apply for a special permit to give an "unusual" name to a kid, and since my son has an italian name we had to do that.

You can see the list of names people requested to add, and it includes random stuff like "Magneto" (which isn't on this list either).

This, to say that basically any given word can be a name in some country, it's likely not possible to have an exhaustive list.

Maybe replace with "extensive".

[0] https://en.wikipedia.org/wiki/Genoveffa_Franchini


Well, we've been found out. My family and I don't exist. We had a good run.

EDIT: Yeah, this list is absurdly non-exhaustive. My wife's maiden name isn't even listed and it's a common English surname, noun and verb.


Well, the list doesn't even have the names of the last three French presidents, the last two Belgian prime ministers, etc.

Actually, I just can't find any French name that I look up. Except for "Michel" (another former Belgian prime minister) but that one was not a very high bar.

So yeah, it's pretty much non-exhaustive and it doesn't take long to find it out.


After a quick scan I could tell that the list of surnames has no names in it that contain letters beyond the basic latin alphabet. That by itself excludes a lot of actually valid names.


It's also missing all the names that are composed of multiple words (and thus contains spaces) which is very common for Dutch last names.


And Dutch first names: Gert-Jan etc.


Note that the list also misses out on Capitalisation. If you want to send messages to Gert-Jan then it needs to be Gert-Jan and not Gert-jan or gert-jan.

With last names this is also an important matter. McDonald is not mcdonald.

Then there is F.W. de Klerk. Not 'De Klerk' or 'de klerk'.

Some people take offence at their name being written wrongly, in this list by having no Title Case going on there could be a few people 'offended'!


It seems perfectly common surnames also missed if nobody with that name happened to have a child in the last x years (where x is quite small).

I'm not going to try do statistics, but I reckon that's quite a lot. Census would be a better source than birth registers.


Did search for the Basque name "Uxue" and it's not present...

https://en.wikipedia.org/wiki/Uxue


Swedish names are under-represented. For example, there are exactly zero hyphenated names starting with "sven".


It doesn't contain my last name and it's not the most unique one in the world (especially in France).


Yes, my two first names are missing on the list.


It sort of sucks that the project has a domain name for a project name ("names.io") but you don't own the domain.

I went to the domain because the GitHub didn't describe the format of the data. So I'd also beef up the README.


Somewhat related and worth pointing out is that the whole world does not use family names, or the family name as a last name.

My wife was annoyed when she came to the US and every form has a “first” and “last” field, but she doesn’t have a last name. Her passport for example, only has a “name” field.


I feel for her. My wife was in the exact same situation, like almost everyone in her home country she had just one given name. When we moved here to the states she found it highly annoying and even limited some opportunities.

Recently she decided to legally change her name to something that fits in better here. Although it has been an emotional decision for her, she felt she had to as it's very hard to live in the US without a first and last name.

Edit: "Annoying" doesn't even begin to cover her experience. Everything from getting a bank account, to buying a car, to getting a marriage license was a whole process of explaining until we were blue in the face, escalating to a manager, being told basically to get lost, etc. etc.. It left her in tears more than once.


Why wouldn't she just use your last name when you got married? Seems like a pretty simple solution.


Burmese women (she's Burmese) aren't used to changing their names at marriage and she found the practice demeaning. We never really considered it beforehand. But, I think if she had fully realized at the time how hard it would be to live here with one name, then yes, she would have taken my last name. You live and learn.


I don’t get why people still take it for granted that a woman should change her name on getting married and use the husband’s name. People should have the names assigned to them at birth or whatever they’ve changed it to as per their wish. I personally find the changing of names on marriage as an erasure of identity, even if the person is ok with it because of internalization or cultural conditioning.


I find it actually demeaning, but people often couldn't care less.

There are laws in France from revolution times [0] that basically set your name in stone because anyone using whatever name they liked without need for any paper trail basically cause administrative and bureaucratic hell.

But nowadays even the administration works around that to comply with the tradition with things like "nom d'usage" (you can see it as "lastname nickname"), which is basically allowing you to use another name for non-legal purposes, but people still think it's legal and shove it in every form as regular "nom".

I repeatedly heard from people working in HR that a significant amount of the questions they get the first few months of employees are just (married) women complaining about their last name "being wrong" on their payslips simply because they think that's no longer their legal name though it is.

[0] Loi du 6 fructidor an II (August 23rd 1794), liberal translation of article 1: no citizen will carry any other name than that of the birth certificate.


In Quebec, changing your last name is not permitted since the 70s. So when non-English-speaking people use the same family name, we tend to assume they are siblings.


The comment you're responding comes from someone who in another thread demanded that global names for other things conform to their own culturally-specific and creatively exsanguinated expectations. Earlier in their comment history you can find them complaining about "South Asians" and justifying casual racism by claiming they "optimise for efficiency". Another page or two back, and they're describing sex as an obligation of marriage.

The point being, reactionary bigots tend to out themselves, and that's the obnoxious worldview driving by here. Erasure of someone's identity in conformance to external expectations is something that - remarkably - remains a thriving and actively promoted idea. This makes it all the more important to confront and openly, firmly reject.


> Seems like a pretty simple solution

So is changing her name. But people shouldn't need to do gymnastics for such a trifling matter. This is like cutting a foot to fit a shoe.

Cultures have different naming conventions, and not all cultures pressure women to take names from husbands.

The real simple solution is to accept that the common ground of name form is a string. One string.


If you move to another country I am of the opinion that you should have a certain respect for the conventions of the welcoming country. So you might not agree with their naming conventions, but why should I expect the country to change just because I show up on the scene?

In an English speaking county, having a separate surname aids in sorting and presentation of family unities. The character set is typically A-Z. In Spain e.g. one expects a person the have two surnames. The default — but this can be changed – is that first name is from the father and the second one from the mother. By looking at the order of your name you can get information about family structures. Character set is A-Z + Ñ + umlauts and accent marks.

Going to an English speaking country I could expect them to spell my first name correctly; but since it contains a character outside A-Z I change my name to comply with their modus operandi. Yes, their computer system probably supports UTF8, but most people have never heard of this character and you won't find it on their keyboard. No problem, I change my name to comply with their system.

In Spain official forms often expect two surnames and certain characters. No problem, I use an extra hyphen, change my name, use my middle name as a surname or whatever makes the system happy.

Is it perfect? No. Does it really matter? No. So I just respect their customs and get on with my life as a respectful guest in the country where I am living.


If it's a government office or some service that only serves people from one country, then sure whatever, be a hard-ass if that's what you want to do.

The problem is it's every SaaS online service in the world many of which happily do business in Myanmar. You can't just tell an entire country with its own unique culture and 50 million citizens "FU, you're wrong", and when you do, it's you that's wrong not them! Everyone from Facebook, LinkedIn, Google, Microsoft, Amazon, to federated platforms like Matrix or Mastadon, to developer tools, to every forum on the Internet [1] requires a first and last name! These platforms have literally millions of users or customers around the world - and at least one entire country - that don't fit the <first> <last> mold and I find it highly disrespectful to force it on them like this.

[1] My info might be out of date on some of these as I've been out of the country and not paid much attention for a few years. I would hope that by now at least some of these big multinational companies would get it right. However, suffice it to say, the vast majority of SaaS platforms today are disrespectful to people with one name.


I also translate my name into a more pronounceable string of letters for English speaking people's sake; forcing a different language on people is disrespectful. We don't disagree there.

However, I don't see how if a person has no first/last name, and expects to be referred so, can be construed as disrespect to the English-speaking culture. There is no need to learn another language; all it takes is to "know" that the person has one string as a name.

You said it helps with sorting, and that is exactly why I said cutting a foot to fit a shoe. It's asking people to change their name for the sake of paper work.

I don't expect to show up and make a country change, for the sake of me. My name is already in first/last form.

I wish countries, and systems (frankly many of which are expanding to be used worldwide) to change, because I believe it to be the right direction, and in the end will save everyone's time.


One does not agree or disagree with naming conventions, one merely comes from a place that has one. Everything would be a lot simpler if theres just a single Name field accepting any Unicode string in forms across the globe. The goal should be to describe a person, not to propagate local naming conventions.


It is quite annoying to see that insistence on last names or family names in most places.

In India, which is a single country but has many different languages (22 official and nearly a 1000 unofficial, including dialects) and cultures, I see the same insistence on last name/surname. There are states and districts where mononyms are quite common, but governments insist on first and last name in many forms. I’ve come to believe that those who make up the rules and are in the majority decide everything for others. In India, it’s the people around the national capital who assume that everybody has a last name or that everybody in the country knows Hindi (India has no national language, much to the chagrin of Hindi speakers who seem to believe it is the one).

Funnily, I’ve also noticed that the U.S. consulates have a system where they deal with a last name not being available, even when the name has more than one word separated by spaces in it. In such cases, they add “LNU” (Last Name Unknown) in the last name field in visas. In some cases I’ve seen “FNU” (First Name Unknown) too.


Indeed, when I clicked through, I expected one of the comments to already have linked to https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...


I am from Tamilnadu, India. We don't have a common family name as last name. And our names mostly have single word. Here children add their father's name as last name. And women once get married will have their husband's name as last name. My name is Kumaran and my father's name is Rajendhiran. So, my full name is Kumaran Rajendhiran.


This case isn’t that uncommon! Worldwide, two of the most common naming conventions are:

1. Given name + group descriptor (profession, locality, tribe, etc. such as Jim Baker or Joan Rivers)

2. Given name + parent’s given name and gender (such as Dwayne Johnson or Jóhanna Sigurðardóttir)

Iceland is especially fun, as they not only append a parentage name, but alongside -son and -dottir they’ve introduced a gender neutral suffix (-bur), and they don’t want you to make up first names out of the blue!

“[Iceland’s] Personal Names Committee maintains an official register of approved Icelandic given names and governs the introduction of new given names into Icelandic culture.”https://en.wikipedia.org/wiki/Icelandic_Naming_Committee


One word names are called “mononyms”. Your example is again an example. You may have others around you with the same name but write it as Kumaran R (the initial derived from the first letter of the father’s name or the place of origin). Some may not even use an initial. These naming schemes, which are as valid as any other on the planet, have already caused a lot of trouble for people with PAN cards and Aadhaar (I recommend you read up on those if you aren’t aware).


Related anecdote.

Long ago, lot of forms from American websites wouldn't let me fill in my last name. As it technically contains 3 words. I had to concat "van der" to my last name to finish registration.


Another anecdote.

Colleague last name: O'Donnell. You can already see where this is going.

Bonus points: has an email address as "FirstName.O'Donnell"@domain.com which basically works in <1% of the sites though it's completely legal, just because nobody cares about specs. Obviously he has other email addresses. The good thing is he gets 0 spam in that inbox.


I had a similar email address for many years. I kept putting in tickets asking if we could change it, or at least create an alias sans-apostrophe, but never got a reply.

Eventually they came to me and asked if I'd mind changing it. I pointed out I'd requested this roughly 10 times by that point, and we discovered their ticket system had been eating it.

For bonus points, I'm in Ireland, the one place on earth you'd expect o'names to be handled gracefully.


Similar to this, some countries have two first and/or two last names. I had my share of trouble with government forms requiring to input the second names I don't have.


A friend also has the same issue with his passport, but he had survived many years believing his first of 2 names was his first name, and the 2nd name his last. It worked okay until he moved cities (in Germany) and the authorities decided, "According to your passport, these 2 names is your last name, and your first name in our system should be '-'".

Luckily, like everything in Germany, there was a form to fill to fix it.


the Netherlands has the reverse, we have last name prefixes (Van, De, Van Der, Van De, ...) which are separate and when sorting on last name they shouldn't be counted, because de V and D become very crowded then.


Where is she from?


She is from Myanmar. In addition to not having family names, people from Myanmar typically have names consisting of 2-4 words. Her name is 4 words. This causes additional issues as many people and systems assume that names are single words. Many web forms won't allow multiple words as first or last name. It's maddening.


I know Myanmar is one country where basically everyone has just one name, but there are probably others.

https://en.wikipedia.org/wiki/Burmese_names


I think it's an "exhaustive" list under the limitations for those who do have a first and last name. Once you expand it to surnames and given names, it's a lot more complex.


> I think it's an "exhaustive" list under the limitations for those who do have a first and last name.

It is not. I have a single first and a single last name, both ASCII-safe. I am not on the list.


It is not exhustive in any sense. There are enormously famous people, like Charles de Gaulle or Jaques Chirac whose last names are apparently not part of the exhaustive list.


No, 'Ford' is missing.



sometimes "falsehoods that other people believe and pay developers to do"


Looks like it doesn’t yet incorporate the Census surname data, which has more than 160K U.S. last names: https://www.census.gov/topics/population/genealogy/data.html


Will add it thanks


I have worked on projects where I needed to extract firstnames and lastnames and if you want to use this dataset to extract names, here are some caveats: - firstnames can be lastnames as well - common words can be names as well - some stop words can be names - the order can change, you can write firstname, lastname or lastname, firstname - Some names are as short as one letter

Using ML can be useful if you can separate people by origin or in more homogeneous population.


> Some names are as short as one letter

I recently read a French political news story referring to O. It must be a misprint, I thought at first. But there is a French cabinet minister named Cedric O.

https://en.wikipedia.org/wiki/C%C3%A9dric_O


A French colleague's last name is "Le". As in, the main stopword in French.


Ouch, that must lead to a lot of confusion. Especially given that "Le" is a fairly common surname prefix in France (eg Yann Le Cun, John Le Carré)


It's a very common Vietnamese surname so it's not unheard of in France though, it's often pronounced as "lé", maybe because it's the right pronunciation (I don't know) but in any case it makes it immediately different from the common word "le".

Although when written, if that person has a typical French name like "Jean Le" there might be people who think it's been truncated indeed.


Indeed I was thinking about him, but he's probably not the only one


Never read "Story of O" before? ;)


Calling it exhaustive is a bit much. It doesn't have many names of my friends.

Here in India you get very long/sometimes weird lastnames. Many are not there on the list.


You should contribute!


Why? It'll never be exhaustive. Names are infinite.

I'd rather waste my time on HN :)


It's missing 9/10 of the most popular Croatian surnames, or to be more generous 3/10 if you agree that the Anglicised version of Knežević etc. is the same name. IMO they are not the same.


It's likewise missing every Norwegian name containing æ, ø and å – of which there are a lot (some are included if you accept the aa representation of å, but the rest are still missing).


Definitely not exhaustive, please don't use this for validation.


I can't wait for the first website to tell me they won't do business with me because I have an invalid name. It's just the next step from forcing me to remove "invalid characters".


It doesn't even contain the author's name


Exhaustive. Phooey. My last name of Morearty isn't in there.


Hi Brian! didn't know you also read HN ;P


Doesn't have my last name. Neither of my friends.

Probably a good list, but far from exhaustive.


Exhaustive? Does net seem to be that exhaustive to me.

It doesn't have my last name nor the last names of some family and friends I tried.

Perhaps just "a large list of scraped names"?


I wrote a CLI tool for name generation awhile back:

https://github.com/ironarachne/namegen

It doesn't have the volume of names that this one does, but it does have custom rules for names (e.g., Icelandic last names), and it can generate Thai names.


I appreciate the work on this repo to date, and as many comments below point out, it is not yet true exhaustive. For those of us with additional data sources we can and should submit a pull request.

Thanks for the effort, I intend to use this to enrich my test data generation scripts.


Also I found the first names consist of names that are most likely, in fact, lastnames in ascii. Heikkila is most probably, really, only, a common Finnish surname Heikkilä. This makes me wonder how much overlap and discrepancy the lists might actually have.


Does not have my first name of my wife. No, you may not know her name....she goes to another school.

Aside, not that exhaustive though her name is a combination of two first letters from her dad and from her moms names. It is in the wild though, have heard others with her name.


That's an interesting thing. However, it would me more interesting (and usable, for example, in gamedev) if each name would also contain a reference to, for example, top-3 countries/cultures in which the name is popular.


~160k first names

~100k last names

Out of 7-8 billion people this is really all (or most) of the first and last names? We aren’t a very creative species I guess. I especially would have expected the number of first names to be at least an order of magnitude larger.


It's not. Not even close. It doesn't even have my last name for instance. And that's a western name.

Then there's also the following issues:

- The list contains mostly names in basic latin script, which already excludes most of the world population.

- It's also pretty bold of the author to assume that everyone even has a "first" and a "last" name. Not all names work like that.

- The list contains a bunch of extraneous stuff like "�" and "</pre></body></html> (>\k����  " and "रामकिशो&amp;"

- The list is converted to lowercase. You can't just change the capitalization of names and expect them to be the same. Not all scripts work like that.

- In some languages/cultures a first/last name could be pretty much any word you can find in a dictionary, or a combination of them. You'd pretty much need to add that language's entire dictionary.

I could go on but instead I'll just link to this: https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...


You're right but I don't think staying accurate is all that important in this case. Judging by the "Features" section this seems to be mostly concerned with recognizing names in strings which means you can discard some extra information as long as you can still accurately match names without too many false positives. I can't say anything about exhaustiveness.


Many people prefer popular first names. For example, there are probably more than 150 million people with the first name Muhammad or some spelling variation thereof.


That has nothing to do with 'preferring a popular first name', that is a part of how people are named according to lineage in islam.


I also noticed it contains names like "carpenterjr" in the surnames list, which is almost certainly a data collection error.


Other names that are hard to come by: company names and product names. But you can get addresses from Open Street Map and Open Addresses.


The exhaustive list only has 90K surnames, whilst Japan has 300K variations of surname.


Cool, my last name is not in this.


Could this allow to improve state of the art named entity recognition?


When would this be useful?


I'm training models to generate fake names for fake resumes -> https://fake.jsonresume.org (When would this also be useful aha)

Going to play with this data set.


Testing data, or creating character names for a story.


Removal of Personally identifiable information (PII) from text data is the first obvious application that came to mind.


if you do that you will likely wipe out a lot of non-person-name data too, just think of common words used as last names, like "Brown"[0].

It seems to me if you have PII in text data you should treat the whole thing as PII.

[0] although the dataset doesn't actually contain "Brown" as a valid last or first name, go figure.


Yeah, calling it "exhaustive" is plainly ridiculous. (It does have "Black" and "Green", if you want alternative examples, though "White" is also missing...) Seems like just scraping author names from a few bibliographic databases could've filled in some of the more obvious holes.


As a first heuristic for finding names in text I would imagine


randombabynames.io


I'm curious what was your stack for scraping this much data?


The code is in the repo. Stack is “wget” mostly it seems.


It's exhaustive and yet my first name is not there...


Good job


Nice job! It seems like you have a fairly rich data set. I could see this being really useful for soon to be parents trying to think up baby names. For anyone where this is your use case or you just find these types of name lists interesting, then you may also want to checkout https://mashword.com.

Mashword is a word mashup name generator service that we recently built that recognizes many common human names. One of our primary use cases is finding alternatives or unique spellings to traditional or common names (e.g. https://mashword.com/search?words=rebecca) It does not support all of the names in these lists, but we are adding and growing our support for names all the time.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: