Popular Names by Country Dataset - data, code and conversation

I needed a dataset of popular names by country for testing, but I couldn’t find one that had everything I needed. So I made my own!

Need a free dataset of popular names by country, including CJK and RTL examples, plus romanization and counts, all for a boatload of countries? Me too! Keep reading to hear more about what I put together.

My Problem and My Solution

Since I knew that what was available wasn’t what I needed, I took the time to figure out exactly what I did need. Here were the criteria I came up with, and how the dataset I built meets those needs:

Free — This dataset is released under the Creative Commons CC0 license.
Popular names — Included names are reported to be the most popular, by country. Each name includes a count of people with the name within the country when available.
Big Enough for Testing — 2,370 Forenames and 2,278 Surnames, many with multiple representations, i.e., different spellings and native vs. romanized forms.
Multinational — Forenames from 106 countries, Surnames from 75. Generally at least 10 each per country. In particular, there are many names from CJK and RTL languages available.
Clear provenance — Data is pulled from https://en.wikipedia.org/wiki/Lists_of_most_common_surnames and https://en.wikipedia.org/wiki/List_of_most_popular_given_names the week of Jul 8, 2023.
Easy-to-use — Data is available in simple JSON formats.
Romanization — Names in non-Latin scripts include Romanization, either as provided or sourced from Google Translate.

All of these things let me build my i18n and l10n tests quickly and easily.

For reference, the following countries are represented:

🇦🇩🇦🇪🇦🇱🇦🇲🇦🇷🇦🇹🇦🇺🇦🇼🇦🇿🇧🇦🇧🇩🇧🇪🇧🇬🇧🇴🇧🇷🇧🇾🇨🇦🇨🇭🇨🇱🇨🇳🇨🇴🇨🇷🇨🇺🇨🇾🇨🇿🇩🇪🇩🇰🇩🇴🇩🇿🇪🇨🇪🇪🇪🇬🇪🇸🇫🇮🇫🇯🇫🇴🇫🇷🇬🇧🇬🇪🇬🇬🇬🇮🇬🇱🇬🇶🇬🇷🇬🇹🇭🇷🇭🇹🇭🇺🇮🇪🇮🇱🇮🇲🇮🇳🇮🇶🇮🇷🇮🇸🇮🇹🇯🇪🇯🇲🇯🇴🇯🇵🇰🇬🇰🇭🇰🇷🇰🇼🇰🇿🇱🇧🇱🇮🇱🇰🇱🇹🇱🇺🇱🇻🇱🇾🇲🇦🇲🇨🇲🇩🇲🇪🇲🇰🇲🇱🇲🇳🇲🇹🇲🇽🇲🇾🇳🇱🇳🇴🇳🇵🇳🇿🇵🇦🇵🇪🇵🇫🇵🇭🇵🇰🇵🇱🇵🇷🇵🇹🇵🇾🇷🇴🇷🇸🇷🇺🇸🇦🇸🇮🇸🇰🇸🇲🇸🇷🇸🇻🇹🇭🇹🇯🇹🇳🇹🇷🇹🇼🇺🇦🇺🇸🇺🇾🇻🇪🇻🇳🇽🇰🇿🇦

The Data

The dataset is comprised of the following data files:

Surnames

common-surnames-by-country.csv — This is the “master” surname file. All other surname files are generated from this file, either directly or indirectly. The format is not documented, but it’s not hard to grok, especially if you refer to surnames2json.py.
common-surnames-by-country.json — The same data as common-surnames-by-country.csv, but in a clearer JSON format.
common-surnames-by-country.min.json — The same data as common-surnames-by-country.json, just minified.
common-surnames.txt — Just want the names? Then this is the file for you. Contains all unique surnames, one per line.

Forenames

common-forenames-by-country.csv — This is the “master” forename file. All other forename files are generated from this file, either directly or indirectly. The format is not documented, but it’s not hard to grok, especially if you refer to forenames2json.py.
common-forenames-by-country.json — The same data as common-forenames-by-country.csv, but in a clearer JSON format.
common-forenames-by-country.min.json — The same data as common-forenames-by-country.json, just minified.
common-forenames.txt — Just want the names? Then this is the file for you. Contains all unique forenames, one per line.

Downloading

You can find all these files on the releases page of the repo. It’s all released into the public domain under the Creative Commons CC0 license.

Prior Art

Certainly, I’m not the first person with this need, or this idea. In my search, I was able to find many existing datasets that do some of the above things right, but not all of them. If you need a names dataset, you might also check these other excellent options out:

The U.S. Census provides outstanding data, but only for US surnames.
The U.S. Social Security Administration provides outstanding data, too, but only for US forenames.
FiveThirtyEight most common name dataset is also US-only, since it’s based on the Census.
solenium/names-dataset is simple and easy to use, but doesn’t indicate what countries names come from, nor their popularities.
davidam/damegender is well-documented, easy to use, and multinational, but only covers forenames.
philipperemy/name-dataset dataset is clearly comprehensive, but there is no indication of the most popular names, and there is no romanization. (And some users may have legitimate concerns about the provenance of the data, too.)
census.name is paid only.