Frequently Asked Questions
Q: Why a new database of cultural/geographic/linguistic distances?
A: There are several reasons why we decided to update these variables. First, there is new and better data available to construct these measures. Second, there are fundamental changes that happened over time that warrant updates: this applies in particular to geographic measures (i.e. changes in country borders and codes). Finally, we were able to develop better variable definitions and methodologies that improve the measurement accuracy as well as the interpretability of the resulting variables.
Q: What are the Differences between these Variables and the previous ones in Spolaore and Wacziarg (2016)?
A: The new data improves on the old one along four dimensions:
Coverage: many more countries are covered, and in some instances we are able to add a time dimension to the data.
Accuracy: higher-quality data has become available. For example, while the old linguistic distance measures relied on Ethnologue data matched to population groups by Fearon (2003), which covered 433 languages, in creating the new linguistic distance we used Ethnologue original data along with newly available users populations, which covers 6,737 languages.
Methodology: we took the opportunity to improve the methodology, with the intent of both improving measurement as well as the interpretability of the resulting measures. For example, for linguistic and religious distances, we introduced the concept of "Normalized tree distance" in dealing with language family trees: this features accounts for the fact that certain branches of the family tree are longer (certain languages are farther away from the root).
Transparency and replicability: in line with broader trends in the profession, we want to share replication code, so that colleagues can re-construct the measures themselves, inspect the code, and suggest improvements or corrections.