# Geographic Distance

For two countries, we define their geographic distance in kilometers as the population-weighted average distance between any pairs of cities from those two countries, subject to some inclusion criteria.

Comma-Separated Values (.csv)

Dataset Format:

Data type: cross-sectional.

Number of cities processed: 26,341.

Number of countries covered: 245.

Number of country dyads: 60,025.

Variables included:

Population-weighted Average Distance (in km)

Population-weighted Avg Distance (fraction of earth's hemicircumference)

Border: dummy indicating that the countries border each other

Population-weighted Latitudinal Distance (difference in degrees)

Population-weighted Longitudinal Distance (difference in degrees)

### Methodology

Our new geographic distance database builds on the methodology of the widely-used CEPII GeoDist database by Mayer and Zignago (2011), and provides updated and improved bilateral distance measures for 245 countries, up from 224.

We utilize the GeoNames database to obtain more current and comprehensive information on city populations and locations. Specifically, we start from the cities500 file, which provides information on the geographic coordinates and the population of 209,311 cities, including all cities with a population of over 500. One important advantage of this file with respect to GeoDist is that it provides up-to-date country names, ISO codes and boundaries.

Starting from this file, we perform a selection of the available cities in order to maintain computational feasibility. Consistent with the methodology of GeoDist, that uses the 25 largest cities for each country to compute the population-weighted distance, we keep the top-50 cities of every country. Some countries have fewer than 50 cities, in which case we keep them all. The other key difference with respect to GeoDist, is that we aim to retain as much of the rest of the dataset as possible. In particular, for each country covered, we aim to cover at least half of the country's population. In practice, within each country, we rank cities inversely to their population, and we only discard cities that have both a rank above 25, and a cumulative population share of over 1/2. The resulting dataset includes 26,341 cities, a five-fold increase from GeoDist.

We use the Haversine formula to compute bilateral city distances using their latitude and longitude. We then take country-level population-weighted distances. For internal distances (a country's distance with itself) we exclude zero distances. That is, we exclude the distance of a city to itself.

There are 13 very small countries in the dataset that only have one settlement (e.g. Falkland/Malvinas Islands and the Vatican). For these countries the internal population-weighted distance is undefined, and we impute it with the country's land area using Leamer's formula - √(area/π) . We empirically verified that this approach provides the best area-based prediction of a country's internal distance among common alternatives, measured as the mean squared error of the predicted log internal distance.

Additionally, we include similar measures of latitudinal and longitudinal distance (in degrees) and a dummy variable indicating whether each country pair shares a land border, allowing researchers to control for contiguity effects in gravity model estimations.