Data source ideas
Many of you, dear readers, are either teaching or studying metrics, and look for nice data sources for assignments, term projects or just practice new skills. Here are some suggestions.
The textbook
- We used dozens of datasets. Check the the dataset review section
Data about the economy, society - most country level
- World Bank – international data on almost everything partly used in the textbook
- Our World in Data – A recent and great set of data is that became famous because of Covid coverage
- FRED – mostly USA, but some international
- OECD – standard macro data for OECD countries
- World Inequality Database – on the historical evolution of the world distribution of income and wealth within and between countries
- NBER business cycles – information on GDP growth and contraction by the committee that calls recessions
- FED consumer finance survey – The Survey of Consumer Finances (SCF) is a triennial cross-sectional survey of U.S. families. The study is sponsored by the Federal Reserve Board in cooperation with the Department of the Treasury, collected by NORC at the University of Chicago.
- USA inequality historical data – Ellora Derenoncourt, Chi Hyun Kim Moritz, KuhnMoritz Schularick wonderful dataset on USA racial inequality (and more) used in their research
Data about firms, business
- World Bank microdata
- EBRD business surveys
- World Management Survey – used in several case studies in the textbook
- OECD on multinational companies, their measurement, controlled foreign firms
- ECB Compnet European firm data – harmonized firm level datasets
- US historical industry data – 60 years
- Open Ownership Register is a database identifying the beneficial owners of 8.4 million companies (mostly Denmark, Slovakia, the UK, and Ukraine). Records include company names, addresses as well as the owners’ names, nationalities, and interests in those companies.
Data about people
- World value survey - regular international survey on values
- CDC NHANES – US health surveys, partly used in the textbook
- IPMUS: census and survey data from around the world integrated across time and space
- USA Integrated Postsecondary Education Data System (IPEDS) Institutions of higher education in the United States, including enrollment, financial aid, degree completion.
- USA Early Childhood Longitudinal Study (ECLS) data about children’s knowledge, skills, and development from birth through elementary school
Global trade
- UN ComTrade – the most wellknown and widely used trade data
- WTO datasets – you may download several datasets here, goods and services.
- CEPII datasets BACI – BACI provides data on bilateral trade flows for 200 countries at the product level (5000 products). Products correspond to the “Harmonized System” nomenclature (6 digit code).
- US product level data by Peter Schott at Yale. Also technical data on matching datasets
- CEPII Gravity country pair data – trade, distance between country pairs
Data on cities, locations
- Eurostata European Cities – socio-economic, population, transport data on European cities
- City-data.com – USA city information on a vareity of features: schools to restaurant inspections
- Airbnb – massive resource of Airbnb offers around the world by Inside Airbnb partly used in the textbook
Finance
- Yahoo Finance has historical data on stocks, bonds, indices, such as Microsoft, used in the textbook ,also great Python API yFinance
- Google finance – great API and may also be linked to Google Sheets
Culture and language
- Open movies database get movie data via API
- IMDB data - get movie data from popular size IMDB
- CEPII language and history – country pair level info on shared common language, historical links (like colonial ties)
- Domestic and International Common Language Database (DICL)
Climate, environment, energy
- NOAA Climate Data Online – provides free access to NCDC’s archive of global historical weather and climate data
- City climate data
- BP global energy – energy consumption and co2 emissions
- Air quality – air pollution data via an API
- Global wildfires by Global Wildfire Information System (GWIS), for instance burnt areas
- EuroCrops – is a dataset collection combining all publicly available self-declared crop reporting datasets from countries of the European Union.
Government, policy
- Open tenders – you could get government contracting datasets from 33 countries
- Election results and institutions around the globe – a collaborative project
- Quality of Government – several open-source datasets, some extra information, all about Quality of Government
- The Global Open Data Index provides the most comprehensive snapshot available of the state of open government data publication. Read about the methodology
- PPEG political parties, Presidents, Elections, and Governments – The PPEG Database from around the world. It brings together a range of datasets produced by the department “Democracy & Democratization” of the WZB Berlin Social Science Center.
- Central bank policy rates – BIS collection of international policy rates, monthly
Sports data
- Football/Soccer: Football-data.co.uk) – teams, games, odds, partly used in the textbook – great way to simply download data
- Football/soccer: Soccerway and whoscored – Great deal of football data, but you may need webscraping to collect datasets.
- Baseball: Sean Lahman Baseball collection
- Baseball: Baseball-refernce
- Tennis: tennis-data.co.uk
- Football – a Python package to get transfermarkt datasets
Transport, travel, commute
- Open flights data – flight routes, airport locations. Data for 2014-2017 only.
- US airline tickets — Bureau of Transportation Statistics’ Passenger Origin and Destination (O&D) Survey. An earlier version is used in the textbook
- Commuting zones datasets – Facebook collected data on users’ position to estimate commuting zone areas. Check out the Data overview as well
Health, medical, Covid
- Covid data hub – a unified dataset by collecting worldwide fine-grained case data, merged with exogenous variables helpful for a better understanding of COVID-19, by Emanuele Guidotti. Has now an R package
- Our world in data / Covid page
- SGIM Research Dataset Compendium is designed to assist investigators conducting research on existing datasets, with a particular emphasis on health services research, clinical epidemiology, and research on medical education. Public dataset list.
Historidcal data
- Medieval and Early Modern Data Bank – The Medieval and Early Modern Data Bank (MEMDB) is a project established at Rutgers University to provide scholars with an expanding library of information in electronic format on the medieval and early modern periods of European history, circa 800-1815 C.E. It has six different datasets on prices and currencies and textile production. Like on European currency excnage rates in mediveal times
- Yale historical financial research data – old stockmarkets, plus cool stuff like data on South Seas Bubble of 1720
Technology
- The chips dataset provides a dataset with 2185 CPUs and 2668 GPUs.
Collections
- Data is plural spreadsheet – One of the collest data collection is based on a newsletter by Jeremy Singer-Vine Data is plural
- Public API collection – just a wonderful collection of APIs to a plethora of sources, really great on environment, finance, popular culture, transport, any many more
- 538 some datasets shared – Fivethirtyeight.com is politics, sports and entertainment website focusing on data driven, analysis.
- Tableau – If interested in more sports data, check out the collection by Tableau
- Machine Learning Repository – For machine learning projects,a gateway to wealthy resources is University of California in Irvine’s repo
- Tidy Tuesdays, a weekly data project series. Not only for R users: a great collection for data wrangling and vizualization projects.
- Data.world is a fantastic collection of datasets such various sources on environment. Partly free, may need sign-up
- R datasets - a collection of datasets included in various R packages by Vincent Arel-Bundock
- Datahub’s collection Another collection data on the economy, environment and more
- Social Science Data Sources & Statistical Methods
- Trade, globalization, tax datasets by Baptiste Souillard
- Education data, USA and beyond by Paul Bruno