Reproducibility package for Global Poverty Estimation Using Private and Public Sector Big Data Sources

2024

Get Reproducibility Package

Reference ID

PP_WLD_2024_67-v01

DOI

https://doi.org/10.60572/zk6t-x836

Author(s)

Robert Marty, Alice Duhaut

Collections

Journal articles

Metadata

JSON

Created on

Mar 09, 2024

Last modified

Sep 23, 2024

Overview

Abstract

Household surveys give a precise estimate of poverty; however, surveys are costly and are fielded infrequently. We demonstrate the importance of jointly using multiple public and private sector data sources to estimate levels and changes in wealth for a large set of countries. We train models using 63,854 survey cluster locations across 59 countries, relying on data from satellites, Facebook Marketing information, and OpenStreetMaps. The model generalizes previous approaches to a wide set of countries. On average, across countries, the model explains 55% (min = 14%; max = 85%) of the variation in levels of wealth at the survey cluster level and 59% (min = 0%; max = 93%) of the variation at the district level, and the model explains 4% (min = 0%; max = 17%) and 6% (min = 0%; max = 26%) of the variation of changes in wealth at the cluster and district levels. Models perform best in lower-income countries and in countries with higher variance in wealth. Features from nighttime lights, OpenStreetMaps, and land cover data are most important in explaining levels of wealth, and features from nighttime lights are most important in explaining changes in wealth.

Reproducibility Package

Scripts

Readme Get Reproducibility Package

Link: https://reproducibility.worldbank.org/index.php/catalog/110/download/287/README.pdf

Reproducibility package (code and partial data) for Global Poverty Estimation Using Private and Public Sector Big Data Sources

File name

PP_WLD_2024_67-v01.zip

Zip package

PP_WLD_2024_67-v01.zip

Title

Reproducibility package (code and partial data) for Global Poverty Estimation Using Private and Public Sector Big Data Sources

Date

2024-02

Description

The code in this folder generates the tables and figures in the paper "Global Poverty Estimation Using Private and Public Sector Big Data Sources" by Robert Marty and Alice Duhaut

Dependencies

All dependencies are in the renv of the package and explicitly mentioned in the scripts.

Instructions

See README in the reproducibility package.

Notes

Computational reproducibility verified by Development Impact (DIME) Analytics team, World Bank.

Source code repository

Repository name	URI
Reproducible Research Repository (World Bank)	https://reproducibility.worldbank.org
Github	https://github.com/dime-worldbank/big-data-poverty-estimation

Software

Name

Version

4.2.0

Python

Name

Python

Version

3.9

Stata

Name

Stata

Version

Reproducibility

Technology environment

The code was reproduced in a computer with the following specifications:
• OS: Windows 11 Home
• Processor: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz 2.3GHz
• Memory available: 15.8 GB
• Software version: R 4.2

Technology requirements

~2 hours run-time starting from analysis-ready data

Reproduction instructions

This replication package is divided into two parts:

Part 1: Creating Analysis-Ready Datasets from Raw Data

The first phase focuses on preparing the raw data for analysis. Due to data sharing restrictions, users must manually download specific datasets from their respective sources. The package's README file provides detailed instructions, including URLs, necessary access protocols, and guidelines for organizing the data folder, which is crucial for the package’s proper operation.

The _main.R script is set up to skip the process of creating analysis-ready datasets from raw data and instead loads pre-prepared datasets directly. However, users can modify a parameter in _main.R to run scripts that transform the raw data into analysis-ready datasets. This process involves:

Manually downloading raw data from the specified sources.
Running a series of scripts written in Stata and Python. While _main.R points to these scripts, they must be opened and executed manually by the user, as documented in the README file.

Part 2: Replicating the Analysis Using Analysis-Ready Datasets

After the raw data has been processed into analysis-ready datasets, the _main.R script continues with the analysis. Users need to adjust the file paths, and the script will automatically run the necessary analyses, producing the figures and tables shown in the paper.
For this replication, the replicators ran the analysis using the analysis-ready datasets due to the significant time and computational resources required to process the raw data. The verification process began with an intermediate dataset provided by the authors. Although the reproducibility package includes the necessary code to construct this intermediate dataset, we did not verify this process due to the extensive time required—approximately five months for Facebook data and two weeks for other sources.

Data

Datasets

Demographic and Health Surveys (DHS)

Name

Demographic and Health Surveys (DHS)

Note

The user should download the datasets from the link below and put the data in Data/DHS/RawData; this directory contains folders that indicate which datasets need to be downloaded. For example, 2020 data for Kenya for the "HR" (Household Recode) dataset should be placed here: /KE/KE_2020_MIS_03292022_2054_82518/KEHR81DT After this, the user must run the scripts in 01_clean_dhs. The analysis-ready, cleaned datasets within the /Data folder will be located at DHS/FinalData/Merged Datasets/survey_alldata_clean.[Rds/csv]

Access policy

The dataset is public but cannot be republished, and thus is not included in the reproducibility package.

Data URL

https://dhsprogram.com/data/

Demographic and Health Surveys (DHS) - Nigeria

Name

Demographic and Health Surveys (DHS) - Nigeria

Note

The paper includes a specific analysis of Nigeria. Following a similar process as above, data should be downloaded and placed in Data/DHS_nga_policy_experiment/RawData. The cleaning code will then produce the analysis-ready cleaned datasets. For this, the user must run the scripts in 01_clean_dhs 01_clean_dhs_nga_experiment. The analysis-ready, cleaned datasets within the /Data folder will be located at DHS_nga_policy_experiment/FinalData/Merged Datasets/survey_alldata_clean.[Rds/csv].

Access policy

The dataset is public but cannot be republished, and thus is not included in the reproducibility package.

Data URL

https://dhsprogram.com/data/

Living Standards Measurement Study (LSMS)

Name

Living Standards Measurement Study (LSMS)

Note

To start the analysis from the original data, navigate to Data/LSMS/RawData/individual_files, where you'll find a dedicated folder for each country. Inside each country's folder, there's a README file that documents the specific datasets you will need to download into that folder. Go to the links below and download the indicated files. After downloading and placing the files in the folder, you must execute the scripts located in 01_clean_lsms to generate the analysis-ready datasets. The cleaned datasets will be placed in the package at Data/LSMS/FinalData/Merged Datasets/survey_alldata_clean.[Rds/csv].

Access policy

The dataset is public, but not published with the package. Clear instructions to download the data and run the analysis are available in the README file.

Data URL

BEN: https://microdata.worldbank.org/index.php/catalog/4291/; BFA: https://microdata.worldbank.org/index.php/catalog/4290; CIV: https://microdata.worldbank.org/index.php/catalog/4292; ETH: https://microdata.worldbank.org/index.php/catalog/3823; MWI: https://microdata.worldbank.org/index.php/catalog/3818; TGO: https://microdata.worldbank.org/index.php/catalog/4298

Harmonized Nighttime Lights

Name

Harmonized Nighttime Lights

Note

To download the data from the source the user should go to the following link to download the files and place them in Data/DMSPOLS_VIIRS_Harmonized/RawData. To create the ready-for-analysis data bases the user must run the scripts in 02_get_process_ancillary_data/DMSPOLS_VIIRS_Harmonized Source: Li, Xuecao; Zhou, Yuyu; zhao, Min; Zhao, Xia (2020). Harmonization of DMSP and VIIRS nighttime light data from 1992-2020 at the global scale. figshare. Dataset. https://doi.org/10.6084/m9.figshare.9828827.v5

Access policy

Published with the reproducibility package.

Data URL

https://figshare.com/articles/dataset/Harmonization_of_DMSP_and_VIIRS_nighttime_light_data_from_1992-2018_at_the_global_scale/9828827/5

ESA Land Cover Classification Gridded Maps

Name

ESA Land Cover Classification Gridded Maps

Note

To download the data you must follow the following link, create an account download the data, and place them in Data/Globcover/RawData. For 1992 to 2015 data, put the ESACCI-LC-L4-LCCS-Map-300m-P1Y-1992_2015-v2.0.7.tif file in the /1992_2015_data folder For 2016 to 2018 data, (1) put the .nc files in the 2016_2018 folder, then (2) use the globcover_netcdf_to_geotiff script to convert .nc files to .tif files. Then they should run the scripts in 02_get_process_ancillary_data/Globcover Source: Copernicus Climate Change Service, Climate Data Store, (2019): Land cover classification gridded maps from 1992 to present derived from satellite observation. Copernicus Climate Change Service (C3S) Climate Data Store (CDS). DOI: 10.24381/cds.006f2c9a In the analysis-ready data, GlobCover variables are distinguished by a number id (e.g., _2, _3, etc). The following dataset shows what each parameter ID corresponds to /Data/Globcover/RawData/gc_classes.csv

Access policy

The data is publicly available but has not been included in the reproducibility package due to the large size of the files. Detailed instructions for accessing the data and understanding the folder structure are provided.

Data URL

https://cds.climate.copernicus.eu/cdsapp#!/dataset/satellite-land-cover?tab=form

Open Street Maps

Name

Open Street Maps

Note

Users must download data from Geofabrik at the link below. To find data for a specific country, (1) click the continent the country is in, (2) click the name of the country, (3) click "raw directory index", (4) and find the relevant date to download; the file that ends in shp.zip should be downloaded. Download the file and unzip it. Place the file in the relevant folder within Data/OSM/RawData; this folder contains subfolders for each country and year where OpenStreetMap data needs to be downloaded and stored. For example, the data downloaded and unzipped from kenya-210101-free.shp.zip should be placed in Data/OSM/RawData/kenya-210101-free.shp

Access policy

The data is publicly available but has not been included in the reproducibility package due to the large size of the shape files. Detailed instructions for accessing the data and understanding the folder structure are provided.

Data URL

https://download.geofabrik.de/

GADM maps and data

Name

GADM maps and data

Note

00_download_gadm downloads GADM data that is used in cleaning survey data.

Access policy

The code to download the information is directly provided with the package (script 00_download_gadm).

Sentinel 5P Pollution Data

Name

Sentinel 5P Pollution Data

Note

To obtain this data run the code 01_download_s5p.js in the Google Earth Engine code editor, and put the data in Data/Sentinel 5P Pollution/RawData

Access policy

Public but does not allow republication and therefore is not included in the package, but the code to download it is directly included in the package.

Facebook Marketing

Name

Facebook Marketing

Note

This data is directly extracted in the code in 02_get_process_ancillary_data. This part of the code took five months to run, and therefore the ready-to-analyze datasets are directly included in the reproducibility package at Data/Facebook Marketing/. In the analysis-ready data, Facebook variables will be distinguished by a number ID (e.g., _2, _3, etc). The following dataset shows what each parameter ID corresponds to /Data/Facebook Marketing/FinalData/facebook_marketing_parameters_clean.[Rds/csv]

Access policy

Public. The data and the code to download it are directly included in the package.

Data URL

https://developers.facebook.com/docs/marketing-api

NASA's Black Marble

Name

NASA's Black Marble

Note

This data is directly extracted in the codes 01_download_black_marble_annual.R and 01_download_black_marble_monthly.R which downloads NASA's Black Marble data. After execution, the files will be saved at NTL Black Marble/FinalData. Please see the links below for more information about NASA's Black Marble data.

Access policy

Public.The package includes code that enables direct download of the information.

Data URL

https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/5000/VNP46A3/, https://worldbank.github.io/blackmarbler/

World Development Indicators

Name

World Development Indicators

Note

Located at Data/WDI/FinalData/wdi, and the code to retrieve it is 02_get_process_ancillary_data/WDI /download_wdi.R

Access policy

Published with the package and the code to retrieve this is published in the package.

Data URL

https://github.com/vincentarelbundock/WDI

Data statement

Some data is confidential and has not been included in the package. For more details, please refer to the README file

Description

Output

Global Poverty Estimation Using Private and Public Sector Big Data Sources

Type

Published Paper

Title

Global Poverty Estimation Using Private and Public Sector Big Data Sources

Authors

Robert Marty and Alice Duhaut

URL

https://www.nature.com/articles/s41598-023-49564-6

DOI

https://doi.org/10.1038/s41598-023-49564-6

Authors

Author	Affiliation	Email
Robert Marty	World Bank	rmarty@worldbank.org
Alice Duhaut	World Bank	aduhaut@worldbank.org

Date of production

2024-02

Scope and coverage

Geographic locations

Location	Code
World	WLD

Disclaimer

The materials in the reproducibility packages are distributed as they were prepared by the staff of the International Bank for Reconstruction and Development/The World Bank. The findings, interpretations, and conclusions expressed in this event do not necessarily reflect the views of the World Bank, the Executive Directors of the World Bank, or the governments they represent. The World Bank does not guarantee the accuracy of the materials included in the reproducibility package.

Access and rights

License

Name	URI
Modified BSD3	https://opensource.org/license/bsd-3-clause/

Contacts

Name	Affiliation	Email
Robert Marty	World Bank	rmarty@worldbank.org
Reproducibility WBG	World Bank	reproducibility@worldbank.org

Information on metadata

Producers

Name	Abbreviation	Affiliation	Role
Reproducibility WBG	DIME	World Bank - Development Impact Department	Verification and preparation of metadata

Date of Production

2024-02-14

Document version

Citation

loading, please wait...

Export citation: RIS | BibTeX | Plain text

Back to Catalog