Reproducible Research Repository
Reproducible Research Repository
  • Home
  • Repository
  • Collections
  • About
    Home / Repository / JA / PP_WLD_2024_67-V01
ja

Reproducibility package for Global Poverty Estimation Using Private and Public Sector Big Data Sources

2024
Get Reproducibility Package
Reference ID
PP_WLD_2024_67-v01
DOI
https://doi.org/10.60572/zk6t-x836
Author(s)
Robert Marty, Alice Duhaut
Collections
Journal articles
Metadata
JSON
Created on
Mar 09, 2024
Last modified
Sep 23, 2024
  • Project Description
  • Downloads
  • Overview
  • Reproducibility Package
  • Description
  • Scope and coverage
  • Disclaimer
  • Access and rights
  • Contacts
  • Information on metadata
  • Citation
  • Overview

    Abstract

    Household surveys give a precise estimate of poverty; however, surveys are costly and are fielded infrequently. We demonstrate the importance of jointly using multiple public and private sector data sources to estimate levels and changes in wealth for a large set of countries. We train models using 63,854 survey cluster locations across 59 countries, relying on data from satellites, Facebook Marketing information, and OpenStreetMaps. The model generalizes previous approaches to a wide set of countries. On average, across countries, the model explains 55% (min = 14%; max = 85%) of the variation in levels of wealth at the survey cluster level and 59% (min = 0%; max = 93%) of the variation at the district level, and the model explains 4% (min = 0%; max = 17%) and 6% (min = 0%; max = 26%) of the variation of changes in wealth at the cluster and district levels. Models perform best in lower-income countries and in countries with higher variance in wealth. Features from nighttime lights, OpenStreetMaps, and land cover data are most important in explaining levels of wealth, and features from nighttime lights are most important in explaining changes in wealth.

    Reproducibility Package

    Scripts
    Readme Get Reproducibility Package
    Link: https://reproducibility.worldbank.org/index.php/catalog/110/download/287/README.pdf
    Reproducibility package (code and partial data) for Global Poverty Estimation Using Private and Public Sector Big Data Sources
    Title
    Reproducibility package (code and partial data) for Global Poverty Estimation Using Private and Public Sector Big Data Sources
    Date
    2024-02
    Description
    The code in this folder generates the tables and figures in the paper "Global Poverty Estimation Using Private and Public Sector Big Data Sources" by Robert Marty and Alice Duhaut
    Dependencies
    All dependencies are in the renv of the package and explicitly mentioned in the scripts.
    Instructions
    See README in the reproducibility package.
    Notes
    Computational reproducibility verified by Development Impact (DIME) Analytics team, World Bank.
    Source code repository
    Repository name URI
    Reproducible Research Repository (World Bank) https://reproducibility.worldbank.org
    Github https://github.com/dime-worldbank/big-data-poverty-estimation
    Software
    R
    Name
    R
    Version
    4.2.0
    Python
    Name
    Python
    Version
    3.9
    Stata
    Name
    Stata
    Version
    17

    Reproducibility

    Technology environment

    The code was reproduced in a computer with the following specifications:
    • OS: Windows 11 Home
    • Processor: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz 2.3GHz
    • Memory available: 15.8 GB
    • Software version: R 4.2

    Technology requirements

    ~2 hours run-time starting from analysis-ready data

    Reproduction instructions

    This replication package is divided into two parts:

    Part 1: Creating Analysis-Ready Datasets from Raw Data

    The first phase focuses on preparing the raw data for analysis. Due to data sharing restrictions, users must manually download specific datasets from their respective sources. The package's README file provides detailed instructions, including URLs, necessary access protocols, and guidelines for organizing the data folder, which is crucial for the package’s proper operation.

    The _main.R script is set up to skip the process of creating analysis-ready datasets from raw data and instead loads pre-prepared datasets directly. However, users can modify a parameter in _main.R to run scripts that transform the raw data into analysis-ready datasets. This process involves:

    • Manually downloading raw data from the specified sources.
    • Running a series of scripts written in Stata and Python. While _main.R points to these scripts, they must be opened and executed manually by the user, as documented in the README file.

    Part 2: Replicating the Analysis Using Analysis-Ready Datasets

    • After the raw data has been processed into analysis-ready datasets, the _main.R script continues with the analysis. Users need to adjust the file paths, and the script will automatically run the necessary analyses, producing the figures and tables shown in the paper.
    • For this replication, the replicators ran the analysis using the analysis-ready datasets due to the significant time and computational resources required to process the raw data. The verification process began with an intermediate dataset provided by the authors. Although the reproducibility package includes the necessary code to construct this intermediate dataset, we did not verify this process due to the extensive time required—approximately five months for Facebook data and two weeks for other sources.

    Data

    Datasets
    Demographic and Health Surveys (DHS)
    Name
    Demographic and Health Surveys (DHS)
    Note
    The user should download the datasets from the link below and put the data in Data/DHS/RawData; this directory contains folders that indicate which datasets need to be downloaded. For example, 2020 data for Kenya for the "HR" (Household Recode) dataset should be placed here: /KE/KE_2020_MIS_03292022_2054_82518/KEHR81DT After this, the user must run the scripts in 01_clean_dhs. The analysis-ready, cleaned datasets within the /Data folder will be located at DHS/FinalData/Merged Datasets/survey_alldata_clean.[Rds/csv]
    Access policy
    The dataset is public but cannot be republished, and thus is not included in the reproducibility package.
    Data URL
    https://dhsprogram.com/data/
    Demographic and Health Surveys (DHS) - Nigeria
    Name
    Demographic and Health Surveys (DHS) - Nigeria
    Note
    The paper includes a specific analysis of Nigeria. Following a similar process as above, data should be downloaded and placed in Data/DHS_nga_policy_experiment/RawData. The cleaning code will then produce the analysis-ready cleaned datasets. For this, the user must run the scripts in 01_clean_dhs 01_clean_dhs_nga_experiment. The analysis-ready, cleaned datasets within the /Data folder will be located at DHS_nga_policy_experiment/FinalData/Merged Datasets/survey_alldata_clean.[Rds/csv].
    Access policy
    The dataset is public but cannot be republished, and thus is not included in the reproducibility package.
    Data URL
    https://dhsprogram.com/data/
    Living Standards Measurement Study (LSMS)
    Name
    Living Standards Measurement Study (LSMS)
    Note
    To start the analysis from the original data, navigate to Data/LSMS/RawData/individual_files, where you'll find a dedicated folder for each country. Inside each country's folder, there's a README file that documents the specific datasets you will need to download into that folder. Go to the links below and download the indicated files. After downloading and placing the files in the folder, you must execute the scripts located in 01_clean_lsms to generate the analysis-ready datasets. The cleaned datasets will be placed in the package at Data/LSMS/FinalData/Merged Datasets/survey_alldata_clean.[Rds/csv].
    Access policy
    The dataset is public, but not published with the package. Clear instructions to download the data and run the analysis are available in the README file.
    Data URL
    BEN: https://microdata.worldbank.org/index.php/catalog/4291/; BFA: https://microdata.worldbank.org/index.php/catalog/4290; CIV: https://microdata.worldbank.org/index.php/catalog/4292; ETH: https://microdata.worldbank.org/index.php/catalog/3823; MWI: https://microdata.worldbank.org/index.php/catalog/3818; TGO: https://microdata.worldbank.org/index.php/catalog/4298
    Harmonized Nighttime Lights
    Name
    Harmonized Nighttime Lights
    Note
    To download the data from the source the user should go to the following link to download the files and place them in Data/DMSPOLS_VIIRS_Harmonized/RawData. To create the ready-for-analysis data bases the user must run the scripts in 02_get_process_ancillary_data/DMSPOLS_VIIRS_Harmonized Source: Li, Xuecao; Zhou, Yuyu; zhao, Min; Zhao, Xia (2020). Harmonization of DMSP and VIIRS nighttime light data from 1992-2020 at the global scale. figshare. Dataset. https://doi.org/10.6084/m9.figshare.9828827.v5
    Access policy
    Published with the reproducibility package.
    Data URL
    https://figshare.com/articles/dataset/Harmonization_of_DMSP_and_VIIRS_nighttime_light_data_from_1992-2018_at_the_global_scale/9828827/5
    ESA Land Cover Classification Gridded Maps
    Name
    ESA Land Cover Classification Gridded Maps
    Note
    To download the data you must follow the following link, create an account download the data, and place them in Data/Globcover/RawData. For 1992 to 2015 data, put the ESACCI-LC-L4-LCCS-Map-300m-P1Y-1992_2015-v2.0.7.tif file in the /1992_2015_data folder For 2016 to 2018 data, (1) put the .nc files in the 2016_2018 folder, then (2) use the globcover_netcdf_to_geotiff script to convert .nc files to .tif files. Then they should run the scripts in 02_get_process_ancillary_data/Globcover Source: Copernicus Climate Change Service, Climate Data Store, (2019): Land cover classification gridded maps from 1992 to present derived from satellite observation. Copernicus Climate Change Service (C3S) Climate Data Store (CDS). DOI: 10.24381/cds.006f2c9a In the analysis-ready data, GlobCover variables are distinguished by a number id (e.g., _2, _3, etc). The following dataset shows what each parameter ID corresponds to /Data/Globcover/RawData/gc_classes.csv
    Access policy
    The data is publicly available but has not been included in the reproducibility package due to the large size of the files. Detailed instructions for accessing the data and understanding the folder structure are provided.
    Data URL
    https://cds.climate.copernicus.eu/cdsapp#!/dataset/satellite-land-cover?tab=form
    Open Street Maps
    Name
    Open Street Maps
    Note
    Users must download data from Geofabrik at the link below. To find data for a specific country, (1) click the continent the country is in, (2) click the name of the country, (3) click "raw directory index", (4) and find the relevant date to download; the file that ends in shp.zip should be downloaded. Download the file and unzip it. Place the file in the relevant folder within Data/OSM/RawData; this folder contains subfolders for each country and year where OpenStreetMap data needs to be downloaded and stored. For example, the data downloaded and unzipped from kenya-210101-free.shp.zip should be placed in Data/OSM/RawData/kenya-210101-free.shp
    Access policy
    The data is publicly available but has not been included in the reproducibility package due to the large size of the shape files. Detailed instructions for accessing the data and understanding the folder structure are provided.
    Data URL
    https://download.geofabrik.de/
    GADM maps and data
    Name
    GADM maps and data
    Note
    00_download_gadm downloads GADM data that is used in cleaning survey data.
    Access policy
    The code to download the information is directly provided with the package (script 00_download_gadm).
    Sentinel 5P Pollution Data
    Name
    Sentinel 5P Pollution Data
    Note
    To obtain this data run the code 01_download_s5p.js in the Google Earth Engine code editor, and put the data in Data/Sentinel 5P Pollution/RawData
    Access policy
    Public but does not allow republication and therefore is not included in the package, but the code to download it is directly included in the package.
    Facebook Marketing
    Name
    Facebook Marketing
    Note
    This data is directly extracted in the code in 02_get_process_ancillary_data. This part of the code took five months to run, and therefore the ready-to-analyze datasets are directly included in the reproducibility package at Data/Facebook Marketing/. In the analysis-ready data, Facebook variables will be distinguished by a number ID (e.g., _2, _3, etc). The following dataset shows what each parameter ID corresponds to /Data/Facebook Marketing/FinalData/facebook_marketing_parameters_clean.[Rds/csv]
    Access policy
    Public. The data and the code to download it are directly included in the package.
    Data URL
    https://developers.facebook.com/docs/marketing-api
    NASA's Black Marble
    Name
    NASA's Black Marble
    Note
    This data is directly extracted in the codes 01_download_black_marble_annual.R and 01_download_black_marble_monthly.R which downloads NASA's Black Marble data. After execution, the files will be saved at NTL Black Marble/FinalData. Please see the links below for more information about NASA's Black Marble data.
    Access policy
    Public.The package includes code that enables direct download of the information.
    Data URL
    https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/5000/VNP46A3/, https://worldbank.github.io/blackmarbler/
    World Development Indicators
    Name
    World Development Indicators
    Note
    Located at Data/WDI/FinalData/wdi, and the code to retrieve it is 02_get_process_ancillary_data/WDI /download_wdi.R
    Access policy
    Published with the package and the code to retrieve this is published in the package.
    Data URL
    https://github.com/vincentarelbundock/WDI
    Data statement

    Some data is confidential and has not been included in the package. For more details, please refer to the README file

    Description

    Output
    Global Poverty Estimation Using Private and Public Sector Big Data Sources
    Type
    Published Paper
    Title
    Global Poverty Estimation Using Private and Public Sector Big Data Sources
    Authors
    Robert Marty and Alice Duhaut
    URL
    https://www.nature.com/articles/s41598-023-49564-6
    DOI
    https://doi.org/10.1038/s41598-023-49564-6
    Authors
    Author Affiliation Email
    Robert Marty World Bank rmarty@worldbank.org
    Alice Duhaut World Bank aduhaut@worldbank.org
    Date of production

    2024-02

    Scope and coverage

    Geographic locations
    Location Code
    World WLD

    Disclaimer

    Disclaimer

    The materials in the reproducibility packages are distributed as they were prepared by the staff of the International Bank for Reconstruction and Development/The World Bank. The findings, interpretations, and conclusions expressed in this event do not necessarily reflect the views of the World Bank, the Executive Directors of the World Bank, or the governments they represent. The World Bank does not guarantee the accuracy of the materials included in the reproducibility package.

    Access and rights

    License
    Name URI
    Modified BSD3 https://opensource.org/license/bsd-3-clause/

    Contacts

    Contacts
    Name Affiliation Email
    Robert Marty World Bank rmarty@worldbank.org
    Reproducibility WBG World Bank reproducibility@worldbank.org

    Information on metadata

    Producers
    Name Abbreviation Affiliation Role
    Reproducibility WBG DIME World Bank - Development Impact Department Verification and preparation of metadata
    Date of Production

    2024-02-14

    Document version

    1

    Citation

    Citation
    loading, please wait...
    Citation format
    Export citation: RIS | BibTeX | Plain text
    Back to Catalog
    The World Bank Working for a World Free of Poverty
    • IBRD IDA IFC MIGA ICSID

    © The World Bank Group, All Rights Reserved.