A realistic base file for testing privacy-preserving data analysis and publication algorithms

Print Friendly, PDF & Email

We wanted to make you all aware of a new data asset that the Labor Dynamics Institute, supported by the Sloan Foundation, has now made available.

We have built a synthetic population of the United States entirely from public-use data in the American Community Survey. Unlike some of the other synthetic data projects we have undertaken in the past, these data were constructed to provide a realistic base file for testing privacy-preserving data analysis and publication algorithms. The records in the synthetic population are all actual records from the responses to the American Community Survey (2010-2014) and the data dictionary is the Census Bureau’s data dictionary for the ACS PUMS 2010-2014 (also available at http://doi.org/10.3886/E100486V1).

Unlike all of the toy data sets in common use in the CS community, this one is at scale, and contains all of the public-use microdata. There’s one record for every household in the U.S., every person, and every group quarters for the reference date July 1, 2012. (The group quarters data are severely limited because this synthetic population was built from public-use data.) It does not contain all tabulation variables for every summary table the Census Bureau produces, but it’s a very good start.

Background info

Some familiarity with the way the ACS is coded is required to interpret these data:

  • Geography is only coded to the public-use micro-area (PUMA, approx. 100,000 persons). Any algorithm that performs well for general analysis of these data down to the PUMA will certainly be a candidate to consider for implementation down to the block level.
  • The U.S. Census Bureau publishes 11 billion tabulations from these data every year. They are summarized here: https://www.census.gov/programs-surveys/acs/data.html.

Availability of data and code

Caveats

Some caveats are in order. We used the Bayesian bootstrap method (Rubin 1981) to construct the synthetic population. This means that there are many duplicate households, group quarters, and persons. All the households are internally consistent. To test algorithms at scale using these data, you can take simple random samples of households or persons. This is very similar to the method used by DPBench to rescale toy data sets, except that we built the full-scale model for you.

A paper describing the methodology in more detail is forthcoming.

Feedback requested

Whether we invest in a version 2.0 of these data, with detailed geography but still based on public-use data, depends upon the feedback we get on the usefulness of this asset. Please write us at ldi@cornell.edu to provide feedback, or examples of uses of the data.

Citing

  • W. Sexton, J. M. Abowd, I. M. Schmutte, and L. Vilhuber, “Synthetic population housing and person records for the United States [Dataset],” ICPSR – Interuniversity Consortium for Political and Social Research [distributor] 2017.
    [DOI] [Bibtex]
    @techreport{openICPSR:e100274v1,
    author = {Sexton, William and Abowd, John M. and Schmutte, Ian M. and Vilhuber, Lars},
    title = {Synthetic population housing and person records for the {United States} [Dataset]},
    year = {2017},
    doi = {10.3886/e100274v1},
    institution = {ICPSR - Interuniversity Consortium for Political and Social Research [distributor]},
    owner = {vilhuber},
    timestamp = {2017.05.25},
    }
  • W. Sexton, J. M. Abowd, I. M. Schmutte, and L. Vilhuber, “Synthetic population housing and person records for the United States [Dataset],” Zenodo [distributor] 2017.
    [DOI] [URL] [Bibtex]
    @techreport{zenodo:556121,
    author = {Sexton, William and Abowd, John M. and Schmutte, Ian M. and Vilhuber, Lars},
    title = {Synthetic population housing and person records for
    the {United States} [Dataset]},
    month = apr,
    year = 2017,
    note = {{This work is supported under Grant G-2015-13903
    from the Alfred P. Sloan Foundation on "The
    Economics of Socially-Efficient Privacy and
    Confidentiality Management for Statistical
    Agencies" (PI: John M. Abowd,
    https://www.ilr.cornell.edu/labor-dynamics-
    institute/research/project-19)}},
    institution={Zenodo [distributor]},
    doi = {10.5281/zenodo.556121},
    url = {https://doi.org/10.5281/zenodo.556121}
    }

Funding

We acknowledge funding by the National Science Foundation (CNS-1012593SES-1131848) and the Alfred P. Sloan Foundation (G-2015-13903).