Publications

Print Friendly, PDF & Email

Published papers and proceedings

2019
  • J. M. Abowd and I. M. Schmutte, “An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices,” American Economic Review, vol. 109, 2019.
    [Abstract] [DOI] [URL] [Bibtex]

    Statistical agencies face a dual mandate to publish accurate statistics while protecting respondent privacy. Increasing privacy protection requires decreased accuracy. Recognizing this as a resource allocation problem, we propose an economic solution: operate where the marginal cost of increasing privacy equals the marginal benefit. Our model of production, from computer science, assumes data are published using an efficient differentially private algorithm. Optimal choice weighs the demand for accurate statistics against the demand for privacy. Examples from U.S. statistical programs show how our framework can guide decision-making. Further progress requires a better understanding of willingness-to-pay for privacy and statistical accuracy.

    @article{abowdschmutte.aer.2018,
    title={An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices},
    author={John M. Abowd and Ian M. Schmutte},
    year={2019},
    journal={American Economic Review},
    issue={1},
    volume={109},
    doi={10.1257/aer.20170627},
    url={https://www.aeaweb.org/articles?id=10.1257/aer.20170627},
    abstract = {Statistical agencies face a dual mandate to publish accurate statistics while protecting respondent privacy. Increasing privacy protection requires decreased accuracy. Recognizing this as a resource allocation problem, we propose an economic solution: operate where the marginal cost of increasing privacy equals the marginal benefit. Our model of production, from computer science, assumes data are published using an efficient differentially private algorithm. Optimal choice weighs the demand for accurate statistics against the demand for privacy. Examples from U.S. statistical programs show how our framework can guide decision-making. Further progress requires a better understanding of willingness-to-pay for privacy and statistical accuracy.}
    }
  • J. M. Abowd, I. M. Schmutte, W. N. Sexton, and L. Vilhuber, “Why the Economics Profession Must Actively Participate in the Privacy Protection Debate,” AEA Papers and Proceedings, 2019.
    [Bibtex]
    @article{aeapp2019,
    author = {John M. Abowd and Ian M. Schmutte and William N. Sexton and Lars Vilhuber},
    title = {Why the Economics Profession Must Actively Participate in the Privacy Protection Debate},
    year = {2019},
    month = may,
    journal = {AEA Papers and Proceedings},
    owner = {vilhuber},
    timestamp = {2019.04.04},
    }
2018
  • J. M. Abowd, F. Kramarz, S. Perez-Duarte, and I. M. Schmutte, “Sorting Between and Within Industries: A Testable Model of Assortative Matching,” Annals of Economics and Statistics, pp. 1-32, 2018.
    [Abstract] [DOI] [URL] [Bibtex]

    We test Shimer’s (2005) theory of the sorting of workers between and within industrial sectors based on directed search with coordination frictions, deliberately maintaining its static general equilibrium framework. We fit the model to sector-specific wage, vacancy and output data, including publicly-available statistics that characterize the distribution of worker and employer wage heterogeneity across sectors. Our empirical method is general and can be applied to a broad class of assignment models. The results indicate that industries are the loci of sorting-more productive workers are employed in more productive industries. The evidence confirms that strong assortative matching can be present even when worker and employer components of wage heterogeneity are weakly correlated.

    @Article{annalsSorting,
    author = {John M. Abowd and Francis Kramarz and Sebastien Perez-Duarte and Ian M. Schmutte},
    title = {Sorting Between and Within Industries: A Testable Model of Assortative Matching},
    journal = {Annals of Economics and Statistics},
    year = {2018},
    issue = {129},
    pages = {1-32},
    doi = {10.15609/annaeconstat2009.129.0001},
    url = {https://doi.org/10.15609/annaeconstat2009.129.0001},
    abstract = {We test Shimer's (2005) theory of the sorting of workers between and within industrial sectors based on directed search with coordination frictions, deliberately maintaining its static general equilibrium framework. We fit the model to sector-specific wage, vacancy and output data, including publicly-available statistics that characterize the distribution of worker and employer wage heterogeneity across sectors. Our empirical method is general and can be applied to a broad class of assignment models. The results indicate that industries are the loci of sorting-more productive workers are employed in more productive industries. The evidence confirms that strong assortative matching can be present even when worker and employer components of wage heterogeneity are weakly correlated.}
    }
  • J. M. Abowd, K. L. Mckinney, and N. Zhao, “Earnings Inequality and Mobility Trends in the United States: Nationally Representative Estimates from Longitudinally Linked Employer-Employee Data,” Journal of Labor Economics, vol. 36, iss. S1, pp. 183-300, 2018.
    [Abstract] [DOI] [URL] [Bibtex]

    Using earnings data from the U.S. Census Bureau, this paper analyzes the role of the employer in explaining the rise in earnings inequality in the United States. We first establish a consistent frame of analysis appropriate for administrative data used to study earnings inequality. We show that the trends in earnings inequality in the administrative data from the Longitudinal Employer-Household Dynamics Program are inconsistent with other data sources when we do not correct for the presence of misused SSNs. After this correction to the worker frame, we analyze how the earnings distribution has changed in the last decade. We present a decomposition of the year-to-year changes in the earnings distribution from 2004-2013. Even when simplifying these flows to movements between the bottom 20\%, the middle 60\% and the top 20\% of the earnings distribution, about 20.5 million workers undergo a transition each year. Another 19.9 million move between employment and nonemployment. To understand the role of the firm in these transitions, we estimate a model for log earnings with additive fixed worker and firm effects using all jobs held by eligible workers from 2004-2013. We construct a composite log earnings firm component across all jobs for a worker in a given year and a non-firm component. We also construct a skill-type index. We show that, while the difference between working at a low- or middle-paying firm are relatively small, the gains from working at a top-paying firm are large. Specifically, the benefits of working for a high-paying firm are not only realized today, through higher earnings paid to the worker, but also persist through an increase in the probability of upward mobility. High-paying firms facilitate moving workers to the top of the earnings distribution and keeping them there.

    @Article{jole2018,
    author = {John M. Abowd and Kevin L. Mckinney and Nellie Zhao},
    title = {Earnings Inequality and Mobility Trends in the United States: Nationally Representative Estimates from Longitudinally Linked Employer-Employee Data},
    journal = {Journal of Labor Economics},
    year=2018,
    volume={36},
    number={S1},
    pages={183-300},
    doi = {10.1086/694104},
    url = {https://doi.org/10.1086/694104},
    abstract = {Using earnings data from the U.S. Census Bureau, this paper analyzes the role of the employer in explaining the rise in earnings inequality in the United States. We first establish a consistent frame of analysis appropriate for administrative data used to study earnings inequality. We show that the trends in earnings inequality in the administrative data from the Longitudinal Employer-Household Dynamics Program are inconsistent with other data sources when we do not correct for the presence of misused SSNs. After this correction to the worker frame, we analyze how the earnings distribution has changed in the last decade. We present a decomposition of the year-to-year changes in the earnings distribution from 2004-2013. Even when simplifying these flows to movements between the bottom 20\%, the middle 60\% and the top 20\% of the earnings distribution, about 20.5 million workers undergo a transition each year. Another 19.9 million move between employment and nonemployment. To understand the role of the firm in these transitions, we estimate a model for log earnings with additive fixed worker and firm effects using all jobs held by eligible workers from 2004-2013. We construct a composite log earnings firm component across all jobs for a worker in a given year and a non-firm component. We also construct a skill-type index. We show that, while the difference between working at a low- or middle-paying firm are relatively small, the gains from working at a top-paying firm are large. Specifically, the benefits of working for a high-paying firm are not only realized today, through higher earnings paid to the worker, but also persist through an increase in the probability of upward mobility. High-paying firms facilitate moving workers to the top of the earnings distribution and keeping them there.},
    owner = {vilhuber},
    timestamp = {2017.09.21},
    }
  • D. H. Weinberg, J. M. Abowd, R. F. Belli, N. Cressie, D. C. Folch, S. H. Holan, M. C. Levenstein, K. M. Olson, J. P. Reiter, M. D. Shapiro, J. Smyth, L. Soh, B. D. Spencer, S. E. Spielman, L. Vilhuber, and C. K. Wikle, “Effects of a Government-Academic Partnership: Has the NSF-Census Bureau Research Network Helped Improve the U.S. Statistical System?,” Journal of Survey Statistics and Methodology, 2018.
    [Abstract] [DOI] [URL] [Bibtex]

    The National Science Foundation-Census Bureau Research Network (NCRN) was established in 2011 to create interdisciplinary research nodes on methodological questions of interest and significance to the broader research community and to the Federal Statistical System (FSS), particularly the Census Bureau. The activities to date have covered both fundamental and applied statistical research and have focused at least in part on the training of current and future generations of researchers in skills of relevance to surveys and alternative measurement of economic units, households, and persons. This paper discusses some of the key research findings of the eight nodes, organized into six topics: (1) Improving census and survey data collection methods; (2) Using alternative sources of data; (3) Protecting privacy and confidentiality by improving disclosure avoidance; (4) Using spatial and spatio-temporal statistical modeling to improve estimates; (5) Assessing data cost and quality tradeoffs; and (6) Combining information from multiple sources. It also reports on collaborations across nodes and with federal agencies, new software developed, and educational activities and outcomes. The paper concludes with an evaluation of the ability of the FSS to apply the NCRN’s research outcomes and suggests some next steps, as well as the implications of this research-network model for future federal government renewal initiatives.

    @Article{ncrn-summary,
    author = {Daniel H. Weinberg and John M. Abowd and Robert F. Belli and Noel Cressie and David C. Folch and Scott H. Holan and Margaret C. Levenstein and Kristen M. Olson and Jerome P. Reiter and Matthew D. Shapiro and Jolene Smyth and Leen-Kiat Soh and Bruce D. Spencer and Seth E. Spielman and Lars Vilhuber and Christopher K. Wikle},
    title = {{Effects of a Government-Academic Partnership: Has the NSF-Census Bureau Research Network Helped Improve the U.S. Statistical System?}},
    journal = {Journal of Survey Statistics and Methodology},
    year = {2018},
    abstract = {The National Science Foundation-Census Bureau Research Network (NCRN) was established in 2011 to create interdisciplinary research nodes on methodological questions of interest and significance to the broader research community and to the Federal Statistical System (FSS), particularly the Census Bureau. The activities to date have covered both fundamental and applied statistical research and have focused at least in part on the training of current and future generations of researchers in skills of relevance to surveys and alternative measurement of economic units, households, and persons. This paper discusses some of the key research findings of the eight nodes, organized into six topics: (1) Improving census and survey data collection methods; (2) Using alternative sources of data; (3) Protecting privacy and confidentiality by improving disclosure avoidance; (4) Using spatial and spatio-temporal statistical modeling to improve estimates; (5) Assessing data cost and quality tradeoffs; and (6) Combining information from multiple sources. It also reports on collaborations across nodes and with federal agencies, new software developed, and educational activities and outcomes. The paper concludes with an evaluation of the ability of the FSS to apply the NCRN’s research outcomes and suggests some next steps, as well as the implications of this research-network model for future federal government renewal initiatives.},
    doi = {10.1093/jssam/smy023},
    eprint = {/oup/backfile/content_public/journal/jssam/pap/10.1093_jssam_smy023/1/smy023.pdf},
    url = {https://doi.org/10.1093/jssam/smy023},
    }
  • A. Slavković and L. Vilhuber, “Remembering Stephen Fienberg,” Journal of Privacy and Confidentiality, vol. 8, iss. 1, 2018.
    [DOI] [Bibtex]
    @Article{Slavkovic2018,
    author = {Aleksandra Slavković and Lars Vilhuber},
    title = {Remembering Stephen Fienberg},
    journal = {Journal of Privacy and Confidentiality},
    year = {2018},
    volume = {8},
    number = {1},
    month = {dec},
    doi = {10.29012/jpc.685},
    owner = {vilhuber},
    publisher = {Journal of Privacy and Confidentiality},
    timestamp = {2019.04.04},
    }
  • L. Vilhuber, “Relaunching the Journal of Privacy and Confidentiality,” Journal of Privacy and Confidentiality, vol. 8, iss. 1, 2018.
    [DOI] [Bibtex]
    @Article{Vilhuber2018,
    author = {Lars Vilhuber},
    title = {Relaunching the Journal of Privacy and Confidentiality},
    journal = {Journal of Privacy and Confidentiality},
    year = {2018},
    volume = {8},
    number = {1},
    month = {dec},
    doi = {10.29012/jpc.706},
    owner = {vilhuber},
    publisher = {Journal of Privacy and Confidentiality},
    timestamp = {2019.04.04},
    }
2017
  • J. M. Abowd, “How Will Statistical Agencies Operate When All Data Are Private?,” Journal of Privacy and Confidentiality, vol. 7, iss. 3, 2017.
    [Abstract] [DOI] [URL] [Bibtex]

    The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the “Big Data” era. There are orders of magnitude more data outside an agency?s firewall than inside it-compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was “asked” in a context wholly outside the agency’s operations-blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.

    @Article{Abowd:JPC:2017,
    author = {John M. Abowd},
    title = {How Will Statistical Agencies Operate When All Data Are Private?},
    journal = {Journal of Privacy and Confidentiality},
    year = {2017},
    volume = {7},
    number = {3},
    abstract = {The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the ``Big Data'' era. There are orders of magnitude more data outside an agency?s firewall than inside it-compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was ``asked'' in a context wholly outside the agency's operations-blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.},
    owner = {vilhuber},
    timestamp = {2017.05.03},
    url = {https://doi.org/10.29012/jpc.v7i3.404},
    doi = {10.29012/jpc.v7i3.404}
    }
  • L. Vilhuber and C. Lagoze, “Making Confidential Data Part of Reproducible Research,” Chance, 2017.
    [URL] [Bibtex]
    @article {chance:2017,
    title = {Making Confidential Data Part of Reproducible Research},
    journal = {Chance},
    year = {2017},
    month = {09/2017},
    url = {http://chance.amstat.org/2017/09/reproducible-research/},
    author = {Vilhuber, Lars and Lagoze, Carl}
    }
  • S. Haney, A. Machanavajjhala, J. M. Abowd, M. Graham, and M. Kutzbach, “Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics,” Proceedings of the 2017 ACM International Conference on Management of Data, 2017.
    [Abstract] [DOI] [URL] [Bibtex]

    National statistical agencies around the world publish tabular summaries based on combined employer-employee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter ε>= 1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional SDL algorithms. Those queries are fodder for future research.

    @Article{2541,
    author = {Samuel Haney and Ashwin Machanavajjhala and John M. Abowd and Matthew Graham and Mark Kutzbach},
    title = {Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics},
    journal = {Proceedings of the 2017 ACM International Conference on Management of Data},
    year = {2017},
    abstract = {National statistical agencies around the world publish tabular summaries based on combined employer-employee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures.
    In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter ε>= 1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional SDL algorithms. Those queries are fodder for future research.},
    doi = {10.1145/3035918.3035940},
    isbn = {978-1-4503-4197-4},
    owner = {vilhuber},
    timestamp = {2017.09.28},
    url = {http://dl.acm.org/citation.cfm?doid=3035918.3035940},
    }
2016
  • J. Miranda and L. Vilhuber, “Using partially synthetic microdata to protect sensitive cells in business statistics,” Statistical Journal of the IAOS, vol. 32, iss. 1, p. 69–80, 2016.
    [Abstract] [DOI] [URL] [Bibtex]

    We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau’s Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).

    @Article{MirandaVilhuber-SJIAOS2016,
    author = {Javier Miranda and Lars Vilhuber},
    title = {Using partially synthetic microdata to protect sensitive cells in business statistics},
    journal = {Statistical Journal of the IAOS},
    year = {2016},
    volume = {32},
    number = {1},
    pages = {69--80},
    month = {Feb},
    abstract = {We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).},
    doi = {10.3233/SJI-160963},
    file = {:MirandaVilhuber-SJIAOS2016.pdf:PDF},
    issn = {1874-7655},
    owner = {vilhuber},
    publisher = {IOS Press},
    timestamp = {2016.09.30},
    url = {http://doi.org/10.3233/SJI-160963},
    }
  • J. M. Abowd and K. L. McKinney, “Noise infusion as a confidentiality protection measure for graph-based statistics,” Statistical Journal of the IAOS, vol. 32, iss. 1, p. 127–135, 2016.
    [Abstract] [DOI] [URL] [Bibtex]

    We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau’s Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs.

    @Article{AbowdMcKinney-SJIAOS2016,
    author = {John M. Abowd and Kevin L. McKinney},
    title = {Noise infusion as a confidentiality protection measure for graph-based statistics},
    journal = {Statistical Journal of the IAOS},
    year = {2016},
    volume = {32},
    number = {1},
    pages = {127--135},
    month = {Feb},
    abstract = {We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau's Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs.},
    doi = {10.3233/SJI-160958},
    file = {:https\://ecommons.cornell.edu/bitstream/handle/1813/42338/AbowdMcKinney-with%20galley%20corrections.pdf?sequence=2&isAllowed=y:URL;:AbowdMcKinney-SJIAOS2016.pdf:PDF},
    issn = {1874-7655},
    owner = {vilhuber},
    publisher = {IOS Press},
    timestamp = {2016.09.30},
    url = {http://doi.org/10.3233/SJI-160958},
    }
  • L. Vilhuber, J. M. Abowd, and J. P. Reiter, “Synthetic establishment microdata around the world,” Statistical Journal of the IAOS, vol. 32, iss. 1, p. 65–68, 2016.
    [Abstract] [DOI] [URL] [Bibtex]

    In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business microdata is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic \emph{establishment} microdata. This overview situates those papers, published in this issue, within the broader literature.

    @Article{VilhuberAbowdReiter-SJIAOS2016,
    author = {Lars Vilhuber and John M. Abowd and Jerome P. Reiter},
    title = {Synthetic establishment microdata around the world},
    journal = {Statistical Journal of the IAOS},
    year = {2016},
    volume = {32},
    number = {1},
    pages = {65--68},
    month = {Feb},
    abstract = {In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business microdata is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic \emph{establishment} microdata. This overview situates those papers, published in this issue, within the broader literature.},
    doi = {10.3233/SJI-160964},
    file = {:VilhuberAbowdReiter-SJIAOS2016.pdf:PDF},
    issn = {1874-7655},
    owner = {vilhuber},
    publisher = {IOS Press},
    timestamp = {2016.09.30},
    url = {http://doi.org/10.3233/SJI-160964},
    }
2015
  • J. M. Abowd and I. Schmutte, “Economic analysis and statistical disclosure limitation,” Brookings Papers on Economic Activity, vol. Fall 2015, 2015.
    [Abstract] [URL] [Bibtex]

    This paper explores the consequences for economic research of methods used by statistical agencies to protect confidentiality of their respondents. We first review the concepts of statistical disclosure limitation for an audience of economists who may be unfamiliar with these methods. Our main objective is to shed light on the effects of statistical disclosure limitation for empirical economic research. In general, the standard approach of ignoring statistical disclosure limitation leads to incorrect inference. We formalize statistical disclosure methods in a model of the data publication process. In the model, the statistical agency collects data from a population, but published a version of the data that have been intentionally distorted. The model allows us to characterize what it means for statistical disclosure limitation to be ignorable, and to characterize what happens when it is not. We then consider the effects of statistical disclosure limitation for regression analysis, instrumental variable analysis, and regression discontinuity design. Because statistical agencies do not always report the methods they use to protect confidentiality, we use our model to characterize settings in which statistical disclosure limitation methods are discoverable; that is, they can be learned from the released data. We conclude with advice for researchers, journal editors, and statistical agencies.

    @article{AbowdSchmutte_BPEA2015,
    jstor_articletype = {research-article},
    title = {Economic analysis and statistical disclosure limitation},
    Author = {John M. Abowd and Ian Schmutte},
    journal = {Brookings Papers on Economic Activity},
    volume = {Fall 2015},
    url = {http://www.brookings.edu/about/projects/bpea/papers/2015/economic-analysis-statistical-disclosure-limitation},
    ISSN = {00072303},
    abstract = {This paper explores the consequences for economic research of methods used by statistical agencies to protect confidentiality of their respondents. We first review the concepts of statistical disclosure limitation for an audience of economists who may be unfamiliar with these methods. Our main objective is to shed light on the effects of statistical disclosure limitation for empirical economic research. In general, the standard approach of ignoring statistical disclosure limitation leads to incorrect inference. We formalize statistical disclosure methods in a model of the data publication process. In the model, the statistical agency collects data from a population, but published a version of the data that have been intentionally distorted. The model allows us to characterize what it means for statistical disclosure limitation to be ignorable, and to characterize what happens when it is not. We then consider the effects of statistical disclosure limitation for regression analysis, instrumental variable analysis, and regression discontinuity design. Because statistical agencies do not always report the methods they use to protect confidentiality, we use our model to characterize settings in which statistical disclosure limitation methods are discoverable; that is, they can be learned from the released data. We conclude with advice for researchers, journal editors, and statistical agencies.},
    language = {English},
    year = {2015},
    publisher = {Brookings Institution Press},
    copyright = {Copyright © 2015 Brookings Institution Press},
    }
  • M. J. Schneider and J. M. Abowd, “A new method for protecting interrelated time series with Bayesian prior distributions and synthetic data,” Journal of the Royal Statistical Society: Series A (Statistics in Society), p. n/a–n/a, 2015.
    [Abstract] [DOI] [URL] [Bibtex]

    Organizations disseminate statistical summaries of administrative data via the Web for unrestricted public use. They balance the trade-off between protection of confidentiality and quality of inference. Recent developments in disclosure avoidance techniques include the incorporation of synthetic data, which capture the essential features of underlying data by releasing altered data generated from a posterior predictive distribution. The US Census Bureau collects millions of interrelated time series microdata that are hierarchical and contain many 0s and suppressions. Rule-based disclosure avoidance techniques often require the suppression of count data for small magnitudes and the modification of data based on a small number of entities. Motivated by this problem, we use zero-inflated extensions of Bayesian generalized linear mixed models with privacy-preserving prior distributions to develop methods for protecting and releasing synthetic data from time series about thousands of small groups of entities without suppression based on the magnitudes or number of entities. We find that, as the prior distributions of the variance components in the Bayesian generalized linear mixed model become more precise towards zero, protection of confidentiality increases and the quality of inference deteriorates. We evaluate our methodology by using a strict privacy measure, empirical differential privacy and a newly defined risk measure, the probability of range identification, which directly measures attribute disclosure risk. We illustrate our results with the US Census Bureau’s quarterly workforce indicators.

    @article {RSSA:RSSA12100,
    author = {Schneider, Matthew J. and Abowd, John M.},
    title = {A new method for protecting interrelated time series with Bayesian prior distributions and synthetic data},
    journal = {Journal of the Royal Statistical Society: Series A (Statistics in Society)},
    issn = {1467-985X},
    url = {http://dx.doi.org/10.1111/rssa.12100},
    doi = {10.1111/rssa.12100},
    pages = {n/a--n/a},
    keywords = {Administrative data, Empirical differential privacy, Informative prior distributions, Statistical disclosure limitation, Synthetic data, Zero-inflated mixed models},
    year = {2015},
    abstract = {Organizations disseminate statistical summaries of administrative data via the Web for unrestricted public use. They balance the trade-off between protection of confidentiality and quality of inference. Recent developments in disclosure avoidance techniques include the incorporation of synthetic data, which capture the essential features of underlying data by releasing altered data generated from a posterior predictive distribution. The US Census Bureau collects millions of interrelated time series microdata that are hierarchical and contain many 0s and suppressions. Rule-based disclosure avoidance techniques often require the suppression of count data for small magnitudes and the modification of data based on a small number of entities. Motivated by this problem, we use zero-inflated extensions of Bayesian generalized linear mixed models with privacy-preserving prior distributions to develop methods for protecting and releasing synthetic data from time series about thousands of small groups of entities without suppression based on the magnitudes or number of entities. We find that, as the prior distributions of the variance components in the Bayesian generalized linear mixed model become more precise towards zero, protection of confidentiality increases and the quality of inference deteriorates. We evaluate our methodology by using a strict privacy measure, empirical differential privacy and a newly defined risk measure, the probability of range identification, which directly measures attribute disclosure risk. We illustrate our results with the US Census Bureau's quarterly workforce indicators.},
    }
2014
  • A. Shrivastava and P. Li, “Graph Kernels via Functional Embedding,” CoRR, vol. abs/1404.5214, 2014.
    [Abstract] [URL] [Bibtex]

    We propose a representation of graph as a functional object derived from the power iteration of the underlying adjacency matrix. The proposed functional representation is a graph invariant, i.e., the functional remains unchanged under any reordering of the vertices. This property eliminates the difficulty of handling exponentially many isomorphic forms. Bhattacharyya kernel constructed between these functionals significantly outperforms the state-of-the-art graph kernels on 3 out of the 4 standard benchmark graph classification datasets, demonstrating the superiority of our approach. The proposed methodology is simple and runs in time linear in the number of edges, which makes our kernel more efficient and scalable compared to many widely adopted graph kernels with running time cubic in the number of vertices.

    @Article{DBLP:journals/corr/Shrivastava014,
    Title = {Graph Kernels via Functional Embedding},
    Author = {Anshumali Shrivastava and Ping Li},
    Journal = {CoRR},
    Year = {2014},
    Volume = {abs/1404.5214},
    URL = {http://arxiv.org/abs/1404.5214},
    Owner = {vilhuber},
    Abstract = {We propose a representation of graph as a functional object derived from the power iteration of the underlying adjacency matrix. The proposed functional representation is a graph invariant, i.e., the functional remains unchanged under any reordering of the vertices. This property eliminates the difficulty of handling exponentially many isomorphic forms. Bhattacharyya kernel constructed between these functionals significantly outperforms the state-of-the-art graph kernels on 3 out of the 4 standard benchmark graph classification datasets, demonstrating the superiority of our approach. The proposed methodology is simple and runs in time linear in the number of edges, which makes our kernel more efficient and scalable compared to many widely adopted graph kernels with running time cubic in the number of vertices.},
    Timestamp = {2014.07.09}
    }
2013
  • C. Lagoze, W. C. Block, J. Williams, J. M. Abowd, and L. Vilhuber, “Data Management of Confidential Data,” International Journal of Digital Curation, vol. 8, iss. 1, pp. 265-278, 2013.
    [Abstract] [DOI] [Bibtex]

    Social science researchers increasingly make use of data that is confidential because it contains linkages to the identities of people, corporations, etc. The value of this data lies in the ability to join the identifiable entities with external data such as genome data, geospatial information, and the like. However, the confidentiality of this data is a barrier to its utility and curation, making it difficult to fulfill US federal data management mandates and interfering with basic scholarly practices such as validation and reuse of existing results. We describe the complexity of the relationships among data that span a public and private divide. We then describe our work on the CED2AR prototype, a first step in providing researchers with a tool that spans this divide and makes it possible for them to search, access, and cite that data.

    @Article{DBLP:journals/ijdc/LagozeBWAV13,
    Title = {Data Management of Confidential Data},
    Author = {Carl Lagoze and William C. Block and Jeremy Williams and John M. Abowd and Lars Vilhuber},
    Journal = {International Journal of Digital Curation},
    Year = {2013},
    Note = {Presented at 8th International Digital Curation Conference 2013, Amsterdam. See also http://hdl.handle.net/1813/30924},
    Number = {1},
    Pages = {265-278},
    Volume = {8},
    Abstract = {Social science researchers increasingly make use of data that is confidential because it contains linkages to the identities of people, corporations, etc. The value of this data lies in the ability to join the identifiable entities with external data such as genome data, geospatial information, and the like. However, the confidentiality of this data is a barrier to its utility and curation, making it difficult to fulfill US federal data management mandates and interfering with basic scholarly practices such as validation and reuse of existing results. We describe the complexity of the relationships among data that span a public and private divide. We then describe our work on the CED2AR prototype, a first step in providing researchers with a tool that spans this divide and makes it possible for them to search, access, and cite that data.},
    Bibsource = {DBLP, http://dblp.uni-trier.de},
    Doi = {10.2218/ijdc.v8i1.259},
    Owner = {vilhuber},
    Timestamp = {2013.10.09}
    }
2012
  • R. Srivastava, P. Li, and D. Sengupta, “Testing for Membership to the IFRA and the NBU Classes of Distributions,” Journal of Machine Learning Research – Proceedings Track for the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2012), vol. 22, pp. 1099-1107, 2012.
    [Abstract] [URL] [Bibtex]

    This paper provides test procedures to determine whether the probability distribution underlying a set of non-negative valued samples belongs to the Increasing Failure Rate Average (IFRA) class or the New Better than Used (NBU) class. Membership of a distribution to one of these classes is known to have implications which are important in reliability, queuing theory, game theory and other disciplines. Our proposed test is based on the Kolmogorov-Smirnov distance between an empirical cumulative hazard function and its best approximation from the class of distributions constituting the null hypothesis. It turns out that the least favorable distribution, which produces the largest probability of Type I error of each of the tests, is the exponential distribution. This fact is used to produce an appropriate cut-off or p-value. Monte Carlo simulations are conducted to check small sample size (i.e., significance) and power of the test. Usefulness of the test is illustrated through the analysis of a set of monthly family expenditure data collected by the National Sample Survey Organization of the Government of India.

    @Article{SrivastavaLS12,
    author = {Radhendushka Srivastava and Ping Li and Debasis Sengupta},
    title = {Testing for Membership to the IFRA and the NBU Classes of Distributions},
    journal = {Journal of Machine Learning Research - Proceedings Track for the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2012)},
    year = {2012},
    volume = {22},
    pages = {1099-1107},
    abstract = {This paper provides test procedures to determine whether the probability distribution underlying a set of non-negative valued samples belongs to the Increasing Failure Rate Average (IFRA) class or the New Better than Used (NBU) class. Membership of a distribution to one of these classes is known to have implications which are important in reliability, queuing theory, game theory and other disciplines. Our proposed test is based on the Kolmogorov-Smirnov distance between an empirical cumulative hazard function and its best approximation from the class of distributions constituting the null hypothesis. It turns out that the least favorable distribution, which produces the largest probability of Type I error of each of the tests, is the exponential distribution. This fact is used to produce an appropriate cut-off or p-value. Monte Carlo simulations are conducted to check small sample size (i.e., significance) and power of the test. Usefulness of the test is illustrated through the analysis of a set of monthly family expenditure data collected by the National Sample Survey Organization of the Government of India.},
    bibsource = {DBLP, http://dblp.uni-trier.de},
    file = {srivastava12.pdf:http\://www.jmlr.org/proceedings/papers/v22/srivastava12/srivastava12.pdf:PDF},
    url = {http://www.jmlr.org/proceedings/papers/v22/srivastava12.html},
    }

Preprints at eCommons and elsewhere

We publish freely accessible copies of papers and preprints at the Cornell eCommons repository and elsewhere.

eCommons

2019
  • L. Vilhuber and W. Block, “Outcomes report | Cornell Node of the NSF-Census Research Network,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:65011, 2019.
    [Abstract] [URL] [Bibtex]

    Description and List of Outcomes of the Cornell node of the NSF-Census Research Network.

    @techreport{handle:1813:65011,
    Title = {Outcomes report | Cornell Node of the NSF-Census Research Network},
    Author = {Vilhuber, Lars and Block, William},
    institution = { NSF Census Research Network - NCRN-Cornell },
    type = {Preprint} ,
    Year = {2019},
    number={1813:65011},
    URL = {https://hdl.handle.net/1813/65011},
    abstract ={Description and List of Outcomes of the Cornell node of the NSF-Census Research Network.}
    }
2018
  • L. Vilhuber and W. Block, “Cornell Node of the NSF-Census Research Network – Annual Report to NSF for 2018,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:65010, 2018.
    [Abstract] [URL] [Bibtex]

    This is the annual report by the Cornell node of the NSF-Census Research Network to NSF for 2018.

    @techreport{handle:1813:65010,
    Title = {Cornell Node of the NSF-Census Research Network - Annual Report to NSF for 2018},
    Author = {Vilhuber, Lars and Block, William},
    institution = { NSF Census Research Network - NCRN-Cornell },
    type = {Preprint} ,
    Year = {2018},
    number={1813:65010},
    URL = {https://hdl.handle.net/1813/65010},
    abstract ={This is the annual report by the Cornell node of the NSF-Census Research Network to NSF for 2018.}
    }
  • J. M. Abowd and I. M. Schmutte, “An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:58669, 2018.
    [Abstract] [URL] [Bibtex]

    Statistical agencies face a dual mandate to publish accurate statistics while protecting respondent privacy. Increasing privacy protection requires decreased accuracy. Recognizing this as a resource allocation problem, we propose an economic solution: operate where the marginal cost of increasing privacy equals the marginal benefit. Our model of production, from computer science, assumes data are published using an efficient differentially private algorithm. Optimal choice weighs the demand for accurate statistics against the demand for privacy. Examples from U.S. statistical programs show how our framework can guide decision-making. Further progress requires a better understanding of willingness-to-pay for privacy and statistical accuracy. Any opinions and conclusions are those of the authors and do not represent the views of the Census Bureau, NSF, or the Sloan Foundation. We thank the Center for Labor Economics at UC–Berkeley and Isaac Newton Institute for Mathematical Sciences, Cambridge (EPSRC grant no. EP/K032208/1) for support and hospitality. We are extremely grateful for very valuable comments and guidance from the editor, Pinelopi Goldberg, and six anonymous referees. We acknowledge helpful comments from Robin Bachman, Nick Bloom, Larry Blume, David Card, Michael Castro, Jennifer Childs, Melissa Creech, Cynthia Dwork, Casey Eggleston, John Eltinge, Stephen Fienberg, Mark Kutzbach, Ron Jarmin, Christa Jones, Dan Kifer, Ashwin Machanavajjhala, Frank McSherry, Gerome Miklau, Kobbi Nissim, Paul Oyer, Mallesh Pai, Jerry Reiter, Eric Slud, Adam Smith, Bruce Spencer, Sara Sullivan, Salil Vadhan, Lars Vilhuber, Glen Weyl, and Nellie Zhao, along with seminar and conference participants at the U.S. Census Bureau, Cornell, CREST, George Mason, Georgetown, Microsoft Research–NYC, University of Washington Evans School, and SOLE. William Sexton provided excellent research assistance. No confidential data were used in this paper. Supplemental materials available at http://doi.org/10.5281/zenodo.1345775. The authors declare that they have no relevant or material financial interests that relate to the research described in this paper.

    @techreport{handle:1813:58669,
    Title = {An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices},
    Author = {Abowd, John M. and Schmutte, Ian M.},
    institution = { NSF Census Research Network - NCRN-Cornell },
    type = {Preprint} ,
    Year = {2018},
    number={1813:58669},
    URL = {https://hdl.handle.net/1813/58669},
    abstract ={Statistical agencies face a dual mandate to publish accurate statistics while protecting respondent privacy.
    Increasing privacy protection requires decreased accuracy. Recognizing this as a resource allocation problem,
    we propose an economic solution: operate where the marginal cost of increasing privacy equals the marginal
    benefit. Our model of production, from computer science, assumes data are published using an efficient
    differentially private algorithm. Optimal choice weighs the demand for accurate statistics against the demand
    for privacy. Examples from U.S. statistical programs show how our framework can guide decision-making.
    Further progress requires a better understanding of willingness-to-pay for privacy and statistical accuracy.
    Any opinions and conclusions are those of the authors and do not represent the views of the Census Bureau, NSF, or the Sloan Foundation. We thank the Center for Labor Economics at UC–Berkeley and Isaac Newton Institute for Mathematical Sciences, Cambridge (EPSRC grant no. EP/K032208/1) for support and hospitality.
    We are extremely grateful for very valuable comments and guidance from the editor, Pinelopi Goldberg, and six anonymous referees. We acknowledge helpful comments from Robin Bachman, Nick Bloom, Larry Blume, David Card, Michael Castro, Jennifer Childs, Melissa Creech, Cynthia Dwork, Casey Eggleston, John Eltinge, Stephen Fienberg, Mark Kutzbach, Ron Jarmin, Christa Jones, Dan Kifer, Ashwin Machanavajjhala, Frank McSherry, Gerome Miklau, Kobbi Nissim, Paul Oyer, Mallesh Pai, Jerry Reiter, Eric Slud, Adam Smith, Bruce Spencer, Sara Sullivan, Salil Vadhan, Lars Vilhuber, Glen Weyl, and Nellie Zhao, along with seminar and conference participants at the U.S. Census Bureau, Cornell, CREST, George Mason, Georgetown, Microsoft Research–NYC, University of Washington Evans School, and SOLE.
    William Sexton provided excellent research assistance.
    No confidential data were used in this paper. Supplemental materials available at http://doi.org/10.5281/zenodo.1345775. The authors declare that they have no relevant or material financial interests that relate to the research described in this paper.}
    }
2017
  • L. Vilhuber and W. Block, “Cornell Node of the NSF-Census Research Network – Annual Report to NSF for 2017,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:65009, 2017.
    [Abstract] [URL] [Bibtex]

    This is the annual report by the Cornell node of the NSF-Census Research Network to NSF for 2017.

    @techreport{handle:1813:65009,
    Title = {Cornell Node of the NSF-Census Research Network - Annual Report to NSF for 2017},
    Author = {Vilhuber, Lars and Block, William},
    institution = { NSF Census Research Network - NCRN-Cornell },
    type = {Preprint} ,
    Year = {2017},
    number={1813:65009},
    URL = {https://hdl.handle.net/1813/65009},
    abstract ={This is the annual report by the Cornell node of the NSF-Census Research Network to NSF for 2017.}
    }
2016
  • L. Vilhuber and W. Block, “Cornell Node of the NSF-Census Research Network – Annual Report to NSF for 2016,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:65008, 2016.
    [Abstract] [URL] [Bibtex]

    This is the annual report by the Cornell node of the NSF-Census Research Network to NSF for 2016.

    @techreport{handle:1813:65008,
    Title = {Cornell Node of the NSF-Census Research Network - Annual Report to NSF for 2016},
    Author = {Vilhuber, Lars and Block, William},
    institution = { NSF Census Research Network - NCRN-Cornell },
    type = {Preprint} ,
    Year = {2016},
    number={1813:65008},
    URL = {https://hdl.handle.net/1813/65008},
    abstract ={This is the annual report by the Cornell node of the NSF-Census Research Network to NSF for 2016.}
    }
2015
  • J. M. Abowd, W. Block, and L. Vilhuber, “Cornell Node of the NSF-Census Research Network – Annual Report to NSF for 2015,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:65007, 2015.
    [Abstract] [URL] [Bibtex]

    This is the annual report by the Cornell node of the NSF-Census Research Network to NSF for 2015.

    @techreport{handle:1813:65007,
    Title = {Cornell Node of the NSF-Census Research Network - Annual Report to NSF for 2015},
    Author = {Abowd, John M. and Block, William and Vilhuber, Lars},
    institution = { NSF Census Research Network - NCRN-Cornell },
    type = {Preprint} ,
    Year = {2015},
    number={1813:65007},
    URL = {https://hdl.handle.net/1813/65007},
    abstract ={This is the annual report by the Cornell node of the NSF-Census Research Network to NSF for 2015.}
    }
2014
  • J. M. Abowd, W. Block, P. Li, and L. Vilhuber, “Cornell Node of the NSF-Census Research Network – Annual Report to NSF for 2014,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:65005, 2014.
    [Abstract] [URL] [Bibtex]

    This is the annual report by the Cornell node of the NSF-Census Research Network to NSF for 2014.

    @techreport{handle:1813:65005,
    Title = {Cornell Node of the NSF-Census Research Network - Annual Report to NSF for 2014},
    Author = {Abowd, John M. and Block, William and Li, Ping and Vilhuber, Lars},
    institution = { NSF Census Research Network - NCRN-Cornell },
    type = {Preprint} ,
    Year = {2014},
    number={1813:65005},
    URL = {https://hdl.handle.net/1813/65005},
    abstract ={This is the annual report by the Cornell node of the NSF-Census Research Network to NSF for 2014.}
    }
2013
  • J. M. Abowd, W. Block, P. Li, and L. Vilhuber, “Cornell Node of the NSF-Census Research Network – Annual Report to NSF for 2013,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:65004, 2013.
    [Abstract] [URL] [Bibtex]

    This is the annual report by the Cornell node of the NSF-Census Research Network to NSF for 2013.

    @techreport{handle:1813:65004,
    Title = {Cornell Node of the NSF-Census Research Network - Annual Report to NSF for 2013},
    Author = {Abowd, John M. and Block, William and Li, Ping and Vilhuber, Lars},
    institution = { NSF Census Research Network - NCRN-Cornell },
    type = {Preprint} ,
    Year = {2013},
    number={1813:65004},
    URL = {https://hdl.handle.net/1813/65004},
    abstract ={This is the annual report by the Cornell node of the NSF-Census Research Network to NSF for 2013.}
    }
  • C. Lagoze, J. Williams, and L. Vilhuber, “Encoding Provenance Metadata for Social Science Datasets,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:55327, 2013.
    [Abstract] [URL] [Bibtex]

    Recording provenance is a key requirement for data-centric scholarship, allowing researchers to evaluate the integrity of source data sets and re- produce, and thereby, validate results. Provenance has become even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. Recent work by the W3C on the PROV model provides the foundation for semantically-rich, interoperable, and web-compatible provenance metadata. We apply that model to complex, but characteristic, provenance examples of social science data, describe scenarios that make scholarly use of those provenance descriptions, and propose a manner for encoding this provenance metadata within the widely-used DDI metadata standard. Submitted to Metadata and Semantics Research (MTSR 2013) conference.

    @techreport{handle:1813:55327,
    Title = {Encoding Provenance Metadata for Social Science Datasets},
    Author = {Lagoze, Carl and Williams, Jeremy and Vilhuber, Lars},
    institution = { NSF Census Research Network - NCRN-Cornell },
    type = {Preprint} ,
    Year = {2013},
    number={1813:55327},
    URL = {https://hdl.handle.net/1813/55327},
    abstract ={Recording provenance is a key requirement for data-centric scholarship, allowing researchers to evaluate the integrity of source data sets and re-
    produce, and thereby, validate results. Provenance has become even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. Recent work by the W3C on the PROV model provides the foundation for semantically-rich, interoperable, and
    web-compatible provenance metadata. We apply that model to complex, but characteristic, provenance examples of social science data, describe scenarios
    that make scholarly use of those provenance descriptions, and propose a manner for encoding this provenance metadata within the widely-used DDI metadata
    standard.
    Submitted to Metadata and Semantics Research (MTSR 2013) conference.}
    }
2012
  • J. M. Abowd, W. Block, P. Li, and L. Vilhuber, “Cornell Node of the NSF-Census Research Network – Annual Report to NSF for 2012,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:65003, 2012.
    [Abstract] [URL] [Bibtex]

    Abstract: This is the annual report by the Cornell node of the NSF-Census Research Network to NSF for 2012

    @techreport{handle:1813:65003,
    Title = {Cornell Node of the NSF-Census Research Network - Annual Report to NSF for 2012},
    Author = {Abowd, John M. and Block, William and Li, Ping and Vilhuber, Lars},
    institution = { NSF Census Research Network - NCRN-Cornell },
    type = {Preprint} ,
    Year = {2012},
    number={1813:65003},
    URL = {https://hdl.handle.net/1813/65003},
    abstract ={Abstract: This is the annual report by the Cornell node of the NSF-Census Research Network to NSF for 2012}
    }

Elsewhere

2019
  • J. Abowd, I. Schmutte, W. Sexton, and L. Vilhuber, “Introductory Readings in Formal Privacy for Economists,” Labor Dynamics Institute, zenodo.2621345, 2019.
    [DOI] [Bibtex]
    @TechReport{privbib20190402,
    author = {Abowd, John and Schmutte, Ian and Sexton, William and Vilhuber, Lars},
    title = {Introductory Readings in Formal Privacy for Economists},
    institution = {Labor Dynamics Institute},
    year = {2019},
    number = {zenodo.2621345},
    doi = {10.5281/zenodo.2621345},
    keywords = {Privacy, Official Statistics, Differential Privacy, Economics, Economics of Privacy, Statistical Disclosure Limitation},
    language = {en},
    owner = {vilhuber},
    publisher = {Zenodo},
    timestamp = {2019.04.04},
    }
  • J. M. Abowd, I. M. Schmutte, W. N. Sexton, and L. Vilhuber, “Why the Economics Profession Must Actively Participate in the Privacy Protection Debate,” Labor Dynamics Institute, Document 51, 2019.
    [URL] [Bibtex]
    @TechReport{ldi51,
    author = {John M. Abowd and Ian M. Schmutte and William N. Sexton and Lars Vilhuber},
    title = {Why the Economics Profession Must Actively Participate in the Privacy Protection Debate},
    institution = {Labor Dynamics Institute},
    year = {2019},
    type = {Document},
    number = {51},
    month = may,
    owner = {vilhuber},
    timestamp = {2019.04.04},
    url = {https://digitalcommons.ilr.cornell.edu/ldi/51/},
    }
2018
  • J. M. Abowd and I. M. Schmutte, “An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices,” Center for Economic Studies, U.S. Census Bureau, Working Papers 18-35, 2018.
    [Abstract] [URL] [Bibtex]

    Statistical agencies face a dual mandate to publish accurate statistics while protecting respondent privacy. Increasing privacy protection requires decreased accuracy. Recognizing this as a resource allocation problem, we propose an economic solution: operate where the marginal cost of increasing privacy equals the marginal benefit. Our model of production, from computer science, assumes data are published using an efficient differentially private algorithm. Optimal choice weighs the demand for accurate statistics against the demand for privacy. Examples from U.S. statistical programs show how our framework can guide decision-making. Further progress requires a better understanding of willingness-to-pay for privacy and statistical accuracy.

    @TechReport{RePEc:cen:wpaper:18-35,
    author={John M. Abowd and Ian M. Schmutte},
    title={{An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices}},
    year=2018,
    month=Aug,
    institution={Center for Economic Studies, U.S. Census Bureau},
    type={Working Papers},
    url={https://ideas.repec.org/p/cen/wpaper/18-35.html},
    number={18-35},
    abstract={Statistical agencies face a dual mandate to publish accurate statistics while protecting respondent privacy. Increasing privacy protection requires decreased accuracy. Recognizing this as a resource allocation problem, we propose an economic solution: operate where the marginal cost of increasing privacy equals the marginal benefit. Our model of production, from computer science, assumes data are published using an efficient differentially private algorithm. Optimal choice weighs the demand for accurate statistics against the demand for privacy. Examples from U.S. statistical programs show how our framework can guide decision-making. Further progress requires a better understanding of willingness-to-pay for privacy and statistical accuracy.},
    keywords={},
    doi={},
    }
  • J. M. Abowd and I. M. Schmutte, “An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices,” arXiv, preprint , 2018.
    [URL] [Bibtex]
    @techreport{abowd2018economic,
    title={An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices},
    author={John M. Abowd and Ian M. Schmutte},
    year={2018},
    eprint={1808.06303},
    archivePrefix={arXiv},
    primaryClass={cs.CR},
    url = { https://arxiv.org/abs/1808.06303},
    institution = {arXiv},
    type = {preprint}
    }
  • J. M. Abowd, I. M. Schmutte, and L. Vilhuber, “Disclosure Limitation and Confidentiality Protection in Linked Data,” Center for Economic Studies, U.S. Census Bureau, Working Papers 18-07, 2018.
    [Abstract] [URL] [Bibtex]

    Confidentiality protection for linked administrative data is a combination of access modalities and statistical disclosure limitation. We review traditional statistical disclosure limitation methods and newer methods based on synthetic data, input noise infusion and formal privacy. We discuss how these methods are integrated with access modalities by providing three detailed examples. The first example is the linkages in the Health and Retirement Study to Social Security Administration data. The second example is the linkage of the Survey of Income and Program Participation to administrative data from the Internal Revenue Service and the Social Security Administration. The third example is the Longitudinal Employer-Household Dynamics data, which links state unemployment insurance records for workers and firms to a wide variety of censuses and surveys at the U.S. Census Bureau. For examples, we discuss access modalities, disclosure limitation methods, the effectiveness of those methods, and the resulting analytical validity. The final sections discuss recent advances in access modalities for linked administrative data.

    @TechReport{RePEc:cen:wpaper:18-07,
    author={John M. Abowd and Ian M. Schmutte and Lars Vilhuber},
    title={{Disclosure Limitation and Confidentiality Protection in Linked Data}},
    year=2018,
    month=Jan,
    institution={Center for Economic Studies, U.S. Census Bureau},
    type={Working Papers},
    url={https://ideas.repec.org/p/cen/wpaper/18-07.html},
    number={18-07},
    abstract={Confidentiality protection for linked administrative data is a combination of access modalities and statistical disclosure limitation. We review traditional statistical disclosure limitation methods and newer methods based on synthetic data, input noise infusion and formal privacy. We discuss how these methods are integrated with access modalities by providing three detailed examples. The first example is the linkages in the Health and Retirement Study to Social Security Administration data. The second example is the linkage of the Survey of Income and Program Participation to administrative data from the Internal Revenue Service and the Social Security Administration. The third example is the Longitudinal Employer-Household Dynamics data, which links state unemployment insurance records for workers and firms to a wide variety of censuses and surveys at the U.S. Census Bureau. For examples, we discuss access modalities, disclosure limitation methods, the effectiveness of those methods, and the resulting analytical validity. The final sections discuss recent advances in access modalities for linked administrative data.},
    keywords={},
    doi={},
    }
  • L. B. Reeder, J. C. Stanley, and L. Vilhuber, “Codebook for the SIPP Synthetic Beta 7.0 (PDF version),” {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute}. Cornell University, PDF and DDI code V20181102b-pdf, 2018.
    [Abstract] [DOI] [URL] [Bibtex]

    The SIPP Synthetic Beta (SSB) is a Census Bureau product that integrates person-level micro-data from a household survey with administrative tax and benefit data. These data link respondents from the Survey of Income and Program Participation (SIPP) to Social Security Administration (SSA)/Internal Revenue Service (IRS) Form W-2 records and SSA records of retirement and disability benefit receipt, and were produced by Census Bureau staff economists and statisticians in collaboration with researchers at Cornell University, the SSA and the IRS. The purpose of the SSB is to provide access to linked data that are usually not publicly available due to confidentiality concerns. To overcome these concerns, Census has synthesized, or modeled, all the variables in a way that changes the record of each individual in a manner designed to preserve the underlying covariate relationships between the variables. The only variables that were not altered by the synthesis process and still contain their original values are gender and a link to the first reported marital partner in the survey. Eight SIPP panels (1990, 1991, 1992, 1993, 1996, 2001, 2004, 2008) form the basis for the SSB, with a large subset of variables available across all the panels selected for inclusion and harmonization across the years. Administrative data were added and some editing was done to correct for logical inconsistencies in the IRS/SSA earnings and benefits data.

    @techreport{reeder_lori_b_2018_1477099,
    title = {{Codebook for the SIPP Synthetic Beta 7.0 (PDF
    version)}},
    author = {Reeder, Lori B. and
    Stanley, Jordan C. and
    Vilhuber, Lars},
    institution = {{Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute}. Cornell University},
    type = {PDF and DDI code},
    number = {V20181102b-pdf},
    month = nov,
    year = 2018,
    doi = {10.5281/zenodo.1477099},
    url = {https://doi.org/10.5281/zenodo.1477099},
    abstract = {The SIPP Synthetic Beta (SSB) is a Census Bureau product that integrates person-level micro-data from a household survey with administrative tax and benefit data. These data link respondents from the Survey of Income and Program Participation (SIPP) to Social Security Administration (SSA)/Internal Revenue Service (IRS) Form W-2 records and SSA records of retirement and disability benefit receipt, and were produced by Census Bureau staff economists and statisticians in collaboration with researchers at Cornell University, the SSA and the IRS. The purpose of the SSB is to provide access to linked data that are usually not publicly available due to confidentiality concerns. To overcome these concerns, Census has synthesized, or modeled, all the variables in a way that changes the record of each individual in a manner designed to preserve the underlying covariate relationships between the variables. The only variables that were not altered by the synthesis process and still contain their original values are gender and a link to the first reported marital partner in the survey. Eight SIPP panels (1990, 1991, 1992, 1993, 1996, 2001, 2004, 2008) form the basis for the SSB, with a large subset of variables available across all the panels selected for inclusion and harmonization across the years. Administrative data were added and some editing was done to correct for logical inconsistencies in the IRS/SSA earnings and benefits data.}
    }
  • L. B. Reeder, J. C. Stanley, and L. Vilhuber, “Codebook for the SIPP Synthetic Beta 7.0 (DDI-C and PDF),” {Labor Dynamics Institute}. Cornell University, Codebook , 2018.
    [Abstract] [DOI] [URL] [Bibtex]

    The SIPP Synthetic Beta (SSB) is a Census Bureau product that integrates person-level micro-data from a household survey with administrative tax and benefit data. These data link respondents from the Survey of Income and Program Participation (SIPP) to Social Security Administration (SSA)/Internal Revenue Service (IRS) Form W-2 records and SSA records of retirement and disability benefit receipt, and were produced by Census Bureau staff economists and statisticians in collaboration with researchers at Cornell University, the SSA and the IRS. The purpose of the SSB is to provide access to linked data that are usually not publicly available due to confidentiality concerns. To overcome these concerns, Census has synthesized, or modeled, all the variables in a way that changes the record of each individual in a manner designed to preserve the underlying covariate relationships between the variables. The only variables that were not altered by the synthesis process and still contain their original values are gender and a link to the first reported marital partner in the survey. Eight SIPP panels (1990, 1991, 1992, 1993, 1996, 2001, 2004, 2008) form the basis for the SSB, with a large subset of variables available across all the panels selected for inclusion and harmonization across the years. Administrative data were added and some editing was done to correct for logical inconsistencies in the IRS/SSA earnings and benefits data.

    @techreport{reeder_lori_b_2018_1477097,
    author = {Reeder, Lori B. and
    Stanley, Jordan C. and
    Lars Vilhuber},
    institution = { {Labor Dynamics Institute}. Cornell University},
    type = {Codebook},
    title = {{Codebook for the SIPP Synthetic Beta 7.0 (DDI-C
    and PDF)}},
    month = nov,
    year = 2018,
    doi = {10.5281/zenodo.1477097},
    url = {https://doi.org/10.5281/zenodo.1477097},
    abstract = {The SIPP Synthetic Beta (SSB) is a Census Bureau product that integrates person-level micro-data from a household survey with administrative tax and benefit data. These data link respondents from the Survey of Income and Program Participation (SIPP) to Social Security Administration (SSA)/Internal Revenue Service (IRS) Form W-2 records and SSA records of retirement and disability benefit receipt, and were produced by Census Bureau staff economists and statisticians in collaboration with researchers at Cornell University, the SSA and the IRS. The purpose of the SSB is to provide access to linked data that are usually not publicly available due to confidentiality concerns. To overcome these concerns, Census has synthesized, or modeled, all the variables in a way that changes the record of each individual in a manner designed to preserve the underlying covariate relationships between the variables. The only variables that were not altered by the synthesis process and still contain their original values are gender and a link to the first reported marital partner in the survey. Eight SIPP panels (1990, 1991, 1992, 1993, 1996, 2001, 2004, 2008) form the basis for the SSB, with a large subset of variables available across all the panels selected for inclusion and harmonization across the years. Administrative data were added and some editing was done to correct for logical inconsistencies in the IRS/SSA earnings and benefits data.}
    }
2017
  • L. Vilhuber and I. M. Schmutte, “Proceedings from the 2016 NSF-Sloan Workshop on Practical Privacy,” Labor Dynamics Institute, Cornell University, Document 33, 2017.
    [Abstract] [URL] [Bibtex]

    On October 14, 2016, we hosted a workshop that brought together economists, survey statisticians, and computer scientists with expertise in the field of privacy preserving methods: Census Bureau staff working on implementing cutting-edge methods in the Bureau?s flagship public-use products mingled with academic researchers from a variety of universities. The four products discussed as part of the workshop were 1. the American Community Survey (ACS); 2. Longitudinal Employer-Household Data (LEHD), in particular the LEHD Origin-Destination Employment Statistics (LODES); the 3. 2020 Decennial Census; and the 4. 2017 Economic Census. The goal of the workshop was to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas. Funding for the workshop was provided by the National Science Foundation (CNS-1012593) and the Alfred P. Sloan Foundation. Organizational support was provided by the Research and Methodology Directorate at the U.S. Census Bureau and the Labor Dynamics Institute at Cornell University.

    @TechReport{Vilhuber:LDI:2017:33,
    author = {Vilhuber, Lars and Schmutte, Ian M.},
    title = {Proceedings from the 2016 NSF-Sloan Workshop on Practical Privacy},
    institution = {Labor Dynamics Institute, Cornell University},
    year = {2017},
    type = {Document},
    number = {33},
    abstract = {On October 14, 2016, we hosted a workshop that brought together economists, survey statisticians, and computer scientists with expertise in the field of privacy preserving methods: Census Bureau staff working on implementing cutting-edge methods in the Bureau?s flagship public-use products mingled with academic researchers from a variety of universities. The four products discussed as part of the workshop were 1. the American Community Survey (ACS); 2. Longitudinal Employer-Household Data (LEHD), in particular the LEHD Origin-Destination Employment Statistics (LODES); the 3. 2020 Decennial Census; and the 4. 2017 Economic Census. The goal of the workshop was to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas.
    Funding for the workshop was provided by the National Science Foundation (CNS-1012593) and the Alfred P. Sloan Foundation. Organizational support was provided by the Research and Methodology Directorate at the U.S. Census Bureau and the Labor Dynamics Institute at Cornell University.},
    comment = {Funding by National Science Foundation (CNS-1012593) and the Alfred P. Sloan Foundation},
    owner = {vilhuber},
    timestamp = {2017.05.03},
    url = {http://digitalcommons.ilr.cornell.edu/ldi/33/},
    xurl = {http://digitalcommons.ilr.cornell.edu/ldi/33/},
    }
  • J. M. Abowd, F. Kramarz, S. Perez-Duarte, and I. M. Schmutte, “Sorting Between and Within Industries: A Testable Model of Assortative Matching,” Labor Dynamics Institute, Document 40, 2017.
    [Abstract] [URL] [Bibtex]

    We test Shimer’s (2005) theory of the sorting of workers between and within industrial sectors based on directed search with coordination frictions, deliberately maintaining its static general equilibrium framework. We fit the model to sector-specific wage, vacancy and output data, including publicly-available statistics that characterize the distribution of worker and employer wage heterogeneity across sectors. Our empirical method is general and can be applied to a broad class of assignment models. The results indicate that industries are the loci of sorting–more productive workers are employed in more productive industries. The evidence confirms that strong assortative matching can be present even when worker and employer components of wage heterogeneity are weakly correlated.

    @TechReport{ldi40,
    author = {John M. Abowd and Francis Kramarz and Sebastien Perez-Duarte and Ian M. Schmutte},
    title = {Sorting Between and Within Industries: A Testable Model of Assortative Matching},
    institution = {Labor Dynamics Institute},
    year = {2017},
    type = {Document},
    number = {40},
    abstract = {We test Shimer's (2005) theory of the sorting of workers between and within industrial sectors based on directed search with coordination frictions, deliberately maintaining its static general equilibrium framework. We fit the model to sector-specific wage, vacancy and output data, including publicly-available statistics that characterize the distribution of worker and employer wage heterogeneity across sectors. Our empirical method is general and can be applied to a broad class of assignment models. The results indicate that industries are the loci of sorting--more productive workers are employed in more productive industries. The evidence confirms that strong assortative matching can be present even when worker and employer components of wage heterogeneity are weakly correlated.},
    owner = {vilhuber},
    timestamp = {2017.09.21},
    url = {http://digitalcommons.ilr.cornell.edu/ldi/28/},
    }
  • J. M. Abowd and I. M. Schmutte, “Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods,” Labor Dynamics Institute, Document 37, 2017.
    [Abstract] [URL] [Bibtex]

    We consider the problem of determining the optimal accuracy of public statistics when increased accuracy requires a loss of privacy. To formalize this allocation problem, we use tools from statistics and computer science to model the publication technology used by a public statistical agency. We derive the demand for accurate statistics from first principles to generate interdependent preferences that account for the public-good nature of both data accuracy and privacy loss. We first show data accuracy is inefficiently under-supplied by a private provider. Solving the appropriate social planner{’}s problem produces an implementable publication strategy. We implement the socially optimal publication plan for statistics on income and health status using data from the American Community Survey, National Health Interview Survey, Federal Statistical System Public Opinion Survey and Cornell National Social Survey. Our analysis indicates that welfare losses from providing too much privacy protection and, therefore, too little accuracy can be substantial.

    @TechReport{ldi37,
    author = {John M. Abowd and Ian M. Schmutte},
    title = {Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods},
    institution = {Labor Dynamics Institute},
    year = {2017},
    type = {Document},
    number = {37},
    month = {04/2017},
    abstract = {We consider the problem of determining the optimal accuracy of public statistics when increased accuracy requires a loss of privacy. To formalize this allocation problem, we use tools from statistics and computer science to model the publication technology used by a public statistical agency. We derive the demand for accurate statistics from first principles to generate interdependent preferences that account for the public-good nature of both data accuracy and privacy loss. We first show data accuracy is inefficiently under-supplied by a private provider. Solving the appropriate social planner{\textquoteright}s problem produces an implementable publication strategy. We implement the socially optimal publication plan for statistics on income and health status using data from the American Community Survey, National Health Interview Survey, Federal Statistical System Public Opinion Survey and Cornell National Social Survey. Our analysis indicates that welfare losses from providing too much privacy protection and, therefore, too little accuracy can be substantial.},
    owner = {vilhuber},
    timestamp = {2017.09.28},
    url = {http://digitalcommons.ilr.cornell.edu/ldi/37/},
    }
  • J. M. Abowd and I. M. Schmutte, “Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods,” Center for Economic Studies, U.S. Census Bureau, Working Papers 17-37, 2017.
    [Abstract] [URL] [Bibtex]

    We consider the problem of determining the optimal accuracy of public statistics when increased accuracy requires a loss of privacy. To formalize this allocation problem, we use tools from statistics and computer science to model the publication technology used by a public statistical agency. We derive the demand for accurate statistics from first principles to generate interdependent preferences that account for the public-good nature of both data accuracy and privacy loss. We first show data accuracy is inefficiently undersupplied by a private provider. Solving the appropriate social planner’s problem produces an implementable publication strategy. We implement the socially optimal publication plan for statistics on income and health status using data from the American Community Survey, National Health Interview Survey, Federal Statistical System Public Opinion Survey and Cornell National Social Survey. Our analysis indicates that welfare losses from providing too much privacy protection and, therefore, too little accuracy can be substantial.

    @TechReport{RePEc:cen:wpaper:17-37,
    author={John M. Abowd and Ian M. Schmutte},
    title={{Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods}},
    year=2017,
    month=Jan,
    institution={Center for Economic Studies, U.S. Census Bureau},
    type={Working Papers},
    url={https://ideas.repec.org/p/cen/wpaper/17-37.html},
    number={17-37},
    abstract={We consider the problem of determining the optimal accuracy of public statistics when increased accuracy requires a loss of privacy. To formalize this allocation problem, we use tools from statistics and computer science to model the publication technology used by a public statistical agency. We derive the demand for accurate statistics from first principles to generate interdependent preferences that account for the public-good nature of both data accuracy and privacy loss. We first show data accuracy is inefficiently undersupplied by a private provider. Solving the appropriate social planner’s problem produces an implementable publication strategy. We implement the socially optimal publication plan for statistics on income and health status using data from the American Community Survey, National Health Interview Survey, Federal Statistical System Public Opinion Survey and Cornell National Social Survey. Our analysis indicates that welfare losses from providing too much privacy protection and, therefore, too little accuracy can be substantial.},
    keywords={Demand for public statistics; Technology for statistical agencies; Optimal data accuracy; Optimal co},
    doi={},
    }
  • S. Haney, A. Machanavajjhala, J. M. Abowd, M. Graham, and M. Kutzbach, “Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics,” Cornell University, Preprint 1813:49652, 2017.
    [Abstract] [URL] [Bibtex]

    Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics Haney, Samuel; Machanavajjhala, Ashwin; Abowd, John M; Graham, Matthew; Kutzbach, Mark National statistical agencies around the world publish tabular summaries based on combined employeremployee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter ϵ>=1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional "This Article is brought to you for free and open access by the Centers, Institutes, Programs at DigitalCommons@ILR. It has been accepted for inclusion in Labor Dynamics Institute by an authorized administrator of DigitalCommons@ILR. For more information, please contact hlmdigital@cornell.edu."

    @TechReport{handle:1813:49652,
    author = {Haney, Samuel and Machanavajjhala, Ashwin and Abowd, John M and Graham, Matthew and Kutzbach, Mark},
    title = {Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics},
    institution = {Cornell University},
    year = {2017},
    type = {Preprint},
    number = {1813:49652},
    abstract = {Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics Haney, Samuel; Machanavajjhala, Ashwin; Abowd, John M; Graham, Matthew; Kutzbach, Mark National statistical agencies around the world publish tabular summaries based on combined employeremployee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter ϵ>=1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional \"This Article is brought to you for free and open access by the Centers, Institutes, Programs at DigitalCommons@ILR. It has been accepted for inclusion in Labor Dynamics Institute by an authorized administrator of DigitalCommons@ILR. For more information, please contact hlmdigital@cornell.edu.\"},
    owner = {vilhuber},
    timestamp = {2017.09.28},
    url = {http://hdl.handle.net/1813/49652},
    }
  • L. Vilhuber, S. Kinney, and I. Schmutte, “Proceedings from the Synthetic LBD International Seminar,” Labor Dynamics Institute, Cornell University, Document 44, 2017.
    [Abstract] [URL] [Bibtex]

    On May 9, 2017, we hosted a seminar to discuss the conditions necessary to implement the SynLBD approach with interested parties, with the goal of providing a straightforward toolkit to implement the same procedure on other data. The proceedings summarize the discussions during the workshop. Funding for the workshop was provided by the National Science Foundation (Grants 1012593; 1131848) and the Alfred P. Sloan Foundation (G-2015-13903). Organizational support was provided by the Labor Dynamics Institute at Cornell University.

    @TechReport{ProceedingsSynLBD2017,
    author = {Lars Vilhuber and Saki Kinney and Ian Schmutte},
    title = {Proceedings from the Synthetic LBD International Seminar},
    institution = {Labor Dynamics Institute, Cornell University},
    year = {2017},
    type = {Document},
    number = {44},
    abstract = {On May 9, 2017, we hosted a seminar to discuss the conditions necessary to implement the SynLBD approach with interested parties, with the goal of providing a straightforward toolkit to implement the same procedure on other data. The proceedings summarize the discussions during the workshop.
    Funding for the workshop was provided by the National Science Foundation (Grants 1012593; 1131848) and the Alfred P. Sloan Foundation (G-2015-13903). Organizational support was provided by the Labor Dynamics Institute at Cornell University.},
    owner = {vilhuber},
    timestamp = {2017.09.28},
    url = {http://digitalcommons.ilr.cornell.edu/ldi/44/},
    }
  • L. Vilhuber and I. Schmutte, “Proceedings from the 2016 NSF-Sloan Workshop on Practical Privacy,” Cornell University, Preprint 1813:46197, 2017.
    [Abstract] [URL] [Bibtex]

    Proceedings from the 2016 NSF{–}Sloan Workshop on Practical Privacy Vilhuber, Lars; Schmutte, Ian; Abowd, John M. On October 14, 2016, we hosted a workshop that brought together economists, survey statisticians, and computer scientists with expertise in the field of privacy preserving methods: Census Bureau staff working on implementing cutting-edge methods in the Bureau{’}s flagship public-use products mingled with academic researchers from a variety of universities. The four products discussed as part of the workshop were 1. the American Community Survey (ACS); 2. Longitudinal Employer-Household Data (LEHD), in particular the LEHD Origin-Destination Employment Statistics (LODES); the 3. 2020 Decennial Census; and the 4. 2017 Economic Census. The goal of the workshop was to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas.

    @TechReport{handle:1813:46197,
    author = {Vilhuber, Lars and Schmutte, Ian},
    title = {Proceedings from the 2016 NSF-Sloan Workshop on Practical Privacy},
    institution = {Cornell University},
    year = {2017},
    type = {Preprint},
    number = {1813:46197},
    abstract = {Proceedings from the 2016 NSF{\textendash}Sloan Workshop on Practical Privacy Vilhuber, Lars; Schmutte, Ian; Abowd, John M. On October 14, 2016, we hosted a workshop that brought together economists, survey statisticians, and computer scientists with expertise in the field of privacy preserving methods: Census Bureau staff working on implementing cutting-edge methods in the Bureau{\textquoteright}s flagship public-use products mingled with academic researchers from a variety of universities. The four products discussed as part of the workshop were 1. the American Community Survey (ACS); 2. Longitudinal Employer-Household Data (LEHD), in particular the LEHD Origin-Destination Employment Statistics (LODES); the 3. 2020 Decennial Census; and the 4. 2017 Economic Census. The goal of the workshop was to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas.},
    owner = {vilhuber},
    timestamp = {2017.09.28},
    url = {http://hdl.handle.net/1813/46197},
    }
  • L. Vilhuber and I. Schmutte, “Proceedings from the 2017 Cornell-Census-NSF-Sloan Workshop on Practical Privacy,” Labor Dynamics Institute, Cornell University, Document 43, 2017.
    [Abstract] [URL] [Bibtex]

    These proceedings report on a workshop hosted at the U.S. Census Bureau on May 8, 2017. Our purpose was to gather experts from various backgrounds together to continue discussing the development of formal privacy systems for Census Bureau data products. This workshop was a successor to a previous workshop held in October 2016 (Vilhuber & Schmutte 2017). At our prior workshop, we hosted computer scientists, survey statisticians, and economists, all of whom were experts in data privacy. At that time we discussed the practical implementation of cutting-edge methods for publishing data with formal, provable privacy guarantees, with a focus on applications to Census Bureau data products. The teams developing those applications were just starting out when our first workshop took place, and we spent our time brainstorming solutions to the various problems researchers were encountering, or anticipated encountering. For these cutting-edge formal privacy models, there had been very little effort in the academic literature to apply those methods in real-world settings with large, messy data. We therefore brought together an expanded group of specialists from academia and government who could shed light on technical challenges, subject matter challenges and address how data users might react to changes in data availability and publishing standards. In May 2017, we organized a follow-up workshop, which these proceedings report on. We reviewed progress made in four different areas. The four topics discussed as part of the workshop were 1. the 2020 Decennial Census; 2. the American Community Survey (ACS); 3. the 2017 Economic Census; 4. measuring the demand for privacy and for data quality. As in our earlier workshop, our goals were to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers; 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas.

    @TechReport{ProceedingsNSFSloan2017,
    author = {Lars Vilhuber and Ian Schmutte},
    title = {Proceedings from the 2017 Cornell-Census-NSF-Sloan Workshop on Practical Privacy},
    institution = {Labor Dynamics Institute, Cornell University},
    year = {2017},
    type = {Document},
    number = {43},
    abstract = {These proceedings report on a workshop hosted at the U.S. Census Bureau on May 8, 2017. Our purpose was to gather experts from various backgrounds together to continue discussing the development of formal privacy systems for Census Bureau data products. This workshop was a successor to a previous workshop held in October 2016 (Vilhuber & Schmutte 2017). At our prior workshop, we hosted computer scientists, survey statisticians, and economists, all of whom were experts in data privacy. At that time we discussed the practical implementation of cutting-edge methods for publishing data with formal, provable privacy guarantees, with a focus on applications to Census Bureau data products. The teams developing those applications were just starting out when our first workshop took place, and we spent our time brainstorming solutions to the various problems researchers were encountering, or anticipated encountering. For these cutting-edge formal privacy models, there had been very little effort in the academic literature to apply those methods in real-world settings with large, messy data. We therefore brought together an expanded group of specialists from academia and government who could shed light on technical challenges, subject matter challenges and address how data users might react to changes in data availability and publishing standards.
    In May 2017, we organized a follow-up workshop, which these proceedings report on. We reviewed progress made in four different areas. The four topics discussed as part of the workshop were 1. the 2020 Decennial Census; 2. the American Community Survey (ACS); 3. the 2017 Economic Census; 4. measuring the demand for privacy and for data quality.
    As in our earlier workshop, our goals were to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers; 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas.},
    owner = {vilhuber},
    timestamp = {2017.09.28},
    url = {http://digitalcommons.ilr.cornell.edu/ldi/43/},
    }
  • L. Vilhuber and C. Lagoze, “Making Confidential Data Part of Reproducible Research,” Labor Dynamics Institute, Cornell University, Document 41, 2017.
    [URL] [Bibtex]
    @TechReport{VilhuberLagozeLDI2017,
    author = {Lars Vilhuber and Carl Lagoze},
    title = {Making Confidential Data Part of Reproducible Research},
    institution = {Labor Dynamics Institute, Cornell University},
    year = {2017},
    type = {Document},
    number = {41},
    owner = {vilhuber},
    timestamp = {2017.09.28},
    url = {http://digitalcommons.ilr.cornell.edu/ldi/41/},
    }
  • D. H. Weinberg, J. M. Abowd, R. F. Belli, N. Cressie, D. C. Folch, S. H. Holan, M. C. Levenstein, K. M. Olson, J. P. Reiter, M. D. Shapiro, J. Smyth, L. Soh, B. D. Spencer, S. E. Spielman, L. Vilhuber, and C. K. Wikle, “Effects of a Government-Academic Partnership: Has the NSF-Census Bureau Research Network Helped Improve the U.S. Statistical System?,” Center for Economic Studies, U.S. Census Bureau, Working Papers 17-59r, 2017.
    [Abstract] [URL] [Bibtex]

    The National Science Foundation-Census Bureau Research Network (NCRN) was established in 2011 to create interdisciplinary research nodes on methodological questions of interest and significance to the broader research community and to the Federal Statistical System (FSS), particularly the Census Bureau. The activities to date have covered both fundamental and applied statistical research and have focused at least in part on the training of current and future generations of researchers in skills of relevance to surveys and alternative measurement of economic units, households, and persons. This paper discusses some of the key research findings of the eight nodes, organized into six topics: (1) Improving census and survey data collection methods; (2) Using alternative sources of data; (3) Protecting privacy and confidentiality by improving disclosure avoidance; (4) Using spatial and spatio-temporal statistical modeling to improve estimates; (5) Assessing data cost and quality tradeoffs; and (6) Combining information from multiple sources. It also reports on collaborations across nodes and with federal agencies, new software developed, and educational activities and outcomes. The paper concludes with an evaluation of the ability of the FSS to apply the NCRN’s research outcomes and suggests some next steps, as well as the implications of this research-network model for future federal government renewal initiatives.

    @TechReport{RePEc:cen:wpaper:17-59r,
    author={Daniel H. Weinberg and John M. Abowd and Robert F. Belli and Noel Cressie and David C. Folch and Scott H. Holan and Margaret C. Levenstein and Kristen M. Olson and Jerome P. Reiter and Matthew D. Shapiro and Jolene Smyth and Leen-Kiat Soh and Bruce D. Spencer and Seth E. Spielman and Lars Vilhuber and Christopher K. Wikle},
    title={{Effects of a Government-Academic Partnership: Has the NSF-Census Bureau Research Network Helped Improve the U.S. Statistical System?}},
    year=2017,
    month=Jan,
    institution={Center for Economic Studies, U.S. Census Bureau},
    type={Working Papers},
    url={https://ideas.repec.org/p/cen/wpaper/17-59r.html},
    number={17-59r},
    abstract={The National Science Foundation-Census Bureau Research Network (NCRN) was established in 2011 to create interdisciplinary research nodes on methodological questions of interest and significance to the broader research community and to the Federal Statistical System (FSS), particularly the Census Bureau. The activities to date have covered both fundamental and applied statistical research and have focused at least in part on the training of current and future generations of researchers in skills of relevance to surveys and alternative measurement of economic units, households, and persons. This paper discusses some of the key research findings of the eight nodes, organized into six topics: (1) Improving census and survey data collection methods; (2) Using alternative sources of data; (3) Protecting privacy and confidentiality by improving disclosure avoidance; (4) Using spatial and spatio-temporal statistical modeling to improve estimates; (5) Assessing data cost and quality tradeoffs; and (6) Combining information from multiple sources. It also reports on collaborations across nodes and with federal agencies, new software developed, and educational activities and outcomes. The paper concludes with an evaluation of the ability of the FSS to apply the NCRN's research outcomes and suggests some next steps, as well as the implications of this research-network model for future federal government renewal initiatives.},
    keywords={},
    doi={},
    }
  • K. L. McKinney, A. S. Green, L. Vilhuber, and J. M. Abowd, “Total Error and Variability Measures with Integrated Disclosure Limitation for Quarterly Workforce Indicators and LEHD Origin Destination Employment Statistics in On The Map,” Center for Economic Studies, U.S. Census Bureau, Working Papers 17-71, 2017.
    [Abstract] [URL] [Bibtex]

    We report results from the rst comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau’s Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total employment, beginning-of-quarter employment, full-quarter employment, total payroll, and average monthly earnings of full-quarter employees. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in OnTheMap (OTM). The evaluation is conducted by generating multiple threads of the edit and imputation models used in the LEHD Infrastructure File System. These threads conform to the Rubin (1987) multiple imputation model, with each thread or implicate being the output of formal probability models that address coverage, edit, and imputation errors. Design-based sampling variability and nite population corrections are also included in the evaluation. We derive special formulas for the Rubin total variability and its components that are consistent with the disclosure avoidance system used for QWI and LODES/OTM workplace reports. These formulas allow us to publish the complete set of detailed total quality measures for QWI and LODES. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs have quality in the range generally deemed acceptable. Tabulations involving zero, one or two jobs, which are generally suppressed in the QWI and synthesized in LODES, have substantial total variability but their publication in LODES allows the formation of larger custom aggregations, which will in general have the accuracy estimated for tabulations in the QWI based on a similar number of workers.

    @TechReport{RePEc:cen:wpaper:17-71,
    author={Kevin L. McKinney and Andrew S. Green and Lars Vilhuber and John M. Abowd},
    title={{Total Error and Variability Measures with Integrated Disclosure Limitation for Quarterly Workforce Indicators and LEHD Origin Destination Employment Statistics in On The Map}},
    year=2017,
    month=Jan,
    institution={Center for Economic Studies, U.S. Census Bureau},
    type={Working Papers},
    url={https://ideas.repec.org/p/cen/wpaper/17-71.html},
    number={17-71},
    abstract={We report results from the rst comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total employment, beginning-of-quarter employment, full-quarter employment, total payroll, and average monthly earnings of full-quarter employees. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in OnTheMap (OTM). The evaluation is conducted by generating multiple threads of the edit and imputation models used in the LEHD Infrastructure File System. These threads conform to the Rubin (1987) multiple imputation model, with each thread or implicate being the output of formal probability models that address coverage, edit, and imputation errors. Design-based sampling variability and nite population corrections are also included in the evaluation. We derive special formulas for the Rubin total variability and its components that are consistent with the disclosure avoidance system used for QWI and LODES/OTM workplace reports. These formulas allow us to publish the complete set of detailed total quality measures for QWI and LODES. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs have quality in the range generally deemed acceptable. Tabulations involving zero, one or two jobs, which are generally suppressed in the QWI and synthesized in LODES, have substantial total variability but their publication in LODES allows the formation of larger custom aggregations, which will in general have the accuracy estimated for tabulations in the QWI based on a similar number of workers.},
    keywords={Multiple imputation; Total quality measures; Employment statistics; Earnings statistics; Total surve},
    doi={},
    }
  • A. S. Green, M. J. Kutzbach, and L. Vilhuber, “Two Perspectives on Commuting: A Comparison of Home to Work Flows Across Job-Linked Survey and Administrative Files,” Center for Economic Studies, U.S. Census Bureau, Working Papers 17-34, 2017.
    [Abstract] [URL] [Bibtex]

    Commuting flows and workplace employment data have a wide constituency of users including urban and regional planners, social science and transportation researchers, and businesses. The U.S. Census Bureau releases two, national data products that give the magnitude and characteristics of home to work flows. The American Community Survey (ACS) tabulates households’ responses on employment, workplace, and commuting behavior. The Longitudinal Employer-Household Dynamics (LEHD) program tabulates administrative records on jobs in the LEHD Origin-Destination Employment Statistics (LODES). Design differences across the datasets lead to divergence in a comparable statistic: county-to-county aggregate commute flows. To understand differences in the public use data, this study compares ACS and LEHD source files, using identifying information and probabilistic matching to join person and job records. In our assessment, we compare commuting statistics for job frames linked on person, employment status, employer, and workplace and we identify person and job characteristics as well as design features of the data frames that explain aggregate differences. We find a lower rate of within-county commuting and farther commutes in LODES. We attribute these greater distances to differences in workplace reporting and to uncertainty of establishment assignments in LEHD for workers at multi-unit employers. Minor contributing factors include differences in residence location and ACS workplace edits. The results of this analysis and the data infrastructure developed will support further work to understand and enhance commuting statistics in both datasets.

    @TechReport{RePEc:cen:wpaper:17-34,
    author={Andrew S. Green and Mark J. Kutzbach and Lars Vilhuber},
    title={{Two Perspectives on Commuting: A Comparison of Home to Work Flows Across Job-Linked Survey and Administrative Files}},
    year=2017,
    month=Jan,
    institution={Center for Economic Studies, U.S. Census Bureau},
    type={Working Papers},
    url={https://ideas.repec.org/p/cen/wpaper/17-34.html},
    number={17-34},
    abstract={Commuting flows and workplace employment data have a wide constituency of users including urban and regional planners, social science and transportation researchers, and businesses. The U.S. Census Bureau releases two, national data products that give the magnitude and characteristics of home to work flows. The American Community Survey (ACS) tabulates households’ responses on employment, workplace, and commuting behavior. The Longitudinal Employer-Household Dynamics (LEHD) program tabulates administrative records on jobs in the LEHD Origin-Destination Employment Statistics (LODES). Design differences across the datasets lead to divergence in a comparable statistic: county-to-county aggregate commute flows. To understand differences in the public use data, this study compares ACS and LEHD source files, using identifying information and probabilistic matching to join person and job records. In our assessment, we compare commuting statistics for job frames linked on person, employment status, employer, and workplace and we identify person and job characteristics as well as design features of the data frames that explain aggregate differences. We find a lower rate of within-county commuting and farther commutes in LODES. We attribute these greater distances to differences in workplace reporting and to uncertainty of establishment assignments in LEHD for workers at multi-unit employers. Minor contributing factors include differences in residence location and ACS workplace edits. The results of this analysis and the data infrastructure developed will support further work to understand and enhance commuting statistics in both datasets.},
    keywords={U.S. Census Bureau; LEHD; LODES; ACS; Employer-employee matched data; Commuting; Record linkage},
    doi={},
    }
2016
  • J. M. Abowd, K. L. McKinney, and I. M. Schmutte, “Modeling Endogenous Mobility in Wage Determination,” Labor Dynamics Institute, Document 28, 2016.
    [Abstract] [URL] [Bibtex]

    We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax the exogenous mobility assumptions by modeling the evolution of the matched data as an evolving bipartite graph using a Bayesian latent class framework. Our results suggest that endogenous mobility biases estimated firm effects toward zero. To assess validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates.

    @TechReport{AbowdMcKinneySchmutte-LDI2016,
    author = {John M. Abowd and Kevin L. McKinney and Ian M. Schmutte},
    title = {Modeling Endogenous Mobility in Wage Determination},
    institution = {Labor Dynamics Institute},
    year = {2016},
    type = {Document},
    number = {28},
    month = may,
    abstract = {We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax the exogenous mobility assumptions by modeling the evolution of the matched data as an evolving bipartite graph using a Bayesian latent class framework. Our results suggest that endogenous mobility biases estimated firm effects toward zero. To assess validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates.},
    owner = {vilhuber},
    timestamp = {2016.09.30},
    url = {http://digitalcommons.ilr.cornell.edu/ldi/28/},
    }
  • J. M. Abowd, “How Will Statistical Agencies Operate When All Data Are Private?,” Labor Dynamics Institute, Cornell University, Document 30, 2016.
    [Abstract] [Bibtex]

    The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the “Big Data” era. There are orders of magnitude more data outside an agency?s firewall than inside it-compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was “asked” in a context wholly outside the agency’s operations-blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.

    @TechReport{Abowd:LDI:2016:30,
    author = {John M. Abowd},
    title = {How Will Statistical Agencies Operate When All Data Are Private?},
    institution = {Labor Dynamics Institute, Cornell University},
    year = {2016},
    type = {Document},
    number = {30},
    abstract = {The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the ``Big Data'' era. There are orders of magnitude more data outside an agency?s firewall than inside it-compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was ``asked'' in a context wholly outside the agency's operations-blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.},
    owner = {vilhuber},
    timestamp = {2017.05.03},
    xurl = {http://digitalcommons.ilr.cornell.edu/ldi/30/},
    }
  • J. M. Abowd, “Why Statistical Agencies Need to Take Privacy-loss Budgets Seriously, and What It Means When They Do,” Labor Dynamics Institute, Cornell University, Document 32, 2016.
    [Abstract] [Bibtex]

    To appear on fcsm.sites.usa.gov, as presented to the 2016 FCSM Statistical Policy Seminar.

    @TechReport{Abowd:LDI:2016:32,
    author = {Abowd, John M.},
    title = {Why Statistical Agencies Need to Take Privacy-loss Budgets Seriously, and What It Means When They Do},
    institution = {Labor Dynamics Institute, Cornell University},
    year = {2016},
    type = {Document},
    number = {32},
    abstract = {To appear on fcsm.sites.usa.gov, as presented to the 2016 FCSM Statistical Policy Seminar.},
    owner = {vilhuber},
    timestamp = {2017.05.03},
    xurl = {http://digitalcommons.ilr.cornell.edu/ldi/32/},
    }
  • L. Vilhuber, “ncrncornell/ced2ar-synlbd-codebook: DDI Codebook for the Synthetic LBD,” {Labor Dynamics Institute}. Cornell University, PDF and DDI code , 2016.
    [Abstract] [DOI] [URL] [Bibtex]

    Codebook for the Synthetic LBD, a Census Bureau data product, see \url{https://www.census.gov/ces/dataproducts/synlbd/}. The SynLBD usage model relies on a Synthetic Data Server, maintained (as of 2018) by Cornell University, see \url{https://www2.vrdc.cornell.edu/news/synthetic-data-server/}. Live version of the DDI codebook at \url{https://www2.ncrn.cornell.edu/ced2ar-web/codebooks/synlbd/}

    @techreport{lars_vilhuber_2016_2527910,
    author = {Lars Vilhuber},
    title = {{ncrncornell/ced2ar-synlbd-codebook: DDI Codebook
    for the Synthetic LBD}},
    institution = { {Labor Dynamics Institute}. Cornell University},
    type = {PDF and DDI code},
    month = nov,
    year = 2016,
    doi = {10.5281/zenodo.2527910},
    url = {https://doi.org/10.5281/zenodo.2527910},
    abstract = {Codebook for the Synthetic LBD, a Census Bureau data product, see \url{https://www.census.gov/ces/dataproducts/synlbd/}.
    The SynLBD usage model relies on a Synthetic Data Server, maintained (as of 2018) by Cornell University, see \url{https://www2.vrdc.cornell.edu/news/synthetic-data-server/}.
    Live version of the DDI codebook at \url{https://www2.ncrn.cornell.edu/ced2ar-web/codebooks/synlbd/}
    }
    }
2015
  • J. M. Abowd and I. Schmutte, “Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods,” Labor Dynamics Institute, Document 22, 2015.
    [Abstract] [URL] [Bibtex]

    We consider the problem of the public release of statistical information about a population?explicitly accounting for the public-good properties of both data accuracy and privacy loss. We first consider the implications of adding the public-good component to recently published models of private data publication under differential privacy guarantees using a Vickery-Clark-Groves mechanism and a Lindahl mechanism. We show that data quality will be inefficiently under-supplied. Next, we develop a standard social planner?s problem using the technology set implied by (?, ?)-differential privacy with (?, ?)-accuracy for the Private Multiplicative Weights query release mechanism to study the properties of optimal provision of data accuracy and privacy loss when both are public goods. Using the production possibilities frontier implied by this technology, explicitly parameterized interdependent preferences, and the social welfare function, we display properties of the solution to the social planner?s problem. Our results directly quantify the optimal choice of data accuracy and privacy loss as functions of the technology and preference parameters. Some of these properties can be quantified using population statistics on marginal preferences and correlations between income, data accuracy preferences, and privacy loss preferences that are available from survey data. Our results show that government data custodians should publish more accurate statistics with weaker privacy guarantees than would occur with purely private data publishing. Our statistical results using the General Social Survey and the Cornell National Social Survey indicate that the welfare losses from under-providing data accuracy while over-providing privacy protection can be substantial.

    @TechReport{AbowdSchmutte_LDI2016-22,
    author = {John M. Abowd and Ian Schmutte},
    title = {Revisiting the Economics of Privacy: {P}opulation Statistics and Confidentiality Protection as Public Goods},
    institution = {Labor Dynamics Institute},
    year = {2015},
    type = {Document},
    number = {22},
    month = jan,
    abstract = {We consider the problem of the public release of statistical information about a population?explicitly accounting for the public-good properties of both data accuracy and privacy loss. We first consider the implications of adding the public-good component to recently published models of private data publication under differential privacy guarantees using a Vickery-Clark-Groves mechanism and a Lindahl mechanism. We show that data quality will be inefficiently under-supplied. Next, we develop a standard social planner?s problem using the technology set implied by (?, ?)-differential privacy with (?, ?)-accuracy for the Private Multiplicative Weights query release mechanism to study the properties of optimal provision of data accuracy and privacy loss when both are public goods. Using the production possibilities frontier implied by this technology, explicitly parameterized interdependent preferences, and the social welfare function, we display properties of the solution to the social planner?s problem. Our results directly quantify the optimal choice of data accuracy and privacy loss as functions of the technology and preference parameters. Some of these properties can be quantified using population statistics on marginal preferences and correlations between income, data accuracy preferences, and privacy loss preferences that are available from survey data. Our results show that government data custodians should publish more accurate statistics with weaker privacy guarantees than would occur with purely private data publishing. Our statistical results using the General Social Survey and the Cornell National Social Survey indicate that the welfare losses from under-providing data accuracy while over-providing privacy protection can be substantial.},
    language = {English},
    owner = {vilhuber},
    timestamp = {2016.09.30},
    url = {http://digitalcommons.ilr.cornell.edu/ldi/22/},
    volume = {Fall 2015},
    }
  • L. Vilhuber, “ncrncornell/ced2ar-nber-ces-codebook: Codebook for NBER-CES Manufacturing Industry Database,” {Labor Dynamics Institute}. Cornell University, PDF and DDI code , 2015.
    [Abstract] [DOI] [URL] [Bibtex]

    Codebook for NBER-CES Manufacturing Industry Database (2009) [NAICS and SIC], by Randy A. Becker , Wayne B. Gray , Jordan Marvakov , and Eric J. Bartelsman Main website: \url{https://www.nber.org/data/nberces5809.html} (note: a newer version is available at \url{http://www.nber.org/data/nberces.html} – this codebook does not necessarily reflect the more recent version.) Live version of the DDI codebook at \url{https://www2.ncrn.cornell.edu/ced2ar-web/codebooks/nber-ces/}

    @techreport{lars_vilhuber_2015_2527908,
    author = {Lars Vilhuber},
    title = {{ncrncornell/ced2ar-nber-ces-codebook: Codebook for
    NBER-CES Manufacturing Industry Database}},
    month = nov,
    year = 2015,
    doi = {10.5281/zenodo.2527908},
    url = {https://doi.org/10.5281/zenodo.2527908},
    type = {PDF and DDI code},
    institution = { {Labor Dynamics Institute}. Cornell University},
    abstract = {Codebook for NBER-CES Manufacturing Industry Database (2009) [NAICS and SIC], by Randy A. Becker , Wayne B. Gray , Jordan Marvakov , and Eric J. Bartelsman
    Main website: \url{https://www.nber.org/data/nberces5809.html} (note: a newer version is available at \url{http://www.nber.org/data/nberces.html} - this codebook does not necessarily reflect the more recent version.)
    Live version of the DDI codebook at \url{https://www2.ncrn.cornell.edu/ced2ar-web/codebooks/nber-ces/}}
    }
  • L. Vilhuber, “ncrncornell/ced2ar-nqwi-codebook: Codebook for the National QWI [Codebook file],” {Labor Dynamics Institute}. Cornell University, PDF and DDI code , 2015.
    [Abstract] [DOI] [URL] [Bibtex]

    Codebook for the early research version of National QWI. Live version of the DDI codebook at \url{https://www2.ncrn.cornell.edu/ced2ar-web/codebooks/nqwi/}

    @techreport{lars_vilhuber_2015_2527906,
    author = {Lars Vilhuber},
    title = {{ncrncornell/ced2ar-nqwi-codebook: Codebook for the
    National QWI [Codebook file]}},
    month = oct,
    year = 2015,
    doi = {10.5281/zenodo.2527906},
    url = {https://doi.org/10.5281/zenodo.2527906},
    type = {PDF and DDI code},
    institution = { {Labor Dynamics Institute}. Cornell University},
    abstract = {Codebook for the early research version of National QWI.
    Live version of the DDI codebook at \url{https://www2.ncrn.cornell.edu/ced2ar-web/codebooks/nqwi/}}
    }

Download of complete Bibtex files

Interested users might want to download the complete Bibtex files for the two sections above:

Electronic metadata

We have created and maintain electronic metadata (otherwise known as “online codebooks”) for a number of datasets, displayed using CED²AR:

  • Cornell NSF-Census Research Network, “NBER-CES Manufacturing Industry Database (NAICS, 2009) [Codebook file],” {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University, Ithaca, NY, USA, {DDI-C} document, 2013.
    [URL] [Bibtex]
    @TECHREPORT{CED2AR-NBER-naics2009,
    author = {{Cornell NSF-Census Research Network}},
    title = {NBER-CES Manufacturing Industry Database (NAICS, 2009) [Codebook file]},
    institution = {{Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University},
    type = {{DDI-C} document},
    address = {Ithaca, NY, USA},
    year = {2013},
    url = {https://www2.ncrn.cornell.edu/ced2ar-web/codebooks/nber-ces/v/naics2009}
    }
  • Cornell NSF-Census Research Network, “NBER-CES Manufacturing Industry Database (SIC, 2009) [Codebook file],” {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University, Ithaca, NY, USA, {DDI-C} document, 2013.
    [URL] [Bibtex]
    @TECHREPORT{CED2AR-NBER-sic2009,
    author = {{Cornell NSF-Census Research Network}},
    title = {NBER-CES Manufacturing Industry Database (SIC, 2009) [Codebook file]},
    institution = {{Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University},
    type = {{DDI-C} document},
    address = {Ithaca, NY, USA},
    year = {2013},
    url = {https://www2.ncrn.cornell.edu/ced2ar-web/codebooks/nber-ces/v/sic2009}
    }
  • Reeder, Lori B., Martha Stinson, Kelly E. Trageser, and Lars Vilhuber, “Codebook for the SIPP Synthetic Beta v5.1 [Codebook file],” {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University, Ithaca, NY, USA, {DDI-C} document, 2014.
    [URL] [Bibtex]
    @TECHREPORT{CED2AR-SSBv51,
    author = {Lori B. Reeder and Martha Stinson and Kelly E. Trageser and Lars Vilhuber},
    title = {Codebook for the {SIPP} {S}ynthetic {B}eta v5.1 [Codebook file]},
    institution = {{Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University},
    type = {{DDI-C} document},
    address = {Ithaca, NY, USA},
    year = {2014},
    url = {http://www2.ncrn.cornell.edu/ced2ar-web/codebooks/ssb/v/v51}
    }
  • Reeder, Lori B., Martha Stinson, Kelly E. Trageser, and Lars Vilhuber, “Codebook for the SIPP Synthetic Beta v6.0 [Codebook file],” {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University, Ithaca, NY, USA, {DDI-C} document, 2015.
    [URL] [Bibtex]
    @TECHREPORT{CED2AR-SSBv6,
    author = {Lori B. Reeder and Martha Stinson and Kelly E. Trageser and Lars Vilhuber},
    title = {Codebook for the {SIPP} {S}ynthetic {B}eta v6.0 [Codebook file]},
    institution = {{Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University},
    type = {{DDI-C} document},
    address = {Ithaca, NY, USA},
    year = {2015},
    url = {http://www2.ncrn.cornell.edu/ced2ar-web/codebooks/ssb/v/v6}
    }
  • Reeder, Lori B., Martha Stinson, Kelly E. Trageser, and Lars Vilhuber, “Codebook for the SIPP Synthetic Beta v6.0.2 [Codebook file],” {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University, Ithaca, NY, USA, {DDI-C} document, 2015.
    [URL] [Bibtex]
    @TECHREPORT{CED2AR-SSBv602,
    author = {Lori B. Reeder and Martha Stinson and Kelly E. Trageser and Lars Vilhuber},
    title = {Codebook for the {SIPP} {S}ynthetic {B}eta v6.0.2 [Codebook file]},
    institution = {{Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University},
    type = {{DDI-C} document},
    address = {Ithaca, NY, USA},
    year = {2015},
    url = {http://www2.ncrn.cornell.edu/ced2ar-web/codebooks/ssb/v/v602}
    }
  • Reeder, Lori B., Jordan C. Stanley, and Lars Vilhuber, “Codebook for the SIPP Synthetic Beta v7 [Online],” {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute}. Cornell University, Ithaca, NY, USA, {DDI-C} document, 2018.
    [URL] [Bibtex]
    @techreport{CED2AR-SSBv7,
    author = {Lori B. Reeder and Jordan C. Stanley and Lars Vilhuber},
    title = {Codebook for the {SIPP} {S}ynthetic {B}eta v7 [Online]},
    institution = {{Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute}. Cornell University},
    type = {{DDI-C} document},
    address = {Ithaca, NY, USA},
    year = {2018},
    url = {http://www2.ncrn.cornell.edu/ced2ar-web/codebooks/ssb/v/v7}
    }
  • Vilhuber, Lars, “Codebook for the Synthetic LBD Version 2.0 [Codebook file],” {Comprehensive Extensible Data Documentation and Access Repository (CED2AR)}, Cornell Institute for Social and Economic Research and Labor Dynamics Institute [distributor]. Cornell University, Ithaca, NY, USA, DDI-C document, 2013.
    [URL] [Bibtex]
    @TECHREPORT{CED2AR-SynLBDv2,
    author = { Lars Vilhuber },
    title = {Codebook for the Synthetic LBD Version 2.0 [Codebook file]},
    institution = {{Comprehensive Extensible Data Documentation and Access Repository (CED2AR)}, Cornell Institute for Social and Economic Research and Labor Dynamics Institute [distributor]. Cornell University},
    type = {DDI-C document},
    address = {Ithaca, NY, USA},
    year = {2013},
    url = {http://www2.ncrn.cornell.edu/ced2ar-web/codebooks/synlbd/v/v2}
    }
  • Vilhuber, Lars, “ncrncornell/ced2ar-nqwi-codebook: Codebook for the National QWI [Codebook file],” {Labor Dynamics Institute}. Cornell University, PDF and DDI code, 2015.
    [Abstract] [DOI] [URL] [Bibtex]

    Codebook for the early research version of National QWI. Live version of the DDI codebook at \url{https://www2.ncrn.cornell.edu/ced2ar-web/codebooks/nqwi/}

    @techreport{lars_vilhuber_2015_2527906,
    author = {Lars Vilhuber},
    title = {{ncrncornell/ced2ar-nqwi-codebook: Codebook for the
    National QWI [Codebook file]}},
    month = oct,
    year = 2015,
    doi = {10.5281/zenodo.2527906},
    url = {https://doi.org/10.5281/zenodo.2527906},
    type = {PDF and DDI code},
    institution = { {Labor Dynamics Institute}. Cornell University},
    abstract = {Codebook for the early research version of National QWI.
    Live version of the DDI codebook at \url{https://www2.ncrn.cornell.edu/ced2ar-web/codebooks/nqwi/}}
    }
  • Vilhuber, Lars, “ncrncornell/ced2ar-nber-ces-codebook: Codebook for NBER-CES Manufacturing Industry Database,” {Labor Dynamics Institute}. Cornell University, PDF and DDI code, 2015.
    [Abstract] [DOI] [URL] [Bibtex]

    Codebook for NBER-CES Manufacturing Industry Database (2009) [NAICS and SIC], by Randy A. Becker , Wayne B. Gray , Jordan Marvakov , and Eric J. Bartelsman Main website: \url{https://www.nber.org/data/nberces5809.html} (note: a newer version is available at \url{http://www.nber.org/data/nberces.html} – this codebook does not necessarily reflect the more recent version.) Live version of the DDI codebook at \url{https://www2.ncrn.cornell.edu/ced2ar-web/codebooks/nber-ces/}

    @techreport{lars_vilhuber_2015_2527908,
    author = {Lars Vilhuber},
    title = {{ncrncornell/ced2ar-nber-ces-codebook: Codebook for
    NBER-CES Manufacturing Industry Database}},
    month = nov,
    year = 2015,
    doi = {10.5281/zenodo.2527908},
    url = {https://doi.org/10.5281/zenodo.2527908},
    type = {PDF and DDI code},
    institution = { {Labor Dynamics Institute}. Cornell University},
    abstract = {Codebook for NBER-CES Manufacturing Industry Database (2009) [NAICS and SIC], by Randy A. Becker , Wayne B. Gray , Jordan Marvakov , and Eric J. Bartelsman
    Main website: \url{https://www.nber.org/data/nberces5809.html} (note: a newer version is available at \url{http://www.nber.org/data/nberces.html} - this codebook does not necessarily reflect the more recent version.)
    Live version of the DDI codebook at \url{https://www2.ncrn.cornell.edu/ced2ar-web/codebooks/nber-ces/}}
    }
  • Vilhuber, Lars, “ncrncornell/ced2ar-synlbd-codebook: DDI Codebook for the Synthetic LBD,” {Labor Dynamics Institute}. Cornell University, PDF and DDI code, 2016.
    [Abstract] [DOI] [URL] [Bibtex]

    Codebook for the Synthetic LBD, a Census Bureau data product, see \url{https://www.census.gov/ces/dataproducts/synlbd/}. The SynLBD usage model relies on a Synthetic Data Server, maintained (as of 2018) by Cornell University, see \url{https://www2.vrdc.cornell.edu/news/synthetic-data-server/}. Live version of the DDI codebook at \url{https://www2.ncrn.cornell.edu/ced2ar-web/codebooks/synlbd/}

    @techreport{lars_vilhuber_2016_2527910,
    author = {Lars Vilhuber},
    title = {{ncrncornell/ced2ar-synlbd-codebook: DDI Codebook
    for the Synthetic LBD}},
    institution = { {Labor Dynamics Institute}. Cornell University},
    type = {PDF and DDI code},
    month = nov,
    year = 2016,
    doi = {10.5281/zenodo.2527910},
    url = {https://doi.org/10.5281/zenodo.2527910},
    abstract = {Codebook for the Synthetic LBD, a Census Bureau data product, see \url{https://www.census.gov/ces/dataproducts/synlbd/}.
    The SynLBD usage model relies on a Synthetic Data Server, maintained (as of 2018) by Cornell University, see \url{https://www2.vrdc.cornell.edu/news/synthetic-data-server/}.
    Live version of the DDI codebook at \url{https://www2.ncrn.cornell.edu/ced2ar-web/codebooks/synlbd/}
    }
    }
  • Reeder, Lori B., Jordan C. Stanley, and Lars Vilhuber, “Codebook for the SIPP Synthetic Beta 7.0 (DDI-C and PDF),” {Labor Dynamics Institute}. Cornell University, Codebook, 2018.
    [Abstract] [DOI] [URL] [Bibtex]

    The SIPP Synthetic Beta (SSB) is a Census Bureau product that integrates person-level micro-data from a household survey with administrative tax and benefit data. These data link respondents from the Survey of Income and Program Participation (SIPP) to Social Security Administration (SSA)/Internal Revenue Service (IRS) Form W-2 records and SSA records of retirement and disability benefit receipt, and were produced by Census Bureau staff economists and statisticians in collaboration with researchers at Cornell University, the SSA and the IRS. The purpose of the SSB is to provide access to linked data that are usually not publicly available due to confidentiality concerns. To overcome these concerns, Census has synthesized, or modeled, all the variables in a way that changes the record of each individual in a manner designed to preserve the underlying covariate relationships between the variables. The only variables that were not altered by the synthesis process and still contain their original values are gender and a link to the first reported marital partner in the survey. Eight SIPP panels (1990, 1991, 1992, 1993, 1996, 2001, 2004, 2008) form the basis for the SSB, with a large subset of variables available across all the panels selected for inclusion and harmonization across the years. Administrative data were added and some editing was done to correct for logical inconsistencies in the IRS/SSA earnings and benefits data.

    @techreport{reeder_lori_b_2018_1477097,
    author = {Reeder, Lori B. and
    Stanley, Jordan C. and
    Lars Vilhuber},
    institution = { {Labor Dynamics Institute}. Cornell University},
    type = {Codebook},
    title = {{Codebook for the SIPP Synthetic Beta 7.0 (DDI-C
    and PDF)}},
    month = nov,
    year = 2018,
    doi = {10.5281/zenodo.1477097},
    url = {https://doi.org/10.5281/zenodo.1477097},
    abstract = {The SIPP Synthetic Beta (SSB) is a Census Bureau product that integrates person-level micro-data from a household survey with administrative tax and benefit data. These data link respondents from the Survey of Income and Program Participation (SIPP) to Social Security Administration (SSA)/Internal Revenue Service (IRS) Form W-2 records and SSA records of retirement and disability benefit receipt, and were produced by Census Bureau staff economists and statisticians in collaboration with researchers at Cornell University, the SSA and the IRS. The purpose of the SSB is to provide access to linked data that are usually not publicly available due to confidentiality concerns. To overcome these concerns, Census has synthesized, or modeled, all the variables in a way that changes the record of each individual in a manner designed to preserve the underlying covariate relationships between the variables. The only variables that were not altered by the synthesis process and still contain their original values are gender and a link to the first reported marital partner in the survey. Eight SIPP panels (1990, 1991, 1992, 1993, 1996, 2001, 2004, 2008) form the basis for the SSB, with a large subset of variables available across all the panels selected for inclusion and harmonization across the years. Administrative data were added and some editing was done to correct for logical inconsistencies in the IRS/SSA earnings and benefits data.}
    }
  • Reeder, Lori B., Jordan C. Stanley, and Lars Vilhuber, “Codebook for the SIPP Synthetic Beta 7.0 (PDF version),” {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute}. Cornell University, PDF and DDI code V20181102b-pdf, 2018.
    [Abstract] [DOI] [URL] [Bibtex]

    The SIPP Synthetic Beta (SSB) is a Census Bureau product that integrates person-level micro-data from a household survey with administrative tax and benefit data. These data link respondents from the Survey of Income and Program Participation (SIPP) to Social Security Administration (SSA)/Internal Revenue Service (IRS) Form W-2 records and SSA records of retirement and disability benefit receipt, and were produced by Census Bureau staff economists and statisticians in collaboration with researchers at Cornell University, the SSA and the IRS. The purpose of the SSB is to provide access to linked data that are usually not publicly available due to confidentiality concerns. To overcome these concerns, Census has synthesized, or modeled, all the variables in a way that changes the record of each individual in a manner designed to preserve the underlying covariate relationships between the variables. The only variables that were not altered by the synthesis process and still contain their original values are gender and a link to the first reported marital partner in the survey. Eight SIPP panels (1990, 1991, 1992, 1993, 1996, 2001, 2004, 2008) form the basis for the SSB, with a large subset of variables available across all the panels selected for inclusion and harmonization across the years. Administrative data were added and some editing was done to correct for logical inconsistencies in the IRS/SSA earnings and benefits data.

    @techreport{reeder_lori_b_2018_1477099,
    title = {{Codebook for the SIPP Synthetic Beta 7.0 (PDF
    version)}},
    author = {Reeder, Lori B. and
    Stanley, Jordan C. and
    Vilhuber, Lars},
    institution = {{Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute}. Cornell University},
    type = {PDF and DDI code},
    number = {V20181102b-pdf},
    month = nov,
    year = 2018,
    doi = {10.5281/zenodo.1477099},
    url = {https://doi.org/10.5281/zenodo.1477099},
    abstract = {The SIPP Synthetic Beta (SSB) is a Census Bureau product that integrates person-level micro-data from a household survey with administrative tax and benefit data. These data link respondents from the Survey of Income and Program Participation (SIPP) to Social Security Administration (SSA)/Internal Revenue Service (IRS) Form W-2 records and SSA records of retirement and disability benefit receipt, and were produced by Census Bureau staff economists and statisticians in collaboration with researchers at Cornell University, the SSA and the IRS. The purpose of the SSB is to provide access to linked data that are usually not publicly available due to confidentiality concerns. To overcome these concerns, Census has synthesized, or modeled, all the variables in a way that changes the record of each individual in a manner designed to preserve the underlying covariate relationships between the variables. The only variables that were not altered by the synthesis process and still contain their original values are gender and a link to the first reported marital partner in the survey. Eight SIPP panels (1990, 1991, 1992, 1993, 1996, 2001, 2004, 2008) form the basis for the SSB, with a large subset of variables available across all the panels selected for inclusion and harmonization across the years. Administrative data were added and some editing was done to correct for logical inconsistencies in the IRS/SSA earnings and benefits data.}
    }

and others, available at http://www2.ncrn.cornell.edu/ced2ar-web/.

Some of the datasets are hosted by the Cornell VirtualRDC:

We have also generated a draft bibliography of Census Bureau datasets available in the FSRDC. This should be considered a starting point only for proper data citation:

Data repositories

Where possible, we publish the data created by our projects, or the data necessary to replicate papers, in publicly accessible repositories:

Code publication

We publish, where possible, source code and metadata standards in openly accessible locations:

Presentations

Our presentations are archived on the  Cornell eCommons presentations repository. Latest Presentations: