# Publications

## Published papers and proceedings

forthcoming
• J. M. Abowd, “How Will Statistical Agencies Operate When All Data Are Private?,” Journal of Privacy and Confidentiality, vol. 7, iss. 3, forthcoming.
[Abstract] [URL] [Bibtex]

The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the ?Big Data? era. There are orders of magnitude more data outside an agency?s firewall than inside it-compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was ?asked? in a context wholly outside the agency?s operations-blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.

@Article{Abowd:JPC:2017,
author = {John M. Abowd},
title = {How Will Statistical Agencies Operate When All Data Are Private?},
journal = {Journal of Privacy and Confidentiality},
year = {forthcoming},
volume = {7},
number = {3},
abstract = {The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the ?Big Data? era. There are orders of magnitude more data outside an agency?s firewall than inside it-compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was ?asked? in a context wholly outside the agency?s operations-blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.},
owner = {vilhuber},
timestamp = {2017.05.03},
url = {http://repository.cmu.edu/jpc_forthcoming/4/},
}
2017
• S. Haney, A. Machanavajjhala, J. M. Abowd, M. Graham, M. Kutzbach, and L. Vilhuber, “Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics,” in Proceedings of the 2017 International Conference on Management of Data, ACM, 2017, vol. forthcoming.
[Abstract] [DOI] [URL] [Bibtex]

National statistical agencies around the world publish tabular summaries based on combined employer-employee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter $\epsilon\geq$ 1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional SDL algorithms. Those queries are fodder for future research.

@InCollection{HaneySIGMOD2017,
author = {Samuel Haney and Ashwin Machanavajjhala and John M. Abowd and Matthew Graham and Mark Kutzbach and Lars Vilhuber},
title = {Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics},
booktitle = {Proceedings of the 2017 International Conference on Management of Data},
publisher = {ACM},
year = {2017},
volume = {forthcoming},
series = {SIGMOD '17},
abstract = {National statistical agencies around the world publish tabular summaries based on combined employer-employee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures.
In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter $\epsilon\geq$ 1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional SDL algorithms. Those queries are fodder for future research.},
acmid = {3035940},
doi = {10.1145/3035918.3035940},
journal = {SIGMOD},
owner = {vilhuber},
timestamp = {2017.03.01},
url = {http://dx.doi.org/10.1145/3035918.3035940},
}
• L. Vilhuber, I. M. Schmutte, and J. M. Abowd, “Proceedings from the 2016 NSF-Sloan Workshop on Practical Privacy,” Labor Dynamics Institute, Cornell University, Document 33, 2017.
[Abstract] [URL] [Bibtex]

On October 14, 2016, we hosted a workshop that brought together economists, survey statisticians, and computer scientists with expertise in the field of privacy preserving methods: Census Bureau staff working on implementing cutting-edge methods in the Bureau?s flagship public-use products mingled with academic researchers from a variety of universities. The four products discussed as part of the workshop were 1. the American Community Survey (ACS); 2. Longitudinal Employer-Household Data (LEHD), in particular the LEHD Origin-Destination Employment Statistics (LODES); the 3. 2020 Decennial Census; and the 4. 2017 Economic Census. The goal of the workshop was to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas. Funding for the workshop was provided by the National Science Foundation (CNS-1012593) and the Alfred P. Sloan Foundation. Organizational support was provided by the Research and Methodology Directorate at the U.S. Census Bureau and the Labor Dynamics Institute at Cornell University.

@TechReport{Vilhuber:LDI:2017:33,
author = {Vilhuber, Lars and Schmutte, Ian M. and Abowd, John M.},
title = {Proceedings from the 2016 NSF-Sloan Workshop on Practical Privacy},
institution = {Labor Dynamics Institute, Cornell University},
year = {2017},
type = {Document},
number = {33},
abstract = {On October 14, 2016, we hosted a workshop that brought together economists, survey statisticians, and computer scientists with expertise in the field of privacy preserving methods: Census Bureau staff working on implementing cutting-edge methods in the Bureau?s flagship public-use products mingled with academic researchers from a variety of universities. The four products discussed as part of the workshop were 1. the American Community Survey (ACS); 2. Longitudinal Employer-Household Data (LEHD), in particular the LEHD Origin-Destination Employment Statistics (LODES); the 3. 2020 Decennial Census; and the 4. 2017 Economic Census. The goal of the workshop was to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas.
Funding for the workshop was provided by the National Science Foundation (CNS-1012593) and the Alfred P. Sloan Foundation. Organizational support was provided by the Research and Methodology Directorate at the U.S. Census Bureau and the Labor Dynamics Institute at Cornell University.},
comment = {Funding by National Science Foundation (CNS-1012593) and the Alfred P. Sloan Foundation},
owner = {vilhuber},
timestamp = {2017.05.03},
url = {http://digitalcommons.ilr.cornell.edu/ldi/33/},
xurl = {http://digitalcommons.ilr.cornell.edu/ldi/33/},
}
2016
• J. Miranda and L. Vilhuber, “Using partially synthetic microdata to protect sensitive cells in business statistics,” Statistical Journal of the IAOS, vol. 32, iss. 1, pp. 69-80, 2016.
[Abstract] [DOI] [URL] [Bibtex]

We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau’s Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).

@Article{MirandaVilhuber-SJIAOS2016,
author = {Javier Miranda and Lars Vilhuber},
title = {Using partially synthetic microdata to protect sensitive cells in business statistics},
journal = {Statistical Journal of the IAOS},
year = {2016},
volume = {32},
number = {1},
pages = {69--80},
month = {Feb},
abstract = {We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).},
doi = {10.3233/SJI-160963},
file = {:MirandaVilhuber-SJIAOS2016.pdf:PDF},
issn = {1874-7655},
owner = {vilhuber},
publisher = {IOS Press},
timestamp = {2016.09.30},
url = {http://doi.org/10.3233/SJI-160963},
}
• J. M. Abowd and K. L. McKinney, “Noise infusion as a confidentiality protection measure for graph-based statistics,” Statistical Journal of the IAOS, vol. 32, iss. 1, pp. 127-135, 2016.
[Abstract] [DOI] [URL] [Bibtex]

We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau’s Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs.

@Article{AbowdMcKinney-SJIAOS2016,
author = {John M. Abowd and Kevin L. McKinney},
title = {Noise infusion as a confidentiality protection measure for graph-based statistics},
journal = {Statistical Journal of the IAOS},
year = {2016},
volume = {32},
number = {1},
pages = {127--135},
month = {Feb},
abstract = {We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau's Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs.},
doi = {10.3233/SJI-160958},
file = {:AbowdMcKinney-SJIAOS2016.pdf:PDF},
issn = {1874-7655},
owner = {vilhuber},
publisher = {IOS Press},
timestamp = {2016.09.30},
url = {http://doi.org/10.3233/SJI-160958},
}
• L. Vilhuber, J. M. Abowd, and J. P. Reiter, “Synthetic establishment microdata around the world,” Statistical Journal of the IAOS, vol. 32, iss. 1, pp. 65-68, 2016.
[Abstract] [DOI] [URL] [Bibtex]

In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business microdata is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic \emph{establishment} microdata. This overview situates those papers, published in this issue, within the broader literature.

@Article{VilhuberAbowdReiter-SJIAOS2016,
author = {Lars Vilhuber and John M. Abowd and Jerome P. Reiter},
title = {Synthetic establishment microdata around the world},
journal = {Statistical Journal of the IAOS},
year = {2016},
volume = {32},
number = {1},
pages = {65--68},
month = {Feb},
abstract = {In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business microdata is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic \emph{establishment} microdata. This overview situates those papers, published in this issue, within the broader literature.},
doi = {10.3233/SJI-160964},
file = {:VilhuberAbowdReiter-SJIAOS2016.pdf:PDF},
issn = {1874-7655},
owner = {vilhuber},
publisher = {IOS Press},
timestamp = {2016.09.30},
url = {http://doi.org/10.3233/SJI-160964},
}
• J. M. Abowd, K. L. McKinney, and I. M. Schmutte, “Modeling Endogenous Mobility in Wage Determination,” Labor Dynamics Institute, Document 28, 2016.
[Abstract] [URL] [Bibtex]

We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax the exogenous mobility assumptions by modeling the evolution of the matched data as an evolving bipartite graph using a Bayesian latent class framework. Our results suggest that endogenous mobility biases estimated firm effects toward zero. To assess validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates.

@TechReport{AbowdMcKinneySchmutte-LDI2016,
author = {John M. Abowd and Kevin L. McKinney and Ian M. Schmutte},
title = {Modeling Endogenous Mobility in Wage Determination},
institution = {Labor Dynamics Institute},
year = {2016},
type = {Document},
number = {28},
month = may,
abstract = {We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax the exogenous mobility assumptions by modeling the evolution of the matched data as an evolving bipartite graph using a Bayesian latent class framework. Our results suggest that endogenous mobility biases estimated firm effects toward zero. To assess validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates.},
owner = {vilhuber},
timestamp = {2016.09.30},
url = {http://digitalcommons.ilr.cornell.edu/ldi/28/},
}
• J. M. Abowd, “How Will Statistical Agencies Operate When All Data Are Private?,” Labor Dynamics Institute, Cornell University, Document 30, 2016.
[Abstract] [Bibtex]

The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the ?Big Data? era. There are orders of magnitude more data outside an agency?s firewall than inside it-compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was ?asked? in a context wholly outside the agency?s operations-blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.

@TechReport{Abowd:LDI:2016:30,
author = {John M. Abowd},
title = {How Will Statistical Agencies Operate When All Data Are Private?},
institution = {Labor Dynamics Institute, Cornell University},
year = {2016},
type = {Document},
number = {30},
abstract = {The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the ?Big Data? era. There are orders of magnitude more data outside an agency?s firewall than inside it-compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was ?asked? in a context wholly outside the agency?s operations-blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.},
owner = {vilhuber},
timestamp = {2017.05.03},
xurl = {http://digitalcommons.ilr.cornell.edu/ldi/30/},
}
• J. M. Abowd, “Why Statistical Agencies Need to Take Privacy-loss Budgets Seriously, and What It Means When They Do,” Labor Dynamics Institute, Cornell University, Document 32, 2016.
[Abstract] [Bibtex]

To appear on fcsm.sites.usa.gov, as presented to the 2016 FCSM Statistical Policy Seminar.

@TechReport{Abowd:LDI:2016:32,
author = {Abowd, John M.},
title = {Why Statistical Agencies Need to Take Privacy-loss Budgets Seriously, and What It Means When They Do},
institution = {Labor Dynamics Institute, Cornell University},
year = {2016},
type = {Document},
number = {32},
abstract = {To appear on fcsm.sites.usa.gov, as presented to the 2016 FCSM Statistical Policy Seminar.},
owner = {vilhuber},
timestamp = {2017.05.03},
xurl = {http://digitalcommons.ilr.cornell.edu/ldi/32/},
}
2015
• J. M. Abowd and I. Schmutte, “Economic analysis and statistical disclosure limitation,” Brookings Papers on Economic Activity, vol. Spring 2015, 2015.
[Abstract] [URL] [Bibtex]

This paper explores the consequences for economic research of methods used by statistical agencies to protect confidentiality of their respondents. We first review the concepts of statistical disclosure limitation for an audience of economists who may be unfamiliar with these methods. Our main objective is to shed light on the effects of statistical disclosure limitation for empirical economic research. In general, the standard approach of ignoring statistical disclosure limitation leads to incorrect inference. We formalize statistical disclosure methods in a model of the data publication process. In the model, the statistical agency collects data from a population, but published a version of the data that have been intentionally distorted. The model allows us to characterize what it means for statistical disclosure limitation to be ignorable, and to characterize what happens when it is not. We then consider the effects of statistical disclosure limitation for regression analysis, instrumental variable analysis, and regression discontinuity design. Because statistical agencies do not always report the methods they use to protect confidentiality, we use our model to characterize settings in which statistical disclosure limitation methods are discoverable; that is, they can be learned from the released data. We conclude with advice for researchers, journal editors, and statistical agencies.

@Article{AbowdSchmutte_BPEA2015,
author = {John M. Abowd and Ian Schmutte},
title = {Economic analysis and statistical disclosure limitation},
journal = {Brookings Papers on Economic Activity},
year = {2015},
volume = {Spring 2015},
abstract = {This paper explores the consequences for economic research of methods used by statistical agencies to protect confidentiality of their respondents. We first review the concepts of statistical disclosure limitation for an audience of economists who may be unfamiliar with these methods. Our main objective is to shed light on the effects of statistical disclosure limitation for empirical economic research. In general, the standard approach of ignoring statistical disclosure limitation leads to incorrect inference. We formalize statistical disclosure methods in a model of the data publication process. In the model, the statistical agency collects data from a population, but published a version of the data that have been intentionally distorted. The model allows us to characterize what it means for statistical disclosure limitation to be ignorable, and to characterize what happens when it is not. We then consider the effects of statistical disclosure limitation for regression analysis, instrumental variable analysis, and regression discontinuity design. Because statistical agencies do not always report the methods they use to protect confidentiality, we use our model to characterize settings in which statistical disclosure limitation methods are discoverable; that is, they can be learned from the released data. We conclude with advice for researchers, journal editors, and statistical agencies.},
issn = {00072303},
jstor_articletype = {research-article},
language = {English},
publisher = {Brookings Institution Press},
}
• M. J. Schneider and J. M. Abowd, “A new method for protecting interrelated time series with Bayesian prior distributions and synthetic data,” Journal of the Royal Statistical Society: Series A (Statistics in Society), p. n/a–n/a, 2015.
[Abstract] [DOI] [URL] [Bibtex]

Organizations disseminate statistical summaries of administrative data via the Web for unrestricted public use. They balance the trade-off between protection of confidentiality and quality of inference. Recent developments in disclosure avoidance techniques include the incorporation of synthetic data, which capture the essential features of underlying data by releasing altered data generated from a posterior predictive distribution. The US Census Bureau collects millions of interrelated time series microdata that are hierarchical and contain many 0s and suppressions. Rule-based disclosure avoidance techniques often require the suppression of count data for small magnitudes and the modification of data based on a small number of entities. Motivated by this problem, we use zero-inflated extensions of Bayesian generalized linear mixed models with privacy-preserving prior distributions to develop methods for protecting and releasing synthetic data from time series about thousands of small groups of entities without suppression based on the magnitudes or number of entities. We find that, as the prior distributions of the variance components in the Bayesian generalized linear mixed model become more precise towards zero, protection of confidentiality increases and the quality of inference deteriorates. We evaluate our methodology by using a strict privacy measure, empirical differential privacy and a newly defined risk measure, the probability of range identification, which directly measures attribute disclosure risk. We illustrate our results with the US Census Bureau’s quarterly workforce indicators.

@article {RSSA:RSSA12100,
author = {Schneider, Matthew J. and Abowd, John M.},
title = {A new method for protecting interrelated time series with Bayesian prior distributions and synthetic data},
journal = {Journal of the Royal Statistical Society: Series A (Statistics in Society)},
issn = {1467-985X},
pages = {n/a--n/a},
keywords = {Administrative data, Empirical differential privacy, Informative prior distributions, Statistical disclosure limitation, Synthetic data, Zero-inflated mixed models},
year = {2015},
abstract = {Organizations disseminate statistical summaries of administrative data via the Web for unrestricted public use. They balance the trade-off between protection of confidentiality and quality of inference. Recent developments in disclosure avoidance techniques include the incorporation of synthetic data, which capture the essential features of underlying data by releasing altered data generated from a posterior predictive distribution. The US Census Bureau collects millions of interrelated time series microdata that are hierarchical and contain many 0s and suppressions. Rule-based disclosure avoidance techniques often require the suppression of count data for small magnitudes and the modification of data based on a small number of entities. Motivated by this problem, we use zero-inflated extensions of Bayesian generalized linear mixed models with privacy-preserving prior distributions to develop methods for protecting and releasing synthetic data from time series about thousands of small groups of entities without suppression based on the magnitudes or number of entities. We find that, as the prior distributions of the variance components in the Bayesian generalized linear mixed model become more precise towards zero, protection of confidentiality increases and the quality of inference deteriorates. We evaluate our methodology by using a strict privacy measure, empirical differential privacy and a newly defined risk measure, the probability of range identification, which directly measures attribute disclosure risk. We illustrate our results with the US Census Bureau's quarterly workforce indicators.},
}
• J. M. Abowd and I. Schmutte, “Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods,” Labor Dynamics Institute, Document 22, 2015.
[Abstract] [URL] [Bibtex]

We consider the problem of the public release of statistical information about a population?explicitly accounting for the public-good properties of both data accuracy and privacy loss. We first consider the implications of adding the public-good component to recently published models of private data publication under differential privacy guarantees using a Vickery-Clark-Groves mechanism and a Lindahl mechanism. We show that data quality will be inefficiently under-supplied. Next, we develop a standard social planner?s problem using the technology set implied by (?, ?)-differential privacy with (?, ?)-accuracy for the Private Multiplicative Weights query release mechanism to study the properties of optimal provision of data accuracy and privacy loss when both are public goods. Using the production possibilities frontier implied by this technology, explicitly parameterized interdependent preferences, and the social welfare function, we display properties of the solution to the social planner?s problem. Our results directly quantify the optimal choice of data accuracy and privacy loss as functions of the technology and preference parameters. Some of these properties can be quantified using population statistics on marginal preferences and correlations between income, data accuracy preferences, and privacy loss preferences that are available from survey data. Our results show that government data custodians should publish more accurate statistics with weaker privacy guarantees than would occur with purely private data publishing. Our statistical results using the General Social Survey and the Cornell National Social Survey indicate that the welfare losses from under-providing data accuracy while over-providing privacy protection can be substantial.

@TechReport{AbowdSchmutte_LDI2016-22,
author = {John M. Abowd and Ian Schmutte},
title = {Revisiting the Economics of Privacy: {P}opulation Statistics and Confidentiality Protection as Public Goods},
institution = {Labor Dynamics Institute},
year = {2015},
type = {Document},
number = {22},
month = jan,
abstract = {We consider the problem of the public release of statistical information about a population?explicitly accounting for the public-good properties of both data accuracy and privacy loss. We first consider the implications of adding the public-good component to recently published models of private data publication under differential privacy guarantees using a Vickery-Clark-Groves mechanism and a Lindahl mechanism. We show that data quality will be inefficiently under-supplied. Next, we develop a standard social planner?s problem using the technology set implied by (?, ?)-differential privacy with (?, ?)-accuracy for the Private Multiplicative Weights query release mechanism to study the properties of optimal provision of data accuracy and privacy loss when both are public goods. Using the production possibilities frontier implied by this technology, explicitly parameterized interdependent preferences, and the social welfare function, we display properties of the solution to the social planner?s problem. Our results directly quantify the optimal choice of data accuracy and privacy loss as functions of the technology and preference parameters. Some of these properties can be quantified using population statistics on marginal preferences and correlations between income, data accuracy preferences, and privacy loss preferences that are available from survey data. Our results show that government data custodians should publish more accurate statistics with weaker privacy guarantees than would occur with purely private data publishing. Our statistical results using the General Social Survey and the Cornell National Social Survey indicate that the welfare losses from under-providing data accuracy while over-providing privacy protection can be substantial.},
language = {English},
owner = {vilhuber},
timestamp = {2016.09.30},
url = {http://digitalcommons.ilr.cornell.edu/ldi/22/},
volume = {Fall 2015},
}
2014
• C. Lagoze, L. Vilhuber, J. Williams, B. Perry, and W. C. Block, “CED²AR: The Comprehensive Extensible Data Documentation and Access Repository,” in ACM/IEEE Joint Conference on Digital Libraries (JCDL 2014), London, United Kingdom, 2014.
[Abstract] [DOI] [URL] [Bibtex]

Social science researchers increasingly make use of data that is confidential because it contains linkages to the identities of people, corporations, etc. The value of this data lies in the ability to join the identifiable entities with external data such as genome data, geospatial information, and the like. However, the confidentiality of this data is a barrier to its utility and curation, making it difficult to fulfill US federal data management mandates and interfering with basic scholarly practices such as validation and reuse of existing results. We describe the complexity of the relationships among data that span a public and private divide. We then describe our work on the CED2AR prototype, a first step in providing researchers with a tool that spans this divide and makes it possible for them to search, access, and cite that data.

@InProceedings{LagozeJCDL2014,
author = {Carl Lagoze and Lars Vilhuber and Jeremy Williams and Benjamin Perry and William C. Block},
title = {CED²AR: The Comprehensive Extensible Data Documentation and Access Repository},
booktitle = {ACM/IEEE Joint Conference on Digital Libraries (JCDL 2014)},
year = {2014},
month = {sep},
organization = {ACM/IEEE},
publisher = {Institute of Electrical {\&} Electronics Engineers ({IEEE})},
note = {Presented at the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2014)},
abstract = {Social science researchers increasingly make use of data that is confidential because it contains linkages to the identities of people, corporations, etc. The value of this data lies in the ability to join the identifiable entities with external data such as genome data, geospatial information, and the like. However, the confidentiality of this data is a barrier to its utility and curation, making it difficult to fulfill US federal data management mandates and interfering with basic scholarly practices such as validation and reuse of existing results. We describe the complexity of the relationships among data that span a public and private divide. We then describe our work on the CED2AR prototype, a first step in providing researchers with a tool that spans this divide and makes it possible for them to search, access, and cite that data.},
doi = {10.1109/JCDL.2014.6970178},
owner = {vilhuber},
timestamp = {2014.07.09},
url = {http://dx.doi.org/10.1109/JCDL.2014.6970178},
}
• A. Shrivastava and P. Li, “Graph Kernels via Functional Embedding,” CoRR, vol. abs/1404.5214, 2014.
[Abstract] [URL] [Bibtex]

We propose a representation of graph as a functional object derived from the power iteration of the underlying adjacency matrix. The proposed functional representation is a graph invariant, i.e., the functional remains unchanged under any reordering of the vertices. This property eliminates the difficulty of handling exponentially many isomorphic forms. Bhattacharyya kernel constructed between these functionals significantly outperforms the state-of-the-art graph kernels on 3 out of the 4 standard benchmark graph classification datasets, demonstrating the superiority of our approach. The proposed methodology is simple and runs in time linear in the number of edges, which makes our kernel more efficient and scalable compared to many widely adopted graph kernels with running time cubic in the number of vertices.

@Article{DBLP:journals/corr/Shrivastava014,
Title = {Graph Kernels via Functional Embedding},
Author = {Anshumali Shrivastava and Ping Li},
Journal = {CoRR},
Year = {2014},
Volume = {abs/1404.5214},
URL = {http://arxiv.org/abs/1404.5214},
Owner = {vilhuber},
Abstract = {We propose a representation of graph as a functional object derived from the power iteration of the underlying adjacency matrix. The proposed functional representation is a graph invariant, i.e., the functional remains unchanged under any reordering of the vertices. This property eliminates the difficulty of handling exponentially many isomorphic forms. Bhattacharyya kernel constructed between these functionals significantly outperforms the state-of-the-art graph kernels on 3 out of the 4 standard benchmark graph classification datasets, demonstrating the superiority of our approach. The proposed methodology is simple and runs in time linear in the number of edges, which makes our kernel more efficient and scalable compared to many widely adopted graph kernels with running time cubic in the number of vertices.},
Timestamp = {2014.07.09}
}
• A. Shrivastava and P. Li, “In Defense of MinHash Over SimHash,” in Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS), Reykjavik, Iceland, 2014.
[Abstract] [PDF] [URL] [Bibtex]

MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity (R), while the collision probability of SimHash is a function of cosine similarity (S). To provide a common basis for comparison, we evaluate retrieval results in terms of S for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to S, by using a general inequality S2≤R≤S2−S. Our worst case analysis can show that MinHash significantly outperforms SimHash in high similarity region. Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often R≥Sz−S holds where z is only slightly larger than 2 (e.g., z≤2.1). Our restricted worst case analysis by assuming Sz−S≤R≤S2−S shows that MinHash indeed significantly outperforms SimHash even in low similarity region. We believe the results in this paper will provide valuable guidelines for search in practice, especially when the data are sparse.

@InProceedings{Ping2014,
Title = {In Defense of MinHash Over SimHash},
Author = {Anshumali Shrivastava and Ping Li},
Booktitle = {Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS)},
Year = {2014},
Volume = {33},
Owner = {vilhuber},
URL = {http://jmlr.org/proceedings/papers/v33/shrivastava14.html},
Abstract = {MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity (R), while the collision probability of SimHash is a function of cosine similarity (S). To provide a common basis for comparison, we evaluate retrieval results in terms of S for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to S, by using a general inequality S2≤R≤S2−S. Our worst case analysis can show that MinHash significantly outperforms SimHash in high similarity region. Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often R≥Sz−S holds where z is only slightly larger than 2 (e.g., z≤2.1). Our restricted worst case analysis by assuming Sz−S≤R≤S2−S shows that MinHash indeed significantly outperforms SimHash even in low similarity region. We believe the results in this paper will provide valuable guidelines for search in practice, especially when the data are sparse.},
pdf = {http://jmlr.org/proceedings/papers/v33/shrivastava14.pdf},
Timestamp = {2014.07.09}
}
• J. Drechsler and L. Vilhuber, “Synthetic Longitudinal Business Databases for International Comparisons,” in Privacy in Statistical Databases, J. Domingo-Ferrer, Ed., Springer International Publishing, 2014, vol. 8744, pp. 243-252.
[Abstract] [DOI] [URL] [Bibtex]

International comparison studies on economic activity are often hampered by the fact that access to business microdata is very limited on an international level. A recently launched project tries to overcome these limitations by improving access to Business Censuses from multiple countries based on synthetic data. Starting from the synthetic version of the longitudinally edited version of the U.S. Business Register (the Longitudinal Business Database, LBD), the idea is to create similar data products in other countries by applying the synthesis methodology developed for the LBD to generate synthetic replicates that could be distributed without confidentiality concerns. In this paper we present some first results of this project based on German business data collected at the Institute for Employment Research.

@InCollection{psd2014a,
author = {Drechsler, Jörg and Vilhuber, Lars},
title = {Synthetic Longitudinal Business Databases for International Comparisons},
booktitle = {Privacy in Statistical Databases},
publisher = {Springer International Publishing},
year = {2014},
editor = {Domingo-Ferrer, Josep},
volume = {8744},
series = {Lecture Notes in Computer Science},
pages = {243-252},
abstract = {International comparison studies on economic activity are often hampered by the fact that access to business microdata is very limited on an international level. A recently launched project tries to overcome these limitations by improving access to Business Censuses from multiple countries based on synthetic data. Starting from the synthetic version of the longitudinally edited version of the U.S. Business Register (the Longitudinal Business Database, LBD), the idea is to create similar data products in other countries by applying the synthesis methodology developed for the LBD to generate synthetic replicates that could be distributed without confidentiality concerns. In this paper we present some first results of this project based on German business data collected at the Institute for Employment Research.},
doi = {10.1007/978-3-319-11257-2_19},
isbn = {978-3-319-11256-5},
keywords = {business data; confidentiality; international comparison; multiple imputation; synthetic},
language = {English},
url = {http://dx.doi.org/10.1007/978-3-319-11257-2_19},
}
• J. Miranda and L. Vilhuber, “Using Partially Synthetic Data to Replace Suppression in the Business Dynamics Statistics: Early Results,” in Privacy in Statistical Databases, J. Domingo-Ferrer, Ed., Springer International Publishing, 2014, vol. 8744, pp. 232-242.
[Abstract] [DOI] [URL] [Bibtex]

The Business Dynamics Statistics is a product of the U.S. Census Bureau that provides measures of business openings and closings, and job creation and destruction, by a variety of cross-classifications (firm and establishment age and size, industrial sector, and geography). Sensitive data are currently protected through suppression. However, as additional tabulations are being developed, at ever more detailed geographic levels, the number of suppressions increases dramatically. This paper explores the option of providing public-use data that are analytically valid and without suppressions, by leveraging synthetic data to replace observations in sensitive cells.

@incollection{psd2014b,
year={2014},
isbn={978-3-319-11256-5},
booktitle={Privacy in Statistical Databases},
volume={8744},
series={Lecture Notes in Computer Science},
editor={Domingo-Ferrer, Josep},
doi={10.1007/978-3-319-11257-2_18},
title={Using Partially Synthetic Data to Replace Suppression in the Business Dynamics Statistics: Early Results},
url={http://dx.doi.org/10.1007/978-3-319-11257-2_18},
publisher={Springer International Publishing},
keywords={synthetic data; statistical disclosure limitation; time-series; local labor markets; gross job flows; confidentiality protection},
author={Miranda, Javier and Vilhuber, Lars},
pages={232-242},
language={English},
abstract={The Business Dynamics Statistics is a product of the U.S. Census Bureau that provides measures of business openings and closings, and job creation and destruction, by a variety of cross-classifications (firm and establishment age and size, industrial sector, and geography). Sensitive data are currently protected through suppression. However, as additional tabulations are being developed, at ever more detailed geographic levels, the number of suppressions increases dramatically. This paper explores the option of providing public-use data that are analytically valid and without suppressions, by leveraging synthetic data to replace observations in sensitive cells.}
}
2013
• C. Lagoze, W. C. Block, J. Williams, J. M. Abowd, and L. Vilhuber, “Data Management of Confidential Data,” International Journal of Digital Curation, vol. 8, iss. 1, pp. 265-278, 2013.
[Abstract] [DOI] [Bibtex]

Social science researchers increasingly make use of data that is confidential because it contains linkages to the identities of people, corporations, etc. The value of this data lies in the ability to join the identifiable entities with external data such as genome data, geospatial information, and the like. However, the confidentiality of this data is a barrier to its utility and curation, making it difficult to fulfill US federal data management mandates and interfering with basic scholarly practices such as validation and reuse of existing results. We describe the complexity of the relationships among data that span a public and private divide. We then describe our work on the CED2AR prototype, a first step in providing researchers with a tool that spans this divide and makes it possible for them to search, access, and cite that data.

@Article{DBLP:journals/ijdc/LagozeBWAV13,
Title = {Data Management of Confidential Data},
Author = {Carl Lagoze and William C. Block and Jeremy Williams and John M. Abowd and Lars Vilhuber},
Journal = {International Journal of Digital Curation},
Year = {2013},
Note = {Presented at 8th International Digital Curation Conference 2013, Amsterdam. See also http://hdl.handle.net/1813/30924},
Number = {1},
Pages = {265-278},
Volume = {8},
Abstract = {Social science researchers increasingly make use of data that is confidential because it contains linkages to the identities of people, corporations, etc. The value of this data lies in the ability to join the identifiable entities with external data such as genome data, geospatial information, and the like. However, the confidentiality of this data is a barrier to its utility and curation, making it difficult to fulfill US federal data management mandates and interfering with basic scholarly practices such as validation and reuse of existing results. We describe the complexity of the relationships among data that span a public and private divide. We then describe our work on the CED2AR prototype, a first step in providing researchers with a tool that spans this divide and makes it possible for them to search, access, and cite that data.},
Bibsource = {DBLP, http://dblp.uni-trier.de},
Doi = {10.2218/ijdc.v8i1.259},
Owner = {vilhuber},
Timestamp = {2013.10.09}
}
• C. Lagoze, W. C. Block, J. Williams, and L. Vilhuber, “Encoding Provenance of Social Science Data: Integrating PROV with DDI,” in 5th Annual European DDI User Conference, 2013.
[Abstract] [DOI] [URL] [Bibtex]

Provenance is a key component of evaluating the integrity and reusability of data for scholarship. While recording and providing access provenance has always been important, it is even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. The PROV model, developed under the auspices of the W3C, is a foundation for semantically-rich, interoperable, and web-compatible provenance metadata. We report on the results of our experimentation with integrating the PROV model into the DDI metadata for a complex, but characteristic, example social science data. We also present some preliminary thinking on how to visualize those graphs in the user interface.

@InProceedings{LagozeEtAl2013,
author = {Carl Lagoze and William C. Block and Jeremy Williams and Lars Vilhuber},
title = {Encoding Provenance of Social Science Data: Integrating PROV with {DDI}},
booktitle = {5th Annual European DDI User Conference},
year = {2013},
doi = {http://dx.doi.org/10.3886/eDDILagoze},
abstract = {Provenance is a key component of evaluating the integrity and reusability of data for scholarship. While recording and providing access provenance has always been important, it is even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. The PROV model, developed under the auspices of the W3C, is a foundation for semantically-rich, interoperable, and web-compatible provenance metadata. We report on the results of our experimentation with integrating the PROV model into the DDI metadata for a complex, but characteristic, example social science data. We also present some preliminary thinking on how to visualize those graphs in the user interface.},
file = {:LagozeEtAl2013:PDF},
issn = {2153-8247},
keywords = {Metadata, Provenance, DDI, eSocial Science},
owner = {vilhuber},
timestamp = {2013.10.09},
url = {http://www.eddi-conferences.eu/ocs/index.php/eddi/EDDI13/paper/view/115},
}
• C. Lagoze, J. Willliams, and L. Vilhuber, “Encoding Provenance Metadata for Social Science Datasets,” in Metadata and Semantics Research, 2013, pp. 123-134.
[Abstract] [DOI] [URL] [Bibtex]

Recording provenance is a key requirement for data-centric scholarship, allowing researchers to evaluate the integrity of source data sets and reproduce, and thereby, validate results. Provenance has become even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. Recent work by the W3C on the PROV model provides the foundation for semantically-rich, interoperable, and web-compatible provenance metadata. We apply that model to complex, but characteristic, provenance examples of social science data, describe scenarios that make scholarly use of those provenance descriptions, and propose a manner for encoding this provenance metadata within the widely-used DDI metadata standard.

@InProceedings{LagozeEtAl2013b,
Title = {Encoding Provenance Metadata for Social Science Datasets},
Author = {Lagoze, Carl and Willliams, Jeremy and Vilhuber, Lars},
Booktitle = {Metadata and Semantics Research},
Year = {2013},
Editor = {Garoufallou, Emmanouel and Greenberg, Jane},
Pages = {123-134},
Publisher = {Springer International Publishing},
Series = {Communications in Computer and Information Science},
Volume = {390},
Abstract ={Recording provenance is a key requirement for data-centric scholarship, allowing researchers to evaluate the integrity of source data sets and reproduce, and thereby, validate results. Provenance has become even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. Recent work by the W3C on the PROV model provides the foundation for semantically-rich, interoperable, and web-compatible provenance metadata. We apply that model to complex, but characteristic, provenance examples of social science data, describe scenarios that make scholarly use of those provenance descriptions, and propose a manner for encoding this provenance metadata within the widely-used DDI metadata standard.},
Doi = {10.1007/978-3-319-03437-9_13},
ISBN = {978-3-319-03436-2},
Keywords = {Metadata; Provenance; DDI; eSocial Science},
Owner = {vilhuber},
Timestamp = {2013.11.05},
Url = {http://dx.doi.org/10.1007/978-3-319-03437-9_13}
}
• P. Li, A. Shrivastava, and A. C. König, “b-Bit Minwise Hashing in Practice,” in Internetware 2013, 2013.
[Abstract] [URL] [Bibtex]

Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of b-bit minwise hashing [23, 24] for efficient search and learning on massive, high-dimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical issues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications. Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20 ~ 80 and becomes substantially smaller than the data loading time. Reducing the preprocessing time is highly beneficial in practice, e.g., for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers. Another critical issue is that for very large data sets it becomes impossible to store a (fully) random permutation matrix, due to its space requirements. Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions, e.g., the 2-universal (2U) and 4-universal (4U) hash families, can produce very similar learning results as using fully random permutations. Experiments on datasets of up to 200GB are presented.

@InProceedings{PingShrivastava2013,
author = {Ping Li and Anshumali Shrivastava and König, Arnd Christian},
title = {b-Bit Minwise Hashing in Practice},
booktitle = {Internetware 2013},
year = {2013},
month = {October},
abstract = {Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of b-bit minwise hashing [23, 24] for efficient search and learning on massive, high-dimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical issues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications. Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20 ~ 80 and becomes substantially smaller than the data loading time. Reducing the preprocessing time is highly beneficial in practice, e.g., for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers. Another critical issue is that for very large data sets it becomes impossible to store a (fully) random permutation matrix, due to its
space requirements. Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions, e.g., the 2-universal (2U) and 4-universal (4U) hash families, can produce very similar learning results as using fully random permutations. Experiments on datasets of up to 200GB are presented.},
file = {http://ecommons.library.cornell.edu/bitstream/1813/37986/2/a13-li.pdf},
owner = {vilhuber},
timestamp = {2013.10.07},
url = {http://www.nudt.edu.cn/internetware2013/},
}
• P. Li and C. Zhang, “Exact Sparse Recovery with L0 Projections,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2013, pp. 302-310.
[Abstract] [DOI] [URL] [Bibtex]

Many applications (e.g., anomaly detection) concern sparse signals. This paper focuses on the problem of recovering a K-sparse signal x ∈ R/1×N, i.e., K << N and ∑N/i=1 1{xi ≠ 0} = K. In the mainstream framework of compressed sensing (CS), × is recovered from M linear measurements y = xS ∈ R/1×M, where S ∈ RN×M is often a Gaussian (or Gaussian-like) design matrix. In our proposed method, the design matrix S is generated from an α-stable distribution with α ≈ 0. Our decoding algorithm mainly requires one linear scan of the coordinates, followed by a few iterations on a small number of coordinates which are "undetermined" in the previous iteration. Our practical algorithm consists of two estimators. In the first iteration, the (absolute) minimum estimator is able to filter out a majority of the zero coordinates. The gap estimator, which is applied in each iteration, can accurately recover the magnitudes of the nonzero coordinates. Comparisons with linear programming (LP) and orthogonal matching pursuit (OMP) demonstrate that our algorithm can be significantly faster in decoding speed and more accurate in recovery quality, for the task of exact spare recovery. Our procedure is robust against measurement noise. Even when there are no sufficient measurements, our algorithm can still reliably recover a significant portion of the nonzero coordinates.

@InProceedings{LiZhang2013a,
author = {Li, Ping and Zhang, Cun-Hui},
title = {Exact Sparse Recovery with L0 Projections},
booktitle = {Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
series = {KDD '13},
year = {2013},
isbn = {978-1-4503-2174-7},
location = {Chicago, Illinois, USA},
pages = {302--310},
numpages = {9},
Abstract ={Many applications (e.g., anomaly detection) concern sparse signals. This paper focuses on the problem of recovering a K-sparse signal x ∈ R/1×N, i.e., K << N and ∑N/i=1 1{xi ≠ 0} = K. In the mainstream framework of compressed sensing (CS), × is recovered from M linear measurements y = xS ∈ R/1×M, where S ∈ RN×M is often a Gaussian (or Gaussian-like) design matrix.
In our proposed method, the design matrix S is generated from an α-stable distribution with α ≈ 0. Our decoding algorithm mainly requires one linear scan of the coordinates, followed by a few iterations on a small number of coordinates which are "undetermined" in the previous iteration. Our practical algorithm consists of two estimators. In the first iteration, the (absolute) minimum estimator is able to filter out a majority of the zero coordinates. The gap estimator, which is applied in each iteration, can accurately recover the magnitudes of the nonzero coordinates. Comparisons with linear programming (LP) and orthogonal matching pursuit (OMP) demonstrate that our algorithm can be significantly faster in decoding speed and more accurate in recovery quality, for the task of exact spare recovery. Our procedure is robust against measurement noise. Even when there are no sufficient measurements, our algorithm can still reliably recover a significant portion of the nonzero coordinates.},
url = {http://doi.acm.org/10.1145/2487575.2487694},
doi = {10.1145/2487575.2487694},
acmid = {2487694},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {compressed sensing, l0 projections, stable distributions},
}
• A. Shrivastava and P. Li, “Beyond Pairwise: Provably Fast Algorithms for Approximate k-Way Similarity Search,” in Advances in Neural Information Processing Systems 26, 2013, pp. 791-799.
[Abstract] [PDF] [URL] [Bibtex]

We go beyond the notion of pairwise similarity and look into search problems with k-way similarity functions. In this paper, we focus on problems related to 3-way Jaccard similarity: R3way = |S1∩S2∩S3| |S1∪S2∪S3| , S1, S2, S3 ∈ C, where C is a size n collection of sets (or binary vectors). We show that approximate R3way similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to k-way resemblance. In the process, we extend traditional framework of locality sensitive hashing (LSH) to handle higher-order similarities, which could be of independent theoretical interest. The applicability of R3way search is shown on the “Google Sets” application. In addition, we demonstrate the advantage of R3way resemblance over the pairwise case in improving retrieval quality.

@InProceedings{ShrivastavaLi2013a,
title = {Beyond Pairwise: Provably Fast Algorithms for Approximate k-Way Similarity Search},
author = {Shrivastava, Anshumali and Li, Ping},
booktitle = {Advances in Neural Information Processing Systems 26},
editor = {C.J.C. Burges and L. Bottou and M. Welling and Z. Ghahramani and K.Q. Weinberger},
pages = {791--799},
year = {2013},
publisher = {Curran Associates, Inc.},
abstract = {We go beyond the notion of pairwise similarity and look into search problems with k-way similarity functions. In this paper, we focus on problems related to 3-way Jaccard similarity: R3way = |S1∩S2∩S3| |S1∪S2∪S3| , S1, S2, S3 ∈ C, where C is a size n collection of sets (or binary vectors). We show that approximate R3way similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to k-way resemblance. In the process, we extend traditional framework of locality sensitive hashing (LSH) to handle higher-order similarities, which could be of independent theoretical interest. The applicability of R3way search is shown on the “Google Sets” application. In addition, we demonstrate the advantage of R3way resemblance over the pairwise case in improving retrieval quality.},
url = {http://papers.nips.cc/paper/5216-beyond-pairwise-provably-fast-algorithms-for-approximate-k-way-similarity-search/},
pdf = {http://papers.nips.cc/paper/5216-beyond-pairwise-provably-fast-algorithms-for-approximate-k-way-similarity-search.pdf},
Owner = {vilhuber},
Timestamp = {2013.09.06}
}
2012
• J. M. Abowd, L. Vilhuber, and W. Block, “A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs,” in Privacy in Statistical Databases, J. Domingo-Ferrer and I. Tinnirello, Eds., Springer Berlin Heidelberg, 2012, vol. 7556, pp. 216-225.
[Abstract] [DOI] [URL] [Bibtex]

We develop the core of a method for solving the data archive and curation problem that confronts the custodians of restricted-access research data and the scientific users of such data. Our solution recognizes the dual protections afforded by physical security and access limitation protocols. It is based on extensible tools and can be easily incorporated into existing instructional materials.

@InCollection{raey,
Title = {A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs},
Author = {Abowd, John M. and Vilhuber, Lars and Block, William},
Booktitle = {Privacy in Statistical Databases},
Publisher = {Springer Berlin Heidelberg},
Year = {2012},
Editor = {Domingo-Ferrer, Josep and Tinnirello, Ilenia},
Pages = {216-225},
Series = {Lecture Notes in Computer Science},
Volume = {7556},
Abstract ={We develop the core of a method for solving the data archive and curation problem that confronts the custodians of restricted-access research data and the scientific users of such data. Our solution recognizes the dual protections afforded by physical security and access limitation protocols. It is based on extensible tools and can be easily incorporated into existing instructional materials.},
Doi = {10.1007/978-3-642-33627-0_17},
ISBN = {978-3-642-33626-3},
Keywords = {Data Archive; Data Curation; Statistical Disclosure Limitation; Privacy-preserving Datamining},
Url = {http://dx.doi.org/10.1007/978-3-642-33627-0_17}
}
• P. Li, A. Owen, and C. Zhang, “One Permutation Hashing,” in Advances in Neural Information Processing Systems 25, P. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., , 2012, pp. 3122-3130.
[Abstract] [URL] [Bibtex]

While minwise hashing is promising for large-scale learning in massive binary data, the preprocessing cost is prohibitive as it requires applying (e.g.,) k=500 permutations on the data. The testing time is also expensive if a new data point (e.g., a new document or a new image) has not been processed. In this paper, we develop a simple \textbf{one permutation hashing} scheme to address this important issue. While it is true that the preprocessing step can be parallelized, it comes at the cost of additional hardware and implementation. Also, reducing k permutations to just one would be much more \textbf{energy-efficient}, which might be an important perspective as minwise hashing is commonly deployed in the search industry. While the theoretical probability analysis is interesting, our experiments on similarity estimation and SVM & logistic regression also confirm the theoretical results.

@InCollection{NIPS2012_1436,
author = {Ping Li and Art Owen and Cun-Hui Zhang},
title = {One Permutation Hashing},
booktitle = {Advances in Neural Information Processing Systems 25},
year = {2012},
editor = {P. Bartlett and F.C.N. Pereira and C.J.C. Burges and L. Bottou and K.Q. Weinberger},
pages = {3122--3130},
abstract = {While minwise hashing is promising for large-scale learning in massive binary data, the preprocessing cost is prohibitive as it requires applying (e.g.,) k=500 permutations on the data. The testing time is also expensive if a new data point (e.g., a new document or a new image) has not been processed. In this paper, we develop a simple \textbf{one permutation hashing} scheme to address this important issue. While it is true that the preprocessing step can be parallelized, it comes at the cost of additional hardware and implementation. Also, reducing k permutations to just one would be much more \textbf{energy-efficient}, which might be an important perspective as minwise hashing is commonly deployed in the search industry. While the theoretical probability analysis is interesting, our experiments on similarity estimation and SVM \& logistic regression also confirm the theoretical results.},
file = {4778-one-permutation-hashing.pdf:http\://papers.nips.cc/paper/4778-one-permutation-hashing.pdf:PDF},
url = {http://papers.nips.cc/paper/4778-one-permutation-hashing},
}
• P. Li, A. Shrivastava, and A. C. König, “GPU-based minwise hashing: GPU-based minwise hashing,” in Proceedings of the 21st World Wide Web Conference (WWW 2012) (Companion Volume), 2012, pp. 565-566.
[Abstract] [DOI] [URL] [Bibtex]

{Minwise hashing is a standard technique for efficient set similarity estimation in the context of search. The recent work of b-bit minwise hashing provided a substantial improvement by storing only the lowest b bits of each hashed value. Both minwise hashing and b-bit minwise hashing require an expensive preprocessing step for applying k (e.g.

@InProceedings{LiSK12,
Title = {GPU-based minwise hashing: GPU-based minwise hashing},
Author = {Ping Li and Anshumali Shrivastava and Arnd Christian K{\"o}nig},
Booktitle = {Proceedings of the 21st World Wide Web Conference (WWW 2012) (Companion Volume)},
Year = {2012},
Pages = {565-566},
Abstract ={Minwise hashing is a standard technique for efficient set similarity estimation in the context of search. The recent work of b-bit minwise hashing provided a substantial improvement by storing only the lowest b bits of each hashed value. Both minwise hashing and b-bit minwise hashing require an expensive preprocessing step for applying k (e.g., k=500) permutations on the entire data in order to compute k minimal values as the hashed data. In this paper, we developed a parallelization scheme using GPUs, which reduced the processing time by a factor of 20-80. Reducing the preprocessing time is highly beneficial in practice, for example, for duplicate web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers (when the test data are not preprocessed).},
Bibsource = {DBLP, http://dblp.uni-trier.de},
Doi = {10.1145/2187980.2188129},
Url = {http://doi.acm.org/10.1145/2187980.2188129}
}
• P. Li and C. Zhang, “Entropy Estimations Using Correlated Symmetric Stable Random Projections,” in Advances in Neural Information Processing Systems 25, P. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., , 2012, pp. 3185-3193.
[Abstract] [URL] [Bibtex]

Methods for efficiently estimating the Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is no longer applicable when entries of data streams can be below zero, which is a common scenario when comparing two streams. In this paper, we propose an algorithm for entropy estimation in general data streams which allow negative entries. In our method, the Shannon entropy is approximated by the finite difference of two correlated frequency moments estimated from correlated samples of symmetric stable random variables. Our experiments confirm that this method is able to substantially better approximate the Shannon entropy compared to the prior state-of-the-art.

@InCollection{NIPS2012_1456,
author = {Ping Li and Cun-Hui Zhang},
title = {Entropy Estimations Using Correlated Symmetric Stable Random Projections},
booktitle = {Advances in Neural Information Processing Systems 25},
year = {2012},
editor = {P. Bartlett and F.C.N. Pereira and C.J.C. Burges and L. Bottou and K.Q. Weinberger},
pages = {3185--3193},
abstract = {Methods for efficiently estimating the Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is no longer applicable when entries of data streams can be below zero, which is a common scenario when comparing two streams. In this paper, we propose an algorithm for entropy estimation in general data streams which allow negative entries. In our method, the Shannon entropy is approximated by the finite difference of two correlated frequency moments estimated from correlated samples of symmetric stable random variables. Our experiments confirm that this method is able to substantially better approximate the Shannon entropy compared to the prior state-of-the-art.},
file = {http://papers.nips.cc/paper/4667-entropy-estimations-using-correlated-symmetric-stable-random-projections.pdf},
url = {http://papers.nips.cc/paper/4667-entropy-estimations-using-correlated-symmetric-stable-random-projections},
}
• A. Shrivastava and P. Li, “Fast Near Neighbor Search in High-Dimensional Binary Data,” in The European Conference on Machine Learning (ECML 2012), 2012.
[Abstract] [PDF] [Bibtex]

Abstract. Numerous applications in search, databases, machine learning, and computer vision, can benefit from efficient algorithms for near neighbor search. This paper proposes a simple framework for fast near neighbor search in high-dimensional binary data, which are common in practice (e.g., text). We develop a very simple and effective strategy for sub-linear time near neighbor search, by creating hash tables directly using the bits generated by b-bit minwise hashing. The advantages of our method are demonstrated through thorough comparisons with two strong baselines: spectral hashing and sign (1-bit) random projections.

@InProceedings{ShrivastavaL12,
author = {Anshumali Shrivastava and Ping Li},
title = {Fast Near Neighbor Search in High-Dimensional Binary Data},
booktitle = {The European Conference on Machine Learning (ECML 2012)},
year = {2012},
abstract = {Abstract. Numerous applications in search, databases, machine learning,
and computer vision, can benefit from efficient algorithms for near
neighbor search. This paper proposes a simple framework for fast near
neighbor search in high-dimensional binary data, which are common in
practice (e.g., text). We develop a very simple and effective strategy for
sub-linear time near neighbor search, by creating hash tables directly
using the bits generated by b-bit minwise hashing. The advantages of
our method are demonstrated through thorough comparisons with two
strong baselines: spectral hashing and sign (1-bit) random projections.
},
PDF = {http://www.cs.bris.ac.uk/~flach/ECMLPKDD2012papers/1125548.pdf},
}
• R. Srivastava, P. Li, and D. Sengupta, “Testing for Membership to the IFRA and the NBU Classes of Distributions,” Journal of Machine Learning Research – Proceedings Track for the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2012), vol. 22, pp. 1099-1107, 2012.
[Abstract] [URL] [Bibtex]

This paper provides test procedures to determine whether the probability distribution underlying a set of non-negative valued samples belongs to the Increasing Failure Rate Average (IFRA) class or the New Better than Used (NBU) class. Membership of a distribution to one of these classes is known to have implications which are important in reliability, queuing theory, game theory and other disciplines. Our proposed test is based on the Kolmogorov-Smirnov distance between an empirical cumulative hazard function and its best approximation from the class of distributions constituting the null hypothesis. It turns out that the least favorable distribution, which produces the largest probability of Type I error of each of the tests, is the exponential distribution. This fact is used to produce an appropriate cut-off or p-value. Monte Carlo simulations are conducted to check small sample size (i.e., significance) and power of the test. Usefulness of the test is illustrated through the analysis of a set of monthly family expenditure data collected by the National Sample Survey Organization of the Government of India.

@Article{SrivastavaLS12,
author = {Radhendushka Srivastava and Ping Li and Debasis Sengupta},
title = {Testing for Membership to the IFRA and the NBU Classes of Distributions},
journal = {Journal of Machine Learning Research - Proceedings Track for the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2012)},
year = {2012},
volume = {22},
pages = {1099-1107},
abstract = {This paper provides test procedures to determine whether the probability distribution underlying a set of non-negative valued samples belongs to the Increasing Failure Rate Average (IFRA) class or the New Better than Used (NBU) class. Membership of a distribution to one of these classes is known to have implications which are important in reliability, queuing theory, game theory and other disciplines. Our proposed test is based on the Kolmogorov-Smirnov distance between an empirical cumulative hazard function and its best approximation from the class of distributions constituting the null hypothesis. It turns out that the least favorable distribution, which produces the largest probability of Type I error of each of the tests, is the exponential distribution. This fact is used to produce an appropriate cut-off or p-value. Monte Carlo simulations are conducted to check small sample size (i.e., significance) and power of the test. Usefulness of the test is illustrated through the analysis of a set of monthly family expenditure data collected by the National Sample Survey Organization of the Government of India.},
bibsource = {DBLP, http://dblp.uni-trier.de},
file = {srivastava12.pdf:http\://www.jmlr.org/proceedings/papers/v22/srivastava12/srivastava12.pdf:PDF},
url = {http://www.jmlr.org/proceedings/papers/v22/srivastava12.html},
}
• X. Sun, A. Shrivastava, and P. Li, “Fast Multi-task Learning for Query Spelling Correction,” in The 21st ACM International Conference on Information and Knowledge Management (CIKM 2012), 2012, pp. 285-294.
[Abstract] [DOI] [URL] [Bibtex]

In this paper, we explore the use of a novel online multi-task learning framework for the task of search query spelling correction. In our procedure, correction candidates are initially generated by a ranker-based system and then re-ranked by our multi-task learning algorithm. With the proposed multi-task learning method, we are able to effectively transfer information from different and highly biased training datasets, for improving spelling correction on all datasets. Our experiments are conducted on three query spelling correction datasets including the well-known TREC benchmark dataset. The experimental results demonstrate that our proposed method considerably outperforms the existing baseline systems in terms of accuracy. Importantly, the proposed method is about one order of magnitude faster than baseline systems in terms of training speed. Compared to the commonly used online learning methods which typically require more than (e.g.,) 60 training passes, our proposed method is able to closely reach the empirical optimum in about 5 passes.

@InProceedings{CIKM-SunSL12,
Title = {Fast Multi-task Learning for Query Spelling Correction},
Author = {Xu Sun and Anshumali Shrivastava and Ping Li},
Booktitle = {The 21st ACM International Conference on Information and Knowledge Management (CIKM 2012) },
Year = {2012},
Pages = {285--294},
Abstract ={In this paper, we explore the use of a novel online multi-task learning framework for the task of search query spelling correction. In our procedure, correction candidates are initially generated by a ranker-based system and then re-ranked by our multi-task learning algorithm. With the proposed multi-task learning method, we are able to effectively transfer information from different and highly biased training datasets, for improving spelling correction on all datasets. Our experiments are conducted on three query spelling correction datasets including the well-known TREC benchmark dataset. The experimental results demonstrate that our proposed method considerably outperforms the existing baseline systems in terms of accuracy. Importantly, the proposed method is about one order of magnitude faster than baseline systems in terms of training speed. Compared to the commonly used online learning methods which typically require more than (e.g.,) 60 training passes, our proposed method is able to closely reach the empirical optimum in about 5 passes.},
Doi = {10.1145/2396761.2396800},
Url = {http://dx.doi.org/10.1145/2396761.2396800}
}
• X. Sun, A. Shrivastava, and P. Li, “Query spelling correction using multi-task learning,” in Proceedings of the 21st World Wide Web Conference (WWW 2012)(Companion Volume), 2012, pp. 613-614.
[Abstract] [DOI] [URL] [Bibtex]

This paper explores the use of online multi-task learning for search query spelling correction, by effectively transferring information from different and biased training datasets for improving spelling correction across datasets. Experiments were conducted on three query spelling correction datasets, including the well-known TREC benchmark data. Our experimental results demonstrate that the proposed method considerably outperforms existing baseline systems in terms of accuracy. Importantly, the proposed method is about one-order of magnitude faster than baseline systems in terms of training speed. In contrast to existing methods which typically require more than (e.g.,) 50 training passes, our algorithm can very closely approach the empirical optimum in around five passes.

@InProceedings{WWW-SunSL12,
Title = {Query spelling correction using multi-task learning},
Author = {Xu Sun and Anshumali Shrivastava and Ping Li},
Booktitle = {Proceedings of the 21st World Wide Web Conference (WWW 2012)(Companion Volume)},
Year = {2012},
Pages = {613-614},
Abstract ={This paper explores the use of online multi-task learning for search query spelling correction, by effectively transferring information from different and biased training datasets for improving spelling correction across datasets. Experiments were conducted on three query spelling correction datasets, including the well-known TREC benchmark data. Our experimental results demonstrate that the proposed method considerably outperforms existing baseline systems in terms of accuracy. Importantly, the proposed method is about one-order of magnitude faster than baseline systems in terms of training speed. In contrast to existing methods which typically require more than (e.g.,) 50 training passes, our algorithm can very closely approach the empirical optimum in around five passes.},
Bibsource = {DBLP, http://dblp.uni-trier.de},
Doi = {10.1145/2187980.2188153},
Url = {http://doi.acm.org/10.1145/2187980.2188153}
}

## eCommons Preprints

We publish freely accessible copies of papers and preprints at the Cornell eCommons repository.

2017
• S. Haney, A. Machanavajjhala, J. M. Abowd, M. Graham, and M. Kutzbach, “Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:49652, 2017.
[Abstract] [URL] [Bibtex]

@techreport{handle:1813:49652,
Title = {Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics},
Author = {Haney, Samuel and Machanavajjhala, Ashwin and Abowd, John M and Graham, Matthew and Kutzbach, Mark},
institution = { NSF Census Research Network - NCRN-Cornell },
type = {Preprint} ,
Year = {2017},
number={1813:49652},
URL = {http://hdl.handle.net/1813/49652},
abstract ={National statistical agencies around the world publish tabular summaries based on combined employeremployee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter ϵ≥1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional
}
• L. Vilhuber, I. Schmutte, and J. M. Abowd, “Proceedings from the 2016 NSF–Sloan Workshop on Practical Privacy,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:46197, 2017.
[Abstract] [URL] [Bibtex]

On October 14, 2016, we hosted a workshop that brought together economists, survey statisticians, and computer scientists with expertise in the field of privacy preserving methods: Census Bureau staff working on implementing cutting-edge methods in the Bureau’s flagship public-use products mingled with academic researchers from a variety of universities. The four products discussed as part of the workshop were 1. the American Community Survey (ACS); 2. Longitudinal Employer-Household Data (LEHD), in particular the LEHD Origin-Destination Employment Statistics (LODES); the 3. 2020 Decennial Census; and the 4. 2017 Economic Census. The goal of the workshop was to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas.

@techreport{handle:1813:46197,
Title = {Proceedings from the 2016 NSF–Sloan Workshop on Practical Privacy},
Author = {Vilhuber, Lars and Schmutte, Ian and Abowd, John M.},
institution = { NSF Census Research Network - NCRN-Cornell },
type = {Preprint} ,
Year = {2017},
number={1813:46197},
URL = {http://hdl.handle.net/1813/46197},
abstract ={On October 14, 2016, we hosted a workshop that brought together economists, survey statisticians, and computer scientists with expertise in the field of privacy preserving methods: Census Bureau staff working on implementing cutting-edge methods in the Bureau’s flagship public-use products mingled with academic researchers from a variety of universities. The four products discussed as part of the workshop were 1. the American Community Survey (ACS); 2. Longitudinal Employer-Household Data (LEHD), in particular the LEHD Origin-Destination Employment Statistics (LODES); the 3. 2020 Decennial Census; and the 4. 2017 Economic Census. The goal of the workshop was to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas.}
}
2016
• J. M. Abowd, “How Will Statistical Agencies Operate When All Data Are Private,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:49653, 2016.
[Abstract] [URL] [Bibtex]

@techreport{handle:1813:49653,
Title = {How Will Statistical Agencies Operate When All Data Are Private},
Author = {Abowd, John M},
institution = { NSF Census Research Network - NCRN-Cornell },
type = {Preprint} ,
Year = {2016},
number={1813:49653},
URL = {http://hdl.handle.net/1813/49653},
abstract ={The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the “Big Data” era. There are orders of magnitude more data outside an agency’s firewall than inside it—compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was “asked” in a context wholly outside the agency’s operations—blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.
}
• J. M. Abowd, “How Will Statistical Agencies Operate When All Data Are Private?,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:44663, 2016.
[Abstract] [URL] [Bibtex]

The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the “Big Data” era. There are orders of magnitude more data outside an agency’s firewall than inside it—compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was “asked” in a context wholly outside the agency’s operations—blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.

@techreport{handle:1813:44663,
Title = {How Will Statistical Agencies Operate When All Data Are Private?},
Author = {Abowd, John M.},
institution = { NSF Census Research Network - NCRN-Cornell },
type = {Preprint} ,
Year = {2016},
number={1813:44663},
URL = {http://hdl.handle.net/1813/44663},
abstract ={The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the “Big Data” era. There are orders of magnitude more data outside an agency’s firewall than inside it—compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was “asked” in a context wholly outside the agency’s operations—blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.}
}
• L. Vilhuber, J. A. Abowd, and J. P. Reiter, “Synthetic Establishment Microdata Around the World,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:42340, 2016.
[Abstract] [URL] [Bibtex]

In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic establishment microdata. This overview situates those papers, published in this issue, within the broader literature.

@techreport{handle:1813:42340,
Title = {Synthetic Establishment Microdata Around the World},
Author = {Vilhuber, Lars and Abowd, John A. and Reiter, Jerome P.},
institution = { NSF Census Research Network - NCRN-Cornell },
type = {Preprint} ,
Year = {2016},
number={1813:42340},
URL = {http://hdl.handle.net/1813/42340},
abstract ={In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic establishment microdata. This overview situates those papers, published in this issue, within the broader literature.}
}
• L. Vilhuber and J. Miranda, “Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:42339, 2016.
[Abstract] [URL] [Bibtex]

We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau’s Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).

@techreport{handle:1813:42339,
Title = {Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics},
Author = {Vilhuber, Lars and Miranda, Javier},
institution = { NSF Census Research Network - NCRN-Cornell },
type = {Preprint} ,
Year = {2016},
number={1813:42339},
URL = {http://hdl.handle.net/1813/42339},
abstract ={We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).}
}
• J. A. Abowd and K. L. McKinney, “Noise Infusion as a Confidentiality Protection Measure for Graph-Based Statistics,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:42338, 2016.
[Abstract] [URL] [Bibtex]

We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau’s Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs.

@techreport{handle:1813:42338,
Title = {Noise Infusion as a Confidentiality Protection Measure for Graph-Based Statistics},
Author = {Abowd, John A. and McKinney, Kevin L.},
institution = { NSF Census Research Network - NCRN-Cornell },
type = {Preprint} ,
Year = {2016},
number={1813:42338},
URL = {http://hdl.handle.net/1813/42338},
abstract ={We use the bipartite graph representation of longitudinally linked employer-employee
data, and the associated projections onto the employer and employee
nodes, respectively, to characterize the set of potential statistical summaries
that the trusted custodian might produce. We consider noise infusion as the
primary confidentiality protection method. We show that a relatively straightforward
extension of the dynamic noise-infusion method used in the U.S. Census
Bureau’s Quarterly Workforce Indicators can be adapted to provide the same
confidentiality guarantees for the graph-based statistics: all inputs have been
modified by a minimum percentage deviation (i.e., no actual respondent data are
used) and, as the number of entities contributing to a particular statistic increases,
the accuracy of that statistic approaches the unprotected value. Our method also
ensures that the protected statistics will be identical in all releases based on the
same inputs.}
}
2015
• M. J. Schneider and J. M. Abowd, “A New Method for Protecting Interrelated Time Series with Bayesian Prior Distributions and Synthetic Data,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:40828, 2015.
[Abstract] [URL] [Bibtex]

Organizations disseminate statistical summaries of administrative data via the Web for unrestricted public use. They balance the trade-off between confidentiality protection and inference quality. Recent developments in disclosure avoidance techniques include the incorporation of synthetic data, which capture the essential features of underlying data by releasing altered data generated from a posterior predictive distribution. The United States Census Bureau collects millions of interrelated time series micro-data that are hierarchical and contain many zeros and suppressions. Rule-based disclosure avoidance techniques often require the suppression of count data for small magnitudes and the modification of data based on a small number of entities. Motivated by this problem, we use zero-inflated extensions of Bayesian Generalized Linear Mixed Models (BGLMM) with privacy-preserving prior distributions to develop methods for protecting and releasing synthetic data from time series about thousands of small groups of entities without suppression based on the of magnitudes or number of entities. We find that as the prior distributions of the variance components in the BGLMM become more precise toward zero, confidentiality protection increases and inference quality deteriorates. We evaluate our methodology using a strict privacy measure, empirical differential privacy, and a newly defined risk measure, Probability of Range Identification (PoRI), which directly measures attribute disclosure risk. We illustrate our results with the U.S. Census Bureau’s Quarterly Workforce Indicators.

@techreport{handle:1813:40828,
Title = {A New Method for Protecting Interrelated Time Series with Bayesian Prior Distributions and Synthetic Data},
Author = {Schneider, Matthew J. and Abowd, John M.},
institution = { NSF Census Research Network - NCRN-Cornell },
type = {Preprint} ,
Year = {2015},
number={1813:40828},
URL = {http://hdl.handle.net/1813/40828},
abstract ={Organizations disseminate statistical summaries of administrative data via the Web for unrestricted
public use. They balance the trade-off between confidentiality protection and inference quality. Recent developments
in disclosure avoidance techniques include the incorporation of synthetic data, which capture the essential features
of underlying data by releasing altered data generated from a posterior predictive distribution. The United States
Census Bureau collects millions of interrelated time series micro-data that are hierarchical and contain many zeros
and suppressions. Rule-based disclosure avoidance techniques often require the suppression of count data for small
magnitudes and the modification of data based on a small number of entities. Motivated by this problem, we use
zero-inflated extensions of Bayesian Generalized Linear Mixed Models (BGLMM) with privacy-preserving prior
distributions to develop methods for protecting and releasing synthetic data from time series about thousands of small
groups of entities without suppression based on the of magnitudes or number of entities. We find that as the prior
distributions of the variance components in the BGLMM become more precise toward zero, confidentiality protection
increases and inference quality deteriorates. We evaluate our methodology using a strict privacy measure, empirical
differential privacy, and a newly defined risk measure, Probability of Range Identification (PoRI), which directly
measures attribute disclosure risk. We illustrate our results with the U.S. Census Bureau’s Quarterly Workforce
Indicators.}
}
2014
• C. Lagoze, L. Vilhuber, J. Williams, B. Perry, and W. C. Block, “CED 2 AR: The Comprehensive Extensible Data Documentation and Access Repository,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:44702, 2014.
[Abstract] [URL] [Bibtex]

We describe the design, implementation, and deployment of the Comprehensive Extensible Data Documentation and Access Repository (CED 2 AR). This is a metadata repository system that allows researchers to search, browse, access, and cite confidential data and metadata through either a web-based user interface or programmatically through a search API, all the while re-reusing and linking to existing archive and provider generated metadata. CED 2 AR is distinguished from other metadata repository-based applications due to requirements that derive from its social science context. These include the need to cloak confidential data and metadata and manage complex provenance chains Presented at 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), Sept 8-12, 2014

@techreport{handle:1813:44702,
Title = {CED 2 AR: The Comprehensive Extensible Data Documentation and Access Repository},
Author = {Lagoze, Carl and Vilhuber, Lars and Williams, Jeremy and Perry, Benjamin and Block, William C.},
institution = { NSF Census Research Network - NCRN-Cornell },
type = {Preprint} ,
Year = {2014},
number={1813:44702},
URL = {http://hdl.handle.net/1813/44702},
abstract ={We describe the design, implementation, and deployment of the Comprehensive Extensible Data Documentation and Access Repository (CED 2 AR). This is a metadata repository system that allows researchers to search, browse, access, and cite confidential data and metadata through either a web-based user interface or programmatically through a search API, all the while re-reusing and linking to existing archive and provider generated metadata. CED 2 AR is distinguished from other metadata repository-based
applications due to requirements that derive from its social science context. These include the need to cloak confidential data and metadata and manage complex provenance chains
Presented at 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), Sept 8-12, 2014}
}
• J. Miranda and L. Vilhuber, “Using partially synthetic data to replace suppression in the Business Dynamics Statistics: early results,” NSF Census Research Network – NCRN-Cornell, Preprint 1813:40852, 2014.
[Abstract] [URL] [Bibtex]

The Business Dynamics Statistics is a product of the U.S. Census Bureau that provides measures of business openings and closings, and job creation and destruction, by a variety of cross-classifications (firm and establishment age and size, industrial sector, and geography). Sensitive data are currently protected through suppression. However, as additional tabulations are being developed, at ever more detailed geographic levels, the number of suppressions increases dramatically. This paper explores the option of providing public-use data that are analytically valid and without suppressions, by leveraging synthetic data to replace observations in sensitive cells.

@techreport{handle:1813:40852,
Title = {Using partially synthetic data to replace suppression in the Business Dynamics Statistics: early results},
Author = {Miranda, Javier and Vilhuber, Lars},
institution = { NSF Census Research Network - NCRN-Cornell },
type = {Preprint} ,
Year = {2014},
number={1813:40852},
URL = {http://hdl.handle.net/1813/40852},
abstract ={The Business Dynamics Statistics is a product of the U.S. Census Bureau that provides measures of business openings and closings, and job creation and destruction, by a variety of cross-classifications (firm and establishment age and size, industrial sector, and geography). Sensitive data are currently protected through suppression. However, as
additional tabulations are being developed, at ever more detailed geographic levels, the number of suppressions increases dramatically. This paper explores the option of providing public-use data that are analytically valid and without suppressions, by leveraging synthetic data to replace observations in sensitive cells.}
}

Interested users might want to download the complete Bibtex files for the two sections above:

We have created and maintain electronic metadata (otherwise known as “online codebooks”) for a number of datasets, displayed using CED²AR:

• Cornell NSF-Census Research Network, “NBER-CES Manufacturing Industry Database (NAICS, 2009) [Codebook file],” {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University, Ithaca, NY, USA, {DDI-C} document, 2013.
[URL] [Bibtex]
@TECHREPORT{CED2AR-NBER-naics2009,
author = {{Cornell NSF-Census Research Network}},
title = {NBER-CES Manufacturing Industry Database (NAICS, 2009) [Codebook file]},
institution = {{Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University},
type = {{DDI-C} document},
year = {2013},
url = {https://www2.ncrn.cornell.edu/ced2ar-web/codebooks/nber-ces/v/naics2009}
}
• Cornell NSF-Census Research Network, “NBER-CES Manufacturing Industry Database (SIC, 2009) [Codebook file],” {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University, Ithaca, NY, USA, {DDI-C} document, 2013.
[URL] [Bibtex]
@TECHREPORT{CED2AR-NBER-sic2009,
author = {{Cornell NSF-Census Research Network}},
title = {NBER-CES Manufacturing Industry Database (SIC, 2009) [Codebook file]},
institution = {{Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University},
type = {{DDI-C} document},
year = {2013},
url = {https://www2.ncrn.cornell.edu/ced2ar-web/codebooks/nber-ces/v/sic2009}
}
• Reeder, Lori B., Martha Stinson, Kelly E. Trageser, and Lars Vilhuber, “Codebook for the SIPP Synthetic Beta v5.1 [Codebook file],” {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University, Ithaca, NY, USA, {DDI-C} document, 2014.
[URL] [Bibtex]
@TECHREPORT{CED2AR-SSBv51,
author = {Lori B. Reeder and Martha Stinson and Kelly E. Trageser and Lars Vilhuber},
title = {Codebook for the {SIPP} {S}ynthetic {B}eta v5.1 [Codebook file]},
institution = {{Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University},
type = {{DDI-C} document},
year = {2014},
url = {http://www2.ncrn.cornell.edu/ced2ar-web/codebooks/ssb/v/v51}
}
• Reeder, Lori B., Martha Stinson, Kelly E. Trageser, and Lars Vilhuber, “Codebook for the SIPP Synthetic Beta v6.0 [Codebook file],” {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University, Ithaca, NY, USA, {DDI-C} document, 2015.
[URL] [Bibtex]
@TECHREPORT{CED2AR-SSBv6,
author = {Lori B. Reeder and Martha Stinson and Kelly E. Trageser and Lars Vilhuber},
title = {Codebook for the {SIPP} {S}ynthetic {B}eta v6.0 [Codebook file]},
institution = {{Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University},
type = {{DDI-C} document},
year = {2015},
url = {http://www2.ncrn.cornell.edu/ced2ar-web/codebooks/ssb/v/v6}
}
• Reeder, Lori B., Martha Stinson, Kelly E. Trageser, and Lars Vilhuber, “Codebook for the SIPP Synthetic Beta v6.0.2 [Codebook file],” {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University, Ithaca, NY, USA, {DDI-C} document, 2015.
[URL] [Bibtex]
@TECHREPORT{CED2AR-SSBv602,
author = {Lori B. Reeder and Martha Stinson and Kelly E. Trageser and Lars Vilhuber},
title = {Codebook for the {SIPP} {S}ynthetic {B}eta v6.0.2 [Codebook file]},
institution = {{Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University},
type = {{DDI-C} document},
year = {2015},
url = {http://www2.ncrn.cornell.edu/ced2ar-web/codebooks/ssb/v/v602}
}
• Vilhuber, Lars, “Codebook for the Synthetic LBD Version 2.0 [Codebook file],” {Comprehensive Extensible Data Documentation and Access Repository (CED2AR)}, Cornell Institute for Social and Economic Research and Labor Dynamics Institute [distributor]. Cornell University, Ithaca, NY, USA, DDI-C document, 2013.
[URL] [Bibtex]
@TECHREPORT{CED2AR-SynLBDv2,
author = { Lars Vilhuber },
title = {Codebook for the Synthetic LBD Version 2.0 [Codebook file]},
institution = {{Comprehensive Extensible Data Documentation and Access Repository (CED2AR)}, Cornell Institute for Social and Economic Research and Labor Dynamics Institute [distributor]. Cornell University},
type = {DDI-C document},
year = {2013},
url = {http://www2.ncrn.cornell.edu/ced2ar-web/codebooks/synlbd/v/v2}
}

and others, available at http://www2.ncrn.cornell.edu/ced2ar-web/.

Some of the datasets are hosted by the Cornell VirtualRDC:

## Data repositories

Where possible, we publish the data created by our projects, or the data necessary to replicate papers, in publicly accessible repositories:

## Code publication

We publish, where possible, source code and metadata standards in openly accessible locations:

## Presentations

Our presentations are archived on the  Cornell eCommons presentations repository. Latest Presentations:

Tue, Jun 06, 2017
eCommons Presentations
Mon, May 08, 2017
eCommons Presentations
Mon, May 08, 2017
eCommons Presentations
Mon, May 08, 2017
eCommons Presentations
Mon, May 08, 2017
eCommons Presentations