Training

Print Friendly

Understanding Social and Economic Data

The course on “Understanding Social and Economic Data”, known at Cornell University and elsewhere as INFO7470, is designed to provide students a detailed overview of the US federal statistical system, where data comes from and how it can be used for research. The course also aims to teach students basic and advanced techniques for acquiring and transforming raw information into social and economic data. More information can be found in the Summary on the course website. The course is taught as a mixture of self-guided online videos (MOOC-style) together with in-classroom discussions of the material. Students from multiple universities, scattered across the country and participating via videoconference, and from multiple domains (economics, demography, geography, statistics) contribute to the discussion. For more information, see the course website.

Additional information

The course is being taught as a hybrid online+telepresence class in the Spring of 2016. We expect future classes to be made available as self-paced online-only classes. More information on the class can be found on the class website at http://www.vrdc.cornell.edu/info7470.

Abstract

The course is designed to teach students basic and advanced techniques for acquiring and transforming raw information into social and economic data. The 2016 version is particularly aimed at American Ph.D. students who are interested in using public-use and confidential U.S. Census Bureau data, and the confidential data of other American statistical agencies that cooperate with the Census Bureau. We cover the legal, historical, statistical, computing, and social science aspects of the data “production” process. Students will learn some of the statistical procedures necessary to handle the complex linked data sets increasingly available as confidential data, and will apply some of those techniques in class. New ways of accessing restricted access data will be both presented and sometimes tested.

Major emphasis is placed on U.S. Census Bureau data that are accessible from the Federal Statistical Research Data Center network, which is adminstered by the Census Bureau on behalf of the collaborating statistical agencies. Graduate students and faculty who are planning to use RDC-based data, or are seriously considering it, should pay particular attention to the lab related to the proposal process. The RDC-accessible data products covered in the course include the internal files used to manage the Census Bureau’s household and establishment frames; the Longitudinal Employer-Household Dynamics (LEHD) micro data; the Longitudinal Business Database (LBD) and its predecessor the Longitudinal Research Database (LRD); internal versions of the Survey of Income and Program Participation (SIPP), Current Population Survey (CPS), American Community Survey (ACS), American Housing Survey (AHS), and the 1990, 2000, and 2010 Decennial Censuses of Population and Housing; the Employer and Non-employer Business Registers (BR and SSEL); the Censuses and Annual Surveys of Manufactures, Mining, Services, Retail Trade, Wholesale Trade, Construction, Transportation, Communications, and Utilities; Business Expenditures Survey; Characteristics of Business Owners; and others.

Core topics

  • Basic statistical principles of populations and sampling frames (no survey background assumed)
  • Acquiring data via samples, censuses, administrative records, transaction logging, and web scraping
  • Law, economics and statistics of data privacy and confidentiality protection
  • Data linking and integration techniques (probabilistic record linking; multivariate statistical matching)
  • Data editing and imputation techniques
  • Analytical methods for complex linked data sets, relational databases, and networks

Learning objectives

  • To understand the history and components of the U.S. federal statistical system, and how these functions are organized in some other countries–you should be able to find the data you want and know who controls access to them
  • To recognize the source data for federal statistical products, and use these files properly even if they are only supported as restricted-access confidential data–once you have the source data you should know how to analyze them whether or not they were edited and released for public-use
  • To understand the data acquisition, edit, imputation, weighting, confidentiality protections, publications, and underlying microdata for major household and business data products in the federal statistical system–in preparing and executing your analysis, you should be able to take responsiblity for the data preparation needed to create accurate, useful analysis files
  • To use both spatial, temporal, and network modeling methods, especially Bayesian hierarchical models, as research tools when working with the micro-data and public-use files from major household and business data products–you should be able to recognize and model the statistical and econometric complexities that occur when data are aggregated over time and space and from multiple sources
  • To produce replicable, properly curated research results based on confidential and public-use data files–you should know how to document the complete provenance of your analysis and the curation of essential elements for reproduction of your results from the original data files

Instructors

Undergraduate Training

As part of the development of metadata standards and tools, we have involved students in the development of the software, of the data, and have involved them in the testing and application of the tools.

Software development

CS 5150 Software Engineering is a course given each fall term at Cornell University. The course is “an introduction to the practical problems of specifying, designing, building, testing, and delivering reliable software systems. Other topics covered in lectures include professionalism, project management, and the legal framework for software development. As a central part of the course, student teams carry out projects for real clients. Each project includes all aspects of software development from a feasibility study to final delivery.” [*] In the fall of 2012, the Cornell NCRN node submitted a project, which was selected by a number of students. The project contributed to the first user interface for CED²AR.

In addition, several students have been engaged as undergraduate assistants, interns, and as part of their senior project.  Students who participated are listed on our people page.

Data development

Undergraduate researchers helped to create data that simulated the structure of confidential data, using existing unstructured metadata on confidential data (“zero-obs datasets”) and public-use data derived from the confidential data (e.g., ACS and CPS). These files helped in the development of our metadata enhancements.

Replication project

The Labor Dynamics Institute runs a summer undergraduate research activity to replicate published articles. These replication activities provide a testing ground for standards-based workflow documentation tools.

Graduate Training

The node has trained or engaged several graduate students in computer science and economics. Graduate students assist primarily in research. Interested students can contact us at.

Papers co-authored by graduate students

Published papers

2014
  • Shrivastava, Anshumali and Ping Li, “Graph Kernels via Functional Embedding,” CoRR, vol. abs/1404.5214, 2014.

Working papers