Skip to main content Accessibility help
Hostname: page-component-848d4c4894-sjtt6 Total loading time: 0 Render date: 2024-06-19T21:20:57.102Z Has data issue: false hasContentIssue false

12 - Extracting Information from Big Data: Issues of Measurement, Inference and Linkage

Published online by Cambridge University Press:  05 July 2014

Frauke Kreuter
University of Maryland
Roger D. Peng
Johns Hopkins Bloomberg School of Public Health
Julia Lane
American Institutes for Research, Washington DC
Victoria Stodden
Columbia University, New York
Stefan Bender
Institute for Employment Research of the German Federal Employment Agency
Helen Nissenbaum
New York University
Get access



Big data pose several interesting and new challenges to statisticians and others who want to extract information from data. As Groves pointedly commented, the era is “appropriately called Big Data as opposed to Big Information,” because there is a lot of work for analysts before information can be gained from “auxiliary traces of some process that is going on in the society.” The analytic challenges most often discussed are those related to three of the Vs that are used to characterize big data. The volume of truly massive data requires expansion of processing techniques that match modern hardware infrastructure, cloud computing with appropriate optimization mechanisms, and re-engineering of storage systems. The velocity of the data calls for algorithms that allow learning and updating on a continuous basis, and of course the computing infrastructure to do so. Finally, the variety of the data structures requires statistical methods that more easily allow for the combination of different data types collected at different levels, sometimes with a temporal and geographic structure.

However, when it comes to privacy and confidentiality, the challenges of extracting (meaningful) information from big data are in our view similar to those associated with data of much smaller size, surveys being one example. For any statistician or quantitative working (social) scientist there are two main concerns when extracting information from data, which we summarize here as concerns about measurement and concerns about inference. Both of these aspects can be implicated by privacy and confidentiality concerns.

Privacy, Big Data, and the Public Good
Frameworks for Engagement
, pp. 257 - 275
Publisher: Cambridge University Press
Print publication year: 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)


O’Neil, C. and Schutt, R., Doing Data Science (Sebastopol, CA: O’Reilly Media, 2014)Google Scholar
Groves, R. M. and Lyberg, L., “Total Survey Error,” Public Opinion Quarterly 74, no. 5 (2010): 849–879CrossRefGoogle Scholar
Valliant, R., Dever, J. A., and Kreuter, F., Practical Tools for Sampling and Weighting (New York: Springer, 2013)CrossRefGoogle Scholar
Rosenbaum, R. and Rubin, D. B., “The Central Role of the Propensity Score in Observational Studies for Causal Effects,” Biometrika 70, no. 1 (April 1983): 41–55CrossRefGoogle Scholar
Frangakis, C. and Rubin, D., “Principal Stratification in Causal Inference,” Biometrics 58 (2002): 21–29CrossRefGoogle ScholarPubMed
Singer, E., “Confidentiality, Risk Perception, and Survey Participation,” Chance 17, no. 3 (2004): 30–34CrossRefGoogle Scholar
Singer, E., Mathiowetz, N., and Couper, M. P., “The Role of Privacy and Confidentiality as Factors in Response to the 1990 Census,” Public Opinion Quarterly 57 (1993): 465–482CrossRefGoogle Scholar
Groves, R. and Peytcheva, E., “The Impact of Nonresponse Rates on Nonresponse Bias: A Meta-Analysis. Public Opinion Quarterly 72, no. 2 (2008): 167–189CrossRefGoogle Scholar
Groves, R. M., “Three Eras of Survey Research,” Public Opinion Quarterly 75, no. 5 (2011): 861–871CrossRefGoogle Scholar
Schmieder, J., von Wachter, T., and Bender, S., “The Effects of Extended Unemployment Insurance over the Business Cycle: Evidence from Regression Discontinuity Estimates over 20 Years,” Quarterly Journal of Economics 127, no. 2 (2012): 701–752CrossRefGoogle Scholar
Card, D., Heining, J., and Kline, P., “Workplace Heterogeneity and the Rise of West German Wage Inequality,” Quarterly Journal of Economics 128, no. 3 (2013): 967–1015CrossRefGoogle Scholar
Tourangeau, R. and Yan, T., “Sensitive Questions in Surveys,” Psychological Bulletin 133, no. 5 (2007): 859–883CrossRefGoogle ScholarPubMed
Brown, V. R. and Vaughn, E. D., “The Writing on the (Facebook) Wall: The Use of Social Networking Sites in Hiring Decisions,” Journal of Business and Psychology 26, no. 2 (2011): 219–225CrossRefGoogle Scholar
Karl, K., Peluchette, J., and Schlaegel, C., “Who’s Posting Facebook Faux Pas? A Cross-Cultural Examination of Personality Differences,” International Journal of Selection and Assessment 18, no. 2 (2010): 174–186CrossRefGoogle Scholar
Kreuter, F., Presser, S., and Tourangeau, R., “Social Desirability Bias in CATI, IVR, and Web Surveys: The Effects of Mode and Question Sensitivity,” Public Opinion Quarterly 72, no. 5 (2008): 847–865CrossRefGoogle Scholar
Couper, M. P., “Is the Sky Falling? New Technology, Changing Media, and the Future of Surveys,” Survey Research Methods 7, no. 3 (2013): 145–156Google Scholar
Prewitt, K., “The 2012 Morris Hansen Lecture: Thank you Morris, et al., for Westat, et al.,” Journal of Official Statistics 29, no. 2 (2013): 223–231CrossRefGoogle Scholar
Yan, T. and Olson, K., “Analyzing Paradata to Investigate Measurement Error,” in Improving Surveys with Paradata: Making Use of Process Information, ed. Kreuter, F. (Hoboken, NJ: Wiley, 2013)Google Scholar
Mayer-Schönberger, V. and Cukier, K., Big Data: A Revolution That Will Transform How We Live, Work and Think (London: John Murray, 2013)Google Scholar
Lessler, J. T. and Kalsbeek, W. D., Nonsampling Error in Surveys (Hoboken, NJ: Wiley, 1992)Google Scholar
Bosnjak, M., Haas, I., Galesic, M., Kaczmirek, L., Bandilla, W., and Couper, M. P., “Sample Composition Discrepancies in Different Stages of a Probability-Based Online Panel,” Field Methods 25, no. 4 (2013): 339–360CrossRefGoogle Scholar
Dever, J. A., Rafferty, A., and Valliant, R., “Internet Surveys: Can Statistical Adjustment Eliminate Coverage Bias?Survey Research Methods 2, no. 2 (2008): 47–62Google Scholar
Couper, M. P., Kapteyn, A., Schonlau, M., and Winter, J., “Noncoverage and Nonresponse in an Internet Survey,” Social Science Research 36, no. 1 (2007): 131–148CrossRefGoogle Scholar
Schonlau, M., Van Soest, A., Kapteyn, A., and Couper, M., “Selection Bias in Web Surveys and the Use of Propensity Scores,” Sociological Methods and Research 37, no. 3 (2009): 291–318CrossRefGoogle Scholar
Singer, E., “Toward a Benefit-Cost Theory of Survey Participation: Evidence, Further Tests, and Implications,” Journal of Official Statistics 27, no. 2 (2011): 379–392Google Scholar
Zandbergen, P. A., “Accuracy of iPhone Locations: A Comparison of Assisted GPS, WiFi and Cellular Positioning,” Transactions in GIS 13, no. s1 (2009): 5–25CrossRefGoogle Scholar
Kosinski, M., Stillwell, D., and Graepel, T., “Private Traits and Attributes are Predictable from Digital Records of Human Behavior,” Proceedings of the National Academy of Sciences 110, no. 15 (2013): 5802–5805CrossRefGoogle ScholarPubMed
Valliant, R. and Dever, J., “Estimating Propensity Adjustments for Volunteer Web Surveys,” Sociological Methods and Research 40 (2011): 105–137CrossRefGoogle Scholar
Dever, J., Rafferty, A., and Valliant, R., “Internet Surveys: Can Statistical Adjustments Eliminate Coverage Bias?Survey Research Methods 2 (2008): 47–60Google Scholar
Couper, , “Is the Sky Falling,” and AAPOR, “Report of the AAPOR Task Force on Non-Probability Sampling,” Journal of Survey Statistics and Methodology 1 (2013): 90–143Google Scholar
Massey, Douglas S. and Tourangeau, Roger, The Nonresponse Challenge to Surveys and Statistics, ANNALS of the American Academy of Political and Social Science Series 645 (Thousand Oaks, CA: Sage, 2013)Google Scholar
Stuart, E. A., Cole, S. R., Bradshaw, C. P., and Leaf, P. J., “The Use of Propensity Scores to Assess the Generalizability of Results from Randomized Trials,” Journal of the Royal Statistical Society, Series A 174, no. 2 (2011): 369–386CrossRefGoogle Scholar
Cole, S. R. and Stuart, E. A., “Generalizing Evidence from Randomized Clinical Trials to Target Populations: The ACTG-320 Trial,” American Journal of Epidemiology 172 (2010): 107–115CrossRefGoogle ScholarPubMed
Smith, T., “The Report of the International Workshop on Using Multi-Level Data from Sample Frames, Auxiliary Databases, Paradata and Related Sources to Detect and Adjust for Nonresponse Bias in Surveys,” International Journal of Public Opinion Research 23 (2011): 389–402CrossRefGoogle Scholar
Sakshaug, J. and Kreuter, F., “Assessing the Magnitude of Non-Consent Biases in Linked Survey and Administrative Data,” Survey Research Methods 6, no. 2 (2012): 113–122Google Scholar
Singer, E., Hippler, H. J., and Schwarz, N., “Confidentiality Assurances in Surveys: Reassurance or Threat?International Journal of Public Opinion Research 4, no. 3 (1992): 256–268CrossRefGoogle Scholar
Bates, N., Dalhammer, J., and Singer, E., “Privacy Concerns, Too Busy, or Just Not Interested: Using Doorstep Concerns to Predict Survey Nonresponse,” Journal of Official Statistics 24, no. 4 (2008): 591–612Google Scholar
Couper, M. P., Singer, E., Conrad, F. G., and Groves, R. M., “Experimental Studies of Disclosure Risk, Disclosure Harm, Topic Sensitivity, and Survey Participation,” Journal of Offiical Statistics 26, no. 2 (2010): 287–300Google ScholarPubMed
Sakshaug, J., Tutz, V., and Kreuter, F., “Placement, Wording, and Interviewers: Identifying Correlates of Consent to Link Survey and Administrative Data,” Survey Research Methods 7, no. 2 (2013): 133–144Google Scholar
Schnell, R., “Combining Surveys with Non-Questionnaire Data: Overview and Introduction,” in Improving Surveys Methods: Lessons from Recent Research, ed. Engel, U., Jann, B., Lynn, P., Scherpenzeel, A., and Sturgis, P. (New York: Psychology Press, 2014)Google Scholar
Eckman, S. and English, N., “Creating Housing Unit Frames from Address Databases Geocoding Precision and Net Coverage Rates,” Field Methods 24, no. 4 (2012): 399–408CrossRefGoogle Scholar
Groves, R., “Designed Data” and “Organic Data,” Director’s Blog, (accessed January 20, 2014)
Kohavi, R., Longbotham, R., Sommerfield, D., and Henne, R. M., “Controlled Experiments on the Web: Survey and Practical Guide,” Data Mining and Knowledge Discovery 18 (2009): 140–181CrossRefGoogle Scholar

Save book to Kindle

To save this book to your Kindle, first ensure is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the or variations. ‘’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats