Data, Tools, and Privacy

Ryen W. White

doi:10.1017/CBO9781139525305.016

An important aspect of many of the methods covered in this book is the availability of data on how people interact with search systems. It is therefore important to discuss how searcher data are collected, and what data are available for research purposes. An important aspect in mining, analyzing, and applying these data is searcher privacy, which permeates all aspects of collection and use – from the consent of searchers to collect the data at the outset, to the de-identification, aggregation, and restrictions of sharing and applying data (Horvitz and Mulligan, 2015). The collection of such interaction data is standard practice for large commercial entities, such as Web search engines, who use the data to understand how people are interacting with their services and improve the user experience. Because of privacy concerns, once the data are collected, they are usually not shareable with external parties. Efforts to release data (e.g., by America Online in 2006) have led to serious privacy breaches associated with a failure to completely anonymize the dataset. Serious events such as this make future broad data releases unlikely. Limited releases under license to researchers and the extreme anonymization of datasets have been used as strategies to address privacy challenges and promote research into behavioral analysis and user modeling.

In this chapter, I discuss the need for the shared resources (e.g., datasets), tools (e.g., logging support), and infrastructure that are necessary to build and evaluate competitive search systems. These pillars are important when comparing or coordinating the performance of interactive search systems across multiple experimental sites. Lagergren and Over (1998) described an experimental design for cross-site comparisons of experimental results (i.e., a matrix design to which participating sites must strictly adhere) to address issues such as two-way interactions and effects specific to how the experiment was conducted at a particular site, in the context of the TREC Interactive Track (in which a single search system was used as a baseline at all sites). This involved significant coordination effort and was still focused on comparing systems. Important alternative goals include advancing our understanding of search behavior, improving the design of systems to support searching, and facilitating comparability between laboratory studies performed at different sites.

Book contents

12 - Data, Tools, and Privacy

Summary

Access options

Book contents

12 - Data, Tools, and Privacy

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive