Skip to main content Accessibility help
×
Hostname: page-component-76fb5796d-2lccl Total loading time: 0 Render date: 2024-04-26T12:51:07.228Z Has data issue: false hasContentIssue false

References

Published online by Cambridge University Press:  13 March 2020

Ron Kohavi
Affiliation:
Microsoft
Diane Tang
Affiliation:
Google
Ya Xu
Affiliation:
LinkedIn
Get access

Summary

Image of the first page of this content. For PDF version, please use the ‘Save PDF’ preceeding this image.'
Type
Chapter
Information
Trustworthy Online Controlled Experiments
A Practical Guide to A/B Testing
, pp. 246 - 265
Publisher: Cambridge University Press
Print publication year: 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abadi, Martin, Chu, Andy, Goodfellow, Ian, Mironov, H. Brendan, Mcmahan, Ilya, Talwar, Kunal, and Zhang, Li. 2016. “Deep Learning with Differential Privacy.” Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.Google Scholar
Abrahamse, Peter. 2016. “How 8 Different A/B Testing Tools Affect Site Speed.” CXL: All Things Data-Driven Marketing. May 16. https://conversionxl.com/blog/testing-tools-site-speed/.Google Scholar
ACM. 2018. ACM Code of Ethics and Professional Conduct. June 22. www.acm.org/code-of-ethics.Google Scholar
Alvarez, Cindy. 2017. Lean Customer Development: Building Products Your Customers Will Buy. O’Reilly.Google Scholar
Angrist, Joshua D., and Pischke, Jörn-Steffen. 2014. Mastering ‘Metrics: The Path from Cause to Effect. Princeton University Press.Google Scholar
Angrist, Joshua D., and Pischke, Jörn-Steffen. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.Google Scholar
Apple, Inc. 2017. “Phased Release for Automatic Updates Now Available.” June 5. https://developer.apple.com/app-store-connect/whats-new/?id=31070842.Google Scholar
Apple, Inc. 2018. “Use Low Power Mode to Save Battery Life on Your iPhone.” Apple. September 25. https://support.apple.com/en-us/HT205234.Google Scholar
Athey, Susan, and Imbens, Guido. 2016. “Recursive Partitioning for Heterogeneous Causal Effects.” PNAS: Proceedings of the National Academy of Sciences. 7353–7360. doi: https://doi.org/10.1073/pnas.1510489113.Google Scholar
Azevedo, Eduardo M., Deng, Alex, Olea, Jose Montiel, Rao, Justin M., and Weyl, E. Glen. 2019. “A/B Testing with Fat Tails.” February 26. Available at SSRN: https://ssrn.com/abstract=3171224 or http://dx.doi.org/10.2139/ssrn.3171224.Google Scholar
Backstrom, Lars, and Kleinberg, Jon. 2011. “Network Bucket Testing.” WWW ‘11 Proceedings of the 20th International Conference on World Wide Web. Hyderabad, India: ACM. 615624.Google Scholar
Bailar, John C. 1983. “Introduction.” In Clinical Trials: Issues and Approaches, by Shapiro, Stuart and Louis, Thomas. Marcel Dekker.Google Scholar
Bakshy, Eytan, Balandat, Max, and Kashin, Kostya. 2019. “Open-sourcing Ax and BoTorch: New AI tools for adaptive experimentation.” Facebook Artificial Intelligence. May 1. https://ai.facebook.com/blog/open-sourcing-ax-and-botorch-new-ai-tools-for-adaptive-experimentation/.Google Scholar
Bakshy, Eytan, and Frachtenberg, Eitan. 2015. “Design and Analysis of Benchmarking Experiments for Distributed Internet Services.” WWW ‘15: Proceedings of the 24th International Conference on World Wide Web. Florence, Italy: ACM. 108118. doi: https://doi.org/10.1145/2736277.2741082.Google Scholar
Bakshy, Eytan, Eckles, Dean, and Bernstein, Michael. 2014. “Designing and Deploying Online Field Experiments.” International World Wide Web Conference (WWW 2014). https://facebook.com//download/255785951270811/planout.pdf.Google Scholar
Barajas, Joel, Akella, Ram, Hotan, Marius, and Flores, Aaron. 2016. “Experimental Designs and Estimation for Online Display Advertising Attribution in Marketplaces.” Marketing Science: the Marketing Journal of the Institute for Operations Research and the Management Sciences 35: 465483.Google Scholar
Barrilleaux, Bonnie, and Wang, Dylan. 2018. “Spreading the Love in the LinkedIn Feed with Creator-Side Optimization.” LinkedIn Engineering. October 16. https://engineering.linkedin.com/blog/2018/10/linkedin-feed-with-creator-side-optimization.Google Scholar
Basin, David, Debois, Soren, and Hildebrandt, Thomas. 2018. “On Purpose and by Necessity: Compliance under the GDPR.” Financial Cryptography and Data Security 2018. IFCA. Preproceedings 21.Google Scholar
Benbunan-Fich, Raquel. 2017. “The Ethics of Online Research with Unsuspecting Users: From A/B Testing to C/D Experimentation.” Research Ethics 13 (3–4): 200218. doi: https://doi.org/10.1177/1747016116680664.Google Scholar
Benjamin, Daniel J., Berger, James O., Johannesson, Magnus, Nosek, Brian A., Wagenmakers, E.-J., Berk, Richard, Bollen, Kenneth A., et al. 2017. “Redefine Statistical Significance.” Nature Human Behaviour 2 (1): 610. https://www.nature.com/articles/s41562-017-0189-z.Google Scholar
Beshears, John, Choi, James J., Laibson, David, Madrian, Brigitte C., and Milkman, Katherine L.. 2011. The Effect of Providing Peer Information on Retirement Savings Decisions. NBER Working Paper Series, National Bureau of Economic Research. www.nber.org/papers/w17345.Google Scholar
Billingsly, Patrick. 1995. Probability and Measure. Wiley.Google Scholar
Blake, Thomas, and Coey, Dominic. 2014. “Why Marketplace Experimentation is Harder Than it Seems: The Role of Test-Control Interference.” EC ’14 Proceedings of the Fifteenth ACM Conference on Economics and Computation. Palo Alto, CA: ACM. 567582.Google Scholar
Blank, Steven Gary. 2005. The Four Steps to the Epiphany: Successful Strategies for Products that Win. Cafepress.com.Google Scholar
Blocker, Craig, Conway, John, Demortier, Luc, Heinrich, Joel, Junk, Tom, Lyons, Louis, and Punzi, Giovanni. 2006. “Simple Facts about P-Values.” The Rockefeller University. January 5. http://physics.rockefeller.edu/luc/technical_reports/cdf8023_facts_about_p_values.pdf.Google Scholar
Bodlewski, Mike. 2017. “When Slower UX is Better UX.” Web Designer Depot. Sep 25. https://www.webdesignerdepot.com/2017/09/when-slower-ux-is-better-ux/.Google Scholar
Bojinov, Iavor, and Shephard, Neil. 2017. “Time Series Experiments and Causal Estimands: Exact Randomization Tests and Trading.” arXiv of Cornell University. July 18. arXiv:1706.07840.Google Scholar
Borden, Peter. 2014. “How Optimizely (Almost) Got Me Fired.” The SumAll Blog: Where E-commerce and Social Media Meet. June 18. https://blog.sumall.com/journal/optimizely-got-me-fired.html.Google Scholar
Bowman, Douglas. 2009. “Goodbye, Google.” stopdesign. March 20. https://stopdesign.com/archive/2009/03/20/goodbye-google.html.Google Scholar
Box, George E.P., Hunter, J. Stuart, and Hunter, William G.. 2005. Statistics for Experimenters: Design, Innovation, and Discovery. 2nd edition. John Wiley & Sons, Inc.Google Scholar
Bell, Brooks. 2015. “Click Summit 2015 Keynote Presentation.” Brooks Bell. www.brooksbell.com/wp-content/uploads/2015/05/BrooksBell_ClickSummit15_Keynote1.pdf.Google Scholar
Brown, Morton B. 1975. “A Method for Combining Non-Independent, One-Sided Tests of Signficance.” Biometrics 31 (4) 987992. www.jstor.org/stable/2529826.Google Scholar
Brutlag, Jake, Abrams, Zoe, and Meenan, Pat. 2011. “Above the Fold Time: Measuring Web Page Performance Visually.” Velocity: Web Performance and Operations Conference.Google Scholar
Buhrmester, Michael, Kwang, Tracy, and Gosling, Samuel. 2011. “Amazon’s Mechanical Turk: A New Source of Inexpensive, Yet High-Quality Data?” Perspectives on Psychological Science, Feb 3.Google Scholar
Campbell, Donald T. 1979. “Assessing the Impact of Planned Social Change.” Evaluation and Program Planning 2: 6790. https://doi.org/10.1016/0149-7189(79)90048-X.Google Scholar
Card, David, and Krueger, Alan B. 1994. “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania.” The American Economic Review 84 (4): 772793. https://www.jstor.org/stable/2118030.Google Scholar
Casella, George, and Berger, Roger L.. 2001. Statistical Inference. 2nd edition. Cengage Learning.Google Scholar
CDC. 2015. The Tuskegee Timeline. December. https://www.cdc.gov/tuskegee/timeline.htm.Google Scholar
Chamandy, Nicholas. 2016. “Experimentation in a Ridesharing Marketplace.” Lyft Engineering. September 2. https:/eng.lyft.com/experimentation-in-a-risharing-marketplace-b39db027a66e.Google Scholar
Chan, David, Ge, Rong, Gershony, Ori, Hesterberg, Tim, and Lambert, Diane. 2010. “Evaluating Online Ad Campaigns in a Pipeline: Causal Models at Scale.” Proceedings of ACM SIGKDD.Google Scholar
Chapelle, Olivier, Joachims, Thorsten, Radlinski, Filip, and Yue, Yisong. 2012. “Large-Scale Validation and Analysis of Interleaved Search Evaluation.” ACM Transactions on Information Systems, February.Google Scholar
Chaplin, Charlie. 1964. My Autobiography. Simon Schuster.Google Scholar
Charles, Reichardt S., and Melvin, Mark M.. 2004. “Quasi Experimentation.” In Handbook of Practical Program Evaluation, by Wholey, Joseph S., Hatry, Harry P. and Newcomer, Kathryn E.. Jossey-Bass.Google Scholar
Chatham, Bob, Temkin, Bruce D., and Amato, Michelle. 2004. A Primer on A/B Testing. Forrester Research.Google Scholar
Chen, Nanyu, Liu, Min, and Xu, Ya. 2019. “How A/B Tests Could Go Wrong: Automatic Diagnosis of Invalid Online Experiments.” WSDM ‘19 Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. Melbourne, VIC, Australia: ACM. 501509. https://dl.acm.org/citation.cfm?id=3291000.Google Scholar
Chrystal, K. Alec, and Mizen, Paul D.. 2001. Goodhart’s Law: Its Origins, Meaning and Implications for Monetary Policy. Prepared for the Festschrift in honor of Charles Goodhart held on 15–16 November 2001 at the Bank of England. http://cyberlibris.typepad.com/blog/files/Goodharts_Law.pdf.Google Scholar
Coey, Dominic, and Cunningham, Tom. 2019. “Improving Treatment Effect Estimators Through Experiment Splitting.” WWW ’19: The Web Conference. San Francisco, CA, USA: ACM. 285295. doi:https://dl.acm.org/citation.cfm?doid=3308558.3313452.Google Scholar
Collis, David. 2016. “Lean Strategy.” Harvard Business Review 62–68. https://hbr.org/2016/03/lean-strategy.Google Scholar
Concato, John, Shah, Nirav, and Horwitz, Ralph I. 2000. “Randomized, Controlled Trials, Observational Studies, and the Hierarchy of Research Designs.” The New England Journal of Medicine 342 (25): 18871892. doi:https://www.nejm.org/doi/10.1056/NEJM200006223422507.Google Scholar
Cox, David Roxbee. 1958. Planning of Experiments. New York: John Wiley.Google Scholar
Croll, Alistair, and Yoskovitz, Benjamin. 2013. Lean Analytics: Use Data to Build a Better Startup Faster. O’Reilly Media.Google Scholar
Crook, Thomas, Frasca, Brian, Kohavi, Ron, and Longbotham, Roger. 2009. “Seven Pitfalls to Avoid when Running Controlled Experiments on the Web.” KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 1105–1114.Google Scholar
Cross, Robert G., and Dixit, Ashutosh. 2005. “Customer-centric Pricing: The Surprising Secret for Profitability.” Business Horizons, 488.Google Scholar
Deb, Anirban, Bhattacharya, Suman, Gu, Jeremey, Zhuo, Tianxia, Feng, Eva, and Liu, Mandie. 2018. “Under the Hood of Uber’s Experimentation Platform.” Uber Engineering. August 28. https://eng.uber.com/xp.Google Scholar
Deng, Alex. 2015. “Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments.” Florence, IT: ACM. 923–928.Google Scholar
Deng, Alex, and Hu, Victor. 2015. “Diluted Treatment Effect Estimation for Trigger Analysis in Online Controlled Experiments.” WSDM ’15: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. Shanghai, China: ACM. 349358. doi:https://doi.org/10.1145/2684822.2685307.Google Scholar
Deng, Alex, Lu, Jiannan, and Chen, Shouyuan. 2016. “Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing.” 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). Montreal, QC, Canada: IEEE. doi:https://doi.org/10.1109/DSAA.2016.33.Google Scholar
Deng, Alex, Knoblich, Ulf, and Lu, Jiannan. 2018. “Applying the Delta Method in Metric Analytics: A Practical Guide with Novel Ideas.” 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.Google Scholar
Deng, Alex, Lu, Jiannan, and Litz, Jonathan. 2017. “Trustworthy Analysis of Online A/B Tests: Pitfalls, Challenges and Solutions.” WSDM: The Tenth International Conference on Web Search and Data Mining. Cambridge, UK.Google Scholar
Deng, Alex, Xu, Ya, Kohavi, Ron, and Walker, Toby. 2013. “Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data.” WSDM 2013: Sixth ACM International Conference on Web Search and Data Mining.Google Scholar
Deng, Shaojie, Longbotham, Roger, Walker, Toby, and Xu, Ya. 2011. “Choice of Randomization Unit in Online Controlled Experiments.” Joint Statistical Meetings Proceedings. 4866–4877.Google Scholar
Denrell, Jerker. 2005. “Selection Bias and the Perils of Benchmarking.” (Harvard Business Review) 83 (4): 114119.Google Scholar
Dickhaus, Thorsten. 2014. Simultaneous Statistical Inference: With Applications in the Life Sciences. Springer. https://www.springer.com/cda/content/document/cda_downloaddocument/9783642451812-c2.pdf.Google Scholar
Dickson, Paul. 1999. The Official Rules and Explanations: The Original Guide to Surviving the Electronic Age With Wit, Wisdom, and Laughter. Federal Street Pr.Google Scholar
Djulbegovic, Benjamin, and Hozo, Iztok. 2002. “At What Degree of Belief in a Research Hypothesis Is a Trial in Humans Justified?” Journal of Evaluation in Clinical Practice, June 13.Google Scholar
Dmitriev, Pavel, and Xian, Wu. 2016. “Measuring Metrics.” CIKM: Conference on Information and Knowledge Management. Indianapolis, In. http://bit.ly/measuringMetrics.Google Scholar
Dmitriev, Pavel, Gupta, Somit, Kim, Dong Woo, and Vaz, Garnet. 2017. “A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments.” Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2017). Halifax, NS, Canada: ACM. 14271436. http://doi.acm.org/10.1145/3097983.3098024.Google Scholar
Dmitriev, Pavel, Frasca, Brian, Gupta, Somit, Kohavi, Ron, and Vaz, Garnet. 2016. “Pitfalls of Long-Term Online Controlled Experiments.” 2016 IEEE International Conference on Big Data (Big Data). Washington DC. 13671376. http://bit.ly/expLongTerm.Google Scholar
Doerr, John. 2018. Measure What Matters: How Google, Bono, and the Gates Foundation Rock the World with OKRs. Portfolio.Google Scholar
Doll, Richard. 1998. “Controlled Trials: the 1948 Watershed.” BMJ. doi:https://doi.org/10.1136/bmj.317.7167.1217.Google Scholar
Dutta, Kaushik, and Vadermeer, Debra. 2018. “Caching to Reduce Mobile App Energy Consumption.” ACM Transactions on the Web (TWEB), February 12(1): Article No. 5.Google Scholar
Dwork, Cynthia, and Roth, Aaron. 2014. “The Algorithmic Foundations of Differential Privacy.” Foundations and Trends in Computer Science 211–407.Google Scholar
Eckles, Dean, Karrer, Brian, and Ugander, Johan. 2017. “Design and Analysis of Experiments in Networks: Reducing Bias from Interference.” Journal of Causal Inference 5(1). www.deaneckles.com/misc/Eckles_Karrer_Ugander_Reducing_Bias_from_Interference.pdf.Google Scholar
Edgington, Eugene S. 1972, “An Additive Method for Combining Probablilty Values from Independent Experiments.” The Journal of Psychology 80 (2): 351363.Google Scholar
Edmonds, Andy, White, Ryan W., Morris, Dan, and Drucker, Steven M.. 2007. “Instrumenting the Dynamic Web.” Journal of Web Engineering. (3): 244260. www.microsoft.com/en-us/research/wp-content/uploads/2016/02/edmondsjwe2007.pdf.Google Scholar
Efron, Bradley, and Tibshriani, Robert J.. 1994. An Introduction to the Bootstrap. Chapman & Hall/CRC.Google Scholar
EGAP. 2018. “10 Things to Know About Heterogeneous Treatment Effects.” EGAP: Evidence in Government and Politics. egap.org/methods-guides/10-things-heterogeneous-treatment-effects.Google Scholar
Ehrenberg, A.S.C. 1975. “The Teaching of Statistics: Corrections and Comments.” Journal of the Royal Statistical Society. Series A 138 (4): 543545. https://www.jstor.org/stable/2345216.Google Scholar
Eisenberg, Bryan 2005. “How to Improve A/B Testing.” ClickZ Network. April 29. www.clickz.com/clickz/column/1717234/how-improve-a-b-testing.Google Scholar
Eisenberg, Bryan. 2004. A/B Testing for the Mathematically Disinclined. May 7. http://www.clickz.com/showPage.html?page=3349901.Google Scholar
Eisenberg, Bryan, and Quarto-vonTivadar, John. 2008. Always Be Testing: The Complete Guide to Google Website Optimizer. Sybex.Google Scholar
eMarketer. 2016. “Microsoft Ad Revenues Continue to Rebound.” April 20. https://www.emarketer.com/Article/Microsoft-Ad-Revenues-Continue-Rebound/1013854.Google Scholar
European Commission. 2016. EU GDPR.ORG. https://eugdpr.org/.Google Scholar
Fabijan, Aleksander, Dmitriev, Pavel, Olsson, Helena Holmstrom, and Bosch, Jan. 2018. “Online Controlled Experimentation at Scale: An Empirical Survey on the Current State of A/B Testing.” Euromicro Conference on Software Engineering and Advanced Applications (SEAA). Prague, Czechia. doi:10.1109/SEAA.2018.00021.Google Scholar
Fabijan, Aleksander, Dmitriev, Pavel, Olsson, Helena Holmstrom, and Bosch, Jan. 2017. “The Evolution of Continuous Experimentation in Software Product Development: from Data to a Data-Driven Organization at Scale.” ICSE ’17 Proceedings of the 39th International Conference on Software Engineering. Buenos Aires, Argentina: IEEE Press. 770780. doi:https://doi.org/10.1109/ICSE.2017.76.Google Scholar
Fabijan, Aleksander, Gupchup, Jayant, Gupta, Somit, Omhover, Jeff, Qin, Wen, Vermeer, Lukas, and Dmitriev, Pavel. 2019. “Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners.” KDD ‘19: The 25th SIGKDD International Conference on Knowledge Discovery and Data Mining. Anchorage, Alaska, USA: ACM.Google Scholar
Fabijan, Aleksander, Dmitriev, Pavel, McFarland, Colin, Vermeer, Lukas, Olsson, Helena Holmström, and Bosch, Jan. 2018. “Experimentation Growth: Evolving Trustworthy A/B Testing Capabilities in Online Software Companies.” Journal of Software: Evolution and Process 30 (12:e2113). doi:https://doi.org/10.1002/smr.2113.Google Scholar
FAT/ML. 2019. Fairness, Accountability, and Transparency in Machine Learning. http://www.fatml.org/.Google Scholar
Fisher, Ronald Aylmer. 1925. Statistical Methods for Research Workers. Oliver and Boyd. http://psychclassics.yorku.ca/Fisher/Methods/.Google Scholar
Forte, Michael. 2019. “Misadventures in experiments for growth.” The Unofficial Google Data Science Blog. April 16. www.unofficialgoogledatascience.com/2019/04/misadventures-in-experiments-for-growth.html.Google Scholar
Freedman, Benjamin. 1987. “Equipoise and the Ethics of Clinical Research.” The New England Journal of Medicine 317 (3): 141145. doi:https://www.nejm.org/doi/full/10.1056/NEJM198707163170304.Google Scholar
Gelman, Andrew, and Carlin, John. 2014. “Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.” Perspectives on Psychological Science 9 (6): 641651. doi:10.1177/1745691614551642.Google Scholar
Gelman, Andrew, and Little, Thomas C.. 1997. “Poststratification into Many Categories Using Hierarchical Logistic Regression.” Survey Methdology 23 (2): 127135. www150.statcan.gc.ca/n1/en/pub/12-001-x/1997002/article/3616-eng.pdf.Google Scholar
Georgiev, Georgi Zdravkov. 2019. Statistical Methods in Online A/B Testing: Statistics for Data-Driven Business Decisions and Risk Management in e-Commerce. Independently published. www.abtestingstats.comGoogle Scholar
Georgiev, Georgi Zdravkov. 2018. “Analysis of 115 A/B Tests: Average Lift is 4%, Most Lack Statistical Power.” Analytics Toolkit. June 26. http://blog.analytics-toolkit.com/2018/analysis-of-115-a-b-tests-average-lift-statistical-power/.Google Scholar
Gerber, Alan S., and Green, Donald P.. 2012. Field Experiments: Design, Analysis, and Interpretation. W. W. Norton & Company. https://www.amazon.com/Field-Experiments-Design-Analysis-Interpretation/dp/0393979954.Google Scholar
Goldratt, Eliyahu M. 1990. The Haystack Syndrome. North River Press.Google Scholar
Goldstein, Noah J., Martin, Steve J., and Cialdini, Robert B.. 2008. Yes!: 50 Scientifically Proven Ways to Be Persuasive. Free Press.Google Scholar
Goodhart, Charles A. E. 1975. Problems of Monetary Management: The UK Experience. Vol. 1, in Papers in Monetary Economics, by Reserve Bank of Australia.Google Scholar
Goodman, Steven. 2008. “A Dirty Dozen: Twelve P-Value Misconceptions.” Seminars in Hematology. doi:https://doi.org/10.1053/j.seminhematol.2008.04.003.Google Scholar
Google. 2019. Processing Logs at Scale Using Cloud Dataflow. March 19. https://cloud.google.com/solutions/processing-logs-at-scale-using-dataflow.Google Scholar
Google. 2011. “Ads Quality Improvements Rolling Out Globally.” Google Inside AdWords. October 3. https://adwords.googleblog.com/2011/10/ads-quality-improvements-rolling-out.html.Google Scholar
Google Console. 2019. “Release App Updates with Staged Rollouts.” Google Console Help. https://support.google.com/googleplay/android-developer/answer/6346149?hl=en.Google Scholar
Google, Helping Advertisers Comply with the GDPR. 2019. Google Ads Help. https://support.google.com/google-ads/answer/9028179?hl=en.Google Scholar
Gordon, Brett R., Zettelmeyer, Florian, Bhargava, Neha, and Chapsky, Dan. 2018. “A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook (forthcoming at Marketing Science).” https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3033144.Google Scholar
Goward, Chris. 2015. “Delivering Profitable ‘A-ha!’ Moments Everyday.” Conversion Hotel. Texel, The Netherlands. www.slideshare.net/webanalisten/chris-goward-strategy-conversion-hotel-2015.Google Scholar
Goward, Chris. 2012. You Should Test That: Conversion Optimization for More Leads, Sales and Profit or The Art and Science of Optimized Marketing. Sybex.Google Scholar
Greenhalgh, Trisha. 2014. How to Read a Paper: The Basics of Evidence-Based Medicine. BMJ Books. https://www.amazon.com/gp/product/B00IPG7GLC.Google Scholar
Greenhalgh, Trisha. 1997. “How to Read a Paper : Getting Your Bearings (deciding what the paper is about).” BMJ 315 (7102): 243246. doi:10.1136/bmj.315.7102.243.Google Scholar
Greenland, Sander, Senn, Stephen J., Rothman, Kenneth J., Carlin, John B., Poole, Charles, Goodman, Steven N., and Altman, Douglas G.. 2016. “Statistical Tests, P Values, Confidence Intervals, and Power: a Guide to Misinterpretations.” European Journal of Epidemiology 31 (4): 337350. https://dx.doi.org/10.1007%2Fs10654–016-0149-3.Google Scholar
Grimes, Carrie, Tang, Diane, and Russell, Daniel M.. 2007. “Query Logs Alone are not Enough.” International Conference of the World Wide Web, May.Google Scholar
Grove, Andrew S. 1995. High Output Management. 2nd edition. Vintage.Google Scholar
Groves, Robert M., Fowler, Floyd J. Jr, Couper, Mick P., Lepkowski, James M., Eleanor, Singer, and Tourangeau, Roger. 2009. Survey Methodology, 2nd edition. Wiley.Google Scholar
Gui, Han, Xu, Ya, Bhasin, Anmol, and Han, Jiawei. 2015. “Network A/B Testing From Sampling to Estimation.” WWW ’15 Proceedings of the 24th International Conference on World Wide Web. Florence, IT: ACM. 399409.Google Scholar
Gupta, Somit, Ulanova, Lucy, Bhardwaj, Sumit, Dmitriev, Pavel, Raff, Paul, and Fabijan, Aleksander. 2018. “The Anatomy of a Large-Scale Online Experimentation Platform.” IEEE International Conference on Software Architecture.Google Scholar
Gupta, Somit, Kohavi, Ronny, Tang, Diane, Xu, Ya, and etal. 2019. “Top Challenges from the first Practical Online Controlled Experiments Summit.” Edited by Dong, Xin Luna, Teredesai, Ankur and Zafarani, Reza. SIGKDD Explorations (ACM) 21 (1). https://bit.ly/OCESummit1.Google Scholar
Guyatt, Gordon H., Sackett, David L., Sinclair, John C., Hayward, Robert, Cook, Deborah J., and Cook, Richard J.. 1995. “Users’ Guides to the Medical Literature: IX. A method for Grading Health Care Recommendations.” Journal of the American Medical Association (JAMA) 274 (22): 18001804. doi:https://doi.org/10.1001%2Fjama.1995.03530220066035.Google Scholar
Harden, K. Paige, Mendle, Jane, Hill, Jennifer E., Turkheimer, Eric, and Emery, Robert E.. 2008. “Rethinking Timing of First Sex and Delinquency.” Journal of Youth and Adolescence 37 (4): 373385. doi:https://doi.org/10.1007/s10964-007-9228-9.Google Scholar
Harford, Tim. 2014. The Undercover Economist Strikes Back: How to Run – or Ruin – an Economy. Riverhead Books.Google Scholar
Hauser, John R., and Katz, Gerry. 1998. “Metrics: You Are What You Measure!European Management Journal 16 (5): 516528. http://www.mit.edu/~hauser/Papers/metrics%20you%20are%20what%20you%20measure.pdf.Google Scholar
Health and Human Services. 2018a. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html.Google Scholar
Health and Human Services. 2018b. Health Information Privacy. https://www.hhs.gov/hipaa/index.html.Google Scholar
Health and Human Services. 2018c. Summary of the HIPAA Privacy Rule. https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html.Google Scholar
Hedges, Larry, and Olkin, Ingram. 2014. Statistical Methods for Meta-Analysis. Academic Press.Google Scholar
Hemkens, Lars, Contopoulos-Ioannidis, Despina, and Ioannidis, John. 2016. “Routinely Collected Data and Comparative Effectiveness Evidence: Promises and Limitations.” CMAJ, May 17.Google Scholar
Journal, HIPAA. 2018. What is Considered Protected Health Information Under HIPAA. April 2. https://www.hipaajournal.com/what-is-considered-protected-health-information-under-hipaa/.Google Scholar
Hochberg, Yosef, and Benjamini, Yoav. 1995. “Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing Series B.” Journal of the Royal Statistical Society 57 (1): 289300.Google Scholar
Hodge, Victoria, and Austin, Jim. 2004. “A Survey of Outlier Detection Methodologies.” Journal of Artificial Intelligence Review. 85–126.Google Scholar
Hohnhold, Henning, O’Brien, Deirdre, and Tang, Diane. 2015. “Focus on the Long-Term: It’s better for Users and Business.” Proceedings 21st Conference on Knowledge Discovery and Data Mining (KDD 2015). Sydney, Australia: ACM. http://dl.acm.org/citation.cfm?doid=2783258.2788583.Google Scholar
Holson, Laura M. 2009. “Putting a Bolder Face on Google.” NY Times. February 28. https://www.nytimes.com/2009/03/01/business/01marissa.html.Google Scholar
Holtz, David Michael. 2018. “Limiting Bias from Test-Control Interference In Online Marketplace Experiments.” DSpace@MIT. http://hdl.handle.net/1721.1/117999.Google Scholar
Hoover, Kevin D. 2008. “Phillips Curve.” In Henderson, R. David, Concise Encyclopedia of Economics. http://www.econlib.org/library/Enc/PhillipsCurve.html.Google Scholar
Huang, Jason, Reiley, David, and Raibov, Nickolai M.. 2018. “David Reiley, Jr.” Measuring Consumer Sensitivity to Audio Advertising: A Field Experiment on Pandora Internet Radio. April 21. http://davidreiley.com/papers/PandoraListenerDemandCurve.pdf.Google Scholar
Huang, Jeff, White, Ryen W., and Dumais, Susan. 2012. “No Clicks, No Problem: Using Cursor Movements to Understand and Improve Search.” Proceedings of SIGCHI.Google Scholar
Huang, Yanping, You, Jane, Wang, Iris, Cao, Feng, and Gao, Ian. 2015. Data Science Interviews Exposed. CreateSpace.Google Scholar
Hubbard, Douglas W. 2014. How to Measure Anything: Finding the Value of Intangibles in Business. 3rd edition. Wiley.Google Scholar
Huffman, Scott. 2008. Search Evaluation at Google. September 15. https://googleblog.blogspot.com/2008/09/search-evaluation-at-google.html.Google Scholar
Imbens, Guido W., and Rubin, Donald B.. 2015. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press.Google Scholar
Ioannidis, John P. 2005. “Contradicted and Initially Stronger Effects in Highly Cited Clinical Research.” (The Journal of the American Medical Association) 294 (2).Google Scholar
Jackson, Simon. 2018. “How Booking.com increases the power of online experiments with CUPED.” Booking.ai. January 22. https://booking.ai/how-booking-com-increases-the-power-of-online-experiments-with-cuped-995d186fff1d.Google Scholar
Joachims, Thorsten, Granka, Laura, Pan, Bing, Hembrooke, Helene, and Gay, Geri. 2005. “Accurately Interpreting Clickthrough Data as Implicit Feedback.” SIGIR, August.Google Scholar
Johari, Ramesh, Pekelis, Leonid, Koomen, Pete, and Walsh, David. 2017. “Peeking at A/B Tests.” KDD ’17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Halifax, NS, Canada: ACM. 15171525. doi:https://doi.org/10.1145/3097983.3097992.Google Scholar
Kaplan, Robert S., and Norton, David P.. 1996. The Balanced Scorecard: Translating Strategy into Action. Harvard Business School Press.Google Scholar
Katzir, Liran, Liberty, Edo, and Somekh, Oren. 2012. “Framework and Algorithms for Network Bucket Testing.” Proceedings of the 21st International Conference on World Wide Web 1029–1036.Google Scholar
Kaushik, Avinash. 2006. “Experimentation and Testing: A Primer.” Occam’s Razor. May 22. www.kaushik.net/avinash/2006/05/experimentation-and-testing-a-primer.html.Google Scholar
Keppel, Geoffrey, Saufley, William H., and Tokunaga, Howard. 1992. Introduction to Design and Analysis. 2nd edition. W.H. Freeman and Company.Google Scholar
Kesar, Alhan. 2018. 11 Ways to Stop FOOC’ing up your A/B tests. August 9. www.widerfunnel.com/stop-fooc-ab-tests/.Google Scholar
King, Gary, and Nielsen, Richard. 2018. Why Propensity Scores Should Not Be Used for Matching. Working paper. https://gking.harvard.edu/publications/why-propensity-scores-should-not-be-used-formatching.Google Scholar
King, Rochelle, Churchill, Elizabeth F., and Tan, Caitlin. 2017. Designing with Data: Improving the User Experience with A/B Testing. O’Reilly Media.Google Scholar
Kingston, Robert. 2015. Does Optimizely Slow Down a Site’s Performance. January 18. https://www.quora.com/Does-Optimizely-slow-down-a-sites-performance/answer/Robert-Kingston.Google Scholar
Knapp, Michael S., Swinnerton, Juli A., Copland, Michael A., and Monpas-Huber, Jack. 2006. Data-Informed Leadership in Education. Center for the Study of Teaching and Policy, University of Washington, Seattle, WA: Wallace Foundation. https://www.wallacefoundation.org/knowledge-center/Documents/1-Data-Informed-Leadership.pdf.Google Scholar
Kohavi, Ron. 2019. “HiPPO FAQ.” ExP Experimentation Platform. http://bitly.com/HIPPOExplained.Google Scholar
Kohavi, Ron. 2016. “Pitfalls in Online Controlled Experiments.” CODE ’16: Conference on Digital Experimentation. MIT. https://bit.ly/Code2016Kohavi.Google Scholar
Kohavi, Ron. 2014. “Customer Review of A/B Testing: The Most Powerful Way to Turn Clicks Into Customers.” Amazon.com. May 27. www.amazon.com/gp/customer-reviews/R44BH2HO30T18.Google Scholar
Kohavi, Ron. 2010. “Online Controlled Experiments: Listening to the Customers, not to the HiPPO.” Keynote at EC10: the 11th ACM Conference on Electronic Commerce. www.exp-platform.com/Documents/2010-06%20EC10.pptx.Google Scholar
Kohavi, Ron. 2003. Real-world Insights from Mining Retail E-Commerce Data. Stanford, CA, May 22. http://ai.stanford.edu/~ronnyk/realInsights.ppt.Google Scholar
Kohavi, Ron, and Longbotham, Roger. 2017. “Online Controlled Experiments and A/B Tests.” In Encyclopedia of Machine Learning and Data Mining, by Sammut, Claude and Webb, Geoffrey I. Springer. www.springer.com/us/book/9781489976857.Google Scholar
Kohavi, Ron, and Longbotham, Roger. 2010. “Unexpected Results in Online Controlled Experiments.” SIGKDD Explorations, December. http://bit.ly/expUnexpected.Google Scholar
Kohavi, Ron and Parekh, Rajesh. 2003. “Ten Supplementary Analyses to Improve E-commerce Web Sites.” WebKDD. http://ai.stanford.edu/~ronnyk/supplementaryAnalyses.pdf.Google Scholar
Kohavi, Ron, and Thomke, Stefan. 2017. “The Surprising Power of Online Experiments.” Harvard Business Review (September–October): 74–92. http://exp-platform.com/hbr-the-surprising-power-of-online-experiments/.Google Scholar
Kohavi, Ron, Crook, Thomas, and Longbotham, Roger. 2009. “Online Experimentation at Microsoft.” Third Workshop on Data Mining Case Studies and Practice Prize. http://bit.ly/expMicrosoft.Google Scholar
Kohavi, Ron, Longbotham, Roger, and Walker, Toby. 2010. “Online Experiments: Practical Lessons.” IEEE Computer, September: 82–85. http://bit.ly/expPracticalLessons.Google Scholar
Kohavi, Ron, Tang, Diane, and Ya, Xu. 2019. “History of Controlled Experiments.” Practical Guide to Trustworthy Online Controlled Experiments. https://bit.ly/experimentGuideHistory.Google Scholar
Kohavi, Ron, Deng, Alex, Longbotham, Roger, and Xu, Ya. 2014. “Seven Rules of Thumb for Web Site.” Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’14). http://bit.ly/expRulesOfThumb.Google Scholar
Kohavi, Ron, Longbotham, Roger, Sommerfield, Dan, and Henne, Randal M.. 2009. “Controlled Experiments on the Web: Survey and Practical Guide.” Data Mining and Knowledge Discovery 18: 140181. http://bit.ly/expSurvey.Google Scholar
Kohavi, Ron, Deng, Alex, Frasca, Brian, Longbotham, Roger, Walker, Toby, and Xu, Ya. 2012. “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained.” Proceedings of the 18th Conference on Knowledge Discovery and Data Mining. http://bit.ly/expPuzzling.Google Scholar
Kohavi, Ron, Deng, Alex, Frasca, Brian, Walker, Toby, Xu, Ya, and Pohlmann, Nils. 2013. “Online Controlled Experiments at Large Scale.” KDD 2013: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar
Kohavi, Ron, Messner, David, Eliot, Seth, Ferres, Juan Lavista, Henne, Randy, Kannappan, Vignesh, and Wang, Justin. 2010. “Tracking Users’ Clicks and Submits: Tradeoffs between User Experience and Data Loss.” Experimentation Platform. September 28. www.exp-platform.com/Documents/TrackingUserClicksSubmits.pdfGoogle Scholar
Kramer, Adam, Guillory, Jamie, and Hancock, Jeffrey. 2014. “Experimental evidence of massive-scale emotional contagion through social networks.” PNAS, June 17.Google Scholar
Kuhn, Thomas. 1996. The Structure of Scientific Revolutions. 3rd edition. University of Chicago Press.Google Scholar
Laja, Peep. 2019. “How to Avoid a Website Redesign FAIL.” CXL. March 8. https://conversionxl.com/show/avoid-redesign-fail/.Google Scholar
Lax, Jeffrey R., and Phillips, Justin H.. 2009. “How Should We Estimate Public Opinion in The States?American Journal of Political Science 53 (1): 107121. www.columbia.edu/~jhp2121/publications/HowShouldWeEstimateOpinion.pdf.Google Scholar
Lee, Jess. 2013. Fake Door. April 10. www.jessyoko.com/blog/2013/04/10/fake-doors/.Google Scholar
Lee, Minyong R, and Shen, Milan. 2018. “Winner’s Curse: Bias Estimation for Total Effects of Features in Online Controlled Experiments.” KDD 2018: The 24th ACM Conference on Knowledge Discovery and Data Mining. London: ACM.Google Scholar
Lehmann, Erich, L., and Romano, Joseph P.. 2005. Testing Statistical Hypothesis. Springer.Google Scholar
Levy, Steven. 2014. “Why The New Obamacare Website is Going to Work This Time.” www.wired.com/2014/06/healthcare-gov-revamp/.Google Scholar
Lewis, Randall A, Rao, Justin M, and Reiley, David. 2011. “Here, There, and Everywhere: Correlated Online Behaviors Can Lead to Overestimates of the Effects of Advertising.” Proceedings of the 20th ACM International World Wide Web Conference (WWW). 157–166. https://ssrn.com/abstract=2080235.Google Scholar
Li, Lihong, Chu, Wei, Langford, John, and Schapire, Robert E.. 2010. “A Contextual-Bandit Approach to Personalized News Article Recommendation.” WWW 2010: Proceedings of the 19th International Conference on World Wide Web. Raleigh, North Carolina. https://arxiv.org/pdf/1003.0146.pdf.Google Scholar
Linden, Greg. 2006. Early Amazon: Shopping Cart Recommendations. April 25. http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html.Google Scholar
Linden, Greg. 2006. “Marissa Mayer at Web 2.0 .” Geeking with Greg . November 9. http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html.Google Scholar
Linowski, Jakub. 2018a. Good UI: Learn from What We Try and Test. https://goodui.org/.Google Scholar
Linowski, Jakub. 2018b. No Coupon. https://goodui.org/patterns/1/.Google Scholar
Liu, Min, Sun, Xiaohui, Varshney, Maneesh, and Xu, Ya. 2018. “Large-Scale Online Experimentation with Quantile Metrics.” Joint Statistical Meeting, Statistical Consulting Section. Alexandria, VA: American Statistical Association. 28492860.Google Scholar
Loukides, Michael, Mason, Hilary, and Patil, D.J.. 2018. Ethics and Data Science. O’Reilly Media.Google Scholar
Lu, Luo, and Liu, Chuang. 2014. “Separation Strategies for Three Pitfalls in A/B Testing.” KDD User Engagement Optimization Workshop. New York. www.ueo-workshop.com/wp-content/uploads/2014/04/Separation-strategies-for-three-pitfalls-in-AB-testing_withacknowledgments.pdf.Google Scholar
Lucas, Robert E. 1976. Econometric Policy Evaluation: A Critique. Vol. 1. In The Phillips Curve and Labor Markets, by Brunner, K. and Meltzer, A., 1946. Carnegie-Rochester Conference on Public Policy.Google Scholar
Malinas, Gary, and Bigelow, John. 2004. “Simpson’s Paradox.” Stanford Encyclopedia of Philosophy. February 2. http://plato.stanford.edu/entries/paradox-simpson/.Google Scholar
Manzi, Jim. 2012. Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society. Basic Books.Google Scholar
Marks, Harry M. 1997. The Progress of Experiment: Science and Therapeutic Reform in the United States, 1900–1990. Cambridge University Press.Google Scholar
Marsden, Peter V., and Wright, James D.. 2010. Handbook of Survey Research, 2nd Edition. Emerald Publishing Group Limited.Google Scholar
Marsh, Catherine, and Elliott, Jane. 2009. Exploring Data: An Introduction to Data Analysis for Social Scientists. 2nd edition. Polity.Google Scholar
Martin, Robert C. 2008. Clean Code: A Handbook of Agile Software Craftsmanship. Prentice Hall.Google Scholar
Mason, Robert L., Gunst, Richard F., and Hess, James L.. 1989. Statistical Design and Analysis of Experiments With Applications to Engineering and Science. John Wiley & Sons.Google Scholar
McChesney, Chris, Covey, Sean, and Huling, Jim. 2012. The 4 Disciplines of Execution: Achieving Your Wildly Important Goals. Free Press.Google Scholar
McClure, Dave. 2007. Startup Metrics for Pirates: AARRR!!! August 8. www.slideshare.net/dmc500hats/startup-metrics-for-pirates-long-version.Google Scholar
McClure, Dave. 2007. Startup Metrics for Pirates: AARRR!!! August 8. www.slideshare.net/dmc500hats/startup-metrics-for-pirates-long-version.Google Scholar
McCrary, Justin. 2008. “Manipulation of the Running Variable in the Regression Discontinuity Design: A Density Test.” Journal of Econometrics (142): 698714.Google Scholar
McCullagh, Declan. 2006. AOL’s Disturbing Glimpse into Users’ Lives. August 9. www.cnet.com/news/aols-disturbing-glimpse-into-users-lives/.Google Scholar
McFarland, Colin. 2012. Experiment!: Website Conversion Rate Optimization with A/B and Multivariate Testing. New Riders.Google Scholar
McGue, Matt. 2014. Introduction to Human Behavioral Genetics, Unit 2: Twins: A Natural Experiment . Coursera. https://www.coursera.org/learn/behavioralgenetics/lecture/u8Zgt/2a-twins-a-natural-experiment.Google Scholar
McKinley, Dan. 2013. Testing to Cull the Living Flower. January. http://mcfunley.com/testing-to-cull-the-living-flower.Google Scholar
McKinley, Dan. 2012. Design for Continuous Experimentation: Talk and Slides. December 22. http://mcfunley.com/design-for-continuous-experimentation.Google Scholar
Turk, Mechanical. 2019. Amazon Mechanical Turk. http://www.mturk.com.Google Scholar
Meenan, Patrick, Feng, Chao (Ray), and Petrovich, Mike. 2013. “Going Beyond Onload – How Fast Does It Feel?” Velocity: Web Performance and Operations conference, October 14–16. http://velocityconf.com/velocityny2013/public/schedule/detail/31344.Google Scholar
Meyer, Michelle N. 2018. “Ethical Considerations When Companies Study – and Fail to Study – Their Customers.” In The Cambridge Handbook of Consumer Privacy, by Selinger, Evan, Polonetsky, Jules and Tene, Omer. Cambridge University Press.Google Scholar
Meyer, Michelle N. 2015. “Two Cheers for Corporate Experimentation: The A/B Illusion and the Virtues of Data-Driven Innovation.” 13 Colo. Tech. L.J. 273. https://ssrn.com/abstract=2605132.Google Scholar
Meyer, Michelle N. 2012. Regulating the Production of Knowledge: Research Risk–Benefit Analysis and the Heterogeneity Problem. 65 Administrative Law Review 237; Harvard Public Law Working Paper. doi:http://dx.doi.org/10.2139/ssrn.2138624.Google Scholar
Meyer, Michelle N., Heck, Patrick R., Holtzman, Geoffrey S., Anderson, Stephen M., Cai, William, Watts, Duncan J., and Chabris, Christopher F.. 2019. “Objecting to Experiments that Compare Two Unobjectionable Policies or Treatments.” PNAS: Proceedings of the National Academy of Sciences (National Academy of Sciences). doi:https://doi.org/10.1073/pnas.1820701116.Google Scholar
Milgram, Stanley. 2009. Obedience to Authority: An Experimental View. Harper Perennial Modern Thought.Google Scholar
Mitchell, Carl, Litz, Jonathan, Vaz, Garnet, and Drake, Andy. 2018. “Metrics Health Detection and AA Simulator.” Microsoft ExP (internal). August 13. https://aka.ms/exp/wiki/AASimulator.Google Scholar
Moran, Mike. 2008. Multivariate Testing in Action: Quicken Loan’s Regis Hadiaris on multivariate testing. December. www.biznology.com/2008/12/multivariate_testing_in_action/.Google Scholar
Moran, Mike. 2007. Do It Wrong Quickly: How the Web Changes the Old Marketing Rules . IBM Press.Google Scholar
Mosteller, Frederick, Gilbert, John P., and McPeek, Bucknam. 1983. “Controversies in Design and Analysis of Clinical Trials.” In Clinical Trials, by Shapiro, Stanley H. and Louis, Thomas A.. New York, NY: Marcel Dekker, Inc.Google Scholar
MR Web. 2014. “Obituary: Audience Measurement Veteran Tony Twyman.” Daily Research News Online. November 12. www.mrweb.com/drno/news20011.htm.Google Scholar
Mudholkar, Govind S., and George, E. Olusegun. 1979. “The Logit Method for Combining Probablilities.” Edited by Rustagi, J.. Symposium on Optimizing Methods in Statistics.” Academic Press. 345–366. https://apps.dtic.mil/dtic/tr/fulltext/u2/a049993.pdf.Google Scholar
Mueller, Hendrik, and Sedley, Aaron. 2014. “HaTS: Large-Scale In-Product Measurement of User Attitudes & Experiences with Happiness Tracking Surveys.” OZCHI, December.Google Scholar
Neumann, Chris. 2017. Does Optimizely Slow Down a Site’s Performance? October 18. https://www.quora.com/Does-Optimizely-slow-down-a-sites-performance.Google Scholar
Newcomer, Kathryn E., Hatry, Harry P., and Wholey, Joseph S.. 2015. Handbook of Practical Program Evaluation (Essential Tests for Nonprofit and Publish Leadership and Management). Wiley.Google Scholar
Neyman, J. 1923. “On the Application of Probability Theory of Agricultural Experiments.” Statistical Science 465–472.Google Scholar
NSF. 2018. Frequently Asked Questions and Vignettes: Interpreting the Common Rule for the Protection of Human Subjects for Behavioral and Social Science Research. www.nsf.gov/bfa/dias/policy/hsfaqs.jsp.Google Scholar
Office for Human Research Protections. 1991. Federal Policy for the Protection of Human Subjects (‘Common Rule’). www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html.Google Scholar
Optimizely. 2018. “A/A Testing.” Optimizely. www.optimizely.com/optimization-glossary/aa-testing/.Google Scholar
Optimizely. 2018. “Implement the One-Line Snippet for Optimizely X.” Optimizely. February 28. https://help.optimizely.com/Set_Up_Optimizely/Implement_the_one-line_snippet_for_Optimizely_X.Google Scholar
Optimizely. 2018. Optimizely Maturity Model. www.optimizely.com/maturity-model/.Google Scholar
Orlin, Ben. 2016. Why Not to Trust Statistics. July 13. https://mathwithbaddrawings.com/2016/07/13/why-not-to-trust-statistics/.Google Scholar
Owen, Art, and Varian, Hal. 2018. Optimizing the Tie-Breaker Regression Discontinuity Design. August. http://statweb.stanford.edu/~owen/reports/tiebreaker.pdf.Google Scholar
Owen, Art, and Varian, Hal. 2009. Oxford Centre for Evidence-based Medicine – Levels of Evidence. March. www.cebm.net/oxford-centre-evidence-based-medicine-levels-evidence-march-2009/.Google Scholar
Park, David K., Gelman, Andrew, and Bafumi, Joseph. 2004. “Bayesian Multilevel Estimation with Poststratification: State-Level Estimates from National Polls.” Political Analysis 375–385.Google Scholar
Parmenter, David. 2015. Key Performance Indicators: Developing, Implementing, and Using Winning KPIs. 3rd edition. John Wiley & Sons, Inc.Google Scholar
Pearl, Judea. 2009. Causality: Models, Reasoning and Inference. 2nd edition. Cambridge University Press.Google Scholar
Pekelis, Leonid. 2015. “Statistics for the Internet Age: The Story behind Optimizely’s New Stats Engine.” Optimizely. January 20. https://blog.optimizely.com/2015/01/20/statistics-for-the-internet-age-the-story-behind-optimizelys-new-stats-engine/.Google Scholar
Pekelis, Leonid, Walsh, David, and Johari, Ramesh. 2015. “The New Stats Engine.” Optimizely. www.optimizely.com/resources/stats-engine-whitepaper/.Google Scholar
Pekelis, Leonid, Walsh, David, and Johari, Ramesh. 2005. Web Site Measurement Hacks. O’Reilly Media.Google Scholar
Peterson, Eric T. 2005. Web Site Measurement Hacks. O’Reilly Media.Google Scholar
Peterson, Eric T. 2004. Web Analytics Demystified: A Marketer’s Guide to Understanding How Your Web Site Affects Your Business. Celilo Group Media and CafePress.Google Scholar
Pfeffer, Jeffrey, and Sutton, Robert I. 1999. The Knowing-Doing Gap: How Smart Companies Turn Knowledge into Action. Harvard Business Review Press.Google Scholar
Phillips, A. W. 1958. “The Relation between Unemployment and the Rate of Change of Money Wage Rates in the United Kingdom, 1861–1957.” Economica, New Series 25 (100): 283299. www.jstor.org/stable/2550759.Google Scholar
Porter, Michael E. 1998. Competitive Strategy: Techniques for Analyzing Industries and Competitors. Free Press.Google Scholar
Porter, Michael E. 1996. “What is Strategy.” Harvard Business Review 61–78.Google Scholar
Quarto-vonTivadar, John. 2006. “AB Testing: Too Little, Too Soon.” Future Now. www.futurenowinc.com/abtesting.pdf.Google Scholar
Radlinski, Filip, and Craswell, Nick. 2013. “Optimized Interleaving For Online Retrieval Evaluation.” International Conference on Web Search and Data Mining. Rome, IT: ASM. 245254.Google Scholar
Rae, Barclay. 2014. “Watermelon SLAs – Making Sense of Green and Red Alerts.” Computer Weekly. September. https://www.computerweekly.com/opinion/Watermelon-SLAs-making-sense-of-green-and-red-alerts.Google Scholar
RAND. 1955. A Million Random Digits with 100,000 Normal Deviates. Glencoe, Ill: Free Press. www.rand.org/pubs/monograph_reports/MR1418.html.Google Scholar
Rawat, Girish. 2018. “Why Most Redesigns fail.” freeCodeCamp. December 4. https://medium.freecodecamp.org/why-most-redesigns-fail-6ecaaf1b584e.Google Scholar
Razali, Nornadiah Mohd, and Wah, Yap Bee. 2011. “Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lillefors and Anderson-Darling tests.” Journal of Statistical Modeling and Analytics, January 1: 2133.Google Scholar
Reinhardt, Peter. 2016. Effect of Mobile App Size on Downloads. October 5. https://segment.com/blog/mobile-app-size-effect-on-downloads/.Google Scholar
Resnick, David. 2015. What is Ethics in Research & Why is it Important? December 1. www.niehs.nih.gov/research/resources/bioethics/whatis/index.cfm.Google Scholar
Ries, Eric. 2011. The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses. Crown Business.Google Scholar
Rodden, Kerry, Hutchinson, Hilary, and Xin, Fu. 2010. “Measuring the User Experience on a Large Scale: User-Centered Metrics for Web Applications.” Proceedings of CHI, April. https://ai.google/research/pubs/pub36299Google Scholar
Romano, Joseph, Shaikh, Azeem M., and Wolf, Michael. 2016. “Multiple Testing.” In The New Palgrave Dictionary of Economics. Palgram Macmillan.Google Scholar
Rosenbaum, Paul R, and Rubin, Donald B. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 4155. doi:http://dx.doi.org/10.1093/biomet/70.1.41.Google Scholar
Rossi, Peter H., Lipsey, Mark W., and Freeman, Howard E.. 2004. Evaluation: A Systematic Approach. 7th edition. Sage Publications, Inc.Google Scholar
Roy, Ranjit K. 2001. Design of Experiments using the Taguchi Approach : 16 Steps to Product and Process Improvement. John Wiley & Sons, Inc.Google Scholar
Rubin, Donald B. 1990. “Formal Mode of Statistical Inference for Causal Effects.” Journal of Statistical Planning and Inference 25, (3) 279292.Google Scholar
Rubin, Donald 1974. “Estimating Causal Effects of Treatment in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66 (5): 688701.Google Scholar
Rubin, Kenneth S. 2012. Essential Scrum: A Practical Guide to the Most Popular Agile Process. Addison-Wesley Professional.Google Scholar
Russell, Daniel M., and Grimes, Carrie. 2007. “Assigned Tasks Are Not the Same as Self-Chosen Web Searches.” HICSS'07: 40th Annual Hawaii International Conference on System Sciences, January. https://doi.org/10.1109/HICSS.2007.91.Google Scholar
Saint-Jacques, Guillaume B., Aral, Sinan, Airoldi, Edoardo, Brynjolfsson, Erik, and Xu, Ya. 2018. “The Strength of Weak Ties: Causal Evidence using People-You-May-Know Randomizations.” 141–152.Google Scholar
Saint-Jacques, Guillaume, Simpson, Maneesh, Varshney, Jeremy, and Xu, Ya. 2018. “Using Ego-Clusters to Measure Network Effects at LinkedIn.” Workshop on Information Systems and Exonomics. San Francisco, CA.Google Scholar
Samarati, Pierangela, and Sweeney, Latanya. 1998. “Protecting Privacy When Disclosing Information: k-anonymity and its Enforcement through Generalization and Suppression.” Proceedings of the IEEE Symposium on Research in Security and Privacy.Google Scholar
Schrage, Michael. 2014. The Innovator’s Hypothesis: How Cheap Experiments Are Worth More than Good Ideas. MIT Press.Google Scholar
Schrijvers, Ard. 2017. “Mobile Website Too Slow? Your Personalization Tools May Be to Blame.” Bloomreach. February 2. www.bloomreach.com/en/blog/2017/01/server-side-personalization-for-fast-mobile-pagespeed.html.Google Scholar
Schurman, Eric, and Brutlag, Jake. 2009. “Performance Related Changes and their User Impact.” Velocity 09: Velocity Web Performance and Operations Conference. www.youtube.com/watch?v=bQSE51-gr2s and www.slideshare.net/dyninc/the-user-and-business-impact-of-server-delays-additional-bytes-and-http-chunking-in-web-search-presentation.Google Scholar
Scott, Steven L. 2010. “A modern Bayesian look at the multi-armed bandit.” Applied Stochastic Models in Business and Industry 26 (6): 639658. doi:https://doi.org/10.1002/asmb.874.Google Scholar
Segall, Ken. 2012. Insanely Simple: The Obsession That Drives Apple’s Success. Portfolio Hardcover.Google Scholar
Senn, Stephen. 2012. “Seven myths of randomisation in clinical trials.” Statistics in Medicine. doi:10.1002/sim.5713.Google Scholar
Shadish, William R., Cook, Thomas D., and Campbell, Donald T.. 2001. Experimental and Quasi-Experimental Designs for Generalized Causal Inference. 2nd edition. Cengage Learning.Google Scholar
Simpson, Edward H. 1951. “The Interpretation of Interaction in Contingency Tables.” Journal of the Royal Statistical Society, Ser. B, 238–241.Google Scholar
Sinofsky, Steven, and Iansiti, Marco. 2009. One Strategy: Organization, Planning, and Decision Making. Wiley.Google Scholar
Siroker, Dan, and Koomen, Pete. 2013. A/B Testing: The Most Powerful Way to Turn Clicks Into Customers. Wiley.Google Scholar
Soriano, Jacopo. 2017. “Percent Change Estimation in Large Scale Online Experiments.” arXiv.org. November 3. https://arciv.org/pdf/1711.00562.pdf.Google Scholar
Souders, Steve. 2013. “Moving Beyond window.onload().” High Performance Web Sites Blog. May 13. www.stevesouders.com/blog/2013/05/13/moving-beyond-window-onload/.Google Scholar
Souders, Steve. 2009. Even Faster Web Sites: Performance Best Practices for Web Developers. O’Reilly Media.Google Scholar
Souders, Steve. 2007. High Performance Web Sites: Essential Knowledge for Front-End Engineers. O’Reilly Media.Google Scholar
Spitzer, Dean R. 2007. Transforming Performance Measurement: Rethinking the Way We Measure and Drive Organizational Success. AMACOM.Google Scholar
Stephens-Davidowitz, Seth, Varian, Hal, and Smith, Michael D.. 2017. “Super Returns to Super Bowl Ads?Quantitative Marketing and Economics, March 1: 128.Google Scholar
Sterne, Jim. 2002. Web Metrics: Proven Methods for Measuring Web Site Success. John Wiley & Sons, Inc.Google Scholar
Strathern, Marilyn. 1997. “‘Improving ratings’: Audit in the British University System.” European Review 5 (3): 305321. doi:10.1002/(SICI)1234-981X(199707)5:33.0.CO;2-4.Google Scholar
Student, . 1908. “The Probable Error of a Mean.” Biometrika 6 (1): 125. https://www.jstor.org/stable/2331554.Google Scholar
Sullivan, Nicole. 2008. “Design Fast Websites.” Slideshare. October 14. www.slideshare.net/stubbornella/designing-fast-websites-presentation.Google Scholar
Tang, Diane, Agarwal, Ashish, O’Brien, Deirdre, and Meyer, Mike. 2010. “Overlapping Experiment Infrastructure: More, Better, Faster Experimentation.” Proceedings 16th Conference on Knowledge Discovery and Data Mining.Google Scholar
The Guardian. 2014. OKCupid: We Experiment on Users. Everyone does. July 29. www.theguardian.com/technology/2014/jul/29/okcupid-experiment-human-beings-dating.Google Scholar
The National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. 1979. The Belmont Report. April 18. www.hhs.gov/ohrp/regulations-and-policy/belmont-report/index.html.Google Scholar
Thistlewaite, Donald L., and Campbell, Donald T.. 1960. “Regression-Discontinuity Analysis: An Alternative to the Ex-Post Facto Experiment.” Journal of Educational Psychology 51 (6): 309317. doi:https://doi.org/10.1037%2Fh0044319.Google Scholar
Thomke, Stefan H. 2003. “Experimentation Matters: Unlocking the Potential of New Technologies for Innovation.”Google Scholar
Tiffany, Kaitlyn. 2017. “This Instagram Story Ad with a Fake Hair in It is Sort of Disturbing.” The Verge. December 11. www.theverge.com/tldr/2017/12/11/16763664/sneaker-ad-instagram-stories-swipe-up-trick.Google Scholar
Tolomei, Sam. 2017. Shrinking APKs, growing installs. November 20. https://medium.com/googleplaydev/shrinking-apks-growing-installs-5d3fcba23ce2.Google Scholar
Tutterow, Craig, and Saint-Jacques, Guillaume. 2019. Estimating Network Effects Using Naturally Occurring Peer Notification Queue Counterfactuals. February 19. https://arxiv.org/abs/1902.07133.Google Scholar
Tyler, Mary E., and Ledford, Jerri. 2006. Google Analytics. Wiley Publishing, Inc.Google Scholar
Tyurin, I.S. 2009. “On the Accuracy of the Gaussian Approximation.” Doklady Mathematics 429 (3): 312316.Google Scholar
Ugander, Johan, Karrer, Brian, Backstrom, Lars, and Kleinberg, Jon. 2013. “Graph Cluster Randomization: Network Exposure to Multiple Universes.” Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 329–337.Google Scholar
van Belle, Gerald. 2008. Statistical Rules of Thumb. 2nd edition. Wiley-Interscience.Google Scholar
Vann, Michael G. 2003. “Of Rats, Rice, and Race: The Great Hanoi Rat Massacre, an Episode in French Colonial History.” French Colonial History 4: 191203. https://muse.jhu.edu/article/42110.Google Scholar
Varian, Hal. 2016. “Causal inference in economics and marketing.” Proceedings of the National Academy of Sciences of the United States of America 7310–7315.Google Scholar
Varian, Hal R. 2007. “Kaizen, That Continuous Improvement Strategy, Finds Its Ideal Environment.” The New York Times. February 8. www.nytimes.com/2007/02/08/business/08scene.html.Google Scholar
Vaver, Jon, and Koehler, Jim. 2012. Periodic Measuement of Advertising Effectiveness Using Multiple-Test Period Geo Experiments. Google Inc.Google Scholar
Vaver, Jon, and Koehler, Jim. 2011. Measuring Ad Effectiveness Using Geo Experiments. Google, Inc.Google Scholar
Vickers, Andrew J. 2009. What Is a p-value Anyway? 34 Stories to Help You Actually Understand Statistics. Pearson. www.amazon.com/p-value-Stories-Actually-Understand-Statistics/dp/0321629302.Google Scholar
Vigen, Tyler. 2018. Spurious Correlations. http://tylervigen.com/spurious-correlations.Google Scholar
Wager, Stefan, and Athey, Susan. 2018. “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests.” Journal of the American Statistical Association 13 (523): 12281242. doi:https://doi.org/10.1080/01621459.2017.1319839.Google Scholar
Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. Springer.Google Scholar
Weiss, Carol H. 1997. Evaluation: Methods for Studying Programs and Policies. 2nd edition. Prentice Hall.Google Scholar
Funnel, Wider. 2018. “The State of Experimentation Maturity 2018.” Wider Funnel. www.widerfunnel.com/wp-content/uploads/2018/04/State-of-Experimentation-2018-Original-Research-Report.pdf.Google Scholar
Wikipedia contributors, Above the Fold. 2014. Wikipedia, The Free Encyclopedia. Jan. http://en.wikipedia.org/wiki/Above_the_fold.Google Scholar
Wikipedia contributors, Cobra Effect. 2019. Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Cobra_effect.Google Scholar
Wikipedia contributors, Data Dredging. 2019. Data dredging. https://en.wikipedia.org/wiki/Data_dredging.Google Scholar
Wikipedia contributors, Eastern Air Lines Flight 401. 2019. Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Eastern_Air_Lines_Flight_401.Google Scholar
Wikipedia contributors, List of .NET libraries and frameworks. 2019. https://en.wikipedia.org/wiki/List_of_.NET_libraries_and_frameworks#Logging_Frameworks.Google Scholar
Wikipedia contributors, Logging as a Service. 2019. Logging as a Service. https://en.wikipedia.org/wiki/Logging_as_a_service.Google Scholar
Wikipedia contributors, Multiple Comparisons Problem. 2019. Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Multiple_comparisons_problem.Google Scholar
Wikipedia contributors, Perverse Incentive. 2019. https://en.wikipedia.org/wiki/Perverse_incentive.Google Scholar
Wikipedia contributors, Privacy by Design. 2019. Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Privacy_by_design.Google Scholar
Wikipedia contributors, Semmelweis Reflex. 2019. Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Semmelweis_reflex.Google Scholar
Wikipedia contributors, Simpson’s Paradox. 2019. Wikipedia, The Free Encyclopedia. Accessed February 28, 2008. http://en.wikipedia.org/wiki/Simpson%27s_paradox.Google Scholar
Wolf, Talia. 2018. “Why Most Redesigns Fail (and How to Make Sure Yours Doesn’t).” GetUplift. https://getuplift.co/why-most-redesigns-fail.Google Scholar
Xia, Tong, Bhardwaj, Sumit, Dmitriev, Pavel, and Fabijan, Aleksander. 2019. “Safe Velocity: A Practical Guide to Software Deployment at Scale using Controlled Rollout.” ICSE: 41st ACM/IEEE International Conference on Software Engineering. Montreal, Canada. www.researchgate.net/publication/333614382_Safe_Velocity_A_Practical_Guide_to_Software_Deployment_at_Scale_using_Controlled_Rollout.Google Scholar
Xie, Huizhi, and Aurisset, Juliette. 2016. “Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix.” KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY: ACM. 645654. http://doi.acm.org/10.1145/2939672.2939733.Google Scholar
Xu, Ya, and Chen, Nanyu. 2016. “Evaluating Mobile Apps with A/B and Quasi A/B Tests.” KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, California, USA: ACM. 313322. http://doi.acm.org/10.1145/2939672.2939703.Google Scholar
Xu, Ya, Duan, Weitao, and Huang, Shaochen. 2018. “SQR: Balancing Speed, Quality and Risk in Online Experiments.” 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. London: Association for Computing Machinery. 895904.Google Scholar
Xu, Ya, Chen, Nanyu, Fernandez, Adrian, Sinno, Omar, and Bhasin, Anmol. 2015. “From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks.” KDD ’15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Sydney, NSW, Australia: ACM. 22272236. http://doi.acm.org/10.1145/2783258.2788602.Google Scholar
Yoon, Sangho. 2018. Designing A/B Tests in a Collaboration Network. www.unofficialgoogledatascience.com/2018/01/designing-ab-tests-in-collaboration.html.Google Scholar
Young, S. Stanley, and Karr, Allan. 2011. “Deming, data and observational studies: A process out of control and needing fixing.” Significance 8 (3).Google Scholar
Zhang, Fan, Joseph, Joshy, and James, Alexander, Zhuang, Peng Rickabaugh. 2018. Client-Side Activity Monitoring. US Patent US 10,165,071 B2. December 25.Google Scholar
Zhao, Zhenyu, Chen, Miao, Matheson, Don, and Stone, Maria. 2016. “Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation.” DSAA 2016: IEEE International Conference on Data Science and Advanced Analytics. IEEE. 498–507. doi:https://ieeexplore.ieee.org/document/7796936.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

  • References
  • Ron Kohavi, Diane Tang, Ya Xu
  • Book: Trustworthy Online Controlled Experiments
  • Online publication: 13 March 2020
  • Chapter DOI: https://doi.org/10.1017/9781108653985.030
Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

  • References
  • Ron Kohavi, Diane Tang, Ya Xu
  • Book: Trustworthy Online Controlled Experiments
  • Online publication: 13 March 2020
  • Chapter DOI: https://doi.org/10.1017/9781108653985.030
Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

  • References
  • Ron Kohavi, Diane Tang, Ya Xu
  • Book: Trustworthy Online Controlled Experiments
  • Online publication: 13 March 2020
  • Chapter DOI: https://doi.org/10.1017/9781108653985.030
Available formats
×