References

Ron Kohavi; Diane Tang; Ya Xu

doi:10.1017/9781108653985.030

References

Published online by Cambridge University Press: 13 March 2020

Ron Kohavi ,

Diane Tang and

Ya Xu

Show author details

Ron Kohavi: Affiliation:
Microsoft
Diane Tang: Affiliation:
Google
Ya Xu: Affiliation:
LinkedIn

Book contents

Get access

Summary

A summary is not available for this content so a preview has been provided. Please use the Get access link above for information on how to access this content.

Image of the first page of this content. For PDF version, please use the ‘Save PDF’ preceeding this image.'

Type: Chapter
Information: Trustworthy Online Controlled Experiments
A Practical Guide to A/B Testing
, pp. 246 - 265

DOI: https://doi.org/10.1017/9781108653985.030 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abadi, Martin, Chu, Andy, Goodfellow, Ian, Mironov, H. Brendan, Mcmahan, Ilya, Talwar, Kunal, and Zhang, Li. 2016. “Deep Learning with Differential Privacy.” Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.Google Scholar

Abrahamse, Peter. 2016. “How 8 Different A/B Testing Tools Affect Site Speed.” CXL: All Things Data-Driven Marketing. May 16. https://conversionxl.com/blog/testing-tools-site-speed/.Google Scholar

ACM. 2018. ACM Code of Ethics and Professional Conduct. June 22. www.acm.org/code-of-ethics.Google Scholar

Alvarez, Cindy. 2017. Lean Customer Development: Building Products Your Customers Will Buy. O’Reilly.Google Scholar

Angrist, Joshua D., and Pischke, Jörn-Steffen. 2014. Mastering ‘Metrics: The Path from Cause to Effect. Princeton University Press.Google Scholar

Angrist, Joshua D., and Pischke, Jörn-Steffen. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.Google Scholar

Apple, Inc. 2017. “Phased Release for Automatic Updates Now Available.” June 5. https://developer.apple.com/app-store-connect/whats-new/?id=31070842.Google Scholar

Apple, Inc. 2018. “Use Low Power Mode to Save Battery Life on Your iPhone.” Apple. September 25. https://support.apple.com/en-us/HT205234.Google Scholar

Athey, Susan, and Imbens, Guido. 2016. “Recursive Partitioning for Heterogeneous Causal Effects.” PNAS: Proceedings of the National Academy of Sciences. 7353–7360. doi: https://doi.org/10.1073/pnas.1510489113.Google Scholar

Azevedo, Eduardo M., Deng, Alex, Olea, Jose Montiel, Rao, Justin M., and Weyl, E. Glen. 2019. “A/B Testing with Fat Tails.” February 26. Available at SSRN: https://ssrn.com/abstract=3171224 or http://dx.doi.org/10.2139/ssrn.3171224.Google Scholar

Backstrom, Lars, and Kleinberg, Jon. 2011. “Network Bucket Testing.” WWW ‘11 Proceedings of the 20th International Conference on World Wide Web. Hyderabad, India: ACM. 615–624.Google Scholar

Bailar, John C. 1983. “Introduction.” In Clinical Trials: Issues and Approaches, by Shapiro, Stuart and Louis, Thomas. Marcel Dekker.Google Scholar

Bakshy, Eytan, Balandat, Max, and Kashin, Kostya. 2019. “Open-sourcing Ax and BoTorch: New AI tools for adaptive experimentation.” Facebook Artificial Intelligence. May 1. https://ai.facebook.com/blog/open-sourcing-ax-and-botorch-new-ai-tools-for-adaptive-experimentation/.Google Scholar

Bakshy, Eytan, and Frachtenberg, Eitan. 2015. “Design and Analysis of Benchmarking Experiments for Distributed Internet Services.” WWW ‘15: Proceedings of the 24th International Conference on World Wide Web. Florence, Italy: ACM. 108–118. doi: https://doi.org/10.1145/2736277.2741082.Google Scholar

Bakshy, Eytan, Eckles, Dean, and Bernstein, Michael. 2014. “Designing and Deploying Online Field Experiments.” International World Wide Web Conference (WWW 2014). https://facebook.com//download/255785951270811/planout.pdf.Google Scholar

Barajas, Joel, Akella, Ram, Hotan, Marius, and Flores, Aaron. 2016. “Experimental Designs and Estimation for Online Display Advertising Attribution in Marketplaces.” Marketing Science: the Marketing Journal of the Institute for Operations Research and the Management Sciences 35: 465–483.Google Scholar

Barrilleaux, Bonnie, and Wang, Dylan. 2018. “Spreading the Love in the LinkedIn Feed with Creator-Side Optimization.” LinkedIn Engineering. October 16. https://engineering.linkedin.com/blog/2018/10/linkedin-feed-with-creator-side-optimization.Google Scholar

Basin, David, Debois, Soren, and Hildebrandt, Thomas. 2018. “On Purpose and by Necessity: Compliance under the GDPR.” Financial Cryptography and Data Security 2018. IFCA. Preproceedings 21.Google Scholar

Benbunan-Fich, Raquel. 2017. “The Ethics of Online Research with Unsuspecting Users: From A/B Testing to C/D Experimentation.” Research Ethics 13 (3–4): 200–218. doi: https://doi.org/10.1177/1747016116680664.Google Scholar

Benjamin, Daniel J., Berger, James O., Johannesson, Magnus, Nosek, Brian A., Wagenmakers, E.-J., Berk, Richard, Bollen, Kenneth A., et al. 2017. “Redefine Statistical Significance.” Nature Human Behaviour 2 (1): 6–10. https://www.nature.com/articles/s41562-017-0189-z.Google Scholar

Beshears, John, Choi, James J., Laibson, David, Madrian, Brigitte C., and Milkman, Katherine L.. 2011. The Effect of Providing Peer Information on Retirement Savings Decisions. NBER Working Paper Series, National Bureau of Economic Research. www.nber.org/papers/w17345.Google Scholar

Billingsly, Patrick. 1995. Probability and Measure. Wiley.Google Scholar

Blake, Thomas, and Coey, Dominic. 2014. “Why Marketplace Experimentation is Harder Than it Seems: The Role of Test-Control Interference.” EC ’14 Proceedings of the Fifteenth ACM Conference on Economics and Computation. Palo Alto, CA: ACM. 567–582.Google Scholar

Blank, Steven Gary. 2005. The Four Steps to the Epiphany: Successful Strategies for Products that Win. Cafepress.com.Google Scholar

Blocker, Craig, Conway, John, Demortier, Luc, Heinrich, Joel, Junk, Tom, Lyons, Louis, and Punzi, Giovanni. 2006. “Simple Facts about P-Values.” The Rockefeller University. January 5. http://physics.rockefeller.edu/luc/technical_reports/cdf8023_facts_about_p_values.pdf.Google Scholar

Bodlewski, Mike. 2017. “When Slower UX is Better UX.” Web Designer Depot. Sep 25. https://www.webdesignerdepot.com/2017/09/when-slower-ux-is-better-ux/.Google Scholar

Bojinov, Iavor, and Shephard, Neil. 2017. “Time Series Experiments and Causal Estimands: Exact Randomization Tests and Trading.” arXiv of Cornell University. July 18. arXiv:1706.07840.Google Scholar

Borden, Peter. 2014. “How Optimizely (Almost) Got Me Fired.” The SumAll Blog: Where E-commerce and Social Media Meet. June 18. https://blog.sumall.com/journal/optimizely-got-me-fired.html.Google Scholar

Bowman, Douglas. 2009. “Goodbye, Google.” stopdesign. March 20. https://stopdesign.com/archive/2009/03/20/goodbye-google.html.Google Scholar

Box, George E.P., Hunter, J. Stuart, and Hunter, William G.. 2005. Statistics for Experimenters: Design, Innovation, and Discovery. 2nd edition. John Wiley & Sons, Inc.Google Scholar

Bell, Brooks. 2015. “Click Summit 2015 Keynote Presentation.” Brooks Bell. www.brooksbell.com/wp-content/uploads/2015/05/BrooksBell_ClickSummit15_Keynote1.pdf.Google Scholar

Brown, Morton B. 1975. “A Method for Combining Non-Independent, One-Sided Tests of Signficance.” Biometrics 31 (4) 987–992. www.jstor.org/stable/2529826.Google Scholar

Brutlag, Jake, Abrams, Zoe, and Meenan, Pat. 2011. “Above the Fold Time: Measuring Web Page Performance Visually.” Velocity: Web Performance and Operations Conference.Google Scholar

Buhrmester, Michael, Kwang, Tracy, and Gosling, Samuel. 2011. “Amazon’s Mechanical Turk: A New Source of Inexpensive, Yet High-Quality Data?” Perspectives on Psychological Science, Feb 3.Google Scholar

Campbell, Donald T. 1979. “Assessing the Impact of Planned Social Change.” Evaluation and Program Planning 2: 67–90. https://doi.org/10.1016/0149-7189(79)90048-X.Google Scholar

Campbell’s law. 2018. Wikipedia. https://en.wikipedia.org/wiki/Campbell%27s_law.Google Scholar

Card, David, and Krueger, Alan B. 1994. “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania.” The American Economic Review 84 (4): 772–793. https://www.jstor.org/stable/2118030.Google Scholar

Casella, George, and Berger, Roger L.. 2001. Statistical Inference. 2nd edition. Cengage Learning.Google Scholar

CDC. 2015. The Tuskegee Timeline. December. https://www.cdc.gov/tuskegee/timeline.htm.Google Scholar

Chamandy, Nicholas. 2016. “Experimentation in a Ridesharing Marketplace.” Lyft Engineering. September 2. https:/eng.lyft.com/experimentation-in-a-risharing-marketplace-b39db027a66e.Google Scholar

Chan, David, Ge, Rong, Gershony, Ori, Hesterberg, Tim, and Lambert, Diane. 2010. “Evaluating Online Ad Campaigns in a Pipeline: Causal Models at Scale.” Proceedings of ACM SIGKDD.Google Scholar

Chapelle, Olivier, Joachims, Thorsten, Radlinski, Filip, and Yue, Yisong. 2012. “Large-Scale Validation and Analysis of Interleaved Search Evaluation.” ACM Transactions on Information Systems, February.Google Scholar

Chaplin, Charlie. 1964. My Autobiography. Simon Schuster.Google Scholar

Charles, Reichardt S., and Melvin, Mark M.. 2004. “Quasi Experimentation.” In Handbook of Practical Program Evaluation, by Wholey, Joseph S., Hatry, Harry P. and Newcomer, Kathryn E.. Jossey-Bass.Google Scholar

Chatham, Bob, Temkin, Bruce D., and Amato, Michelle. 2004. A Primer on A/B Testing. Forrester Research.Google Scholar

Chen, Nanyu, Liu, Min, and Xu, Ya. 2019. “How A/B Tests Could Go Wrong: Automatic Diagnosis of Invalid Online Experiments.” WSDM ‘19 Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. Melbourne, VIC, Australia: ACM. 501–509. https://dl.acm.org/citation.cfm?id=3291000.Google Scholar

Chrystal, K. Alec, and Mizen, Paul D.. 2001. Goodhart’s Law: Its Origins, Meaning and Implications for Monetary Policy. Prepared for the Festschrift in honor of Charles Goodhart held on 15–16 November 2001 at the Bank of England. http://cyberlibris.typepad.com/blog/files/Goodharts_Law.pdf.Google Scholar

Coey, Dominic, and Cunningham, Tom. 2019. “Improving Treatment Effect Estimators Through Experiment Splitting.” WWW ’19: The Web Conference. San Francisco, CA, USA: ACM. 285–295. doi:https://dl.acm.org/citation.cfm?doid=3308558.3313452.Google Scholar

Collis, David. 2016. “Lean Strategy.” Harvard Business Review 62–68. https://hbr.org/2016/03/lean-strategy.Google Scholar

Concato, John, Shah, Nirav, and Horwitz, Ralph I. 2000. “Randomized, Controlled Trials, Observational Studies, and the Hierarchy of Research Designs.” The New England Journal of Medicine 342 (25): 1887–1892. doi:https://www.nejm.org/doi/10.1056/NEJM200006223422507.Google Scholar

Cox, David Roxbee. 1958. Planning of Experiments. New York: John Wiley.Google Scholar

Croll, Alistair, and Yoskovitz, Benjamin. 2013. Lean Analytics: Use Data to Build a Better Startup Faster. O’Reilly Media.Google Scholar

Crook, Thomas, Frasca, Brian, Kohavi, Ron, and Longbotham, Roger. 2009. “Seven Pitfalls to Avoid when Running Controlled Experiments on the Web.” KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 1105–1114.Google Scholar

Cross, Robert G., and Dixit, Ashutosh. 2005. “Customer-centric Pricing: The Surprising Secret for Profitability.” Business Horizons, 488.Google Scholar

Deb, Anirban, Bhattacharya, Suman, Gu, Jeremey, Zhuo, Tianxia, Feng, Eva, and Liu, Mandie. 2018. “Under the Hood of Uber’s Experimentation Platform.” Uber Engineering. August 28. https://eng.uber.com/xp.Google Scholar

Deng, Alex. 2015. “Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments.” Florence, IT: ACM. 923–928.Google Scholar

Deng, Alex, and Hu, Victor. 2015. “Diluted Treatment Effect Estimation for Trigger Analysis in Online Controlled Experiments.” WSDM ’15: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. Shanghai, China: ACM. 349–358. doi:https://doi.org/10.1145/2684822.2685307.Google Scholar

Deng, Alex, Lu, Jiannan, and Chen, Shouyuan. 2016. “Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing.” 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). Montreal, QC, Canada: IEEE. doi:https://doi.org/10.1109/DSAA.2016.33.Google Scholar

Deng, Alex, Knoblich, Ulf, and Lu, Jiannan. 2018. “Applying the Delta Method in Metric Analytics: A Practical Guide with Novel Ideas.” 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.Google Scholar

Deng, Alex, Lu, Jiannan, and Litz, Jonathan. 2017. “Trustworthy Analysis of Online A/B Tests: Pitfalls, Challenges and Solutions.” WSDM: The Tenth International Conference on Web Search and Data Mining. Cambridge, UK.Google Scholar

Deng, Alex, Xu, Ya, Kohavi, Ron, and Walker, Toby. 2013. “Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data.” WSDM 2013: Sixth ACM International Conference on Web Search and Data Mining.Google Scholar

Deng, Shaojie, Longbotham, Roger, Walker, Toby, and Xu, Ya. 2011. “Choice of Randomization Unit in Online Controlled Experiments.” Joint Statistical Meetings Proceedings. 4866–4877.Google Scholar

Denrell, Jerker. 2005. “Selection Bias and the Perils of Benchmarking.” (Harvard Business Review) 83 (4): 114–119.Google Scholar

Dickhaus, Thorsten. 2014. Simultaneous Statistical Inference: With Applications in the Life Sciences. Springer. https://www.springer.com/cda/content/document/cda_downloaddocument/9783642451812-c2.pdf.Google Scholar

Dickson, Paul. 1999. The Official Rules and Explanations: The Original Guide to Surviving the Electronic Age With Wit, Wisdom, and Laughter. Federal Street Pr.Google Scholar

Djulbegovic, Benjamin, and Hozo, Iztok. 2002. “At What Degree of Belief in a Research Hypothesis Is a Trial in Humans Justified?” Journal of Evaluation in Clinical Practice, June 13.Google Scholar

Dmitriev, Pavel, and Xian, Wu. 2016. “Measuring Metrics.” CIKM: Conference on Information and Knowledge Management. Indianapolis, In. http://bit.ly/measuringMetrics.Google Scholar

Dmitriev, Pavel, Gupta, Somit, Kim, Dong Woo, and Vaz, Garnet. 2017. “A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments.” Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2017). Halifax, NS, Canada: ACM. 1427–1436. http://doi.acm.org/10.1145/3097983.3098024.Google Scholar

Dmitriev, Pavel, Frasca, Brian, Gupta, Somit, Kohavi, Ron, and Vaz, Garnet. 2016. “Pitfalls of Long-Term Online Controlled Experiments.” 2016 IEEE International Conference on Big Data (Big Data). Washington DC. 1367–1376. http://bit.ly/expLongTerm.Google Scholar

Doerr, John. 2018. Measure What Matters: How Google, Bono, and the Gates Foundation Rock the World with OKRs. Portfolio.Google Scholar

Doll, Richard. 1998. “Controlled Trials: the 1948 Watershed.” BMJ. doi:https://doi.org/10.1136/bmj.317.7167.1217.Google Scholar

Dutta, Kaushik, and Vadermeer, Debra. 2018. “Caching to Reduce Mobile App Energy Consumption.” ACM Transactions on the Web (TWEB), February 12(1): Article No. 5.Google Scholar

Dwork, Cynthia, and Roth, Aaron. 2014. “The Algorithmic Foundations of Differential Privacy.” Foundations and Trends in Computer Science 211–407.Google Scholar

Eckles, Dean, Karrer, Brian, and Ugander, Johan. 2017. “Design and Analysis of Experiments in Networks: Reducing Bias from Interference.” Journal of Causal Inference 5(1). www.deaneckles.com/misc/Eckles_Karrer_Ugander_Reducing_Bias_from_Interference.pdf.Google Scholar

Edgington, Eugene S. 1972, “An Additive Method for Combining Probablilty Values from Independent Experiments.” The Journal of Psychology 80 (2): 351–363.Google Scholar

Edmonds, Andy, White, Ryan W., Morris, Dan, and Drucker, Steven M.. 2007. “Instrumenting the Dynamic Web.” Journal of Web Engineering. (3): 244–260. www.microsoft.com/en-us/research/wp-content/uploads/2016/02/edmondsjwe2007.pdf.Google Scholar

Efron, Bradley, and Tibshriani, Robert J.. 1994. An Introduction to the Bootstrap. Chapman & Hall/CRC.Google Scholar

EGAP. 2018. “10 Things to Know About Heterogeneous Treatment Effects.” EGAP: Evidence in Government and Politics. egap.org/methods-guides/10-things-heterogeneous-treatment-effects.Google Scholar

Ehrenberg, A.S.C. 1975. “The Teaching of Statistics: Corrections and Comments.” Journal of the Royal Statistical Society. Series A 138 (4): 543–545. https://www.jstor.org/stable/2345216.Google Scholar

Eisenberg, Bryan 2005. “How to Improve A/B Testing.” ClickZ Network. April 29. www.clickz.com/clickz/column/1717234/how-improve-a-b-testing.Google Scholar

Eisenberg, Bryan. 2004. A/B Testing for the Mathematically Disinclined. May 7. http://www.clickz.com/showPage.html?page=3349901.Google Scholar

Eisenberg, Bryan, and Quarto-vonTivadar, John. 2008. Always Be Testing: The Complete Guide to Google Website Optimizer. Sybex.Google Scholar

eMarketer. 2016. “Microsoft Ad Revenues Continue to Rebound.” April 20. https://www.emarketer.com/Article/Microsoft-Ad-Revenues-Continue-Rebound/1013854.Google Scholar

European Commission. 2018. https://ec.europa.eu/commission/priorities/justice-and-fundamental-rights/data-protection/2018-reform-eu-data-protection-rules_en.Google Scholar

European Commission. 2016. EU GDPR.ORG. https://eugdpr.org/.Google Scholar

Fabijan, Aleksander, Dmitriev, Pavel, Olsson, Helena Holmstrom, and Bosch, Jan. 2018. “Online Controlled Experimentation at Scale: An Empirical Survey on the Current State of A/B Testing.” Euromicro Conference on Software Engineering and Advanced Applications (SEAA). Prague, Czechia. doi:10.1109/SEAA.2018.00021.Google Scholar

Fabijan, Aleksander, Dmitriev, Pavel, Olsson, Helena Holmstrom, and Bosch, Jan. 2017. “The Evolution of Continuous Experimentation in Software Product Development: from Data to a Data-Driven Organization at Scale.” ICSE ’17 Proceedings of the 39th International Conference on Software Engineering. Buenos Aires, Argentina: IEEE Press. 770–780. doi:https://doi.org/10.1109/ICSE.2017.76.Google Scholar

Fabijan, Aleksander, Gupchup, Jayant, Gupta, Somit, Omhover, Jeff, Qin, Wen, Vermeer, Lukas, and Dmitriev, Pavel. 2019. “Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners.” KDD ‘19: The 25th SIGKDD International Conference on Knowledge Discovery and Data Mining. Anchorage, Alaska, USA: ACM.Google Scholar

Fabijan, Aleksander, Dmitriev, Pavel, McFarland, Colin, Vermeer, Lukas, Olsson, Helena Holmström, and Bosch, Jan. 2018. “Experimentation Growth: Evolving Trustworthy A/B Testing Capabilities in Online Software Companies.” Journal of Software: Evolution and Process 30 (12:e2113). doi:https://doi.org/10.1002/smr.2113.Google Scholar

FAT/ML. 2019. Fairness, Accountability, and Transparency in Machine Learning. http://www.fatml.org/.Google Scholar

Fisher, Ronald Aylmer. 1925. Statistical Methods for Research Workers. Oliver and Boyd. http://psychclassics.yorku.ca/Fisher/Methods/.Google Scholar

Forte, Michael. 2019. “Misadventures in experiments for growth.” The Unofficial Google Data Science Blog. April 16. www.unofficialgoogledatascience.com/2019/04/misadventures-in-experiments-for-growth.html.Google Scholar

Freedman, Benjamin. 1987. “Equipoise and the Ethics of Clinical Research.” The New England Journal of Medicine 317 (3): 141–145. doi:https://www.nejm.org/doi/full/10.1056/NEJM198707163170304.Google Scholar

Gelman, Andrew, and Carlin, John. 2014. “Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.” Perspectives on Psychological Science 9 (6): 641–651. doi:10.1177/1745691614551642.Google Scholar

Gelman, Andrew, and Little, Thomas C.. 1997. “Poststratification into Many Categories Using Hierarchical Logistic Regression.” Survey Methdology 23 (2): 127–135. www150.statcan.gc.ca/n1/en/pub/12-001-x/1997002/article/3616-eng.pdf.Google Scholar

Georgiev, Georgi Zdravkov. 2019. Statistical Methods in Online A/B Testing: Statistics for Data-Driven Business Decisions and Risk Management in e-Commerce. Independently published. www.abtestingstats.com Google Scholar

Georgiev, Georgi Zdravkov. 2018. “Analysis of 115 A/B Tests: Average Lift is 4%, Most Lack Statistical Power.” Analytics Toolkit. June 26. http://blog.analytics-toolkit.com/2018/analysis-of-115-a-b-tests-average-lift-statistical-power/.Google Scholar

Gerber, Alan S., and Green, Donald P.. 2012. Field Experiments: Design, Analysis, and Interpretation. W. W. Norton & Company. https://www.amazon.com/Field-Experiments-Design-Analysis-Interpretation/dp/0393979954.Google Scholar

Goldratt, Eliyahu M. 1990. The Haystack Syndrome. North River Press.Google Scholar

Goldstein, Noah J., Martin, Steve J., and Cialdini, Robert B.. 2008. Yes!: 50 Scientifically Proven Ways to Be Persuasive. Free Press.Google Scholar

Goodhart, Charles A. E. 1975. Problems of Monetary Management: The UK Experience. Vol. 1, in Papers in Monetary Economics, by Reserve Bank of Australia.Google Scholar

Goodhart’s law. 2018. Wikipedia. https://en.wikipedia.org/wiki/Goodhart%27s_law.Google Scholar

Goodman, Steven. 2008. “A Dirty Dozen: Twelve P-Value Misconceptions.” Seminars in Hematology. doi:https://doi.org/10.1053/j.seminhematol.2008.04.003.Google Scholar

Google. 2019. Processing Logs at Scale Using Cloud Dataflow. March 19. https://cloud.google.com/solutions/processing-logs-at-scale-using-dataflow.Google Scholar

Google. 2018. Google Surveys. https://marketingplatform.google.com/about/surveys/.Google Scholar

Google. 2011. “Ads Quality Improvements Rolling Out Globally.” Google Inside AdWords. October 3. https://adwords.googleblog.com/2011/10/ads-quality-improvements-rolling-out.html.Google Scholar

Google Console. 2019. “Release App Updates with Staged Rollouts.” Google Console Help. https://support.google.com/googleplay/android-developer/answer/6346149?hl=en.Google Scholar

Google Developers. 2019. Reduce Your App Size. https://developer.andriod.com/topic/performance/reduce-apk-size.Google Scholar

Google, Helping Advertisers Comply with the GDPR. 2019. Google Ads Help. https://support.google.com/google-ads/answer/9028179?hl=en.Google Scholar

Google Website Optimizer. 2008. http://services.google.com/websiteoptimizer.Google Scholar

Gordon, Brett R., Zettelmeyer, Florian, Bhargava, Neha, and Chapsky, Dan. 2018. “A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook (forthcoming at Marketing Science).” https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3033144.Google Scholar

Goward, Chris. 2015. “Delivering Profitable ‘A-ha!’ Moments Everyday.” Conversion Hotel. Texel, The Netherlands. www.slideshare.net/webanalisten/chris-goward-strategy-conversion-hotel-2015.Google Scholar

Goward, Chris. 2012. You Should Test That: Conversion Optimization for More Leads, Sales and Profit or The Art and Science of Optimized Marketing. Sybex.Google Scholar

Greenhalgh, Trisha. 2014. How to Read a Paper: The Basics of Evidence-Based Medicine. BMJ Books. https://www.amazon.com/gp/product/B00IPG7GLC.Google Scholar

Greenhalgh, Trisha. 1997. “How to Read a Paper : Getting Your Bearings (deciding what the paper is about).” BMJ 315 (7102): 243–246. doi:10.1136/bmj.315.7102.243.Google Scholar

Greenland, Sander, Senn, Stephen J., Rothman, Kenneth J., Carlin, John B., Poole, Charles, Goodman, Steven N., and Altman, Douglas G.. 2016. “Statistical Tests, P Values, Confidence Intervals, and Power: a Guide to Misinterpretations.” European Journal of Epidemiology 31 (4): 337–350. https://dx.doi.org/10.1007%2Fs10654–016-0149-3.Google Scholar

Grimes, Carrie, Tang, Diane, and Russell, Daniel M.. 2007. “Query Logs Alone are not Enough.” International Conference of the World Wide Web, May.Google Scholar

Grove, Andrew S. 1995. High Output Management. 2nd edition. Vintage.Google Scholar

Groves, Robert M., Fowler, Floyd J. Jr, Couper, Mick P., Lepkowski, James M., Eleanor, Singer, and Tourangeau, Roger. 2009. Survey Methodology, 2nd edition. Wiley.Google Scholar

Gui, Han, Xu, Ya, Bhasin, Anmol, and Han, Jiawei. 2015. “Network A/B Testing From Sampling to Estimation.” WWW ’15 Proceedings of the 24th International Conference on World Wide Web. Florence, IT: ACM. 399–409.Google Scholar

Gupta, Somit, Ulanova, Lucy, Bhardwaj, Sumit, Dmitriev, Pavel, Raff, Paul, and Fabijan, Aleksander. 2018. “The Anatomy of a Large-Scale Online Experimentation Platform.” IEEE International Conference on Software Architecture.Google Scholar

Gupta, Somit, Kohavi, Ronny, Tang, Diane, Xu, Ya, and etal. 2019. “Top Challenges from the first Practical Online Controlled Experiments Summit.” Edited by Dong, Xin Luna, Teredesai, Ankur and Zafarani, Reza. SIGKDD Explorations (ACM) 21 (1). https://bit.ly/OCESummit1.Google Scholar

Guyatt, Gordon H., Sackett, David L., Sinclair, John C., Hayward, Robert, Cook, Deborah J., and Cook, Richard J.. 1995. “Users’ Guides to the Medical Literature: IX. A method for Grading Health Care Recommendations.” Journal of the American Medical Association (JAMA) 274 (22): 1800–1804. doi:https://doi.org/10.1001%2Fjama.1995.03530220066035.Google Scholar

Harden, K. Paige, Mendle, Jane, Hill, Jennifer E., Turkheimer, Eric, and Emery, Robert E.. 2008. “Rethinking Timing of First Sex and Delinquency.” Journal of Youth and Adolescence 37 (4): 373–385. doi:https://doi.org/10.1007/s10964-007-9228-9.Google Scholar

Harford, Tim. 2014. The Undercover Economist Strikes Back: How to Run – or Ruin – an Economy. Riverhead Books.Google Scholar

Hauser, John R., and Katz, Gerry. 1998. “Metrics: You Are What You Measure!” European Management Journal 16 (5): 516–528. http://www.mit.edu/~hauser/Papers/metrics%20you%20are%20what%20you%20measure.pdf.Google Scholar

Health and Human Services. 2018a. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html.Google Scholar

Health and Human Services. 2018b. Health Information Privacy. https://www.hhs.gov/hipaa/index.html.Google Scholar

Health and Human Services. 2018c. Summary of the HIPAA Privacy Rule. https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html.Google Scholar

Hedges, Larry, and Olkin, Ingram. 2014. Statistical Methods for Meta-Analysis. Academic Press.Google Scholar

Hemkens, Lars, Contopoulos-Ioannidis, Despina, and Ioannidis, John. 2016. “Routinely Collected Data and Comparative Effectiveness Evidence: Promises and Limitations.” CMAJ, May 17.Google Scholar

Journal, HIPAA. 2018. What is Considered Protected Health Information Under HIPAA. April 2. https://www.hipaajournal.com/what-is-considered-protected-health-information-under-hipaa/.Google Scholar

Hochberg, Yosef, and Benjamini, Yoav. 1995. “Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing Series B.” Journal of the Royal Statistical Society 57 (1): 289–300.Google Scholar

Hodge, Victoria, and Austin, Jim. 2004. “A Survey of Outlier Detection Methodologies.” Journal of Artificial Intelligence Review. 85–126.Google Scholar

Hohnhold, Henning, O’Brien, Deirdre, and Tang, Diane. 2015. “Focus on the Long-Term: It’s better for Users and Business.” Proceedings 21st Conference on Knowledge Discovery and Data Mining (KDD 2015). Sydney, Australia: ACM. http://dl.acm.org/citation.cfm?doid=2783258.2788583.Google Scholar

Holson, Laura M. 2009. “Putting a Bolder Face on Google.” NY Times. February 28. https://www.nytimes.com/2009/03/01/business/01marissa.html.Google Scholar

Holtz, David Michael. 2018. “Limiting Bias from Test-Control Interference In Online Marketplace Experiments.” DSpace@MIT. http://hdl.handle.net/1721.1/117999.Google Scholar

Hoover, Kevin D. 2008. “Phillips Curve.” In Henderson, R. David, Concise Encyclopedia of Economics. http://www.econlib.org/library/Enc/PhillipsCurve.html.Google Scholar

Huang, Jason, Reiley, David, and Raibov, Nickolai M.. 2018. “David Reiley, Jr.” Measuring Consumer Sensitivity to Audio Advertising: A Field Experiment on Pandora Internet Radio. April 21. http://davidreiley.com/papers/PandoraListenerDemandCurve.pdf.Google Scholar

Huang, Jeff, White, Ryen W., and Dumais, Susan. 2012. “No Clicks, No Problem: Using Cursor Movements to Understand and Improve Search.” Proceedings of SIGCHI.Google Scholar

Huang, Yanping, You, Jane, Wang, Iris, Cao, Feng, and Gao, Ian. 2015. Data Science Interviews Exposed. CreateSpace.Google Scholar

Hubbard, Douglas W. 2014. How to Measure Anything: Finding the Value of Intangibles in Business. 3rd edition. Wiley.Google Scholar

Huffman, Scott. 2008. Search Evaluation at Google. September 15. https://googleblog.blogspot.com/2008/09/search-evaluation-at-google.html.Google Scholar

Imbens, Guido W., and Rubin, Donald B.. 2015. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press.Google Scholar

Ioannidis, John P. 2005. “Contradicted and Initially Stronger Effects in Highly Cited Clinical Research.” (The Journal of the American Medical Association) 294 (2).Google Scholar

Jackson, Simon. 2018. “How Booking.com increases the power of online experiments with CUPED.” Booking.ai. January 22. https://booking.ai/how-booking-com-increases-the-power-of-online-experiments-with-cuped-995d186fff1d.Google Scholar

Joachims, Thorsten, Granka, Laura, Pan, Bing, Hembrooke, Helene, and Gay, Geri. 2005. “Accurately Interpreting Clickthrough Data as Implicit Feedback.” SIGIR, August.Google Scholar

Johari, Ramesh, Pekelis, Leonid, Koomen, Pete, and Walsh, David. 2017. “Peeking at A/B Tests.” KDD ’17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Halifax, NS, Canada: ACM. 1517–1525. doi:https://doi.org/10.1145/3097983.3097992.Google Scholar

Kaplan, Robert S., and Norton, David P.. 1996. The Balanced Scorecard: Translating Strategy into Action. Harvard Business School Press.Google Scholar

Katzir, Liran, Liberty, Edo, and Somekh, Oren. 2012. “Framework and Algorithms for Network Bucket Testing.” Proceedings of the 21st International Conference on World Wide Web 1029–1036.Google Scholar

Kaushik, Avinash. 2006. “Experimentation and Testing: A Primer.” Occam’s Razor. May 22. www.kaushik.net/avinash/2006/05/experimentation-and-testing-a-primer.html.Google Scholar

Keppel, Geoffrey, Saufley, William H., and Tokunaga, Howard. 1992. Introduction to Design and Analysis. 2nd edition. W.H. Freeman and Company.Google Scholar

Kesar, Alhan. 2018. 11 Ways to Stop FOOC’ing up your A/B tests. August 9. www.widerfunnel.com/stop-fooc-ab-tests/.Google Scholar

King, Gary, and Nielsen, Richard. 2018. Why Propensity Scores Should Not Be Used for Matching. Working paper. https://gking.harvard.edu/publications/why-propensity-scores-should-not-be-used-formatching.Google Scholar

King, Rochelle, Churchill, Elizabeth F., and Tan, Caitlin. 2017. Designing with Data: Improving the User Experience with A/B Testing. O’Reilly Media.Google Scholar

Kingston, Robert. 2015. Does Optimizely Slow Down a Site’s Performance. January 18. https://www.quora.com/Does-Optimizely-slow-down-a-sites-performance/answer/Robert-Kingston.Google Scholar

Knapp, Michael S., Swinnerton, Juli A., Copland, Michael A., and Monpas-Huber, Jack. 2006. Data-Informed Leadership in Education. Center for the Study of Teaching and Policy, University of Washington, Seattle, WA: Wallace Foundation. https://www.wallacefoundation.org/knowledge-center/Documents/1-Data-Informed-Leadership.pdf.Google Scholar

Kohavi, Ron. 2019. “HiPPO FAQ.” ExP Experimentation Platform. http://bitly.com/HIPPOExplained.Google Scholar

Kohavi, Ron. 2016. “Pitfalls in Online Controlled Experiments.” CODE ’16: Conference on Digital Experimentation. MIT. https://bit.ly/Code2016Kohavi.Google Scholar

Kohavi, Ron. 2014. “Customer Review of A/B Testing: The Most Powerful Way to Turn Clicks Into Customers.” Amazon.com. May 27. www.amazon.com/gp/customer-reviews/R44BH2HO30T18.Google Scholar

Kohavi, Ron. 2010. “Online Controlled Experiments: Listening to the Customers, not to the HiPPO.” Keynote at EC10: the 11th ACM Conference on Electronic Commerce. www.exp-platform.com/Documents/2010-06%20EC10.pptx.Google Scholar

Kohavi, Ron. 2003. Real-world Insights from Mining Retail E-Commerce Data. Stanford, CA, May 22. http://ai.stanford.edu/~ronnyk/realInsights.ppt.Google Scholar

Kohavi, Ron, and Longbotham, Roger. 2017. “Online Controlled Experiments and A/B Tests.” In Encyclopedia of Machine Learning and Data Mining, by Sammut, Claude and Webb, Geoffrey I. Springer. www.springer.com/us/book/9781489976857.Google Scholar

Kohavi, Ron, and Longbotham, Roger. 2010. “Unexpected Results in Online Controlled Experiments.” SIGKDD Explorations, December. http://bit.ly/expUnexpected.Google Scholar

Kohavi, Ron and Parekh, Rajesh. 2003. “Ten Supplementary Analyses to Improve E-commerce Web Sites.” WebKDD. http://ai.stanford.edu/~ronnyk/supplementaryAnalyses.pdf.Google Scholar

Kohavi, Ron, and Thomke, Stefan. 2017. “The Surprising Power of Online Experiments.” Harvard Business Review (September–October): 74–92. http://exp-platform.com/hbr-the-surprising-power-of-online-experiments/.Google Scholar

Kohavi, Ron, Crook, Thomas, and Longbotham, Roger. 2009. “Online Experimentation at Microsoft.” Third Workshop on Data Mining Case Studies and Practice Prize. http://bit.ly/expMicrosoft.Google Scholar

Kohavi, Ron, Longbotham, Roger, and Walker, Toby. 2010. “Online Experiments: Practical Lessons.” IEEE Computer, September: 82–85. http://bit.ly/expPracticalLessons.Google Scholar

Kohavi, Ron, Tang, Diane, and Ya, Xu. 2019. “History of Controlled Experiments.” Practical Guide to Trustworthy Online Controlled Experiments. https://bit.ly/experimentGuideHistory.Google Scholar

Kohavi, Ron, Deng, Alex, Longbotham, Roger, and Xu, Ya. 2014. “Seven Rules of Thumb for Web Site.” Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’14). http://bit.ly/expRulesOfThumb.Google Scholar

Kohavi, Ron, Longbotham, Roger, Sommerfield, Dan, and Henne, Randal M.. 2009. “Controlled Experiments on the Web: Survey and Practical Guide.” Data Mining and Knowledge Discovery 18: 140–181. http://bit.ly/expSurvey.Google Scholar

Kohavi, Ron, Deng, Alex, Frasca, Brian, Longbotham, Roger, Walker, Toby, and Xu, Ya. 2012. “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained.” Proceedings of the 18th Conference on Knowledge Discovery and Data Mining. http://bit.ly/expPuzzling.Google Scholar

Kohavi, Ron, Deng, Alex, Frasca, Brian, Walker, Toby, Xu, Ya, and Pohlmann, Nils. 2013. “Online Controlled Experiments at Large Scale.” KDD 2013: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar

Kohavi, Ron, Messner, David, Eliot, Seth, Ferres, Juan Lavista, Henne, Randy, Kannappan, Vignesh, and Wang, Justin. 2010. “Tracking Users’ Clicks and Submits: Tradeoffs between User Experience and Data Loss.” Experimentation Platform. September 28. www.exp-platform.com/Documents/TrackingUserClicksSubmits.pdf Google Scholar

Kramer, Adam, Guillory, Jamie, and Hancock, Jeffrey. 2014. “Experimental evidence of massive-scale emotional contagion through social networks.” PNAS, June 17.Google Scholar

Kuhn, Thomas. 1996. The Structure of Scientific Revolutions. 3rd edition. University of Chicago Press.Google Scholar

Laja, Peep. 2019. “How to Avoid a Website Redesign FAIL.” CXL. March 8. https://conversionxl.com/show/avoid-redesign-fail/.Google Scholar

Lax, Jeffrey R., and Phillips, Justin H.. 2009. “How Should We Estimate Public Opinion in The States?” American Journal of Political Science 53 (1): 107–121. www.columbia.edu/~jhp2121/publications/HowShouldWeEstimateOpinion.pdf.Google Scholar

Lee, Jess. 2013. Fake Door. April 10. www.jessyoko.com/blog/2013/04/10/fake-doors/.Google Scholar

Lee, Minyong R, and Shen, Milan. 2018. “Winner’s Curse: Bias Estimation for Total Effects of Features in Online Controlled Experiments.” KDD 2018: The 24th ACM Conference on Knowledge Discovery and Data Mining. London: ACM.Google Scholar

Lehmann, Erich, L., and Romano, Joseph P.. 2005. Testing Statistical Hypothesis. Springer.Google Scholar

Levy, Steven. 2014. “Why The New Obamacare Website is Going to Work This Time.” www.wired.com/2014/06/healthcare-gov-revamp/.Google Scholar

Lewis, Randall A, Rao, Justin M, and Reiley, David. 2011. “Here, There, and Everywhere: Correlated Online Behaviors Can Lead to Overestimates of the Effects of Advertising.” Proceedings of the 20th ACM International World Wide Web Conference (WWW). 157–166. https://ssrn.com/abstract=2080235.Google Scholar

Li, Lihong, Chu, Wei, Langford, John, and Schapire, Robert E.. 2010. “A Contextual-Bandit Approach to Personalized News Article Recommendation.” WWW 2010: Proceedings of the 19th International Conference on World Wide Web. Raleigh, North Carolina. https://arxiv.org/pdf/1003.0146.pdf.Google Scholar

Linden, Greg. 2006. Early Amazon: Shopping Cart Recommendations. April 25. http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html.Google Scholar

Linden, Greg. 2006. “Make Data Useful.” December. http://sites.google.com/site/glinden/Home/StanfordDataMining.2006-11-28.ppt.Google Scholar

Linden, Greg. 2006. “Marissa Mayer at Web 2.0 .” Geeking with Greg . November 9. http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html.Google Scholar

Linowski, Jakub. 2018a. Good UI: Learn from What We Try and Test. https://goodui.org/.Google Scholar

Linowski, Jakub. 2018b. No Coupon. https://goodui.org/patterns/1/.Google Scholar

Liu, Min, Sun, Xiaohui, Varshney, Maneesh, and Xu, Ya. 2018. “Large-Scale Online Experimentation with Quantile Metrics.” Joint Statistical Meeting, Statistical Consulting Section. Alexandria, VA: American Statistical Association. 2849–2860.Google Scholar

Loukides, Michael, Mason, Hilary, and Patil, D.J.. 2018. Ethics and Data Science. O’Reilly Media.Google Scholar

Lu, Luo, and Liu, Chuang. 2014. “Separation Strategies for Three Pitfalls in A/B Testing.” KDD User Engagement Optimization Workshop. New York. www.ueo-workshop.com/wp-content/uploads/2014/04/Separation-strategies-for-three-pitfalls-in-AB-testing_withacknowledgments.pdf.Google Scholar

Lucas critique. 2018. Wikipedia. https://en.wikipedia.org/wiki/Lucas_critique.Google Scholar

Lucas, Robert E. 1976. Econometric Policy Evaluation: A Critique. Vol. 1. In The Phillips Curve and Labor Markets, by Brunner, K. and Meltzer, A., 19–46. Carnegie-Rochester Conference on Public Policy.Google Scholar

Malinas, Gary, and Bigelow, John. 2004. “Simpson’s Paradox.” Stanford Encyclopedia of Philosophy. February 2. http://plato.stanford.edu/entries/paradox-simpson/.Google Scholar

Manzi, Jim. 2012. Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society. Basic Books.Google Scholar

Marks, Harry M. 1997. The Progress of Experiment: Science and Therapeutic Reform in the United States, 1900–1990. Cambridge University Press.Google Scholar

Marsden, Peter V., and Wright, James D.. 2010. Handbook of Survey Research, 2nd Edition. Emerald Publishing Group Limited.Google Scholar

Marsh, Catherine, and Elliott, Jane. 2009. Exploring Data: An Introduction to Data Analysis for Social Scientists. 2nd edition. Polity.Google Scholar

Martin, Robert C. 2008. Clean Code: A Handbook of Agile Software Craftsmanship. Prentice Hall.Google Scholar

Mason, Robert L., Gunst, Richard F., and Hess, James L.. 1989. Statistical Design and Analysis of Experiments With Applications to Engineering and Science. John Wiley & Sons.Google Scholar

McChesney, Chris, Covey, Sean, and Huling, Jim. 2012. The 4 Disciplines of Execution: Achieving Your Wildly Important Goals. Free Press.Google Scholar

McClure, Dave. 2007. Startup Metrics for Pirates: AARRR!!! August 8. www.slideshare.net/dmc500hats/startup-metrics-for-pirates-long-version.Google Scholar

McCrary, Justin. 2008. “Manipulation of the Running Variable in the Regression Discontinuity Design: A Density Test.” Journal of Econometrics (142): 698–714.Google Scholar

McCullagh, Declan. 2006. AOL’s Disturbing Glimpse into Users’ Lives. August 9. www.cnet.com/news/aols-disturbing-glimpse-into-users-lives/.Google Scholar

McFarland, Colin. 2012. Experiment!: Website Conversion Rate Optimization with A/B and Multivariate Testing. New Riders.Google Scholar

McGue, Matt. 2014. Introduction to Human Behavioral Genetics, Unit 2: Twins: A Natural Experiment . Coursera. https://www.coursera.org/learn/behavioralgenetics/lecture/u8Zgt/2a-twins-a-natural-experiment.Google Scholar

McKinley, Dan. 2013. Testing to Cull the Living Flower. January. http://mcfunley.com/testing-to-cull-the-living-flower.Google Scholar

McKinley, Dan. 2012. Design for Continuous Experimentation: Talk and Slides. December 22. http://mcfunley.com/design-for-continuous-experimentation.Google Scholar

Turk, Mechanical. 2019. Amazon Mechanical Turk. http://www.mturk.com.Google Scholar

Meenan, Patrick. 2012. “Speed Index.” WebPagetest. April. https://sites.google.com/a/webpagetest.org/docs/using-webpagetest/metrics/speed-index.Google Scholar

Meenan, Patrick, Feng, Chao (Ray), and Petrovich, Mike. 2013. “Going Beyond Onload – How Fast Does It Feel?” Velocity: Web Performance and Operations conference, October 14–16. http://velocityconf.com/velocityny2013/public/schedule/detail/31344.Google Scholar

Meyer, Michelle N. 2018. “Ethical Considerations When Companies Study – and Fail to Study – Their Customers.” In The Cambridge Handbook of Consumer Privacy, by Selinger, Evan, Polonetsky, Jules and Tene, Omer. Cambridge University Press.Google Scholar

Meyer, Michelle N. 2015. “Two Cheers for Corporate Experimentation: The A/B Illusion and the Virtues of Data-Driven Innovation.” 13 Colo. Tech. L.J. 273. https://ssrn.com/abstract=2605132.Google Scholar

Meyer, Michelle N. 2012. Regulating the Production of Knowledge: Research Risk–Benefit Analysis and the Heterogeneity Problem. 65 Administrative Law Review 237; Harvard Public Law Working Paper. doi:http://dx.doi.org/10.2139/ssrn.2138624.Google Scholar

Meyer, Michelle N., Heck, Patrick R., Holtzman, Geoffrey S., Anderson, Stephen M., Cai, William, Watts, Duncan J., and Chabris, Christopher F.. 2019. “Objecting to Experiments that Compare Two Unobjectionable Policies or Treatments.” PNAS: Proceedings of the National Academy of Sciences (National Academy of Sciences). doi:https://doi.org/10.1073/pnas.1820701116.Google Scholar

Milgram, Stanley. 2009. Obedience to Authority: An Experimental View. Harper Perennial Modern Thought.Google Scholar

Mitchell, Carl, Litz, Jonathan, Vaz, Garnet, and Drake, Andy. 2018. “Metrics Health Detection and AA Simulator.” Microsoft ExP (internal). August 13. https://aka.ms/exp/wiki/AASimulator.Google Scholar

Moran, Mike. 2008. Multivariate Testing in Action: Quicken Loan’s Regis Hadiaris on multivariate testing. December. www.biznology.com/2008/12/multivariate_testing_in_action/.Google Scholar

Moran, Mike. 2007. Do It Wrong Quickly: How the Web Changes the Old Marketing Rules . IBM Press.Google Scholar

Mosavat, Fareed. 2019. Twitter. Jan 29. https://twitter.com/far33d/status/1090400421842018304.Google Scholar

Mosteller, Frederick, Gilbert, John P., and McPeek, Bucknam. 1983. “Controversies in Design and Analysis of Clinical Trials.” In Clinical Trials, by Shapiro, Stanley H. and Louis, Thomas A.. New York, NY: Marcel Dekker, Inc.Google Scholar

MR Web. 2014. “Obituary: Audience Measurement Veteran Tony Twyman.” Daily Research News Online. November 12. www.mrweb.com/drno/news20011.htm.Google Scholar

Mudholkar, Govind S., and George, E. Olusegun. 1979. “The Logit Method for Combining Probablilities.” Edited by Rustagi, J.. Symposium on Optimizing Methods in Statistics.” Academic Press. 345–366. https://apps.dtic.mil/dtic/tr/fulltext/u2/a049993.pdf.Google Scholar

Mueller, Hendrik, and Sedley, Aaron. 2014. “HaTS: Large-Scale In-Product Measurement of User Attitudes & Experiences with Happiness Tracking Surveys.” OZCHI, December.Google Scholar

Neumann, Chris. 2017. Does Optimizely Slow Down a Site’s Performance? October 18. https://www.quora.com/Does-Optimizely-slow-down-a-sites-performance.Google Scholar

Newcomer, Kathryn E., Hatry, Harry P., and Wholey, Joseph S.. 2015. Handbook of Practical Program Evaluation (Essential Tests for Nonprofit and Publish Leadership and Management). Wiley.Google Scholar

Neyman, J. 1923. “On the Application of Probability Theory of Agricultural Experiments.” Statistical Science 465–472.Google Scholar

NSF. 2018. Frequently Asked Questions and Vignettes: Interpreting the Common Rule for the Protection of Human Subjects for Behavioral and Social Science Research. www.nsf.gov/bfa/dias/policy/hsfaqs.jsp.Google Scholar

Office for Human Research Protections. 1991. Federal Policy for the Protection of Human Subjects (‘Common Rule’). www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html.Google Scholar

Optimizely. 2018. “A/A Testing.” Optimizely. www.optimizely.com/optimization-glossary/aa-testing/.Google Scholar

Optimizely. 2018. “Implement the One-Line Snippet for Optimizely X.” Optimizely. February 28. https://help.optimizely.com/Set_Up_Optimizely/Implement_the_one-line_snippet_for_Optimizely_X.Google Scholar

Optimizely. 2018. Optimizely Maturity Model. www.optimizely.com/maturity-model/.Google Scholar

Orlin, Ben. 2016. Why Not to Trust Statistics. July 13. https://mathwithbaddrawings.com/2016/07/13/why-not-to-trust-statistics/.Google Scholar

Owen, Art, and Varian, Hal. 2018. Optimizing the Tie-Breaker Regression Discontinuity Design. August. http://statweb.stanford.edu/~owen/reports/tiebreaker.pdf.Google Scholar

Owen, Art, and Varian, Hal. 2009. Oxford Centre for Evidence-based Medicine – Levels of Evidence. March. www.cebm.net/oxford-centre-evidence-based-medicine-levels-evidence-march-2009/.Google Scholar

Park, David K., Gelman, Andrew, and Bafumi, Joseph. 2004. “Bayesian Multilevel Estimation with Poststratification: State-Level Estimates from National Polls.” Political Analysis 375–385.Google Scholar

Parmenter, David. 2015. Key Performance Indicators: Developing, Implementing, and Using Winning KPIs. 3rd edition. John Wiley & Sons, Inc.Google Scholar

Pearl, Judea. 2009. Causality: Models, Reasoning and Inference. 2nd edition. Cambridge University Press.Google Scholar

Pekelis, Leonid. 2015. “Statistics for the Internet Age: The Story behind Optimizely’s New Stats Engine.” Optimizely. January 20. https://blog.optimizely.com/2015/01/20/statistics-for-the-internet-age-the-story-behind-optimizelys-new-stats-engine/.Google Scholar

Pekelis, Leonid, Walsh, David, and Johari, Ramesh. 2015. “The New Stats Engine.” Optimizely. www.optimizely.com/resources/stats-engine-whitepaper/.Google Scholar

Pekelis, Leonid, Walsh, David, and Johari, Ramesh. 2005. Web Site Measurement Hacks. O’Reilly Media.Google Scholar

Peterson, Eric T. 2005. Web Site Measurement Hacks. O’Reilly Media.Google Scholar

Peterson, Eric T. 2004. Web Analytics Demystified: A Marketer’s Guide to Understanding How Your Web Site Affects Your Business. Celilo Group Media and CafePress.Google Scholar

Pfeffer, Jeffrey, and Sutton, Robert I. 1999. The Knowing-Doing Gap: How Smart Companies Turn Knowledge into Action. Harvard Business Review Press.Google Scholar

Phillips, A. W. 1958. “The Relation between Unemployment and the Rate of Change of Money Wage Rates in the United Kingdom, 1861–1957.” Economica, New Series 25 (100): 283–299. www.jstor.org/stable/2550759.Google Scholar

Porter, Michael E. 1998. Competitive Strategy: Techniques for Analyzing Industries and Competitors. Free Press.Google Scholar

Porter, Michael E. 1996. “What is Strategy.” Harvard Business Review 61–78.Google Scholar

Quarto-vonTivadar, John. 2006. “AB Testing: Too Little, Too Soon.” Future Now. www.futurenowinc.com/abtesting.pdf.Google Scholar

Radlinski, Filip, and Craswell, Nick. 2013. “Optimized Interleaving For Online Retrieval Evaluation.” International Conference on Web Search and Data Mining. Rome, IT: ASM. 245–254.Google Scholar

Rae, Barclay. 2014. “Watermelon SLAs – Making Sense of Green and Red Alerts.” Computer Weekly. September. https://www.computerweekly.com/opinion/Watermelon-SLAs-making-sense-of-green-and-red-alerts.Google Scholar

RAND. 1955. A Million Random Digits with 100,000 Normal Deviates. Glencoe, Ill: Free Press. www.rand.org/pubs/monograph_reports/MR1418.html.Google Scholar

Rawat, Girish. 2018. “Why Most Redesigns fail.” freeCodeCamp. December 4. https://medium.freecodecamp.org/why-most-redesigns-fail-6ecaaf1b584e.Google Scholar

Razali, Nornadiah Mohd, and Wah, Yap Bee. 2011. “Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lillefors and Anderson-Darling tests.” Journal of Statistical Modeling and Analytics, January 1: 21–33.Google Scholar

Reinhardt, Peter. 2016. Effect of Mobile App Size on Downloads. October 5. https://segment.com/blog/mobile-app-size-effect-on-downloads/.Google Scholar

Resnick, David. 2015. What is Ethics in Research & Why is it Important? December 1. www.niehs.nih.gov/research/resources/bioethics/whatis/index.cfm.Google Scholar

Ries, Eric. 2011. The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses. Crown Business.Google Scholar

Rodden, Kerry, Hutchinson, Hilary, and Xin, Fu. 2010. “Measuring the User Experience on a Large Scale: User-Centered Metrics for Web Applications.” Proceedings of CHI, April. https://ai.google/research/pubs/pub36299 Google Scholar

Romano, Joseph, Shaikh, Azeem M., and Wolf, Michael. 2016. “Multiple Testing.” In The New Palgrave Dictionary of Economics. Palgram Macmillan.Google Scholar

Rosenbaum, Paul R, and Rubin, Donald B. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55. doi:http://dx.doi.org/10.1093/biomet/70.1.41.Google Scholar

Rossi, Peter H., Lipsey, Mark W., and Freeman, Howard E.. 2004. Evaluation: A Systematic Approach. 7th edition. Sage Publications, Inc.Google Scholar

Roy, Ranjit K. 2001. Design of Experiments using the Taguchi Approach : 16 Steps to Product and Process Improvement. John Wiley & Sons, Inc.Google Scholar

Rubin, Donald B. 1990. “Formal Mode of Statistical Inference for Causal Effects.” Journal of Statistical Planning and Inference 25, (3) 279–292.Google Scholar

Rubin, Donald 1974. “Estimating Causal Effects of Treatment in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66 (5): 688–701.Google Scholar

Rubin, Kenneth S. 2012. Essential Scrum: A Practical Guide to the Most Popular Agile Process. Addison-Wesley Professional.Google Scholar

Russell, Daniel M., and Grimes, Carrie. 2007. “Assigned Tasks Are Not the Same as Self-Chosen Web Searches.” HICSS'07: 40th Annual Hawaii International Conference on System Sciences, January. https://doi.org/10.1109/HICSS.2007.91.Google Scholar

Saint-Jacques, Guillaume B., Aral, Sinan, Airoldi, Edoardo, Brynjolfsson, Erik, and Xu, Ya. 2018. “The Strength of Weak Ties: Causal Evidence using People-You-May-Know Randomizations.” 141–152.Google Scholar

Saint-Jacques, Guillaume, Simpson, Maneesh, Varshney, Jeremy, and Xu, Ya. 2018. “Using Ego-Clusters to Measure Network Effects at LinkedIn.” Workshop on Information Systems and Exonomics. San Francisco, CA.Google Scholar

Samarati, Pierangela, and Sweeney, Latanya. 1998. “Protecting Privacy When Disclosing Information: k-anonymity and its Enforcement through Generalization and Suppression.” Proceedings of the IEEE Symposium on Research in Security and Privacy.Google Scholar

Schrage, Michael. 2014. The Innovator’s Hypothesis: How Cheap Experiments Are Worth More than Good Ideas. MIT Press.Google Scholar

Schrijvers, Ard. 2017. “Mobile Website Too Slow? Your Personalization Tools May Be to Blame.” Bloomreach. February 2. www.bloomreach.com/en/blog/2017/01/server-side-personalization-for-fast-mobile-pagespeed.html.Google Scholar

Schurman, Eric, and Brutlag, Jake. 2009. “Performance Related Changes and their User Impact.” Velocity 09: Velocity Web Performance and Operations Conference. www.youtube.com/watch?v=bQSE51-gr2s and www.slideshare.net/dyninc/the-user-and-business-impact-of-server-delays-additional-bytes-and-http-chunking-in-web-search-presentation.Google Scholar

Scott, Steven L. 2010. “A modern Bayesian look at the multi-armed bandit.” Applied Stochastic Models in Business and Industry 26 (6): 639–658. doi:https://doi.org/10.1002/asmb.874.Google Scholar

Segall, Ken. 2012. Insanely Simple: The Obsession That Drives Apple’s Success. Portfolio Hardcover.Google Scholar

Senn, Stephen. 2012. “Seven myths of randomisation in clinical trials.” Statistics in Medicine. doi:10.1002/sim.5713.Google Scholar

Shadish, William R., Cook, Thomas D., and Campbell, Donald T.. 2001. Experimental and Quasi-Experimental Designs for Generalized Causal Inference. 2nd edition. Cengage Learning.Google Scholar

Simpson, Edward H. 1951. “The Interpretation of Interaction in Contingency Tables.” Journal of the Royal Statistical Society, Ser. B, 238–241.Google Scholar

Sinofsky, Steven, and Iansiti, Marco. 2009. One Strategy: Organization, Planning, and Decision Making. Wiley.Google Scholar

Siroker, Dan, and Koomen, Pete. 2013. A/B Testing: The Most Powerful Way to Turn Clicks Into Customers. Wiley.Google Scholar

Soriano, Jacopo. 2017. “Percent Change Estimation in Large Scale Online Experiments.” arXiv.org. November 3. https://arciv.org/pdf/1711.00562.pdf.Google Scholar

Souders, Steve. 2013. “Moving Beyond window.onload().” High Performance Web Sites Blog. May 13. www.stevesouders.com/blog/2013/05/13/moving-beyond-window-onload/.Google Scholar

Souders, Steve. 2009. Even Faster Web Sites: Performance Best Practices for Web Developers. O’Reilly Media.Google Scholar

Souders, Steve. 2007. High Performance Web Sites: Essential Knowledge for Front-End Engineers. O’Reilly Media.Google Scholar

Spitzer, Dean R. 2007. Transforming Performance Measurement: Rethinking the Way We Measure and Drive Organizational Success. AMACOM.Google Scholar

Stephens-Davidowitz, Seth, Varian, Hal, and Smith, Michael D.. 2017. “Super Returns to Super Bowl Ads?” Quantitative Marketing and Economics, March 1: 1–28.Google Scholar

Sterne, Jim. 2002. Web Metrics: Proven Methods for Measuring Web Site Success. John Wiley & Sons, Inc.Google Scholar

Strathern, Marilyn. 1997. “‘Improving ratings’: Audit in the British University System.” European Review 5 (3): 305–321. doi:10.1002/(SICI)1234-981X(199707)5:33.0.CO;2-4.Google Scholar

Student, . 1908. “The Probable Error of a Mean.” Biometrika 6 (1): 1–25. https://www.jstor.org/stable/2331554.Google Scholar

Sullivan, Nicole. 2008. “Design Fast Websites.” Slideshare. October 14. www.slideshare.net/stubbornella/designing-fast-websites-presentation.Google Scholar

Tang, Diane, Agarwal, Ashish, O’Brien, Deirdre, and Meyer, Mike. 2010. “Overlapping Experiment Infrastructure: More, Better, Faster Experimentation.” Proceedings 16th Conference on Knowledge Discovery and Data Mining.Google Scholar

The Guardian. 2014. OKCupid: We Experiment on Users. Everyone does. July 29. www.theguardian.com/technology/2014/jul/29/okcupid-experiment-human-beings-dating.Google Scholar

The National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. 1979. The Belmont Report. April 18. www.hhs.gov/ohrp/regulations-and-policy/belmont-report/index.html.Google Scholar

Thistlewaite, Donald L., and Campbell, Donald T.. 1960. “Regression-Discontinuity Analysis: An Alternative to the Ex-Post Facto Experiment.” Journal of Educational Psychology 51 (6): 309–317. doi:https://doi.org/10.1037%2Fh0044319.Google Scholar

Thomke, Stefan H. 2003. “Experimentation Matters: Unlocking the Potential of New Technologies for Innovation.”Google Scholar

Tiffany, Kaitlyn. 2017. “This Instagram Story Ad with a Fake Hair in It is Sort of Disturbing.” The Verge. December 11. www.theverge.com/tldr/2017/12/11/16763664/sneaker-ad-instagram-stories-swipe-up-trick.Google Scholar

Tolomei, Sam. 2017. Shrinking APKs, growing installs. November 20. https://medium.com/googleplaydev/shrinking-apks-growing-installs-5d3fcba23ce2.Google Scholar

Tutterow, Craig, and Saint-Jacques, Guillaume. 2019. Estimating Network Effects Using Naturally Occurring Peer Notification Queue Counterfactuals. February 19. https://arxiv.org/abs/1902.07133.Google Scholar

Tyler, Mary E., and Ledford, Jerri. 2006. Google Analytics. Wiley Publishing, Inc.Google Scholar

Tyurin, I.S. 2009. “On the Accuracy of the Gaussian Approximation.” Doklady Mathematics 429 (3): 312–316.Google Scholar

Ugander, Johan, Karrer, Brian, Backstrom, Lars, and Kleinberg, Jon. 2013. “Graph Cluster Randomization: Network Exposure to Multiple Universes.” Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 329–337.Google Scholar

van Belle, Gerald. 2008. Statistical Rules of Thumb. 2nd edition. Wiley-Interscience.Google Scholar

Vann, Michael G. 2003. “Of Rats, Rice, and Race: The Great Hanoi Rat Massacre, an Episode in French Colonial History.” French Colonial History 4: 191–203. https://muse.jhu.edu/article/42110.Google Scholar

Varian, Hal. 2016. “Causal inference in economics and marketing.” Proceedings of the National Academy of Sciences of the United States of America 7310–7315.Google Scholar

Varian, Hal R. 2007. “Kaizen, That Continuous Improvement Strategy, Finds Its Ideal Environment.” The New York Times. February 8. www.nytimes.com/2007/02/08/business/08scene.html.Google Scholar

Vaver, Jon, and Koehler, Jim. 2012. Periodic Measuement of Advertising Effectiveness Using Multiple-Test Period Geo Experiments. Google Inc.Google Scholar

Vaver, Jon, and Koehler, Jim. 2011. Measuring Ad Effectiveness Using Geo Experiments. Google, Inc.Google Scholar

Vickers, Andrew J. 2009. What Is a p-value Anyway? 34 Stories to Help You Actually Understand Statistics. Pearson. www.amazon.com/p-value-Stories-Actually-Understand-Statistics/dp/0321629302.Google Scholar

Vigen, Tyler. 2018. Spurious Correlations. http://tylervigen.com/spurious-correlations.Google Scholar

Wager, Stefan, and Athey, Susan. 2018. “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests.” Journal of the American Statistical Association 13 (523): 1228–1242. doi:https://doi.org/10.1080/01621459.2017.1319839.Google Scholar

Wagner, Jeremy. 2019. “Why Performance Matters.” Web Fundamentals. May. https://developers.google.com/web/fundamentals/performance/why-performance-matters/#performance_is_about_improving_conversions.Google Scholar

Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. Springer.Google Scholar

Weiss, Carol H. 1997. Evaluation: Methods for Studying Programs and Policies. 2nd edition. Prentice Hall.Google Scholar

Funnel, Wider. 2018. “The State of Experimentation Maturity 2018.” Wider Funnel. www.widerfunnel.com/wp-content/uploads/2018/04/State-of-Experimentation-2018-Original-Research-Report.pdf.Google Scholar

Wikipedia contributors, Above the Fold. 2014. Wikipedia, The Free Encyclopedia. Jan. http://en.wikipedia.org/wiki/Above_the_fold.Google Scholar

Wikipedia contributors, Cobra Effect. 2019. Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Cobra_effect.Google Scholar

Wikipedia contributors, Data Dredging. 2019. Data dredging. https://en.wikipedia.org/wiki/Data_dredging.Google Scholar

Wikipedia contributors, Eastern Air Lines Flight 401. 2019. Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Eastern_Air_Lines_Flight_401.Google Scholar

Wikipedia contributors, List of .NET libraries and frameworks. 2019. https://en.wikipedia.org/wiki/List_of_.NET_libraries_and_frameworks#Logging_Frameworks.Google Scholar

Wikipedia contributors, Logging as a Service. 2019. Logging as a Service. https://en.wikipedia.org/wiki/Logging_as_a_service.Google Scholar

Wikipedia contributors, Multiple Comparisons Problem. 2019. Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Multiple_comparisons_problem.Google Scholar

Wikipedia contributors, Perverse Incentive. 2019. https://en.wikipedia.org/wiki/Perverse_incentive.Google Scholar

Wikipedia contributors, Privacy by Design. 2019. Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Privacy_by_design.Google Scholar

Wikipedia contributors, Semmelweis Reflex. 2019. Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Semmelweis_reflex.Google Scholar

Wikipedia contributors, Simpson’s Paradox. 2019. Wikipedia, The Free Encyclopedia. Accessed February 28, 2008. http://en.wikipedia.org/wiki/Simpson%27s_paradox.Google Scholar

Wolf, Talia. 2018. “Why Most Redesigns Fail (and How to Make Sure Yours Doesn’t).” GetUplift. https://getuplift.co/why-most-redesigns-fail.Google Scholar

Xia, Tong, Bhardwaj, Sumit, Dmitriev, Pavel, and Fabijan, Aleksander. 2019. “Safe Velocity: A Practical Guide to Software Deployment at Scale using Controlled Rollout.” ICSE: 41st ACM/IEEE International Conference on Software Engineering. Montreal, Canada. www.researchgate.net/publication/333614382_Safe_Velocity_A_Practical_Guide_to_Software_Deployment_at_Scale_using_Controlled_Rollout.Google Scholar

Xie, Huizhi, and Aurisset, Juliette. 2016. “Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix.” KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY: ACM. 645–654. http://doi.acm.org/10.1145/2939672.2939733.Google Scholar

Xu, Ya, and Chen, Nanyu. 2016. “Evaluating Mobile Apps with A/B and Quasi A/B Tests.” KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, California, USA: ACM. 313–322. http://doi.acm.org/10.1145/2939672.2939703.Google Scholar

Xu, Ya, Duan, Weitao, and Huang, Shaochen. 2018. “SQR: Balancing Speed, Quality and Risk in Online Experiments.” 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. London: Association for Computing Machinery. 895–904.Google Scholar

Xu, Ya, Chen, Nanyu, Fernandez, Adrian, Sinno, Omar, and Bhasin, Anmol. 2015. “From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks.” KDD ’15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Sydney, NSW, Australia: ACM. 2227–2236. http://doi.acm.org/10.1145/2783258.2788602.Google Scholar

Yoon, Sangho. 2018. Designing A/B Tests in a Collaboration Network. www.unofficialgoogledatascience.com/2018/01/designing-ab-tests-in-collaboration.html.Google Scholar

Young, S. Stanley, and Karr, Allan. 2011. “Deming, data and observational studies: A process out of control and needing fixing.” Significance 8 (3).Google Scholar

Zhang, Fan, Joseph, Joshy, and James, Alexander, Zhuang, Peng Rickabaugh. 2018. Client-Side Activity Monitoring. US Patent US 10,165,071 B2. December 25.Google Scholar

Zhao, Zhenyu, Chen, Miao, Matheson, Don, and Stone, Maria. 2016. “Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation.” DSAA 2016: IEEE International Conference on Data Science and Advanced Analytics. IEEE. 498–507. doi:https://ieeexplore.ieee.org/document/7796936.Google Scholar

Book contents

References

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive