To send content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about sending content to .
To send content items to your Kindle, first ensure email@example.com
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about sending to your Kindle.
Note you can select to send to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
For a test to be useful, it must be informative; that is, it must (at least some of the time) give different results depending on what is going on. In Chapter 1, we said we would simplify (at least initially) what is going on into just two homogeneous alternatives, D+ and D−. In this chapter, we consider the simplest type of tests, dichotomous tests, which have only two possible results (T+ and T−).
A test should give the same or similar results when administered repeatedly to the same individual within a time too short for real biological variation to take place. Results should be consistent whether the test is repeated by the same observer or instrument or by different observers or instruments. This desirable characteristic of a test is called “reliability” or “reproducibility.”
We have learned how to quantify the accuracy of dichotomous (Chapter 2) and multilevel (Chapter 3) tests. In this chapter, we turn to critical appraisal of studies of diagnostic test accuracy, with an emphasis on problems with study design that affect the interpretation or credibility of the results. After a general discussion of an approach to studies of diagnostic tests, we will review some common biases to which studies of test accuracy are uniquely or especially susceptible and conclude with an introduction to systematic reviews of test accuracy studies.
While screening tests share some features with diagnostic tests, they deserve a chapter of their own because of important differences. Whereas we generally do diagnostic tests on sick people to determine the cause of their symptoms, we generally do screening tests on healthy people with a low prior probability of disease. The problems of false positives and harms of treatment loom larger. In Chapter 4, on evaluating studies of diagnostic test accuracy, we assumed that accurate diagnosis would lead to better outcomes. The benefits and harms of screening tests are so closely tied to the associated treatments that it is hard to evaluate diagnosis and treatment separately. Instead, we compare outcomes such as mortality between those who receive the screening test and those who don’t. We postponed our discussion of screening until after our discussion of randomized trials because randomized trials are a key element in the evaluation of screening tests. Finally, because decisions about screening are often made at the population level, political and other nonmedical factors are more influential. Thus, in this chapter, we focus explicitly on the question of whether doing a screening test improves health, not just on how it alters disease probabilities, and we pay particular attention to biases and nonmedical factors that can lead to excessive screening.1
In previous chapters, we discussed issues affecting evaluation and use of diagnostic tests: how to assess test reliability and accuracy, how to combine the results of tests with prior information to estimate disease probability, and how a test’s value depends on the decision it will guide and the relative cost of errors. In this chapter, we move from diagnosing prevalent disease to predicting incident outcomes. We will discuss the difference between diagnostic tests and risk predictions and then focus on evaluating predictions, specifically covering calibration, discrimination, net benefit calculations, and decision curves.
As we noted in the Preface and Chapter 1, because the purpose of doing diagnostic tests is often to determine how to treat the patient, we may need to quantify the effects of treatment to decide whether to do a test. For example, if the treatment for a disease provides a dramatic benefit, we should have a lower threshold for testing for that disease than if the treatment is of marginal or unknown efficacy. In Chapters 2, 3, and 6, we showed how the expected benefit of testing depends on the treatment threshold probability (PTT = C/[C + B]) in addition to the prior probability and test characteristics. In this chapter, we discuss how to quantify the benefits and harms of treatments (which determine C and B) using the results of randomized trials. In Chapter 9, we will extend the discussion to observational studies of treatment efficacy; in Chapter 10, we will look at screening tests themselves as treatments and how to quantify their efficacy.
In the previous two chapters, we discussed using the results of randomized trials and observational studies to estimate treatment effects. We were primarily interested in measures of effect size and in problems with design (in randomized trials) and confounding (in observational studies) that could bias effect estimates. We did not focus on whether the apparent treatment effects could be a result of chance or attempt to quantify the precision of our effect estimates. The statistics used to help us with these issues − P-values and confidence intervals – are the subject of this chapter.
We wrestled for a long time with the question of whether to include the term “evidence-based” in the title of the first edition of this book. Although both of us are firm believers in the principles and goals of evidence-based medicine (EBM), as articulated by its first proponents we also knew that the term “evidence-based” would be viewed negatively by some potential readers [2–4]. We decided to keep “evidence-based” in the title and use this chapter to directly address some of the criticisms of EBM, many of which we believe have merit. We also recognize that, as elegant and satisfying as evidence-based diagnosis is, there are some very real cognitive barriers to applying it in a clinical setting. These barriers are the second topic of this chapter. Finally, we end the book with some thoughts on the future of evidence-based diagnosis and why it will be increasingly important.
At this point, we know how to use the result of a single test to update the probability of disease but not how to combine the results from multiple tests, and we can evaluate risk prediction models but not create them. In making a clinical treatment decision (or any other decision), we usually consider multiple variables. This chapter is about combining the results of multiple tests with other information to estimate the probability of a disease or the risk of an outcome. We begin by reviewing the concept of test independence and then discuss how to deal with departures from independence, which are probably the rule rather than the exception. Next, we cover two common methods of combining variables to predict a binary condition or outcome: classification trees and logistic regression. Finally, we discuss the process and pitfalls of variable selection and the importance of model validation.
We said in Chapter 8 that randomized blinded trials are the best way to estimate treatment effects because they minimize the potential for confounding, co-interventions, and bias, thus maximizing the strength of causal inference. However, sometimes observational studies can be attractive alternatives to randomized trials because they may be more feasible, ethical, or elegant. Of course, the issue of inferring causality from observational studies is a major topic in classical risk factor epidemiology. In this chapter, we focus on observational studies of treatments rather than risk factors, describing methods of reducing or assessing confounding that are particularly applicable to such studies.