To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure firstname.lastname@example.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
One approach for dealing with intractability is to utilize representations that permit certain queries of interest to be computable in polytime. Such tractable representations will ultimately be exponential in size for certain problems and they may also not be suitable for direct specification by users. Hence, they are typically generated from other specifications through a process known as knowledge compilation. In this chapter, we review a subset of these tractable representations, known as decomposable negation normal forms (DNNFs), which have proved influential in a number of applications, including formal verification, model-based diagnosis and probabilistic reasoning.
Many areas of computer science have shown a great interest in tractable and canonical representations of propositional knowledge bases (aka, Boolean functions). The ordered binary decision diagram (OBDD) is one such representation that received much attention and proved quite influential in a variety of areas . Within AI, the study of tractable representations has also had a long tradition (e.g., [61, 30, 31, 49, 62, 14, 28, 19, 13, 52, 66, 50]). This area of research, which is also known as knowledge compilation, has become more systematic since , which showed that many known and useful representations are subsets of negation normal form (NNF) and correspond to imposing specific properties on NNF. The most fundamental of these properties turned out to be decomposability and determinism, giving rise to the corresponding language of DNNF and its subset, d-DNNF. This chapter is dedicated to DNNF and its subsets, which also include the influential language of OBDDs, and the more recently introduced sentential decision diagrams (SDDs).
We introduce probability calculus in this chapter as a tool for representing and reasoning with degrees of belief.
We provide in this chapter a framework for representing and reasoning with uncertain beliefs. According to this framework, each event is assigned a degree of belief which is interpreted as a probability that quantifies the belief in that event. Our focus in this chapter is on the semantics of degrees of belief, where we discuss their properties and the methods for revising them in light of new evidence. Computational and practical considerations relating to degrees of belief are discussed at length in future chapters.
We start in Section 3.2 by introducing degrees of belief, their basic properties, and the way they can be used to quantify uncertainty. We discuss the updating of degrees of belief in Section 3.3, where we show how they can increase or decrease depending on the new evidence made available. We then turn to the notion of independence in Section 3.4, which will be fundamental when reasoning about uncertain beliefs. The properties of degrees of belief are studied further in Section 3.5, where we introduce some of the key laws for manipulating them. We finally treat the subject of soft evidence in Sections 3.6 and 3.7, where we provide some tools for updating degrees of belief in light of uncertain information.
Degrees of belief
We have seen in Chapter 2 that a propositional knowledge base Δ classifies sentences into one of three categories: sentences that are implied by Δ, sentences whose negations are implied by Δ, and all other sentences (see Figure 2.2).
We consider in this chapter the problem of finding variable instantiations that have maximal probability under some given evidence. We present two classes of exact algorithms for this problem, one based on variable elimination and the other based on systematic search. We also present approximate algorithms based on local search.
Consider the Bayesian network in Figure 10.1, which concerns a population that is 55% male and 45% female. According to this network, members of this population can suffer from a medical condition C that is more likely to occur in males. Moreover, two diagnostic tests are available for detecting this condition, T1 and T2, with the second test being more effective on females. The CPTs of this network also reveal that the two tests are equally effective on males.
One can partition the members of this population into four different groups depending on whether they are male or female and whether they have the condition or not. Suppose that a person takes both tests and all we know is that the two tests yield the same result, leading to the evidence A=yes. We may then ask: What is the most likely group to which this individual belongs? This query is therefore asking for the most likely instantiation of variables S and C given evidence A=yes, which is technically known as a MAP instantiation. We have already discussed this class of queries in Chapter 5, where we referred to variables S and C as the MAP variables.
Bayesian networks have received a lot of attention over the last few decades from both scientists and engineers, and across a number of fields, including artificial intelligence (AI), statistics, cognitive science, and philosophy.
Perhaps the largest impact that Bayesian networks have had is on the field of AI, where they were first introduced by Judea Pearl in the midst of a crisis that the field was undergoing in the late 1970s and early 1980s. This crisis was triggered by the surprising realization that a theory of plausible reasoning cannot be based solely on classical logic [McCarthy, 1977], as was strongly believed within the field for at least two decades [McCarthy, 1959]. This discovery has triggered a large number of responses by AI researchers, leading, for example, to the development of a new class of symbolic logics known as non-monotonic logics (e.g., [McCarthy, 1980; Reiter, 1980; McDermott and Doyle, 1980]). Pearl's introduction of Bayesian networks, which is best documented in his book [Pearl, 1988], was actually part of his larger response to these challenges, in which he advocated the use of probability theory as a basis for plausible reasoning and developed Bayesian networks as a practical tool for representing and computing probabilistic beliefs.
From a historical perspective, the earliest traces of using graphical representations of probabilistic information can be found in statistical physics [Gibbs, 1902] and genetics [Wright, 1921]. However, the current formulations of these representations are of a more recent origin and have been contributed by scientists from many fields.
We consider in this chapter the relationship between the values of parameters that quantify a Bayesian network and the values of probabilistic queries applied to these networks. In particular, we consider the impact of parameter changes on query values, and the amount of parameter change needed to enforce some constraints on these values.
Consider a laboratory that administers three tests for detecting pregnancy: a blood test, a urine test, and a scanning test. Assume also that these tests relate to the state of pregnancy as given by the network of Figure 16.1 (we treated this network in Chapter 5). According to this network, the prior probability of pregnancy is 87% after an artificial insemination procedure. Moreover, the posterior probability of pregnancy given three negative tests is 10.21%. Suppose now that this level of accuracy is not acceptable: the laboratory is interested in improving the tests so the posterior probability is no greater than 5% given three negative tests. The problem now becomes one of finding a certain set of network parameters (corresponding to the tests' false positive and negative rates) that guarantee the required accuracy. This is a classic problem of sensitivity analysis that we address in Section 16.3 as it is concerned with controlling network parameters to enforce some constraints on the queries of interest.
Assume now that we replace one of the tests with a more accurate one, leading to a new Bayesian network that results from updating the parameters corresponding to that test.
We introduce Bayesian networks in this chapter as a modeling tool for compactly specifying joint probability distributions.
We have seen in Chapter 3 that joint probability distributions can be used to model uncertain beliefs and change them in the face of hard and soft evidence. We have also seen that the size of a joint probability distribution is exponential in the number of variables of interest, which introduces both modeling and computational difficulties. Even if these difficulties are addressed, one still needs to ensure that the synthesized distribution matches the beliefs held about a given situation. For example, if we are building a distribution that captures the beliefs of a medical expert, we may need to ensure some correspondence between the independencies held by the distribution and those believed by the expert. This may not be easy to enforce if the distribution is constructed by listing all possible worlds and assessing the belief in each world directly.
The Bayesian network is a graphical modeling tool for specifying probability distributions that, in principle, can address all of these difficulties. The Bayesian network relies on the basic insight that independence forms a significant aspect of beliefs and that it can be elicited relatively easily using the language of graphs. We start our discussion in Section 4.2 by exploring this key insight, and use our developments in Section 4.3 to provide a formal definition of the syntax and semantics of Bayesian networks.
We present in this chapter a variation on the variable elimination algorithm, known as the jointree algorithm, which can be understood in terms of factor elimination. This algorithm improves on the complexity of variable elimination when answering multiple queries. It also forms the basis for a class of approximate inference algorithms that we discuss in Chapter 14.
Consider a Bayesian network and suppose that our goal is to compute the posterior marginal for each of its n variables. Given an elimination order of width w, we can compute a single marginal using variable elimination in O(n exp(w)) time and space, as we explained in Chapter 6. To compute all these marginals, we can then run variable elimination O(n) times, leading to a total complexity of O(n2 exp(w)).
For large networks, the n2 factor can be problematic even when the treewidth is small. The good news is that we can avoid this complexity and compute marginals for all networks variables in only O(n exp(w)) time and space. This can be done using a more refined algorithm known as the jointree algorithm, which is the main subject of this chapter. The jointree algorithm will also compute the posterior marginals for other sets of variables, including all network families, where a family consists of a variable and its parents in the Bayesian network. Family marginals are especially important for sensitivity analysis, as discussed in Chapter 16, and for learning Bayesian networks, as discussed in Chapters 17 and 18.
We discuss in this chapter the process of learning Bayesian networks from data. The learning process is studied under different conditions, which relate to the nature of available data and the amount of prior knowledge we have on the Bayesian network.
Consider Figure 17.1, which depicts a Bayesian network structure from the domain of medical diagnosis (we treated this network in Chapter 5). Consider also the data set depicted in this figure. Each row in this data set is called a case and represents a medical record for a particular patient. Note that some of the cases are incomplete, where “?” indicates the unavailability of corresponding data for that patient. The data set is therefore said to be incomplete due to these missing values; otherwise, it is called a complete data set.
A key objective of this chapter is to provide techniques for estimating the parameters of a network structure given both complete and incomplete data sets. The techniques we provide therefore complement those given in Chapter 5 for constructing Bayesian networks. In particular we can now construct the network structure from either design information or by working with domain experts, as discussed in Chapter 5, and then use the techniques discussed in this chapter to estimate the CPTs of these structures from data. We also discuss techniques for learning the network structure itself, although our focus here is on complete data sets for reasons that we state later.
We discuss in this chapter computational techniques for exploiting certain properties of network parameters, allowing one to perform inference efficiently in some situations where the network treewidth can be quite large.
We discussed in Chapters 6–8 two paradigms for probabilistic inference based on elimination and conditioning, showing how they lead to algorithms whose time and space complexity are exponential in the network treewidth. These algorithms are often called structure-based since their performance is driven by the network structure and is independent of the specific values attained by network parameters. We also presented in Chapter 11 some CNF encodings of Bayesian networks, allowing us to reduce probabilistic inference to some well-known CNF tasks. The resulting CNFs were also independent of the specific values of network parameters and are therefore also structure-based.
However, the performance of inference algorithms can be enhanced considerably if one exploits the specific values of network parameters. The properties of network parameters that lend themselves to such exploitation are known as parametric or local structure. This type of structure typically manifests in networks involving logical constraints, contextspecific independence, or local models of interaction, such as the noisy-or model discussed in Chapter 5.
In this chapter, we present a number of computational techniques for exploiting local structure that can be viewed as extensions of inference algorithms discussed in earlier chapters. We start in Section 13.2 with an overview of local structure and the impact it can have on the complexity of inference.
We introduce propositional logic in this chapter as a tool for representing and reasoning about events.
The notion of an event is central to both logical and probabilistic reasoning. In the former, we are interested in reasoning about the truth of events (facts), while in the latter we are interested in reasoning about their probabilities (degrees of belief). In either case, one needs a language for expressing events before one can write statements that declare their truth or specify their probabilities. Propositional logic, which is also known as Boolean logic or Boolean algebra, provides such a language.
We start in Section 2.2 by discussing the syntax of propositional sentences, which we use for expressing events. We then follow in Section 2.3 by discussing the semantics of propositional logic, where we define properties of propositional sentences, such as consistency and validity, and relationships among them, such as implication, equivalence, and mutual exclusiveness. The semantics of propositional logic are used in Section 2.4 to formally expose its limitations in supporting plausible reasoning. This also provides a good starting point for Chapter 3, where we show how degrees of belief can deal with these limitations.
In Section 2.5, we discuss variables whose values go beyond the traditional true and false values of propositional logic. This is critical for our treatment of probabilistic reasoning in Chapter 3, which relies on the use of multivalued variables.
This book is a thorough introduction to the formal foundations and practical applications of Bayesian networks. It provides an extensive discussion of techniques for building Bayesian networks that model real-world situations, including techniques for synthesizing models from design, learning models from data, and debugging models using sensitivity analysis. It also treats exact and approximate inference algorithms at both theoretical and practical levels. The treatment of exact algorithms covers the main inference paradigms based on elimination and conditioning and includes advanced methods for compiling Bayesian networks, time-space tradeoffs, and exploiting local structure of massively connected networks. The treatment of approximate algorithms covers the main inference paradigms based on sampling and optimization and includes influential algorithms such as importance sampling, MCMC, and belief propagation. The author assumes very little background on the covered subjects, supplying in-depth discussions for theoretically inclined readers and enough practical details to provide an algorithmic cookbook for the system developer.
We consider in this chapter three models of graph decomposition: elimination orders, jointrees and dtrees, which underly the key inference algorithms we discussed thus far. We present formal definitions of these models, provide polytime, width-preserving transformations between them, and show how the optimal construction of each of these models corresponds in a precise sense to the process of optimally triangulating a graph.
We presented three inference algorithms in previous chapters whose complexity can be exponential only in the network treewidth: variable elimination, factor elimination (jointree), and recursive conditioning. Each one of these algorithms can be viewed as decomposing the Bayesian network in a systematic manner, allowing us to reduce a query with respect to some network into a query with respect to a smaller network. In particular, variable elimination removes variables one at a time from the network, while factor elimination removes factors one at a time and recursive conditioning partitions the network into smaller pieces. We also saw how the decompositional choices made by these algorithms can be formalized using elimination orders, elimination trees (jointrees), and dtrees, respectively. In fact, the time and space complexity of each of these algorithms was characterized using the width of its corresponding decomposition model, which is lower-bounded by the treewidth.
We provide a more comprehensive treatment of decomposition models in this chapter including polytime, width-preserving transformations between them. These transformations allow us to convert any method for constructing low-width models of one type into low-width models of other types.
We present in this chapter one of the simplest methods for general inference in Bayesian networks, which is based on the principle of variable elimination: A process by which we successively remove variables from a Bayesian network while maintaining its ability to answer queries of interest.
We saw in Chapter 5 how a number of real-world problems can be solved by posing queries with respect to Bayesian networks. We also identified four types of queries: probability of evidence, prior and posterior marginals, most probable explanation (MPE), and maximum a posterior hypothesis (MAP). We present in this chapter one of the simplest inference algorithms for answering these types of queries, which is based on the principle of variable elimination. Our interest here will be restricted to computing the probability of evidence and marginal distributions, leaving the discussion of MPE and MAP queries to Chapter 10.
We start in Section 6.2 by introducing the process of eliminating a variable. This process relies on some basic operations on a class of functions known as factors, which we discuss in Section 6.3. We then introduce the variable elimination algorithm in Section 6.4 and see how it can be used to compute prior marginals in Section 6.5. The performance of variable elimination will critically depend on the order in which we eliminate variables. We discuss this issue in Section 6.6, where we also provide some heuristics for choosing good elimination orders.
We discuss in this chapter a class of approximate inference algorithms which are based on belief propagation. These algorithms provide a full spectrum of approximations, allowing one to trade-off approximation quality with computational resources.
The algorithm of belief propagation was first introduced as a specialized algorithm that applied only to networks having a polytree structure. This algorithm, which we treated in Section 7.5.4, was later applied to networks with arbitrary structure and found to produce high-quality approximations in certain cases. This observation triggered a line of investigations into the semantics of belief propagation, which had the effect of introducing a generalization of the algorithm that provides a full spectrum of approximations with belief propagation approximations at one end and exact results at the other.
We discuss belief propagation as applied to polytrees in Section 14.2 and then discuss its application to more general networks in Section 14.3. The semantics of belief propagation are exposed in Section 14.4, showing howit can be viewed as searching for an approximate distribution that satisfies some interesting properties. These semantics will then be the basis for developing generalized belief propagation in Sections 14.5–14.7. An alternative semantics for belief propagation will also be given in Section 14.8, together with a corresponding generalization. The difference between the two generalizations of belief propagation is not only in their semantics but also in the way they allow the user to trade off the approximation quality with the computational resources needed to produce them.
We consider in this chapter the computational complexity of probabilistic inference. We also provide some reductions of probabilistic inference to well known problems, allowing us to benefit from specialized algorithms that have been developed for these problems.
In previous chapters, we discussed algorithms for answering three types of queries with respect to a Bayesian network that induces a distribution Pr(X). In particular, given some evidence e we discussed algorithms for computing:
The probability of evidence e, Pr(e) (see Chapters 6–8)
The MPE probability for evidence e, MPEP (e) (see Chapter 10)
The MAP probability for variables Q and evidence e, MAPP (Q, e) (see Chapter 10).
In this chapter, we consider the complexity of three decision problems that correspond to these queries. In particular, given a number p, we consider the following problems:
D-PR: Is Pr(e) > p?
D-MPE: Is there a network instantiation x such that Pr(x, e) > p?
D-MAP: Given variables Q ⊆ X, is there an instantiation q such that Pr(q, e) > p?
We also consider a fourth decision problem that includes D-PR as a special case:
D-MAR: Given variables Q ⊆ X and instantiation q, is Pr(q|e) > p?
Note here that when e is the trivial instantiation, D-MAR reduces to asking whether Pr(q) > p, which is identical to D-PR.
We provide a number of results on these decision problems in this chapter. In particular, we show in Sections 11.2–11.4 that D-MPE is NP-complete, D-PR and D-MAR are PPcomplete, and D-MAP is NPPP-complete.
We discuss in this chapter a class of approximate inference algorithms based on stochastic sampling: a process by which we repeatedly simulate situations according to their probability and then estimate the probabilities of events based on the frequency of their occurrence in the simulated situations.
Consider the Bayesian network in Figure 15.1 and suppose that our goal is to estimate the probability of some event, say, wet grass. Stochastic sampling is a method for estimating such probabilities that works by measuring the frequency at which events materialize in a sequence of situations simulated according to their probability of occurrence. For example, if we simulate 100 situations and find out that the grass is wet in 30 of them, we estimate the probability of wet grass to be 3/10. As we see later, we can efficiently simulate situations according to their probability of occurrence by operating on the corresponding Bayesian network, a process that provides the basis for many of the sampling algorithms we consider in this chapter.
The statements of sampling algorithms are remarkably simple compared to the methods for exact inference discussed in previous chapters, and their accuracy can be made arbitrarily high by increasing the number of sampled situations. However, the design of appropriate sampling methods may not be trivial as we may need to focus the sampling process on a set of situations that are of particular interest.