#### (ML 13.1) (ML 13.2) Directed graphical models - introductory examples

(Directed) Graphica “Models” aka “Bayesian” networks

Better name would be “conditional independence diagrams” of probability distributions

Key notions:

• factorization of probability distributions
• notational device
• useful for visualization of (a) conditional independence properties
(b) inference algorithms (DP, MCMC)

Why conditional independence important? Tractable inference

Thnking “Graphically”

Let $A, B, C$ be random variables.

$p(a, b,c) = p(c\vert a, b)p(a, b) = p(c\vert a, b)p(b\vert a)p(a)$

If $B$ and $C$ are independent given $A$, then $p(c\vert a, b) = p(c\vert a)$.

#### (ML 13.3) (ML 13.4) Directed graphical models - formalism

Notation.

• DAG (Directed Acyclic Graph) : no directed cycles
• $pa(i) =$ parents of vertex

Definition. Given $(X_1, \ldots, X_n) \sim p$ (pmf or odf) and an ordered DAG $G$ on $n$ vertices, we say $X$ respects $G$ (or $p$ respects $G$) if

Remark. This does not imply that any particular random variables are conditionally dependent.

Example. $X = (X_1, X_2, X_3)$ mutually independent
Then $p(x_1, x_2, x_3) = p(x_1)p(x_2)p(x_3)$

Terminology. A Complete directed graph.

Example. Any distribution respects any complete DAG.

Remark.

• Can combine random variables into vectors
• If the factors are normalized conditional distributions, then the product is normalized also. (Exercise)

#### (ML 13.5) Generative process specification (“a handy convention”)

$X_1, X_2 \sim \text{Bernoulli}(1/2)$, independent
$X_3 \sim N(X_1 + X_2, \sigma^2)$
$X_4 \sim N(aX_2 + b, 1)$
$X_5 = \begin{cases} 1, \, \text{if} \,X_4 \ge 0,\\ 0, \, \text{else}. \end{cases}$

$X = (X_1, \ldots, X_5)$ respects the following graph

and wee have the following factorization $p(x_1, \ldots, x_5) = p(x_1)p(x_2)p(x_3\vert x_1, x_2)p(x_4\vert x_2)p(x_5\vert x_4)$.

Nice sampleing and visualization.

### Examples of (Directed) Graphical Models

#### (ML 13.6) Graphical model for Bayesian linear regression

$D = ((x_1, y_1), \ldots, (x_n, y_n)), x_i \in \mathbb{R}^d, y_i \in \mathbb{R}$
$f(x) = w^T\varphi(x)$
$w \sim N(0, \sigma_0^2I)$
$Y_i \sim N(w^T\varphi(x_i), \sigma^2)$, independent given $w$

$p(w, y_1, \ldots, y_n) = p(w)\prod_{i=1}^n p(y_i\vert w)$

Plate notation.

#### (ML 13.7) Graphical model for Bayesian Naive Bayes

$D = ((x^{(1)}, y_1), \ldots, (x^{(n)}, y_n)), x^{(i)} \in \mathbb{R}^d, y_i \in \{1, \ldots, m\}$
$\pi \sim \text{Dir}(\alpha)$
$r_{jy} \sim \text{Dir}(\beta)$ for $j = \{1, \ldots, d\}, y \in \{1, \ldots, m\}$
$Y \sim \pi, Y_i \sim \pi$
$X_j \sim r_{jY}, X_j^{(i)} \sim r_{jY_i}$

1 “Tail-tail”

#### (ML 13.10) (ML 13.11) D-separation

“The whole point of graphical models is to express the conditional independence properties of a probability distribution. And the D-separation criterion gives you a way to read off those conditional independence properties from the graphocal model for a probability distribution.”

D-separation is a vast generalization of these:

Terminology. Descendents

Let $G$ be a DAG. Let $A, B, C$ be disjoint subsets of vertices.
($A \cup B \cup C$ is not necessarily all of the vertices.)

Definition. A path (not necessarily directed) between two vertices is blocked (w.r.t. $C$) if it passes through a vertex $v$ s.t. either (a) the arrows are head-tail or tail-tail, and $v \in C$ or (b) the arrows are head-head and $v \not\in C$ and none of the descendents of $v$ are in $C$.

Definition. $A$ and $B$ are d-separated by $C$ if all paths from a vertex of $A$ to a vertex of $B$ are blocked (w.r.t. $C$).

Theorem (d-separation). If $A$ and $B$ are d-separated by $C$ then $A$ and $B$ are conditionally independent given $C$.

#### (ML 13.12) (ML 13.13) How to use D-separation - illustrative examples

Example 1. $C = \{3\}$ Are $X_i$ and $X_j$ independent given $X_3$?

Example 2.

Example 3.