Decision Theory - Notes on AI

# Decision Theory Decision theory, combined with probability theory, allows us to make optimal decisions in situations involving uncertainity such as those encountered in pattern recognition. Classification problems can be broken down into two separate stages, *inference stage* and *decision stage*. The *inference stage* involves using the training data to learn the model for the joint distribution $p(\mathbf{x},\mathcal{C}_k)$, or equivalently $p(\mathbf{x},t)$, which gives us the most complete probabilistic description of the situation. In the end, we must decide on optimal choice based on our situation. This *decision stage* is generally very simple, even trivial, once we have solved the inference problem. ## Classifications with decision regions The idea of classification through decision regions involves dividing the input space $\mathbb{R}^D$ into $K$ decision regions, and assign each decision region to a class. Boundaries of decision regions are called *decision surface*. In linear classification for $D$ dimensional input space, the decision surface is a $D-1$dimensional hyperplane. To scale this strategy to multiple classes, the most simple idea is for $K$ classes, use $K-1$ classifiers. However, this leads may lead to ambiguous regions in the data space as shown in the figure below: ![[amigious regions with oneVrest classifiers.jpg]] ## Minimizing the misclassification rate In classification problem, our goal is to minimize the misclassifications. Assume observations are drawn from joint distribution $p(\mathbf{x},t)$. Then the probability of miscalssification is given as: $ \begin{aligned} p(\text { mistake }) &=\sum_{i=1}^{K} \sum_{k \neq i} p\left(\mathbf{x} \in R_{i}, C_{k}\right) \\ &=1-\sum_{k=1}^{K} p\left(x \in R_{k}, C_{k}\right) \end{aligned} $ Let's choose a decision rule that assings each point to one of the two classes - assing x to class $C_k$ if $p\left(\mathbf{x}, t=C_{k}\right)>p\left(\mathbf{x}, t=C_{j}\right), \quad \text{where} \quad j \neq k$ From the product rule of probability we have $p\left(\mathbf{x}, \mathcal{C}_{k}\right)=p\left(\mathcal{C}_{k} \mid \mathbf{x}\right) p(\mathbf{x})$. This allows us to conclude that the minumum probability of making a mistake is obtained if each value of $\mathbf{x}$ is assigned to the class for which the posteriror probability $p\left(\mathcal{C}_{k} | \mathbf{x}\right)$ is largest. Usually, there are overlaps in the joint probability distributions for different classes. We are bound to make errors because of the overlap, as shown in the diagrams below. ![[Screenshot 2020-09-22 at 3.20.41 PM.jpg]] ## Minimizing the expected loss For many applications, the objective of classification is more complex than simply minimizing the number of misclassifications. Different types of misclassifications have different implications, which we must take in consideration while modeling the problem. We formalize the issues in minimizing simply the misclassifications by introducing *loss functions*. The optimal solution is the one which minimizes the loss function. We minimize the average loss, where average is computed with respect to the join probability distribution $p\left(\mathbf{x}, \mathcal{C}_{k}\right)$. $ \mathbb{E}[L]=\sum_{k, j} L_{k j} \int_{\mathcal{R}_{j}} p\left(x, C_{k}\right) \mathrm{d} x $ Our goal is to choose the regions $\mathcal{R}_j$ in order to minimize the expected loss, which implies that we should minimize $\sum_{j=1}^{K} L_{j k} p\left(x, C_{j}\right)$ for each $\mathbf{x}$. This is clearly trivial to do once we know the posterior class probabilities. ## Approaches to solving classification problems The three distinct approaches to solving decision problems, in decreasing order of complexity are: ### 1. Probabilistic generative models First solve the inference problem of determining the class-conditional densities $p\left(\mathbf{x} \mid \mathcal{C}_{k}\right)$ for each class individually. Also separately infer the prior class probabilities $p\left(\mathcal{C}_{k}\right)$ from the fraction of the class in the data set. Then use the [[Bayes Theorem]] in the form $ p\left(\mathcal{C}_{k} \mid \mathbf{x}\right)=\frac{p\left(\mathbf{x} \mid \mathcal{C}_{k}\right) p\left(\mathcal{C}_{k}\right)}{p(\mathbf{x})} $ to find the posterior classs probabilities $p\left(\mathcal{C}_{k} \mid \mathbf{x}\right)$. Then use decision theory to determine class membership for new input $\mathbf{x}$. Thus we explicitly model the distribution of inputs as well as outputs. These models are known as *generative models*. By sampling from class conditional densities, we can generate synthetic data points in the input space. Probabilistic generative models are most demanding because it involves finding the join distribution over both $\mathbf{x}$ and $\mathcal{C}_k$. Usually, $\mathbf{x}$ will have large dimensionality and so we may need a large training set. The prior can estimated simply from the fractions of training data. One advantage is that it allows marginal density of data $p(\mathbf{x})$ to be determined, which can be useful to detect new data points that low probability under the model and for which the prediction may be of low accuracy, known as *outlier detection*. ### 2. Probabilistic discriminative models First solve the inference problem of directly determining the posterior class probabilities $p\left(\mathcal{C}_{k} \mid \mathbf{x}\right)$ and then subsequently use decision theory to assign each new $\mathbf{x}$ to one of the classes. Modelling posterior probabilities directly are called *discriminative models*. When we wish to just make classification decisions, it is wasteful of computation resources and large data requirement to calculate the class conditional probability $p\left(\mathbf{x} \mid \mathcal{C}_{k}\right)$. So we can directily model the posterior with discriminative models. There are several powerful reasons for wanting to compute the posterior probabilities: 1. Minimize the risks, which is easy to do by adjusting $\sum_{k} L_{k j} p\left(\mathcal{C}_{k} \mid \mathbf{x}\right)$ 2. Determine rejection criterion that minimize the expected loss for a given fraction of rejected data points. (rejection option help avoid making decision on difficult cases) 3. Compensating for class priors when we select a balanced data set for class imbalance situations. 4. Break down the input space to build different models for different aspects of inputs. As long as two models give posterior probabilities for the classes, we can combine the outputs systematically using rules of probability. ### 3. Discriminant functions Find a function $f(\mathbf{x})$, called a discriminant function, which maps each input $\mathbf{x}$ directly onto a class label. No access to posterior probabilities $p\left(\mathcal{C}_{k} \mid \mathbf{x}\right)$.