Lagrange Multipliers - Notes on AI

# Lagrange Multipliers Lagrange multipliers are used to find the stationary data (where the derivate of function wrt to every parameter is zero) points of a function of several variables subject to one or more constraints. ## Intuition ### Understanding Gradient Vectors First, let's clarify what gradient vectors are and how they behave: - The **gradient** of a function, denoted as $\nabla f(\mathbf{x})$ for a function $f$, points in the direction of the steepest ascent of that function. This means if you were to move in the direction of the gradient, you would increase the value of the function $f$ most rapidly. - Similarly, the gradient of a constraint function $g(\mathbf{x})$, denoted as $\nabla g(\mathbf{x})$, points in the direction of the steepest ascent of $g$. ### Constraint Surfaces and Perpendicular Gradients When you're dealing with a constraint of the form $g(\mathbf{x})=c$, the set of points that satisfy this equation form what's called a **constraint surface** in the space of the variable $\mathbf{x}$. The gradient $\nabla g(\mathbf{x})$ at any point on this surface is perpendicular to the surface. This perpendicularity is a geometric property of gradients: they are always normal (perpendicular) to level sets of the function, and the set of points where $g(\mathbf{x}) = c$ is a level set of $g$. ### How to optimize? When maximizing $f(x)$ subject to $g(x) = c$, at the optimal point, $\nabla f(x)$ and $\nabla g(x)$ must be parallel (technically anti-parallel, pointing in opposite directions). Why? - $\nabla g(x)$ is perpendicular to the constraint surface $g(x) = c$ - To maximize $f$ while staying on this level set $g(x) = c$, we can't have any component of $\nabla f(x)$ tangent to the surface (otherwise we could move along the surface to increase $f$ further) - Therefore $\nabla f(x)$ must also be perpendicular to the constraint surface $g(x) = c$ - This gives us the way to optimize the constrained function! $\nabla f(x) + \lambda \nabla g(x) = 0$ Important details: - Multiply by λ: You don't just add the gradients - you add λ times the constraint gradient: ∇f(x) + λ∇g(x) = 0 - Constraint must equal zero: The constraint needs to be in the form g(x) = 0. If you have g(x) = c, rewrite it as g(x) - c = 0. - Solve the system: You get a system of equations: - ∇f(x) + λ∇g(x) = 0 (gradient condition) - g(x) = 0 (constraint condition) So the general recipe is: - Set up ∇f + λ∇g = 0 - Include your constraint g(x) = 0 Solve this system for both x and λ. The λ is crucial because it tells you the "exchange rate" between violating the constraint and improving the objective function. **Example**: Find the point on line $x + y = 1$ closest to the origin. Minimize distance squared $f(x,y) = x^2 + y^2$ subject to $g(x,y) = x + y - 1 = 0$. Setting up: $\nabla f = (2x, 2y)$ and $\nabla g = (1, 1)$ From $\nabla f + \lambda \nabla g = 0$: $(2x, 2y) + \lambda(1, 1) = (0, 0)$ This gives us: $2x = -\lambda$, $2y = -\lambda$, so $x = y$ Combined with constraint $x + y = 1$: solution is $x = y = 1/2$. **Geometric picture**: You're driving on a winding mountain road (the constraint surface) trying to reach the highest elevation possible. At the optimal point, the road must be perfectly level (tangent to the mountain's elevation contours) - if it were still going uphill or downhill, you could drive further to get higher. ## Equality constraints optimization The goal is to find maximum of $f(\mathbf{x})$ subject to $g(\mathbf{x})=c$. The motivation of this method of constrained optimization comes from a geometrical perspective. Consider a D-dimensional vatiable $\mathbf{x}$. The constraint equation $g(\mathbf{x})=0$ then represensts a (D-1) dimensional surface in $\mathbf{x}$ space. Note that the gradient $\nabla g(\mathbf{x})$ of the constraint function will be perpendicular to the constraint surface (i.e where $g(\mathbf{x})$ = 0) (as gradient points to the steepest ascent). Now for the point $\mathbf{x}$ that maximizes $f\mathbf{x}$, the vector $\nabla f(\mathbf{x})$ must also be perpendicular to the constraint surface, because we could increase the value of $f(\mathbf{x})$ by moving a short distance along the constraint surface. Thus $\nabla g$ and $\nabla f$ are anti-parallel vectors, so there must exist a parameter $\lambda$ such that $ \nabla f(x)+\lambda \nabla g(x)=0 $ where $\lambda \neq 0$ is known as _Lagrange multiplier_. Introducing _Lagrangian function_: $ L(\mathbf{x}, \lambda)=f(\mathbf{x})+\lambda(g(\mathbf{x})-c) $ The constrainted optimization of $f(x)$ can now be solved by finding the stationary points of $L(\mathbf{x},\lambda)$. $ \frac{\partial}{\partial x} L(x, \lambda)=0 \quad \frac{\partial}{\partial \lambda} L(x, \lambda)=0 $ ![[lagrange.jpg]] ## Inequality constraints optimization We now consider the problem of maximizing $f(\mathbf{x})$ subject to an _inequality constraint_ of the form $g(\mathbf{x})\geq 0$. Two kinds of solution is now possbile: 1. Stationary point lies in region $g(\mathbf{x}) \geq 0$ : inactive constraint $ \nabla f(\mathbf{x})=0, \quad \mu=0 $ 2. Stationary point lies on boundary $g(\mathbf{x})=0$ : active constraint $ \nabla f(\mathbf{x})=-\mu \nabla g(\mathbf{x}), \quad \mu>0 $ In the second case, the solution lies on the boundary, which is analogous to the equality constraint, however the sign of Lagrange multiplier is negative, as $f(x)$ will only be at a maximum if its gradient is oriented away from the region $g(\mathbf{x})>0$ for some value $\mu > 0$.For either of these cases, $\mu g(\mathbf{x})=0$. Therefore, the solution is obtained by optimizing the Lagrange function with respect to $\mathbf{x}$ and $\mu$ subject to the conditions $ \mu \geq 0, \quad g(\mathbf{x}) \geq 0, \quad \mu g(\mathbf{x})=0 $ known as _Karush-Kuhn-Tucker (KKT)_ conditions. We pose this optimization problem with _Dual Lagrangian_ with primal variables $\mathbf{x}$ for fixed dual variables $\mu$ $ \tilde{L}(\mu)=\max _{\mathbf{x}} L(\mathbf{x}, \mu) $ with $ L(\mathbf{x}, \mu)=f(\mathbf{x})+\mu g(\mathbf{x}) $ To obtain dual Langrangian analytically, 1. Use stationarity condition $\nabla_{\mathrm{x}} L=0$ to eliminate $\mathbf{x}$ from $L$ 2. This gives $\tilde{L}$ which now only depends on $\mu$ 3. This is an upper bound for (1) as function of $\mu$ Duality gap: $d^{*}-p^{*}$ comes from the fact that $\mathbf{p}^{*}=\max _{\mathbf{x}, g(\mathbf{x}) \geq 0} f(\mathbf{x}) \leq \min _{\mu} \tilde{L}(\mu)=\mathbf{d}^{*}$ For (almost all) convex problems: 1. Strong duality: $\mathbf{p}^{*}=\mathbf{d}^{*}$ 2. So if we have solved the dual problem, we have solved the primal problem! **Dual problem** (find the lowest upper bound): $ \min \tilde{L}(\mu) \text { subject to } \mu \geq 0 $ Thus to conclude, the algorithm follows as: 1. Define Lagrangian: $L(\mathbf{x}, \mu)=f(x)+\mu g(\mathbf{x})$ 2. Compute dual Lagrangian $\tilde{L}(\mu)$ Solve dual problem: $\mu^{*}=\underset{\mu}{\arg \min } \tilde{L}(\mu)$ subject to $\mu \geq 0$ 3. Maximize primal Lagrangian: $\mathbf{x}^{*}=\underset{\mathbf{x}}{\arg \max } L\left(\mathbf{x}, \mu^{*}\right)$ ## Note about the signs For building the Lagrangian it is quickest to remember that: min problem $\& \geq$ constraint $\Rightarrow-$ as a sign for the Lagrangian multiplier $\min \& \leq \Rightarrow+$ $\max \& \geq \Rightarrow+$ $\max \& \leq \Rightarrow-$ Lagrangian is constructed by looking at the gradients of the function to be optimised and of the constraint function: We know that a constraint optimisation problem is solved when the gradient of the constraint relates in a certain way to the gradient of the function: the gradients must be either parallel or antiparallel. Since the gradient always points in the direction of the highest slope (uphills, so to say), the direction of the original function's gradient will be different if the function's optimum is a minimum vs. a maximum. At the same time, the direction of the constraint's gradient will also be different depending on the direction of the inequality. So if our goal will be parallel or antiparallel gradients for solving the optimisation problem, really depends on both the kind of the problem and the direction of the constraint (which determines on which side of the function the constraint is actually active). For antiparallel gradients, the resulting sign will be + and for parallel gradients it will be −. ## References: 1. Appendix E, Bishop 2006