# Lagrange Multipliers Lagrange multipliers are used to find the stationary data (where the derivate of function wrt to every parameter is zero) points of a function of several variables subject to one or more constraints. ## Intuition ### Understanding Gradient Vectors First, let's clarify what gradient vectors are and how they behave: - The **gradient** of a function, denoted as $\nabla f(\mathbf{x})$ for a function $f$, points in the direction of the steepest ascent of that function. This means if you were to move in the direction of the gradient, you would increase the value of the function $f$ most rapidly. - Similarly, the gradient of a constraint function $g(\mathbf{x})$, denoted as $\nabla g(\mathbf{x})$, points in the direction of the steepest ascent of $g$. ### Constraint Surfaces and Perpendicular Gradients When you're dealing with a constraint of the form $g(\mathbf{x})=c$, the set of points that satisfy this equation form what's called a **constraint surface** in the space of the variable $\mathbf{x}$. The gradient $\nabla g(\mathbf{x})$ at any point on this surface is perpendicular to the surface. This perpendicularity is a geometric property of gradients: they are always normal (perpendicular) to level sets of the function, and the set of points where $g(\mathbf{x}) = c$ is a level set of $g$. ### Maximizing a Function Subject to a Constraint When you want to maximize $f(\mathbf{x})$ subject to $g(\mathbf{x})=c$, you're looking for points on the constraint surface where $f$ reaches its maximum value. However, because you're restricted to moving along the constraint surface, you cannot freely follow $\nabla f(\mathbf{x})$ unless it points exactly along the surface. In most situations, freely following $\nabla f(\mathbf{x})$ would lead you off the constraint surface, which violates the constraint. The only way to increase $f$ to a maximum while staying on the constraint surface is if $\nabla f(\mathbf{x})$ itself is perpendicular to the tangent plane of the constraint surface at the point of interest. This means it must align with $\nabla g(\mathbf{x})$, since $\nabla g(\mathbf{x})$ is also perpendicular to the constraint surface at every point on it. ### Anti-Parallel Vectors and Lagrange Multipliers Saying that $\nabla f(\mathbf{x})$ and $\nabla g(\mathbf{x})$ are anti-parallel means they point in exactly opposite directions when you're at the optimal point. This is because the direction of steepest ascent for $f$ that still respects the constraint imposed by $g$ can only be achieved if these gradients are aligned but in opposite directions; otherwise, you could still move along the constraint surface and increase $f$, meaning you hadn't found the maximum. The equation $\nabla f(\mathbf{x}) + \lambda \nabla g(\mathbf{x}) = 0$ encapsulates this relationship. The parameter $\lambda$ (known as a **Lagrange multiplier**) scales the gradient of the constraint function so that when added to the gradient of the objective function, the sum is zero. This equation is the core of the **Lagrange multiplier** method, a powerful technique for finding the extrema of functions subject to constraints. In essence, the intuition here is geometric and algebraic: you're aligning the force (gradient) that drives you up the hill (maximizes $f$) with the wall (constraint $g(\mathbf{x})=c$) that confines your path, ensuring that any step you take not only keeps you within bounds but also takes you as high as possible within those bounds. ## Equality constraints optimization The goal is to find maximum of $f(\mathbf{x})$ subject to $g(\mathbf{x})=c$. The motivation of this method of constrained optimization comes from a geometrical perspective. Consider a D-dimensional vatiable $\mathbf{x}$. The constraint equation $g(\mathbf{x})=0$ then represensts a (D-1) dimensional surface in $\mathbf{x}$ space. Note that the gradient $\nabla g(\mathbf{x})$ of the constraint function will be perpendicular to the constraint surface (i.e where $g(\mathbf{x})$ = 0) (as gradient points to the steepest ascent). Now for the point $\mathbf{x}$ that maximizes $f\mathbf{x}$, the vector $\nabla f(\mathbf{x})$ must also be perpendicular to the constraint surface, because we could increase the value of $f(\mathbf{x})$ by moving a short distance along the constraint surface. Thus $\nabla g$ and $\nabla f$ are anti-parallel vectors, so there must exist a parameter $\lambda$ such that $ \nabla f(x)+\lambda \nabla g(x)=0 $ where $\lambda \neq 0$ is known as _Lagrange multiplier_. Introducing _Lagrangian function_: $ L(\mathbf{x}, \lambda)=f(\mathbf{x})+\lambda(g(\mathbf{x})-c) $ The constrainted optimization of $f(x)$ can now be solved by finding the stationary points of $L(\mathbf{x},\lambda)$. $ \frac{\partial}{\partial x} L(x, \lambda)=0 \quad \frac{\partial}{\partial \lambda} L(x, \lambda)=0 $ ![[lagrange.jpg]] ## Inequality constraints optimization We now consider the problem of maximizing $f(\mathbf{x})$ subject to an _inequality constraint_ of the form $g(\mathbf{x})\geq 0$. Two kinds of solution is now possbile: 1. Stationary point lies in region $g(\mathbf{x}) \geq 0$ : inactive constraint $ \nabla f(\mathbf{x})=0, \quad \mu=0 $ 2. Stationary point lies on boundary $g(\mathbf{x})=0$ : active constraint $ \nabla f(\mathbf{x})=-\mu \nabla g(\mathbf{x}), \quad \mu>0 $ In the second case, the solution lies on the boundary, which is analogous to the equality constraint, however the sign of Lagrange multiplier is negative, as $f(x)$ will only be at a maximum if its gradient is oriented away from the region $g(\mathbf{x})>0$ for some value $\mu > 0$.For either of these cases, $\mu g(\mathbf{x})=0$. Therefore, the solution is obtained by optimizing the Lagrange function with respect to $\mathbf{x}$ and $\mu$ subject to the conditions $ \mu \geq 0, \quad g(\mathbf{x}) \geq 0, \quad \mu g(\mathbf{x})=0 $ known as _Karush-Kuhn-Tucker (KKT)_ conditions. We pose this optimization problem with _Dual Lagrangian_ with primal variables $\mathbf{x}$ for fixed dual variables $\mu$ $ \tilde{L}(\mu)=\max _{\mathbf{x}} L(\mathbf{x}, \mu) $ with $ L(\mathbf{x}, \mu)=f(\mathbf{x})+\mu g(\mathbf{x}) $ To obtain dual Langrangian analytically, 1. Use stationarity condition $\nabla_{\mathrm{x}} L=0$ to eliminate $\mathbf{x}$ from $L$ 2. This gives $\tilde{L}$ which now only depends on $\mu$ 3. This is an upper bound for (1) as function of $\mu$ Duality gap: $d^{*}-p^{*}$ comes from the fact that $\mathbf{p}^{*}=\max _{\mathbf{x}, g(\mathbf{x}) \geq 0} f(\mathbf{x}) \leq \min _{\mu} \tilde{L}(\mu)=\mathbf{d}^{*}$ For (almost all) convex problems: 1. Strong duality: $\mathbf{p}^{*}=\mathbf{d}^{*}$ 2. So if we have solved the dual problem, we have solved the primal problem! **Dual problem** (find the lowest upper bound): $ \min \tilde{L}(\mu) \text { subject to } \mu \geq 0 $ Thus to conclude, the algorithm follows as: 1. Define Lagrangian: $L(\mathbf{x}, \mu)=f(x)+\mu g(\mathbf{x})$ 2. Compute dual Lagrangian $\tilde{L}(\mu)$ Solve dual problem: $\mu^{*}=\underset{\mu}{\arg \min } \tilde{L}(\mu)$ subject to $\mu \geq 0$ 3. Maximize primal Lagrangian: $\mathbf{x}^{*}=\underset{\mathbf{x}}{\arg \max } L\left(\mathbf{x}, \mu^{*}\right)$ ## Note about the signs For building the Lagrangian it is quickest to remember that: min problem $\& \geq$ constraint $\Rightarrow-$ as a sign for the Lagrangian multiplier $\min \& \leq \Rightarrow+$ $\max \& \geq \Rightarrow+$ $\max \& \leq \Rightarrow-$ Lagrangian is constructed by looking at the gradients of the function to be optimised and of the constraint function: We know that a constraint optimisation problem is solved when the gradient of the constraint relates in a certain way to the gradient of the function: the gradients must be either parallel or antiparallel. Since the gradient always points in the direction of the highest slope (uphills, so to say), the direction of the original function's gradient will be different if the function's optimum is a minimum vs. a maximum. At the same time, the direction of the constraint's gradient will also be different depending on the direction of the inequality. So if our goal will be parallel or antiparallel gradients for solving the optimisation problem, really depends on both the kind of the problem and the direction of the constraint (which determines on which side of the function the constraint is actually active). For antiparallel gradients, the resulting sign will be + and for parallel gradients it will be −. --- ## References: 1. Appendix E, Bishop 2006