Homo Habilis: Partial Derivative Logistic Regression Cost Function

Logistic regression is used for classification problems. As Andrew said, it's a bit confusing given the "regression" in the name.

LR cost function is given by:
\[
\text{Cost} (h_\theta (x),y) =
\begin{cases}
-\log(h_\theta (x)) \quad &\text{if } y=1 \\
-\log(1-h_\theta (x)) \quad &\text{if } y=0
\end{cases}
\]

where \(h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}\) is the logistic function.

Since \(y \in \{0,1\}\) only, we can reduce the cost function to an equivalent, single equation.
\[
\text{Cost} (h_\theta (x),y) = -y\log(h_\theta (x)) - (1-y)\log(1-h_\theta (x))
\]

This leads to the overall cost function for the logistic regression:
\[
J(\theta) = -\frac{1}{m} [\sum_{i=1}^m y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log(1-h_\theta (x))]
\]

Our goal is to find \(\min_\theta J(\theta)\). To do so, we use gradient descent, but we first need to find the partial derivatives \(\frac{\partial}{\partial \theta_j} J(\theta)\).

We're going to make use of a neat property of the logistic function:
\begin{align}
g'(z) &= \frac{d}{dz} \frac{1}{1+e^{-z}} = \frac{1}{(1+e^{-z})^2}e^{-z} \\
&= \frac{1+e^{-z}-1}{(1+e^{-z})^2} = \frac{1}{1+e^{-z}}-\frac{1}{(1+e^{-z})^2} = \frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}}) \\
&= g(z) (1-g(z))
\end{align}

So for our with our cost function:
\begin{align}
\frac{\partial}{\partial \theta_j} J(\theta) &= -\frac{1}{m} \left [\frac{\partial}{\partial \theta_j} \sum_{i=1}^m y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log(1-h_\theta (x)) \right] \\
&= -\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}\frac{1}{h_\theta(x^{(i)})}\frac{\partial}{\partial \theta_j}h_\theta(x^{(i)}) + (1-y^{(i)})\frac{1}{1-h_\theta(x^{(i)})}\left (-\frac{\partial}{\partial \theta_j}h_\theta (x^{(i)})\right) \right]
\end{align}

using the chain rule and the logistic regression derivative, we see that
\begin{align}
\frac{\partial}{\partial \theta_j} J(\theta) &=-\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}\frac{x_j^{(i)}}{h_\theta(x^{(i)})}h_\theta(x^{(i)})(1-h_\theta(x^{(i)})) - (1-y^{(i)})\frac{x_j^{(i)}}{1-h_\theta(x^{(i)})}h_\theta (x)(1-h_\theta(x^{(i)})) \right] \\
&= -\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}x_j^{(i)}(1-h_\theta(x^{(i)})) - (1-y^{(i)})x_j^{(i)}h_\theta (x^{(i)})) \right] \\
&= -\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}x_j^{(i)} - x_jh_\theta (x^{(i)}) \right] \\
\frac{\partial}{\partial \theta_j} J(\theta) &= \frac{1}{m} \sum_{i=1}^m (h_\theta (x^{(i)}) - y^{(i)}) x_j^{(i)}
\end{align}

This formula can now be used in gradient descent.

Homo Habilis

Sunday, August 19, 2012

Partial Derivative Logistic Regression Cost Function

13 comments: