Sunday, August 19, 2012

Partial Derivative Logistic Regression Cost Function

Logistic regression is used for classification problems.  As Andrew said, it's a bit confusing given the "regression" in the name.

LR cost function is given by:
\[
\text{Cost} (h_\theta (x),y) =
\begin{cases}
 -\log(h_\theta (x)) \quad &\text{if } y=1 \\
 -\log(1-h_\theta (x)) \quad &\text{if } y=0
\end{cases}
\]

where \(h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}\) is the logistic function.

Since \(y \in \{0,1\}\) only, we can reduce the cost function to an equivalent, single equation.
\[
\text{Cost} (h_\theta (x),y) =  -y\log(h_\theta (x)) - (1-y)\log(1-h_\theta (x))
\]

This leads to the overall cost function for the logistic regression:
\[
J(\theta) = -\frac{1}{m} [\sum_{i=1}^m y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log(1-h_\theta (x))]
\]

Our goal is to find \(\min_\theta J(\theta)\). To do so, we use gradient descent, but we first need to find the partial derivatives \(\frac{\partial}{\partial \theta_j} J(\theta)\).

We're going to make use of a neat property of the logistic function:
\begin{align}
g'(z) &= \frac{d}{dz} \frac{1}{1+e^{-z}} = \frac{1}{(1+e^{-z})^2}e^{-z} \\
 &= \frac{1+e^{-z}-1}{(1+e^{-z})^2} = \frac{1}{1+e^{-z}}-\frac{1}{(1+e^{-z})^2} = \frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}}) \\
 &= g(z) (1-g(z))
\end{align}

So for our with our cost function:
\begin{align}
\frac{\partial}{\partial \theta_j} J(\theta) &= -\frac{1}{m} \left [\frac{\partial}{\partial \theta_j} \sum_{i=1}^m y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log(1-h_\theta (x)) \right] \\
 &= -\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}\frac{1}{h_\theta(x^{(i)})}\frac{\partial}{\partial \theta_j}h_\theta(x^{(i)}) + (1-y^{(i)})\frac{1}{1-h_\theta(x^{(i)})}\left (-\frac{\partial}{\partial \theta_j}h_\theta (x^{(i)})\right) \right]
\end{align}


using the chain rule and the logistic regression derivative, we see that
\begin{align}
\frac{\partial}{\partial \theta_j} J(\theta) &=-\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}\frac{x_j^{(i)}}{h_\theta(x^{(i)})}h_\theta(x^{(i)})(1-h_\theta(x^{(i)})) - (1-y^{(i)})\frac{x_j^{(i)}}{1-h_\theta(x^{(i)})}h_\theta (x)(1-h_\theta(x^{(i)})) \right] \\
&=  -\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}x_j^{(i)}(1-h_\theta(x^{(i)})) - (1-y^{(i)})x_j^{(i)}h_\theta (x^{(i)})) \right] \\
&=  -\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}x_j^{(i)} - x_jh_\theta (x^{(i)}) \right] \\
\frac{\partial}{\partial \theta_j} J(\theta) &=  \frac{1}{m} \sum_{i=1}^m  (h_\theta (x^{(i)}) -  y^{(i)}) x_j^{(i)}
\end{align}

This formula can now be used in gradient descent.


13 comments:

  1. Thank you! It is very useful for me!

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Fuck me, I don't understand the calculus :(

    ReplyDelete
    Replies
    1. which part of the calculus is unclear?

      Delete
    2. the d/dx thing, isn't it 1/x if the d gets canceled out?

      Delete
  4. its really good work, thank you so much

    ReplyDelete
  5. Truly appreciate your effort , well formatted and explained. Thank you!

    ReplyDelete
  6. Can someone explain how chain rule is implemented?

    ReplyDelete
  7. Awesome, Thanks you guys. I was confused on this for a while , it's really a great help to me.

    ReplyDelete
  8. hello, Thanks for this post, How did you come about the Xj ?

    ReplyDelete