LR cost function is given by:
\[
\text{Cost} (h_\theta (x),y) =
\begin{cases}
-\log(h_\theta (x)) \quad &\text{if } y=1 \\
-\log(1-h_\theta (x)) \quad &\text{if } y=0
\end{cases}
\]
where \(h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}\) is the logistic function.
Since \(y \in \{0,1\}\) only, we can reduce the cost function to an equivalent, single equation.
\[
\text{Cost} (h_\theta (x),y) = -y\log(h_\theta (x)) - (1-y)\log(1-h_\theta (x))
\]
This leads to the overall cost function for the logistic regression:
\[
J(\theta) = -\frac{1}{m} [\sum_{i=1}^m y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log(1-h_\theta (x))]
\]
Our goal is to find \(\min_\theta J(\theta)\). To do so, we use gradient descent, but we first need to find the partial derivatives \(\frac{\partial}{\partial \theta_j} J(\theta)\).
We're going to make use of a neat property of the logistic function:
\begin{align}
g'(z) &= \frac{d}{dz} \frac{1}{1+e^{-z}} = \frac{1}{(1+e^{-z})^2}e^{-z} \\
&= \frac{1+e^{-z}-1}{(1+e^{-z})^2} = \frac{1}{1+e^{-z}}-\frac{1}{(1+e^{-z})^2} = \frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}}) \\
&= g(z) (1-g(z))
\end{align}
So for our with our cost function:
\begin{align}
\frac{\partial}{\partial \theta_j} J(\theta) &= -\frac{1}{m} \left [\frac{\partial}{\partial \theta_j} \sum_{i=1}^m y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log(1-h_\theta (x)) \right] \\
&= -\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}\frac{1}{h_\theta(x^{(i)})}\frac{\partial}{\partial \theta_j}h_\theta(x^{(i)}) + (1-y^{(i)})\frac{1}{1-h_\theta(x^{(i)})}\left (-\frac{\partial}{\partial \theta_j}h_\theta (x^{(i)})\right) \right]
\end{align}
using the chain rule and the logistic regression derivative, we see that
\begin{align}
\frac{\partial}{\partial \theta_j} J(\theta) &=-\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}\frac{x_j^{(i)}}{h_\theta(x^{(i)})}h_\theta(x^{(i)})(1-h_\theta(x^{(i)})) - (1-y^{(i)})\frac{x_j^{(i)}}{1-h_\theta(x^{(i)})}h_\theta (x)(1-h_\theta(x^{(i)})) \right] \\
&= -\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}x_j^{(i)}(1-h_\theta(x^{(i)})) - (1-y^{(i)})x_j^{(i)}h_\theta (x^{(i)})) \right] \\
&= -\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}x_j^{(i)} - x_jh_\theta (x^{(i)}) \right] \\
\frac{\partial}{\partial \theta_j} J(\theta) &= \frac{1}{m} \sum_{i=1}^m (h_\theta (x^{(i)}) - y^{(i)}) x_j^{(i)}
\end{align}
\begin{align}
\frac{\partial}{\partial \theta_j} J(\theta) &=-\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}\frac{x_j^{(i)}}{h_\theta(x^{(i)})}h_\theta(x^{(i)})(1-h_\theta(x^{(i)})) - (1-y^{(i)})\frac{x_j^{(i)}}{1-h_\theta(x^{(i)})}h_\theta (x)(1-h_\theta(x^{(i)})) \right] \\
&= -\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}x_j^{(i)}(1-h_\theta(x^{(i)})) - (1-y^{(i)})x_j^{(i)}h_\theta (x^{(i)})) \right] \\
&= -\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}x_j^{(i)} - x_jh_\theta (x^{(i)}) \right] \\
\frac{\partial}{\partial \theta_j} J(\theta) &= \frac{1}{m} \sum_{i=1}^m (h_\theta (x^{(i)}) - y^{(i)}) x_j^{(i)}
\end{align}
This formula can now be used in gradient descent.
Thank you! It is very useful for me!
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteFuck me, I don't understand the calculus :(
ReplyDeletewhich part of the calculus is unclear?
Deletethe d/dx thing, isn't it 1/x if the d gets canceled out?
DeleteThanks for showing the derivation!
ReplyDeleteits really good work, thank you so much
ReplyDeleteTruly appreciate your effort , well formatted and explained. Thank you!
ReplyDeleteNice description!
ReplyDeleteCan someone explain how chain rule is implemented?
ReplyDeleteAwesome, Thanks you guys. I was confused on this for a while , it's really a great help to me.
ReplyDeleteReally really thank you!!
ReplyDeletehello, Thanks for this post, How did you come about the Xj ?
ReplyDelete