LR cost function is given by:
Cost(hθ(x),y)={−log(hθ(x))if y=1−log(1−hθ(x))if y=0
where hθ(x)=11+e−θTx is the logistic function.
Since y∈{0,1} only, we can reduce the cost function to an equivalent, single equation.
Cost(hθ(x),y)=−ylog(hθ(x))−(1−y)log(1−hθ(x))
This leads to the overall cost function for the logistic regression:
J(θ)=−1m[m∑i=1y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x))]
Our goal is to find min. To do so, we use gradient descent, but we first need to find the partial derivatives \frac{\partial}{\partial \theta_j} J(\theta).
We're going to make use of a neat property of the logistic function:
\begin{align} g'(z) &= \frac{d}{dz} \frac{1}{1+e^{-z}} = \frac{1}{(1+e^{-z})^2}e^{-z} \\ &= \frac{1+e^{-z}-1}{(1+e^{-z})^2} = \frac{1}{1+e^{-z}}-\frac{1}{(1+e^{-z})^2} = \frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}}) \\ &= g(z) (1-g(z)) \end{align}
So for our with our cost function:
\begin{align} \frac{\partial}{\partial \theta_j} J(\theta) &= -\frac{1}{m} \left [\frac{\partial}{\partial \theta_j} \sum_{i=1}^m y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log(1-h_\theta (x)) \right] \\ &= -\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}\frac{1}{h_\theta(x^{(i)})}\frac{\partial}{\partial \theta_j}h_\theta(x^{(i)}) + (1-y^{(i)})\frac{1}{1-h_\theta(x^{(i)})}\left (-\frac{\partial}{\partial \theta_j}h_\theta (x^{(i)})\right) \right] \end{align}
using the chain rule and the logistic regression derivative, we see that
\begin{align} \frac{\partial}{\partial \theta_j} J(\theta) &=-\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}\frac{x_j^{(i)}}{h_\theta(x^{(i)})}h_\theta(x^{(i)})(1-h_\theta(x^{(i)})) - (1-y^{(i)})\frac{x_j^{(i)}}{1-h_\theta(x^{(i)})}h_\theta (x)(1-h_\theta(x^{(i)})) \right] \\ &= -\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}x_j^{(i)}(1-h_\theta(x^{(i)})) - (1-y^{(i)})x_j^{(i)}h_\theta (x^{(i)})) \right] \\ &= -\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}x_j^{(i)} - x_jh_\theta (x^{(i)}) \right] \\ \frac{\partial}{\partial \theta_j} J(\theta) &= \frac{1}{m} \sum_{i=1}^m (h_\theta (x^{(i)}) - y^{(i)}) x_j^{(i)} \end{align}
\begin{align} \frac{\partial}{\partial \theta_j} J(\theta) &=-\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}\frac{x_j^{(i)}}{h_\theta(x^{(i)})}h_\theta(x^{(i)})(1-h_\theta(x^{(i)})) - (1-y^{(i)})\frac{x_j^{(i)}}{1-h_\theta(x^{(i)})}h_\theta (x)(1-h_\theta(x^{(i)})) \right] \\ &= -\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}x_j^{(i)}(1-h_\theta(x^{(i)})) - (1-y^{(i)})x_j^{(i)}h_\theta (x^{(i)})) \right] \\ &= -\frac{1}{m} \left [ \sum_{i=1}^m y^{(i)}x_j^{(i)} - x_jh_\theta (x^{(i)}) \right] \\ \frac{\partial}{\partial \theta_j} J(\theta) &= \frac{1}{m} \sum_{i=1}^m (h_\theta (x^{(i)}) - y^{(i)}) x_j^{(i)} \end{align}
This formula can now be used in gradient descent.