Logistic regression is like linear regression, except the output is run through a squashing function called the logistic function. The logistic function rescales the output to the interval (0,1). Because of this property, logistic regression is useful for classification problems, where the data set's output feature is 0 if the point is not in the class, or 1 if the point is in the class. The interpretation of the hypothesis function then is that it is the probability that the point is in the class.
Just as in linear regression, we define a cost function with an optional regularization term. The cost function for logistic regression is different from that of linear regression, but it still maintains the property that 0 is perfect, and higher is worse:
Interestingly, with the cost defined like this, the gradient is the same as in linear regression:
Using gradient descent as usual (or, in fact, any minimization algorithm) gives us a solution which is always a global optimum.
When looking at costs from a learning curve perspective, use the logistic regression cost without the regularization term.
If there is more than one class, that is, the problem is one of multiclass classification, then we can simply train one set of parameters per class. Then, when we evaluate a data point, we feed its features into all classifiers, which gives us the probabilities that the point is in each class. We can then simply choose the class with the highest probability.
The next post will unify the linear and logistic regression cost functions so that we can see they fall out from the same considerations.