O OP acredita erroneamente que a relação entre essas duas funções se deve ao número de amostras (ou seja, uma única vs todas). No entanto, a diferença real é simplesmente como selecionamos nossos rótulos de treinamento.
In the case of binary classification we may assign the labels y=±1 or y=0,1.
As it has already been stated, the logistic function σ(z) is a good choice since it has the form of a probability, i.e. σ(−z)=1−σ(z) and σ(z)∈(0,1) as z→±∞. If we pick the labels y=0,1 we may assign
P(y=1|z)P(y=0|z)=σ(z)=11+e−z=1−σ(z)=11+ez
which can be written more compactly as P(y|z)=σ(z)y(1−σ(z))1−y.
It is easier to maximize the log-likelihood. Maximizing the log-likelihood is the same as minimizing the negative log-likelihood. For m samples {xi,yi}, after taking the natural logarithm and some simplification, we will find out:
l(z)=−log(∏imP(yi|zi))=−∑imlog(P(yi|zi))=∑im−yizi+log(1+ezi)
Full derivation and additional information can be found on this jupyter notebook. On the other hand, we may have instead used the labels y=±1. It is pretty obvious then that we can assign
P(y|z)=σ(yz).
It is also obvious that P(y=0|z)=P(y=−1|z)=σ(−z). Following the same steps as before we minimize in this case the loss function
L(z)=−log(∏jmP(yj|zj))=−∑jmlog(P(yj|zj))=∑jmlog(1+e−yzj)
Where the last step follows after we take the reciprocal which is induced by the negative sign. While we should not equate these two forms, given that in each form y takes different values, nevertheless these two are equivalent:
−yizi+log(1+ezi)≡log(1+e−yzj)
The case yi=1 is trivial to show. If yi≠1, then yi=0 on the left hand side and yi=−1 on the right hand side.
While there may be fundamental reasons as to why we have two different forms (see Why there are two different logistic loss formulation / notations?), one reason to choose the former is for practical considerations. In the former we can use the property ∂σ(z)/∂z=σ(z)(1−σ(z)) to trivially calculate ∇l(z) and ∇2l(z), both of which are needed for convergence analysis (i.e. to determine the convexity of the loss function by calculating the Hessian).
I learned the loss function for logistic regression as follows.
Logistic regression performs binary classification, and so the label outputs are binary, 0 or 1. LetP(y=1|x) be the probability that the binary output y is 1 given the input feature vector x . The coefficients w are the weights that the algorithm is trying to learn.
Because logistic regression is binary, the probabilityP(y=0|x) is simply 1 minus the term above.
The loss functionJ(w) is the sum of (A) the output y=1 multiplied by P(y=1) and (B) the output y=0 multiplied by P(y=0) for one training example, summed over m training examples.
wherey(i) indicates the ith label in your training data. If a training instance has a label of 1 , then y(i)=1 , leaving the left summand in place but making the right summand with 1−y(i) become 0 . On the other hand, if a training instance has y=0 , then the right summand with the term 1−y(i) remains in place, but the left summand becomes 0 . Log probability is used for ease of calculation.
If we then replaceP(y=1) and P(y=0) with the earlier expressions, then we get:
You can read more about this form in these Stanford lecture notes.
fonte
Instead of Mean Squared Error, we use a cost function called Cross-Entropy, also known as Log Loss. Cross-entropy loss can be divided into two separate cost functions: one for y=1 and one for y=0.
When we put them together we have:
Multiplying byy and (1−y) in the above equation is a sneaky trick that let’s us use the same equation to solve for both y=1 and y=0 cases. If y=0 , the first side cancels out. If y=1 , the second side cancels out. In both cases we only perform the operation we need to perform.
If you don't want to use a
for
loop, you can try a vectorized form of the equation aboveThe entire explanation can be view on Machine Learning Cheatsheet.
fonte