Qual função de perda está correta para a regressão logística?

31

Eu li sobre duas versões da função de perda para regressão logística, qual delas está correta e por quê?

No Machine Learning , Zhou ZH (em chinês), com $\beta = (w, b)\text{ and }\beta^Tx=w^Tx +b$ :

$\begin{matrix} (1) & l (β) = \sum_{i = 1}^{m} (- y_{i} β^{T} x_{i} + \ln (1 + e^{β^{T} x_{i}})) \end{matrix}$ $l(\beta) = \sum\limits_{i=1}^{m}\Big(-y_i\beta^Tx_i+\ln(1+e^{\beta^Tx_i})\Big) \tag 1$
Do meu curso na faculdade, com $z_i = y_if(x_i)=y_i(w^Tx_i + b)$ :

$\begin{matrix} (2) & L (z_{i}) = \log (1 + e^{- z_{i}}) \end{matrix}$ $L(z_i)=\log(1+e^{-z_i}) \tag 2$

Eu sei que o primeiro é um acúmulo de todas as amostras e o segundo é para uma única amostra, mas estou mais curioso sobre a diferença na forma de duas funções de perda. De alguma forma, sinto que eles são equivalentes.

logistic loss-functions xtt
fonte

31

O relacionamento é como segue: $l(\beta) = \sum_i L(z_i)$ .

Defina uma função logística como . Eles possuem a propriedade que $f(z) = \frac{e^{z}}{1 + e^{z}} = \frac{1}{1+e^{-z}}$ . Ou em outras palavras: $f(-z) = 1-f(z)$

\frac{1}{1 + e^{z}} = \frac{e^{- z}}{1 + e^{- z}} .

$\frac{1}{1+e^{z}} = \frac{e^{-z}}{1+e^{-z}}.$

Se você considerar o recíproco de ambos os lados, faça o log que você obtém:

\ln (1 + e^{z}) = \ln (1 + e^{- z}) + z .

$\ln(1+e^{z}) = \ln(1+e^{-z}) + z.$

Subtraia de ambos os lados e você verá o seguinte: $z$

- y_{i} β^{T} x_{i} + l n (1 + e^{y_{i} β^{T} x_{i}}) = L (z_{i}) .

$-y_i\beta^Tx_i+ln(1+e^{y_i\beta^Tx_i}) = L(z_i).$

Editar:

No momento, estou relendo esta resposta e estou confuso sobre como eu consegui ser igual a $-y_i\beta^Tx_i+ln(1+e^{\beta^Tx_i})$ $-y_i\beta^Tx_i+ln(1+e^{y_i\beta^Tx_i})$ . Talvez haja um erro de digitação na pergunta original.

Edição 2:

Caso não tenha havido um erro de digitação na pergunta original, @ManelMorales parece estar correto ao chamar a atenção para o fato de que, quando , a função de massa de probabilidade pode ser escrita como , devido à propriedade que $y \in \{-1,1\}$ $P(Y_i=y_i) = f(y_i\beta^Tx_i)$ $f(-z) = 1 - f(z)$ . Estou reescrevendo-o de maneira diferente aqui, porque ele introduz um novo equívoco na notação $z_i$ . O restante segue considerando a probabilidade logarítmica negativa para cadacodificação . Veja a resposta dele abaixo para mais detalhes. $y$

Taylor
fonte

42

O OP acredita erroneamente que a relação entre essas duas funções se deve ao número de amostras (ou seja, uma única vs todas). No entanto, a diferença real é simplesmente como selecionamos nossos rótulos de treinamento.

In the case of binary classification we may assign the labels $y=\pm1$ or $y=0,1$ .

As it has already been stated, the logistic function $\sigma(z)$ is a good choice since it has the form of a probability, i.e. $\sigma(-z)=1-\sigma(z)$ and $\sigma(z)\in (0,1)$ as $z\rightarrow \pm \infty$ . If we pick the labels $y=0,1$ we may assign

\begin{aligned} P (y = 1 | z) & = σ (z) = \frac{1}{1 + e^{- z}} \\ P (y = 0 | z) & = 1 - σ (z) = \frac{1}{1 + e^{z}} \end{aligned}

$\begin{equation} \begin{aligned} \mathbb{P}(y=1|z) & =\sigma(z)=\frac{1}{1+e^{-z}}\\ \mathbb{P}(y=0|z) & =1-\sigma(z)=\frac{1}{1+e^{z}}\\ \end{aligned} \end{equation}$

which can be written more compactly as $\mathbb{P}(y|z) =\sigma(z)^y(1-\sigma(z))^{1-y}$ .

It is easier to maximize the log-likelihood. Maximizing the log-likelihood is the same as minimizing the negative log-likelihood. For $m$ samples $\{x_i,y_i\}$ , after taking the natural logarithm and some simplification, we will find out:

\begin{aligned} l (z) = - \log (\prod_{i}^{m} P (y_{i} | z_{i})) = - \sum_{i}^{m} \log (P (y_{i} | z_{i})) = \sum_{i}^{m} - y_{i} z_{i} + \log (1 + e^{z_{i}}) \end{aligned}

$\begin{equation} \begin{aligned} l(z)=-\log\big(\prod_i^m\mathbb{P}(y_i|z_i)\big)=-\sum_i^m\log\big(\mathbb{P}(y_i|z_i)\big)=\sum_i^m-y_iz_i+\log(1+e^{z_i}) \end{aligned} \end{equation}$

Full derivation and additional information can be found on this jupyter notebook. On the other hand, we may have instead used the labels $y=\pm 1$ . It is pretty obvious then that we can assign

P (y | z) = σ (y z) .

$\begin{equation} \mathbb{P}(y|z)=\sigma(yz). \end{equation}$

It is also obvious that $\mathbb{P}(y=0|z)=\mathbb{P}(y=-1|z)=\sigma(-z)$ . Following the same steps as before we minimize in this case the loss function

\begin{aligned} L (z) = - \log (\prod_{j}^{m} P (y_{j} | z_{j})) = - \sum_{j}^{m} \log (P (y_{j} | z_{j})) = \sum_{j}^{m} \log (1 + e^{- y z_{j}}) \end{aligned}

$\begin{equation} \begin{aligned} L(z)=-\log\big(\prod_j^m\mathbb{P}(y_j|z_j)\big)=-\sum_j^m\log\big(\mathbb{P}(y_j|z_j)\big)=\sum_j^m\log(1+e^{-yz_j}) \end{aligned} \end{equation}$

Where the last step follows after we take the reciprocal which is induced by the negative sign. While we should not equate these two forms, given that in each form $y$ takes different values, nevertheless these two are equivalent:

\begin{aligned} - y_{i} z_{i} + \log (1 + e^{z_{i}}) \equiv \log (1 + e^{- y z_{j}}) \end{aligned}

$\begin{equation} \begin{aligned} -y_iz_i+\log(1+e^{z_i})\equiv \log(1+e^{-yz_j}) \end{aligned} \end{equation}$

The case $y_i=1$ is trivial to show. If $y_i \neq 1$ , then $y_i=0$ on the left hand side and $y_i=-1$ on the right hand side.

While there may be fundamental reasons as to why we have two different forms (see Why there are two different logistic loss formulation / notations?), one reason to choose the former is for practical considerations. In the former we can use the property $\partial \sigma(z) / \partial z=\sigma(z)(1-\sigma(z))$ to trivially calculate $\nabla l(z)$ and $\nabla^2l(z)$ , both of which are needed for convergence analysis (i.e. to determine the convexity of the loss function by calculating the Hessian).

Manuel Morales
fonte

Is logistic loss function convex?

user85361

2

Log reg

l (z)

$l(z)$ IS convex, but not

α

$\alpha$ -convex. Thus we can't place a bound on how long gradient descent takes to converge. We can adjust the form of

l

$l$ to make it strongly convex by adding a regularization term: with positive constant

λ

$\lambda$ define our new function to be

l^{'} (z) = l (z) + λ ‖ z ‖^{2}

$l'(z)=l(z)+\lambda\|z\|^2$ s.t

l^{'} (z)

$l'(z)$ is

λ

$\lambda$ -strongly convex and we can now prove the convergence bound of

l^{'}

$l'$ . Unfortunately, we are now minimizing a different function! Luckily, we can show that the value of the optimum of the regularized function is close to the value of the optimum of the original.

Manuel Morales

The notebook you referred has gone, I got another proof: statlect.com/fundamentals-of-statistics/…

Domi.Zhang

2

I found this to be the most helpful answer.

mohit6up

@ManuelMorales Do you have a link to the regularized function's optimum value being close to the original?

Mark

19

I learned the loss function for logistic regression as follows.

Logistic regression performs binary classification, and so the label outputs are binary, 0 or 1. Let $P(y=1|x)$ be the probability that the binary output $y$ is 1 given the input feature vector $x$ . The coefficients $w$ are the weights that the algorithm is trying to learn.

P (y = 1 | x) = \frac{1}{1 + e^{- w^{T} x}}

$P(y=1|x) = \frac{1}{1 + e^{-w^{T}x}}$

Because logistic regression is binary, the probability $P(y=0|x)$ is simply 1 minus the term above.

P (y = 0 | x) = 1 - \frac{1}{1 + e^{- w^{T} x}}

$P(y=0|x) = 1- \frac{1}{1 + e^{-w^{T}x}}$

The loss function $J(w)$ is the sum of (A) the output $y=1$ multiplied by $P(y=1)$ and (B) the output $y=0$ multiplied by $P(y=0)$ for one training example, summed over $m$ training examples.

J (w) = \sum_{i = 1}^{m} y^{(i)} \log P (y = 1) + (1 - y^{(i)}) \log P (y = 0)

$J(w) = \sum_{i=1}^{m} y^{(i)} \log P(y=1) + (1 - y^{(i)}) \log P(y=0)$

where $y^{(i)}$ indicates the $i^{th}$ label in your training data. If a training instance has a label of $1$ , then $y^{(i)}=1$ , leaving the left summand in place but making the right summand with $1-y^{(i)}$ become $0$ . On the other hand, if a training instance has $y=0$ , then the right summand with the term $1-y^{(i)}$ remains in place, but the left summand becomes $0$ . Log probability is used for ease of calculation.

If we then replace $P(y=1)$ and $P(y=0)$ with the earlier expressions, then we get:

J (w) = \sum_{i = 1}^{m} y^{(i)} \log (\frac{1}{1 + e^{- w^{T} x}}) + (1 - y^{(i)}) \log (1 - \frac{1}{1 + e^{- w^{T} x}})

$J(w) = \sum_{i=1}^{m} y^{(i)} \log \left(\frac{1}{1 + e^{-w^{T}x}}\right) + (1 - y^{(i)}) \log \left(1- \frac{1}{1 + e^{-w^{T}x}}\right)$

You can read more about this form in these Stanford lecture notes.

stackoverflowuser2010
fonte

This answer also provides some relevant perspective here.

GeoMatt22

6

The expression you have is not a loss (to be minimized), but rather a log-likelihood (to be maximized).

xenocyon

2

@xenocyon true - this same formulation is typically written with a negative sign applied to the full summation.

Alex Klibisz

1

Instead of Mean Squared Error, we use a cost function called Cross-Entropy, also known as Log Loss. Cross-entropy loss can be divided into two separate cost functions: one for y=1 and one for y=0.

\begin{aligned} j (θ) & = \frac{1}{m} \sum_{i = 1}^{m} C o s t (h_{θ} (x^{(i)}), y^{(i)}) \\ C o s t (h_{θ} (x), y) & = - \log (h_{θ} (x)) & i f y & = 1 \\ C o s t (h_{θ} (x), y) & = - \log (1 - h_{θ} (x)) & i f y & = 0 \end{aligned}

$\begin{align}\newcommand{\Cost}{{\rm Cost}}\newcommand{\if}{{\rm if}} j(\theta) &= \frac 1 m \sum_{i=1}^m \Cost(h_\theta(x^{(i)}), y^{(i)}) & & \\ \Cost(h_\theta(x), y) &= -\log(h_\theta(x)) & \if\ y &= 1 \\ \Cost(h_\theta(x), y) &= -\log(1-h_\theta(x)) & \if\ y &= 0 \end{align}$

When we put them together we have:

j (θ) = \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} \log (h_{θ} (x^{(i)})) + (1 - y^{(i)}) \log (1 - h_{θ} (x)^{(i)})]

$j(\theta) = \frac 1 m \sum_{i=1}^m \big[y^{(i)}\log(h_\theta(x^{(i)})) + (1-y^{(i)})\log(1-h_\theta(x)^{(i)}) \big]$

Multiplying by $y$ and $(1−y)$ in the above equation is a sneaky trick that let’s us use the same equation to solve for both $y=1$ and $y=0$ cases. If $y=0$ , the first side cancels out. If $y=1$ , the second side cancels out. In both cases we only perform the operation we need to perform.

If you don't want to use a for loop, you can try a vectorized form of the equation above

\begin{aligned} h & = g (X θ) \\ J (θ) & = \frac{1}{m} \cdot (- y^{T} \log (h) - (1 - y)^{T} \log (1 - h)) \end{aligned}

$\begin{align} h &= g(X\theta) \\ J(\theta) &= \frac 1 m \cdot \big(-y^T\log(h)-(1-y)^T\log(1-h)\big) \end{align}$

The entire explanation can be view on Machine Learning Cheatsheet.

Emanuel Fontelles
fonte

Qual função de perda está correta para a regressão logística?

Respostas:

Editar:

Edição 2: