Quais são as funções de custo comuns usadas na avaliação do desempenho de redes neurais?
Detalhes
(fique à vontade para pular o restante desta pergunta, minha intenção aqui é simplesmente fornecer esclarecimentos sobre a notação que as respostas podem usar para ajudá-las a serem mais compreensíveis para o leitor em geral)
Eu acho que seria útil ter uma lista de funções de custo comuns, juntamente com algumas maneiras pelas quais elas foram usadas na prática. Portanto, se outras pessoas estão interessadas nisso, acho que um wiki da comunidade é provavelmente a melhor abordagem, ou podemos removê-lo se não estiver relacionado ao tópico.
Notação
Então, para começar, gostaria de definir uma notação que todos usamos ao descrevê-las, para que as respostas se encaixem bem.
Esta notação é do livro de Neilsen .
Uma Rede Neural Feedforward é formada por várias camadas de neurônios conectados. Em seguida, recebe uma entrada, que "escorre" pela rede e, em seguida, a rede neural retorna um vetor de saída.
Mais formalmente, chamar a activação (aka saída) do neurónio no camada, onde é a elemento no vector de entrada.
Em seguida, podemos relacionar a entrada da próxima camada com a anterior por meio da seguinte relação:
Onde
é a função de ativação,
é o peso do neurônio na camada para o neurônio na camada ,
is the bias of the neuron in the layer, and
represents the activation value of the neuron in the layer.
Sometimes we write to represent , in other words, the activation value of a neuron before applying the activation function.
For more concise notation we can write
To use this formula to compute the output of a feedforward network for some input , set , then compute , , ...,, where m is the number of layers.
Introduction
A cost function is a measure of "how good" a neural network did with respect to it's given training sample and the expected output. It also may depend on variables such as weights and biases.
A cost function is a single value, not a vector, because it rates how good the neural network did as a whole.
Specifically, a cost function is of the form
where is our neural network's weights, is our neural network's biases, is the input of a single training sample, and is the desired output of that training sample. Note this function can also potentially be dependent on and for any neuron in layer , because those values are dependent on , , and .
In backpropagation, the cost function is used to compute the error of our output layer, , via
Which can also be written as a vector via
We will provide the gradient of the cost functions in terms of the second equation, but if one wants to prove these results themselves, using the first equation is recommended because it's easier to work with.
Cost function requirements
To be used in backpropagation, a cost function must satisfy two properties:
1: The cost function must be able to be written as an average
over cost functions for individual training examples, .
This is so it allows us to compute the gradient (with respect to weights and biases) for a single training example, and run Gradient Descent.
2: The cost function must not be dependent on any activation values of a neural network besides the output values .
Technically a cost function can be dependent on any or . We just make this restriction so we can backpropagte, because the equation for finding the gradient of the last layer is the only one that is dependent on the cost function (the rest are dependent on the next layer). If the cost function is dependent on other activation layers besides the output one, backpropagation will be invalid because the idea of "trickling backwards" no longer works.
Also, activation functions are required to have an output for all . Thus these cost functions need to only be defined within that range (for example, is valid since we are guaranteed ).
fonte
Respostas:
Here are those I understand so far. Most of these work best when given values between 0 and 1.
Quadratic cost
Also known as mean squared error, maximum likelihood, and sum squared error, this is defined as:
The gradient of this cost function with respect to the output of a neural network and some sampler is:
Cross-entropy cost
Also known as Bernoulli negative log-likelihood and Binary Cross-Entropy
The gradient of this cost function with respect to the output of a neural network and some sampler is:
Exponentional cost
This requires choosing some parameterτ that you think will give you the behavior you want. Typically you'll just need to play with this until things work good.
whereexp(x) is simply shorthand for ex .
The gradient of this cost function with respect to the output of a neural network and some sampler is:
I could rewrite outCEXP , but that seems redundant. Point is the gradient computes a vector and then multiplies it by CEXP .
Hellinger distance
You can find more about this here. This needs to have positive values, and ideally values between0 and 1 . The same is true for the following divergences.
The gradient of this cost function with respect to the output of a neural network and some sampler is:
Kullback–Leibler divergence
Also known as Information Divergence, Information Gain, Relative entropy, KLIC, or KL Divergence (See here).
Kullback–Leibler divergence is typically denoted
whereDKL(P∥Q) is a measure of the information lost when Q is used to approximate P . Thus we want to set P=Ei and Q=aL , because we want to measure how much information is lost when we use aij to approximate Eij . This gives us
The other divergences here use this same idea of settingP=Ei and Q=aL .
The gradient of this cost function with respect to the output of a neural network and some sampler is:
Generalized Kullback–Leibler divergence
From here.
The gradient of this cost function with respect to the output of a neural network and some sampler is:
Itakura–Saito distance
Also from here.
The gradient of this cost function with respect to the output of a neural network and some sampler is:
Where((aL)2)j=aLj⋅aLj . In other words, (aL)2 is simply equal to squaring each element of aL .
fonte
a*(1-a)
nota*(1+a)
Don't have the reputation to comment, but there are sign errors in those last 3 gradients.
In the KL divergence,
In the Itakura-Saito distance ,
fonte