Rede neural profunda - retropropagação com ReLU

17

Estou tendo alguma dificuldade em derivar a propagação com a ReLU e fiz algum trabalho, mas não tenho certeza se estou no caminho certo.

Função de custo: 12(yy^)2em queyé o valor real, e y é um valor previsto. Suponha também quex> 0 sempre.y^x


ReLU de 1 camada, em que o peso na 1ª camada é w1

insira a descrição da imagem aqui

dCdw1=dCdRdRdw1

dCw1=(yReLU(w1x))(x)


ReLU de 2 camadas, onde os pesos na 1ª camada são w2 e a 2ª camada é w1 E eu queria atualizar a 1ª camada w2

insira a descrição da imagem aqui

dCdw2=dCdRdRdw2

dCw2=(yReLU(w1ReLU(w2x))(w1x)

Como ReLU(w1ReLU(w2x))=w1w2x


ReLU de 3 camadas, em que os pesos na 1ª camada são , 2ª camada w 2 e 3ª camada w 1w3w2w1

insira a descrição da imagem aqui

dCdw3=dCdRdRdw3

dCw3=(yReLU(w1ReLU(w2(ReLU(w3)))(w1w2x)

Como ReLU(w1ReLU(w2(ReLU(w3))=w1w2w3x

Como a regra da cadeia dura apenas 2 derivados, em comparação com um sigmóide, que pode ter o número de camadas.n


Digamos que eu queira atualizar todos os pesos de 3 camadas, em que é a terceira camada, w 2 é a segunda camada, w 1 é a terceira camadaw1w2w1

dCw1=(yReLU(w1x))(x)

dCw2=(yReLU(w1ReLU(w2x))(w1x)

dCw3=(yReLU(w1ReLU(w2(ReLU(w3)))(w1w2x)

Se essa derivação estiver correta, como isso evita o desaparecimento? Em comparação com o sigmóide, onde temos muita multiplicação por 0,25 na equação, enquanto ReLU não tem nenhuma multiplicação de valor constante. Se houver milhares de camadas, haveria muita multiplicação devido a pesos, isso não causaria desaparecimento ou explosão de gradiente?

user1157751
fonte
@NeilSlater Obrigado pela sua resposta! Você pode elaborar, não tenho certeza do que você quis dizer?
User1157751 28/05
Ah, acho que sei o que você quis dizer. Bem, a razão pela qual eu fiz essa pergunta é que tenho certeza de que a derivação está correta? Eu procurei e não encontrei um exemplo de ReLU derivado totalmente do zero?
User1157751

Respostas:

15

Definições de trabalho da função ReLU e seus derivados:

ReLU(x)={0,if x<0,x,otherwise.

ddxReLU(x)={0,if x<0,1,otherwise.

A derivada é a função de etapa unitária . Isso ignora um problema em x=0 , onde o gradiente não é estritamente definido, mas isso não é uma preocupação prática para redes neurais. Com a fórmula acima, a derivada em 0 é 1, mas você pode igualmente tratá-la como 0 ou 0,5, sem impacto real no desempenho da rede neural.


Rede simplificada

Com essas definições, vamos dar uma olhada em suas redes de exemplo.

Você está executando a regressão com a função de custo C=12(yy^)2. Você definiuRcomo a saída do neurônio artificial, mas não definiu um valor de entrada. Acrescentarei isso para completar - chame dez, adicione alguma indexação por camada e prefiro letras minúsculas para os vetores e letras maiúsculas para matrizes, portanto,r(1)saída da primeira camada,z(1)para sua entrada eW(0)para o peso que conecta o neurônio à sua entradax(em uma rede maior, que pode se conectar a umrmais profundor value instead). I have also adjusted the index number for the weight matrix - why that is will become clearer for the larger network. NB I am ignoring having more than neuron in each layer for now.

Looking at your simple 1 layer, 1 neuron network, the feed-forward equations are:

z(1)=W(0)x

y^=r(1)=ReLU(z(1))

The derivative of the cost function w.r.t. an example estimate is:

Cy^=Cr(1)=r(1)12(yr(1))2=12r(1)(y22yr(1)+(r(1))2)=r(1)y

Using the chain rule for back propagation to the pre-transform (z) value:

Cz(1)=Cr(1)r(1)z(1)=(r(1)y)Step(z(1))=(ReLU(z(1))y)Step(z(1))

This Cz(1) is an interim stage and critical part of backprop linking steps together. Derivations often skip this part because clever combinations of cost function and output layer mean that it is simplified. Here it is not.

To get the gradient with respect to the weight W(0), then it is another iteration of the chain rule:

CW(0)=Cz(1)z(1)W(0)=(ReLU(z(1))y)Step(z(1))x=(ReLU(W(0)x)y)Step(W(0)x)x

. . . because z(1)=W(0)x therefore z(1)W(0)=x

That is the full solution for your simplest network.

However, in a layered network, you also need to carry the same logic down to the next layer. Also, you typically have more than one neuron in a layer.


More general ReLU network

If we add in more generic terms, then we can work with two arbitrary layers. Call them Layer (k) indexed by i, and Layer (k+1) indexed by j. The weights are now a matrix. So our feed-forward equations look like this:

zj(k+1)=iWij(k)ri(k)

rj(k+1)=ReLU(zj(k+1))

In the output layer, then the initial gradient w.r.t. rjoutput is still rjoutputyj. However, ignore that for now, and look at the generic way to back propagate, assuming we have already found Crj(k+1) - just note that this is ultimately where we get the output cost function gradients from. Then there are 3 equations we can write out following the chain rule:

First we need to get to the neuron input before applying ReLU:

  1. Czj(k+1)=Crj(k+1)rj(k+1)zj(k+1)=Crj(k+1)Step(zj(k+1))

We also need to propagate the gradient to previous layers, which involves summing up all connected influences to each neuron:

  1. Cri(k)=jCzj(k+1)zj(k+1)ri(k)=jCzj(k+1)Wij(k)

And we need to connect this to the weights matrix in order to make adjustments later:

  1. CWij(k)=Czj(k+1)zj(k+1)Wij(k)=Czj(k+1)ri(k)

You can resolve these further (by substituting in previous values), or combine them (often steps 1 and 2 are combined to relate pre-transform gradients layer by layer). However the above is the most general form. You can also substitute the Step(zj(k+1)) in equation 1 for whatever the derivative function is of your current activation function - this is the only place where it affects the calculations.


Back to your questions:

If this derivation is correct, how does this prevent vanishing?

Your derivation was not correct. However, that does not completely address your concerns.

The difference between using sigmoid versus ReLU is just in the step function compared to e.g. sigmoid's y(1y), applied once per layer. As you can see from the generic layer-by-layer equations above, the gradient of the transfer function appears in one place only. The sigmoid's best case derivative adds a factor of 0.25 (when x=0,y=0.5), and it gets worse than that and saturates quickly to near zero derivative away from x=0. The ReLU's gradient is either 0 or 1, and in a healthy network will be 1 often enough to have less gradient loss during backpropagation. This is not guaranteed, but experiments show that ReLU has good performance in deep networks.

If there's thousands of layers, there would be a lot of multiplication due to weights, then wouldn't this cause vanishing or exploding gradient?

Yes this can have an impact too. This can be a problem regardless of transfer function choice. In some combinations, ReLU may help keep exploding gradients under control too, because it does not saturate (so large weight norms will tend to be poor direct solutions and an optimiser is unlikely to move towards them). However, this is not guaranteed.

Neil Slater
fonte
Was a chain rule performed on dCdy^?
user1157751
@user1157751: No, Cy^=Cr(1) because y^=r(1). The cost function C is simple enough that you can take its derivative immediately. The only thing I haven't shown there is the expansion of the square - would you like me to add it?
Neil Slater
But C is 12(yy^)2, don't we need to perform chain rule so that we can perform the derivative on y^? dCdy^=dCdUdUdy^, where U=yy^. Apologize for asking really simple questions, my maths ability is probably causing trouble for you : (
user1157751
If you can make things simpler by expanding. Then please do expand the square.
user1157751
@user1157751: Yes you could use the chain rule in that way, and it would give the same answer as I show. I just expanded the square - I'll show it.
Neil Slater