Por que o Q Learning diverge?

Os valores de estado do meu algoritmo Q-Learning continuam divergindo até o infinito, o que significa que meus pesos também estão divergindo. Eu uso uma rede neural para meu mapeamento de valores.

Eu tentei:

Recorte do "valor da ação recompensa + desconto * máximo" (máximo / min definido como 50 / -50)
Definindo uma baixa taxa de aprendizado (0,00001 e eu uso o Backpropagation clássico para atualizar os pesos)
Diminuindo os valores das recompensas
Aumentando a taxa de exploração
Normalizando as entradas para 1 ~ 100 (anteriormente era 0 ~ 1)
Alterar a taxa de desconto
Diminuir as camadas da rede neural (apenas para validação)

Ouvi dizer que o Q Learning diverge em entradas não lineares, mas há mais alguma coisa que eu possa tentar impedir a divergência de pesos?

Atualização # 1 em 14 de agosto de 2017:

Decidi adicionar alguns detalhes específicos sobre o que estou fazendo agora devido a uma solicitação.

Atualmente, estou tentando fazer um agente aprender a lutar em uma vista de cima para baixo de um jogo de tiro. O oponente é um bot simples que se move estocástico.

Cada personagem tem 9 ações para escolher em cada turno:

subir
descer
Mova para a esquerda
mova para a direita
atirar uma bala para cima
atirar uma bala para baixo
atirar uma bala para a esquerda
disparar uma bala para a direita
fazer nada

As recompensas são:

se o agente acertar o bot com um marcador, +100 (tentei muitos valores diferentes)
se o agente for atingido por um tiro de bala pelo bot, -50 (novamente, eu tentei muitos valores diferentes)
se o agente tentar disparar uma bala enquanto as balas não puderem ser disparadas (por exemplo, quando o agente acabou de disparar uma bala, etc.), -25 (Não é necessário, mas eu queria que o agente fosse mais eficiente)
se o bot tentar sair da arena, -20 (Não é necessário também, mas eu queria que o agente fosse mais eficiente)

As entradas para a rede neural são:

Distância entre o agente e o bot no eixo X normalizada para 0 ~ 100
Distância entre o agente e o bot no eixo Y normalizada para 0 ~ 100
Posições x e y do agente
Posições x e y de Bot
Posição de bala de Bot. Se o bot não disparou um tiro, os parâmetros são definidos para as posições x e y do bot.

Eu também brinquei com as entradas também; Tentei adicionar novos recursos, como o valor x da posição do agente (não a distância, mas a posição real) e a posição da bala do bot. Nenhum deles funcionou.

Aqui está o código:

from pygame import *
from pygame.locals import *
import sys
from time import sleep
import numpy as np
import random
import tensorflow as tf
from pylab import savefig
from tqdm import tqdm


#Screen Setup
disp_x, disp_y = 1000, 800
arena_x, arena_y = 1000, 800
border = 4; border_2 = 1

#Color Setup
white = (255, 255, 255); aqua= (0, 200, 200)
red = (255, 0, 0); green = (0, 255, 0)
blue = (0, 0, 255); black = (0, 0, 0)
green_yellow = (173, 255, 47); energy_blue = (125, 249, 255)

#Initialize character positions
init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50]
init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50]

#Setup character dimentions
character_size = 50
character_move_speed = 25

#Initialize character stats
character_init_health = 100

#initialize bullet stats
beam_damage = 10
beam_width = 10
beam_ob = -100

#The Neural Network
input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32)
weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1))
#weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1))

#The calculations, loss function and the update model
Q = tf.matmul(input_layer, weight_1)
predict = tf.argmax(Q, 1)
next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_Q - Q))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
updateModel = trainer.minimize(loss)

initialize = tf.global_variables_initializer()

jList = []
rList = []

init()
font.init()
myfont = font.SysFont('Comic Sans MS', 15)
myfont2 = font.SysFont('Comic Sans MS', 150)
myfont3 = font.SysFont('Gothic', 30)
disp = display.set_mode((disp_x, disp_y), 0, 32)

#CHARACTER/BULLET PARAMETERS
agent_x = agent_y = int()
bot_x = bot_y = int()
agent_hp = bot_hp = int()
bot_beam_dir = int()
agent_beam_fire = bot_beam_fire = bool()
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int()
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int()
bot_current_action = agent_current_action = int()

def param_init():
    """Initializes parameters"""
    global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y

    agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1]
    bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1]
    agent_hp = bot_hp = character_init_health
    agent_beam_fire = bot_beam_fire = False
    agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob
    agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0


def screen_blit():
    global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x, \
    agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue, \
    agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width

    disp.fill(aqua)
    draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y /
                            2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2))
    draw.rect(disp, green, (disp_x / 2 - arena_x / 2,
                            disp_y / 2 - arena_y / 2, arena_x, arena_y))
    if bot_beam_fire == True:
        draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y))
        bot_beam_fire = False
    if agent_beam_fire == True:
        draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y))
        agent_beam_fire = False

    draw.rect(disp, red, (agent_x, agent_y, character_size, character_size))
    draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size))

    draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 +
                            border + 1, float(agent_hp) / float(character_init_health) * 100, 14))
    draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 +
                            border + 1, float(bot_hp) / float(character_init_health) * 100, 14))


def bot_take_action():
    return random.randint(1, 9)

def beam_hit_detector(player):
    global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x, \
    bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y, \
    bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size

    if player == "bot":
        if bot_current_action == 1:
            if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
                return True
            else:
                return False
        elif bot_current_action == 2:
            if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
                return True
            else:
                return False
        elif bot_current_action == 3:
            if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
                return True
            else:
                return False
        elif bot_current_action == 4:
            if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
                return True
            else:
                return False
    else:
        if agent_current_action == 1:
            if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
                return True
            else:
                return False
        elif agent_current_action == 2:
            if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
                return True
            else:
                return False
        elif agent_current_action == 3:
            if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
                return True
            else:
                return False
        elif bot_current_action == 4:
            if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
                return True
            else:
                return False


def mapping(maximum, number):
    return number#int(number * maximum)

def action(agent_action, bot_action):
    global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, \
    bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, \
    agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size

    agent_current_action = agent_action; bot_current_action = bot_action
    reward = 0; cont = True; successful = False; winner = ""
    if 1 <= bot_action <= 4:
        bot_beam_fire = True
        if bot_action == 1:
            bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2
            bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2
        elif bot_action == 2:
            bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2
            bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width
        elif bot_action == 3:
            bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size
            bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size
        elif bot_action == 4:
            bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2
            bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width

    elif 5 <= bot_action <= 8:
        if bot_action == 5:
            bot_y -= character_move_speed
            if bot_y <= disp_y/2 - arena_y/2:
                bot_y = disp_y/2 - arena_y/2
            elif agent_y <= bot_y <= agent_y + character_size:
                bot_y = agent_y + character_size
        elif bot_action == 6:
            bot_x += character_move_speed
            if bot_x >= disp_x/2 + arena_x/2 - character_size:
                bot_x = disp_x/2 + arena_x/2 - character_size
            elif agent_x <= bot_x + character_size <= agent_x + character_size:
                bot_x = agent_x - character_size
        elif bot_action == 7:
            bot_y += character_move_speed
            if bot_y + character_size >= disp_y/2 + arena_y/2:
                bot_y = disp_y/2 + arena_y/2 - character_size
            elif agent_y <= bot_y + character_size <= agent_y + character_size:
                bot_y = agent_y - character_size
        elif bot_action == 8:
            bot_x -= character_move_speed
            if bot_x <= disp_x/2 - arena_x/2:
                bot_x = disp_x/2 - arena_x/2
            elif agent_x <= bot_x <= agent_x + character_size:
                bot_x = agent_x + character_size

    if bot_beam_fire == True:
        if beam_hit_detector("bot"):
            #print "Agent Got Hit!"
            agent_hp -= beam_damage
            reward += -50
            bot_beam_size_x = bot_beam_size_y = 0
            bot_beam_x = bot_beam_y = beam_ob
            if agent_hp <= 0:
                cont = False
                winner = "Bot"

    if 1 <= agent_action <= 4:
        agent_beam_fire = True
        if agent_action == 1:
            if agent_y > disp_y/2 - arena_y/2:
                agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2
                agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2
            else:
                reward += -25
        elif agent_action == 2:
            if agent_x + character_size < disp_x/2 + arena_x/2:
                agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2
                agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width
            else:
                reward += -25
        elif agent_action == 3:
            if agent_y + character_size < disp_y/2 + arena_y/2:
                agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size
                agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size
            else:
                reward += -25
        elif agent_action == 4:
            if agent_x > disp_x/2 - arena_x/2:
                agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2
                agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width
            else:
                reward += -25

    elif 5 <= agent_action <= 8:
        if agent_action == 5:
            agent_y -= character_move_speed
            if agent_y <= disp_y/2 - arena_y/2:
                agent_y = disp_y/2 - arena_y/2
                reward += -5
            elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
                agent_y = bot_y + character_size
                reward += -2
        elif agent_action == 6:
            agent_x += character_move_speed
            if agent_x + character_size >= disp_x/2 + arena_x/2:
                agent_x = disp_x/2 + arena_x/2 - character_size
                reward += -5
            elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
                agent_x = bot_x - character_size
                reward += -2
        elif agent_action == 7:
            agent_y += character_move_speed
            if agent_y + character_size >= disp_y/2 + arena_y/2:
                agent_y = disp_y/2 + arena_y/2 - character_size
                reward += -5
            elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
                agent_y = bot_y - character_size
                reward += -2
        elif agent_action == 8:
            agent_x -= character_move_speed
            if agent_x <= disp_x/2 - arena_x/2:
                agent_x = disp_x/2 - arena_x/2
                reward += -5
            elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
                agent_x = bot_x + character_size
                reward += -2
    if agent_beam_fire == True:
        if beam_hit_detector("agent"):
            #print "Bot Got Hit!"
            bot_hp -= beam_damage
            reward += 50
            agent_beam_size_x = agent_beam_size_y = 0
            agent_beam_x = agent_beam_y = beam_ob
            if bot_hp <= 0:
                successful = True
                cont = False
                winner = "Agent"
    return reward, cont, successful, winner

def bot_beam_dir_detector():
    global bot_current_action
    if bot_current_action == 1:
        bot_beam_dir = 2
    elif bot_current_action == 2:
        bot_beam_dir = 4
    elif bot_current_action == 3:
        bot_beam_dir = 3
    elif bot_current_action == 4:
        bot_beam_dir = 1
    else:
        bot_beam_dir = 0
    return bot_beam_dir

#Parameters
y = 0.75
e = 0.3
num_episodes = 10000
batch_size = 10
complexity = 100
with tf.Session() as sess:
    sess.run(initialize)
    success = 0
    for i in tqdm(range(1, num_episodes)):
        #print "Episode #", i
        rAll = 0; d = False; c = True; j = 0
        param_init()
        samples = []
        while c == True:
            j += 1
            current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
                                        mapping(complexity, float(agent_y) / float(arena_y)),
                                        mapping(complexity, float(bot_x) / float(arena_x)),
                                        mapping(complexity, float(bot_y) / float(arena_y)),
                                        #mapping(complexity, float(agent_hp) / float(character_init_health)),
                                        #mapping(complexity, float(bot_hp) / float(character_init_health)),
                                        mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
                                        mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
                                        bot_beam_dir
                                        ]])
            b = bot_take_action()
            if np.random.rand(1) < e or i <= 5:
                a = random.randint(0, 8)
            else:
                a, _ = sess.run([predict, Q],feed_dict={input_layer : current_state})
            r, c, d, winner = action(a + 1, b)
            bot_beam_dir = bot_beam_dir_detector()
            next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
                                        mapping(complexity, float(agent_y) / float(arena_y)),
                                        mapping(complexity, float(bot_x) / float(arena_x)),
                                        mapping(complexity, float(bot_y) / float(arena_y)),
                                        #mapping(complexity, float(agent_hp) / float(character_init_health)),
                                        #mapping(complexity, float(bot_hp) / float(character_init_health)),
                                        mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
                                        mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
                                        bot_beam_dir
                                        ]])
            samples.append([current_state, a, r, next_state])
            if len(samples) > 10:
                for count in xrange(batch_size):
                    [batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)]
                    batch_allQ = sess.run(Q, feed_dict={input_layer : batch_current_state})
                    batch_Q1 = sess.run(Q, feed_dict = {input_layer : batch_next_state})
                    batch_maxQ1 = np.max(batch_Q1)
                    batch_targetQ = batch_allQ
                    batch_targetQ[0][a] = reward + y * batch_maxQ1
                    sess.run([updateModel], feed_dict={input_layer : batch_current_state, next_Q : batch_targetQ})
            rAll += r
            screen_blit()
            if d == True:
                e = 1. / ((i / 50) + 10)
                success += 1
                break
            #print agent_hp, bot_hp
            display.update()

        jList.append(j)
        rList.append(rAll)
        print winner

Tenho certeza de que, se você tiver o pygame e o Tensorflow e o matplotlib instalados em um ambiente python, poderá ver as animações do bot e do agente "fighting".

Eu discordei da atualização, mas seria incrível se alguém também pudesse resolver o meu problema específico junto com o problema geral original.

Obrigado!

Atualização # 2 em 18 de agosto de 2017:

Com base nos conselhos do @NeilSlater, implementei a repetição da experiência no meu modelo. O algoritmo melhorou, mas vou procurar mais opções de aprimoramento que ofereçam convergência.

Atualização # 3 em 22 de agosto de 2017:

Percebi que se o agente acertar o bot com uma bala em um turno e a ação que o bot executou naquele turno não foi "disparar uma bala", as ações erradas receberiam crédito. Assim, eu transformei as balas em vigas para que o bot / agente sofra dano na vez em que a viga for disparada.

machine-learning python reinforcement-learning q-learning IronEdward
fonte

Você está usando os valores de repetição e inicialização de experiência de uma cópia "congelada" da rede recente? Essas são abordagens usadas no DQN - elas não são garantidas, embora possam ser necessárias para a estabilidade. Você está usando um algoritmo Q ( ) ou apenas um Q-learning em uma única etapa? Você pode dar uma indicação de como é o seu ambiente e esquema de recompensa? O aprendizado em Q de etapa única terá um desempenho ruim quando as recompensas são escassas, por exemplo, recompensa final de +1 ou -1 no final do episódio longo.

λ

$\lambda$

Neil Slater

OK, com a sua atualização, sugiro imediatamente que você precise de repetição da experiência e provavelmente também redes alternadas para inicialização, porque essas são influências estabilizadoras no aprendizado por reforço com aproximadores não lineares. Fico feliz em conversar sobre isso em detalhes e dar uma olhada no código do seu projeto para mostrar um exemplo, mas pode levar um dia ou dois para retornar a você com esse nível de detalhe.

Neil Slater

Eu tenho o código em execução e se eu estiver correto ao entendê-lo, os marcadores podem ser "guiados" pela seleção do agente das ações 1 a 4 a cada turno, ou seja, o marcador pode ser movido em qualquer direção enquanto o agente permanece parado. Isso é intencional? O bot não faz isso porque é acionado apenas quando alinhado na grade com o agente, e sempre escolhe a mesma direção.

Neil Slater

Quase certo, mas você não armazena o valor de inicialização, recalcule-o quando a etapa for amostrada posteriormente. Para cada ação executada, você armazena as quatro coisas: Estado, Ação, Próximo estado, Recompensa. Em seguida, você pega um pequeno mini-lote (1 por etapa é bom, mas mais, por exemplo, 10 é típico) desta lista e, para o Q-learning, calcula a nova ação máxima e seu valor para criar o mini-lote de aprendizado supervisionado (também chamado de Alvo TD ).

Neil Slater

Isso deve ser "cópia congelada do aproximador (ou seja, a rede neural" (se a citação for de um dos meus comentários ou respostas, por favor, aponte-o e eu o corrigirei. É muito simples - basta manter duas cópias do peso params , o "ativo" que você atualiza e o "antigo antigo" que você copia do "ativo" a cada poucas centenas de atualizações.Quando você calcula o destino do TD, por exemplo, use a cópia "antiga" para calcular , mas depois treine a cópia "ao vivo" com Esses valores

w

$\mathbf{w}$

R + γ {max}_{a^{'}} \hat{q} (S^{'}, a^{'}, w)

$R + \gamma \text{max}_{a'} \hat{q}(S',a',\mathbf{w})$

\hat{q}

$\hat{q}$

Neil Slater

Por que o Q Learning diverge?

Respostas: