Contexto: Sou um programador com alguma experiência (meio esquecida) em estatística de cursos da uni. Recentemente me deparei com http://akinator.com e passei algum tempo tentando fazê-lo falhar. E quem não era? :)
Eu decidi descobrir como isso poderia funcionar. Depois de pesquisar no Google e ler postagens de blog relacionadas e adicionar um pouco do meu conhecimento (limitado) ao mix resultante, crio o seguinte modelo (tenho certeza de que usarei a notação errada, por favor, não me mate por isso):
Existem assuntos (s) e perguntas (Q). O objetivo do preditor é selecionar o assunto S que tem a maior probabilidade posterior de ser o assunto em que o usuário está pensando, com perguntas e respostas coletadas até o momento.
Seja o jogo G um conjunto de perguntas e respostas dadas: .
Então o preditor está procurando por .
Anterior para sujeitos ( ), poderia ser apenas o número de vezes que o sujeito foi adivinhado, dividido pelo número total de jogos.
Assumindo que todas as respostas são independentes, poderíamos calcular a probabilidade do sujeito S, dado o jogo G da seguinte forma:
Poderíamos calcular o se mantivermos o controle de quais perguntas e respostas foram dadas quando os usados têm o assunto em questão:
Now, defines a probability distribution over subjects and when we need to select the next question we have to select the one for which the expected change in the entropy of this distribution is maximal:
I've tried to implement this and it works. But, obviously, as the number of subjects goes up, performance degrades due to the need to recalculate the after each move and calculate updated distribution for question selection.
I suspect that I simply have chosen the wrong model, being constrained by the limits of my knowledge. Or, maybe, there is an error in the math. Please enlighten me: what should I make myself familiar with, or how to change the predictor so that it could cope with millions of subjects and thousands of questions?
fonte
Respostas:
This game looks similar to 20 questions at http://20q.net, which the creator reports is based on a neural network. Here's one way to structure such network, similar to the neural network described in Concept description vectors and the 20 question game.
You'd have
0/1
representsno/yes
answer. Initially set to0.5
Input units for questions that have been answered are set to 0 or 1, and the assumption is that neural network has been trained to make output units output values close to 1 for questions that have "Yes" answer given a set of existing answers.
At each stage you would pick the question which
NN
is the least sure about, ie, corresponding output unit is close to0.5
, ask the question, and set corresponding input unit to the answer. At the last stage you pick an output unit from the "final question" list with value closest to1
.Each game of 20 questions gives 20 datapoints which you can use to update
NN
's weights with back-propagation, ie, you update the weights to make the outputs of current neural network match the true answer given all the previous questions asked.fonte
I don’t think it is really a classification problem. 20 questions is often characterized as a compression problem. This actually matches better with the last part of your question where you talk about entropy.
See Chapter 5.7 (Google books) of
Cover, T.M. and Joy, A.T. (2006) Elements of Information Theory
and also Huffman coding. This paper I found on arXiv may be of interest as well.
Gill, J.T. and Wu, W. (2010) "Twenty Questions Games Always End With Yes” http://arxiv.org/abs/1002.4907
For simplicity assume yes/no questions (whereas akinator.com allows allows maybe, don’t know). Assume that every possible subject (what akinator.com knows) can be uniquely identified by a sequence of yes/no questions and answers — essentially a binary vector.
The questions that are (and their answers) asked define a recursive partitioning of the space of subjects. This partitioning also corresponds to a tree structure. The interior vertices of the tree correspond to questions, and the leaves correspond to subjects. The depth of the leaves is exactly the number of questions required to uniquely identify the subject. You can (trivially) identify every known subject by asking every possible question. That’s not interesting because there are potentially hundreds of questions.
The connection with Huffman coding is that it gives an optimal way (under a certain probabilistic model) to construct the tree so that the average depth is (nearly) minimal. In other words, it tells you how to arrange the sequence of questions (construct a tree) so that the number of questions you need to ask is on average small. It uses a greedy approach.
There is, of course, more to akinator.com than this, but the basic idea is that you can think of the problem in terms of a tree and trying to minimize the average depth of its leaves.
fonte