Entendi que os SVMs são classificadores lineares binários (sem o truque do kernel). Eles possuem dados de treinamento onde é um vetor e é a classe. Por serem binários, classificadores lineares, a tarefa é encontrar um hiperplano que separa os pontos de dados com o rótulo dos pontos de dados com o rótulo .
Suponha, por enquanto, que os pontos de dados sejam linearmente separáveis e não precisamos de variáveis de folga.
Agora eu li que o problema de treinamento agora é o seguinte problema de otimização:
- s.t.
I think I got that minizmizing means maximizing the margin (however, I don't understand why it is the square here. Would anything change if one would try to minimize ?).
I also understood that means that the model has to be correct on the training data. However, there is a and not a . Why?
fonte
Respostas:
First problem: Minimizing∥w∥ or ∥w∥2 :
It is correct that one wants to maximize the margin. This is actually done by maximizing2∥w∥ . This would be the "correct" way of doing it, but it is rather inconvenient. Let's first drop the 2 , as it is just a constant. Now if 1∥w∥ is maximal, ∥w∥ will have to be as small as possible. We can thus find the identical solution by minimizing ∥w∥ .
Finally, as we often have to calculate derivatives, we multiply the whole expression by a factor12 . This is done very often, because if we derive ddxx2=2x and thus ddx12x2=x .
This is how we end up with the problem: minimize 12∥w∥2 .
tl;dr: yes, minimizing∥w∥ instead of 12∥w∥2 would work.
Second problem:≥0 or ≥1 :
As already stated in the question,yi(⟨w,xi⟩+b)≥0 means that the point has to be on the correct side of the hyperplane. However this isn't enough: we want the point to be at least as far away as the margin (then the point is a support vector), or even further away.
Remember the definition of the hyperplane,
This description however is not unique: if we scalew and b by a constant c , then we get an equivalent description of this hyperplane. To make sure our optimization algorithm doesn't just scale w and b by constant factors to get a higher margin, we define that the distance of a support vector from the hyperplane is always 1 , i.e. the margin is 1∥w∥ . A support vector is thus characterized by yi(⟨w,xi⟩+b)=1 .
As already mentioned earlier, we want all points to be either a support vector, or even further away from the hyperplane. In training, we thus add the constraintyi(⟨w,xi⟩+b)≥1 , which ensures exactly that.
tl;dr: Training points don't only need to be correct, they have to be on the margin or further away.
fonte