Onde exatamente

7

Entendi que os SVMs são classificadores lineares binários (sem o truque do kernel). Eles possuem dados de treinamento onde é um vetor e é a classe. Por serem binários, classificadores lineares, a tarefa é encontrar um hiperplano que separa os pontos de dados com o rótulo dos pontos de dados com o rótulo .(xi,yi)xiyi{1,1}1+1

Suponha, por enquanto, que os pontos de dados sejam linearmente separáveis ​​e não precisamos de variáveis ​​de folga.

Agora eu li que o problema de treinamento agora é o seguinte problema de otimização:

  • minw,b12w2
  • s.t. yi(w,xi+b)1

I think I got that minizmizing w2 means maximizing the margin (however, I don't understand why it is the square here. Would anything change if one would try to minimize w?).

I also understood that yi(w,xi+b)0 means that the model has to be correct on the training data. However, there is a 1 and not a 0. Why?

Martin Thoma
fonte
In the math minimize (derivative = 0) the squared probably turns out to be an easier equation
paparazzo
See also: Alexander Ihler: Support Vector Machines (1): Linear SVMs, primal form on YouTube. 25.01.2015.
Martin Thoma

Respostas:

10

First problem: Minimizing w or w2:

It is correct that one wants to maximize the margin. This is actually done by maximizing 2w. This would be the "correct" way of doing it, but it is rather inconvenient. Let's first drop the 2, as it is just a constant. Now if 1w is maximal, w will have to be as small as possible. We can thus find the identical solution by minimizing w.

w can be calculated by wTw. As the square root is a monotonic function, any point x which maximizes f(x) will also maximize f(x). To find this point x we thus don't have to calculate the square root and can minimize wTw=w2.

Finally, as we often have to calculate derivatives, we multiply the whole expression by a factor 12. This is done very often, because if we derive ddxx2=2x and thus ddx12x2=x. This is how we end up with the problem: minimize 12w2.

tl;dr: yes, minimizing w instead of 12w2 would work.

Second problem: 0 or 1:

As already stated in the question, yi(w,xi+b)0 means that the point has to be on the correct side of the hyperplane. However this isn't enough: we want the point to be at least as far away as the margin (then the point is a support vector), or even further away.

Remember the definition of the hyperplane,

H={xw,x+b=0}.

This description however is not unique: if we scale w and b by a constant c, then we get an equivalent description of this hyperplane. To make sure our optimization algorithm doesn't just scale w and b by constant factors to get a higher margin, we define that the distance of a support vector from the hyperplane is always 1, i.e. the margin is 1w. A support vector is thus characterized by yi(w,xi+b)=1.

As already mentioned earlier, we want all points to be either a support vector, or even further away from the hyperplane. In training, we thus add the constraint yi(w,xi+b)1, which ensures exactly that.

tl;dr: Training points don't only need to be correct, they have to be on the margin or further away.

hbaderts
fonte
Just to check if I understood it: Instead of writing 1 we could also use any constant ϵ and write ϵ, where ϵ>0?
Martin Thoma
In principle, yes. E.g. in soft-margin SVMs (where you allow for some misclassifications or points within the margin), you use 1ξi so you can be ξi from the margin. Of course then you need some penalty term which forces most ξi to be zero or at least very low.
hbaderts
1
I think in the above comment Martin was asking not about the case of soft margins where you add a ξi to let some points cross over, but rather just about what happens if you replace 1 with a different positive constant ϵ. I believe the result in that case would be the same (i.e., you'd find the same plane of separation) but w would be scaled so that the margin would be 2ϵw instead of 2w
Tim Goodman
This is because w,x+b=ϵ defines a plane orthogonal to w and offset from the origin by ϵbw in the w direction. And likewise (w,x+b)=ϵ defines a plane orthogonal to w and offset from the origin by ϵbw. So the distance between the two planes is ϵbwϵbw=2ϵw
Tim Goodman