Author: Olivier Chapelle (France) criteo labs
Personal Website: http://olivier.chapelle.cc/
• Yahoo! Research, Machine Learning group. (Oct.2006 – present)
Senior Research Scientist
Main projects are on machine learning applied to web search ranking and display advertising.
• Max Planck Institute, Tubingen, Empirical inference group. (Sep. 2002 – Oct. 2006)
Research Scientist
Design of semisupervised learning algorithms and kernel methods.
该作者primal相关的研究介绍在http://olivier.chapelle.cc/primal
Using Support Vector Machines in the primal don’t need to know anything about Lagrange multipliers, KKT conditions and duality to train an SVM.
07年发了一篇primal问题的论文(Training a Support Vector Machine in the Primal, Neural Computation)之后, 没有看到后续单独对primal的研究工作. 个人网站中提供了matlab的实现代码.
下面先学习一下 [ Training a Support Vector Machine in the Primal ] 这篇paper.
Abstract:
The primal problem can also be solved efficiently, both for linear and nonlinear SVMs. From the primal point of view new families of algorithms for large scale SVM training can be investigated.
1 Introduction:
Primal optimizations of linear SVMs have already been studied by Keerthi and DeCoste (2005); Mangasarian (2002).

S. Sathiya Keerthi http://www.keerthis.com/
Paper(2005 with Dennis DeCoste): A modified finite Newton method for fast solution of large scale linear SVMs

Olvi L. Mangasarian http://pages.cs.wisc.edu/~olvi/
Paper(2002): A Newton Method for Linear Programming
One of the main contributions of this paper is to complement those studies to include the nonlinear case. when the goal is to find an approximate solution, primal optimization is superior.
The primal SVM optimization problem is usually written is:
\(\mathop {\min }\limits_{{\bf{w}},b} :{\left\ {\bf{w}} \right\^2} + C\sum\limits_{i = 1}^n {\xi _i^p} \) under constraints \({y_i}({\bf{w}} \cdot {x_i} + b) \ge 1 – {\xi _i}\), \({\xi _i} \ge 0\).
In the literature there are usually two main reasons mentioned for solving this problem in the dual:
1. The duality theory provides a convenient way to deal with the constraints.
2. The dual optimization problem can be written in terms of dot products, thereby making it possible to use kernel functions.
section 2: We will start now with some general discussion about primal and dual optimization.
section 3: We will demonstrate those two reasons are not a limitation for solving the problem in the primal, mainly by writing the optimization problem as an unconstrained one and by using the representer theorem.
section 4: We will see that performing a Newton optimization in the primal yields exactly the same computational complexity as optimizing the dual.
section 5: that will be validated experimentally.
section 6: Possible advantages of a primal optimization are presented.
2 Links between primal and dual optimization:
We illustrate the connections through the example of regularized leastsquares (RLS).
The primal RLS problem can be written as:
\[\mathop {\min }\limits_{{\bf{w}} \in {{\bf{R}}^d}} :\lambda {{\bf{w}}^{\rm{T}}}{\bf{w}} + {\left\ {{\bf{Xw}}  {\bf{y}}} \right\^2}\]
where\(\lambda \) is the regularization parameter. This objective function is minimized for \({\bf{w}}{\rm{ = (}}{{\bf{X}}^{\rm{T}}}{\bf{X}}{\rm{ + }}\lambda {\bf{I}}{{\rm{)}}^{ – 1}}{{\bf{X}}^{\rm{T}}}{\bf{y}}\)and its minimum is \({{\bf{y}}^{\mathop{\rm T}\nolimits} }{\bf{y}} – {{\bf{y}}^{\rm{T}}}{\bf{X}}{{\rm{(}}{{\bf{X}}^{\rm{T}}}{\bf{X}}{\rm{ + }}\lambda {\bf{I}}{\rm{)}}^{ – 1}}{{\bf{X}}^{\rm{T}}}{\bf{y}}\).
The dual optimization problem is
\[\mathop {\max }\limits_{{\bf{\alpha }} \in {{\bf{R}}^d}} :2{{\bf{\alpha }}^{\rm{T}}}{\bf{y}}{\rm{  }}\frac{1}{\lambda }{{\bf{\alpha }}^{\rm{T}}}{{\rm{(}}{\bf{X}}{{\bf{X}}^{\rm{T}}}{\rm{ + }}\lambda {\bf{I}}{\rm{)}}^{  1}}{\bf{\alpha }}\]
The dual is maximized for \({\bf{\alpha }}{\rm{ = }}\lambda {{\rm{(}}{\bf{X}}{{\bf{X}}^{\rm{T}}}{\rm{ + }}\lambda {\bf{I}}{\rm{)}}^{ – 1}}{\bf{y}}\)and its maximum is \(\lambda {{\bf{y}}^{\rm{T}}}{{\rm{(}}{\bf{X}}{{\bf{X}}^{\rm{T}}}{\rm{ + }}\lambda {\bf{I}}{\rm{)}}^{ – 1}}{\bf{y}}\)
Thanks to Woodbury formula \(\lambda {{\rm{(}}{\bf{X}}{{\bf{X}}^{\rm{T}}}{\rm{ + }}\lambda {\bf{I}}{\rm{)}}^{ – 1}}{\rm{ = }}{\bf{I}} – {\bf{X}}{{\rm{(}}\lambda {\bf{I}}{\rm{ + }}{{\bf{X}}^{\rm{T}}}{\bf{X}}{\rm{)}}^{ – 1}}{{\bf{X}}^{\rm{T}}}\). With the equality, the primal and dual optimal values are the same.
In primal problem, the computation complexity of \({{\rm{(}}{{\bf{X}}^{\rm{T}}}{\bf{X}}{\rm{ + }}\lambda {\bf{I}}{\rm{)}}^{ – 1}}\) requires\(O(n{d^2} + {d^3})\). In dual problem, the computation complexity of \({{\rm{(}}{\bf{X}}{{\bf{X}}^{\rm{T}}}{\rm{ + }}\lambda {\bf{I}}{\rm{)}}^{ – 1}}\) requires\(O(d{n^2} + {n^3})\). But the matrix to invert is too big. So both for primal and dual optimization, the complexity is \(O(\max (n,d) + \min {(n,d)^2})\).
The difference between primal and dual optimization comes when computing approximate solutions. Because it is more focused on minimizing what we are interested in: the primal objective function.
3 Primal objective function:
Rewrite the primal SVM optimization as an unconstrained optimization problem:
\({\left\ {\bf{w}} \right\^2}{\rm{ + C}}\sum\limits_{i = 1}^n {L({y_i},{\bf{w}} \cdot {{\bf{x}}_i} + b)} \), with \(L(y,t) = \max {(0,1 – yt)^p}\). could be any loss function.
Write the optimization problem as an unconstrained one and by using the representer theorem.
\(\lambda {\bf{\beta K\beta }}{\rm{ + }}\sum\limits_{i = 1}^n {L({y_i},K_i^T{\bf{\beta }})} \).
4 Newton optimization:
The unconstrained objective function can be minimized using a variety of optimization techniques such as conjugate gradient. Here we will only consider Newton optimization as the similarities with dual optimization will then appear clearly.
We will focus on two loss functions: the quadratic penalization of the training errors and a differentiable approximation to the linear penalization, the Huber loss.
5 Experiments:
The experiments in this section can be considered as a sanity check to show that primal and dual optimization of a nonlinear SVM have similar time complexities. However, for linear SVMs, the primal optimization is definitely superior.
6 Advantages of primal optimization:
As explained throughout this paper, primal and dual optimization are very similar, and it is not surprising that they lead to same computational complexity, \(O(n{n_{SV}} + n_{SV}^3)\). So is there a reason to use one rather than the other?
We believe that primal optimization might have advantages for large scale optimization. Indeed, when the number of training points is large, the number of support vectors is also typically large and it becomes intractable to compute the exact solution. For this reason, one has to resort to approximations. But introducing approximations in the dual may not be wise. There is indeed no guarantee that an approximate dual solution yields a good approximate primal solution. Since what we are eventually interested in is a good primal objective function value, it is more straightforward to directly minimize it.
7 Conclusion:
In this paper, we have studied the primal optimization of nonlinear SVMs and derived the update rules for a Newton optimization. From these formulae, it appears clear that there are strong similarities between primal and dual optimization. Also, the corresponding implementation is very simple and does not require any optimization libraries.
The historical reasons for which most of the research in the last decade has been about dual optimization are unclear. We believe that it is because SVMs were first introduced in their hard margin formulation, for which a dual optimization (because of the constraints) seems more natural. In general, however, soft margin SVMs should be preferred, even if the training data are separable: the decision boundary is more robust because more training points are taken into account [Chapelle et al., 2000].
We do not pretend that primal optimization is better in general; our main motivation was to point out that primal and dual are two sides of the same coin and that there is no reason to look always at the same side. And by looking at the primal side, some new algorithms for finding approximate solutions emerge naturally. We believe that an approximate primal solution is in general superior to a dual one since an approximate dual solution can yield a primal one which is arbitrarily bad.
In addition to all the possibilities for approximate solutions metioned in this paper, the primal optimization also offers the advantage of tuning the hyperparameters simultaneously by performing a conjoint optimization on parameters and hyperparameters.