牛顿法在SVM原问题参数优化中的应用

1. 牛顿法

牛顿法又称为二次函数法或二阶梯度法. 梯度法的缺点是有可能使搜索过程收敛很慢. 因此, 在某些情况下, 它并非是有效的迭代方法. 牛顿法在搜索方向上比梯度法有所改进, 这一方法不仅利用了准则函数在搜索点的梯度, 而且还利用了它的二次导数, 就是说利用了搜索点所能提供的更多信息, 使搜索方向能更好地指向最优点.

牛顿的基本思想是企图一步达到最优点, 即一步达到\(J({\bf{w}})\)的最小值. 仍然考虑\(J({\bf{w}})\)的二阶泰勒展开式

\[J({\bf{w}}){\rm{\dot = }}J({{\bf{w}}_k}) + \nabla {J^{\rm{T}}}({\bf{w}} - {{\bf{w}}_k}) + \frac{1}{2}{({\bf{w}} - {{\bf{w}}_k})^{\rm{T}}}D({\bf{w}} - {{\bf{w}}_k})\]

若令\({\bf{w}} = {{\bf{w}}_{k + 1}}\)得

\[J({{\bf{w}}_{k{\rm{ + }}1}}){\rm{\dot = }}J({{\bf{w}}_k}) + \nabla {J^{\rm{T}}}({{\bf{w}}_{k{\rm{ + }}1}} - {{\bf{w}}_k}) + \frac{1}{2}{({{\bf{w}}_{k{\rm{ + }}1}} - {{\bf{w}}_k})^{\rm{T}}}D({{\bf{w}}_{k{\rm{ + }}1}} - {{\bf{w}}_k})\]

我们的目的是求\({{\bf{w}}_{k{\rm{ + }}1}}\)为何值时, 可使\(J({{\bf{w}}_{k{\rm{ + }}1}})\)取最小值, 显然可将上式对\({{\bf{w}}_{k{\rm{ + }}1}}\)求导, 并使导数为零. 按纯量函数对向量导数的公式, 可得

\[\frac{{{\rm{d}}J({{\bf{w}}_{k{\rm{ + }}1}})}}{{d{{\bf{w}}_{k{\rm{ + }}1}}}}{\rm{ = }}\nabla J + D({{\bf{w}}_{k{\rm{ + }}1}} - {{\bf{w}}_k}) = 0\]

两侧同时左乘\({D^{{\rm{ – }}1}}\), 得: \({D^{{\rm{ – }}1}}\nabla J + {{\bf{w}}_{k{\rm{ + }}1}} – {{\bf{w}}_k} = 0\)

因此: \({{\bf{w}}_{k{\rm{ + }}1}} = {{\bf{w}}_k} – {D^{{\rm{ – }}1}}\nabla J\)

这就得到了牛顿法的迭代公式. 由于我们用二阶泰勒展开式逼近准则函数\(J({\bf{w}})\), 所以, 如果\(J({\bf{w}})\)正好本身就是二次函数, 则二阶展开式为准确展开式. 这时, 可一步到达最优点. 在一般情况下, \(J({\bf{w}})\)不一定是二次函数, 这样泰勒展开式就存在逼近误差, 使搜索不能一步达到最优, 尤其是当\(J({\bf{w}})\)次数较高时, 迭代次数也会较多. 一般来说, 由于牛顿法利用了\(J({\bf{w}})\)的二次导数信息来修正搜索方向, 因此它的收敛速度较快. 但是\(J({\bf{w}})\)的二次偏导矩阵\(D\)(也就是Hessian矩阵)的计算量大, 尤其是算法中要求计算\(D\)的逆矩阵\({D^{{\rm{ – }}1}}\), 这就不仅增加了计算量, 而且若\(D\)是奇异的, 就无法使用牛顿法.

2. SVM in primal

对于SVM无约束问题的objective function: \({\left\| {\bf{w}} \right\|^2}{\rm{ + }}C\sum\limits_{i = 1}^n {L({y_i},{\bf{w}} \cdot {{\bf{x}}_i} + b)} \), 后项为惩罚项, \(C\)为惩罚因子. 也可引入regularization parameter: \(\lambda {\rm{ = }}1/C\). 惩罚项中\(L(y,t)\)为损失函数, 分类问题中常用的几种损失函数有(注,以下几个损失函数在回归问题中不能用):

L1(hinge loss): \(L(y,t){\rm{ = }}\max (0,1 – yt)\);

L2(quadratic loss): \(L(y,t){\rm{ = }}\max {(0,1 – yt)^2}\);

Huber loss: \(L(y,t){\rm{ = }}\)\(\left\{ {\begin{array}{*{20}{c}}
0&{,yt > 1 + h}\\
{\frac{{{{(1 – h – yt)}^2}}}{{4h}}}&{,\left| {1 – yt} \right| \le h}\\
{1 – yt}&{,yt < 1 – h}
\end{array}} \right.\).

常用的损失函数为L2损失, 代入目标函数: \({\left\| {\bf{w}} \right\|^2}{\rm{ + }}C\sum\limits_{i = 1}^n {\max {{(0,1 – {y_i}({\bf{w}} \cdot {{\bf{x}}_i} + b))}^2}} \). 又知, \(1 – {y_i}({\bf{w}} \cdot {{\bf{x}}_i} + b) > 0\)时, 即表示训练样本没有正确分类, 标记为支持向量, 简称为sv. 则目标函数又可表示为: \({\left\| {\bf{w}} \right\|^2}{\rm{ + }}C\sum\limits_{i \in sv}^{} {{{(1 – {y_i}({\bf{w}} \cdot {{\bf{x}}_i} + b))}^2}} \). 我们的任务是要由训练样本集\(({{\bf{x}}_i},{y_i})\)借助最优化算法求出一组\( < {\bf{w}},b > \)使目标函数值全局最优. 如果是借助最优化方法中的牛顿法, 根据牛顿法的迭代公式, 我们需要对\( < {\bf{w}},b > \)求梯度和海森矩阵, 则:

\[\nabla {\rm{ = }}\left[ {\begin{array}{*{20}{c}}
{\frac{{df}}{{d{\bf{w}}}}}\\
{\frac{{df}}{{db}}}
\end{array}} \right] = \left[ {\begin{array}{*{20}{c}}
{2{\bf{w}} + 2C\sum\limits_{i \in sv}^{} {(1 - {y_i}({\bf{w}} \cdot {{\bf{x}}_i} + b))( - {y_i}{{\bf{x}}_i})} }\\
{2C\sum\limits_{i \in sv}^{} {(1 - {y_i}({\bf{w}} \cdot {{\bf{x}}_i} + b))( - {y_i})} }
\end{array}} \right]\]
\[H{\rm{ = }}\left[ {\begin{array}{*{20}{c}}
{\frac{{\partial {f^2}}}{{\partial {\bf{w}}\partial {\bf{w}}}}}&{\frac{{\partial {f^2}}}{{\partial {\bf{w}}\partial b}}}\\
{\frac{{\partial {f^2}}}{{\partial b\partial {\bf{w}}}}}&{\frac{{\partial {f^2}}}{{\partial b\partial b}}}
\end{array}} \right]{\rm{ = }}\left[ {\begin{array}{*{20}{c}}
{2{I_d} + 2C\sum\limits_{i \in sv}^{} {{{\bf{x}}_i}{\bf{x}}_i^{\rm{T}}} }&{2C\sum\limits_{i \in sv}^{} {{{\bf{x}}_i}} }\\
{2C\sum\limits_{i \in sv}^{} {{\bf{x}}_i^{\rm{T}}} }&{2C\sum\limits_{i \in sv}^{} 1 }
\end{array}} \right]\]

那么, 根据梯度和海森矩阵就可以求得迭代方向\(step = {H^{{\rm{ – }}1}}\nabla \), 但是一般步长不一定取1, 还可以通过一维线性搜索的方法求取最优步长\(t\). 这里采用牛顿法进行最优步长的求取, 同样应计算梯度和海森矩阵. 把\({\bf{w}} \to {\bf{w}}{\rm{ + }}t \cdot ste{p_{\bf{w}}},b \to b + t \cdot ste{p_b}\)代入目标函数并对\(t\)求取一阶导和二阶导(计算过程和上述类似且简单, 这里略过).

3. matlab代码实现

测试应用文件test_primal_svm.m, 根据具体场合进行修改.

clear;clc;
global X
% Generating 3000 points in 100 dimensions
X = randn(3000,100); %返回3000*100的标准正态分布随机项矩阵,3000个样本,每个样本100维
%给随即生成的3000个点做标记
Y = sign(X(:,1)); %当x0时, sign(x)=1
lambda = 1;
opt.loss_func = 3; %Loss =1,Hinge损失;2,L2损失;3,Huber损失
[w, b0 ]=primal_svm(1,Y,lambda,opt); %第一个参数linear为真,即线性

调用的函数文件primal_svm.m

function [sol,b,obj] = primal_svm(linear,Y,lambda,opt)
% [SOL, B] = PRIMAL_SVM(LINEAR,Y,LAMBDA,OPT)
% Solves the SVM optimization problem in the primal (with quatratic
% penalization of the training errors).
%
% If LINEAR is 1, a global variable X containing the training inputs
% should be defined. X is an n x d matrix (n = number of points).
% If LINEAR is 0, a global variable K (the n x n kernel matrix) should be defined.
% Y is the target vector (+1 or -1, length n).
% LAMBDA is the regularization parameter ( = 1/C)
%
% IF LINEAR is 0, SOL is the expansion of the solution (vector beta of length n).
% IF LINEAR is 1, SOL is the hyperplane w (vector of length d).
% B is the bias
% The outputs on the training points are either K*SOL+B or X*SOL+B
% OBJ is the objective function value
%
% OPT is a structure containing the options (in brackets defaultvalues):
% cg: Do not use Newton, but nonlinear conjugate gradients [0]
% lin_cg: Compute the Newton step with linear CG
% [0 unless solving sparse linear SVM]
% iter_max_Newton: Maximum number of Newton steps [20]
% prec: Stopping criterion
% cg_prec and cg_it: stopping criteria for the linear CG.

% Copyright Olivier Chapelle, olivier.chapelle@tuebingen.mpg.de
% Last modified 25/08/2006

% nargin是函数内部输入的参数个数,如果没有输入设置项则分配默认配置值
if nargin < 4 % Assign the options to their defaultvalues
  opt = []; %没有声明过结构体需要声明
end;
%检查结构体opt中各域是否定义过,主要是指cg
if ~isfield(opt,'cg'), opt.cg = 0; end;
if ~isfield(opt,'lin_cg'), opt.lin_cg = 0; end;
if ~isfield(opt,'iter_max_Newton'), opt.iter_max_Newton = 50; end;
if ~isfield(opt,'prec'), opt.prec = 1e-5; end;
if ~isfield(opt,'cg_prec'), opt.cg_prec = 1e-4; end;
if ~isfield(opt,'cg_it'), opt.cg_it = 20; end;
%默认的损失函数为L2
if ~isfield(opt,'loss_func'), opt.loss_func = 2; end;
if opt.loss_func == 3 % 如果损失函数为huber, 那么定义huber参数h
  opt.huber_h = 0.5; %h一般取值0.01~0.5
end
% Call the right function depending on problem type and CG / Newton
% Also check that X / K exists and that the dimension of Y is correct if linear
global X; %要调用全局变量X,这里要重新声明一次global
if isempty(X), error('Global variable X undefined'); end;
[n,d] = size(X);
if issparse(X), opt.lin_cg = 1; end;
%判断一个稀疏矩阵存储方式是否是sparse storage organization
if size(Y,1)~=n, error('Dimension error'); end;
%判断标记输出个数是否与样本个数一致
if ~opt.cg
  [sol,obj] = primal_svm_linear (Y,lambda,opt); %利用Newton法训练线性SVM
else
  [sol,obj] = primal_svm_linear_cg(Y,lambda,opt); %利用共轭梯度训练一个线性SVM
end;
% The last component of the solution is the bias b.
b = sol(end);
sol = sol(1:end-1);
fprintf('\n'); 

function [w,obj] = primal_svm_linear(Y,lambda,opt)
% ------------------------------- % Train a linear SVM using Newton % -------------------------------
Each_Group_Num = 1500; %此处设置每个子训练集的样本个数
global X;
X_all = X;
Y_all = Y;
isGroup = false; %分组标识符 %%%%%%%%%%%%迭代开始之前进行随即小样本提取%%%%%%%%%%%%%
training = find(Y_all);
% The points with 0 are ignored.
n = length(training); % The real number of training points
if n>=Each_Group_Num % Train a subset first, 若>1K个点则分组
  isGroup = ~false;
  group_n = fix(n/Each_Group_Num); %每组1K个点, 最后的一些剩余点不要了
  perm = randperm(n); %把1到n这些数随机打乱得到的一个数字序列
  for i =1:group_n %放弃最后剩余点保证向量长度矩阵格式统一
    Group.n(:,i) = training(perm((i-1)*Each_Group_Num+1:i*Each_Group_Num));
  end
end
%%%%%%%%%%%%迭代开始之前进行随即小样本提取%%%%%%%%%%%%%
if (~isGroup) %即小数据训练样本未分组
  [n,d] = size(X_all);
else
  X = X_all(Group.n(:,1),:); %提取随即小样本集的数据X
  Y = Y_all(Group.n(:,1)); %提取随即小样本集的数据Y
  [n,d] = size(X);
end
if opt.loss_func == 2
  w = zeros(d+1,1); % The last component of w is b.
end
if opt.loss_func == 3
  w = ones(d+1,1)/10;
  w(d+1) = 0; %Huber初始条件应该使一些样本初始为sv, 即初始w为0.1,b为0
end
iter = 0;
t = 0; % 步长初始为0
step = zeros(d+1,1);
while 1
  iter = iter + 1;
  if isGroup
    group_num_iter = rem(iter,group_n);
    if group_num_iter == 0
      group_num_iter = group_n;
    end
    X = X_all(Group.n(:,group_num_iter),:); %提取随即小样本集的数据X
    Y = Y_all(Group.n(:,group_num_iter)); %提取随即小样本集的数据Y
  end
  %新的一组训练集也要利用之前的步长和方向搜索新的sv
  %由于w在每次迭代之前已经做了更新: w = w+t*step
  out = ones(n,1) - Y.*(X*w(1:end-1)+w(end));
  if iter > opt.iter_max_Newton;
    warning(sprintf(['Maximum number of Newton steps reached.' ...
    'Try larger lambda']));
    break;
  end;
  if opt.loss_func == 2
    [obj, grad, sv] = obj_fun_linear(w,Y,lambda,out); %计算L2损失的目标函数值,梯度和支持向量
  end
  if opt.loss_func == 3
    [obj, grad, sv, nsv] = obj_fun_linear_huber(w,Y,lambda,out,opt);
    %计算Huber损失的目标函数值,梯度和支持向量
  end

  % Compute the Newton direction either exactly orby linear CG
  if opt.lin_cg
    % Advantage of linear CG when using sparse input: the Hessian is never
    % computed explicitly.
    [step, foo, relres] = minres(@hess_vect_mult, -grad,...
    opt.cg_prec,opt.cg_it,[],[],[],sv,lambda);
  else
    Xsv = X(sv,:); %取支持向量
    %注: 因为w和x均为增广矩阵w=,x=,梯度和海森矩阵也为增广矩阵
    %grad=[df/dw;df/db]
    %hess=[df^2/dwdw, df^2/dwdb; df^2/dbdw,df^2/dbdb]
    if opt.loss_func == 2
      hess = lambda*diag([ones(d,1); 0]) + ... % Hessian
      [[Xsv'*Xsv sum(Xsv,1)']; [sum(Xsv) length(sv)]];
      %sum(x)和sum(x,1)一样是竖向相加,求每列的和,结果是行向量
    end
    if opt.loss_func == 3
      hess = 2*lambda*diag([ones(d,1); 0]) + [[Xsv'*Xsv ...
      sum(Xsv,1)'];[sum(Xsv) length(sv)]]/(2*opt.huber_h);
    end
    step = - hess \ grad; % Newton direction
  end;

  % Do an exact line search
  if opt.loss_func == 2
    [t,out] = line_search_linear(w,step,out,Y,lambda);
    fprintf(['Iter = %d, Obj = %f, Nb of sv = %d, Newton decr = %.5f, ' ...
    'Line search = %.3f'],iter,obj,length(sv),-step'*grad/2,t);
  end
  if opt.loss_func == 3
    [t,out] = line_search_linear_huber(w,step,out,Y,lambda,opt);%
    % 一维线性搜索中步长出现死循环, 简单考虑把步长t设为定常0.1
    % t = 0.2;
    % t=1的时候3步就会发散
    % t=0.1的时候115步会收敛到obj = 187.69188
    % t=0.2 40步左右也会收敛不过循环终止条件要放宽,Obj=178.27
    % t=0.3 22步左右可收敛不过 循环终止条件需要改
    % t>=0.4 时就会发散
    % 如果使用一维线性搜索 10左右可以收敛不过终止条件需要放宽
    %?????是个经验步长.?????
    %当分组每组1500个子训练样本是,通过迭代可以看到在尽全力收敛,但还是会死循环
    fprintf(['Iter = %d, Obj = %f, Nb of nsv = %d, Nb of sv = %d, Newton decr'...
    ' = %.3f, Line search = %.3f'],iter,obj,length(nsv),length(sv),-step'*grad/2,t);
  end
  if opt.lin_cg
    fprintf(', Lin CG acc = %.4f \n',relres);
  else
    fprintf(' \n');
  end;
  w = w + t*step;

  if -step'*grad < opt.prec * obj
    % Stop when the Newton decrement is small enough
    break;
  end;
end; 

function [obj, grad, sv] = obj_fun_linear(w,Y,lambda,out)
% Compute the objective function, its gradient and the setof support vectors
% Out is supposed to contain 1-Y.*(X*w)
global X
out = max(0,out);
w0 = w;
w0(end) = 0; % Do not penalize b/ 不惩罚b
obj = sum(out.^2)/2 + lambda*w0'*w0/2;
% L2 penalization of the errors 这里要注意点乘.*是每个对应元素相乘,*是直接进行矩阵相乘
grad = lambda*w0 - [((out.*Y)'*X)'; sum(out.*Y)]; % Gradient
sv = find(out>0); %返回所需要元素的所在位置(在矩阵中,第一列开始,自上而下,然后再从第二列,依次往后)

%%%%%%%%%%%%%%%%%%%%%%% edit by jacoxu %%%%%%%%%%%%%%%%%%%%%%
function [obj, grad, sv, nsv] = obj_fun_linear_huber(w,Y,lambda,out,opt)
% Compute the objective function, its gradient and the setof support vectors
% Outis supposed to contain 1-Y.*(X*w)
global X
%开始对out,即L 1-Y.*(X*w)进行分段, 然后生成单位对角阵I_0, I_1;
I0 = (abs(out)  opt.huber_h);
nsv = find(I1); % 找到huber损失函数中的nsv
E.I1 = diag(I1);
%剩下的就都是0了, 不用考虑了

w0 = w; w0(end) = 0; % Do not penalize b/ 不惩罚b
out_h = out + opt.huber_h;
obj = lambda*(w0'*w0) + sum(out_h(sv).^2)/(4*opt.huber_h) + sum(out(nsv));
%带Huber损失的目标函数值, .*是每个对应元素相乘,*是直接进行矩阵相乘

grad = 2*lambda*w0 - [((E.I0*out_h.*Y)'*X)'/(2*opt.huber_h)+((E.I1*Y)'*X)';...
sum((out_h(sv)).*Y(sv))/2*opt.huber_h+sum(Y(nsv))];
%...是换行连接符
%%%%%%%%%%%%%%%%%%%%%%% edit by jacoxu %%%%%%%%%%%%%%%%%%%%%%

function [t,out] = line_search_linear(w,d,out,Y,lambda)
% From the current solution w, do a line search in the direction d by
% 1D Newton minimization
%注: 线性搜索又是一个牛顿梯度下降的过程, 寻找最优步长.
global X
t = 0;
% Precompute some dots products
Xd = X*d(1:end-1)+d(end); %矩阵或向量 + 一个数 = 所有元素加上这个数
wd = lambda * w(1:end-1)'*d(1:end-1);
dd = lambda * d(1:end-1)'*d(1:end-1);
while 1
  out2 = out - t*(Y.*Xd); % The new outputs after a step of length t
  sv = find(out2>0);
  g = wd + t*dd - (out2(sv).*Y(sv))'*Xd(sv); % The gradient (along the line)
  h = dd + Xd(sv)'*Xd(sv); % The second derivative (along the line)
  t = t - g/h; % Take the 1D Newton step. Note that if d was an exact Newton
  % direction, t is 1 after the first iteration.
  if g^2/h < 1e-10, break; end;
  % fprintf('%f %f\n',t,g^2/h)
end;
out = out2; %这个是本组新的out, 分组迭代的话不使用此out

%%%%%%%%%%%%%%%%%%%%%%% edit by jacoxu %%%%%%%%%%%%%%%%%%%%%%
function [t,out] = line_search_linear_huber(w,d,out,Y,lambda,opt)
% From the current solution w, do a line search in the direction d by
% 1D Newton minimization
%注: 线性搜索又是一个牛顿梯度下降的过程, 寻找最优步长.
global X
t = 0;
iter_t = 0;
% Precompute some dots products
Xd = X*d(1:end-1)+d(end); %矩阵或向量 + 一个数 = 所有元素加上这个数
wd = lambda * w(1:end-1)'*d(1:end-1);
dd = lambda * d(1:end-1)'*d(1:end-1);
while 1
  iter_t = iter_t + 1;
  out2 = out - t*(Y.*Xd); % The new outputs after a step of length t
  %开始对out,即L 1-Y.*(X*w)进行分段, 然后生成单位对角阵I_0, I_1;
  I0 = (abs(out2)  opt.huber_h);
  nsv = find(I1); % 找到huber损失函数中的nsv
  E.I1 = diag(I1);
  %剩下的就都是0了, 不用考虑了

  out2_h = out2 + opt.huber_h;
  g = 2*(wd+t*dd)-((out2_h(sv).*Y(sv))'*Xd(sv))/(2*opt.huber_h)-Y(nsv)'*Xd(nsv);
  % The gradient (along the line)
  h = 2*dd + ((Xd(sv))'*Xd(sv))/(2*opt.huber_h);
  t = t - 0.2*g/h; % Take the 1D Newton step. Note that if d was an exact Newton
  % direction, t is 1 after the first iteration.
  %%%%%!!!!!!系数为1的话, 这里出现问题,出现反复死循环!!!!!!%%%%%
  %如果乘上0.2的话23步可以迭代出结果 第一步t=0.1937
  %如果乘上0.3的话17步可以迭代出结果 第一步t=0.2052
  fprintf('Iter_t = %d, Newton decr = %.5f \n',iter_t,-g^2/h);
  if g^2/h < 1e-4
    break;
  end;
  % fprintf('%f %f\n',t,g^2/h)
end;
out = out2; %这个是本组新的out, 分组迭代的话不使用此out
%%%%%%%%%%%%%%%%%%%%%%% edit by jacoxu %%%%%%%%%%%%%%%%%%%%%%

[参考:] Matlab 部分代码参考http://olivier.chapelle.cc/primal

发表评论

电子邮件地址不会被公开。 必填项已用 * 标注

*

您可以使用这些 HTML 标签和属性: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>