MultiClass SVM尝试分类的问题

最近拿到一些数据, 是从Image中提取出来的训练数据和测试样本, 训练数据共有120W左右的训练样本, 1K维特征, 1K个类别. 4G的数据量分成了11个子文件按类别进行顺序存储(注: 不是随即采样存储, 这也给后面的训练带来了麻烦). 然后尝试进行Linear SVM训练, 或者说只是做一个测试, 因为直观上来讲如此多类别进行传统分类器效果肯定不好. 这里还是把测试中遇到的一些问题都记录下来, 做为后期尝试改进的一个起点吧:-).

然后就开始测试的第一步, 考虑如此高维特征, 那么把线性超平面\(f({\bf{x}}) = {\bf{w}} \cdot {\bf{x}} + b\)中的\(b\)去掉也不会对结果有大的影响, 而且带L2损失的目标函数在求取梯度和Hessian矩阵时会简化很多.
\[\begin{array}{l}
\nabla {\rm{ = }}\left[ {\begin{array}{*{20}{c}}
{\frac{{df}}{{d{\bf{w}}}}}\\
{\frac{{df}}{{db}}}
\end{array}} \right] = \left[ {\begin{array}{*{20}{c}}
{2{\bf{w}} + 2C\sum\limits_{i \in sv}^{} {(1 - {y_i}({\bf{w}} \cdot {{\bf{x}}_i} + b))( - {y_i}{{\bf{x}}_i})} }\\
{2C\sum\limits_{i \in sv}^{} {(1 - {y_i}({\bf{w}} \cdot {{\bf{x}}_i} + b))( - {y_i})} }
\end{array}} \right]\\
\begin{array}{*{20}{c}}
{}
\end{array}\mathop \to \limits^ \sim \left[ {\frac{{df}}{{d{\bf{w}}}}} \right] = \left[ {2{\bf{w}} + 2C\sum\limits_{i \in sv}^{} {(1 - {y_i}{\bf{w}} \cdot {{\bf{x}}_i})( - {y_i}{{\bf{x}}_i})} } \right]
\end{array}\]
\[\begin{array}{l}
H{\rm{ = }}\left[ {\begin{array}{*{20}{c}}
{\frac{{\partial {f^2}}}{{\partial {\bf{w}}\partial {\bf{w}}}}}&{\frac{{\partial {f^2}}}{{\partial {\bf{w}}\partial b}}}\\
{\frac{{\partial {f^2}}}{{\partial b\partial {\bf{w}}}}}&{\frac{{\partial {f^2}}}{{\partial b\partial b}}}
\end{array}} \right]{\rm{ = }}\left[ {\begin{array}{*{20}{c}}
{2{I_d} + 2C\sum\limits_{i \in sv}^{} {{{\bf{x}}_i}{\bf{x}}_i^{\rm{T}}} }&{2C\sum\limits_{i \in sv}^{} {{{\bf{x}}_i}} }\\
{2C\sum\limits_{i \in sv}^{} {{\bf{x}}_i^{\rm{T}}} }&{2C\sum\limits_{i \in sv}^{} 1 }
\end{array}} \right]\\
\begin{array}{*{20}{c}}
{}
\end{array}\mathop \to \limits^ \sim \left[ {\frac{{\partial {f^2}}}{{\partial {\bf{w}}\partial {\bf{w}}}}} \right]{\rm{ = }}\left[ {2{I_d} + 2C\sum\limits_{i \in sv}^{} {{{\bf{x}}_i}{\bf{x}}_i^{\rm{T}}} } \right]
\end{array}\]

对于MultiClass分类最简单的是考虑对于每一类问题进行1-vs-all classification, 重复1K次得到MultiClass分类器.

由于这1K类问题基本完全对称, 那么先单独训练一下第一类. 采用Newton法对Linear SVM在Primal中进行训练. 首先遇到的问题就是如何处理分布在11个文件中的数据? 考虑随即梯度思想的话, 在每次Newton迭代的过程中只去调用一部分数据集应该是没问题的. 但是再想, 对于1-vs-all classification会出现极大的不平衡性, 这极易在训练的过程中把正样本作为噪音处理掉, 解决的方法可以考虑按比例\(\frac{{(all – pos)}}{{pos}}\)提高目标函数中的惩罚因子\(C\), 这样能够保证正样本在极不平衡的情况下被忽视掉. 而紧接着会遇到的问题是, 在随即梯度下降中每次迭代的随即样本抽取都是符合同分布的, 即便在极端情况下每次只抽取一个样本, 但是抽取的概率符合分布情况, 而且在迭代训练的过程中梯度会逐步减少最后收敛在一定较小的范围内. 而我们这里按类别进行顺序存储的11个文件, 第一类别的全部样本都会保存在第1个文件中, 那么即便进行了惩罚因子\(C\)的修正, 但是在后面的10次迭代中都只是负样本的训练, 那么很容易出现的情况就是每11个文件进行一次循环迭代的过程中, 第1个文件训练的 值作为后10次迭代的初值, 最终循环结束后会收敛于负样本集. 第二次循环开始时对第1个文件进行训练又会出现比较大的object value和梯度值. 如果样本在11个文件中的分布不够均匀就很难保证在循环的过程中进行依次收敛. 对此进行试验验证的结果也是如此, 实验中我进行了23次的迭代, 每11次是一个循环. 从打印出的信息中能够看出object value在后10个文件中迅速下降, 但每次循环开始又返回一个很大的值且没有表现出下降趋势.

Iter = 1, Obj = 805229.500000, Nb of sv = 124946, Newton decr = 681920.00000, Line search = 1.077
Iter = 2, Obj = 123105.437500, Nb of sv = 124744, Newton decr = 122080.12500, Line search = 1.044
Iter = 3, Obj = 214.565216, Nb of sv = 123782, Newton decr = 137.74878, Line search = 1.394
Iter = 4, Obj = 29.986719, Nb of sv = 124711, Newton decr = 10.13229, Line search = 2.038
Iter = 5, Obj = 9.733945, Nb of sv = 124002, Newton decr = 2.52931, Line search = 1.636
Iter = 6, Obj = 5.415860, Nb of sv = 124202, Newton decr = 0.92609, Line search = 2.456
Iter = 7, Obj = 3.181893, Nb of sv = 123927, Newton decr = 0.52187, Line search = 1.753
Iter = 8, Obj = 2.258441, Nb of sv = 123998, Newton decr = 0.20214, Line search = 2.560
Iter = 9, Obj = 1.730960, Nb of sv = 123385, Newton decr = 0.17267, Line search = 1.710
Iter = 10, Obj = 1.430632, Nb of sv = 124104, Newton decr = 0.07477, Line search = 2.853
Iter = 11, Obj = 1.211763, Nb of sv = 19605, Newton decr = 0.33495, Line search = 1.973
Iter = 12, Obj = 3138217.250000, Nb of sv = 124946, Newton decr = 3004216.00000, Line search = 1.047
Iter = 13, Obj = 119924.625000, Nb of sv = 124744, Newton decr = 118802.75000, Line search = 1.050
Iter = 14, Obj = 208.002533, Nb of sv = 123782, Newton decr = 127.20361, Line search = 1.430
Iter = 15, Obj = 34.009930, Nb of sv = 124711, Newton decr = 11.09056, Line search = 2.065
Iter = 16, Obj = 11.553728, Nb of sv = 124002, Newton decr = 2.87595, Line search = 1.634
Iter = 17, Obj = 6.599584, Nb of sv = 124202, Newton decr = 1.10904, Line search = 2.516
Iter = 18, Obj = 3.882921, Nb of sv = 123927, Newton decr = 0.64134, Line search = 1.743
Iter = 19, Obj = 2.746929, Nb of sv = 123998, Newton decr = 0.23902, Line search = 2.597
Iter = 20, Obj = 2.118799, Nb of sv = 123385, Newton decr = 0.21016, Line search = 1.717
Iter = 21, Obj = 1.743901, Nb of sv = 124104, Newton decr = 0.09170, Line search = 2.824
Iter = 22, Obj = 1.479050, Nb of sv = 19605, Newton decr = 0.42116, Line search = 1.980
Iter = 23, Obj = 3157472.750000, Nb of sv = 124946, Newton decr = 3022556.00000, Line search = 1.047

在如此棘手的情况下, 被迫考虑把11个文件的数据全部导入1个文件中进行全部数据参与每次迭代过程(而这也是违背最初的设想方案的, 原本计划的应该是能够在不融合全部数据的情况下能够在小样本集中进行训练得到一个收敛的结果). 目前看来事与愿违, 只好先进行全部数据的测试工作.

通过如此迭代可以得到对第一类问题的1-vs-all Linear分类器, 然后进行测试, 测试数据为5W个样本, 1K类, 即每类50个样本.测试的结果为: 用了20次迭代, 得到的训练结果使用 的平面对测试数据进行分类, 测试集50000个样本中, 49500个负样本中的5860个被错分为了正样本, 50个正样本中的6个被错分为负样本. 考虑比例系数计算召回率的话基本在90%左右. 但是还是由于1-vs-all数据的不平衡性导致被分为正样本中的正确分类率很低. 同样对第2类, 第3类样本进行测试, 得到的统计结果类似. (如果不忽略 如何?随后我进行了测试, 结果保持一致, 基本没有任何改观)而在image classification中这种MultiClass问题很常见的. 毕竟这也符合我们人类对世界的认知. 人类通常对类别的识别通常至少在10K以上. (这里同样遇到的问题是, 把11个文件的所有数据进行融合之后, matlab存储不了4G如此大的变量, 只好每次跑程序时都进行在线整合. 而且我全部采用了矩阵运算, 包括惩罚因子\(C\)的赋值, matlab也无法生成120W*120W的常规矩阵, 一定要用sparse存储格式).

后来发现在ECCV2010中一篇论文中提及了vision multiclass分类问题. 并对传统的分类器进行测试, accuracy很低, 而且提出了一些提高accuracy的方案供参考.

下面是对11个文件进行融合处理的matlab代码(没有忽略\(b\)的代码).
Test_primal_svm.m文件

clear;clc;
lambda = 1;
opt.classifer_num = 1000; %表示有1000个分类器.
opt.dim = 1000;%设定输入向量的维数
opt.files_num = 11; % 文件个数11个
opt.loss_func = 2; %Loss =1,Hinge损失;2,L2损失;3,Huber损失
train_data_num = 0;
for file_iter = 1:opt.files_num
  filename = 'train_data/train_split_?_of_11_nystrom.mat';
  %把?替换成要载入数据文件中的数字
  filename = strrep(filename,'?',int2str(file_iter));
  load(filename);  %载入数据到内存中
  new_data_num = length(q_all(:,1));
  train_data_X(train_data_num+1:(train_data_num+new_data_num),:) = Feats_rand_save;
  train_data_Y(train_data_num+1:(train_data_num+new_data_num),:) = q_all;
  train_data_num = train_data_num+new_data_num;
end
Feats_rand_save = train_data_X; %保存了所有的训练数据
q_all = train_data_Y;

for classifer_iter = 1:opt.classifer_num % 即迭代训练1000个分类器
  [w, obj] = primal_svm(1,lambda,classifer_iter,Feats_rand_save,q_all,opt); %第一个参数linear为真,即线性
end

primal_svm.m文件

function [sol,obj] = primal_svm(linear,lambda,classifer_iter,Feats_rand_save,q_all,opt)
% [SOL, B] = PRIMAL_SVM(LINEAR,Y,LAMBDA,OPT)
% Solves the SVM optimization problem in the primal (with quatratic
%   penalization of the training errors).
%
% If LINEAR is 1, a global variable X containing the training inputs
%   should be defined. X is an n x d matrix (n = number of points).
% If LINEAR is 0, a global variable K (the n x n kernel matrix) should be defined.
% LAMBDA is the regularization parameter ( = 1/C)
%
% IF LINEAR is 0, SOL is the expansion of the solution (vector beta of length n).
% IF LINEAR is 1, SOL is the hyperplane w (vector of length d).
% B is the bias
% The outputs on the training points are either K*SOL+B or X*SOL+B
% OBJ is the objective function value
%
% OPT is a structure containing the options (in brackets default values):
%   cg: Do not use Newton, but nonlinear conjugate gradients [0]
%   lin_cg: Compute the Newton step with linear CG
%           [0 unless solving sparse linear SVM]
%   iter_max_Newton: Maximum number of Newton steps [20]
%   prec: Stopping criterion
%   cg_prec and cg_it: stopping criteria for the linear CG.

% Copyright Olivier Chapelle, olivier.chapelle@tuebingen.mpg.de
% Last modified 25/08/2006

% nargin是函数内部输入的参数个数,如果没有输入设置项则分配默认配置值
  if nargin < 6  % Assign the options to their default values     opt = []; %没有声明过结构体需要声明   end;   %检查结构体opt中各域是否定义过,主要是指cg   if ~isfield(opt,'cg'),                opt.cg = 0;                     end;   if ~isfield(opt,'lin_cg'),            opt.lin_cg = 0;                 end;   if ~isfield(opt,'iter_max_Newton'),   opt.iter_max_Newton = 100;       end;   if ~isfield(opt,'prec'),              opt.prec = 1e-6;                end;   if ~isfield(opt,'cg_prec'),           opt.cg_prec = 1e-4;             end;   if ~isfield(opt,'cg_it'),             opt.cg_it = 20;                 end;   if ~isfield(opt,'dim'),               opt.dim = 2;                    end;   if ~isfield(opt,'files_num'),         opt.files_num = 1;              end;   %默认的损失函数为L2   if ~isfield(opt,'loss_func'),         opt.loss_func = 2;              end;   if opt.loss_func == 3  % 如果损失函数为huber, 那么定义huber参数h     opt.huber_h = 0.5; %h一般取值0.01~0.5   end   % Call the right function depending on problem type and CG / Newton   [sol,obj] = primal_svm_linear(lambda,classifer_iter,Feats_rand_save,q_all,opt);   fprintf('\n'); function  [w,obj] = primal_svm_linear(lambda,classifer_iter,Feats_rand_save,q_all,opt) % ------------------------------- % Train a linear SVM using Newton % -------------------------------   %在迭代之前进行的维数和初值设定,要保证一部分sv存在这样Hessian的逆才存在   if opt.loss_func == 2     w = zeros(opt.dim+1,1); % no b.   end   if opt.loss_func == 3     w = ones(opt.dim,1)/10;     %Huber初始条件应该使一些样本初始为sv, 即初始w为0.1   end   iter = 0; % 迭代计数器,从0开始计数,保证rem文件名的正确性;   while 1     %先提取正样本,记录正样本个数; 然后由正负样本比修正惩罚权重C     pos_flag=find(q_all(:,classifer_iter));     pos_n = length(pos_flag);     X = Feats_rand_save;     %Y数据要把多类问题二值化     Y = -ones(length(q_all(:,1)),1); %先全部初始化为负样本     Y(pos_flag) = 1;     [n,d] = size(X);     if (size(Y,1)~=n)||(d~=opt.dim)         error('Dimension error');         break;     end     %构造惩罚因子C     C_1d = ones(n,1);     C_1d(pos_flag) = (n-pos_n)/pos_n; %修正正样本的惩罚因子C     C = spdiags(C_1d,0,n,n); %因为矩阵太大所以必须转换为稀疏矩阵存储.     %新的一组训练集也要利用之前的步长和方向搜索新的sv     %由于w在每次迭代之前已经做了更新: w = w+t*step     out = ones(n,1) - Y.*(X*w(1:end-1)+w(end)); % out = 1-Y.*(X*w+b)     if iter > opt.iter_max_Newton;
      warning(sprintf(['Maximum number of Newton steps reached.' ...
                       'Try larger lambda']));
      break;
    end;
    if opt.loss_func == 2
        [obj, grad, sv] = obj_fun_linear(w,X,Y,C,out); %计算L2损失的目标函数值,梯度和支持向量
    end
    if opt.loss_func == 3
        [obj, grad, sv, nsv] = obj_fun_linear_huber(w,X,Y,lambda,out,opt);
        %计算Huber损失的目标函数值,梯度和支持向量
    end

    % Compute the Newton direction either exactly or by linear CG
    if opt.lin_cg
      % Advantage of linear CG when using sparse input: the Hessian is never
      %   computed explicitly.
      [step, foo, relres] = minres(@hess_vect_mult, -grad,...
                                   opt.cg_prec,opt.cg_it,[],[],[],sv,lambda);
    else
      I0 = zeros(n,1);
      I0(sv) = 1;
      E.I0 = spdiags(I0,0,n,n);% 用单位矩阵来标记sv
%     Xsv = X(sv,:); %取支持向量
      %注: 因为w和x均为增广矩阵w=,x=,梯度和海森矩阵也为增广矩阵
      %grad=[df/dw;df/db]
      %hess=[df^2/dwdw, df^2/dwdb; df^2/dbdw,df^2/dbdb]
      %若不考虑b 则:
      %grad=[df/dw]
      %hess=[df^2/dwdw]
      if opt.loss_func == 2
          hess = diag([ones(opt.dim,1);0]) + ...
          [[(C*E.I0*double(X))'*(E.I0*double(X)) sum(C*E.I0*double(X),1)'];...
          [sum(C*E.I0*double(X)) trace(C*E.I0)]];   % Hessian
      end
      if opt.loss_func == 3
          hess = 2*lambda*diag([ones(d,1); 0]) + [[Xsv'*Xsv ...
              sum(Xsv,1)'];[sum(Xsv) length(sv)]]/(2*opt.huber_h);
      end
      step  = - hess \ grad;   % Newton direction
    end;
    iter = iter + 1; %打印信息前对迭代计数器值进行修正
    % Do an exact line search
    if opt.loss_func == 2
      t = line_search_linear(w,step,out,X,Y,C);
    end
    if opt.loss_func == 3
  %   t = line_search_linear_huber(w,step,out,Y,lambda,opt);%
  %   一维线性搜索中步长出现死循环, 简单考虑把步长t设为定常0.1
      t = 0.2;
      % t=1的时候3步就会发散
      % t=0.1的时候115步会收敛到obj = 187.69188
      % t=0.2 40步左右也会收敛不过循环终止条件要放宽,Obj=178.27
      % t=0.3 22步左右可收敛不过 循环终止条件需要改
      % t>=0.4 时就会发散
      % 如果使用一维线性搜索 10左右可以收敛不过终止条件需要放宽
      %?????是个经验步长.?????
      %当分组每组1500个子训练样本是,通过迭代可以看到在尽全力收敛,但还是会死循环
    end
    w = w + t*step;
    if opt.loss_func == 2
      fprintf(['Iter = %d, Obj = %f, Nb of sv = %d, Newton decr = %.5f, ' ...
      'Line search = %.3f'],iter,obj,length(sv),-step'*grad/2,t);
    end
    if opt.loss_func == 3
      fprintf(['Iter = %d, Obj = %f, Nb of nsv = %d, Nb of sv = %d, Newton decr'...
      ' = %.3f, Line search = %.3f'],iter,obj,length(nsv),length(sv),-step'*grad/2,t);
    end
    if opt.lin_cg
      fprintf(', Lin CG acc = %.4f     \n',relres);
    else
      fprintf('      \n');
    end;
    w_filename ='w0001_?.mat'; %每步存储一下w值
    %把?替换成要载入数据文件中的数字
    w_filename = strrep(w_filename,'?',int2str(iter));
    save(w_filename,'w');
    if iter == 50
        iter =iter; %调试用
    end
    if abs(step'*grad) < opt.prec * obj         % Stop when the Newton decrement is small enough       break;     end;   end;    function [obj, grad, sv] = obj_fun_linear(w,X,Y,C,out)   % Compute the objective function, its gradient and the set of support vectors   % Out is supposed to contain 1-Y.*(X*w)   out = max(0,out);   w0 = w; w0(end) = 0;  % Do not penalize b/ 不惩罚b   obj = sum(C*double(out.^2))/2 + (w0'*w0)/2;    % L2 penalization of the errors 这里要注意点乘.*是每个对应元素相乘,*是直接进行矩阵相乘   grad = w0 - [((C*double(out.*Y))'*X)';sum(C*double(out.*Y))]; % Gradient   sv = find(out>0);  %返回所需要元素的所在位置(在矩阵中,第一列开始,自上而下,然后再从第二列,依次往后)

function t = line_search_linear(w,d,out,X,Y,C)
  % From the current solution w, do a line search in the direction d by
  % 1D Newton minimization
  %注: 线性搜索又是一个牛顿梯度下降的过程, 寻找最优步长.
  t = 0;
  n = length(Y);
  % Precompute some dots products
  Xd = X*d(1:end-1)+d(end); %矩阵或向量 + 一个数 = 所有元素加上这个数
  wd = w(1:end-1)'*d(1:end-1);
  dd = d(1:end-1)'*d(1:end-1);
  while 1
    out2 = out - t*(Y.*Xd); % The new outputs after a step of length t
    I0 = (out2>=0);
    E.I0 = spdiags(I0,0,n,n);% 用单位矩阵来标记sv
    g = wd + t*dd -(C*E.I0*double(out2.*Y))'*Xd; % The gradient (along the line)
    h = dd + (C*E.I0*double(Xd))'*Xd; % The second derivative (along the line)
    t = t - g/h; % Take the 1D Newton step. Note that if d was an exact Newton
                 % direction, t is 1 after the first iteration.
    if g^2/h < 1e-8, break; end;
%    fprintf('%f %f\n',t,g^2/h)
  end;

发表评论

电子邮件地址不会被公开。 必填项已用 * 标注

*

您可以使用这些 HTML 标签和属性: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>