【本文相关数据和代码】参见[Github, https://github.com/jacoxu/STC2]
@article{xu2017self,
title={Self-Taught Convolutional Neural Networks for Short Text Clustering},
author={Xu, Jiaming and Xu, Bo and Wang, Peng and Zheng, Suncong and Tian, Guanhua and Zhao, Jun and Xu, Bo},
journal={Neural Networks},
doi = “http://dx.doi.org/10.1016/j.neunet.2016.12.008“,
year={2017}
}
三个数据集分别来自SearchSnippets, StackOverflow和Biomedical.
SearchSnippets: 由Phan等人[2]利用预先定义好的8个领域短语词在搜索引擎中检索出的文本片段做为短文本,共搜集到12,340条数据。其领域包括商业、计算机、健康和教育等。最大句子长度为38,平均句子长度为17.88,词典大小为30,642。
StackOverflow: 我们从IT技术问答社区StackOverflow数据[3]中选取20类不同标签的问题标题做为短文本,共搜集到20,000条数据。其标签包含svn, oracle, bash 和apache等。最大句子长度为34,平均句子长度为8.31,词典大小为22,956。
Biomedical: 我们从国际知名生物医学类平台BioASQ的官方数据[4]中选取20类MeSH主题下的论文标题做为短文本,共搜集到20,000条数据。其主题包含aging, chemistry, cats和erythrocytes等。最长句子长度为53,平均句子长度为12.88,词典大小为18,888。
其中,数据集的聚类结果参见论文[1], 分类准确度(ACC)简单汇总如下:
Classification Methods | SearchSnippets | StackOverflow | Biomedical |
SVM-Linear (TF) | 67.72 | 83.70 | 71.48 |
SVM-Linear (TFIDF) | 70.96 | 84.55 | 71.55 |
SVM-Kernel (TF) | 62.32 | 79.05 | 68.73 |
SVM-Kernel (TFIDF) | 64.78 | 82.23 | 70.85 |
SVM-Linear (AE) | 87.15 | 81.90 | 62.75 |
SVM-Kernel (AE) | 87.63 | 81.43 | 62.80 |
其中,文本分类器采用线性SVM和基于高斯核的SVM,特征分别采用词频特征(TF),TFIDF和平均词向量(AE)。基线方法Baseline的实现代码如下,完整数据和代码见[github]:
- clear; clc;
- dataset=’SearchSnippets’; %SearchSnippets, StackOverflow, Biomedical
- method = ‘SVM’; % SVM(Support Vector Machine)
- isKernel = 0; %1: SVM Gaussian Kernel; 0: SVM Linear Kernel.
- Weighting = ‘TF’; %TF, TFIDF or AE(Average Embedding)
- dataStr=[‘./../../dataset/’,dataset,’-lite.mat’];
- load(dataStr);
- rand(‘state’,0)
- randn(‘state’,0)
- %%
- disp(‘Step 1 get the train and test data …’)
- % TFIDF
- if (strcmp(Weighting,’TF’))
- testFea = fea(testIdx, :);
- trainFea = fea(trainIdx, :);
- testGnd = gnd(testIdx, :);
- trainGnd = gnd(trainIdx, :);
- elseif (strcmp(Weighting,’TFIDF’))
- testFea = fea(testIdx, :);
- trainFea = fea(trainIdx, :);
- testGnd = gnd(testIdx, :);
- trainGnd = gnd(trainIdx, :);
- [trainFea, testFea] = tf_idf(trainFea, testFea);
- elseif (strcmp(Weighting,’AE’))
- dataStr=[‘./../../dataset/’,dataset,’-STC2.mat’];
- load(dataStr);
- parameters.wordDim = 48;
- parameters.vocSize = size_vocab;
- disp(‘AE: Generate verage embedding vectors …’)
- % Step a. Generate word vector sets
- CR_E = randi([-25,25],parameters.wordDim,parameters.vocSize)/100;
- disp(strcat(‘Number of weights E:’,num2str(size(CR_E))));
- vocab_emb_length = length(vocab_emb_Word2vec_48(1,:));
- if vocab_emb_length > size_vocab
- error([‘Error, and the size fo vocab_emb is:’,vocab_emb_length])
- end
- CR_E(1:parameters.wordDim,vocab_emb_Word2vec_48_index) = vocab_emb_Word2vec_48(1:parameters.wordDim,1:vocab_emb_length);
- % Step b. Average Embedding
- textSize = length(fea_All(:,1));
- fea =[];
- for i=1:textSize
- tmp_fea_vector_weight = repmat(fea_All(i,find(fea_All(i,:)>0)),parameters.wordDim,1);
- tmp_fea_vector_matrix = CR_E(:,find(fea_All(i,:)>0)) .* tmp_fea_vector_weight;
- tmp_fea_vector = sum(tmp_fea_vector_matrix,2);
- fea(i,:) = tmp_fea_vector’;
- if mod(i,2000) == 0
- disp([‘has averaged embedding number:’,num2str(i)]);
- end
- end
- testFea = fea(testIdx, :);
- trainFea = fea(trainIdx, :);
- testGnd = gnd(testIdx, :);
- trainGnd = gnd(trainIdx, :);
- %
- end
- testFea = normalize(testFea);
- trainFea = normalize(trainFea);
- %%
- disp(‘step 2 train model …’)
- if (strcmp(method,’SVM’))
- trainFea = sparse(trainFea);
- testFea = sparse(testFea);
- if ~isKernel
- disp(‘start train linear SVM model …’)
- model = train(trainGnd, trainFea, ‘-q’);
- disp(‘start predict test data via linear SVM …’)
- disp(‘step 3 predict test data …’)
- [predict_label, accuracy, predict_scores] = predict(testGnd, testFea, model, ‘-b 1’);
- else
- disp(‘start train kernel SVM model …’)
- model = svmtrain(trainGnd, trainFea, ‘-t 0’);
- disp(‘start predict test data via kernel SVM model …’)
- disp(‘step 3 predict test data …’)
- [predict_label, accuracy, predict_scores] = svmpredict(testGnd, testFea, model);
- end
- end
- AC = length(find(predict_label == testGnd))/length(testGnd)*100;
- disp([‘Accuracy is ‘,num2str(AC)])
[1]. J Xu, B Xu, P Wang, S Zheng, G Tian, J Zhao, B Xu. Self-Taught Convolutional Neural Networks for Short Text Clustering. Neural Networks, 2017.
[2]. X.-H. Phan, L.-M. Nguyen, S. Horiguchi, Learning to classify short and sparse text & web with hidden topics from large-scale data collections, WWW, 2008.
[3]. https://www.kaggle.com/c/predict-closed-questions-on-stack-overflow/download/train.zip
[4]. http://participants-area.bioasq.org/
本文出自:http://jacoxu.com/short_text_4_cluster_and_classification