Stack Overflow问答短文本数据集处理

数据集由Kaggle平台发布,https://www.kaggle.com/c/predict-closed-questions-on-stack-overflow/data

train数据集3.47G,包括了一些发帖人的标题,内容等信息。而2012-07 Stack Overflow .7z (6.20 gb)则包含了更多的信息,甚至包括了所有的评论等内容。解压后是的20G的.csv文件

由于是.7z文件格式,如果是在linux服务器上解压的话,需要安装.7z
=====================分割线下为.7z的安装=======================
.7z(准确点说是7-Zip)提供了线下的程序安装包,也可自己编译安装。这里讲的是用7z提供的bin包来安装。
宿主机linux一般是X86的,而7z提供编译好了的bin包,可以很方便的安装。步骤如下:
1) 去网站http://sourceforge.net/projects/p7zip/files/或http://sourceforge.net/projects/p7zip/files/p7zip/上下载p7zip的包,当前最新版本是9.20.1;
2) 找到对应版本号进去,页面会提供两个供你下载,一个是bin包,另一个是源码包,这里下的是bin包,以9.20.1为例,下载的包名称是:p7zip_9.20.1_x86_linux_bin.tar.bz2;
3) 在Linux上执行下面命令(解压和安装):.7z指令
tar xjvf p7zip_9.20.1_x86_linux_bin.tar.bz2
cd p7zip_9.20.1
sh install.sh
注意上面的命令权限,需要root权限,因此最好在tar和sh命令前加上sudo。
到此,就安装完成了。
如果使用过程中报出如下错误:
/usr/local/bin/7za: /usr/local/lib/p7zip/7za: /lib/ld-linux.so.2: bad ELF interpreter: No such file or directory
则一般都是因为在x64服务器上安装x86服务器出的问题
解压缩7z文件
7za x phpMyAdmin-3.3.8.1-all-languages.7z -r -o./
参数含义:
x 代表解压缩文件,并且是按原始目录树解压(还有个参数 e 也是解压缩文件,但其会将所有文件都解压到根下,而不是自己原有的文件夹下)
phpMyAdmin-3.3.8.1-all-languages.7z 是压缩文件,这里我用phpadmin做测试。这里默认使用当前目录下的phpMyAdmin-3.3.8.1-all-languages.7z
-r 表示递归解压缩所有的子文件夹
-o 是指定解压到的目录,-o后是没有空格的,直接接目录。这一点需要注意。
压缩文件/文件夹
7za a -t7z -r Mytest.7z /opt/phpMyAdmin-3.3.8.1-all-languages/*
参数含义:
a 代表添加文件/文件夹到压缩包
-t 是指定压缩类型,这里定为7z,可不指定,因为7za默认压缩类型就是7z。
-r 表示递归所有的子文件夹
Mytest.7z 是压缩好后的压缩包名
/opt/phpMyAdmin-3.3.8.1-all-languages/*:是压缩目标。
注意:7za不仅仅支持.7z压缩格式,还支持.tar.bz2等压缩类型的。如上所述,用-t指定即可。
=====================分割线上为.7z的安装=======================

解压后可以看到包含36G各种.xml文件的文件夹,接下来要做的就是解析.xml文件中的内容,我们主要希望的字段为:

问题title, 问题内容content, 标签tag1,tag2,tag3,tag4,tag5, 所有的评论comments.
具体内容可以看readme.txt:
1 Stack Overflow – Data Dump: August 2012
2 – Format: 7zipped
3 – Files:
4 – **badges**.xml
5 – UserId, e.g.: “420″
6 – Name, e.g.: “Teacher”
7 – Date, e.g.: “2008-09-15T08:55:03.923″
8 – **comments**.xml
9 – Id
10 – PostId
11 – Score
12 – Text, e.g.: “@Stu Thompson: Seems possible to me – why not try it?”
13 – CreationDate, e.g.:”2008-09-06T08:07:10.730″
14 – UserId
15 – **posts**.xml
16 – Id
17 – PostTypeId
18 – 1: Question
19 – 2: Answer
20 – ParentID (only present if PostTypeId is 2)
21 – AcceptedAnswerId (only present if PostTypeId is 1)
22 – CreationDate
23 – Score
24 – ViewCount
25 – Body
26 – OwnerUserId
27 – LastEditorUserId
28 – LastEditorDisplayName=”Jeff Atwood”
29 – LastEditDate=”2009-03-05T22:28:34.823″
30 – LastActivityDate=”2009-03-11T12:51:01.480″
31 – CommunityOwnedDate=”2009-03-11T12:51:01.480″
32 – ClosedDate=”2009-03-11T12:51:01.480″
33 – Title=
34 – Tags=
35 – AnswerCount
36 – CommentCount
37 – FavoriteCount
38 – **posthistory**.xml
39 – Id
40 – PostHistoryTypeId
41 – 1: Initial Title – The first title a question is asked with.
42 – 2: Initial Body – The first raw body text a post is submitted with.
43 – 3: Initial Tags – The first tags a question is asked with.
44 – 4: Edit Title – A question’s title has been changed.
45 – 5: Edit Body – A post’s body has been changed, the raw text is stored here as markdown.
46 – 6: Edit Tags – A question’s tags have been changed.
47 – 7: Rollback Title – A question’s title has reverted to a previous version.
48 – 8: Rollback Body – A post’s body has reverted to a previous version – the raw text is stored here.
49 – 9: Rollback Tags – A question’s tags have reverted to a previous version.
50 – 10: Post Closed – A post was voted to be closed.
51 – 11: Post Reopened – A post was voted to be reopened.
52 – 12: Post Deleted – A post was voted to be removed.
53 – 13: Post Undeleted – A post was voted to be restored.
54 – 14: Post Locked – A post was locked by a moderator.
55 – 15: Post Unlocked – A post was unlocked by a moderator.
56 – 16: Community Owned – A post has become community owned.
57 – 17: Post Migrated – A post was migrated.
58 – 18: Question Merged – A question has had another, deleted question merged into itself.
59 – 19: Question Protected – A question was protected by a moderator
60 – 20: Question Unprotected – A question was unprotected by a moderator
61 – 21: Post Disassociated – An admin removes the OwnerUserId from a post.
62 – 22: Question Unmerged – A previously merged question has had its answers and votes restored.
63 – PostId
64 – RevisionGUID: At times more than one type of history record can be recorded by a single action. All of these will be grouped using the same RevisionGUID
65 – CreationDate: “2009-03-05T22:28:34.823″
66 – UserId
67 – UserDisplayName: populated if a user has been removed and no longer referenced by user Id
68 – Comment: This field will contain the comment made by the user who edited a post
69 – Text: A raw version of the new value for a given revision
70 – If PostHistoryTypeId = 10, 11, 12, 13, 14, or 15 this column will contain a JSON encoded string with all users who have voted for the PostHistoryTypeId
71 – If PostHistoryTypeId = 17 this column will contain migration details of either “from ” or “to
72 – CloseReasonId
73 – 1: Exact Duplicate – This question covers exactly the same ground as earlier questions on this topic; its answers may be merged with another identical question .
74 – 2: off-topic
75 – 3: subjective
76 – 4: not a real question
77 – 7: too localized
78 – **users**.xml
79 – Id
80 – Reputation
81 – CreationDate
82 – DisplayName
83 – EmailHash
84 – LastAccessDate
85 – WebsiteUrl
86 – Location
87 – Age
88 – AboutMe
89 – Views
90 – UpVotes
91 – DownVotes
92 – **votes**.xml
93 – Id
94 – PostId
95 – VoteTypeId
96 – ` 1`: AcceptedByOriginator
97 – ` 2`: UpMod
98 – ` 3`: DownMod
99 – ` 4`: Offensive
100 – ` 5`: Favorite – if VoteTypeId = 5 UserId will be populated
101 – ` 6`: Close
102 – ` 7`: Reopen
103 – ` 8`: BountyStart
104 – ` 9`: BountyClose
105 – `10`: Deletion
106 – `11`: Undeletion
107 – `12`: Spam
108 – `13`: InformModerator
109 – CreationDate
110 – UserId (only for VoteTypeId 5)
111 – BountyAmount (only for VoteTypeId 9)

但是我们发现,其实数据还很raw。。。幸好的是Kaggle竞赛平台上还发布了一个整理好的数据集:train
=====================分割线下为.csv的java解析问题===============

======================分割线上为.csv的java解析问题==============

发表评论

电子邮件地址不会被公开。 必填项已用 * 标注

*

您可以使用这些 HTML 标签和属性: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>