Some experiences of LDA java version

LDA is a popluar Topic Model widely used in research and application. Until now, the original paper described LDA has cited 7150. Very luckly, some warm-hearted men have shared their codes, so we can easily use it. Thanks for the spirit of sharing. Here is my some experiences of LDA java version:

First, dowload the project of LDA java version. The writer Xuan-Hieu Phan is a very enthusiast.

From my experience, –also maybe my problem –, There are not effective args when we input them into Eclipse. So, I suggest to package the project into jar packet. We can input the command lines like that:

  1. java -jar jgibblda.jar -est -alpha 0.5 -beta 0.1 -ntopics 400 -dir ./ -dfile train.txt  

由于:\(\hat \phi _j^{(w)} = \frac{{n_j^{(w)} + \beta }}{{n_j^{( \cdot )} + W\beta }}\), \(\hat \phi _j^{(d)} = \frac{{n_j^{(d)} + \alpha }}{{n_ \cdot ^{(d)} + T\alpha }}\)
因此选择\(\alpha \) 设定的越大,则文档上的主题分布就越趋于均匀分布,一致性变强,反之亦然。而\(beta \)设定的越大,则主题上的词分布一致性越强,反之亦然。[Griffiths.04]

对于短文本来说,建议\(alpha \) 设定小一点。可参考[WWW08,及WWW13-BTM]
The non-input args like -alpha, -beta are used the defalut. Note that the -dir should not be ingored, although you will put the jar packet and train data in a same path. Another noteworthy part is that the default -niters are 1000 rather than 2000 described in the user’s guide. And the default -niters of infer for new data are 100 rather than 20.

The input data format should like that:
[TotalNum of the document]
[document1]
…..
[document..]
The pre-processing–(e.g. removing stop words and rare words, stemming, etc) should be done by users.

训练:

  1. java -jar jgibblda.jar -inf -dir ./ -model model-final -dfile test.txt  

Some experiences of LDA java version》上有 3 条评论

  1. Pingback 引用通告: 处理SearchSnippets数据集 | 刻骨铭心

发表评论

电子邮件地址不会被公开。 必填项已用 * 标注

*

您可以使用这些 HTML 标签和属性: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>