四篇应该仔细读的关于文本分析的tutorial类文章

这四篇文章经常被提及到,现原文出自:http://blog.sciencenet.cn/blog-611051-535693.html
对文本分析进行详细深入介绍的肯定不只这四篇,这是本人目前读过的,其他比较好的tutorial类文章欢迎大家推荐补充。

第一篇:详细介绍了离散数据的参数估计方法,而不是像大多数教材中使用的Gaussian分布作为例子进行介绍。个人觉得最值得一读的地方是它使用Gibbs采样对LDA进行推断,其中相关公式的推导非常详细,是许多人了解LDA及其他相关topic model的必读文献。
@TECHREPORT{Hei09,
author = {Heinrich, Gregor},
title = {Parameter Estimation for Text Analysis},
institution = {vsonix GmbH and University of Leipzig},
year = {2009},
type = {Technical Report Version 2.9},
abstract = {Presents parameter estimation methods common with discrete probability
distributions, which is of particular interest in text modeling.
Starting with maximum likelihood, a posteriori and Bayesian estimation,
central concepts like conjugate distributions and Bayesian networks
are reviewed. As an application, the model of latent Dirichlet allocation
(LDA) is explained in detail with a full derivation of an aaproximate
inference algorithm based on Gibbs sampling, including a discussion
of Dirichlet hyperparameter estimation.},
}

第二篇:正像该文文摘中所陈述的那样,特别适合于计算机科学家。其中涉及的数学知识比较少,适用于不太关心数学细节的同仁。uninitiated好像是门外汉的意思,不难看出Resnik和Hardisty写该文的目的。
@TECHREPORT{RH10,
author = {Resnik, Philip and Hardisty, Eric},
title = {Gibbs Sampling for the Uninitiated},
institution = {University of Maryland},
year = {2010},
type = {Technical Report CS-TR-4956, UMIACS-TR-2010-04, LAMP-153},
abstract = {This document is intended for computer scientists who would like to
try out a Markov Chain Monte Carlo (MCMC) technique, particularly
in order to do inference with Bayesian models on problems related
to text processing. We try to keep theory to the absolute minimum
needed, though we work through the details much more explicitly than
you usually see even in “introductory” explanations. That means we’ve
attempted to be ridiculously explicit in our exposition and notation.

After providing the reasons and reasoning behind Gibbs sampling (and
at least nodding our heads in the direction of theory), we work through
an example application in detail—the derivation of a Gibbs sampler
for a Na\”{i}ve Bayes model. Along with the example, we discuss some
practical implementation issues, including the integrating out of
continuous parameters when possible. We conclude with some pointers
to literature that we’ve found to be somewhat more friendly to uninitiated
readers.

Note: as of June 3, 2010 we have corrected some small errors in the
original April 2010 report.},
keywords = {Gibbs Sampling; Markov Chain Monte Carlo; Na\”{i}ve Bayes; Bayesian
Inference; Tutorial},
url = {http://drum.lib.umd.edu/bitstream/1903/10058/3/gsfu.pdf}
}

第三篇:Knight是做NLP的同仁们非常熟悉的大牛,就不多介绍了。
@ELECTRONIC{Kni09,
author = {Knight, Kevin},
title = {Bayesian Inference with Tears: A Tutorial Workbook for Natural Language
Researchers},
url = {http://www.isi.edu/natural-language/people/bayes-with-tears.pdf},
}

第四篇,LDA之父Blei和他的学生Gershman共同撰写的,对Bayesian非参数模型进行了详细介绍,特别对Chinese Restaurant Process (CRP)和Indian Buffet Process以非常直观的方式进行了讨论。
@ARTICLE{GB11,
author = {Gershman, Samuel J. and Blei, David M.},
title = {A Tutorial on Bayesian Nonparametric Models},
journal = {Journal of Mathematical Psychology},
year = {2011},
abstract = {A key problem in statistical modeling is model selection, that is,
how to choose a model at an appropriate level of complexity. This
problem appears in many settings, most prominently in choosing the
number of clusters in mixture models or the number of factors in
factor analysis. In this tutorial, we describe Bayesian nonparametric
methods, a class of methods that side-steps this issue by allowing
the data to determine the complexity of the model. This tutorial
is a high-level introduction to Bayesian nonparametric methods and
contains several examples of their application.},
keywords = {Bayesian Methods; Chinese Restaurant Process; Indian Buffer Process},
}

发表评论

电子邮件地址不会被公开。 必填项已用 * 标注

*

您可以使用这些 HTML 标签和属性: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>