Online latent Dirichlet allocation with infinite vocabulary

K Zhai, J Boyd-Graber - International conference on machine …, 2013 - proceedings.mlr.press
International conference on machine learning, 2013proceedings.mlr.press
Topic models based on latent Dirichlet allocation (LDA) assume a predefined vocabulary a
priori. This is reasonable in batch settings, but it is not reasonable when data are revealed
over time, as is the case with streaming/online algorithms. To address this lacuna, we extend
LDA by drawing topics from a Dirichlet process whose base distribution is a distribution over
all strings rather than from a finite Dirichlet. We develop inference using online variational
inference and because we only can consider a finite number of words for each truncated …
Abstract
Topic models based on latent Dirichlet allocation (LDA) assume a predefined vocabulary a priori. This is reasonable in batch settings, but it is not reasonable when data are revealed over time, as is the case with streaming/online algorithms. To address this lacuna, we extend LDA by drawing topics from a Dirichlet process whose base distribution is a distribution over all strings rather than from a finite Dirichlet. We develop inference using online variational inference and because we only can consider a finite number of words for each truncated topic propose heuristics to dynamically organize, expand, and contract the set of words we consider in our vocabulary truncation. We show our model can successfully incorporate new words as it encounters new terms and that it performs better than online LDA in evaluations of topic quality and classification performance.
proceedings.mlr.press
Showing the best result for this search. See all results