Elsevier

Speech Communication

Volume 50, Issue 10, October 2008, Pages 829-846
Speech Communication

Evaluating user simulations with the Cramér–von Mises divergence

https://doi.org/10.1016/j.specom.2008.05.007Get rights and content

Abstract

User simulations are increasingly employed in the development and evaluation of spoken dialog systems. However, there is no accepted method for evaluating user simulations, which is problematic because the performance of new dialog management techniques is often evaluated on user simulations alone, not on real people. In this paper, we propose a novel method of evaluating user simulations. We view a user simulation as a predictor of the performance of a dialog system, where per-dialog performance is measured with a domain-specific scoring function. The divergence between the distribution of dialog scores in the real and simulated corpora provides a measure of the quality of the user simulation, and we argue that the Cramér–von Mises divergence is well-suited to this task. To demonstrate this technique, we study a corpus of callers with real information needs and show that Cramér-von Mises divergence conforms to expectations. Finally, we present simple tools which enable practitioners to interpret the statistical significance of comparisons between user simulations.

Introduction

Traditionally, spoken dialog systems have been hand-built, which is problematic because a human designer needs to consider innumerable dialog situations, many of which can be difficult to foresee. To address this, researchers have begun incorporating machine learning techniques into spoken dialog systems. The idea is for a (human) designer to provide the high-level objectives, and for the machine learning algorithm to determine what to do in each dialog situation.

Machine learning algorithms for dialogs usually operate by exploring different dialog strategies and making incremental improvements. This process, called training, often requires thousands or millions of dialogs to complete, which is clearly infeasible with real users. As a result, machine learning algorithms are usually trained with a user simulation, which is a computer program or model that is intended to be a realistic substitute for a population of real users.

Ultimately, the success of a machine learning approach depends on the quality of the user simulation used to train it. Yet, despite this, there is no accepted method to evaluate user simulations. This is especially problematic because machine learning-based dialog systems are often trained and evaluated on user simulations alone, not on real users. Without some quantification of user simulation reliability, it is hard to judge claims about machine learning approaches not evaluated on real users.

In this paper, we suggest a quality measure for user simulations. Our quality measure is designed to fill a similar role as a metric like word error rate (WER) provides for speech recognition accuracy. WER serves a valuable role by enabling speech recognizers to be rank-ordered, by quantifying improvements in a recognition algorithm, and by providing a measurement of the gap between observed and perfect performance. In the same way, the evaluation metric presented here enables user simulations to be rank-ordered, allows an improvement in a user simulation to be quantified, and provides a measurement of the gap between the observed and perfect user simulation.

Our evaluation method operates as follows. First, since different factors are important in different domains, our method relies on a domain-specific scoring function, which assigns a real-valued score to each dialog. Scores from real and simulated dialogs are aggregated to estimate two distributions, and the user simulation is evaluated by determining the similarity of these distributions using a normalized Cramér–von Mises divergence (Anderson, 1962).

The normalized Cramér–von Mises divergence has a host of desirable properties for this task. First, it is designed to handle small samples from one or both distributions, which is significant because there may be only 50 or 100 real (human–machine) dialogs available in a given domain. In addition, the Cramér–von Mises divergence makes no assumption about the parametric form of the distributions – such as assuming a normal or uniform distribution – which is important because the parametric form of the score distributions will not be known. Moreover, the Cramér–von Mises divergence accounts for the notion of samples from a “true” distribution and a “modeled” distribution in a principled way. Finally, the normalization enables practitioners to report user simulation performance on an intuitive, standardized scale.

This paper is organized as follows. First, Section 2 reviews background and related work. Next, Section 3 states our assumptions, presents the evaluation procedure, and discusses its strengths and limitations. Then, Section 4 provides an illustration using real dialog data and confirms that the evaluation procedure agrees with common-sense intuition. Finally, recognizing that there may be a small number of real dialogs available, Section 5 tackles the important problem of data sparsity, developing a concise guide for practitioners to easily interpret the reliability of an evaluation. Section 6 then concludes.

Section snippets

Background and motivation

A spoken dialog system helps a user to accomplish some goal through spoken language, such as booking an airline reservation, restoring service to an internet connection, or selecting music in an automobile. Fig. 1 shows the logical components of a spoken dialog system. A dialog manager decides what to say to a user and passes a text string to a text-to-speech engine which renders this text string as audio for the user to hear. The user speaks in response, and this audio is processed by a speech

Method

We start by addressing the overall objective of the user simulation. Although past work has argued that the aim of a user simulation is to engage in “realistic” dialogs (Schatzmann et al., 2005), basing an evaluation measure on realism seems problematic. Indeed, Schatzmann et al. (2005) reports that “it is of course not possible to specify what levels of [evaluation metrics] need to be reached in order to claim that a user simulation is realistic.” Realism is a reasonable aim, but in practice

Example application to a real dialog system

In this section, we strive to show that the normalized Cramér-von Mises evaluation procedure agrees with common-sense intuition by studying a corpus of dialogs with a real dialog system. A series of user simulations are created, and it is shown that increasingly realistic user simulations yield decreasing Cramér–von Mises divergences. In other words, it is shown that the Cramér–von Mises divergence correlates well with the qualitative difference between the real environment and the user

Statistical significance of the Cramér–von Mises divergence

In the previous section, several user simulations were created and rank-ordered using the Cramér–von Mises divergence. While the rank-ordering agreed with common-sense expectations, it is important to confirm that the differences measured were statistically significant. More generally, given that the number of real dialogs is often limited, we seek to provide guidance to system developers and practitioners on the reliability of a rank ordering of user simulations calculated with the Cramér–von

Conclusions

In the paper, we have tackled the problem of evaluating and rank-ordering user simulations. This work has sought to provide system designers and practitioners with a simple, principled method of evaluating and rank-ordering user simulations, based on the normalized Cramér–von Mises divergence.

We view a user simulation as a predictive tool: a dialog system interacting with a population of users will produce a distribution over these scores, and the aim of a user simulation is to predict this

Acknowledgements

Thanks to Bob Bell for many insightful conversations in developing this method. Thanks also to Vincent Goffin for facilitating access to the voice dialer code and logs, and to Srinivas Bangalore for helpful comments about the presentation in this paper. Finally, thanks to the three anonymous reviewers for insightful comments and guidance.

References (43)

  • J.D. Williams et al.

    Partially observable Markov decision processes for spoken dialog systems

    Computer Speech and Language

    (2007)
  • Ai, H., Litman, D.J., 2007. Knowledge consistent user simulations for dialog systems. In: Proc. Eurospeech, Antwerp,...
  • T. Anderson

    On the distribution of the two-sample Cramér-von Mises criterion

    Ann. Math. Statist.

    (1962)
  • Bui, T., Poel, M., Nijholt, A., Zwiers, J., 2007. A tractable DDN-POMDP approach to affective dialogue modeling for...
  • H. Cramér

    On the composition of elementary errors. second paper: Statistical applications

    Skandinavisk Aktuarietidskrift

    (1928)
  • Cuayáhuitl, H., Renals, S., Lemon, O., Shimodaira, H., 2005. Human–computer dialogue simulation using hidden markov...
  • Denecke, M., Dohsaka, K., Nakano, M., 2004. Learning dialogue policies using state aggregation in reinfocement...
  • Eadie, W., Drijard, D., James, F., Roos, M., Sadoulet, B., 1971. Statistical Methods in Experimental Physics. North...
  • Filisko, E., Seneff, S., 2005. Developing city name acquisition strategies in spoken dialogue systems via user...
  • Frampton, M., Lemon, O., 2006. Learning more effective dialogue strategies using limited dialogue move features. In:...
  • Georgila, K., Henderson, J., Lemon, O., 2005. Learning user simulations for information state update dilaogue systems....
  • Georgila, K., Henderson, J., Lemon, O., 2006. User simulation for spoken dialogue systems: Learning and evaluation. In:...
  • Goddeau, D., Pineau, J., 2000. Fast reinforcement learning of dialog strategies. In: Proc. Internat. Conf. on...
  • Heeman, P., 2007. Combining reinforcement learning with information-state update rules. In: Proc. Human Language...
  • Henderson, J., Lemon, O., Georgila, K., 2005. Hybrid reinforcement/supervised learning for dialogue policies from...
  • A. Kolmogorov

    Sulla determinazione empirica di una legge di distribuzione

    Giorn. Ist. Ital. Attuari

    (1933)
  • S. Kullback et al.

    On information and sufficiency

    Ann. Math. Statist.

    (1951)
  • Lemon, O., Georgila, K., Henderson, J., 2006. Evaluating effectiveness and portability of reinforcement learned...
  • Levin, E., Pieraccini, R., 1997. A stochastic model of computer–human interaction for leaning dialog strategies. In:...
  • Levin, E., Pieraccini, R., 2006. Value-based optimal decision for dialog systems. In: Proc. Workshop on Spoken Language...
  • E. Levin et al.

    A stochastic model of human–machine interaction for learning dialogue strategies

    IEEE Trans. Speech and Audio Processing

    (2000)
  • Cited by (0)

    View full text