Evaluating user simulations with the Cramér–von Mises divergence
Introduction
Traditionally, spoken dialog systems have been hand-built, which is problematic because a human designer needs to consider innumerable dialog situations, many of which can be difficult to foresee. To address this, researchers have begun incorporating machine learning techniques into spoken dialog systems. The idea is for a (human) designer to provide the high-level objectives, and for the machine learning algorithm to determine what to do in each dialog situation.
Machine learning algorithms for dialogs usually operate by exploring different dialog strategies and making incremental improvements. This process, called training, often requires thousands or millions of dialogs to complete, which is clearly infeasible with real users. As a result, machine learning algorithms are usually trained with a user simulation, which is a computer program or model that is intended to be a realistic substitute for a population of real users.
Ultimately, the success of a machine learning approach depends on the quality of the user simulation used to train it. Yet, despite this, there is no accepted method to evaluate user simulations. This is especially problematic because machine learning-based dialog systems are often trained and evaluated on user simulations alone, not on real users. Without some quantification of user simulation reliability, it is hard to judge claims about machine learning approaches not evaluated on real users.
In this paper, we suggest a quality measure for user simulations. Our quality measure is designed to fill a similar role as a metric like word error rate (WER) provides for speech recognition accuracy. WER serves a valuable role by enabling speech recognizers to be rank-ordered, by quantifying improvements in a recognition algorithm, and by providing a measurement of the gap between observed and perfect performance. In the same way, the evaluation metric presented here enables user simulations to be rank-ordered, allows an improvement in a user simulation to be quantified, and provides a measurement of the gap between the observed and perfect user simulation.
Our evaluation method operates as follows. First, since different factors are important in different domains, our method relies on a domain-specific scoring function, which assigns a real-valued score to each dialog. Scores from real and simulated dialogs are aggregated to estimate two distributions, and the user simulation is evaluated by determining the similarity of these distributions using a normalized Cramér–von Mises divergence (Anderson, 1962).
The normalized Cramér–von Mises divergence has a host of desirable properties for this task. First, it is designed to handle small samples from one or both distributions, which is significant because there may be only 50 or 100 real (human–machine) dialogs available in a given domain. In addition, the Cramér–von Mises divergence makes no assumption about the parametric form of the distributions – such as assuming a normal or uniform distribution – which is important because the parametric form of the score distributions will not be known. Moreover, the Cramér–von Mises divergence accounts for the notion of samples from a “true” distribution and a “modeled” distribution in a principled way. Finally, the normalization enables practitioners to report user simulation performance on an intuitive, standardized scale.
This paper is organized as follows. First, Section 2 reviews background and related work. Next, Section 3 states our assumptions, presents the evaluation procedure, and discusses its strengths and limitations. Then, Section 4 provides an illustration using real dialog data and confirms that the evaluation procedure agrees with common-sense intuition. Finally, recognizing that there may be a small number of real dialogs available, Section 5 tackles the important problem of data sparsity, developing a concise guide for practitioners to easily interpret the reliability of an evaluation. Section 6 then concludes.
Section snippets
Background and motivation
A spoken dialog system helps a user to accomplish some goal through spoken language, such as booking an airline reservation, restoring service to an internet connection, or selecting music in an automobile. Fig. 1 shows the logical components of a spoken dialog system. A dialog manager decides what to say to a user and passes a text string to a text-to-speech engine which renders this text string as audio for the user to hear. The user speaks in response, and this audio is processed by a speech
Method
We start by addressing the overall objective of the user simulation. Although past work has argued that the aim of a user simulation is to engage in “realistic” dialogs (Schatzmann et al., 2005), basing an evaluation measure on realism seems problematic. Indeed, Schatzmann et al. (2005) reports that “it is of course not possible to specify what levels of [evaluation metrics] need to be reached in order to claim that a user simulation is realistic.” Realism is a reasonable aim, but in practice
Example application to a real dialog system
In this section, we strive to show that the normalized Cramér-von Mises evaluation procedure agrees with common-sense intuition by studying a corpus of dialogs with a real dialog system. A series of user simulations are created, and it is shown that increasingly realistic user simulations yield decreasing Cramér–von Mises divergences. In other words, it is shown that the Cramér–von Mises divergence correlates well with the qualitative difference between the real environment and the user
Statistical significance of the Cramér–von Mises divergence
In the previous section, several user simulations were created and rank-ordered using the Cramér–von Mises divergence. While the rank-ordering agreed with common-sense expectations, it is important to confirm that the differences measured were statistically significant. More generally, given that the number of real dialogs is often limited, we seek to provide guidance to system developers and practitioners on the reliability of a rank ordering of user simulations calculated with the Cramér–von
Conclusions
In the paper, we have tackled the problem of evaluating and rank-ordering user simulations. This work has sought to provide system designers and practitioners with a simple, principled method of evaluating and rank-ordering user simulations, based on the normalized Cramér–von Mises divergence.
We view a user simulation as a predictive tool: a dialog system interacting with a population of users will produce a distribution over these scores, and the aim of a user simulation is to predict this
Acknowledgements
Thanks to Bob Bell for many insightful conversations in developing this method. Thanks also to Vincent Goffin for facilitating access to the voice dialer code and logs, and to Srinivas Bangalore for helpful comments about the presentation in this paper. Finally, thanks to the three anonymous reviewers for insightful comments and guidance.
References (43)
- et al.
Partially observable Markov decision processes for spoken dialog systems
Computer Speech and Language
(2007) - Ai, H., Litman, D.J., 2007. Knowledge consistent user simulations for dialog systems. In: Proc. Eurospeech, Antwerp,...
On the distribution of the two-sample Cramér-von Mises criterion
Ann. Math. Statist.
(1962)- Bui, T., Poel, M., Nijholt, A., Zwiers, J., 2007. A tractable DDN-POMDP approach to affective dialogue modeling for...
On the composition of elementary errors. second paper: Statistical applications
Skandinavisk Aktuarietidskrift
(1928)- Cuayáhuitl, H., Renals, S., Lemon, O., Shimodaira, H., 2005. Human–computer dialogue simulation using hidden markov...
- Denecke, M., Dohsaka, K., Nakano, M., 2004. Learning dialogue policies using state aggregation in reinfocement...
- Eadie, W., Drijard, D., James, F., Roos, M., Sadoulet, B., 1971. Statistical Methods in Experimental Physics. North...
- Filisko, E., Seneff, S., 2005. Developing city name acquisition strategies in spoken dialogue systems via user...
- Frampton, M., Lemon, O., 2006. Learning more effective dialogue strategies using limited dialogue move features. In:...