Improving predictive inference under covariate shift by weighting the log-likelihood function
Introduction
Let x be the explanatory variable or the covariate, and y be the response variable. In predictive inference with the regression analysis, we are interested in estimating the conditional density q(y|x) of y given x, using a parametric model. Let p(y|x,θ) be the model of the conditional density which is parameterized by . Having observed i.i.d. samples of size n, denoted by , we obtain a predictive density by giving an estimate . In this paper, we discuss improvement of the maximum likelihood estimate (MLE) under both (i) covariate shift in distribution and (ii) misspecification of the model as explained below.
Let q1(x) be the density of x for evaluation of the predictive performance, while q0(x) be the density of x in the observed data. We consider the Kullback–Leibler loss functionfor i=0,1, and then employ for evaluation of , rather than the usual . The situation q0(x)≠q1(x) will be called covariate shift in distribution, which is one of the premises of this paper.
This situation is not so odd as it might look at first. In fact, it is seen in various fields as follows. In sample surveys, q0(x) is determined by the sampling scheme, while q1(x) is determined by the population. In regression analysis, covariate shift often happens because of the limitation of resources, or the design of experiments. In artificial neural networks literature, “active learning” is the typical situation where we control q0(x) for better prediction. We could say that the distribution of x in future observations is different from that of the past observations; x is not necessarily distributed as q1(x) in future, but we can give imaginary q1(x) to specify the region of x where the prediction accuracy should be controlled. Note that q0(x) and/or q1(x) are often estimated from data, but we assume they are known or estimated reasonably in advance.
The second premise of this paper is misspecification of the model. Let be the MLE of θ, and be the asymptotic limit of as n→∞. Under certain regularity conditions, MLE is consistent and provided that the model is correctly specified. In practice, however, deviates more or less from q(y|x).
Under both the covariate shift and the misspecification, MLE does not necessarily provide a good inference. We will show that MLE is improved by giving a weight function w(x) of the covariate in the log-likelihood functionwhere . Then the maximum weighted log-likelihood estimate (MWLE), denoted by , is obtained by maximizing (1.1) over Θ. It will be seen that the weight function w(x)=q1(x)/q0(x) is the optimal choice for sufficiently large n in terms of the expected loss with respect to q1(x). We denote MWLE with this weight function by . A comparison between and is made in the numerical example of polynomial regression of Section 2, and the asymptotic optimality of is shown in Section 3. Note that MWLE turns out to be downweighting the observed samples which are not important in fitting the model with respect to the population. An interpretation of MWLE as one of the robust estimation techniques is given in Section 9.
This type of estimation is not new in statistics. Actually, is regarded as a generalization of the pseudo-maximum likelihood estimation in sample surveys (Skinner et al., 1989, p. 80; Pfeffermann et al., 1998); the log likelihood is weighted inversely proportional to q0(x), the probability of selecting unit x, while q1(x) is equal probability for all possible values of x. The same idea is also seen in Rao (1991), where weighted maximum likelihood estimation is considered for unequally spaced time-series data.
The local likelihoods or the weighted likelihoods formally similar to (1.1) are found in the literature for semi-parametric inference. However, is estimated using a weight function concentrated locally around each x or (x,y) in the semi-parametric approach; thus in will depend on (x,y) as well as the data (x(n),y(n)). On the other hand, we restrict our attention to a rather conventional parametric modeling approach here, and depends only on the data.
In spite of the asymptotic optimality of w(x)=q1(x)/q0(x) mentioned above, another choice of the weight function can improve the expected loss for moderate sample size by compromising the bias and the variance of . We develop a practical method for this improvement in 4 Expected loss, 5 Information criterion, 6 Numerical example revisited, 7 Simulation study. The asymptotic expansion of the expected loss is given in Section 4, and a variant of the information criterion is derived as an estimate of the expected loss in Section 5. This new criterion is used to find a good w(x) as well as a good form of p(y|x,θ). The numerical example is revisited in Section 6, and a simulation study is given in Section 7.
In Section 8, we show the Bayesian predictive density is also improved by considering the weight function. Finally, concluding remarks are given in Section 9. All the proofs are deferred to the appendix.
Section snippets
Illustrative example in regression
Here we consider the normal regression to predict the response using a polynomial function of . Let the model p(y|x,θ) be the polynomial regressionwhere θ=(β0,…,βd,σ) and N(a,b) denotes the normal distribution with mean a and variance b. In the numerical example below, we assume the true q(y|x) is also given by (2.1) with d=3:
The density q0(x) of the covariate x iswhere μ0=0.5, τ02=0.52. This corresponds to the sampling scheme
Asymptotic properties of MWLE
Let Ei(·) denote the expectation with respect to q(y|x)qi(x) for i=0,1. Considering −Lw(θ) as the summation of i.i.d. random variables lw(xt,yt|θ), it follows from the law of large numbers that −Lw(θ)/n→E0(lw(x,y|θ)) as n grows to infinity. Then we have in probability as n→∞, where is the minimizer of E0(lw(x,y|θ)) over θ∈Θ. Hereafter, we restrict our attention to proper w(x) such that E0(lw(x,y|θ)) exists for all θ∈Θ and that the Hessian of E0(lw(x,y|θ)) is non-singular at ,
Expected loss
In the previous section, optimal choice of w(x) was discussed in terms of the asymptotic bias . For moderate sample size, however, the variance of due to the sampling error should be considered. In order to take account of both the bias and the variance, we employ the expected loss to determine the optimal weight; E0(n)(·) denotes the expectation with respect to (x(n),y(n)) which follows ∏t=1nq(yt|xt)q0(xt). Lemma 2 The expected loss is asymptotically expanded as
Information criterion
The performance of MWLE for a specified w(x) is given by (4.1). However, we cannot calculate the value of the expected loss from it in practice, because q(y|x) is unknown. We provide a variant of the information criterion as an estimate of (4.1). Theorem 1 Let the information criterion for MWLE bewhereThe matrices Jw and Hw may be replaced by their consistent estimates
Numerical example revisited
For the normal linear regression, such as the polynomial regression given in (2.1), β-components of are obtained by WLS with weights w(xt). σ-component of is then given by , where and is the residual. Letting , t=1,…,n be the diagonal elements of the hat matrix used in the WLS, the information criterion (5.1) is calculated from
We apply
Simulation study
First we show simulation results in Table 1, Table 2, Table 3 which confirm the theory of 4 Expected loss, 5 Information criterion. A large number N of replicates of the dataset of size n are generated from , . We used (2.4) as q1(x). Four simulations of n=50, 100, 300, and 1000 are done with N=105 for n=50–300 and N=106 for n=1000. For each replicate of the dataset, is calculated for λ=0,1 and d=0,1,2. Then, , , and are calculated, and their averages over the N
Bayesian inference
We have been working on the predictive densitywhich is based on MWLE . This type of predictive density is occasionally called as an estimative density in the literature. Another possibility is the Bayesian predictive density. Here we consider a weighted version of it, and examine its performance in prediction.
Let p(θ) be the prior density of θ. Given the data (x(n),y(n)), we shall define the weighted posterior density byThen the predictive
Concluding remarks
Although the ratio q1(x)/q0(x) has been assumed to be known, it is often estimated from data in practice. Assuming q1(x) is known, we tried three possibilities in the numerical example of Section 2: (i) q0(x) is specified correctly without unknown parameters. (ii) Assuming the normality of q0(x), the unknown μ0 and τ0 are estimated. (iii) Non-parametric kernel density estimation is applied to q0(x). Then, it turns out that MWLE is robust against the estimation of q1(x)/q0(x) and the results are
Acknowledgements
I would like to thank John Copas, Tony Hayter, Motoaki Kawanabe, Shinto Eguchi, and the reviewers for helpful comments and suggestions.
References (24)
- et al.
An Akaike information criterion for model selection in the presence of incomplete data
J. Statist. Plann. Infererence
(1998) A new look at the statistical model identification
IEEE Trans. Automat. Control
(1974)Differential-Geometrical Methods in Statistics
(1985)- et al.
Minimum disparity estimation for continuous models: efficiency, distributions and robustness
Ann. Inst. Statist. Math.
(1994) - et al.
Multinomial goodness-of-fit tests
J. Roy. Statist. Soc. Ser. B
(1984) Approximate predictive likelihood
Biometrika
(1986)Asymptotic prediction analysis
Biometrika
(1976)The geometry of exponential families
Ann. Statist.
(1978)- et al.
Robust estimation – a weighted maximum likelihood approach
Internat. Statist. Rev.
(1994) Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives (with discussion)
J. Roy. Statist. Soc. Ser. B
(1984)
Robust Statistics: The Approach Based on Influence Functions.
Asymptotic expansions associated with posterior distributions
Ann. Math. Statist.
Cited by (1735)
A multi-stream spatio-temporal network based behavioural multiparametric pain assessment system
2024, Biomedical Signal Processing and ControlTransfer learning model for cash-instrument prediction adopting a Transformer derivative
2024, Journal of King Saud University - Computer and Information SciencesUsing Machine Learning to Individualize Treatment Effect Estimation: Challenges and Opportunities
2024, Clinical Pharmacology and TherapeuticsProxy Methods for Domain Adaptation
2024, arXivSubdomain adaptation via correlation alignment with entropy minimization for unsupervised domain adaptation
2024, Pattern Analysis and Applications