Improving predictive inference under covariate shift by weighting the log-likelihood function

https://doi.org/10.1016/S0378-3758(00)00115-4Get rights and content

Abstract

A class of predictive densities is derived by weighting the observed samples in maximizing the log-likelihood function. This approach is effective in cases such as sample surveys or design of experiments, where the observed covariate follows a different distribution than that in the whole population. Under misspecification of the parametric model, the optimal choice of the weight function is asymptotically shown to be the ratio of the density function of the covariate in the population to that in the observations. This is the pseudo-maximum likelihood estimation of sample surveys. The optimality is defined by the expected Kullback–Leibler loss, and the optimal weight is obtained by considering the importance sampling identity. Under correct specification of the model, however, the ordinary maximum likelihood estimate (i.e. the uniform weight) is shown to be optimal asymptotically. For moderate sample size, the situation is in between the two extreme cases, and the weight function is selected by minimizing a variant of the information criterion derived as an estimate of the expected loss. The method is also applied to a weighted version of the Bayesian predictive density. Numerical examples as well as Monte-Carlo simulations are shown for polynomial regression. A connection with the robust parametric estimation is discussed.

Introduction

Let x be the explanatory variable or the covariate, and y be the response variable. In predictive inference with the regression analysis, we are interested in estimating the conditional density q(y|x) of y given x, using a parametric model. Let p(y|x,θ) be the model of the conditional density which is parameterized by θ=(θ1,…,θm)′∈Θ⊂Rm. Having observed i.i.d. samples of size n, denoted by (x(n),y(n))=((xt,yt):t=1,…,n), we obtain a predictive density p(y|x,θ̂) by giving an estimate θ̂=θ̂(x(n),y(n)). In this paper, we discuss improvement of the maximum likelihood estimate (MLE) under both (i) covariate shift in distribution and (ii) misspecification of the model as explained below.

Let q1(x) be the density of x for evaluation of the predictive performance, while q0(x) be the density of x in the observed data. We consider the Kullback–Leibler loss functionlossi(θ)≔−∫qi(x)∫q(y|x)logp(y|x,θ)dydxfor i=0,1, and then employ loss1(θ̂) for evaluation of θ̂, rather than the usual loss0(θ̂). The situation q0(x)≠q1(x) will be called covariate shift in distribution, which is one of the premises of this paper.

This situation is not so odd as it might look at first. In fact, it is seen in various fields as follows. In sample surveys, q0(x) is determined by the sampling scheme, while q1(x) is determined by the population. In regression analysis, covariate shift often happens because of the limitation of resources, or the design of experiments. In artificial neural networks literature, “active learning” is the typical situation where we control q0(x) for better prediction. We could say that the distribution of x in future observations is different from that of the past observations; x is not necessarily distributed as q1(x) in future, but we can give imaginary q1(x) to specify the region of x where the prediction accuracy should be controlled. Note that q0(x) and/or q1(x) are often estimated from data, but we assume they are known or estimated reasonably in advance.

The second premise of this paper is misspecification of the model. Let θ̂0 be the MLE of θ, and θ0 be the asymptotic limit of θ̂0 as n→∞. Under certain regularity conditions, MLE is consistent and p(y|x,θ0)=q(y|x) provided that the model is correctly specified. In practice, however, p(y|x,θ0) deviates more or less from q(y|x).

Under both the covariate shift and the misspecification, MLE does not necessarily provide a good inference. We will show that MLE is improved by giving a weight function w(x) of the covariate in the log-likelihood functionLw(θ|x(n),y(n))≔−t=1nlw(xt,yt|θ),where lw(x,y|θ)=−w(x)logp(y|x,θ). Then the maximum weighted log-likelihood estimate (MWLE), denoted by θ̂w, is obtained by maximizing (1.1) over Θ. It will be seen that the weight function w(x)=q1(x)/q0(x) is the optimal choice for sufficiently large n in terms of the expected loss with respect to q1(x). We denote MWLE with this weight function by θ̂1. A comparison between θ̂0 and θ̂1 is made in the numerical example of polynomial regression of Section 2, and the asymptotic optimality of θ̂1 is shown in Section 3. Note that MWLE turns out to be downweighting the observed samples which are not important in fitting the model with respect to the population. An interpretation of MWLE as one of the robust estimation techniques is given in Section 9.

This type of estimation is not new in statistics. Actually, θ̂1 is regarded as a generalization of the pseudo-maximum likelihood estimation in sample surveys (Skinner et al., 1989, p. 80; Pfeffermann et al., 1998); the log likelihood is weighted inversely proportional to q0(x), the probability of selecting unit x, while q1(x) is equal probability for all possible values of x. The same idea is also seen in Rao (1991), where weighted maximum likelihood estimation is considered for unequally spaced time-series data.

The local likelihoods or the weighted likelihoods formally similar to (1.1) are found in the literature for semi-parametric inference. However, θ̂w is estimated using a weight function concentrated locally around each x or (x,y) in the semi-parametric approach; thus θ̂w in p(y|x,θ̂w) will depend on (x,y) as well as the data (x(n),y(n)). On the other hand, we restrict our attention to a rather conventional parametric modeling approach here, and θ̂w depends only on the data.

In spite of the asymptotic optimality of w(x)=q1(x)/q0(x) mentioned above, another choice of the weight function can improve the expected loss for moderate sample size by compromising the bias and the variance of θ̂w. We develop a practical method for this improvement in 4 Expected loss, 5 Information criterion, 6 Numerical example revisited, 7 Simulation study. The asymptotic expansion of the expected loss is given in Section 4, and a variant of the information criterion is derived as an estimate of the expected loss in Section 5. This new criterion is used to find a good w(x) as well as a good form of p(y|x,θ). The numerical example is revisited in Section 6, and a simulation study is given in Section 7.

In Section 8, we show the Bayesian predictive density is also improved by considering the weight function. Finally, concluding remarks are given in Section 9. All the proofs are deferred to the appendix.

Section snippets

Illustrative example in regression

Here we consider the normal regression to predict the response y∈R using a polynomial function of x∈R. Let the model p(y|x,θ) be the polynomial regressiony=β01x+⋯+βdxd+ε,ε∼N(0,σ2),where θ=(β0,…,βd,σ) and N(a,b) denotes the normal distribution with mean a and variance b. In the numerical example below, we assume the true q(y|x) is also given by (2.1) with d=3:y=−x+x3+ε,ε∼N(0,0.32).

The density q0(x) of the covariate x isx∼N002),where μ0=0.5, τ02=0.52. This corresponds to the sampling scheme

Asymptotic properties of MWLE

Let Ei(·) denote the expectation with respect to q(y|x)qi(x) for i=0,1. Considering −Lw(θ) as the summation of i.i.d. random variables lw(xt,yt|θ), it follows from the law of large numbers that −Lw(θ)/nE0(lw(x,y|θ)) as n grows to infinity. Then we have θ̂w→θw in probability as n→∞, where θw is the minimizer of E0(lw(x,y|θ)) over θΘ. Hereafter, we restrict our attention to proper w(x) such that E0(lw(x,y|θ)) exists for all θΘ and that the Hessian of E0(lw(x,y|θ)) is non-singular at θw,

Expected loss

In the previous section, optimal choice of w(x) was discussed in terms of the asymptotic bias θw−θ1. For moderate sample size, however, the variance of θ̂w due to the sampling error should be considered. In order to take account of both the bias and the variance, we employ the expected loss E0(n)(loss1(θ̂w)) to determine the optimal weight; E0(n)(·) denotes the expectation with respect to (x(n),y(n)) which follows ∏t=1nq(yt|xt)q0(xt).

Lemma 2

The expected loss is asymptotically expanded asE0(n)(loss1(θ

Information criterion

The performance of MWLE for a specified w(x) is given by (4.1). However, we cannot calculate the value of the expected loss from it in practice, because q(y|x) is unknown. We provide a variant of the information criterion as an estimate of (4.1).

Theorem 1

Let the information criterion for MWLE beICw≔−2L1(θ̂w)+2tr(JwHw−1),whereL1(θ)=t=1nq1(x)q0(x)logp(yt|xt,θ),Jw=−E0q1(x)q0(x)logp(y|x,θ)∂θθw∂lw(x,y|θ)∂θ′θw.The matrices Jw and Hw may be replaced by their consistent estimatesĴw=−1nt=1nq1(xt)q0(xt)log

Numerical example revisited

For the normal linear regression, such as the polynomial regression given in (2.1), β-components of θ̂w are obtained by WLS with weights w(xt). σ-component of θ̂w is then given by σ̂2=t=1nw(xt)ε̂t2/ĉw, where ĉw=t=1nw(xt) and ε̂t is the residual. Letting ĥt, t=1,…,n be the diagonal elements of the hat matrix used in the WLS, the information criterion (5.1) is calculated from−L1(θ̂w)=12t=1nq1(xt)q0(xt)ε̂t2σ̂2+log(2πσ̂2),tr(ĴwĤw−1)=t=1nq1(xt)q0(xt)ε̂t2σ̂2ĥt+w(xt)2ĉwε̂t2σ̂2−12.

We apply

Simulation study

First we show simulation results in Table 1, Table 2, Table 3 which confirm the theory of 4 Expected loss, 5 Information criterion. A large number N of replicates of the dataset of size n are generated from , . We used (2.4) as q1(x). Four simulations of n=50, 100, 300, and 1000 are done with N=105 for n=50–300 and N=106 for n=1000. For each replicate of the dataset, θ̂w is calculated for λ=0,1 and d=0,1,2. Then, loss1(θ̂w), L1(θ̂w), and tr(ĴwĤw−1) are calculated, and their averages over the N

Bayesian inference

We have been working on the predictive densityp(y|x,θ̂w),which is based on MWLE θ̂w. This type of predictive density is occasionally called as an estimative density in the literature. Another possibility is the Bayesian predictive density. Here we consider a weighted version of it, and examine its performance in prediction.

Let p(θ) be the prior density of θ. Given the data (x(n),y(n)), we shall define the weighted posterior density bypw(θ|x(n),y(n))∝p(θ)expLw(θ|x(n),y(n)).Then the predictive

Concluding remarks

Although the ratio q1(x)/q0(x) has been assumed to be known, it is often estimated from data in practice. Assuming q1(x) is known, we tried three possibilities in the numerical example of Section 2: (i) q0(x) is specified correctly without unknown parameters. (ii) Assuming the normality of q0(x), the unknown μ0 and τ0 are estimated. (iii) Non-parametric kernel density estimation is applied to q0(x). Then, it turns out that MWLE is robust against the estimation of q1(x)/q0(x) and the results are

Acknowledgements

I would like to thank John Copas, Tony Hayter, Motoaki Kawanabe, Shinto Eguchi, and the reviewers for helpful comments and suggestions.

References (24)

  • J.E. Cavanaugh et al.

    An Akaike information criterion for model selection in the presence of incomplete data

    J. Statist. Plann. Infererence

    (1998)
  • H. Akaike

    A new look at the statistical model identification

    IEEE Trans. Automat. Control

    (1974)
  • S. Amari

    Differential-Geometrical Methods in Statistics

    (1985)
  • A. Basu et al.

    Minimum disparity estimation for continuous models: efficiency, distributions and robustness

    Ann. Inst. Statist. Math.

    (1994)
  • N. Cressie et al.

    Multinomial goodness-of-fit tests

    J. Roy. Statist. Soc. Ser. B

    (1984)
  • A.C. Davison

    Approximate predictive likelihood

    Biometrika

    (1986)
  • I.R. Dunsmore

    Asymptotic prediction analysis

    Biometrika

    (1976)
  • B. Efron

    The geometry of exponential families

    Ann. Statist.

    (1978)
  • C. Field et al.

    Robust estimation – a weighted maximum likelihood approach

    Internat. Statist. Rev.

    (1994)
  • P.J. Green

    Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives (with discussion)

    J. Roy. Statist. Soc. Ser. B

    (1984)
  • F.R. Hampel et al.

    Robust Statistics: The Approach Based on Influence Functions.

    (1986)
  • R.A. Johnson

    Asymptotic expansions associated with posterior distributions

    Ann. Math. Statist.

    (1970)
  • Cited by (1735)

    • Transfer learning model for cash-instrument prediction adopting a Transformer derivative

      2024, Journal of King Saud University - Computer and Information Sciences
    View all citing articles on Scopus
    View full text