Chapter 10 Likelihood-Based Inferences
Previously, we discussed approximate sampling distribution theory for maximum likelihood estimators. Recall that if ˆθ is the MLE of θ, the parameter of a distribution family Pθ, then (under some technical conditions) ˆθ⋅∼N(θ,I(θ)−1) for large n where I(θ) is the Fisher information I(θ)=−E[∂2∂θℓ(θ;X1,…,Xn)] for a random sample X1,…,Xn from Pθ.
In this chapter we discuss how to use this (and other) approximate (large sample) sampling distributions to make inferences about population parameters.
10.1 Wald tests and CIs
A 100(1−α)% Wald approximate CI for a scalar parameter θ has the form (ˆθ±z1−α/2√I(ˆθ)−1)
This CI is straightforward to obtain by the same argument we have used before. Start with the identity 1−α≈P(zα/2<ˆθ−θ√I(θ)−1<z1−α/2) which follows from the approximate pivot discussed above, and in Chapter 2. Then, simply multiply by the standard error √I(θ)−1, subtract ˆθ from both sides, and multiply by −1 to get 1−α≈P(ˆθ−z1−α/2√I(θ)−1<θ<ˆθ+z1−α/2√I(θ)−1) Finally, since θ is unknown, replace θ by ˆθ in the standard error to yield the so-called Wald type CI (ˆθ±z1−α/2√I(ˆθ)−1).
If θ=(θ1,…,θp) is a vector, then, the marginal Wald approximate CI for θj is given by (ˆθj±z1−α/2√I(ˆθ)−1j,j). where I(θ) is the Fisher information matrix and I(θ)j,j is the jth diagonal entry of the Fisher information matrix.
Besides CIs, the approximate pivot may be used for hypothesis tests as well. To test H0:θ=θ0 versus Ha:θ≠θ0 for a scalar parameter θ, reject the null hypothesis if |z|>z1−α/2 where z=ˆθ−θ0√I(θ0)−1. (Alternatively, one may use ˆθ in the standard error rather than θ0. Under the null hypothesis, there is little difference.)
For a vector parameter θ let W=(ˆθ−θ0)⊤I(θ0)−1(ˆθ−θ0). Under H0:θ=θ0, W⋅∼χ2(p). And, if we use the estimated covariance: V=(ˆθ−θ0)⊤I(ˆθ)−1(ˆθ−θ0), then, V⋅∼F(p,n−p). So, we get either a Chi-Squared or F-based test for a vector parameter.
10.1.1 Example: Auto insurance claims
We have the following data on 100 auto insurance claims:
Given the auto claims are positively skewed, we will assume a Gamma distribution reasonably models the population and fit a Gamma to the data by maximum likelihood.
The Gamma density has form f(x;α,β)=βαΓ(α)xα−1e−xβ,x,α,β>0 where (α,β) are shape and rate parameters. The likelihood for a random sample of size n is L(α,β;X1,…,Xn)=(βαΓ(α))nn∏i=1(xi)α−1e−β∑ni=1xi. Taking the log, we have the loglikelihood function (suppressing dependence on data) ℓ(α,β)=nαlogβ−nlogΓ(α)+(α−1)n∑i=1log(xi)−βn∑i=1xi. Next, we need the gradient (the first derivatives) w.r.t. the parameters (α,β): ∂ℓ∂α=nlogβ−DiGamma(α)+n∑i=1log(xi) where the function DiGamma(α) is the logarithmic derivative of the Gamma function. Next, we have ∂ℓ∂β=nαβ−n∑i=1xi.
We also need the second derivatives in order to compute the Fisher information:
∂2ℓ∂α2=−TriGamma(α)
where the TriGamma function is (you guessed it!) the second logarithmic derivative fo the Gamma function Γ(α).
∂2ℓ∂α∂β=n/β.
∂2ℓ∂β2=−nαβ2.
You may have guessed we will not be able to compute the MLEs by hand by solving ∂ℓ∂α=0 and ∂ℓ∂β=0. Instead, we need an iterative solver (an algorithm). In R we can use optim to solve for the MLEs. Or, we can code our own version of Newton’s method…
<- length(data)
n <- function(param){
loglik sum(dgamma(data, shape = param[1], rate = param[2], log = TRUE))
}<- function(param){
grad <- n*log(param[2]) - n*digamma(param[1]) + sum(log(data))
g1 <- n*param[1]/param[2] - sum(data)
g2 return(c(g1,g2))
}<- function(param){
H <- -n*trigamma(param[1])
h1 <- n/param[2]
h2 <- -n*param[1]/(param[2]^2)
h3 return(matrix(c(h1,h2,h2,h3),2,2))
}
<- mean(data)
m <- var(data)
v
<- m/v
beta0 <- m*beta0
alpha0
<- 0.00001
delta <- 1
diff.step <- c(alpha0, beta0)
par.old <- 0
steps while((diff.step > delta) & (steps < 100)){
<- par.old - solve(H(par.old))%*%matrix(grad(par.old),2,1)
par.new <- max(abs(par.new - par.old))
diff.step <- steps + 1
steps <- par.new
par.old
} par.old
## [,1]
## [1,] 1.966640491
## [2,] 0.001948425
loglik(par.old)
## [1] -780.5969
H(par.old)
## [,1] [,2]
## [1,] -65.86955 51323.52
## [2,] 51323.51588 -51803341.74
solve(H(par.old))
## [,1] [,2]
## [1,] -6.657160e-02 -6.595499e-05
## [2,] -6.595499e-05 -8.464786e-08
<- optim(c(alpha0, beta0), loglik, gr = grad, control = list(fnscale = -1), method = 'BFGS', hessian = TRUE) mle.optim
## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced
## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced
## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced
## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced
## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced
## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced
## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced
## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced
## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced
## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced
## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced
## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced
## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced
mle.optim
## $par
## [1] 1.966641732 0.001948426
##
## $value
## [1] -780.5969
##
## $counts
## function gradient
## 34 4
##
## $convergence
## [1] 0
##
## $message
## NULL
##
## $hessian
## [,1] [,2]
## [1,] -65.86951 54017.32
## [2,] 54017.31581 -70328540.92
solve(mle.optim$hessian)
## [,1] [,2]
## [1,] -4.101660e-02 -3.150366e-05
## [2,] -3.150366e-05 -3.841603e-08