Chapter 10 Likelihood-Based Inferences

Previously, we discussed approximate sampling distribution theory for maximum likelihood estimators. Recall that if ˆθ is the MLE of θ, the parameter of a distribution family Pθ, then (under some technical conditions) ˆθN(θ,I(θ)1) for large n where I(θ) is the Fisher information I(θ)=E[2θ(θ;X1,,Xn)] for a random sample X1,,Xn from Pθ.



In this chapter we discuss how to use this (and other) approximate (large sample) sampling distributions to make inferences about population parameters.

10.1 Wald tests and CIs

A 100(1α)% Wald approximate CI for a scalar parameter θ has the form (ˆθ±z1α/2I(ˆθ)1)

This CI is straightforward to obtain by the same argument we have used before. Start with the identity 1αP(zα/2<ˆθθI(θ)1<z1α/2) which follows from the approximate pivot discussed above, and in Chapter 2. Then, simply multiply by the standard error I(θ)1, subtract ˆθ from both sides, and multiply by 1 to get 1αP(ˆθz1α/2I(θ)1<θ<ˆθ+z1α/2I(θ)1) Finally, since θ is unknown, replace θ by ˆθ in the standard error to yield the so-called Wald type CI (ˆθ±z1α/2I(ˆθ)1).

If θ=(θ1,,θp) is a vector, then, the marginal Wald approximate CI for θj is given by (ˆθj±z1α/2I(ˆθ)1j,j). where I(θ) is the Fisher information matrix and I(θ)j,j is the jth diagonal entry of the Fisher information matrix.



Besides CIs, the approximate pivot may be used for hypothesis tests as well. To test H0:θ=θ0 versus Ha:θθ0 for a scalar parameter θ, reject the null hypothesis if |z|>z1α/2 where z=ˆθθ0I(θ0)1. (Alternatively, one may use ˆθ in the standard error rather than θ0. Under the null hypothesis, there is little difference.)



For a vector parameter θ let W=(ˆθθ0)I(θ0)1(ˆθθ0). Under H0:θ=θ0, Wχ2(p). And, if we use the estimated covariance: V=(ˆθθ0)I(ˆθ)1(ˆθθ0), then, VF(p,np). So, we get either a Chi-Squared or F-based test for a vector parameter.

10.1.1 Example: Auto insurance claims

We have the following data on 100 auto insurance claims:

Given the auto claims are positively skewed, we will assume a Gamma distribution reasonably models the population and fit a Gamma to the data by maximum likelihood.

The Gamma density has form f(x;α,β)=βαΓ(α)xα1exβ,x,α,β>0 where (α,β) are shape and rate parameters. The likelihood for a random sample of size n is L(α,β;X1,,Xn)=(βαΓ(α))nni=1(xi)α1eβni=1xi. Taking the log, we have the loglikelihood function (suppressing dependence on data) (α,β)=nαlogβnlogΓ(α)+(α1)ni=1log(xi)βni=1xi. Next, we need the gradient (the first derivatives) w.r.t. the parameters (α,β): α=nlogβDiGamma(α)+ni=1log(xi) where the function DiGamma(α) is the logarithmic derivative of the Gamma function. Next, we have β=nαβni=1xi.

We also need the second derivatives in order to compute the Fisher information: 2α2=TriGamma(α) where the TriGamma function is (you guessed it!) the second logarithmic derivative fo the Gamma function Γ(α).
2αβ=n/β. 2β2=nαβ2.

You may have guessed we will not be able to compute the MLEs by hand by solving α=0 and β=0. Instead, we need an iterative solver (an algorithm). In R we can use optim to solve for the MLEs. Or, we can code our own version of Newton’s method…

n <- length(data)
loglik <- function(param){
  sum(dgamma(data, shape = param[1], rate = param[2], log = TRUE))
}
grad <- function(param){
  g1 <- n*log(param[2]) - n*digamma(param[1]) + sum(log(data))
  g2 <- n*param[1]/param[2] - sum(data)
  return(c(g1,g2))
}
H <- function(param){
  h1 <- -n*trigamma(param[1])
  h2 <- n/param[2]
  h3 <- -n*param[1]/(param[2]^2)
  return(matrix(c(h1,h2,h2,h3),2,2))
}

m <- mean(data)
v <- var(data)

beta0 <- m/v
alpha0 <- m*beta0

delta <- 0.00001
diff.step <- 1
par.old <- c(alpha0, beta0)
steps <- 0
while((diff.step > delta) & (steps < 100)){
  par.new <- par.old - solve(H(par.old))%*%matrix(grad(par.old),2,1) 
  diff.step <- max(abs(par.new - par.old))
  steps <- steps + 1
  par.old <- par.new
}
par.old
##             [,1]
## [1,] 1.966640491
## [2,] 0.001948425
loglik(par.old)
## [1] -780.5969
H(par.old)
##             [,1]         [,2]
## [1,]   -65.86955     51323.52
## [2,] 51323.51588 -51803341.74
solve(H(par.old))
##               [,1]          [,2]
## [1,] -6.657160e-02 -6.595499e-05
## [2,] -6.595499e-05 -8.464786e-08
mle.optim <- optim(c(alpha0, beta0), loglik, gr = grad, control = list(fnscale = -1), method = 'BFGS', hessian = TRUE)
## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced

## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced

## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced

## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced

## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced

## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced

## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced

## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced

## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced

## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced

## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced

## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced

## Warning in dgamma(data, shape = param[1], rate = param[2], log = TRUE): NaNs
## produced
mle.optim
## $par
## [1] 1.966641732 0.001948426
## 
## $value
## [1] -780.5969
## 
## $counts
## function gradient 
##       34        4 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
## 
## $hessian
##             [,1]         [,2]
## [1,]   -65.86951     54017.32
## [2,] 54017.31581 -70328540.92
solve(mle.optim$hessian)
##               [,1]          [,2]
## [1,] -4.101660e-02 -3.150366e-05
## [2,] -3.150366e-05 -3.841603e-08