References
Arthern, R.J. and Hindmarsh, R.C.A.. 2003. Optimal estimation of changes in the mass of ice sheets. J. Geophys. Res., 108(F1), 6007. (10.1029/2003JF000021.)
Berger, J.O.
1985. Statistical decision theory and Bayesian analysis. New York, SpringerVerlag.
Berliner, L.M.
1996. Hierarchical Bayesian time series models. In
Hanson, K. and Silver, R., eds.
Maximum entropy and Bayesian methods. Dordrecht, etc., Kluwer Academic Publishers, 15–22.
Berliner, L.M.
2003. Physical–statistical modelling in geophysics. J. Geophys. Res., 108(D24), 8776. (10.1029/2002JD002865.)
Berliner, L.M. and Wikle, C.K.. 2006. Approximate importance sampling Monte Carlo for data assimilation. Physica D, 230(1–2), 37–49.
Berliner, L.M., Wikle, C.K. and Cressie, N.. 2000. Longlead prediction of Pacific SSTs via Bayesian dynamic modelling. J. Climate, 13(13), 3953–3968.
Berliner, L.M., Milliff, R.F. and Wikle, C.K.. 2003. Bayesian hierarchical modelling of air–sea interaction. J. Geophys. Res., 108(C4), 3104. (10.1029/2002JC001413.)
Blatter, H.
1995. Velocity and stress fields in grounded glaciers: a simple algorithm for including deviatoric stress gradients. J. Glaciol., 41(138), 333–344.
Bruce, A. and Gao, H.Y.. 1996. Applied wavelet analysis with SPLUS. New York, SpringerVerlag.
DeGroot, M.H.
1970. Optimal statistical decisions. New York, McGrawHill.
Epstein, E.S.
1985. Statistical inference and prediction in climatology: a Bayesian approach. Boston, MA, American Meteorological Society.
Evensen, G. and van Leeuwen, P.J.. 2000. An ensemble Kalman smoother for nonlinear dynamics. Mon. Weather Rev., 128(6), 1852–1867.
Gelman, A., Carlin, J.B., Stern, H.S. and Rubin, D.B.. 2004. Bayesian data analysis. Second edition. Boca Raton, FL, Chapman and Hall/CRC.
Goldstein, R.M., Engelhardt, H., Kamb, B. and Frolich, R.M.. 1993. Satellite radar interferometry for monitoring ice sheet motion: application to an Antarctic ice stream. Science, 262(5139), 1525–1530.
Gudmundsson, G.H.
2006. Case study: estimating basal properties of glaciers from surface measurements. In
Knight, P.G., ed.
Glacier science and environmental change. Oxford, Blackwell, 415–417.
Joughin, I., Fahnestock, M., MacAyeal, D., Bamber, J.L. and Gogineni, P.. 2001. Observation and analysis of ice flow in the largest Greenland ice stream. J. Geophys. Res., 106(D24), 34,021–34,034.
Kamb, B. and Echelmeyer, K.A.. 1986. Stressgradient coupling in glacier flow: I. Longitudinal averaging of the influence of ice thickness and surface slope. J. Glaciol., 32(111), 267–284.
Lorenc, A.C.
1986. Analysis methods for numerical weather prediction. Q. J. R. Meteorol. Soc., 112(474), 1177–1194.
MacAyeal, D.R.
1989. Largescale ice flow over a viscous basal sediment: theory and application to Ice Stream B, Antarctica. J. Geophys. Res., 94(B4), 4071–4087.
MacAyeal, D.R.
1993. A tutorial on the use of control methods in icesheet modeling. J. Glaciol., 39(131), 91–98.
Meinhold, J. and Singpurwalla, N.. 1983. Understanding the Kalman filter. Am. Stat., 37(2), 123–127.
Paterson, W.S.B.
1994. The physics of glaciers. Third edition. Oxford, etc., Elsevier.
Robert, C.P. and Casella, G.. 1999. Monte Carlo statistical methods. New York, SpringerVerlag.
Tarantola, A.
1987. Inverse problem theory: methods for data fitting and model parameter estimation. New York, Elsevier.
Tebaldi, C., Smith, R.L., Nychka, D. and Mearns, L.O.. 2005. Quantifying uncertainty in projections of regional climate change: a Bayesian approach to the analysis of multimodel ensembles. J. Climate, 18(10), 1524–1540.
Van der Veen, C.J.
1999. Fundamentals of glacier dynamics. Rotterdam, A.A. Balkema.
Vidakovic, B.
1999. Statistical modeling by wavelets. New York, etc., Wiley.
Whillans, I.M.
1987. Force budget of ice sheets. In
Van der Veen, C.J. and Oerlemans, J., eds.
Dynamics of the West Antarctic ice sheet. Dordrecht, etc., D. Reidel Publishing Co., 17–36.
Wikle, C.K. and Berliner, L.M.. 2006. A Bayesian tutorial for data assimilation. Physica D, 230(1–2), 1–16.
Wikle, C.K., Berliner, L.M. and Milliff, R.F.. 2003. Hierarchical Bayesian approach to boundary value problems with stochastic boundary conditions. Mon. Weather Rev., 131(6), 1051–1062.
Appendix A Details of the Hierarchical Model
Basal model: parameter priors
Our prior for the measurementerror variance is that it has an inverse gamma distribution with mean 50 and standard deviation 15. We also assumed an inverse gamma prior distribution for with mean 2000 and standard deviation 200.
While we could treat ϕ
_{1} and ϕ
_{2} as unknown parameters, they are only of minor interest here and small variations in their values do not affect our results. Hence, we considered ordinary leastsquares fits to the data and used the resulting residuals to estimate ϕ
_{1} and ϕ
_{2}, resulting in
Next, we describe our prior for the wavelet coefficients. Regarding the four coefficients C
_{s} for the smooth signal,
where μ is the vector of conventional, ordinary leastsquares estimates of Haarwavelet coefficients. Regarding the coefficients for the detail signal, we assume their prior means to be zero and that they are independent.
The model is completed by forming priors for the variances of the wavelet coefficients. We assumed these variances to be independent and to have inverse gamma distributions. For , we used prior mean 200^{2} and standard deviation 40. For the variance of the detail signals, we assumed prior mean 1000^{2} and prior standard deviation 100.
Surface modeling
Recall that we assume s(x) = ψ_{s}
(x) + N_{s}
(x) (Equation (21)), where the largescale parameterized surface is given by ψ_{s}
as defined in Equation (22). Assume N_{s}
is a zeromean, Gaussian spatial stochastic process with constant variance and stationary correlation function γ(x, x′) = γ(x − x′).
To produce a data model readily usable to isolate the largescale surface, we first assumed that the individual observations were conditionally independent with means equal to the true surface values. Next, we formally eliminated the smallscale process N_{s}
from the model via a probabilistic method. The result then leads to observations whose means are equal to the largescale model but whose values are correlated. Informally, they inherit the correlation structure of N_{s}
.
Rather than dealing with the overhead of modeling that correlation structure, we make a simple assumption. We assume that for all locations separated by at least 150 m, the correlation is approximately zero. (We arrived at this value by fitting the surface by conventional leastsquares analysis and then inspected the correlation function of the residuals of the fitted surface. Residuals at locations separated by more than 150 m were essentially uncorrelated.)
We then took a subsample of the surface data such that all observations were at least 150 m apart, which resulted in a subsample of 600 observations. We should have very little loss in efficiency in basing our analysis of the surface on such a subsample: 600 is a very large sample size for estimating three parameters (μ, K and L). Further, we tried a total of six such subsamples and obtained essentially the same results. Of course, the analysis hinges on the approximation that the decorrelation length scale of smallscale variations is 150 m.
Data model
We assume that, conditional on the true surface and the measurementerror variance ,
where s is the vector of true surface values at the observation locations. Standard results from probability theory imply that if we integrate out N_{s}
in Equation (21), the resulting data model is
where
ψ
_{s}
is the vector of the largescale parametric surface model evaluated at the observation locations and Γ is the matrix of correlations of N_{s}
at pairs of those locations, as implied by the correlation function γ.
Process model
A Bayesian or a nonBayesian analysis (e.g. generalized least squares) based on Equation (A3) requires modeling and estimation of Γ (in parallel with the estimation of Σ(ϕ
_{1}, ϕ
_{2}) in Equation (19)). Assuming a decorrelation length of 150 m, a subsample S
_{l}
such that all observations are at least 150 m apart has distribution
where and
ψ
_{s}
_{,l
} is the vector of largescale surface values at the subsample locations.
Parameter model
Under the assumptions and simplifications outlined above, we need only consider a prior for the parameters μ, K, L and . Probability theory greatly simplifies our task. Having reduced the original problem to that of only 4 parameters and 600 conditionally independent observations, we can approximate the desired posterior. For all but the most highly concentrated priors, the posterior is approximately Gaussian, with means given by the conventional leastsquares estimates and covariance matrix given by the estimated inverse information matrix for our setting (Berger, 1985, section 4.9).
Velocity modeling
Data model
We assume
where u is the vector of true velocities at the observation locations and is the measurementerror variance.
Process model
Based on the corresponding smoothed versions of thickness Equation (27) and stress Equation (28), we consider the changepoint models
This leads to the data models
as given in Equation (30).
Parameter model
We next describe our prior distribution for the unknown parameters k, u_{b}
, A
_{1}, A
_{2}, and c.
Our prior for k, u_{b}
, A
_{1}, A
_{2} and is the following socalled conjugate multivariate normalinverse gamma distribution. The prior is described hierarchically as
Where
and
where all offdiagonal elements of W are zero and
Under these specifications, the prior means of the regression coefficients are given in Equation (A11), the prior variances of the regression coefficients are
and the prior mean and variance of are both equal to 9 (Goldstein and others, 1983).
We remark that the prior above may seem cumbersome and strange in that we model the regressionmodel parameters conditioned on the error variance . However, this strategy leads to a computable posterior distribution for the regression parameters and . Specifically,
where the first distribution is the usual multivariate normal posterior holding fixed, and is an inverse gamma distribution with updated parameters. Hence, it was a popular prior model before the advent of MCMC. A complete discussion and explanation is given by DeGroot (1970, ch. 9); see also Berger (1985, p.288). We use the prior in this case because it readily adapts to our use of ISMC.
To closely mimic an uninformative analysis regarding the change point, we performed a leastsquares piecewise linear regression with an unknown change point. We found very strong agreement on the value of c and, hence, treated c as known and equal to 127.3 km.
Appendix B MCMC–ISMC Approach
Our computational approach combines MCMC and ISMC. The following derivations are motivated by a need to make the ISMC steps efficient. To achieve efficiency, we seek to generate ensembles of unknown quantities that are reasonably similar to ensembles generated directly from the posterior distribution. To do so, we apply Bayes’ theorem to portions of the hierarchical model in an attempt to allow the observational data to strongly influence the simulations. The details of these derivations involve interplay of our modeling assumptions and repeated use of two basic facts from probability theory. First, for the random quantities A and B, we have [AB] [B] = [BA] [A] (Bayes’ theorem). Second, such expressions hold conditionally. That is, if C is also some random quantity, we have [AB, C] [BC] = [BA, C] [AC].
We may write the full posterior distribution as
Note that we have incorporated the basal and surface datasets into a partial posterior distribution for the basal and surface processes and parameters. This model is amenable to simple MCMC. Further, as indicated in Figure 2, simulated smoothed basal topographies from this partial posterior are clearly responsive to the observations. Next, we rewrite the first two terms on the right side of Equation (B1) as
and then note that
In summary, we have
Our basic algorithm is as follows.

1. Via MCMC, produce an ensemble of size M from the partial posterior distribution (recall Equation (B1)). The result includes an ensemble
of smoothed basal and surface elevations.

2. For each , generate
η
^{m}
from .

3. Similarly, generate from
Finally, define ISMC weights:
Coupling these weights with the ensemble members yields a weighted ensemble from the full posterior distribution.
We remark that the selection of the prior described in Appendix A for
θ
_{u}
and enables the exact calculation of the partial posterior distribution needed in step 3 (Equation (A14)). However, there are two difficulties with implementation in our example. First, [U
ψ
_{b},
ψ
_{s}] is not easily obtained, disabling our ability to compute the weights in Equation (B5). To deal with this problem, we apply the general reasoning suggested in Berliner and Wikle (2006) to claim that the earlier steps in the analysis, particularly step 3, have sufficiently accounted for the information in the velocity data regarding the unknowns. Therefore, a good approximate analysis arises if we ignore these weights.
Similarly, step 2 is problematic in that the required distribution is not obtainable in closed form. Recall that our prior for the elements of
η
is that they are independent, uniform random variables on the interval (−100,100). Hence,
for vectors
η
lying in the 2880dimensional cube with sides given by (−100,100). We can obtain the exact form of [U
η
,
ψ
_{b}
,
ψ
_{s}
]. Specifically, it is a multivariate t density (Berger, 1985, p. 561), i.e.
where
μ
, W and (a, b) are defined in Equations (A11–A13) and Z=Z(
η
,
ψ
_{b}
,
ψ
_{s}
) is a 2880 × 4 matrix obtained by rewriting Equation (30) as
Although we have Equation (B7), it is intractable viewed as a probability density function for
η
. To deal with this problem, we appeal to importance sampling. Specifically, rather than sampling from [
η
U,
ψ
_{b}
,
ψ
_{s}
], we sample from another distribution, [
η
]_{IS}, chosen to mimic it. Specifically, we used a multivariate t distribution, with mean specified by inverting the regression model and degrees of freedom parameter equal to 10. By inverting the regression model, we mean the following. For each observation of velocity U(x) at a location x after the change point, and each ensemble member, our regression model (Equation (30)) is
where we have suppressed notation indicating dependence on the ensemble member. We then simply approximate e_{U}
by 0, yielding the approximation
The right side of this expression is then used to specify the mean of [
η
]_{IS}. This may appear to be cheating, but it is perfectly legal. One may construct importancesampling distributions in any required fashion, including using the data, as long as ISMC weights are used. These weights are readily approximated by