1. Introduction
One of the tasks presently facing glaciologists is to advise how the Greenland and Antarctic ice sheets might contribute to sealevel rise under the range of different climatic conditions that could occur in the future. To be genuinely useful to policymakers, planners of coastal infrastructure and other investors that are sensitive to future sea level, these glaciological forecasts will need to deliver information about the probability of the various possible outcomes that could be realized by the ice sheets. This makes it important to characterize the probability density function (PDF) of the sealevel contributions from Greenland and Antarctica.
Some estimates of how the climate of the atmosphere and oceans might evolve are available from general circulation models. These climate projections can be used to force dynamical models of the flow of an ice sheet, giving a forecast of the future contribution to sea level (e.g. Reference Joughin, Smith and MedleyJoughin and others, 2014; Reference CornfordCornford and others, 2015). At present these glaciological simulations are well adapted to investigating the sensitivity of the forecast to various perturbations in forcing, model parameters or initial conditions (e.g. Reference BindschadlerBindschadler and others, 2013). However, unless we can obtain reliable estimates of the probability of any particular perturbation actually occurring, the models cannot be used to evaluate the PDF of the contributions that ice from Greenland and Antarctica will make to sea level.
Glaciological forecasts over the next century or two will only be accurate if the models used to simulate the future begin in a state for which the geometry and flow speed are closely representative of the presentday ice sheets in Greenland and Antarctica. As in weather forecasting, the selection of the initial conditions for the simulation is an important component of the forecast. The procedure for setting up a model in a realistic starting state is known as initialization. One of the best ways to initialize large icesheet models is to use inverse methods (e.g. Reference MacAyealMacAyeal, 1992). These optimize the basal drag coefficient, viscosity or similar model parameters, to ensure that the model state from which the forecast proceeds agrees closely with a wide variety of measurements from satellites, aircraft and field campaigns. In this way the model starts from a state where the shape and flow speed of the ice accurately reflect what is happening now.
Ultimately, we are seeking to determine the complete PDF for the sealevel rise contribution from Greenland and Antarctica at times in the future that are relevant to planning decisions. Any errors in the initial conditions will propagate to give uncertainty in the simulation of future behavior. To quantify this uncertainty it is important to first characterize the uncertainty in the initial conditions in probabilistic terms. This makes a Bayesian approach to model initialization attractive, since it offers a probabilistic interpretation, while allowing information from satellites and other observational data to influence the joint PDF for viscosity, drag coefficient and other parameter values, as well as the initial values of state variables. Once this joint PDF has been obtained it can either be used to design ensembles for Monte Carlo experiments that evolve multiple simulations with a variety of initial conditions and parameter values, or as input to more formal probabilistic calculations that evaluate how uncertainty in the present state will propagate into the simulation used to forecast the future of the ice sheet. In this study we adopt a Bayesian approach to the problem of model initialization, and to the inversion of model parameters, such as the basal drag coefficient and viscosity.
A number of Bayesian inversions have been described previously by glaciologists (e.g. Reference Berliner, Jezek, Cressie, Kim, Lam and Van der VeenBerliner and others, 2008; Reference Gudmundsson and RaymondGudmundsson and Raymond, 2008; Reference Raymond and GudmundssonRaymond and Gudmundsson, 2009; Reference Tarasov, Dyke, Neal and PeltierTarasov and others, 2012; Reference Petra, Martin, Stadler and GhattasPetra and others, 2014; Reference ZammitMangion, Rougier, Bamber and SchoenZammitMangion and others, 2014). A key requirement in applying Bayesian methods is the definition of a prior probability distribution for the parameters that we wish to identify. In this paper, our particular focus will be on how we can specify prior information for model parameters that we have very little useful information about. A good example is the basal drag coefficient. This can vary enormously, depending on details of the subglacial environment that are completely unknown to us. The drag coefficient can be effectively zero for ice floating on water, but effectively infinite for ice frozen motionless to the bed. In many places in Greenland and Antarctica we do not know which condition applies. Furthermore, we do not know whether there are narrow waterfilled channels, large waterfilled cavities or broad sheets of water under the ice, so we cannot specify the length scale on which the basal drag coefficient might vary with any certainty. This makes it difficult to specify the prior PDF that is needed for any Bayesian inversion of this parameter.
One of our goals in this study is to reduce the subjectivity attached to glaciological forecasts. The general approach of defining the initial state of an icesheet model using inverse methods and then running the model forward in time to produce a forecast of the future might seem to provide a strategy for prediction that is physically based, mechanistic and largely free of subjectivity. By free of subjectivity we mean that different scientists should provide the same forecasts of the future behaviour of the ice sheet, assuming: (1) they are given the same set of observations; (2) they make the same rheological assumptions about the deformation of ice or sediment; and (3) they use the same conservation equations in the physical model that represents forces and mass fluxes within the ice sheet. However, even with observations, rheological assumptions and conservation equations in common, there is scope for making subjective decisions in the application of inverse methods that are used to identify parameters in the model, or the initial values for state variables. This applies particularly to the specification of the prior PDF for those parameters.
Subjective decisions made in defining the prior PDF will influence the initial state, and this, in turn, will affect the forecast of the ice sheet. The rate of ice flow into the ocean is sensitive to the basal drag coefficient and the ice viscosity (Reference SchoofSchoof, 2007). Furthermore, the forecast of the ice sheet is typically obtained by solving a nonlinear system of equations, and it may be quite sensitive to small changes in initial conditions or parameter values. Models that specify different prior PDFs for the spatial variations in viscosity and basal drag could potentially produce quite different projections of sea level.
The subjectivity attached to glaciological inverse methods is not usually emphasized, and not much consideration has been given to whether it is important or not, so we consider it in some detail here. We do not claim to have a recipe to eliminate all subjective decisions from glaciological forecasts, nor is it our intention to criticize glaciological inversions that have relied upon them. There will always be decisions about which model to use, which datasets to include, which parameters to invert for and which methods to use to regularize the inversion. In common with many previous studies, our work has involved a variety of such decisions in mapping spatial patterns of basal drag and estimating the flow speeds within the Antarctic ice sheet (Reference Arthern, Hindmarsh and WilliamsArthern and others, 2015). The motivation for the present study is to explore whether this approach can be improved upon by working within a probabilistic framework. Tasked with providing probabilistic estimates of the contribution of the ice sheets to sea level, our goal is that those forecasts should be made as objectively as currently available techniques allow.
2. Bayesian Inference of Model Parameters using Observations
Suppose we are trying to estimate a vector, θ , comprised of N parameters, θ = [θ _{1}, θ _{2}, …, θ_{N} ]^{T}, which may include the basal drag coefficient, β, at many different locations and viscosity, η, at many different points within the ice sheet. Bayes’ theorem provides a recipe for modifying a prior PDF for these parameters, p( θ ), to include the information provided by new data, x = [x _{1}, x _{2}, …, x_{M} ]^{T}, which may include observations from satellites, aircraft, field parties or laboratory experiments. The result is the posterior PDF,
The prior p( θ ) is a PDF, defined such that p( θ ) d θ is the probability that the parameters lie within a vanishingly small ‘volume’ element d θ = dθ _{1}dθ _{2} … dθ _{N}, located at θ , within an Ndimensional parameter space, Θ, that includes all possible values of the parameters. The term prior reflects that this is the PDF before we have taken account of the information provided by the data. The information provided by the data, x , is encoded in the likelihood function, p _{l}( x  θ ). The likelihood function can be assumed known, provided two conditions are met. First, our physical model must be capable of estimating the measured quantities, x , if supplied with parameter values, θ . Second, we must be able to estimate the PDF of residuals between these modelbased estimates and the data, x (e.g. if model deficiencies can be neglected, this amounts to knowing the distribution of the observational errors). The likelihood function, p _{l}( x  θ ), is then proportional to the PDF for observing the data, x , given that the parameters take particular values, θ . The denominator, p _{n}( x ), is defined as p _{n}( x ) = ∫_{Θ} p _{l}( x  θ )p( θ )d θ , and can simply be viewed as a normalizing constant, defined so the posterior PDF gives a total probability of ∫_{Θ} p _{p}( θ  x )d θ = 1, when integrated over all possible values within the parameter space, Θ. To avoid ambiguity we will use subscripts to identify various different posterior PDFs (p _{p1}, p _{p2}, etc.), likelihoods (p _{l1}, p _{l2}, etc.), priors (p _{1}, p _{2}, etc.) and normalizing constants (p _{n1}, p _{n2}, etc.).
The notation for conditional probabilities, P(AB), denotes probability of event A given that B is true. The posterior, p _{p}( θ  x ), is the PDF for the parameters, θ , given that the data, x , take the particular values observed. This means that, after we have taken account of all the information provided by the data, x , the posterior PDF, p _{p}( θ x) d θ , gives the updated probability that the parameters lie within a small volume, d θ , of parameter space located at θ . Selecting the values of θ that maximize p _{p}( θ x) provides a Bayesian estimate for the most likely value of the parameters.
The likelihood, function, p _{l}( x  θ ), sometimes written L( θ ; x ) or L( θ ), can be considered as a function of θ for the observed values of the data, x . It is sometimes the case when applying Bayes’ rule that the likelihood, L( θ ), is negligible except within a narrowly confined region of parameter space, while the prior, p( θ ), describes a much broader distribution. This situation would indicate great prior uncertainty in parameter values, θ , but much less uncertainty once the information from the data is incorporated using Bayes’ rule. In such cases, the information provided by the likelihood function, L( θ ), overwhelms the information provided by the prior, p( θ ). Specifying the prior accurately in such circumstances is perhaps not so important, since any sufficiently smooth function much broader than the likelihood function would produce a similar posterior PDF. However, we should not be complacent just because there are some circumstances in which it is not very important to specify the prior PDF accurately. There is no guarantee that this situation will correspond to glaciological inversions of the type that we are considering. Many aspects of the subglacial environment are barely constrained by data, so it is in our interests to specify the prior PDF carefully.
In this paper we will apply two principles advocated by Reference JaynesJaynes (2003) to constrain the choice of prior PDF: (1) we will exploit symmetries of the icesheet model, by requiring that the prior PDF is invariant to a group of transformations that do not alter the mathematical specification of the inverse problem, and (2) using this invariant prior as a reference function, we will include additional constraints by seeking the PDF that maximizes the relative entropy subject to those constraints. Both approaches are described in detail by Reference JaynesJaynes (2003), and we will only make brief introductory remarks about them (Sections 5 and 6). Our intention is to guide, and as far as possible eliminate, the subjective decisions made during the inverse problem that defines the initial state of the model from the observations, particularly with respect to the choice of a prior PDF for the parameters. Although we concentrate here on methods advocated by Reference JaynesJaynes (2003), reviews by Reference Kass and WassermanKass and Wasserman (1996) and Reference BergerBerger (2006) provide a broader perspective and include additional background on the use of formal rules for the selection of prior PDFs.
3. The Close Relationship between the Prior PDF and the Regularization of Inverse Methods
Although we will use Bayesian methods, our investigation is relevant to a wide variety of glaciological inverse methods, many of which have been described in the literature without mention of Bayes’ theorem.
In particular, our approach is related to a broad class of inverse methods that minimize a cost function, J _{misfit}, that quantifies the misfit between the model and data. A common example is choosing model parameters that minimize the mismatch between model velocities, u , and observations of the ice velocity, u *, from satellites, so that the cost function is the unweighted sum of the squares of the misfit, e.g. , or some similar function weighted by estimates of the observational error covariance, R , e.g. . Other cost functions to characterize the misfit between the model and the data have been proposed (Reference Arthern and GudmundssonArthern and Gudmundsson, 2010; Reference Morlighem, Rignot, Seroussi, Larour, Ben Dhia and AubryMorlighem and others, 2010) and the choice of cost function is therefore one way that subjective decisions can influence the inversion.
There are other aspects of the inversion that require subjective decisions. Generally speaking, simply minimizing the mismatch with observations does not uniquely define the spatially varying fields of basal drag and viscosity. This is because many different combinations of parameters allow the model to agree equally well with the available observational data. This is often dealt with by constraining parameters, using some kind of explicit or implicit regularization.
Regularization introduces additional information about parameters (e.g. requiring that they be close to some estimated value, or that they are small in magnitude, or that they vary smoothly in space). Before proceeding we will describe how regularization of inverse methods as commonly applied in glaciology relates to our Bayesian inversion.
The purpose of regularization is either to turn an illposed problem into a wellposed problem, or an illconditioned problem into a wellconditioned problem. As defined by Reference HadamardHadamard (1902), an illposed problem either has no solution, more than one solution, or a solution that varies discontinuously when small perturbations are made to any quantitative information provided (i.e. data). In any of these three cases it becomes impossible to precisely define a solution to the problem. On a practical level, especially when performing calculations numerically, we may come across problems that are not exactly illposed in the above sense, but are illconditioned. This means that a unique solution exists, but small changes to the data from measurement errors or numerical roundoff can result in large changes to that solution. If the resulting loss of precision is too great, we may be willing to constrain the solution in some other way, by regularizing the problem.
To be more concrete, we will give some simple examples of how a glaciological inversion can be illposed or illconditioned, beginning with an example of a problem that does not have a unique solution. Suppose we would like to find the initial state for a model of ice of constant thickness flowing down a featureless inclined plane. Furthermore, suppose we know the ice thickness, the surface elevation and the flow speed at the surface (i.e. the data). Suppose now that we have no information about the drag coefficient at the base of the ice, or the ice viscosity, but wish to determine these using inverse methods.
For a slab of the given thickness, the ice speed at the surface could be matched either by a rigid slab that is sliding at its base, or by a slab that is fixed at the base, but deforming through internal shearing (Fig. 1). None of the data provide information about the subsurface flow speed so we cannot distinguish between these two possibilities, or between these and some intermediate solution that is any combination of sliding and internal shearing that matches the specified surface velocity.
In practical applications, to avoid such nonuniqueness, it is rare that viscosity and basal drag are solved for simultaneously. Rather, it is commonly assumed that one or other of these quantities is known perfectly. On floating ice shelves the viscosity is usually solved for on the assumption that basal drag is zero. By contrast, on grounded ice, the basal drag is usually solved for on the assumption that the viscosity perfectly obeys some rheological flow law with known parameters, so that the viscosity can be considered known.
The assumption that either the basal drag or the viscosity is perfectly known regularizes what would otherwise be an illposed problem, by avoiding a multiplicity of nonunique solutions. However, there are difficulties with this approach. First, for ice shelves where the bathymetry is poorly mapped, it may be difficult to be certain there is no basal drag from some unidentified contact with the sea floor (Reference FürstFürst and others, 2015). Second, on grounded ice, many factors that are not included in commonly used flow laws can affect the viscosity. These include such poorly known factors as impurity content, anisotropy, grain size, geothermal heating and damage from crevassing. This makes it problematic to assume we have perfect knowledge of the viscosity. Another consideration is that we have assumed in the problem specified above that the ice thickness is known perfectly. However the thickness is poorly known in many places, and it might be useful to invert for this field, rather than assume it is known perfectly (Reference Raymond and GudmundssonRaymond and Gudmundsson, 2009). Clearly, the problem of nonuniqueness would become even more acute if the observations of ice thickness were unavailable and we tried to solve for ice thickness as well as the basal drag and viscosity.
In the above example there are three parameters that we would like to invert for: ice thickness, ice viscosity and basal drag coefficient. Rather than assume we have perfect knowledge of two of these, it would be more realistic to acknowledge that there is considerable uncertainty in each, and to seek a compromise solution that jointly reflects these uncertainties.
For problems with many uncertain parameters a Bayesian approach is attractive. Rather than one set of parameters that minimize the cost function, J _{misfit}, Bayesian inversion seeks the posterior joint PDF, p _{p}( θ  x ), for the parameters. This means that the combined uncertainty in basal drag coefficient, viscosity and thickness can be evaluated. If we wish, we can later seek the values of the parameters that maximize the joint PDF, allowing us to solve simultaneously for the most likely values of all three quantities. To perform such Bayesian inversion we will need to define a prior PDF, p( θ ), for the parameters. It is this aspect that we concentrate on in this paper.
As an example of illconditioning, suppose the slipperiness at the base of the slab is not uniform as assumed above, but has fluctuations on some scale. As the characteristic size of these fluctuations decreases, their effect on the flow at the surface will diminish, until they become too small to have any significant effect (Reference Bahr, Pfeffer and MeierBahr and others, 1994; Reference GudmundssonGudmundsson, 2003). At the smallest scales their effect on the surface elevation and flow speed will be smaller than the accuracy with which these data are measured. Any inverse method that seeks to recover the fluctuations in basal drag on such a fine scale will be corrupted by the errors in surface elevation and surface velocity. In extreme cases, wild and unrealistic variations in basal drag might be introduced in an attempt to match the flow speed in the model to noise in the observations. This is known as overfitting. The usual remedy is to apply some form of regularization.
There are various different ways of regularizing inversions of basal drag to avoid overfitting, but a common approach is to enforce smoothness of the recovered pattern of basal drag. Many of the iterative algorithms that are used to minimize the cost function have a convenient property: they introduce features in basal drag on coarse scales in the first few iterations, then add progressively finer scales in later iterations (Reference Maxwell, Truffer, Avdonin and StueferMaxwell and others, 2008; Reference Habermann, Maxwell and TrufferHabermann and others, 2012). Simply stopping after some number of iterations can prevent unrealistic finescale features being added. Deciding when to stop is a more vexing question, but there are criteria that can serve as a guide (Reference Maxwell, Truffer, Avdonin and StueferMaxwell and others, 2008; Reference Habermann, Maxwell and TrufferHabermann and others, 2012). One remaining issue is that the regularized solution depends upon the initial guess for parameters used to start the very first iteration. Again, this is an opportunity for different people to make different choices.
A different form of regularization that is often used is Tikhonov regularization (e.g. Reference JayAllemand, GilletChaulet, Gagliardini and NodetJayAllemand and others, 2011; Reference GilletChauletGilletChaulet and others, 2012; Reference Sergienko and HindmarshSergienko and Hindmarsh, 2013). Here the data–model misfit cost function, J _{misfit}, is replaced by J _{total} = J _{misfit} + J _{reg}, where J _{reg} is a term that penalizes solutions for the basal drag coefficient, β, that are not smooth and promotes those that are. A common choice is J _{reg} = λ _{reg} ∫∇β^{2} dS, for some constant reg, which adds a term proportional to the area integral of the square of the magnitude of the horizontal gradient in basal drag coefficient (e.g. Reference Sergienko and HindmarshSergienko and Hindmarsh, 2013). Adding this term to the data–model misfit cost function before minimization favors solutions for basal drag that have small gradients, hence the wildly fluctuating highfrequency oscillations that might otherwise be introduced by overfitting are reduced.
When Tikhonov regularization is used, the value of λ _{reg} can be varied to increase or decrease the strength of the regularization. It can be difficult to know what value to use for this parameter. Some heuristic conventions exist for selecting λ _{reg}, among them plotting the Lcurve (e.g. Reference JayAllemand, GilletChaulet, Gagliardini and NodetJayAllemand and others, 2011), or making use of a discrepancy principle (Reference Maxwell, Truffer, Avdonin and StueferMaxwell and others, 2008), but in real applications these do not always provide an obvious choice (Reference VogelVogel, 1996).
It can also be difficult to know whether to penalize gradients in the drag parameter or its logarithm, i.e. J _{reg} = λ _{reg} ∫∇lnβ^{2} dS. Other options include the square of basal drag, J _{reg} = λ _{reg} ∫β^{2} dS, or its logarithm, J _{reg} = λreg ∫lnβ^{2} dS, but it is not always obvious why one should use one form rather than another, or even some combination. It is clear there is scope for many different choices in applying Tikhonov regularization, and we have not even mentioned all of them.
Regularization requires the introduction of information that does not come from the observational data, x , that we have available from satellites, aircraft, field observations or laboratory experiments. This extra information must come from somewhere. The source of much of the subjectivity that we refer to in this paper is that the practitioners of the inverse methods often simply decide what seems reasonable. It is here that many of the subjective decisions that we would prefer to avoid can arise.
How smooth should the field of basal drag be? What should be the starting guess for iterative minimization of the cost function? What form of Tikhonov regularization should be used? How much can the viscosity vary from some prescribed approximation of the ice rheology, such as Glen’s flow law? Viewed from the Bayesian perspective, all of these decisions amount to the selection of priors for basal drag and viscosity.
One of the attractions of Bayes’ theorem is that it can provide the joint PDF for the parameters, given some observations with known error distribution. Crucially, the theorem cannot be applied without a prior for the parameters. This requirement to define a prior PDF for the parameters brings into the open many of the subjective decisions that are often made in an ad hoc fashion in the process of regularizing inversions.
As noted in many studies using Bayesian methods (e.g. Reference Gudmundsson and RaymondGudmundsson and Raymond, 2008; Reference Petra, Martin, Stadler and GhattasPetra and others, 2014), the link between regularization and specification of the prior can often be made explicit by taking the negative of the logarithm of Eqn (1),
Now, we identify a misfit function, J _{misfit} = −ln p _{l}( x  θ ), defined as the negative of the loglikelihood, and a regularization term, J _{reg} = −ln p( θ ), that is the negative of the logarithm of the prior, and J _{0} = ln p _{n}( x ), which is just a constant offset for any given set of observations. Then it is clear that choosing parameters, θ , that maximize the posterior PDF is the same as choosing them to minimize a misfit function, J _{total} = J _{misfit} + J _{reg} + J _{0} = −lnp _{p}( θ  x ). The relationship, J _{reg} = ln p( θ ), means for instance that quadratic regularization terms, such as J _{reg} = λ _{reg} ∫∇β^{2} dS, correspond to specifying a Gaussian density function for p( θ ), such as exp(−λ _{reg} ∫∇β^{2} dS), and vice versa. From a Bayesian perspective the various options for Tikhonov regularization described above are just different ways of specifying a prior PDF for the parameters.
Working in the Bayesian framework provides some clarity to the definition of the cost function, J _{misfit}, since it suggests that if we want the most likely parameters we should use the negative loglikelihood function, −ln p _{l}( x  θ ), to characterize the misfit with data, rather than unweighted leastsquares or some other choice. It also clarifies the process of regularization, since it requires that the information to be added is explicitly formulated in terms of a prior PDF for the parameters. However, simply adopting the Bayesian approach does not tell us what the prior, p( θ ), should be. So how should priors be defined for parameters that we have so little information about?
4. Subjective Priors
One possible way to define a prior, p( θ ), is to leave this up to the individual scientist performing the inversion. In the case of inverting for the basal drag under an ice sheet this seems a questionable choice. The posterior PDF, p _{p}( θ  x ), will be used to define the parameters for the model, and these parameters are an important component of the sealevel forecast. The glaciological forecast usually requires us to solve a nonlinear system of equations, which may be sensitive to small changes in parameter values or initial conditions, and we know that flow of ice into the ocean is sensitive to the basal drag coefficient and the ice viscosity (Reference SchoofSchoof, 2007). This suggests that models that specify different prior PDFs for the spatial variations in basal drag could produce quite different projections of sea level. At present it is difficult to know how important this effect could be, but as more forecasts are produced, each with different models and different inversion methods, it will become easier to evaluate the degree of spread among projections.
Often a great deal of effort and cost is expended in developing the physical model, collecting the observations, x , and characterizing the error covariance of those observations. It seems questionable to apply such dedication to deriving the likelihood, p _{l}( x  θ ), but then multiply this function by a prior that is left up to individual choice, either through explicit definition of a subjective prior, or implicitly through choices of regularization strategy. This could be justified if the scientist performing the inversion has some real insight into the range of variation and the length scale on which basal drag varies. As mentioned above, there is great uncertainty regarding the subglacial environment and it is difficult to know how the insight needed to define the prior would be obtained. We emphasize that the prior, p( θ ), is logically independent of the observations, x , that will be used in the inversion, so these observations cannot be used to provide insight into what the prior should be.
Another way of defining the prior, p( θ ), would be to delegate the task to experts on the subglacial environment, asking them to define the prior for the basal drag, viscosity, etc. The justification for this would be that there are people who (without regard of the observations that we will use in the inversion) can provide us with useful information on the range of values that the viscosity and basal drag coefficient can take, and the length scales they will vary over. If such experts exist then their views should be taken into account in definition of the prior, p( θ ). However, it may be difficult to find anyone with such comprehensive information about the details of the subglacial environments of Greenland and Antarctica.
In the end, the main justifications for using subjective priors, or indeed heuristic approaches to regularization, may be (1) that they are easy to implement, (2) that it can plausibly be assumed, or checked after the fact, that the main results of the forecast are not too sensitive to the details of how this regularization is performed and (3) that it can be difficult to imagine what else could be done. The first point is certainly true, and should not be downplayed, since it allows largescale calculations to be performed that could not be otherwise (e.g. Reference GilletChauletGilletChaulet and others, 2012; Reference Morlighem, Seroussi, Larour and RignotMorlighem and others, 2013; Reference Joughin, Smith and MedleyJoughin and others, 2014; Reference Petra, Martin, Stadler and GhattasPetra and others, 2014; Reference Arthern, Hindmarsh and WilliamsArthern and others, 2015; Reference CornfordCornford and others, 2015). The second point may well be true also, but seems to require that we address the third. After all, without first considering what else we might do to regularize the problem it is hard to argue that it won’t make much difference. In the following sections we outline two principles that have been advanced by Reference JaynesJaynes (2003) as a way of defining prior PDFs for parameters when minimal information about them is available.
5. Transformation Group Priors
Transformation group priors use symmetries of the problem to constrain the function, p( θ ), that is used as a prior PDF. In many mathematical problems knowledge of some particular symmetry can be extremely valuable, because it allows us to rule out a wide range of possible solutions that do not exhibit that symmetry. For instance, if there is some prior information available to us that can be written as mathematical expressions involving θ and if there are transformations that can be applied to these expressions that do not alter them in any way, then Reference JaynesJaynes (2003) argues that those transformations should also leave the prior, p( θ ), unchanged. The motivation is to ensure consistency, so that for two problems where we have the same prior information we assign the same prior probabilities (Reference JaynesJaynes, 2003). Based on this, Reference JaynesJaynes (2003) argues that we should select priors that are invariant to a group of transformations that do not alter the specified problem. Surprisingly, in some cases, identifying the symmetries in the form of a group of transformations that leave the problem unchanged and then requiring that the function p( θ ) is invariant to those transformations can completely determine which function to use as a prior. The value of using transformation group priors is perhaps best appreciated by imagining that we use a prior that does not respect the symmetries of the specified problem. Then we would, in effect, be claiming access to additional information that is not inherent in the problem specification, and, if called upon, we should be able to provide a reasoned explanation of where that information has come from.
6. Maximizing Relative Entropy
In addition to the symmetries of the problem, we may have other information that is relevant to specification of the prior. Sometimes this information can be expressed in the form of constraints that the PDF must satisfy. One common class of constraints are expectations of the form,
For instance, if we have reason to believe that the expected value for the vector of parameters θ is , we would apply a constraint with f_{i} = θ , . A similar constraint with f_{i} = F_{i} = 1 requires that the PDF, p( θ ), is normalized such that it integrates to one. Reference JaynesJaynes (2003) provides a recipe for incorporating such constraints, arguing that we should favor the PDF that maximizes the relative entropy subject to whatever constraints are imposed. The relative entropy of a PDF, p( θ ), is a functional, H(p), defined with respect to a reference PDF, π( θ ), as
Multiple constraints of the form given by Eqn (3) can be imposed using Lagrange multipliers λ = {λ _{1}, λ _{2}, …, λ _{Q}}, by seeking stationary points of the functional,
As described by Reference JaynesJaynes (2003), when the normalization constraint is enforced and other constraints, i = 1, 2, …, Q, are also imposed, stationary points of H _{1}(p, λ ) are provided by PDFs of the form
Solving Eqn (8) often provides a convenient way of identifying values for the Lagrange multipliers, λ , such that H _{1} is stationary and the constraints are enforced. Once these values of λ have been obtained, Eqn (6) provides the PDF and Z( λ ) plays the role of a normalizing constant. If there are no constraints other than normalization, then finding stationary points of H _{1} results in p( θ ) = π( θ ). This means that π( θ ) can be viewed as a preliminary prior that will be modified, such that any additional constraints on p( θ ) are satisfied. Reference JaynesJaynes (2003) argues that the PDF, p( θ ), that maximizes H, subject to whatever constraints are imposed has many attractive features. Roughly speaking, H can be viewed as quantifying the ‘uninformativeness’ of the PDF, p( θ ). Maximizing H is therefore a way of guarding against selecting a prior that is too prescriptive about which values of θ are likely. This provides a safeguard against ruling things out that could possibly happen. A prior probability obtained in this way is guaranteed to satisfy the constraints, but is otherwise as uninformative as possible.
We would obviously prefer to have a very informative prior, since then we would know exactly which parameter values to use in our model. It may then seem strange that we are selecting the least informative prior possible, subject to the information introduced by the reference distribution and the constraints. The point is that we should only use an informative prior if we actually have the information to back it up. Here, once we have defined a reference distribution, the extra information is being introduced in the form of constraints, or through the data that we will introduce later via Bayes’ theorem. For each constraint that we impose, the prior will become more informative, relative to the original PDF, π( θ ). If we were to subjectively choose a prior more informative than demanded by the constraints we would be guilty of subjectively introducing information into the inversion without good reason, and this is exactly what we are hoping to avoid, as far as possible. A prior that is too prescriptive about which parameter values are possible will only lead to overconfidence in the accuracy of our forecasts and to surprises when the forecast fails to deliver such accuracy.
It may seem that the problem has now simply changed from finding p( θ ) to finding the preliminary prior, π( θ ). This is where a combination of the two approaches outlined above can be used. Reference JaynesJaynes (2003) suggests that invariance to a transformation group defining the symmetry of the specified problem should be used to define π( θ ). Having obtained π( θ ), any additional constraints can then be imposed by maximizing the relative entropy, H, subject to those constraints. This is the procedure that we will adopt in the rest of this paper.
7. Application to a simple Glaciological Problem
To introduce the methods outlined above, we will consider the simple problem of estimating the viscosity and basal drag coefficient for a slab of uniform thickness flowing down a plane. Although this is a highly simplified problem compared with the initialization of largescale models of the Greenland and Antarctic ice sheets, it will turn out to contain many of the essential features of the more difficult threedimensional problem, and therefore serves as a useful starting point to illustrate the methods.
We define a coordinate system in which the x and yaxes are parallel to the planar bed of the ice sheet, which slopes downwards at an angle α below horizontal in the direction of increasing x, with no slope in the direction of increasing y (Fig. 1). The zaxis is taken to be normal to the bed, positive upwards, with z = z _{b} defining the bed and z = z _{s} defining the surface. The thickness, h = z _{s} − z _{b}, is assumed uniform, and velocity in the xdirection, u(z), is a function of depth. Any vertical shearing within the slab leads to a shear stress σ_{xz} . The ice density, ρ, is assumed constant. The basal drag coefficient is β, and the viscosity within the slab is η, which is assumed constant with depth in this simplified problem. Within the slab, body forces from gravity are balanced by gradients in stress. At the lower boundary a linear sliding law is assumed, so the shear stress σ_{xz} = βu. At the upper boundary the shear stress vanishes, so σ_{xz} = 0. For now, we will assume nothing about β or η, other than that they are positive constants.
For this simple system, conservation of momentum in the xdirection gives the system of equations:
where g is acceleration due to gravity. Generally, before performing an inversion using the model we will already have available a discrete version of the momentum equations. For illustrative purposes we can consider a simple finitedifference discretization of the above system on a uniform grid that has n velocity levels at grid spacing Δ = h/(n – 1) and velocities in the xdirection of u _{1}, u _{2}, etc. These are defined on each of the different levels, so u _{1} is the sliding velocity at the base and u_{n} is the flow velocity at the upper surface. We define
The discretized system corresponding to Eqn (9) is then
This is obviously a highly simplified model. We have only introduced it here so we have a very simple discrete system that we can use to illustrate how the methods can be applied in more general circumstances. More sophisticated icesheet models can be written in a very similar form, with A a symmetric positivedefinite matrix, u a vector of velocities and f a vector comprised of body forces and forcing terms from boundary conditions.
If A and f are known, solving the system defined by Eqn (12) provides an estimate, A ^{−1} f , for the velocities, u . In this paper we are not particularly interested in such a straightforward application of the model. Instead we would like to consider very general inferences about the velocity field, u , the forces, f , and the matrix, A , remembering that this matrix depends on the parameters, β and η, that we are trying to identify. We will also consider the possibility that the model is not perfect, so Eqn (12) is satisfied only approximately.
Note that in the above example, if we can estimate the matrix A , we can later derive the parameters and by computing D = X ^{−T} AX ^{−1}. To keep the following discussion as generally applicable as possible, we will not yet assume any particular form for the system matrix, A , except that it is positivedefinite and symmetric. Later, we will return to the problem with the particular A defined by Eqn (10).
We will include in our vector of parameters, θ , all of the quantities that we might perhaps want to estimate. These will include velocities, u , forces, f , the upper triangular (including leading diagonal) elements, A ^{u}, of the symmetric matrix, A , and the upper triangular (including leading diagonal) elements, C ^{u}, of the model error covariance, C . Some of these would not usually be regarded as ‘parameters’, but we will continue to use this terminology for the unknown quantities that we would like to be able to estimate. We have only included the upper triangular elements of symmetric matrices in our list of parameters because the complete matrix, e.g. A ( A ^{u}), can always be recovered from these if needed.
For now, our only concern is to find a function that we can use as a prior for velocities, u , forces, f , and the upper triangular elements of A and C , based on what we know about the relationships between them.
It is important to recognize that we will not introduce any observations of any of the quantities until we have obtained the prior and are ready to include those observations using Bayes’ rule. Equally, once we have obtained a very general prior, we can later impose additional constraints on the form of the matrix, A . If expert knowledge or independent estimates of parameter values from laboratory experiments are available these can also be introduced later using Bayes’ theorem.
8. Deriving a General Prior for Discrete Symmetric PositiveDefinite Systems
On the assumption that a symmetric positivedefinite matrix, A , exists that relates velocities, u , and body forces, f , with finite error covariance and finite bias, we have the following prior information:
Since we are assuming that the model has already been discretized using some particular set of basis functions, the velocities, u , and body forces, f , belong to the set of real vectors of known dimension n. The matrices A and C belong to the set of n × n real symmetric positivedefinite matrices. The conditional expectation E _{u} _{, f  A , C }[ ∊∊ ^{T}], is the average of ∊∊ ^{T} over the velocity, u , and body forces, f , for particular values of A and C . We will refer to the matrix C as the model error covariance. It is possible that the model is biased, so to be strict we should perhaps refer to this as the meansquared error. It is defined as C = Cov ( ∊ ) + bb ^{T}, where b = E _{u} _{, fA , C }[ ∊ ] is the expected bias of the model velocities represented by A ^{−1} f , averaged over all possible forcings, f , and Cov ( ∊ ) = E _{u} _{, fA , C } ( ∊ − b )( ∊ − b )^{T} is the error covariance of the model, averaged over all possible forcings.
Before looking at any data, we do not know anything about u , f , A or C , except for the information contained in Eqn (13). Before we can use Bayes’ theorem to make inferences about u , f , A and C from data, we need a prior PDF so that p( θ )d θ is the prior probability that the parameters lie within a small interval, d θ , of parameter space located at θ . For our particular choice of parameters, this takes the form
where d u = Π _{i} du_{i} , d f = Π _{i} df_{i} , d A ^{u} = Π _{i} _{, j≥i } dA_{ij} and d C ^{u} = Π _{i} _{, j≥i } dC_{ij} are the standard (i.e. Lebesgue) measures for integration over elements of u , f , A ^{u} and C ^{u}, the latter being the n(n + 1)/2 upper triangular elements of A and C , respectively. To label the domains of integration for u , f , A ^{u} and C ^{u}, we write , , such that , and , and for the parameter space that combines all of , , and .
The information defined by the prior information (Eqn (13)) is invariant under several transformations:

T1( Φ ): Orthogonal transformations, with Φ an n × n orthogonal matrix, such that Φ ^{T} Φ = ΦΦ ^{T} = I ,

T2(a, b): Change of units. Rescaling by a > 0 and b > 0,

T3( r ): Superposition of solutions,

T4(q): Switch velocities for forces and model for inverse. Repeated q times, with q = 1 or q = 2,
According to Reference JaynesJaynes (2003) it is important to specify the transformations as a mathematical group. Mathematically, a group is a set of elements (e.g. A, B, C, etc.) that also has an operation, ×, that takes two elements A and B of the set and relates them to a third element P. To be a group the set and the operation must together satisfy certain conditions: (1) the operation must be closed so that the product, P = A × B, always lies within the set; (2) the operation must be associative, so that for any A, B and C in the set, (A × B) × C = A × (B × C); (3) the set must contain an identity element, I, such that A × I = A for any element A; (4) each element of the set must have an inverse, A ^{−1}, such that A × A^{−} ^{1} = I. In Appendix A we show that the transformations T1 to T4 satisfy these conditions individually and that their direct product defines a transformation group.
As noted by Reference JaynesJaynes (2003), the key role played by the preliminary reference prior, π( u , f , A ^{u}, C ^{u}), is to define a volume measure for the parameter space, Θ. To qualify as a transformation group prior, this volume measure must be invariant to the group of transformations that do not alter the mathematical specification of the problem, but it need not correspond to the Lebesgue volume measure, dV = d u d f d A ^{u} d C ^{u}, that would apply if the parameter space had a standard Euclidean geometry. The following distance metric is invariant under the group of transformations defined above:
where Tr indicates the trace, obtained by summing elements on the main diagonal. We have written d A , d C , d u and d f to represent infinitesimal changes to A , C , u and f . The invariance of ds^{2} to the transformation group can be shown by applying the transformations to Eqn (19), and using the invariance of Tr[ A ^{−1} d AA ^{−1} d A ] to inversion A ↦ A ^{−1}, scale changes A ↦bA , and to congruent transformations of the form A ↦ BAB ^{T}, where B is an invertible n × n matrix (Reference Moakher and ZéraïMoakher and Zéraï, 2011).
From expressions given by Reference Moakher and ZéraïMoakher and Zéraï (2011), the volume element that corresponds to the distance metric, ds^{2}, is
where d u , d f , d A ^{u} and d C ^{u} are the usual Lebesgue measures. The following transformation group prior is therefore a suitable preliminary prior for derivation of a maximumentropy PDF,
where Z _{0} is a nondimensional constant that will be determined by normalization.
Because the matrices A and C are defined to be positivedefinite, the function π( u , f , A ^{u}, C ^{u}) is finite everywhere within the parameter space, Θ, provided that Z _{0} is finite. However, π( u , f , A ^{u}, C ^{u}) is an ‘improper’ prior, as described by Reference JaynesJaynes (2003). This means that if we attempt to integrate Eqn (20) over the parameter space, Θ, we find that this integral does not exist. Therefore a finite Z _{0} cannot be defined such that the prior, π( u , f , A ^{u}, C ^{u}), is a normalized PDF over the entire parameter space, Θ. To allow us to interpret the function π( u , f , A ^{u}, C ^{u}) as a PDF, we will consider a restricted section of parameter space, Θ _{∊} , for which the necessary integral exists, and for which there is a welldefined limiting process Θ _{∊} → Θ as ∊ → 0^{+}. For example, rather than an improper uniform prior for a single parameter over the interval (−∞, ∞), a welldefined uniform prior can be defined over a range [−I/∊, I/∊], with I a constant that is finite and positive. Restricted domains , , and that are subsets of , , and , respectively, are defined in Appendix B. The restricted parameter space is then derived from the Cartesian product, . Later, having derived a posterior PDF that depends upon ∊, we can investigate its behavior in the limit ∊ → 0^{+}. In Appendix B, we also define a separate nondimensional parameter, γ, that controls the smallest diagonal entries of positivedefinite matrices.
Having defined the restricted parameter space, Θ _{∊} , we can seek the PDF, p( u , f , A ^{u}, C ^{u}), that maximizes the relative entropy, H,
subject to whatever constraints are imposed. In our case the constraints are that the PDF must be normalized, so that
and that the conditional expectation of the error covariance, E _{u} _{, fA , C } [( u − A ^{−1} f )( u − A ^{−1} f )^{T}] is equal to C . Using the product rule, P(AB)P(B) = P(A, B), for conditional probabilities of events A and B gives
The constraint, E _{u} _{, fA , C } [( u − A ^{−1} f )( u − A ^{−1} f )^{T}] = C , takes the form
The constraints can be imposed using Lagrange multipliers, λ _{1}, a scalar, and Λ _{2}( A ^{u}, C ^{u}), a positivedefinite symmetric matrix. We seek stationary points for the following quantity:
Wherever p_{n} _{2}( A ^{u}, C ^{u}) does not vanish, we can define
Then H _{2} is stationary for PDFs of the form
The normalization constraint is satisfied for
in which Z _{1} is regarded as a functional of Ψ ( A ^{u}, C ^{u}). To evaluate the function Ψ we require that the first variation of Z _{1}( Ψ ) with respect to Ψ is zero. This is analogous to identifying values for Lagrange multipliers by solving Eqn (8). Since the preliminary prior, π( u , f , A ^{u}, C ^{u}), is independent of u , carrying out an integration over provides the approximation
where π written without arguments represents the constant, and E.E. represents edge effects arising because we have integrated over rather than . Here we neglect these edge effects, in anticipation that they become unimportant in the limit ∊ → 0^{+}, whereupon . Then, requiring that the first variation of Z _{1}( Ψ ) with respect to Ψ is zero provides
Since this must be true for any δΨ , and the quantities preceding δΨ in the integrand are all positive, applying the fundamental lemma of the calculus of variations provides
We therefore arrive at the following expression for the prior:
9. Introducing Additional Information using Bayes’ Theorem
Deriving the prior PDF is only the first step of our inversion. We still have information available to us that we have not used. In particular, we have not yet introduced any of the observational data, x . We will assume that these data provide an estimate for the velocity at the upper surface, u *, and that it also allows us to estimate the forces, f *. Having obtained a prior density function and a likelihood function, we can write the posterior PDF (Eqn (1)) using Bayes’ rule as
with
where p _{l}( u *, f * u , f , A ^{u}, C ^{u}) is the likelihood function and p _{n}( u *, f *) is the normalizing constant. The posterior PDF should then approach a welldefined limit as ∊ → 0^{+} and γ → 0^{+}, so that Θ _{∊} → Θ. If it does not, this may be a sign that the problem remains illposed and additional information is needed. Other information could also be introduced using Bayes’ theorem. For instance, we may have data from laboratory experiments that can help to constrain the ice viscosity, or there may be additional information about basal drag from experts in subglacial processes. We have tried to define a very general prior that does not rely on such expert knowledge, but if credible information from experts can be obtained there is no reason it could not be introduced later using Bayes’ theorem.
10. Returning to the Simple Slab Problem
We now return to our simple slab problem. To give a practical illustration of how a Bayesian estimation might proceed, we will derive a posterior PDF for the basal drag coefficient, β, and the viscosity, η, for our slab. For this example we will make numerous simplifying assumptions, some of which would not be applicable in real situations. Nevertheless, many of the methods that we will use would apply to more general problems.
Specifying the likelihood requires knowledge of the accuracy of the various observations that are to be introduced, and how the errors in those observations are correlated. In a real inversion the estimation of the likelihood function might be quite involved, but to illustrate the methods we will assume Gaussian distributed errors in the estimation of surface velocity, u *, and body forces, f *. We then have the likelihood function
where R _{u} and R _{f} are the error covariances for the observations u* and f *, respectively. Note that these are distinct from the model error, C , that was introduced previously. For our simple slab model we only have one observation of surface velocity, so u* and R _{u} are scalars, and H = [0, 0, 0, …, 0, 0, 1] simply selects the surface velocity, so that Hu = u _{n}.
The observational data, u* and f*, are not the only information that we need to include. We also know that the form of the system matrix is given by A = X ^{T} DX , where D is diagonal and the matrix X is known. To begin with, we assume no information about the basal drag coefficient, β, or viscosity, η, except that these are positive, so we treat the matrix D as an unknown diagonal matrix with positive elements D ^{d} = [D _{11}, D _{22}, …, D_{nn} ] on its diagonal. To introduce this information we apply a onetoone coordinate transformation, defined in Appendix B, from A ^{u} to Y = [ D ^{d}, M ^{sl}] and we recognize that our system matrix is a special case of Eqns (B1) and (B2), where M is the identity matrix. This corresponds to M ^{sl} → 0, or, more precisely, all elements of M ^{sl} bounded within the interval [−δM, δM], defined by some vanishingly small quantity, δM. To transform the PDF to the new coordinates we need to know the Jacobian of the coordinate transform from A ^{u} to Y . This can be found by explicitly writing out the dependence of A_{ij} on M_{ij} , D_{ii} and X_{ij} and differentiating, then reordering the variables so that the Jacobian of this transformation is a triangular matrix (e.g. Reference Magnus and NeudeckerMagnus and Neudecker, 1980; Reference MathaiMathai, 1997). The determinant of this triangular matrix is then given by the product of its diagonal elements, which results in
This can be used to define the transformed PDF
Using the product rule, P(AB)P(B) = P(A, B), for conditional probabilities of events A and B gives
which provides the posterior PDF for unknown quantities, given values for the data, u *, f *, and the known matrix elements, M ^{sl}. As usual, the denominator,
can be viewed as a normalizing constant. The notation refers to the restricted domain , with l _{D} positive and finite, and the standard (Lebesgue) measure is used.
Many similar manipulations can be considered. For instance, if we want to assume some particular model error, C ^{u}, or, more precisely, that all elements of C ^{u} lie within some small interval, [−δC, δC], of such an estimate for vanishingly small δC, we could modify this to
where the normalizing constant is then
If we are more interested in estimating the parameters D ^{d} than the forces and velocities within the slab, we can integrate over u and f to compute the marginalized distribution,
If we assume constant viscosity in the slab, we also know that D_{ii} = D _{22} for i > 2. To use this information we make a second transformation of coordinates of the form , with D_{β} = βΔ^{−1} = D _{11}, D_{η} = ηΔ^{−2} = D _{22}, and a set of residuals that will be assumed zero, given by . There is a subtlety in the choice of ζ: later we will explain why we have taken residuals of inverses of diagonal elements, rather than diagonal elements themselves. We use the Jacobian
to give
Then the posterior PDF for D_{β} and D_{η} , assuming known ζ , M ^{sl}, C ^{u}, u* and f *, is
with
Making substitutions from the equations above, we have
Reassuringly, this can be interpreted as an application of Bayes’ rule (Eqn (1)), with parameters θ = {D_{β} , D_{η} , u , f } and data x = { ζ , M ^{sl}, C ^{u}, u *, f* }, followed by marginalization over u and f . The Jacobians, and , just apply the changes of coordinates mentioned above, and the chain rule, P(A, B, C) = P(AB, C)P(BC)P(C), can be used to rewrite the denominator in the more usual form, p _{n}( x ).
To evaluate the prior, p( u , f , A ^{u}, C ^{u}), we need the determinant  A . For our simple problem X is triangular and has a determinant given by the product of elements on the leading diagonal. This gives  X = 1, so  A is given by  A =  X ^{T} DX =  D ∥ X ^{2}=  D . To impose that viscosity is constant in the slab, we require that all elements of ζ are bounded within some small interval, [−δζ, δζ], for vanishingly small δζ. Then , and  A are approximated by
If we take γ to be a constant, independent of ∊, then taking the limit ∊ → 0^{+} in Eqn (48), produces the posterior PDF over a restricted section of parameter space for which all D_{ii} are greater than some constant, I _{D} γ. Collecting all factors that do not depend on D_{β} or D_{η} into one constant of normalization, Z _{8}, then gives
where the integral, , is
This evaluates to
where π written without arguments represents the constant, and
For our simple problem, the dependence of the matrix, A ^{−1}, on D_{β} and D_{η} can be computed explicitly as
so Eqns (50) and (51), with substitutions from Eqns (53–55), provide the posterior PDF for D_{β} and D_{η} .
For this simple example we have assumed that viscosity in the slab is constant, so elements of ζ are bounded close to zero, and that A = X ^{T} DX , with X and D given by Eqn (11). We have also assumed that the model error, C , is known, that the thickness, h, is known and that observational data for velocity, u *, and forces, f *, along with their error covariances, R _{u} and R _{f}, are available. In the limit of vanishing model error and vanishing error in estimation of forces, such that HCH ^{T} + HA ^{−1} R _{f} A ^{−1} H ^{T} = 0, we can make the further simplifying assumption, C _{2} = R _{u}. Then the integral defined by Eqn (51) can be evaluated in the limit γ → 0^{+}. In that case, the posterior PDF defined by p _{p9} = lim _{γ} _{→0}+ p _{p8} can be normalized, resulting in
with
Figure 2 shows this posterior PDF for basal drag coefficient, β, and viscosity, η, calculated according to Eqn (56). The plots show results for nondimensional quantities, , , , , . Colors show the PDF normalized by the maximum value. In this example we set n = 1000 and C _{2} = R _{u} = (0.05u*)^{2}, which corresponds to 5% error in velocity, perfect knowledge of forces, f *, and negligible model error.
Figure 3 shows multiple profiles of nondimensional velocity, , as a function of nondimensional depth, , overlain on the same plot. Different curves are plotted for each combination of nondimensional basal drag coefficient, , and viscosity, . Each curve is colored according to the posterior probability for the particular combination of nondimensional basal drag coefficient, , and viscosity, , that it represents. Profiles for higher probabilities are plotted on top of those with lower probabilities. The profile corresponding to values of and that maximize the posterior PDF is shown as a white dashed curve.
Interestingly, even though we did not introduce any information about the viscosity of the slab or the basal drag coefficient, the posterior PDF shown in Figure 2 has a welldefined maximum. Figure 3 shows that the parameter values that maximize the posterior PDF correspond to a particular velocity profile through the slab. As discussed by Reference JaynesJaynes (2003), the maximum entropy distribution can be interpreted as the distribution that can be achieved in the greatest number of ways, subject to the imposed constraints, so we might perhaps expect this velocity profile to be realized with highest probability in a natural system, if viscosity is constant in the slab and the symmetries encoded by the transformation group are respected by the physical system, but there are otherwise no constraints on the values taken by the viscosity and basal drag coefficient. However, in a more realistic inversion, we would probably want to introduce information about the viscosity derived from laboratory experiments. If the uncertainty in viscosity can be estimated, so that a likelihood function can be derived, the information could be introduced using Bayes’ theorem. Figures 4 and 5 show the effect of weighting the posterior PDF shown in Figure 2 by a Gaussian likelihood function, , which corresponds to an estimate . Again, there is a preferred velocity profile, but now the most likely parameter values correspond to a velocity profile that is almost uniform with depth. An alternative approach would be to require that the expected value of the viscosity is equal to the laboratoryderived value, using a constraint of the form described by Eqn (3). Although we have concentrated here on a simple linear rheology, similar constraints could be applied using values of viscosity predicted by Glen’s flow law, or some other rheology.
11. Discussion
More realistic situations than a simple slab of uniform viscosity can obviously be imagined, but in this paper we wanted to illustrate the main features of the methods without introducing too many complications into the model. We have illustrated how manipulations of the PDF can be made using coordinate transformations, the product rule, marginalization and Bayes’ theorem. For more general problems, some of the details would be different, but many aspects of the approach outlined above could still be applied.
To make our example problem as simple as possible we have made a number of assumptions that could be questioned. In addition to the observational constraints, we have imposed constraints on ζ , to make the viscosity uniform within the slab, and on M ^{sl}, to impose the particular structure of our very simple finitedifference model. We have also assumed that C ^{u} and R _{f} can be made arbitrarily small, to impose the assumption of negligible model error and negligible errors in the estimation of forces. Of course, in a more realistic situation, the viscosity would not be constant, the model structure would be more complicated and it would be difficult to justify either the perfect model assumption or perfect knowledge of forces. As an alternative to neglecting model error, we could perhaps marginalize over model error, C , treating it as a nuisance parameter, or use some other estimate for model error, based on the approximate magnitude of neglected terms in an asymptotic expansion describing the model. It would also be worth trying to relax the unrealistic assumption that we make negligible error in estimation of forces, f *.
An important consideration in our example is that we simplified the PDF to a function of only two parameters so that it could be plotted. To do this, we imposed the constraints ζ → 0 , M ^{sl} → 0 and C → 0 ^{+}. However, these constraints do not provide sufficient information to provide a welldefined posterior PDF unless we also describe how these limits are approached. This is an example of the Borel–Kolmogorov paradox (e.g. Reference JaynesJaynes, 2003). The important point to consider is that conditional probabilities, e.g. p(AB), can only be defined unambiguously with respect to an event B that has nonzero probability, so we must consider a limiting process, such as the requirement specified above that all elements of are bounded within some small interval, [−δζ, δζ], for vanishingly small δζ. Perhaps surprisingly, there are many different ways to represent uniform viscosity in the slab, and these can result in different posterior PDFs. Instead of using ζ , we could have represented constant viscosity by requiring that all elements of , (D _{44} − D _{22}), …, (D_{nn} − D _{22})} are bounded within some small interval for vanishingly small . Then we would have obtained a different Jacobian, , instead of Eqn (44), and hence a different posterior PDF. Similarly, a different coordinate transformation could have been used in place of Eqns (B1) and (B2).
In situations where different coordinates, such as ζ or , can be used it can be difficult to know which option should apply. In our example there are additional desirable symmetries that can perhaps help. Allowing for a factor of  CA^{−} ^{1} in Eqn (21) that originates from the final two terms in Eqn (19) and provides scaleinvariance, our use of Eqns (21), (37), (B1) and (B2) is consistent with treating diagonal elements of D , which are known to be positive, as scale parameters, and elements of M ^{sl}, which are known to be bounded within [−1, 1], as location parameters in the terminology of Reference JeffreysJeffreys (1961) and Reference JaynesJaynes (2003). Then, the advantage of using ζ over is that we obtain a PDF that does not depend upon n in the limit n → ∞. This means the PDF we obtain converges to a welldefined function as we increase the resolution of the finitedifference model. It still seems possible that some other choice of variables might also satisfy this symmetry, so for now we make no claim that the particular posterior PDF that we have obtained is unique in satisfying all of the symmetries of the problem that we have identified. If it turns out not to be unique, the question of whether there are other symmetries that we have yet to take into account would arise. Each symmetry that we can identify places constraints on the prior, but there is no guarantee that it can be specified uniquely.
Although we advocate the Bayesian approach to inversion, there may be problems for which it is too expensive to derive the posterior PDF in full. To consider how maximizing the posterior PDF relates to minimization of a cost function, we can take the negative of the logarithm of Eqn (56). We obtain
where J _{0} is a constant offset. Since HA ^{−1} f * is a modelbased estimate of the surface velocity, this is a quadratic misfit function, defined by the negative of the loglikelihood function, with an additional term, J _{reg} = 2ln β + 2ln η. This term appears in place of the more conventional Tikhonov regularization terms, J _{reg}, mentioned in Section 3. In this particular example, for which we have assumed we can neglect model error, there are no arbitrary coefficients λ _{reg}, so there is no requirement for an Lcurve analysis to fix the degree of regularization applied. The general case, where model error cannot be neglected, would not be so straightforward.
The terms 2ln β and 2ln η penalize large values of basal drag, β, and viscosity, η. This provides qualitative support to the algorithms that are commonly used in glaciological inversions (e.g. Reference Morlighem, Seroussi, Larour and RignotMorlighem and others, 2013; Reference Joughin, Smith and MedleyJoughin and others, 2014; Reference Petra, Martin, Stadler and GhattasPetra and others, 2014; Reference Arthern, Hindmarsh and WilliamsArthern and others, 2015). However, most previous approaches have not exploited the symmetries of the model in the specification of their regularization, or in the characterization of the prior PDF, and therefore are effectively introducing additional information about the subglacial environment into the inversion. In itself this does not represent a problem, if some reasoned explanation can be provided for where this information has come from, but our poor knowledge of the subglacial environment perhaps makes it difficult to furnish a very convincing explanation. Comparing many diverse approaches to the inversion (e.g. Reference FavierFavier and others, 2014) and assessing the consequences for the simulation of the future behavior of the ice sheet would offer one way to explore the consequences for predictions of sea level.
It will be interesting to explore how these methods apply in models with a greater number of spatial dimensions than the simple slab problem. The prior determined by Eqn (33) is very general, and could be used for any discretized system of equations with a symmetric positivedefinite matrix. As an example, a more complicated threedimensional model of Antarctica can also be written in the form A = X ^{T} DX , with X known and D a diagonal matrix that depends upon viscosity and basal drag parameters (Reference Arthern, Hindmarsh and WilliamsArthern and others, 2015). In that case, the matrices X and D are different from the simple model considered above, and a different coordinate transformation would be needed to separate out the known and unknown parts of the matrix. Nevertheless, it seems likely that very similar methods could be used to provide a posterior PDF for the viscosity and basal drag in a realistic geographical setting.
Some of the mathematical aspects of our approach could be developed further. We have considered a discretized model from the outset, but the equivalent problem for continuous linear differential operators could also be investigated. Other avenues open to exploration include the application of similar methods to a nonlinear ice rheology, further characterization of group orbits produced as the transformation group acts on the parameter space, mathematical consideration of the actual length scales of basal drag that are important in the icesheet prediction problem, further investigation of sensitivity to the limiting process that is used to define the restricted parameter space in more general cases than considered here, and investigations that relate the prior defined here to other approaches (e.g. Reference JeffreysJeffreys, 1961; Reference Kass and WassermanKass and Wasserman, 1996; Reference BergerBerger, 2006).
12. Conclusions
In this paper, we have described an exploratory study to investigate whether transformation group priors and the maximization of relative entropy might have a role to play in glaciological inversions for viscosity and basal drag. These inversions are used to initialize forecasts of the ice sheets, and their formulation in Bayesian terms is an essential prerequisite to probabilistic forecasts of the sealevel contribution from Greenland and Antarctica.
Our initial findings are that adopting a Bayesian approach that uses transformation group priors and the maximization of relative entropy does add complexity to the problem. Nevertheless, having investigated a very general problem with a model that is based upon a symmetric positivedefinite matrix, and having applied this to a highly simplified problem for a slab of ice flowing down an inclined plane, it does seem that these methods could be used to initialize icesheet models. Rather than an ad hoc and subjective approach to regularization of the glaciological inverse problem, this would provide a more formulaic approach to the definition of priors for the parameters.
The great advantage of the Bayesian approach is that it allows the complete probability distribution of the model parameters to be evaluated. This could be of considerable value, either in setting up ensembles for Monte Carlo simulations of the future behavior of an ice sheet, or in more formal calculations of the probability of various possible contributions to sea level that might come from the ice sheets of Greenland and Antarctica.
Acknowledgements
The research was funded by the UK Natural Environment Research Council, including NERC grant NE/L005212/1. I am extremely grateful to the editor and two anonymous reviewers for very insightful and useful comments that significantly improved the manuscript. Richard Hindmarsh, Hilmar Gudmundsson, Jan De Rydt, Jonathan Kingslake and Dan Goldberg all provided useful comments on an earlier draft that have led to improvements.
Appendix A
The Transformation Group
Mathematically, a group is a set of elements (e.g. A, B, C, etc.) that also has an operation × that takes two elements, A and B, of the set and relates them to a third element, P. To be a group the set and the operation must together satisfy certain conditions: (1) the operation must be closed, so that the product P = A × B always lies within the set; (2) the operation must be associative, so that for any A, B and C in the set (A × B) C = A × (B × C); (3) the set must contain an identity element, I, such that A × I = A for any element A; (4) each element of the set must have an inverse A ^{−1}, such that A × A ^{−1} = I.
Varying the parameter Φ within the set of orthogonal matrices defines the set of possible transformations, T1( Φ ), that could be applied. For transformation groups the operation × represents successive application of two transformations from the set, so that T1( Φ _{2}) × T1( Φ _{1}) represents the transformation T1( Φ _{1}) followed by the transformation T1( Φ _{2}). It is then straightforward to show that T1( Φ ), together with this operation, defines a group. Closure follows, since
and Φ _{2} Φ _{1} is orthogonal. Associativity follows from associativity of matrix multiplication:
The identity is I _{T1} = T1( I ), where I is the identity matrix. The inverse is T1( Φ ^{T}), since T1( Φ ^{T}) × T1( Φ ) = I _{T1}.
Similar considerations can be used to show that when × represents successive application of the transformations, (T2(a, b), with ×), (T3( r ), with ×) and (T4(q), with ×) also define transformation groups. For the transformation group defined by T2(a, b), closure follows from T2(a _{1}, b _{1}) × T2(a _{2}, b _{2}) = T2(a _{1} a _{2}, b _{1} b _{2}), associativity follows from that of multiplication, the identity is I _{T2} = T2(1, 1) and the inverse is T2(1/a, 1/b). For T3( r ), closure follows from T3( r _{1}) × T3( r _{2}) = T2( r _{1} + r _{2}), associativity follows from that of addition, the identity is I _{T3} = T3(0) and the inverse is T3(− r ). The transformation T4(1) is its own inverse, so T4(2) = T4(1) × T4(1) is the identity, I _{T4}. The set composed of T4(1) and T4(2), along with ×, defines a cyclic transformation group. For T4(q), closure and associativity (A × B) × C = A × (B × C) can be verified trivially for all possible combinations of A, B and C that are either T4(1) or the identity, I _{T4} (i.e. T4(2)).
The four groups (T1( Φ ), with ×), (T2(a, b), with ×), (T3( r ), with ×) and (T4(q), with ×) can be combined by taking their direct product. The direct product of groups is defined so that the sets of transformations are combined using their Cartesian product, and operations are applied component by component: a simple example of a direct product for two groups ({A _{1}, B _{1}} with ×_{1}) and ({C2, D2} with ×_{2}) is the group composed of the set of ordered pairs {(A _{1}, C _{2}), ( A _{1}, D _{2}), (B _{1}, C _{2}), ( B _{1}, D _{2}) }, with the operation (A _{1}, C _{2}) × (B _{1}, D _{2}) = (A _{1 × 1} B _{1}, C _{2} × _{2} D _{2}).
Appendix B
The Restricted Parameter Space
We define the restricted domains , from the Cartesian product of n finite intervals, with l _{u} and l _{f} being positive constants. As ∊ → 0^{+}, and approach the domains and , respectively. To introduce a similar restricted domain, , for the matrix A it is useful to first consider a transformation of coordinates from A ^{u}. If X is an arbitrary invertible square matrix, such as the one defined by Eqn (11), any symmetric positivedefinite matrix, A , can be written as
with W symmetric positivedefinite. Let D = diag( W ) be the diagonal matrix composed of the diagonal elements of W . All diagonal elements D ^{d} = {D _{11}, D _{22}, …, D_{nn} } are positive, since W is symmetric positivedefinite, so W can be further decomposed as
where M has diagonal elements M_{ii} = 1. Our simple example system, defined by Eqn (11), corresponds to the special case where M is the identity. More generally, since W is symmetric positivedefinite, M must also be symmetric positivedefinite. A necessary (but not sufficient) condition for this is that the offdiagonal elements lie in the range −1 < M_{i} _{,j≠i } < 1. This implies that M ^{sl}, the elements of M that are strictly below the leading diagonal, lie within
where M ( M ^{sl}) has diagonal elements M_{ii} = 1, elements M ^{sl} strictly below the leading diagonal and elements above the leading diagonal determined by the fact that M is symmetric.
For any particular invertible matrix, X , there is a onetoone coordinate transform from A ^{u} to Y = [ D ^{d}, M ^{sl}]. In our restricted parameter space we require that diagonal elements of D lie in the range with l _{D} finite and positive. For now, we will leave γ as a general parameter that controls the smallest values of D_{ii} . Our restricted domain is then
We can derive a restricted domain, for C ^{u}, in a similar way as
where V ( V ^{d}) is a diagonal matrix of variances V ^{d}, restricted to the domain with l _{V} finite and positive. The symmetric positivedefinite correlation matrix, P ( P ^{sl}), has diagonal elements P_{ii} = 1 and elements P ^{sl} strictly below the diagonal. The set is defined in the same way as Eqn (B3), but for P ^{sl} rather than M ^{sl}. The restricted parameter space is derived from the Cartesian product .