## 1. Introduction

Mortality modelling is an important topic in actuarial science and insurance practice, dating back to the deterministic and one-dimensional Gompertz model. Over the past decades, stochastic approaches are rapidly developed, represented by the seminal (Lee and Carter, Reference Lee and Carter1992), or LC model. The LC model is based on a combination of temporal trends and age patterns in logged central mortality rates, which has become a standard for model and project dynamics of mortality rates. Its forecasting performance is well documented in existing literature (see, for example, Lee and Miller, Reference Lee and Miller2001, among others). Intrinsically, the LC model assumes that a single time-varying index drives the temporal dynamics of mortality rates. The associated forecast relies on the extrapolation of this index with an appropriate statistical linear time-series model. The popularity of LC stems from its simple construction and straightforward interpretations (Lee, Reference Lee2000; Denuit *et al*., Reference Denuit, Devolder and Goderniaux2007).

This paper focuses on modelling and forecasting multi-population mortality rates, which are more comprehensive than a single-population framework and are receiving attention from recent literature. For example, a multi-population framework is used to study a group of countries with similar socioeconomic situations or males and females in the same population (Li and Li, Reference Li and Li2017; Boonen and Li, Reference Boonen and Li2017). These models are motivated to assess the demographic basis risk involved in an index-based longevity hedge by comparing and projecting the reference and target populations’ mortality experience (Li and Lee, Reference Li and Lee2005). For instance, when projecting mortality for a smaller community with a thin volume of mortality data, the forecaster may aim to improve the credibility by modelling the smaller population jointly with a larger population. Another useful approach is to simultaneously forecast mortality rates of both males and females of the same population, thereby ensuring consistency of the sex differentials.

Regarding the multi-population mortality modelling, Li and Lee (Reference Li and Lee2005) are among the first to extend the LC model and propose a multiple-population counterpart, or the LL model. The LL framework uses a common factor to describe the long-term temporal trend shared by all countries within the investigated group to model the mortality rates. A country-specific factor is also adopted to describe the short-term country-specific patterns, effectively avoiding the undesirable divergence of mortality forecasts among populations in the long run.

More recently, other extensions of the LC model to the multi-population framework have flourished, and examples include Yang and Wang (Reference Yang and Wang2013), Zhou *et al*. (Reference Zhou, Wang, Kaufhold, Li and Tan2014), Li *et al*. (Reference Li, Zhou and Hardy2015) and Danesi *et al*. (Reference Danesi, Haberman and Millossovich2015). For instance, Kleinow (Reference Kleinow2015) proposed a common-age-effect (CAE) model to allow more than two populations, which extends the LL model. The age effect of CAE model, however, is assumed identical for all populations. This is motivated by the observation that obtained age effects are close to each other when estimated among different countries of similar socioeconomic structures. Thus, the number of parameters (i.e., age effects) can be reduced when simultaneously modelling their mortality experiences.

As pointed out by Chen *et al*. (Reference Chen, MacMinn and Sun2015), the CAE-type models might only be justified by the long-term mortality co-integration, yet it seems too strong to model the short-term mortality dependence. Alternatively, an ARMA-GARCH process with heavy-tailed innovations was proposed to filter the mortality dynamics of each population. The residual risk could then be fitted via a one-factor copula model. However, the increased complexity in its modelling structure is the drawback when compared with the LC-type models.

Extended from the classic LC model, this paper proposes a new multi-population approach based on a hierarchical structure to model all examined populations. Our approach effectively balances the dichotomy of short-term predictive power and long-term coherence. Specifically, unlike the existing CAE framework, in our model, population-wise age effects are allowed to improve the short-term forecasting accuracy. Further, similar to the models of Yang and Wang (Reference Yang and Wang2013) and Zhou *et al*. (Reference Zhou, Wang, Kaufhold, Li and Tan2014), the long-term coherence is considered, since all population-wise age effects are random vectors generated from the *same* parametric distribution. More importantly, our model manages to achieve the improved flexibility while retaining a relatively parsimonious model specification. Given the limited availability of mortality data, such a hierarchical structure effectively utilises cross-sectional information in the estimation and forecasting. As for the temporal dimensionality, a Vector Error Correction Model (VECM) is employed to model the co-movements among sub-populations. Relevant parameter restrictions are discussed for identification purpose, which also provides asymptotically coherent forecasts of mortality rates. The employed VECM also enables subsequent structural analyses, which examine the interdependence of mortality dynamics of sub-populations.

In terms of the parameter estimation of our model, the traditional singular value decomposition (SVD) for estimating the LC model is no longer feasible. Although the integrated likelihood function exists analytically, its exact computation involves large dimensional Kronecker products, resulting in difficulty to implement the standard maximum likelihood estimation algorithms. Due to the high dimensionality and the limited sample size of mortality data, other popular alternative such as the Expectation-Maximization technique is not pursued in this paper. To combat against the complexity, we implement the Bayesian estimation approach.

Over the past decades, Bayesian inference has become a popular statistical approach for its ability to capture complicated dynamics and its inherent parameter variability assumption (Czado *et al*., Reference Czado, Delwarde and Denuit2005; Pedroza, Reference Pedroza2006; Kogure and Kurachi, Reference Kogure and Kurachi2010; Li *et al*., Reference Li, Zhou and Hardy2015; Wong *et al*., Reference Wong, Forster and Smith2018; Lin and Tsai, Reference Lin and Tsai2022). The first Bayesian attempt to implement the LC model via state-space model is made by Pedroza (Reference Pedroza2006) for a single-population framework, followed by a two-population age-period-cohort model developed in Cairns *et al*. (Reference Cairns, Blake, Dowd, Coughlan and Khalaf-Allah2011). The benefits of using Bayesian methods in mortality modelling are fourfold:

• It is particularly suitable for constructing hierarchical models, where the inference is obtained iteratively through the conditional likelihoods.

• It allows for the imputation of the prior knowledge that one can blend in the belief of long-term coherence among the population groups (Hyndman

*et al*., Reference Hyndman, Booth and Yasmeen2013) via prior hyper-parameters.• Unlike the frequentist framework, the model parameters are random variables, which are convenient for superannuation and life insurance fund with heterogeneity among the policyholders (Cairns, Reference Cairns2000).

• Bayesian methods can straightforwardly deal with missing data (Li

*et al*., Reference Li, Zhou, Zhu, Chan and Chan2019).

Despite the sophisticated framework, our proposed hierarchical model can be efficiently estimated via the Bayesian Markov Chain Monte-Carlo (MCMC) algorithm. Nevertheless, a naive implementation of the MCMC is undesirable, due to the resulting slow mixing of the chain and poor estimation results. To address this, we develop an efficient MCMC algorithm for sampling posterior distributions and predictive densities. The model is first rewritten as a state-space model under a matrix-variate Gaussian formulation (Gupta and Varga, Reference Gupta and Varga1992). An algorithm is then proposed to sample the age random effects for all the populations in one block. Finally, instead of using the standard Kalman filter (Koopman and Durbin, Reference Koopman and Durbin2003) for sampling the latent factors, we employ an efficient precision sampler of Chan and Jeliazkov (Reference Chan and Jeliazkov2009) that substantially reduces the computational cost.

To demonstrate its usefulness, we apply our multi-population LC model to empirical mortality data of the Group of Seven (G7) countries, consisting of Canada, France, Germany, Italy, Japan, the United Kingdom and the United States. The sample ranges from 1956 to 2019. Compared with other competing models, our baseline approach achieves favourable in-sample and out-of-sample forecasting results. A structural analysis of mortality dynamics is further implemented.

The contributions of our research are fourfold. First, we provide a parsimonious hierarchical framework and propose a multi-population LC model. This Bayesian hierarchical model extends Pedroza (Reference Pedroza2006)’s state-space representation by introducing random age effects to account for multi-population modelling. Compared with existing approaches, such as the CAE and those proposed in Yang and Wang (Reference Yang and Wang2013) and Zhou *et al*. (Reference Zhou, Wang, Kaufhold, Li and Tan2014), the more flexible population-wise random effects are allowed with minimal additional complexity. Lin and Tsai (Reference Lin and Tsai2022) also consider introducing Bayesian hierarchical method to multi-population modelling, but their underlying model of log mortality rate is a random walk with a drift process. In contrast, our paper develops a Bayesian hierarchical method based on the classical Lee–Carter model and its multi-factor extension. Second, via the VECM approach with an appropriate shrinkage prior, the proposed model has co-integrated temporal factors and thus achieves the desirable long-term coherence across populations. Thus, same as the seminal LL model, the long-run divergence in forecast mortality rates among populations is prevented, which cannot be ensured if independent LC models were adopted (Tuljapurkar *et al*., Reference Tuljapurkar, Li and Boe2000). Compared to the LL specification and other coherent LC extensions (Yang and Wang, Reference Yang and Wang2013; Zhou *et al*., Reference Zhou, Wang, Kaufhold, Li and Tan2014), our model is more flexible, such that temporal interactions across populations could be investigated. In particular, our VECM approach enables a structural analysis to examine the interdependence of mortality dynamics of G7 populations. To the best of our knowledge, this paper is among the first to consider such analyses. We demonstrate that structural shock to US mortality rate can permanently lead to mortality declines of other countries. In the long run, mortality dynamics of Japan and Germany have non-negligible contributions to US mortality changes. Third, an associated Bayesian algorithm to implement the estimation is developed. Different from a naive MCMC, our approach significantly reduces the computational cost and improves the reliability of estimation. Fourth, the empirical results, demonstrate the outperformance of our model, when both in- and out-of-sample forecasts are considered. Also, our empirical evidence shows significant variation in the age effects and significant temporal interactions among examined populations. This supports the importance of our hierarchical structure to more effectively utilise the cross-sectional information.

The remainder of the paper is organised as follows. Section 2 introduces our multi-population LC model and provides appropriate parameter restrictions for identification and asymptotic coherence. The model is then calibrated in the Bayesian framework in Section 3, and an efficient estimation algorithm is provided. The in-sample and out-of-sample empirical performances of our proposed models are then presented in Section 4, along with an empirical structural analysis. Finally, we conclude the paper in Section 5.

## 2. A multi-population Lee–Carter framework

### 2.1. The classic Lee–Carter specification

Lee and Carter (Reference Lee and Carter1992) developed an approach to study mortality data of the United States, which has been widely applied in actuarial and demographic research. The LC model to forecast mortality rates is essentially via extrapolating historical trends and predicting probability distributions of age-specific death rates (ASDR) using standard time-series procedures. The basic LC specification is displayed below:

for ages $x=x_0,\cdots,\omega$ and years $t=1,\cdots,T$ . ${\mathbf{y}}_t=[\log\!(m_{x_0,t}),\cdots,\log\!(m_{\omega,t})]'$ is the vector of logged ASDR. We also have that ${\boldsymbol{\alpha}}=[\alpha_{x_0},\cdots,\alpha_{\omega}]'$ , and $n=\omega-x_0+1$ denotes the number of age groups. The popularity of LC model stems from its simple interpretation, that $\alpha_x$ is the long term average of $\log\!(m_{x,t})$ , $\kappa_t$ is the common temporal trend of mortality change and assumed a latent factor, and $\beta_{x}$ is the relative sensitivity of the ASDR with respects to the time change.

To estimate this single population Lee–Carter model, the original approach is to use sample average for $\alpha_x$ and apply a singular value decomposition on $[{\mathbf{y}}_1,\cdots,{\mathbf{y}}_T]-{\boldsymbol{\alpha}}\mathbf{1}'_{\!\!T}$ to extract ${\boldsymbol{\kappa}}=[\kappa_1,\cdots,\kappa_T]'$ . Then, ${\boldsymbol{\kappa}}$ is adjusted to fit the reported life expectancy at each time. This second stage makes the model fit historical life expectancy exactly. The adjusted ${\boldsymbol{\kappa}}$ is then modelled using standard time-series methods, typically a random walk with a drift, as suggested in the original work of of Lee and Carter (Reference Lee and Carter1992), to produce mortality forecasts.

### 2.2. A multi-population extension of Lee–Carted model

Despite its popularity, independently fitting single-population LC models are insufficient to comprehensively study the mortality dynamics. To see this, in Figure 1, point forecasts of logged central death rates for age 65 (spanning 2020–2119) are depicted for all the G7 countries, where the LC model is independently fitted to each country’s mortality rate over 1956–2019, with male and female data combined. Obviously, those point forecasts in Figure 1 are non-coherent, that is, the predicted mortality rates of those populations diverge over time. This lack of long-term coherence motivated the class of coherent multi-population models (Li and Lee, Reference Li and Lee2005; Li *et al*., Reference Li, Zhou and Hardy2015).

Studying mortality experience in the multiple population context is a more complicated problem. Early explorations by Tuljapurkar *et al*. (Reference Tuljapurkar, Li and Boe2000) treated the mortality movements of each population independently and modelled the age-specific mortality via

for
$i=1,\cdots,I$
, where *I* denoting the number of populations. Various extensions were studied in the literature to allow dependence across populations, such as adopting population-dependent
$\kappa_t^i$
’s in Yang and Wang (Reference Yang and Wang2013) and Zhou *et al*. (Reference Zhou, Wang, Kaufhold, Li and Tan2014), and common age effects in Kleinow (Reference Kleinow2015).

To address the aforementioned empirical issues, this paper introduces a framework that jointly models all populations’ temporal effects and allows heterogeneous age effects. In particular, to ensure the coherent forecasts, a common distribution is shared among these age effects. The concept of coherence is formally defined and discussed in Section 2.3. Listed below is the specification of our baseline model.

Model 1 (The Multi-population LC model). *Let*
$y_t^i=[\log\!(m_{x_0,t}^i),\cdots,\log\!(m_{\omega,t}^i)]'$
*for*
$i=1,2,\cdots,I$
*, we assume that*

*with*
$F_a,F_b$
*and*
$F_\epsilon$
denote some multivariate parametric distributions and

*for all*
$i=1,2, \cdots,I$
*. For simplicity, we assume the covariance-variance matrix*
$\Omega$
*to be a diagonal matrix with*
$g_k$
*being the error variance for the*
$k{\text{th}}$
*age.*

*For the temporal movements of latent factors, we consider a Vector Error Correction Model (VECM) for*
$\boldsymbol{\kappa}_t=[\kappa_t^1,\cdots,\kappa_t^{I}]'$
*such that*

*where N denotes the multivariate Gaussian distribution. The matrix of long-run multipliers,*
$\Pi$
*, can be written as*
$\Pi = \mathbf{c}\mathbf{d}'$
*, where*
$\mathbf{c}$
*and*
$\mathbf{d}$
*are both full rank*
$I \times r$
*matrices and where*
$0 \le r \le I$
*is the number of co-integrating relationships. The matrix*
$\mathbf{d}$
*is called a cointegration matrix and*
$\mathbf{c}$
*is sometimes called a loading matrix. (From the Proposition 1 discussed in* Section 2.4
*, to ensure the long-term coherent mortality forecasts, we require that*
$\kappa^i_t$
*is an I(1) process for any*
$i \in \{1,2,\cdots,I\}$
*. This is a common assumption in mortality literature, for example,* Li and Lee (Reference Li and Lee2005), Yang and Wang (Reference Yang and Wang2013) *and* Zhou et al. (Reference Zhou, Wang, Kaufhold, Li and Tan2014) *among others. In that case, both*
$\Delta\boldsymbol{\kappa}_t$
*and*
$\mathbf{d}'\boldsymbol{\kappa}_t$
*are stationary, and hence, each element of*
$\mathbf{d}'\boldsymbol{\kappa}_t$
*represents a long-run equilibrium relation.) Especially, if*
$r=I$
*then all the elements of*
$\boldsymbol{\kappa}_t$
*are stationary, while if*
$r=0$
*then all the series are I(1) processes without any existing co-integration.*

Hence, the model parameter is

In Model 1, the co-movements of population-wise mortality rates are assumed to follow a VECM model without lagged differences, which is equivalent to a restricted VAR(1) model. The VECM is a theoretical-based approach which is useful for combining both long-run co-integration relationship and short-run corrections/adjustments of co-integrated variables towards the long-run equilibrium. Also, the VECM could be used to analyse interdependence of response variables via a structural analysis, which is popular in empirical fields, such as macroeconomics (Kunst and Neusser, Reference Kunst and Neusser1990; Granger, Reference Granger2004) and finance (Mukherjee and Naka, Reference Mukherjee and Naka1995). More recently, VECM/co-integration techniques have been employed to produce coherent forecasts in mortality modelling (see, for example, Yang and Wang, Reference Yang and Wang2013; Zhou *et al*., Reference Zhou, Wang, Kaufhold, Li and Tan2014; Hunt and Blake, Reference Hunt and Blake2015; Li and Lu, Reference Li and Lu2017; Hunt and Blake, Reference Hunt and Blake2018; Li *et al*., Reference Li, Lu and Lyu2021; Li and Shi, Reference Li and Shi2021a,b, among others). This is for the property of VECM such that the distribution
$\boldsymbol{\kappa}_t|\boldsymbol{\kappa}_{t-1}$
is known, which then allows coherent mortality forecasts. Please see Jarner and Jallbjørn (Reference Jarner and Jallbjørn2020) for a systematic review on benefits and drawbacks of co-integration based mortality models.

The second unique feature of Model 1 is the allowed heterogeneous age effects, that is,
$\alpha_x^i$
’s and
$\beta_x^i$
’s are different across populations. Yet, they are modelled as random effects drawn from a common distribution. The hierarchical structure actually assumes that multiple populations are heterogeneous sub-populations randomly drawn from a hypothetical super-population (represented by the common distribution). The advantages of this hierarchical structure are summarised below. First, this is a parsimonious specification that greatly reduces the dimension of model parameters. Second, the heterogeneity in the age effects is retained, which often provides a better short-term predictive power (Chen *et al*., Reference Chen, MacMinn and Sun2015). Last, as will be shown below, the common distribution and restrictions on
$\boldsymbol{\kappa}_t$
coefficients will enable asymptotic coherence in mortality forecasting (by Proposition 1, we also need
$\beta^i$
to be deterministic with
$||\mu_b||<\infty$
). Although it is beyond the scope of this paper, the hierarchical structure could also be extended to other LC-typed models, such as the LL. Compared to existing competitors, it is expected that more effectively utilising the cross-sectional information, as in Model 1, can generally improve forecasting accuracy for mortality data.

The multi-population LC model, as a special case of the linear dynamic factor models, is subject to identification issues (Li and Lu, Reference Li and Lu2017). The dynamic factor model is a flexible but parsimonious approach to study unobserved heterogeneity, and relevant parameter restrictions for identification purposes have been well discussed in the literature (Bai and Wang, Reference Bai and Wang2015). To identify dynamic factors (i.e., mortality indices $\boldsymbol{\kappa}_t$ ) in the Model 1, we consider two identification assumptions stated below:

Assumption 1.
*For all*
$i=1,\cdots,I$
,

(a) $\alpha^i,\beta^i,\kappa_t^i$

*and*$\epsilon_t^i$*are mutually independent*(b) $\mathbb{E}[\epsilon_t^i]=0$ .

Assumption 2.
*We assume*
$\kappa_{ - 1}^i = [\kappa_2^i, \cdots, \kappa_T^i]'$
*with*
$\kappa_1^i = 0$
, and

*where*
$N_S$
denotes a multivariate normal distribution truncated on a hyper-plane defined by S. Furthermore, we assume

and

*where*
$\Omega$
*is a diagonal matrix while*
$\Sigma_a$
*and*
$\Sigma_b$
*are two general positive-semidefinite matrices.*

As a result of Assumption 1, Model 1 reduces to the independent LC model given the random effects. The identification of our model then boils down to the identification of each single-population Lee-Carter model. Following Lee and Carter (Reference Lee and Carter1992), in Assumption 2, we let $\beta^{i}_{x_0}=1 \mbox{ and }\kappa_1^i = 0$ for $i \in I$ . The merits of the above constraints can be illustrated for the ease of computation and interpretation. (The advantage of this set constraint is that the parameters could still be interpreted as in the classical LC model, except that the interpretation of $\alpha_x^i$ changes slightly. In our case, $\alpha_x^i$ represents the mortality level at the base year. That is, $\mathbb{E}[\log\!(m_{x, 1}^i)] = (\alpha_x^i) + (\beta_x^i) (\kappa_1^i) = \alpha_x^i$ .) And the normality assumption is for simplicity purpose in empirical analysis. By introducing the Gaussian assumption (and conditionally conjugate priors in Section 3.1), one can easily obtain analytical full conditional posterior distributions for model parameters (and dynamic factors). This enables the application of a standard Gibbs sampler to approximate joint posterior distribution.

Note that our model offers a generalised framework and nests a range of specifications. For instance, if both $\Pi$ and $\Sigma_\kappa$ are diagonal matrices, it essentially becomes independent LC models for each population group (especially when $\Pi$ is a zero matrix, all the $\kappa^i_t$ s will follow independent random walks with drifts). It could also be shown that CAE-type models (Kleinow, Reference Kleinow2015) are nested, when associated parameter restrictions are implemented in Model 1. Specifically, the hierarchical random effects can easily reduce to the homogeneous models when $\Sigma_a$ or $\Sigma_b$ are zeros, which suggests that the age effects are the same among all populations. In our subsequent study, Bayesian shrinkage priors will be employed to impose the belief of certain nested model when performing the MCMC estimation for Model 1. The implementation of those priors will not enforce a draconian parameter restriction to make Model 1 a reduced structure. In other words, this effectively balances between the prior belief of the long-term coherence as for the CAE model and inherent features of the empirical data.

In comparison, we also consider two restricted cases, namely Model 2 and Model 3. Specifically, Model 2 assumes a homogeneous age effect such that all the $\beta^i$ ’s are the same across different populations. As will be discussed in Section 2.3, this condition is necessary to ensure the coherent forecasts. In that sense, Model 2 is a special case of Model 1 where $\Sigma_b$ is a zero matrix. Instead of a random effect, $F_b(\cdot|\mu_b,\Sigma_b)$ in Model 2 just serves as a hierarchical prior for the age effect $\beta$ .

Model 2 (The multi-population LC model with homogeneous age effect $\beta$ ).

*with a VAR(1) in the VECM representation for*
$\boldsymbol{\kappa}_t=[\kappa_t^1,\cdots,\kappa_t^{I}]'$
*such that*

*Hence, the model parameter is*
$\theta=\{b,\Pi,\Sigma_\kappa,\mu_a,\mu_b,\Sigma_a,\Sigma_b,\Omega\}$
.

In Model 3, we retain hierarchical structures of $\alpha^i$ and $\beta^i$ but assume that they are population-invariant. That is, in contrast to Model 1, Model 3 is a multi-population LC model with homogeneous age effects $\alpha$ and $\beta$ . This is also a special case of Model 1, where both $\Sigma_a$ and $\Sigma_b$ in Model 1 are zeros. Similar to Model 2, $F_a(\cdot|\mu_a,\Sigma_a)$ and $F_b(\cdot|\mu_b,\Sigma_b)$ are essentially hierarchical priors for the age effects $\alpha$ and $\beta$ , respectively, rather than random effects.

Model 3 (The Multi-population LC model with homogeneous age effects $\alpha$ and $\beta$ ).

*with a VAR(1) in the VECM representation for*
$\boldsymbol{\kappa}_t=[\kappa_t^1,\cdots,\kappa_t^{I}]'$
*such that*

*Hence, the model parameter is*
$\theta=\{b,\Pi,\Sigma_\kappa,\mu_a,\mu_b,\Sigma_a,\Sigma_b,\Omega\}$
.

### 2.3. Multi-population coherence

In this section, we discuss the concept of long-term coherence in the multi-population mortality modelling framework. Generally speaking, it means that death rates in two modelled populations do not diverge in the long run. Following Li and Lee (Reference Li and Lee2005), the formal definition is stated below.

Definition 1.
*The forecasts of a multi-population mortality model are asymptotically coherent if*

*for*
$i,j\in \{1,2,\cdots,I\}$
*and*
$k=1,2$
.

As outlined above, forecasting of the original LC model is the extrapolation of historical temporal trends. However, the price to pay for this simplicity is the incapability to ensure coherence. As demonstrated in Figure 1, a clear long-term divergence of the forecast mortality is demonstrated when LC models are independently fitted.

To realise the coherent forecasting, a simple strategy is to assume that $\beta^i=\beta^j$ and that the spread $\kappa_t^i-\kappa_{t}^j$ is mean-reverting. It can be shown that this condition is sufficient for a standard LC formulation. To see this, if all populations have the same $\beta_x$ and long term $\kappa_t^i$ , then the ratios of the mean ASDRs among populations would be constant over time at each age in the forecasts. Otherwise, its projections of some ASDR would differ from those of others over time. In the Proposition described below, the conditions for coherent modelling are derived.

Proposition 1.
*Suppose that*

2. $\beta^i$ are deterministic vectors with $||\mu_b||<\infty $ for all $i=1,2,\cdots,I$ (i.e., $\Sigma_b$ is a zero matrix); and

3. $\kappa^i_t$ is an I(1) process for any $i \in \{1,2,\cdots,I\}$ and $\kappa_t^i-\kappa_t^j$ is a weakly stationary process for $i,j \in \{1,2,\cdots,I\}$ and $i \neq j$ ,

*the forecast ASDR produced by Model 1 are asymptotically coherent.*

Proof. We need to justify that under the parameter constraints in Proposition 1, the mortality forecasts are not divergent for any $i,j \in \{1,2,\cdots,I\}$ and $i \ne j$ . Since

it is sufficient to prove mortality forecasts to be coherent (i.e., $\lim_{t \to \infty}\mathbb{E}\!\left[\frac{m_{x,t}^i}{m_{x,t}^j}\Bigg\vert \theta\right] < \infty$ ) by demonstrating that all the limits of $\mathbb{E}\!\left[\exp\!(\alpha_x^i-\alpha_x^j)\big\vert \theta\right]$ , $\mathbb{E}\!\left[\exp\!(\beta_x^i\kappa_t^i-\beta_x^j\kappa_t^j)\Big\vert \theta\right]$ and $\mathbb{E}\!\left[\exp\!(\epsilon_{x,t}^i-\epsilon_{x,t}^j)\Big\vert \theta\right]$ are finite when $t \to \infty$ under the assumptions in Proposition 1. To do so, it is then sufficient to show that $\mathbb{E}\!\left[\exp\!(\alpha_x^i-\alpha_x^j)\big\vert \theta\right]$ , $\mathbb{E}\!\left[\exp\!(\beta_x^i\kappa_t^i-\beta_x^j\kappa_t^j)\Big\vert \theta\right]$ and $\mathbb{E}\!\left[\exp\!(\epsilon_{x,t}^i-\epsilon_{x,t}^j)\Big\vert \theta\right]$ are all upper-bounded by their respective time-invariant constants.

We first prove that for any *t*, both
$\mathbb{E}\!\left[\exp\!(\alpha_x^i-\alpha_x^j)\big\vert \theta\right]$
and
$\mathbb{E}\!\left[\exp\!(\epsilon_{x,t}^i-\epsilon_{x,t}^j)\Big\vert \theta\right]$
are upper-bounded. Based on the prior setting,
$\alpha_x^i\vert \theta \sim N([\mu_a]_x, [\Sigma_a]_x)$
for age group *x* in population *i*, where
$[\mu_a]_x$
denotes the *x*-th element in the mean vector
$\mu_a$
and
$[\Sigma_a]_x$
is the *x*-th diagonal element of the variance-covariance matrix
$\Sigma_a$
. Therefore,
$\exp\!(\alpha_x^i)\vert \theta \sim \text{log-}N([\mu_a]_x, [\Sigma_a]_x)$
; here, log-*N* is the log-normal distribution. We can further deduce that
$\exp\!(\alpha_x^i - \alpha_x^j)\vert \theta \sim \text{log-}N(0, 2[\Sigma_a]_x)$
since
$\exp\!(\alpha_x^i)$
and
$\exp\!(\alpha_x^j)$
are identically, independently distributed. Hence, we can see

which is a bounded value dependent on the age only. Similarly, for any *t*,

where
$[\Omega]_x$
represents the *x*-th value on the diagonal line of
$\Omega$
.

The remaining task is to prove that
$\lim_{t\to\infty}\mathbb{E}\!\left[\exp\!(\beta_x^i\kappa_t^i-\beta_x^j\kappa_t^j)\Big\vert \theta\right]$
is finite. Based on the prior setting of
$\beta_x^i$
,
$\beta_x^i\vert \theta \sim N([\mu_b]_x, [\Sigma_b]_x)$
, and it is easy to show that
$\exp\!(\beta_x^i\kappa_t^i-\beta_x^j\kappa_t^j)\vert \theta,\kappa_t^i,\kappa_t^j \sim \text{log-}N([\mu_b]_x^i(\kappa_t^i-\kappa_t^j), [\Sigma_b]_x[(\kappa_t^i)^2+(\kappa_t^j)^2])$
. Then according to the assumptions in Proposition 1, that is,
$[\Sigma_b]_x = 0$
and
$\kappa_t^i-\kappa_t^j$
being a weakly stationary process, for any *t*, we will have

where $k_{ij}$ and $K_{ij}$ represent the stationary mean and variance of $\kappa_t^i-\kappa_t^j$ , respectively. The last equality holds because $\kappa_t^i-\kappa_t^j$ follows a stationary Gaussian process. To prove this, it is enough to show that $\boldsymbol{\kappa}_t$ follows a Gaussian process (stationarity of $\kappa_t^i-\kappa_t^j$ holds by assumption). Starting with an initial state $\boldsymbol{\kappa}_1=\mathbf{0}$ , the VECM form in Equation (2.2) gives us:

where $\Pi^*_i = (\mathbb{I}+\Pi)^i$ for $i=0,\cdots,t-1$ . Hence, $\boldsymbol{\kappa}_t$ follows a Gaussian distribution since it is a linear combination of several i.i.d Gaussian error terms.

In conclusion, under the assumptions of Proposition 1, we could prove that

Similarly, we can further prove that

which deduces that

Note that the above requirements for long-term coherence restricts our model towards one with constant $\beta$ ’s. Furthermore, unlike a standard LL approach, where the underlying co-movement is modelled via a multivariate random walk with drift, we assume that the $\kappa_t^i$ ’s are co-integrated I(1) processes. Those then boil Model 1 down to Model 2. However, Model 1 is much more flexible and might be able to provide more accurate short/medium-term forecasts than Models 2 and 3. In the next section, we discuss the use of Bayesian prior techniques in Model 1 to balance both the desirable long-term coherence and short-term data-specific dynamics.

### 2.4. Structural analysis

Structural analysis is commonly employed to investigate the interdependence of modelled response variables, especially in the macroeconomic literature (see, for example, Forni and Gambetti, Reference Forni and Gambetti2010; Barigozzi *et al*., Reference Barigozzi, Lippi and Luciani2021). Such an analysis is also applicable in our multi-population LC model to study the interdependence of mortality dynamics across sub-populations.

To implement a structural analysis, uncorrelated structural shocks need to be constructed first. In a VECM, a usual way is to consider the forecast errors (i.e., $\xi_t$ in Equation (2.2)) as linear combinations of the structural shocks:

where
$u_t$
are usually assumed to be orthonormal white noises, that is,
$u_t \sim N(0,\mathbb{I}_I)$
with
$\mathbb{I}_I$
being an identity matrix of size *I*. This normalization assumption implies that

The structural shocks could then be explained as random, unexpected events which can influence the mortality rates but exogenous to the currently employed mortality model.

It is widely known that the structural shocks described above are not unique. A common practice is to derive a unique $\Theta_0$ via a Cholesky decomposition of the variance-covariance matrix $\Sigma_{\kappa}$ . This implies that the resulting structural model has a recursive structure. The recursive method identifies structural shocks by imposing short-run restrictions. (There also exists some alternative identification schemes for VAR and VECM via, e.g., long-run restrictions or sign restrictions. Please refer to Lütkepohl, Reference Lütkepohl2005 and Kilian, Reference Kilian2013 for more details about structural VAR and structural VECM.) Specifically, structural shocks of one response variable can only contemporaneously affect variables that ranked after that response. Consequently, a meaningful (non-sample) information is usually needed for identifying the recursive order of the structural shocks.

#### 2.4.1. Impulse Response Function

Impulse response function (IRF) is a popular type of structural analysis. Specifically, this measures the response of one mortality index to an impulse (i.e., an exogenous structural shock) of another. Based on the normalization condition (2.9),
$\frac{\partial}{\partial u^j_t} \kappa^i_{t+h}$
is defined as the the *h*-step IRF of the response of *i*th population’s mortality index
$\kappa^i_{t+h}$
to a one-standard deviation exogenous change in *j*th structural shock
$u^j_t$
.

To derive an analytical form of IRF, we can rewrite the VECM described in Equation (2.2) as

where
$\Pi^*_i = (\mathbb{I}+\Pi)^i$
for
$i=0,\cdots,t+h-1$
. Hence, the aforementioned *h*-step IRF
$\frac{\partial}{\partial u^j_t} \kappa^i_{t+h}$
is given by

where $e_i$ denotes the $i^{\text{th}}$ column of the identity matrix $\mathbb{I}_I$ .

#### 2.4.2. Forecast error variance decomposition

Another important component of structural analysis is the forecast error variance decomposition (FEVD). This metric decomposes the variance of forecast error into the contributions from specific exogenous structural shocks. Essentially, FEVD provides information on how much a structural shock contributes to variations of a particular response variable, and the dynamics of those contributions. Specifically, the proportion
$p^h_{ij}$
of the *h*-step forecast error variance of *i*th population’s mortality index explained by the *j*th structural shock
$u^j_t$
, is given by:

where the denominator
$\sum^{I}_{j=1} \sum^{h-1}_{k=0}\{e'_{\!\!i}[(\mathbb{I}+\Pi)^k\Theta_0]e_j\}^2$
is the *h*-step forecast error variance of
$\kappa^i_{t+h}$
. For more details about FEVD in the VAR or VECM, please refer to Lütkepohl (Reference Lütkepohl2005).

### 2.5. A multi-population and multi-factor Lee–Carter model

In the previous sections, we mainly focus on extending the classical single-factor LC model to a multi-population specification (please refer to Model 1). It is possible to further incorporate multiple factors, which is discussed in this section.

Model 4 (The multi-population multi-factor LC model). *Let*
$y_t^i=[\log\!(m_{x_0,t}^i),\cdots,\log\!(m_{\omega,t}^i)]'$
*for*
$i=1,2,\cdots,I$
*, we assume that*

*with*
$F_a, F_{k,b}$
*and*
$F_\epsilon$
denote some multivariate parametric distributions and

*for all*
$i=1,2, \cdots,I$
*and*
$k=1,2,\cdots,p$
*. For simplicity, we assume the covariance-variance matrix*
$\Omega$
*to be a diagonal matrix with*
$g_k$
*being the error variance for the kth age.*

*For the temporal movements of latent factors, if*
$\boldsymbol{\kappa}_{kt}=[\kappa_{kt}^1,\cdots,\kappa_{kt}^{I}]'$
*is non-stationary, as investigated above, we consider a VECM such that*

*where N denotes the multivariate Gaussian distribution. While if*
$\boldsymbol{\kappa}_{kt}=[\kappa_{kt}^1,\cdots,\kappa_{kt}^{I}]'$
*is stationary, we consider a VAR model such that*

Hence, the model parameter is

Similar to Model 1, as a special case of the linear dynamic factor model, we need to impose some additional constraints to identify dynamic factors. Following the discussion in Bai and Wang (Reference Bai and Wang2015), two identification assumptions are employed and stated below:

Assumption 3.
*For all*
$i=1,\cdots,I$
*and*
$k=1,\cdots,p$
,

(a) $\alpha^i,\beta^i_k,\kappa_{kt}^i$

*and*$\epsilon_t^i$*are mutually independent*(b) $\mathbb{E}[\epsilon_t^i]=0$ .

Assumption 4.
*We assume*
$\kappa_{k,-1}^i = [\kappa_{k2}^i, \cdots, \kappa_{kT}^i]'$
*with*
$\kappa_{k1}^i = 0$
, and

*where*
$N_S$
*denotes a multivariate normal distribution truncated on a hyper-plane defined by S. That is, we constrain the first*
$k-1$
*elements of*
$\beta^i_k$
to be zeros and the kth element to be one. Furthermore, we assume

and

*where*
$\Omega$
*is a diagonal matrix, and*
$\Sigma_a$
*and*
$\Sigma_{k,b}$
*are general positive-semidefinite matrices.*

Finally, we discuss the relevant conditions for coherent modelling of Model 4. The proof of these conditions follows straightforwardly from the the process delineated in the Section 2.3. As such, only the sufficient conditions to achieve coherence are presented below.

Proposition 2.
*Suppose that*

2. For $k=1,\cdots,p$ , at least one of $\kappa^i_{kt}$ is an I(1) process for any $i \in \{1,2,\cdots,I\}$ . And for such a $\kappa^i_{kt}$ , $\kappa_{kt}^i-\kappa_{kt}^j$ should be weakly stationary for $i,j \in \{1,2,\cdots,I\}$ and $i \neq j$ , and $\beta^i_k$ are deterministic vectors with $||\mu_{k,b}||<\infty $ for all $i=1,2,\cdots,I$ (i.e., $\Sigma_{k,b}$ is a zero matrix); and

3. The remaining $\kappa^i_{kt}$ s are stationary, that is, I(0) processes of any $i \in \{1,2,\cdots,I\}$ .

*the forecast ASDR produced by Model 4 are asymptotically coherent.*

In particular, for a two-factor Model 4, it is plausible to assume that $\kappa^i_{1t}$ is an I(1) process characterized by mean-reverting $\kappa_{1t}^i-\kappa_{1t}^j$ , while $\kappa^i_{2t}$ is an I(0) process. Thus, $\kappa^i_{1t}$ can be described by a VECM, whereas $\kappa^i_{2t}$ conforms to a VAR model.

## 3. Efficient Bayesian estimation

As discussed above, the co-integration relationship justifies the long-term coherence assumption. However, focusing on this long-run property only may be too strong to model short/medium-term mortality dependence. In this section, we provide a Bayesian method for Model 1 to allow for more flexibility in the short/medium term. The extension of the subsequent Bayesian method to a multi-factor model, specifically Model 4, is convenient. Consequently, a detailed repetition of this process will be forgone.

### 3.1. Imposing shrinkage priors

Prior specification of Bayesian inference has two components: the parametric class and the prior hyper-parameters given in the parametric family. For the prior parametric distribution, we adopt conditionally conjugate priors structure that implies known conditional density, which is common in the literature for dynamic factor models (see Chan and Jeliazkov, Reference Chan and Jeliazkov2009; Bańbura *et al*., Reference Bańbura, Giannone and Reichlin2010; Njenga and Sherris, Reference Njenga and Sherris2020, for details). In particular, we set the distribution of the random effect parameters

and

where *IW* denote the inverse wishart distribution. For the error variance, we set

for $x = x_0, \dots, \omega$ . For the VECM coefficients, we consider

and

Under this construction, all conditional distributions are tractable, and this avoids the use of non-smooth samplers such as the Metropolis–Hasting algorithm. One can employ the standard Gibbs sampling algorithm to estimate the posterior distribution, leading to the algorithm’s geometric convergence.

In Bayesian data analysis, the prior distribution usually demonstrates one’s prior beliefs, which could be independent of empirical data. For example, since mortality rates of all ages are expected to continue to decline in the future, we could set the mean value
$m_6$
of mortality index’s drift term *b* as negative and the mean value
$m_2$
of age effect
$\mu_b$
to be positive. In other words, this prior structure effectively balances long-term belief (i.e., coherent mortality forecasts) and short-term empirical dynamics.

Bayesian shrinkage priors have been widely employed in the literature (Litterman, Reference Litterman1986) and are adopted in our estimation. To fulfil the first requirement in Proposition 1, that is, $\beta^i$ should be a constant, one can set a small value of $\phi_4$ for the prior of $\Sigma_b$ . The second condition is equivalent to the fact that coefficient matrix $\Pi$ in the VECM model of ${\boldsymbol{\kappa}}_t$ should be of a reduced rank. This is to shrink the movements of ${\boldsymbol{\kappa}}_t$ towards a co-integrated I(1) process. In particular, the rank of the coefficient matrix $\Pi$ should be $I-1$ , or equivalently, $\Pi$ should have a zero eigenvalue.

To specify the prior distribution of $\Pi$ , we follow a similar procedure developed in Litterman (Reference Litterman1986) and employ the Minnesota prior. The basic idea is to “center” the distribution of coefficients in $\mathbb{I}_I+\Pi$ so that the behaviour of each element in $\boldsymbol{\kappa}_t$ approximates a random walk with drift. Similarly, our prior belief that $\boldsymbol{\kappa}_t$ ’s are co-integrated over time could be formulated by setting the following moments for the prior distributions of the entries in $\Pi$ :

where $\lambda_1$ and $\lambda_2$ are two hyper-parameters of the prior distribution of $\Pi$ . The $vec(\Pi)$ is assumed to be normally distributed with a diagonal variance-covariance matrix. (As for a multi-factor Model 4, if $\boldsymbol{\kappa}_{kt}$ is a I(0) process, we can just assign $\mathbb{E}[(B_k)]$ as a null matrix.)

Roughly speaking, this prior specification assumes that $\kappa^i_t$ is a weighted average of $\kappa^i_{t-1}$ and $\kappa^{i+1}_{t-1}$ with the weight $\lambda_1$ . In order to avoid the existence of explosive roots, the range of $\lambda_1$ should be between 0 and 1. In addition, the hyper-parameter $\lambda_2$ controls the overall tightness of the prior distribution and represents the relative importance of prior beliefs compared with data-specific information. When $\lambda_2$ increases, the prior beliefs become less informative, and the sample information will be more dominant.

For example, when $J = 3$ , the prior belief is that

which is equivalent to expressing $\boldsymbol{\kappa}_t$ in the VAR form of

Equation (3.2) is exactly a VECM form of
$\boldsymbol{\kappa}_t$
with the co-integrating vectors being
$\begin{bmatrix}-1 & 1 & 0\end{bmatrix}$
and
$\begin{bmatrix}0 & -1 & 1\end{bmatrix}$
. In other words, this prior specification supposes that
$\kappa^i_t-\kappa^j_t$
is mean-reverting for any choice of *i* and *j*. Table 1 summarise all the hyperparameters that will be used in our empirical analysis.

Alternatively, instead of using shrinkage prior, it is also possible to restrict parameters of Model 1 to satisfy the long-term coherence assumptions. For instance, we can make use of Model 2 in which age effect $\beta_i=\beta$ are assumed to be the same across different populations. Additionally, in the VECM form of $\boldsymbol{\kappa}_t$ , the coefficient $\Pi$ can be written as a matrix product $\mathbf{c}\mathbf{d}'$ . Then we can apply the co-integration relations (i.e., $\kappa_t^i-\kappa_t^j$ is weakly stationary) by setting the co-integration matrix $\mathbf{d}'$ as a $(I-1)\times I$ matrix

It can be shown that such specifications produce the long-term coherent forecasts. However, as will be discussed in Section 4, this would be too restrictive for short/medium-term forecasting performance. Instead, imposing Bayesian prior to Model 1 is therefore a more flexible method.

### 3.2. The proposed MCMC algorithm

In this section, we discuss a block-based Gibbs sampler to estimate the proposed multi-population LC models. Note that high-dimensional latent random states are to be sampled (i.e., the $\alpha^i$ ’s, $\beta^i$ ’s and ${\boldsymbol{\kappa}}_t$ ’s), in addition to the model parameters. Given the model parameters, random effects $\alpha^i$ and $\beta^i$ can be sampled iteratively, and the VECM/VAR can be obtained via the Kalman Filter. However, due to the high dimensionality of mortality data, a naive implementation of such a sampler will lead to extensive computational costs. In this section, we present an efficient precision sampler.

#### 3.2.1. Jointly sampling the random age effects

As the discussion in Chan and Jeliazkov (Reference Chan and Jeliazkov2009), the efficiency of MCMC can be greatly improved, if the number of blocks could be reduced. Therefore, the following joint conditional posterior of all the random age effects is developed, which allows us to sample them as a whole block.

First, rewrite the general model as

where ${\mathbf{Y}}_t=[y_t^1,\cdots,y_t^I], A=[\alpha^1,\cdots,\alpha^I], {\boldsymbol{\beta}}=[\beta^1,\cdots,\beta^I], {\boldsymbol{\epsilon}}_t=[\epsilon_t^1,\cdots,\epsilon_t^I]$ . Since $\beta^{i}_{x_0}$ = 1 for $i \in I$ , ${\boldsymbol{\beta}}$ can be expressed as $[\mathbf{1}_I, {\boldsymbol{\beta}}^{'}_{(-1)}]^{'}$ . Based on this, we can sample

with

and

where

In this formulation,
$\Omega_{(-1)}$
means
$\Omega$
without the first row and column.
${\boldsymbol{\beta}}_{(-1)}$
,
${\mathbf{Y}}^{(-1)}_t$
and
$A_{(-1)}$
, respectively, represent
${\boldsymbol{\beta}}$
,
${\mathbf{Y}}_t$
and *A* with the first rows removed. This is equivalent to sampling them independently for each population, yet avoiding using a loop.

#### 3.2.2. Precision sampler for all the $\kappa$ ’s

The sampling of all the $\boldsymbol{\kappa}_t$ ’s usually involves the Kalman filter. In this paper, we develop the precision sampler that allows us to sample all the latent time-series jointly. This improves the transparency in the integrated likelihood and, consequently, allows for the associated model comparison. Recall that for the identification purpose, we have set ${\boldsymbol{\kappa}}_1=0$ . Hence, sampling is only required for ${\boldsymbol{\kappa}}=[{\boldsymbol{\kappa}}'_{\!\!2},\cdots,{\boldsymbol{\kappa}}'_{\!\!T}]$ . For the proposed model, we can rewrite

where ${\mathbf{Y}}=[vec({\mathbf{Y}}_2)',...,vec({\mathbf{Y}}_T)']', {\boldsymbol{\epsilon}}=[vec({\boldsymbol{\epsilon}}_2)',...,vec({\boldsymbol{\epsilon}}_T)']'$ , and

where $M_I$ is a matrix of dimension $I^2\times I$ that first converts the column vector into a diagonal matrix and then vectorizes it, that is,

In a similar fashion, we can write the VECM of latent factors as

where $\boldsymbol{\xi}=[\xi'_{\!\!2},\cdots,\xi'_{\!\!T}]'$ and

Hence, we derive the precision sampler

with

The most significant advantage of our precision sampler is that the precision matrix $K_k$ is a symmetric block-banded matrix with very few non-zero elements. In Section A of Supplementary Material, we provide a simplified example of $\kappa_t$ ’s precision matrix with red points representing non-zero elements in the matrix. Our sampling algorithm’s exact computational advantages are also demonstrated via simulation explorations in Section A of Supplementary Material. The presented precision matrix there is an intermediate product during the simulation, when the empirical analysis detailed in Section 4 is conducted. It can be seen that only a small number of non-zero entries falling in a narrow band around the precision matrix’s diagonal. From a computational point of view, this implies that we could reduce storage and computational costs by exploiting efficient algorithms designed for the sparse matrix. More details regarding the precision sampler can be found in Chan and Jeliazkov (Reference Chan and Jeliazkov2009).

## 4. Empirical application to the modelling of G7 mortality data

In this section, we examine the mortality data of the G7 countries, which are sourced from Human Mortality Database (2019). In a related study, the empirical results of Tuljapurkar *et al*. (Reference Tuljapurkar, Li and Boe2000) with data over 1950–1994 suggest a universal decline in mortality across all age groups in the G7 populations. They alluded that this trend places a constraint on any theory of society-driven mortality decline and provides a basis for stochastic mortality forecasting via the LC-type model. Instead of independently fitting LC models, this section presents a comprehensive analysis of the G7 mortality data based on our proposed multi-population LC models.

### 4.1. Empirical data set

We use the crude (un-smoothed) annual data for the period 1956–2019 for all the G7 countries. For each country, age-sex-specific death rates are available annually, from age 0 to age 110. Since mortality measures at very old ages are unreliable (Lee and Carter, Reference Lee and Carter1992), we constrain the maximum age as 89, such that $\omega = 89$ as in the model. Also, male and female mortality rates are combined for the following analyses.

### 4.2. Model comparison: preliminary results

To demonstrate the usefulness of our model, we first undertake a preliminary comparative analysis of a range of popular single-factor and two-factor models. For the single-factor case, our consideration encompasses Model 1, Model 2, Model 3, the LC model (Lee and Carter, Reference Lee and Carter1992), and the single-factor Li & Lee model (Li and Lee, Reference Li and Lee2005). With respect to the two-factor models, we compare the performance of Model 4, the two-factor LC model, and the two-factor Li & Lee model (Li and Lee, Reference Li and Lee2005).

To obtain comparable results, both the LC models and the Li & Lee models have been reformulated according to the state-space representations suggested by Pedroza (Reference Pedroza2006) (please refer to Section B of Supplementary Material for details). In particular, we consider the marginal likelihood as the basis for model comparison, which is a standard technique in Bayesian analysis (Koop, Reference Koop2003). It is worth mentioning that evaluating the marginal likelihood is usually a computationally challenging task. In practice, the most commonly used Bayesian information criteria (or BIC) approximates twice the log of the marginal likelihood (Schwarz, Reference Schwarz1978). To address the computational issue, Newton and Raftery (Reference Newton and Raftery1994) proposed a simple way to calculate marginal likelihood by using the posterior harmonic mean of the likelihood. Please refer to the Section C of Supplementary Material for more details. To more comprehensively compare the prediction accuracy of each model, we present the out-of-sample forecasting results in Section 4.4.

In Table 2, we present the marginal log-likelihoods for Model 1–4, the LC model, and the Li & Lee model. For Model 1–3, we consider three distinct values of the hyper-parameter $\lambda_2$ : 0.01 (moderate), 0.1 (weak), and 0.00001 (strong). These same values of $\lambda_2$ are also considered for the first factor in Model 4.

In general, two-factor models significantly outperform single-factor mortality models in terms of the marginal likelihood. Geweke and Amisano (Reference Geweke and Amisano2011) demonstrated that the in-sample marginal likelihood is intimately connected to the one-step-ahead predictive likelihood, thereby making it a good measure of short-term forecasting accuracy. Thus, the two-factor models showcase superior short-term forecasting capabilities compared to their single-factor counterparts.

Among all the single-factor models, the LC model exhibits the greatest marginal likelihood. In contrast, the Li & Lee model only outperforms Model 3, ranking it as the second-worst model. Among Model 1–3, Model 1, which incorporates heterogeneous random age effects, has the highest marginal likelihood. Conversely, Model 3, characterized by homogeneous age effects, has the lowest marginal likelihood. The marginal likelihood of Model 2 considerably surpasses Model 3, although it remains inferior to Model 1. Therefore, even though Model 2 satisfies long-term coherence conditions, Model 1 emerges as the preferred model when we focus on its superior short-term predictive performance. Moreover, Model 1 exhibits greater flexibility than Model 2, considering that Model 2 is actually a special case of Model 1 when $\Sigma_b$ converges to a zero matrix. Essentially, Model 1, employing a Bayesian shrinkage prior, more appropriately balances the trade-off between short to medium-term predictive accuracy and long-term coherence. The superior short- to medium-term predictive accuracy is attributable to the heterogeneous random age effects, while the long-term coherence is approximately attained by imposing a Bayesian shrinkage prior. Furthermore, in the selection of hyper-parameters, the optimal choice is determined to be $\lambda_2=0.1$ among three distinct $\lambda_2$ values.

Similarly, when considering two-factor mortality models, the two-factor LC model has the highest marginal likelihood. Conversely, the two-factor Li & Lee model results in the lowest marginal likelihood, which suggests comparatively less accurate short-term forecasting performance. Regarding the selection of the hyper-parameter $\lambda_2$ for Model 4, it is discerned that $\lambda_2=0.01$ is the optimal choice.

### 4.3. Fitted in-sample results and inferences

#### 4.3.1. The temporal trend of mortality rates

In this subsection concerning in-sample results, our primary focus is on the single-factor model. Specifically, we utilize Model 1, setting $\lambda_2=0.1$ for the in-sample analyses within this subsection. This selection is premised on its superior short-term forecasting performance relative to other competitive specifications. (Please refer to the Section E of Supplementary Material for the in-sample analyses for Model 4.) Figure 2 exhibits the temporal plot of fitted $\kappa_t^i$ separately for each country. The solid line represents the posterior mean of $\kappa_t^i$ , and the grey area depicts the corresponding 99% credible band. It appears that $\kappa_t^i$ has a persistent declining trend and thus indicates a non-stationary process. Those posterior means are consistent with the downward trends of historical mortality data of the G7 countries. The narrow widths of the credible band imply that the estimation of ${\boldsymbol{\kappa}}_t$ is reliable.

To compare the differences in mortality declines over years, we plot all the estimated posteriors means of $\kappa_t^i$ ’s in Figure 3. Despite some slowed mortality improvements over recent periods, the overall patterns might be roughly fitted by linear trends, especially for those post-1970s. According to the behaviours of $\kappa_t^i$ , G7 countries can be divided into three distinct groups:

1. Japan has the lowest mortality index across all seven countries, and the Japanese $\kappa_t^i$ declines at almost a constant rate (linear pattern) over the examined six decades.

2. Canada, France, (West) Germany, Italy and the UK: Although their $\kappa_t^i$ ’s decline rates are substantially lower than the Japanese counterpart before 1980s, all those countries’ $\kappa_t^i$ ’s tend to exhibit similar speeds of decline after 1990. Figure 3 shows that the estimated $\kappa_t^i$ ’s of the six countries are almost parallel to each other over 1990–2019.

3. USA has the highest mortality rates among G7 countries over the sample period. Unlike the other G7 countries, the marginal decline rate of the USA’s $\kappa_t$ decelerates, especially over most recent years. A flat curve is displayed since 2010. This somewhat deviates from the rest G7 countries’ overall temporal trend.

#### 4.3.2. The age effects of mortality rates

In Figure 4, we plot the estimated age effects, that is,
$\mu_a$
and
$\mu_b$
, respectively. Recall that the heterogeneous age effects in Model 1 are drawn randomly from a common distribution characterised by
$\mu_a$
and
$\mu_b$
, to ensure the coherence. For age *x*,
$\mu_a^x$
represents the (common) first-year (i.e., 1956) level of log mortality rate, and
$\mu_b$
indicates the (common) age-specific loading of
$\kappa_t^i$
. From the widths of 99% credible bands shown in Figures 4 and 5, the estimation uncertainty is relatively lower for
$\mu_a$
than for
$\mu_b$
. The age pattern of
$\mu_a$
has a classic ‘tick’ shape, which declines from age 0 to reach a minimum around age 12. The pattern then almost uniformly increases, except for a famous ‘accidental-hump’ over ages 15–25. The estimates of
$\mu_b^x$
are also consistent with empirical findings, such that mortality declines are faster at young ages than at very old ages.

In addition to $\mu_a$ and $\mu_b$ , Figures 5 and 6 present estimated age effects $\alpha^i$ and $\beta^i$ for each country, respectively. Recall that those are the unique feature of Model 1 to allow flexible population-wise age effects. Despite some minor differences, all the $\alpha^i$ ’s demonstrate rather similar patterns. More differences can be observed for the country-dependent $\beta^i$ ’s, suggesting various age-specific decline speeds across the G7 countries. In Figure 7, we plot the posterior means of $\alpha^i$ ’s and $\beta^i$ ’s for all the G7 countries together to facilitate comparison. The population-specific variations are demonstrated by differences among the corresponding curves. This validates the effectiveness of our specification to enable heterogeneous age effects in the Model 1. Again, we observe that $\alpha^i$ ’s of all the countries are relatively close to each other, whereas the $\beta^i$ ’s are more heterogeneous, especially for ages over 20–40 and 50–80.

#### 4.3.3. Cross-sectional dynamic structure

We now investigate relationships among ${\boldsymbol{\kappa}}_t$ in the VECM, characterised by the coefficient matrix $\Pi$ . In Table 3, posterior means of parameter $\Pi$ are presented, together with their corresponding standard errors. Those estimates could provide more information about interdependence of temporal factors in each country. We firstly verify the impact of prior belief of long-term coherence on the coefficient matrix $\Pi$ . Specifically, recall that our prior belief suggests that $\kappa^i_t-\kappa^j_t$ is weakly stationary for any $i \ne j$ . It can be seen that most of the diagonal elements of $\Pi$ are shrunk to $-\lambda_1$ (-0.1), and most of the first super-diagonal elements are also close to $\lambda_1$ (0.1). Most of the remaining cross-sectional relationships are insignificant, since a small $\lambda_2$ (0.1) is adopted to be consistent with our prior belief of the long-term coherence.

In Figure 8, we report posterior distributions of eigenvalues’ modulus computed from the simulated VECM coefficient $\Pi$ . Recall that to meet the coherence conditions, the Minnesota-type prior is adopted such that the coefficient $\Pi$ has a reduced rank. Figure 8 shows that the smallest eigenvalue (measured by modulus) of $\Pi$ is close to 0. Hence, the VECM coefficient $\Pi$ is indeed of rank $I-1$ . This supports the desirable co-integration of temporal factors, and thus all the countries’ mortality rates will not diverge in the long run (Li and Lu, Reference Li and Lu2017).

#### 4.3.4. Structural analysis

As stated in Section 2.4, a meaningful recursive relationship is needed to identify unique structural shocks. In this section, since GDP is believed an important factor on the mortality improvements (Boonen and Li, 2017), we choose an order according to G7 countries’ total GDP as of 2019, which are retrieved from World Bank Open Data (https://data.worldbank.org). Specifically, the response variables are ranked as USA, Japan, Germany, UK, France, Italy and Canada. Thus, this recursive structure implies that the US structural shock could contemporaneously affect mortality indices of others, while the shock of Canadian population cannot affect any of other contemporaneous mortality indices.

Note: $^{*}$ 0 is outside the 90% credible interval;

In Figure 9, we plot the responses of all the mortality indices to the US mortality shock of one standard deviation (around 3%). The solid blue lines represent posterior means of IRFs, and the grey areas and dashed red lines correspond to 68% and 95% credible intervals, respectively (It is common to use 68% and 95% credible intervals in the macroeconometrics literature, for example, see baumeister2019structural). Since mortality rates are consistently improving (reducing), our results suggest that a reduction in the US mortality rates may significantly lead to permanent decline of mortality rates in other G7 countries. For the contemporaneous impact, the point estimates of IRF range from 1 to 3%, whereas the influence at the 50th step varies within roughly the same range.

In Figure 10, the dynamics of FEVD of the US mortality index contributed by structural shocks of G7 populations are plotted. First, we observe that contributions of shocks of US mortality rates constantly reduce over time and achieves a minimum just below 30% at the 50th step. As for the contributions of other G7 countries, shocks of Japanese mortality rates is the largest (around 60% at the 50th step), followed by those of Germany population (around 10% at the 50th step). This is consistent with the assumed recursive order that Japan and German are the second and third largest economies, respectively, among all G7 countries.

Interpreting the impacts of mortality rates among different countries is typically challenging due to the multitude of influencing factors. Despite this, changes of Japan’s mortality rates could potentially influence US mortality rates through the following channels, given the tight social and economic connections between the two countries.

First, a significant shift in Japan’s mortality rates could indicate a new health trend, disease emergence or healthcare breakthrough. If these factors are globally pervasive or if the associated research is disseminated worldwidely, they could potentially impact US mortality rates. For instance, if a medical breakthrough originates in Japan and later adopted by the US, it could potentially reduce the US mortality rates. Given that Japan has the highest life expectancy worldwidely, it is plausible that Japanese mortality rates serve as an upper limit for other countries.

Second, as a significant global economy and a key trading partner of the US, shocks to Japan’s mortality rates could affect the size and productivity of its workforce, which could have a long-term economic impacts. Those impacts could then lead to non-negligible impacts to the US economy, which eventually affects factors such as the healthcare conditions and thus influence the US mortality rates. Specifically, a significant rise in mortality among Japan’s working-age population could have negative impacts on labour economics and thus reduce its economic outputs. Consequently, this might introduce economic downturns in the US, which then lead to increased mortality rates due to factors such as increased stress level and reduced healthcare spending.

### 4.4. Out-of-sample forecasting results

The out-of-sample forecasts of our proposed models can be obtained using the posterior predictive distributions of log mortality rates
${\mathbf{y}}_{T+s}$
, where
$s=1,2,\dots, h$
, with *h* the largest forecasting step. This will be based on the simulations of latent random states and model parameters. Please see the Section D of Supplementary Material for more details on the simulation of posterior predictive distributions.

#### 4.4.1. Out-of-sample forecasting performance comparison

In this section, we evaluate and compare forecasting performances of popular mortality models as considered in Section 4.2, including Models 1–4, the LC model and Li & Lee model. Our complete sample period is from 1956 to 2019, with the training sample spanning 1956–2009 and the test sample covering 2010–2019. At each step, we use only the data up to time *t*, denoted as
$\mathcal{F}_t$
, to obtain the posterior predictive density of the h-step ahead forecast
$\boldsymbol{y}_{t+h}$
.

Consistent with existing studies, such as Li and Lu (Reference Li and Lu2017) and Li and Shi (Reference Li and Shi2021a), the popular metric root mean squared forecast error (RMSFE) is employed to evaluate forecasting accuracy. Denote
$\mathbb{E}(y_{x,T+h} \vert \mathcal{F}_T)$
as the h-step-ahead point forecast (posterior expectation) for age *x* at year *T*, the RMSFE for the next *h* years is defined as

where
$y^o_{x,T+t}$
is the observed values of log mortality rate
$\log\! (m_{x,T+t})$
for age *x* at year
$T+t$
.

Table 4 provides a comprehensive comparison of the RMSFEs for all fitted single-factor mortality models. The RMSFEs are calculated at different forecast horizons, ranging from 1 to 10, and three various shrinkage hyperparameters ( $\lambda_2 = 0.1, 0.01, 0.00001$ ) are used for Models 1–3. Our results suggest that the forecasting performance varies with the forecast horizon and the shrinkage hyperparameter. Overall, the single-factor LC model demonstrates the best forecasting performance, while the common factor Li & Lee model substantially underperforms others. Despite this, it is worth noting that the LC model achieves a higher forecasting accuracy at the cost of producing incoherent forecasts in a longer term, whereas the single-factor Li & Lee model achieves the long-run coherence with the least accurate forecasts. Our proposed models, namely Model 1–3, manage to balance the trade-off between short-term forecasting performance and long-term coherence. Across different shrinkage hyperparameters and among the three proposed specifications, Model 1 is the best performing candidate in terms of forecasting accuracy.

Table 4 also indicates that the RMSFEs of the three proposed models are influenced by the shrinkage hyperparameters. Regardless of the model used, both weak ( $\lambda_2 = 0.1$ ) and strong ( $\lambda_2 = 0.00001$ ) priors outperform the moderate prior ( $\lambda_2 = 0.01$ ). This may be attributed to the flexibility of the weak prior and the strong prior’s ability to capture the long-term trend. Conversely, the moderate prior lacks these two advantages, resulting in poor forecasting performance. In particular, it is worth mentioning that Model 1 with a weak prior performs as the second best model across all competitors (including LC and Lee & Li models).

In Table 5, we further compare the forecasting performances of three two-factor mortality models: Model 4, the two-factor LC model and the augmented common factor Li & Lee model. Among them, Model 4 is the best performing model. When compared to single-factor counterparts, the improvements achieved by the two-factor models are primarily evident in short-term forecasting, particularly when the forecast horizon *h* is small. When the forecast horizon expands, Model 4 exhibits forecasting accuracy comparable to that of its single-factor counterpart. However, for the LC model and the Li & Lee model, the forecasting performances significantly deteriorate when
$h>5$
. This may be attributed to the hierarchical structure employed by Model 4, which results in fewer parameters than the LC and Li & Lee models and consequently reduces the possibility of overfitting. Finally, unlike the single-factor models, we find that the choice of hyperparameters does not have substantially affect the forecasting performance of Model 4.

Apart from the point forecasts, prediction intervals with high coverage are usually more important for mortality models to be used in actuarial practices. In Tables 6 and 7, we present the coverage ratios of the 95% prediction intervals of single-factor models and two-factor models, respectively. The average widths of the prediction intervals (to demonstrate the efficiency) are displayed in the Section F of Supplementary Material. It is worth noting that, for single-factor models, our proposed models can provide larger coverage ratios with comparable forecasting efficiency. In the case of two-factor models, LC and Li % Lee models results in abnormally wide prediction intervals at almost all forecasting steps. Despite the resulting large coverage ratios, those intervals are undesirable, compared with those of Model 4.

#### 4.4.2. Long-run predictions of mortality rates

We now compare the long-run predictions of mortality rates for all the G7 countries using the Model 1, for its best out-of-sample performance among the four proposed specifications. The forecast horizon *h* is chosen as 30 years beyond 2019 (up to 2049). To facilitate the comparison, the point forecasts (posterior means) are presented and discussed in this section.

To illustrate the long-term and age-specific forecasts, we present point forecasts of log mortality rates at age 65 for all the G7 countries in Figure 11. The left, middle and right panel displays results corresponding to Model 1 with a weak ( $\lambda_2 = 0.1$ ), moderate ( $\lambda_2 = 0.01$ ) and strong ( $\lambda_2 = 0.00001$ ) shrinkage parameter, respectively. We also display long-run forecasts of log mortality rates at age 65 using Model 2 in the Section G of Supplementary Material, where the results are similar to those from Model 1. When $\lambda_2 = 0.00001$ , despite the presence of some short-term fluctuations, the age-specific mortality rates exhibit parallel declining trends, implying their non-divergence over the long term. In contrast, when $\lambda_2=0.1$ , the data demonstrate relatively more diversified declining patterns, signifying the long-term divergence of these age-specific mortality rates across the G7 countries. This underscores the effectiveness of our prior parameter ( $\lambda_2$ ) in achieving the long-term coherence. Similar observations can be made from the plots of future life expectancy at birth provided in the Section H of Supplementary Material. By employing a strong shrinkage prior, the future life expectancy also appears to be nearly parallel across different countries.

## 5. Conclusion

In this paper, we present a new multi-population mortality framework based on the seminal Lee–Carter model. In particular, a hierarchical structure is assumed for the age effects, and a (structural) VECM is employed to fit the co-movements of the mortality dynamics. By employing the Bayesian inference with a shrinkage prior, the proposed model is flexible on balancing the short-term empirical patterns and long-term coherence in mortality forecasting. Building on the Bayesian MCMC literature, we construct an efficient precision block sampler that largely reduces the extensive computational cost of Kalman filter and small-blocked sampling. The application to the G7 data set demonstrates the usefulness of our model in understanding the mortality dynamics for actuarial practice.

## Supplementary Material

To view supplementary material for this article, please visit https://doi.org/10.1017/asb.2023.29.