### 2.1 Introduction

The basic principles of mixed model analysis will be explained by using a continuous outcome variable, i.e. they will be explained using a linear mixed model analysis. The most important basic principle to be considered is the fact that linear mixed model analysis can be seen as an extended linear regression analysis. So, to understand the basic principles of mixed model analysis, linear regression analysis must be the starting point.

Suppose that a cohort study is performed to investigate the relationship between physical activity and health. Both are continuous variables and Eq. 2.1 describes the linear regression model. Figure 2.1 illustrates the result of this linear regression analysis.*β*

_{0}+

*β*

_{1}activity +

*ε*

where ‘health’ = continuous outcome variable; *β*_{0} = intercept; *β*_{1} = regression coefficient for activity; ‘activity’ = continuous independent variable and *ε* = error/residual.

*β*

_{0}) is the value of the outcome variable (health) when the independent variable (physical activity) equals zero. The regression coefficient for activity (

*β*

_{1}) reflects the difference in health for subjects who differ one unit in physical activity. Suppose now that the analysis is adjusted for gender. The reason for this adjustment is that males are different from females. Therefore, an adjustment for gender makes sense (Eq. 2.2).

*β*

_{0}+

*β*

_{1}activity +

*β*

_{2}gender +

*ε*

where *β*_{2} = regression coefficient for gender.

Suppose that males are coded as 0 and females are coded as 1. The intercept *β*_{0} now reflects the intercept (i.e. the value for health when physical activity equals zero) only for males, while *β*_{0} + *β*_{2} reflects the intercept for females. So, an adjustment for gender actually means that the intercept of the regression line is assumed to be different for males and females (see Figure 2.2).

Suppose another situation in which the study is performed in a city with different neighbourhoods. It is very reasonable to assume that subjects who are living in a particular neighbourhood are more like subjects living in the same neighbourhood than they are like subjects living in other neighbourhoods. Subjects can live in a rich neighbourhood or in a poor neighbourhood and the socioeconomic situation of subjects can be related to the outcome of the study (i.e. health). So, because the observations of the outcome variable within a certain neighbourhood are correlated with each other, the linear regression analysis should be adjusted for neighbourhood.

Again, an adjustment for neighbourhood actually means that different intercepts are estimated for each neighbourhood (see Figure 2.3).

where *β*_{2} to *β*_{m} = regression coefficients for the dummy variables representing the different neighbourhoods, and *m* = number of neighbourhoods.

Therefore, if there are 50 neighbourhoods involved in the study, 49 additional regression coefficients must be estimated in the linear regression analysis. This is a dramatic waste of power and efficiency, especially because the neighbourhood variable was only added to the regression analysis to be adjusted for. There is no real interest in the different health values for each of the separate neighbourhoods. A much more powerful and efficient way to adjust for neighbourhood is provided by mixed model analysis. The general idea behind a mixed model analysis can be seen as a three-step procedure. In the first step, the intercepts for the different neighbourhoods are estimated. In the second step, a normal distribution is drawn over the different intercepts. In the third step, the variance of that normal distribution is estimated. That variance is then added to the regression model in order to adjust for the neighbourhood. So, instead of estimating 49 regression coefficients to adjust for the neighbourhood, this adjustment is performed by adding only one variance parameter to the model. The estimation of the variance of the intercepts is also referred to as a random intercept. This is why mixed model analysis is also known as random coefficient analysis.

In mixed model terminology it is said that the observations that are made of the subjects are clustered within neighbourhoods. Because the observations of subjects within one neighbourhood are correlated, an adjustment must be made for neighbourhood. In this particular example the data contains a two-level structure: the subject is the lowest level, while the neighbourhood is the highest level (see Figure 2.4).

In general, mixed model analysis is a very efficient way of adjusting for a categorical variable with many categories. Of course, there is some sort of trade-off. This trade-off is the assumption that the distribution of intercepts for the different neighbourhoods is more or less normal, because the variance used in the adjustment is based on the normal distribution. So, when performing a mixed model analysis, it is important to realise that this normality assumption underlies the estimation procedure (see also Section 2.7).

### 2.2 Example

The example that will be used to explain the basic principles of mixed model analysis is a very simple one: a cross-sectional cohort study investigating the relationship between two continuous variables, physical activity and health. In this study, 684 subjects living in 48 different neighbourhoods are involved. So, in this example, the observations of the subjects are clustered within neighbourhoods. Table 2.1 shows the descriptive information of the data used in this example.

Mean | Standard deviation | |
---|---|---|

Health | 30.3 | 6.7 |

Physical activity | 50.2 | 5.8 |

Output 2.1 shows the result of the standard linear regression analysis without adjusting for the neighbourhood. This kind of analysis is also referred to as a naive regression analysis because it ignores the possible clustering of data, meaning it ignores the fact that the observations of subjects within the same neighbourhood can be correlated.

The same result can also be obtained from a so-called naive mixed model analysis: a mixed model analysis without adjusting for the neighbourhood. Output 2.2 shows the result of this analysis.

The first line of Output 2.2 shows that a mixed effects maximum-likelihood (ML) regression is performed. The first part of the right-hand column refers to the number of observations (684). In the last line of the left-hand column, the log likelihood is given. The log likelihood value is a result of the maximum-likelihood estimation procedure. The value itself is not informative but can be used in the likelihood ratio test. The first part of the output also gives the result of a statistical test. This test is a Chi-square test with one degree of freedom. The Chi-square value is 241.24 with a corresponding *p*‑value < 0.001. This is the result of a statistical test for all the variables in the model. In this analysis, the only variable in the model is activity. It should be noted that in general this statistical test does not provide interesting information.

The second part of the output shows the regression coefficients. This part is also referred to as the fixed part of the regression model. Besides the regression coefficient, the standard error, the *z*-value (which is calculated as the regression coefficient divided by the standard error), the corresponding *p*-value and the 95% confidence interval (CI) are given. The 95% CI is calculated as the regression coefficient ± 1.96 times the standard error. The regression coefficient for activity (0.5933817) reflects the difference in health when there is one unit difference in physical activity. The intercept (0.5743945) reflects the value of health when physical activity equals 0. It should be noted that the value of the intercept in this example is not informative, because there is no physical activity equals 0 in the example dataset. The *z*-value for the test whether the regression coefficient for activity equals 0 is 15.53. The squared value of this number gives 241.50, which is exactly the same as the value given in the first part of the output that was used as a test statistic for all variables in the model. It should be noted that the Chi-square distribution with one degree of freedom is equal to the standard normal distribution squared. So, the *z*-test for the variable activity gives exactly the same *p*‑value as the Chi-square test for all variables in the model.

The last part of the output shows the so-called random part of the regression model. Because a naive mixed model analysis was performed, the random part of the regression model only contains the residual variance, which is also known as the error variance or the unexplained variance.

In the second step of the example analysis, an adjustment is made for neighbourhood; in other words, a random intercept on the neighbourhood level is added to the model. As mentioned before, this is done by estimating the variance of the intercepts for the different neighbourhoods (based on a normal distribution). Output 2.3 shows the result of this analysis.

In the first part of Output 2.3 it can be seen that the group variable neighbourhood is added to the model and that there are 48 neighbourhoods. The right-hand column also shows the average number of subjects in a particular neighbourhood (14.3) and the minimum and maximum number of subjects (4 and 49 subjects, respectively).

The fixed part of the regression model provides the same information as before, although the numbers are slightly different, because in the analysis reported in Output 2.3 an adjustment for neighbourhood has been made. That adjustment can be found in the random part of the regression model, which now contains not only the residual variance but also the random intercept variance (4.018727). This number reflects the variance in intercepts for the different neighbourhoods around the intercept value given in the fixed part of the model (0.7898844). The variance is estimated from a normal distribution around the intercepts of all the different neighbourhoods (see Section 2.1).

The question then arises whether or not it is necessary to adjust for neighbourhood, or in other words, whether or not a random intercept should be added to the model. This question can be answered by performing the likelihood ratio test. The likelihood ratio test compares the −2 log likelihood of the model with a random intercept and the −2 log likelihood of the model without a random intercept. The difference between the −2 log likelihoods of the two models follows a Chi-square distribution. The number of degrees of freedom for this Chi-square distribution is equal to the difference in the number of parameters estimated in the two models. In the present example the difference between the two −2 log likelihoods equals:which follows a Chi-square distribution with one degree of freedom. There is one degree of freedom because, compared to the naive analysis, in the second analysis only the variance of the intercepts is additionally estimated, which is only one parameter. Because the critical value for a Chi-square distribution with one degree of freedom is 3.84, the difference of 30.96 is highly significant. It is argued that when variance parameters are added to the model, the difference between the two −2 log likelihoods should be tested one sided. Because a variance can only be positive the difference between the −2 log likelihoods can only be in one direction (Goldstein, Reference Goldstein1995, Reference Goldstein2003; Lin, Reference Lin1997). It is rather strange that for the likelihood ratio test in mixed model analysis one-sided *p*-values are used, while for likelihood ratio tests in standard logistic or Cox-regression analysis, for instance, two-sided *p*-values are used. In these standard situations basically the same phenomenon occurs, because adding variables to models can only lead to a −2 log likelihood change in one direction. Although in practice it is not really a big deal whether one-sided or two-sided *p*-values are used, it is important to realise that this contradiction exists in the literature. In the remaining part of this book two-sided *p*-values will be used for the likelihood ratio tests.

The result of the likelihood ratio test to compare the model with a random intercept and the model without a random intercept (i.e. a standard linear regression analysis) can also be found in the last line of the output.

The most interesting information in the output is still the fixed part of the regression model, which shows the regression coefficient for activity. Adjusted for neighbourhood, the regression coefficient for activity is 0.586818. So, a difference of one unit in activity between subjects is associated with a difference of 0.59 units in the continuous health outcome. When the *z*-test is performed for this regression coefficient, it is obvious that this relationship is highly significant. The 95% CI around the regression coefficient of 0.59 ranges from 0.52 to 0.66.

### 2.3 Intraclass Correlation Coefficient

Based on the random intercept variance and the residual variance, the so-called intraclass correlation coefficient (ICC) can be calculated. This ICC is an indication of the average correlation of the observations of subjects living in the same neighbourhood. The ICC is defined as the variance between neighbourhoods divided by the total variance, where the total variance is defined as the summation of the variance between neighbourhoods and the variance within neighbourhoods. The variance within neighbourhoods is equal to the residual variance. It seems to be counterintuitive that a correlation coefficient is calculated from variances. Therefore, Figure 2.5 illustrates this phenomenon. Figure 2.5a shows the distribution of a particular outcome variable, and in Figure 2.5b and c the outcome variable is divided into three groups (for instance three different neighbourhoods). In Figure 2.5b, the ICC is low, because the variance within groups is high and the variance between groups is low. Because in Figure 2.5c the groups are more spread out, the between-group variance increases and the within-group variance decreases. As a consequence, the ICC is high. In fact, Figure 2.5c shows the most extreme situation given the data, i.e. there is maximal between-group variance and minimal within-group variance, leading to the highest possible ICC, given the variance in the data. Thus, in fact, variance and correlation are related to each other one to one.

Going back to the results of the example given in Output 2.3, the ICC can be calculated by dividing the between-neighbourhood variance (4.018727) by the total variance, which is calculated by summation of the between-neighbourhood variance and the within-neighbourhood variance (4.018727 + 29.57958). So, in this example the ICC is 4.02/33.6 = 0.12. It should be noted that this ICC is not the pure ICC present in the data, because both variances are calculated from a model which includes the independent variable physical activity. Physical activity is related to the outcome health, so it reduces the remaining residual variance considerably. Besides that, it is possible that physical activity also explains some of the differences between the neighbourhoods, so the between-neighbourhood variance can also be influenced by physical activity. Therefore, it is better to calculate the pure ICC, which can be done with a so-called intercept-only model (i.e. a model with an intercept but without any independent variables). This intercept-only model has to include an adjustment for the neighbourhood, meaning it has to include a random intercept. Output 2.4 shows the result of this analysis.

Output 2.4 shows that both the random intercept variance and the residual variance are higher compared to the model with physical activity (Output 2.3). From the numbers given in Output 2.4 the pure ICC can be calculated. This ICC is 0.12(5.32/(5.32 + 40.06)), which is equal to the ICC calculated from the model with physical activity. So, in this particular situation, adding physical activity to the model influences both variances by same (relative) amount.

### 2.4 Random Slopes

Up to now, the only situation that has been considered is one in which the intercept of the regression model is allowed to differ between groups. Let us go back to the regression model in which the analysis of the relationship between physical activity and health was adjusted for gender (Eq. 2.2).

Suppose now that it is not only assumed that the intercepts are different for males and females, but that the relationship between physical activity and health is also different for males and females. To allow for that, an interaction term between physical activity and gender must be added to the regression model (Eq. 2.4). By adding an interaction between physical activity and gender to the regression model, different slopes of the regression line are estimated for males and females (see Figure 2.6).*β*

_{0}+

*β*

_{1}activity +

*β*

_{2}gender +

*β*

_{3}(activity × gender) +

*ε*

where *β*_{3} = regression coefficient for the interaction between physical activity and gender.

When the possible effect modifier (i.e. gender) is a dichotomous variable, just one interaction term has to be added to the regression model. However, when it is not the interaction with gender that is of interest, but for instance the interaction with neighbourhood, the same problems that were described for the adjustment for neighbourhood arise. When the observations are clustered within neighbourhoods, it may be reasonable to assume that the relationship between physical activity and health is different for different neighbourhoods. In other words, in this situation different slopes of the regression line have to be estimated for each neighbourhood (see Figure 2.7).

where *β*_{m+1} to *β*_{2m−1} = regression coefficients for the interactions between physical activity and the dummy variables representing the different neighbourhoods, and *m* = number of neighbourhoods.

In this example with 48 neighbourhoods, 47 interaction terms have to be added to the regression model. Estimating 47 regression coefficients that are not of major interest will, again, lead to a loss of power and efficiency. Comparable to what has been described for the different intercepts, the interest is in the overall relationship between physical activity and health. To cope with this, mixed model analysis provides a very elegant solution, and it does so for this situation as well. As for the different intercepts, a three-step procedure can be applied. First, different regression coefficients for the relationship between physical activity and health are estimated for the different neighbourhoods. Then, a normal distribution is drawn over the different regression coefficients. Finally, the variance of the normal distribution is estimated. Now only one variance parameter reflects the difference in the relationship between physical activity and health for the different neighbourhoods. So, in addition to a random intercept, a random slope can also be considered. Again, it should be noted that the estimation of the variance of the slopes is based on a normal distribution.

### 2.5 Example

Let us go back to the example dataset that was described in Section 2.2. Output 2.5 shows the output of the mixed model analysis on the example dataset in which not only a random intercept is added to the model, but also a random slope for physical activity.

The first two parts of Output 2.5 (i.e. the general information and the fixed part of the regression model) look exactly the same as the output of the mixed model analysis with only a random intercept on neighbourhood level (see Output 2.3). The difference between the two outputs is found in the random part of the regression model. It can be seen that on the neighbourhood level the ‘var(activity)’ is added. This is the random slope variance for activity, i.e. the variance over the regression coefficients for physical activity for the different neighbourhoods. This variance is a very small number, but that does not necessarily mean that the random slope for physical activity is not important. To evaluate the importance of the random slope for activity, a likelihood ratio test must be performed.

The −2 log likelihood of the model with both a random intercept and a random slope for physical activity is −2 × −2153.4088 = 4306.8. Compared to the −2 log likelihood of the model with only a random intercept (4306.8), adding a random slope to the model did not lead to a significant improvement of the model. Therefore, it is not necessary to add a random slope for physical activity to the model. From Output 2.5 it can also be seen that the random slope for activity and the random intercept are assumed to be independent. It is questionable whether it makes sense to assume the two random components are independent. Because they belong to the same neighbourhood, it is expected that the intercept and slope for a particular neighbourhood are related to each other. Therefore, it makes more sense to model this dependency together with the random slope for physical activity. Output 2.6 shows the results of a mixed model analysis in which the random intercept and the random slope for activity are not assumed to be independent.

From the random part of the regression model in Output 2.6 it can be seen that besides the random intercept and the random slope for activity, the covariance between the random intercept and random slope for activity ‘(cov(activity,_cons))’ is modelled as well. It can also be seen that the random slope variance for activity is much higher than the random slope variance given in Output 2.5 as well as that the random intercept variance has increased enormously. The next step is to compare the two models with each other by using the likelihood ratio test. It is common to compare the model presented in Output 2.6 (i.e. the model with a random slope and the covariance between random intercept and slope) with the model with only a random intercept (Output 2.3). The −2 log likelihood of the model with a random intercept, a random slope for activity and the covariance between the random intercept and random slope is −2 × −2142.4879 = 4285. When this value is compared with the −2 log likelihood of the model with only a random intercept (i.e. 4306.8) it can be seen that there is an improvement of more than 21 points. This value has to be evaluated on a Chi-square distribution of two degrees of freedom, because besides the random slope for activity, the covariance between the random intercept and the random slope for activity also was added to the model. The critical value of the Chi-square distribution with two degrees of freedom is 5.99, so the model with a random intercept, a random slope for activity and the covariance between the random intercept and the random slope, is significantly better than the model with only a random intercept. It should be noted that the model assuming independence between a random intercept and a random slope on the same level does not makes much sense, so it is advised not to assume independence between the two random components on the same level, but to always model the dependency together with the new random component.

The covariance between the random intercept and the random slope is also known as the correlation between the random intercept and the random slope or the interaction between the random intercept and the random slope. For the interpretation of this covariance, the sign is probably the most important. In the example the covariance has a negative sign, which indicates an inverse relationship between the random intercept and the random slope for activity. In other words, for neighbourhoods with a relatively high intercept, a relatively low slope is observed (see Figure 2.8). On the other hand, a positive sign of the covariance between a random intercept and a random slope indicates that the group with a relatively high intercept also has a relatively high slope.

Looking at Output 2.6 and Output 2.3, it was already mentioned that the random intercept variance in the model with only a random intercept is much lower than the random intercept variance in the model with a random intercept, a random slope and the covariance between the random intercept and the random slope (i.e. 4.0 versus 166.4). The high value of the random intercept variance in the model with a random intercept, a random slope and the covariance between a random slope and a random intercept has to do with the fact that the intercept in this example does not have a real interpretation. As mentioned before, the value of the intercept indicates the value of the outcome ‘variable health’ when the independent variable ‘physical activity’ equals zero. In the example dataset, the physical activity score ranges between 29 and 61. Therefore, the intercept reflects a value which is far from the observed values. In a situation with only a random intercept this does not influence the variance between the intercepts, because the difference between the regression lines is equal at each value for physical activity (see Figure 2.9a). However, when the slopes of the regression lines differ, this can have a major influence on the variance of the intercepts when the value of the intercept is non-informative (see Figure 2.9b).

A possible way in which to make the intercept more interpretable is to use the centred value of the independent variable in the analysis. This can be done by subtracting the average activity score from all individual observations. Because the result of this subtraction is that the average activity score in the dataset will be zero, the intercept can be interpreted as the outcome variable health for the average physical activity score. Output 2.7 shows the result of a mixed model analysis with a random intercept, a random slope for physical activity and the covariance between the random intercept and slope, when physical activity is centred.

From Output 2.7 it can be seen that the regression coefficient (and random variance) for activity is, of course, exactly the same as before, but the magnitude of the intercept and the random intercept variance have changed considerably. It can further be seen that the random intercept variance has more or less the same value as in the analysis with only a random intercept. To make the intercept and the random intercept variance better interpretable, it is often argued that for all mixed model analyses, the independent variables should be centred. It should be noted that the regression coefficient of interest is not influenced by centring the independent variable(s) (see also Section 2.6.1).

### 2.6 Mixed Model Analysis with More than Two Levels

Up to now, only a relatively simple situation has been considered in which the individual observations were clustered within neighbourhoods. It is, however, also possible that the different neighbourhoods are clustered within, for instance, regions (see Figure 2.10).

It is not surprising that this clustering within a higher level can be treated in the same way as has been described for the clustering of the individual observations within neighbourhoods. So, for the different regions, a random intercept (i.e. the variance of the intercepts for the different regions) and a random slope for activity (i.e. the variance of the regression coefficients for activity for the different regions) can also be estimated. It should be realised that, in general, the model building procedure starts by adding random intercepts on the different levels to the model step by step and, after that, possible random slopes are considered.

#### 2.6.1 Example

Output 2.8 shows the result of a mixed model analysis in which a random intercept is assumed for neighbourhood as well as for region.

From the first part of Output 2.8 it can be seen that the outcome variable is measured on three levels, subjects, neighbourhoods and regions, and that there are 12 regions. The model with a random intercept on the region level and a random intercept on the neighbourhood level can be seen as an extension of the model with only a random intercept on neighbourhood level, the result of which was shown in Output 2.3. The difference between the two can be found in the random part of the regression model. Here, we see the random intercept variance on region level (2.947866) and the random intercept variance on neighbourhood level (1.128718). The necessity of the additional random intercept on region level can be evaluated with the likelihood ratio test. Therefore, the −2 log likelihood of the model with only a random intercept on neighbourhood level (−2 × −2153.4088 = 4306.8176) has to be compared to the −2 log likelihood of the model with both a random intercept on neighbourhood and region level (−2 × −2146.8629 = 4293.7258). This difference is 13.09 and evaluated on a Chi-square distribution with one degree of freedom. This difference is highly statistically significant, so a random intercept on region level is necessary. Looking at the two variances provided in Output 2.8 it can be seen that most of the between-neighbourhood variance shown in Output 2.3 (the linear mixed model analysis with only a random intercept on neighbourhood level) is actually the variance between regions. Only a relatively small part of the variance is still related to the neighbourhoods. The latter now reflects the difference in health between neighbourhoods within regions.

The next step in the modelling procedure is to add a random slope for activity on neighbourhood level to the model. Output 2.9 shows the result of this analysis.

From the random part of the regression model in Output 2.9 it can be seen that a random slope for activity on the neighbourhood level is added to the model. It can also be seen that the covariance between the random intercept and the random slope for activity on the neighbourhood level is added to the model. As has been mentioned before, it is rather strange to assume that the random intercept and random slope on a particular level are independent.

To evaluate whether a random slope for activity on neighbourhood level must be added to the model, the likelihood ratio test can be performed. To do so, the −2 log likelihood of the model with two random intercepts has to be compared with the −2 log likelihood of the model shown in Output 2.9. The −2 log likelihood of the latter equals −2 × −2141.5435 = 4283.087. The −2 log likelihood of the model with two random intercepts was 4293.7258. So, the difference between the two models equals 10.64. This difference follows a Chi-square distribution with two degrees of freedom – the variance of the slopes on neighbourhood level and the covariance between the random intercept and random slope on neighbourhood level – which is highly significant.

The last possibility for a random coefficient in the present model is allowing the regression coefficient for activity to differ between regions. When the analysis in which the intercepts as well as the slopes for activity were assumed to be random on both neighbourhood level and region level, an error message occurs. In this error message it is mentioned that there are problems with the Hessian matrix. In most situations, it is said that the ‘*Hessian matrix is not negative semidefinite.’* Although this is a quite complicated mathematical issue, it basically means that no optimal solution for the likelihood function can be derived. There are several ways to deal with this problem. Most researchers will conclude that when there is no optimal solution it is better not to add the random slope for activity on region level to the model and report the results of the model with two random intercepts and a random slope for activity on neighbourhood level. Because activity is a continuous variable with values far above zero, another possibility is to use the centred variable for activity as the independent variable (see Section 2.5). Besides a better interpretation of the intercept and the random intercept variance, centring the independent variable also leads to less complicated estimates. When the same analysis is performed with the centred activity variable, there is no error message, but the iterations will never stop, also indicating that no optimal solution can be achieved. To stop the model endlessly iterating, a maximum number of iterations can be given. Output 2.10 shows the result of an analysis with a random intercept and random slope for activity (the centred variable) on both neighbourhood and region level and a maximum number of 50 iterations.

From Output 2.10 it can be seen that the estimation of the standard errors of the variances in the random part of the regression model lead to problems. It is therefore questionable whether the results of the analysis shown in Output 2.10 are valid. Another possibility to reach an optimal solution for the maximum-likelihood estimation is to increase the tolerance factor. When the log likelihood changes by a relative amount less than the tolerance factor, the optimal solution is derived. In some situations the optimal solution is not derived because the estimates jump between two possibilities for which the difference in log likelihood is slightly bigger than the tolerance factor. A slight increase in tolerance factor will sometimes lead to a proper solution. Output 2.11 shows the results of an analysis with a random intercept and random slope for activity (the centred variable) on both neighbourhood and region level with an increased tolerance factor.

When both alternative ways to obtain an optimal solution are compared to each other, it can be seen that the log likelihood values are exactly the same, that the regression coefficients in the fixed part of the regression model are exactly the same and that the estimates of the variance components are exactly the same. So, the results shown in Output 2.10 and Output 2.11 are probably valid. When the −2 log likelihood of het model with two random intercepts and two random slopes (−2 × −2134.7689 = 4269.5378) is compared to the −2 log likelihood of the model with two random intercepts and only a random slope for activity on neighbourhood level (4283.087) it can be seen that this difference (evaluated on a Chi-square distribution with two degrees of freedom) is highly significant. Note that in this particular example the alternative ways to obtain an optimal solution only give a solution when the centred variable for activity is used. This is, however, no problem for the interpretation of the regression coefficient of interest, which is not influenced by centring a variable.

Thus, if it is believed that the alternative ways to obtain an optimal solution in this example are valid, the most appropriate estimate of the relationship between physical activity and health can be found in Output 2.10 and Output 2.11. The regression coefficient for activity is 0.55 and the corresponding 95% CI ranges from 0.42 to 0.68. The *p*-value of the relationship between physical activity and health is < 0.001. It should, however, be realised that most researchers will report the results of the analysis with two random intercepts and a random slope for activity on neighbourhood level, the results shown in Output 2.9.

### 2.7 Assumptions in Mixed Model Analysis

Because linear mixed model analysis is an extension of standard linear regression analysis, all assumptions for standard linear regression analysis also hold for linear mixed model analysis. So, the continuous outcome variable should be normally distributed, i.e. the residuals should be normally distributed. As in standard linear regression analysis, this can be investigated by inspection of the distribution of the residuals. Moreover, the residuals should also be uncorrelated. In most mixed model studies this should not be a big problem, because the reason for performing a mixed model analysis in the first place is that there are correlated observations (i.e. correlated residuals) in the data to be analysed. Basically, by using mixed model analysis the problem of these correlated residuals is more or less solved. However, there are research situations in which the use of mixed model analysis only partly solves the problem of the correlated residuals. This is, for instance, sometimes the case in longitudinal studies (see Chapter 9).

An additional assumption that is typical for mixed model analysis was already mentioned in the earlier sections: the intercepts and slopes from which the variance is estimated must be more or less normally distributed. Whether or not this is a reasonable assumption can, for instance, be investigated by analysing the mean values of the outcome variable for the different groups (i.e. neighbourhoods). Figure 2.11 shows the distribution of the mean values of health for the 48 neighbourhoods in the present example.

From Figure 2.11 it can be seen that the distribution of the mean values of health for the different neighbourhoods is slightly skewed to the left. However, the skewness is not that strong, so it can be concluded that the assumption regarding the normal distribution of the intercepts holds.

In addition to checking the assumptions of the linear mixed model analysis, it can also be important to investigate whether the model coefficients are influenced by certain data-points or whether outliers are present in the analysed dataset. Because of the mixed model structure of the data, outliers (or influencing data points) can occur at different levels. Observations of subjects can influence the overall relationship that is analysed or can be outliers on subject level. On the other hand, a single observation can also be an outlier for the particular neighbourhood to whom that subjects belongs. In other words, the subject observation can be an outlier on neighbourhood level. For detailed information regarding outliers and influencing data-points in mixed model analysis, reference is made to Atkinson (Reference Atkinson1986), Barnett and Lewis (Reference Barnett and Lewis1994), Lawrence (Reference Lawrence1995) or Langford and Lewis (Reference Langford and Lewis1998).

### 2.8 Comments

#### 2.8.1 Which Regression Coefficients Can Be Assumed to Be Random?

In the present example, only one independent variable was included in the model (i.e. physical activity). Because physical activity was measured on subject level it was theoretically possible to add a random slope for activity on neighbourhood level and region level to the model. A general rule concerning random slopes is that they can only be considered to be random at a level above that on which they are measured. In the present example this means that because activity is measured on subject level, activity can only be assumed to be random on the levels above subject level (i.e. on neighbourhood level and on region level). In line with this, if a variable was measured on neighbourhood level, for instance the number of playing grounds in a particular neighbourhood, the regression coefficient (i.e. slope) for that particular variable can only be assumed to be random on region level.

#### 2.8.2 Random Regression Coefficients versus Fixed Regression Coefficients

Within mixed model analysis the regression model is divided into a fixed part and a random part. Therefore, a distinction can be made between fixed and random coefficients. It should be realised that this distinction differs from the distinction between random and fixed factors in the traditional analysis of variance. In analysis of variance a random factor is defined as: ‘a categorical variable in which the groups are a random sample of all possible groups about which conclusions are desired’ (e.g. neighbourhood or region). A fixed factor is defined as: ‘a categorical variable about which conclusions are desired for every group’ (e.g. gender). In mixed model analysis, however, a fixed regression coefficient is just the regression coefficient itself. In principle, all regression coefficients of a mixed model analysis are fixed, because in general one is interested in the magnitude of the regression coefficients. In addition to the fixed part of the regression coefficient, each regression coefficient can also be considered to be random (depending on the level on which the variable is measured [see Section 2.8.1]). This random part of the regression coefficient is the variation of the regression coefficient between the groups considered (e.g. neighbourhoods, regions, etc.). The term ‘random’ is probably not the most appropriate in this respect, because the regression coefficients are not really random; they are only assumed to be different for different groups.

#### 2.8.3 Maximum Likelihood versus Restricted Maximum Likelihood

In all analyses performed so far in this chapter, the regression coefficients and variances were estimated by maximum likelihood (ML). There is also another procedure available that can be used to estimate the regression coefficients and variances of a mixed model analysis, i.e. restricted maximum likelihood (REML). It should be noted that in most software packages the REML estimation procedure is the default (see also Chapter 13). There is no real consensus concerning the best estimation procedure. It is often argued that REML is better for the estimation of random variances, while ML is better for the estimation of the (fixed) regression coefficients. In general, the (fixed) regression coefficients are of major interest; therefore, in the examples presented in this book, a ML estimation procedure is used. To illustrate the differences between the ML and the REML estimation procedure, the relationship between physical activity and health with both a random intercept and a random slope on neighbourhood level and a random intercept on region level was also estimated with REML. Output 2.12 shows the result of this analysis, while Table 2.2 summarises the results of both analyses.

ML estimate | REML estimate | |
---|---|---|

Intercept | 30.49 (0.45) | 30.50 (0.47) |

Activity random intercept | 0.58 (0.05) | 0.58 (0.05) |

On neighbourhood level random slope for activity | 2.85 (1.74) | 2.69 (1.70) |

On neighbourhood level random intercept | 0.04 (0.02) | 0.04 (0.02) |

On region level | 1.08 (1.07) | 1.35 (1.25) |

Log likelihood | −2141.5 | −2143.6 |

From Table 2.2 it can be seen that the differences between the two estimation procedures are very small. Not surprisingly, the only differences were observed for the variance of the random intercepts and the −2 log likelihood. So, the discussion about whether maximum likelihood or restricted maximum likelihood should be used for mixed model analyses is basically irrelevant. It should furthermore be noted that for restricted maximum-likelihood estimation the likelihood is totally based on the estimation of the variances in the random part of the model. Thus, when the likelihood ratio test is used for evaluating whether a particular variable should be added to the fixed part of the model, the likelihood estimated with restricted maximum likelihood is not valid and can therefore not be used (Morrell, Reference Morrell1998; Oehlert, Reference Oehlert2012).