Chapter 8 Bivariate relationships considering spatial pattern

Correlation coefficients are a well known measure of association between variables. The question is how these can or should be modified if we apply them to spatialy structured data. here we focus on association between numerical scaled variables and ignore association measures for ordinal or nominal variables.

8.1 Cheat sheet

The following packages are required for this chapter:

require(sf)
require(spdep)
require(sp)
require(ncf)
require(tmap)
require(ggplot2)
require(ggpubr)
require(gstat) # for simulation of autocorrelated data
require(dplyr)
require(SpatialPack) # for the modified t-test

Important commands:

  • spdep::lee.mc - permutation test for Lee’s L
  • spdep::lag.listw - computing a spatially lagged version of a variable vector
  • cor.test - Pearson’s r with test

8.2 Adjusted t-test

##’ Motivation

In presence of positive spatial autocorrelation our effective sample size is smaller than the sample size. This is due to shared information between observations. The normal t-test for Pearson’s r is therefore incorrect.

The simplest solution is to calculate the effective sample size under consideration of spatial autocorrelation.

The effective sample size \(n'\) in absence of autocorrelation: \(n' = n\) (number of observations). In presence of autocorrelation this needs to be modified to

\[n'= \frac{n^2}{\sum_{i=1}^n\sum_{j=1}^nCor(x_i, y_i)}\]

The exact formula depends on the from of the correlation function. For a first order autoregressive correlation structure (\(Cor(x_i, x_j)= \rho^{|i-j}\)):

\[n'=(1+2\frac{\rho}{1-\rho}(1-\frac{1}{n})-2\frac{\rho^2}{(1-\rho)^2} \frac{1-\rho^{n-1}}{n}) \cdot n\]

This can be simplified for large n:

\[n' \cong n \frac{1-\rho}{1+\rho}\]

# function to calculate the effective sampel size for large n and the simple first order autoregresive function
neff <- function(n, rho)
{
  res <- n*(1-rho)/(1+rho)
  return(res)
}
  • e.g. \(n=1000\) and \(\rho = 0.7\) \(\rightarrow n' =\) 176.4705882
  • e.g. \(n=1000\) and \(\rho = 0.9\) \(\rightarrow n' =\) 52.6315789

The modified t-test by Clifford and Richardson (Clifford et al. (1989) and Dutilleul et al. (1993)) uses the covariance for all observations in a distance band as a measure to capture spatial autocorrelation (cf. correlogram). The function is available as spatialPack::modified.ttest.

8.3 Bivariate Moran’s I

Bivariate Moran’s I is a measure of autocorrelation of X with the spatial lag of Y. As such, the resultant measure may overestimate the amount of spatial autocorrelation which may be a product of the inherent correlation of X and Y.

\[I_B = \frac{\sum_i(\sum_j w_{i,j}y_j\cdot x_i)}{\sum_i x_i^2}\] It only captures autocorrelation of one variable and completely ignores the spatial pattern of the second variable. \(I_{X,Y}\) might be different from \(I_{Y,X}\) which is not helpful for interpretation.

Using the spatial lag operator this can be written as:

\[I_{X,Y} = \frac{\sum_i(x_i-\bar{x})(\tilde{y}-\bar{y})}{\sqrt{\sum_i(x_i-\bar{x})^2} \sqrt{\sum_i(y_i-\bar{y})^2}} \cong \sqrt{SSS_Y} \cdot r_{x,\tilde{y}}\]

In general, the use of the bivariate Moran’s I is not recommended for the reason giving above.

8.4 Lee’s L - measuring association between spatially autocorrelated variables

Spatial autocorrelation also influences the correlation between variables. Using the standard Pearson correlation coefficient ignores the fact that part of the association originates from the spatial autocorrelation of the two variables. Lee (2001) developed an index to measure the bivariate spatial association by a fusion of Pearson’s correlation coefficient and Moran’s I. More precisely, Moran’s I is decomposed by a spatial smoothing scalar (SSS) that measures the degree of smoothing when a variable is transformed to its spatial lag. Lee’s L is the product of the SSS of the two variables and their Pearson correlation coefficient at spatial lags. Lee’s L measures similarity/dissimilarity among variables in terms of bivariate associations and their spatial clustering.

It is calculated using spdep::lee. Its significance is tested using lee.test or lee.mc in spdep.

Using the concept of the spatial lag (see Moran’s I):

\[ \tilde{x}_i = \sum_j w_{ij}x_j \]

Lee’s L can be expressed (for a row-standardized \(W\)) as follows:

\[ L_{X,Y} = \frac{\sum_i^n (\tilde{x}_i - \bar{x}) (\tilde{y}_i - \bar{y})} {\sqrt{\sum_i^n(x_i-\bar{x})^2} \sqrt{\sum_i^n(y_i-\bar{y})^2}} \]

The idea behind Lee’s L is that presence of spatial autocorrelation violates the assumption of independent data points. Therefore, test statistics can no longer be based on taking simply the number of observations as the degrees of freedom for the test statistic as neighboring observations share information (are not completely independent).

If one varies the spatial composition of the observations, one does not change the value of Pearson’s r but changes spatial autocorrelation (measured e.g. by Moran’s I). This is shown in figure 8.1 (Lee, 2001). Pearson’s r between A-B, A-C and B-C is always 0.422. However, the spatial structure (as captured by Moran’s I) is very different. If we think that spatial structure matters, it would be nice if this could be captured by a statistic - this is the purpose of Lee’s L.

Correlation betwenn abservations A and B and C. B and C show identical values that are differently spatially arranged. Therefore, Pearson's r is exaclty the same for A-B and A-C. However Moran's I for B and C is different. As this effects the amount of information shared between neighboring observations, a measure that reflects that seems necessary.

Figure 8.1: Correlation betwenn abservations A and B and C. B and C show identical values that are differently spatially arranged. Therefore, Pearson’s r is exaclty the same for A-B and A-C. However Moran’s I for B and C is different. As this effects the amount of information shared between neighboring observations, a measure that reflects that seems necessary.

We have already rewritten Moran’s I using the spatial lag operator above:

\[\tilde{x}_i = \sum_j w_{ij}x_j\]

\[ I = \frac{\sum_i^n (x_i - \bar{x}) (\tilde{x}_i - \bar{x}) } {\sqrt{\sum_i^n (x_i - \bar{x})^2 \sum_i^n (x_i - \bar{x})^2}}\]

If we now calculate Pearson’s r between the variable \(x\) and its lag \(\tilde{x}\) we can further rewrite the equation.

\[r_{x,\tilde{x}} = \frac{\sum_i(x_i-\bar{x})(\tilde{x}_i-\bar{x}))}{\sqrt{\sum_i(x_i-\bar{x})^2} \sqrt{\sum_i(x_i-\bar{x})^2}}\]

This allows us to write Moran’s I as:

\[I = \sqrt{\frac{\sum_i ( \tilde{x}_i -\bar{\tilde{x}}_i)^2}{\sum_i ( x_i -\bar{x}_i)^2}} \cdot r_{x,\tilde{x}}\]

This shows that Moran’s I is Pearson’s r between the variable and its spatial lag, scaled by the square root of the ratio of the variance of the spatial lagged variable and the variance of the variable.

If we rearrange the equation and use a clever approximation this can be further simplified (Lee, 2001).

\[I = \sqrt{\frac{\sum_i(\tilde{x}_i - \bar{x}_i)^2}{\sum_i(\tilde{x}_i - \bar{x}_i)^2}} \cdot \sqrt{\frac{\sum_i ( \tilde{x}_i -\bar{\tilde{x}}_i)^2}{\sum_i ( x_i -\bar{x}_i)^2}} \cdot r_{x,\tilde{x}} = \] \[ = \sqrt{\frac{\sum_i(\tilde{x}_i - \bar{x}_i)^2}{\sum_i(x_i -\bar{x}_i)^2}} \cdot \sqrt{\frac{\sum_i ( \tilde{x}_i -\bar{\tilde{x}}_i)^2}{\sum_i ( \tilde{x}_i - \bar{x}_i)^2}} \cdot r_{x,\tilde{x}} \]

As the mean of the original variable and its spatial lag can be expected to be very similar the can assume that \(\sqrt{\frac{\sum_i ( \tilde{x}_i -\bar{\tilde{x}}_i)^2}{\sum_i ( \tilde{x}_i - \bar{x}_i)^2}}\cong 1\).

\[I \cong \sqrt{\frac{\sum_i(\tilde{x}_i - \bar{x}_i)^2}{\sum_i(x_i -\bar{x}_i)^2}} 1 \cdot r_{x,\tilde{x}} = \sqrt{SSS} \cdot r_{x,\tilde{x}}\]

There \(SSS\) was termed spatial smoothing scalar by Lee (2001). If a variable is stronger spatially clustered, its SSS will be larger, as the variance of the original vector is less reduced when it is transformed to its spatial lag. SSS reveals substantive information about the spatial clustering of a variable. If a variable is stronger spatially clustered, its SSS is larger, because variance of the original vector is less reduced when it is transformed to its spatial lag.

SSS itself can be seen as a direction-free univariate spatial association measure that theoretically ranges from 0 to 1 (Lee, 2001).

Lets have a look at three examples from Lee (2001). The figures show the spatial patternof the variable and of its spatial lag as well as the SSS, Pearson’s r for the variable and its spatial lag es well as Moran’s I.

For the bivariate case this can be extended to Lee’s L which can be rewritten as3:

\[L_{x,y}=\sqrt{SSS_y}\sqrt{SSS_y} \cdot r_{\tilde{x}, \tilde{y}} = \sqrt{BSSS_{x,y}} \cdot r_{\tilde{x}, \tilde{y}}\]

8.5 Simulated data

We create a hexagon tesselation and add some spatially structure columns to it that can then be used to explore Lee’s L for these pattern.

set.seed(3)
aoi <- st_sfc(st_polygon(list(rbind(c(0,0), c(5,0), c(5,5), c(0,0)))))
hexgrid <- st_make_grid(aoi,  cellsize = 1, square = FALSE)
hexgrid <- st_sf(hexgrid)
n <- nrow(hexgrid)
hexgrid$id <- 1:n

hexgrid <- hexgrid %>% mutate(var1 = ifelse(id <10, 30, ifelse(id < 20, 5, 1) + rnorm(n, mean= 0, sd=.1)),
                              var2 = ifelse(id <10, 2, ifelse(id < 20, 6, 20) + rnorm(n, mean= 0, sd=.1) ))

hexgrid <- hexgrid %>% mutate(var3 = rnorm(n, mean= 4) + var1,
                              var4 = rpois(n, lambda = var1),
                              var5 = rnorm(n, mean=30),
                              var6 = rnorm(n, mean=30))
hexgrid <- hexgrid %>% mutate(var7 = var6 + rnorm(n, mean= 0, sd=.5) )
idx <- sample(hexgrid$id, size=n, replace = FALSE)
hexgrid <- hexgrid %>% mutate(var8 = var1[idx], var9= var2[idx])
hexgrid <- hexgrid %>% mutate(var10 = var1 + rnorm(n))
hexgrid <- hexgrid %>% mutate(var11 = var10[idx])

theBreaks <-seq(0,40, by=5)
thePalette <- "Reds"
m1 <- tm_shape(hexgrid) + tm_polygons(col="var1", breaks = theBreaks, palette= thePalette) +
  tm_layout(legend.outside = TRUE)
m2 <- tm_shape(hexgrid) + tm_polygons(col="var2", breaks = theBreaks, palette= thePalette) +
  tm_layout(legend.outside = TRUE)
m3 <- tm_shape(hexgrid) + tm_polygons(col="var3", breaks = theBreaks, palette= thePalette) +
  tm_layout(legend.outside = TRUE)
m4 <- tm_shape(hexgrid) + tm_polygons(col="var4", breaks = theBreaks, palette= thePalette) +
  tm_layout(legend.outside = TRUE)
m5 <- tm_shape(hexgrid) + tm_polygons(col="var5", breaks = theBreaks, palette= thePalette) +
  tm_layout(legend.outside = TRUE)
m6 <- tm_shape(hexgrid) + tm_polygons(col="var6", breaks = theBreaks, palette= thePalette) +
  tm_layout(legend.outside = TRUE)
m7 <- tm_shape(hexgrid) + tm_polygons(col="var7", breaks = theBreaks, palette= thePalette) +
  tm_layout(legend.outside = TRUE)
m8 <- tm_shape(hexgrid) + tm_polygons(col="var8", breaks = theBreaks, palette= thePalette) +
  tm_layout(legend.outside = TRUE)
m9 <- tm_shape(hexgrid) + tm_polygons(col="var9", breaks = theBreaks, palette= thePalette) +
  tm_layout(legend.outside = TRUE)
m10 <- tm_shape(hexgrid) + tm_polygons(col="var10", breaks = theBreaks, palette= thePalette) +
  tm_layout(legend.outside = TRUE)
m11 <- tm_shape(hexgrid) + tm_polygons(col="var11", breaks = theBreaks, palette= thePalette) +
  tm_layout(legend.outside = TRUE)


tmap_arrange(m1,m2,m3,m4, m5, m6,m7, m8, m9, m10, m11)

A contiguity neighborhood definition is best suited for the case of the demonstration.

hexnb <- poly2nb(hexgrid)
hexlistw <- nb2listw(hexnb, style = "W")

If we apply Moran’s test to the five simulated variables we see that all but the last three show significant spatial autocorrelation. (I am using sapply to map a function to a list but this is not of concern here.)

moran4vars <- lapply(2:8, FUN = function(x) moran.test(st_drop_geometry(hexgrid)[,x], listw = hexlistw))
moran4varsShorted <- sapply(moran4vars, FUN = function(x) c(x$estimate[1], x$p.value))
row.names(moran4varsShorted)[2] <- "p-value"
moran4varsShorted
##                           [,1]         [,2]         [,3]         [,4]
## Moran I statistic 6.778588e-01 8.140493e-01 6.773210e-01 6.101243e-01
## p-value           1.351183e-14 3.210743e-19 1.429188e-14 2.032328e-12
##                           [,5]       [,6]      [,7]
## Moran I statistic -0.007337836 0.01197456 0.1382407
## p-value            0.433931191 0.35233507 0.0370708

Let us explore the case of the obviously autocorrelated and shifted first two variables:

tmap_arrange(m1,m2)

lee.mc(hexgrid$var1, hexgrid$var2, listw = hexlistw, nsim = 499, alternative = "less")
## 
##  Monte-Carlo simulation of Lee's L
## 
## data:  hexgrid$var1 ,  hexgrid$var2 
## weights: hexlistw  
## number of simulations + 1: 500 
## 
## statistic = -0.63469, observed rank = 1, p-value = 0.002
## alternative hypothesis: less
cor.test(hexgrid$var1, hexgrid$var2)
## 
##  Pearson's product-moment correlation
## 
## data:  hexgrid$var1 and hexgrid$var2
## t = -8.7979, df = 43, p-value = 3.624e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8866491 -0.6646941
## sample estimates:
##        cor 
## -0.8017904

Compared to Pearson’s r that ignores spatial autocorrelation Lee’s L indicates a weaker but still clear negative association between var1 and var2.

Let’s repeat that with var3 which has added white noise.

tmap_arrange(m3,m2)

lee.mc(hexgrid$var3, hexgrid$var2, listw = hexlistw, nsim = 499, alternative = "less")
## 
##  Monte-Carlo simulation of Lee's L
## 
## data:  hexgrid$var3 ,  hexgrid$var2 
## weights: hexlistw  
## number of simulations + 1: 500 
## 
## statistic = -0.6389, observed rank = 1, p-value = 0.002
## alternative hypothesis: less
cor.test(hexgrid$var3, hexgrid$var2)
## 
##  Pearson's product-moment correlation
## 
## data:  hexgrid$var3 and hexgrid$var2
## t = -8.828, df = 43, p-value = 3.294e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8872331 -0.6662194
## sample estimates:
##       cor 
## -0.802766

Similar. Both methods indicate significant negative correlation between var2 and var3. Lee’s L is against smaller due to the effect of positive spatial autocorrelation in the data.

For the two white noise variables without spatial structure both association measures indicate expectedly no significant relationship.

tmap_arrange(m5,m6)

lee.mc(hexgrid$var6, hexgrid$var5, listw = hexlistw, nsim = 499, alternative = "less")
## 
##  Monte-Carlo simulation of Lee's L
## 
## data:  hexgrid$var6 ,  hexgrid$var5 
## weights: hexlistw  
## number of simulations + 1: 500 
## 
## statistic = 0.061969, observed rank = 425, p-value = 0.85
## alternative hypothesis: less
cor.test(hexgrid$var6, hexgrid$var5)
## 
##  Pearson's product-moment correlation
## 
## data:  hexgrid$var6 and hexgrid$var5
## t = 0.79209, df = 43, p-value = 0.4327
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.179947  0.399396
## sample estimates:
##       cor 
## 0.1199212

The last case is interesting as we are comparing two variables without significant spatial autocorrelation that were designed to be correlated. While Pearson’s r indicates strong autocorrelation, Lee’´s L suggest no significant association.

tmap_arrange(m7,m6)

lee.mc(hexgrid$var6, hexgrid$var7, listw = hexlistw, nsim = 999, alternative = "greater")
## 
##  Monte-Carlo simulation of Lee's L
## 
## data:  hexgrid$var6 ,  hexgrid$var7 
## weights: hexlistw  
## number of simulations + 1: 1000 
## 
## statistic = 0.169, observed rank = 574, p-value = 0.426
## alternative hypothesis: greater
cor.test(hexgrid$var6, hexgrid$var7)
## 
##  Pearson's product-moment correlation
## 
## data:  hexgrid$var6 and hexgrid$var7
## t = 8.9585, df = 43, p-value = 2.178e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6727348 0.8897199
## sample estimates:
##       cor 
## 0.8069249

8.6 The COVID-19 data case study - bivariate association

We will use the covid-19 incidence data for Germany that have been introduced previously.

covidKreise <- st_read(dsn = "data/covidKreise.gpkg", layer = "kreiseWithCovidMeldeWeeklyCleaned")
## Reading layer `kreiseWithCovidMeldeWeeklyCleaned' from data source 
##   `/home/slautenbach/Documents/lehre/HD/git_areas/spatial_data_science/spatial-data-science/data/covidKreise.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 400 features and 145 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 280371.1 ymin: 5235856 xmax: 921292.4 ymax: 6101444
## Projected CRS: ETRS89 / UTM zone 32N
polynb <- poly2nb(covidKreise)
poSurfKreise <- st_point_on_surface(covidKreise)
## Warning: st_point_on_surface assumes attributes are constant over geometries
distnb <- dnearneigh(poSurfKreise, d1=1, d2= 70*10^3)
unionedNb <- union.nb(distnb, polynb)
dlist <- nbdists(unionedNb, st_coordinates(poSurfKreise))
dlist <- lapply(dlist, function(x) 1/x)
unionedListw_d <- nb2listw(unionedNb, glist=dlist, style = "W")
ebiMc <- EBImoran.mc(n= st_drop_geometry(covidKreise)%>% pull("sumCasesTotal"),
              x= covidKreise$EWZ, listw = unionedListw_d, nsim =999)
## The default for subtract_mean_in_numerator set TRUE from February 2016

Let’s look at the association between the transformed response variable (empirical bayes modification) with two variables we will be using later on as predictors in our model: the votes for the right winged party at the national election in 2017 (Stimmenanteile.AfD) and the rurality of a district (Laendlichkeitmeasured as the share of population that lives in densely populated areas in the district).

First the normal Pearson correlation coefficient:

cor.test(x = covidKreise$Stimmenanteile.AfD, y = ebiMc$z)
## 
##  Pearson's product-moment correlation
## 
## data:  covidKreise$Stimmenanteile.AfD and ebiMc$z
## t = 10.226, df = 398, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3748636 0.5304747
## sample estimates:
##       cor 
## 0.4561491
cor.test(x = covidKreise$Laendlichkeit, y = ebiMc$z)
## 
##  Pearson's product-moment correlation
## 
## data:  covidKreise$Laendlichkeit and ebiMc$z
## t = 3.6908, df = 398, p-value = 0.0002547
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.08538521 0.27505941
## sample estimates:
##       cor 
## 0.1819139

When Lee’s L with the permutation test:

lee.mc(x = covidKreise$Stimmenanteile.AfD, y = ebiMc$z, 
       listw = unionedListw_d, nsim = 999)
## 
##  Monte-Carlo simulation of Lee's L
## 
## data:  covidKreise$Stimmenanteile.AfD ,  ebiMc$z 
## weights: unionedListw_d  
## number of simulations + 1: 1000 
## 
## statistic = 0.35156, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
lee.mc(x = covidKreise$Laendlichkeit, y = ebiMc$z, 
       listw = unionedListw_d, nsim = 999)
## 
##  Monte-Carlo simulation of Lee's L
## 
## data:  covidKreise$Laendlichkeit ,  ebiMc$z 
## weights: unionedListw_d  
## number of simulations + 1: 1000 
## 
## statistic = 0.11064, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater

We see that the incorporation of spatial autocorrelation reduces the estimated strength of the relationship as both variables are positively spatially autocorrelated. As the spatial autocorrelation for rurality is weaker this leads to a lower difference bettwen Pearson’s r and Lee’s L.

moran.mc(covidKreise$Stimmenanteile.AfD, listw= unionedListw_d, nsim = 999)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  covidKreise$Stimmenanteile.AfD 
## weights: unionedListw_d  
## number of simulations + 1: 1000 
## 
## statistic = 0.73757, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
moran.mc(covidKreise$Laendlichkeit, listw= unionedListw_d, nsim = 999)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  covidKreise$Laendlichkeit 
## weights: unionedListw_d  
## number of simulations + 1: 1000 
## 
## statistic = 0.22267, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater

References

Clifford, P., Richardson, S., Hemon, D., 1989. Assessing the Significance of the Correlation between Two Spatial Processes. Biometrics 45, 123. https://doi.org/10.2307/2532039
Dutilleul, P., Clifford, P., Richardson, S., Hemon, D., 1993. Modifying the t Test for Assessing the Correlation Between Two Spatial Processes. Biometrics 49, 305. https://doi.org/10.2307/2532625
Lee, S.-I., 2001. Developing a bivariate spatial association measure: An integration of Pearson’s r and Moran’s I. Journal of Geographical Systems 3, 369–385. https://doi.org/10.1007/s101090100064

  1. see Lee (2001) for details↩︎