Chapter 8 Bivariate relationships considering spatial pattern
Correlation coefficients are a well known measure of association between variables. The question is how these can or should be modified if we apply them to spatialy structured data. here we focus on association between numerical scaled variables and ignore association measures for ordinal or nominal variables.
8.1 Cheat sheet
The following packages are required for this chapter:
require(sf)
require(spdep)
require(sp)
require(ncf)
require(tmap)
require(ggplot2)
require(ggpubr)
require(gstat) # for simulation of autocorrelated data
require(dplyr)
require(SpatialPack) # for the modified t-test
Important commands:
spdep::lee.mc
- permutation test for Lee’s Lspdep::lag.listw
- computing a spatially lagged version of a variable vectorcor.test
- Pearson’s r with test
8.2 Adjusted t-test
##’ Motivation
In presence of positive spatial autocorrelation our effective sample size is smaller than the sample size. This is due to shared information between observations. The normal t-test for Pearson’s r is therefore incorrect.
The simplest solution is to calculate the effective sample size under consideration of spatial autocorrelation.
The effective sample size \(n'\) in absence of autocorrelation: \(n' = n\) (number of observations). In presence of autocorrelation this needs to be modified to
\[n'= \frac{n^2}{\sum_{i=1}^n\sum_{j=1}^nCor(x_i, y_i)}\]
The exact formula depends on the from of the correlation function. For a first order autoregressive correlation structure (\(Cor(x_i, x_j)= \rho^{|i-j}\)):
\[n'=(1+2\frac{\rho}{1-\rho}(1-\frac{1}{n})-2\frac{\rho^2}{(1-\rho)^2} \frac{1-\rho^{n-1}}{n}) \cdot n\]
This can be simplified for large n:
\[n' \cong n \frac{1-\rho}{1+\rho}\]
# function to calculate the effective sampel size for large n and the simple first order autoregresive function
neff <- function(n, rho)
{
res <- n*(1-rho)/(1+rho)
return(res)
}
- e.g. \(n=1000\) and \(\rho = 0.7\) \(\rightarrow n' =\) 176.4705882
- e.g. \(n=1000\) and \(\rho = 0.9\) \(\rightarrow n' =\) 52.6315789
The modified t-test by Clifford and Richardson (Clifford et al. (1989) and Dutilleul et al. (1993)) uses the covariance for all observations in a distance band as a measure to capture spatial autocorrelation (cf. correlogram). The function is available as spatialPack::modified.ttest
.
8.3 Bivariate Moran’s I
Bivariate Moran’s I is a measure of autocorrelation of X with the spatial lag of Y. As such, the resultant measure may overestimate the amount of spatial autocorrelation which may be a product of the inherent correlation of X and Y.
\[I_B = \frac{\sum_i(\sum_j w_{i,j}y_j\cdot x_i)}{\sum_i x_i^2}\] It only captures autocorrelation of one variable and completely ignores the spatial pattern of the second variable. \(I_{X,Y}\) might be different from \(I_{Y,X}\) which is not helpful for interpretation.
Using the spatial lag operator this can be written as:
\[I_{X,Y} = \frac{\sum_i(x_i-\bar{x})(\tilde{y}-\bar{y})}{\sqrt{\sum_i(x_i-\bar{x})^2} \sqrt{\sum_i(y_i-\bar{y})^2}} \cong \sqrt{SSS_Y} \cdot r_{x,\tilde{y}}\]
In general, the use of the bivariate Moran’s I is not recommended for the reason giving above.
8.5 Simulated data
We create a hexagon tesselation and add some spatially structure columns to it that can then be used to explore Lee’s L for these pattern.
set.seed(3)
aoi <- st_sfc(st_polygon(list(rbind(c(0,0), c(5,0), c(5,5), c(0,0)))))
hexgrid <- st_make_grid(aoi, cellsize = 1, square = FALSE)
hexgrid <- st_sf(hexgrid)
n <- nrow(hexgrid)
hexgrid$id <- 1:n
hexgrid <- hexgrid %>% mutate(var1 = ifelse(id <10, 30, ifelse(id < 20, 5, 1) + rnorm(n, mean= 0, sd=.1)),
var2 = ifelse(id <10, 2, ifelse(id < 20, 6, 20) + rnorm(n, mean= 0, sd=.1) ))
hexgrid <- hexgrid %>% mutate(var3 = rnorm(n, mean= 4) + var1,
var4 = rpois(n, lambda = var1),
var5 = rnorm(n, mean=30),
var6 = rnorm(n, mean=30))
hexgrid <- hexgrid %>% mutate(var7 = var6 + rnorm(n, mean= 0, sd=.5) )
idx <- sample(hexgrid$id, size=n, replace = FALSE)
hexgrid <- hexgrid %>% mutate(var8 = var1[idx], var9= var2[idx])
hexgrid <- hexgrid %>% mutate(var10 = var1 + rnorm(n))
hexgrid <- hexgrid %>% mutate(var11 = var10[idx])
theBreaks <-seq(0,40, by=5)
thePalette <- "Reds"
m1 <- tm_shape(hexgrid) + tm_polygons(col="var1", breaks = theBreaks, palette= thePalette) +
tm_layout(legend.outside = TRUE)
m2 <- tm_shape(hexgrid) + tm_polygons(col="var2", breaks = theBreaks, palette= thePalette) +
tm_layout(legend.outside = TRUE)
m3 <- tm_shape(hexgrid) + tm_polygons(col="var3", breaks = theBreaks, palette= thePalette) +
tm_layout(legend.outside = TRUE)
m4 <- tm_shape(hexgrid) + tm_polygons(col="var4", breaks = theBreaks, palette= thePalette) +
tm_layout(legend.outside = TRUE)
m5 <- tm_shape(hexgrid) + tm_polygons(col="var5", breaks = theBreaks, palette= thePalette) +
tm_layout(legend.outside = TRUE)
m6 <- tm_shape(hexgrid) + tm_polygons(col="var6", breaks = theBreaks, palette= thePalette) +
tm_layout(legend.outside = TRUE)
m7 <- tm_shape(hexgrid) + tm_polygons(col="var7", breaks = theBreaks, palette= thePalette) +
tm_layout(legend.outside = TRUE)
m8 <- tm_shape(hexgrid) + tm_polygons(col="var8", breaks = theBreaks, palette= thePalette) +
tm_layout(legend.outside = TRUE)
m9 <- tm_shape(hexgrid) + tm_polygons(col="var9", breaks = theBreaks, palette= thePalette) +
tm_layout(legend.outside = TRUE)
m10 <- tm_shape(hexgrid) + tm_polygons(col="var10", breaks = theBreaks, palette= thePalette) +
tm_layout(legend.outside = TRUE)
m11 <- tm_shape(hexgrid) + tm_polygons(col="var11", breaks = theBreaks, palette= thePalette) +
tm_layout(legend.outside = TRUE)
tmap_arrange(m1,m2,m3,m4, m5, m6,m7, m8, m9, m10, m11)
A contiguity neighborhood definition is best suited for the case of the demonstration.
If we apply Moran’s test to the five simulated variables we see that all but the last three show significant spatial autocorrelation. (I am using sapply
to map a function to a list but this is not of concern here.)
moran4vars <- lapply(2:8, FUN = function(x) moran.test(st_drop_geometry(hexgrid)[,x], listw = hexlistw))
moran4varsShorted <- sapply(moran4vars, FUN = function(x) c(x$estimate[1], x$p.value))
row.names(moran4varsShorted)[2] <- "p-value"
moran4varsShorted
## [,1] [,2] [,3] [,4]
## Moran I statistic 6.778588e-01 8.140493e-01 6.773210e-01 6.101243e-01
## p-value 1.351183e-14 3.210743e-19 1.429188e-14 2.032328e-12
## [,5] [,6] [,7]
## Moran I statistic -0.007337836 0.01197456 0.1382407
## p-value 0.433931191 0.35233507 0.0370708
Let us explore the case of the obviously autocorrelated and shifted first two variables:
##
## Monte-Carlo simulation of Lee's L
##
## data: hexgrid$var1 , hexgrid$var2
## weights: hexlistw
## number of simulations + 1: 500
##
## statistic = -0.63469, observed rank = 1, p-value = 0.002
## alternative hypothesis: less
##
## Pearson's product-moment correlation
##
## data: hexgrid$var1 and hexgrid$var2
## t = -8.7979, df = 43, p-value = 3.624e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8866491 -0.6646941
## sample estimates:
## cor
## -0.8017904
Compared to Pearson’s r that ignores spatial autocorrelation Lee’s L indicates a weaker but still clear negative association between var1 and var2.
Let’s repeat that with var3 which has added white noise.
##
## Monte-Carlo simulation of Lee's L
##
## data: hexgrid$var3 , hexgrid$var2
## weights: hexlistw
## number of simulations + 1: 500
##
## statistic = -0.6389, observed rank = 1, p-value = 0.002
## alternative hypothesis: less
##
## Pearson's product-moment correlation
##
## data: hexgrid$var3 and hexgrid$var2
## t = -8.828, df = 43, p-value = 3.294e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8872331 -0.6662194
## sample estimates:
## cor
## -0.802766
Similar. Both methods indicate significant negative correlation between var2 and var3. Lee’s L is against smaller due to the effect of positive spatial autocorrelation in the data.
For the two white noise variables without spatial structure both association measures indicate expectedly no significant relationship.
##
## Monte-Carlo simulation of Lee's L
##
## data: hexgrid$var6 , hexgrid$var5
## weights: hexlistw
## number of simulations + 1: 500
##
## statistic = 0.061969, observed rank = 425, p-value = 0.85
## alternative hypothesis: less
##
## Pearson's product-moment correlation
##
## data: hexgrid$var6 and hexgrid$var5
## t = 0.79209, df = 43, p-value = 0.4327
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.179947 0.399396
## sample estimates:
## cor
## 0.1199212
The last case is interesting as we are comparing two variables without significant spatial autocorrelation that were designed to be correlated. While Pearson’s r indicates strong autocorrelation, Lee’´s L suggest no significant association.
##
## Monte-Carlo simulation of Lee's L
##
## data: hexgrid$var6 , hexgrid$var7
## weights: hexlistw
## number of simulations + 1: 1000
##
## statistic = 0.169, observed rank = 574, p-value = 0.426
## alternative hypothesis: greater
##
## Pearson's product-moment correlation
##
## data: hexgrid$var6 and hexgrid$var7
## t = 8.9585, df = 43, p-value = 2.178e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6727348 0.8897199
## sample estimates:
## cor
## 0.8069249
8.6 The COVID-19 data case study - bivariate association
We will use the covid-19 incidence data for Germany that have been introduced previously.
## Reading layer `kreiseWithCovidMeldeWeeklyCleaned' from data source
## `/home/slautenbach/Documents/lehre/HD/git_areas/spatial_data_science/spatial-data-science/data/covidKreise.gpkg'
## using driver `GPKG'
## Simple feature collection with 400 features and 145 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 280371.1 ymin: 5235856 xmax: 921292.4 ymax: 6101444
## Projected CRS: ETRS89 / UTM zone 32N
## Warning: st_point_on_surface assumes attributes are constant over geometries
distnb <- dnearneigh(poSurfKreise, d1=1, d2= 70*10^3)
unionedNb <- union.nb(distnb, polynb)
dlist <- nbdists(unionedNb, st_coordinates(poSurfKreise))
dlist <- lapply(dlist, function(x) 1/x)
unionedListw_d <- nb2listw(unionedNb, glist=dlist, style = "W")
ebiMc <- EBImoran.mc(n= st_drop_geometry(covidKreise)%>% pull("sumCasesTotal"),
x= covidKreise$EWZ, listw = unionedListw_d, nsim =999)
## The default for subtract_mean_in_numerator set TRUE from February 2016
Let’s look at the association between the transformed response variable (empirical bayes modification) with two variables we will be using later on as predictors in our model: the votes for the right winged party at the national election in 2017 (Stimmenanteile.AfD) and the rurality of a district (Laendlichkeitmeasured as the share of population that lives in densely populated areas in the district).
First the normal Pearson correlation coefficient:
##
## Pearson's product-moment correlation
##
## data: covidKreise$Stimmenanteile.AfD and ebiMc$z
## t = 10.226, df = 398, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3748636 0.5304747
## sample estimates:
## cor
## 0.4561491
##
## Pearson's product-moment correlation
##
## data: covidKreise$Laendlichkeit and ebiMc$z
## t = 3.6908, df = 398, p-value = 0.0002547
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.08538521 0.27505941
## sample estimates:
## cor
## 0.1819139
When Lee’s L with the permutation test:
##
## Monte-Carlo simulation of Lee's L
##
## data: covidKreise$Stimmenanteile.AfD , ebiMc$z
## weights: unionedListw_d
## number of simulations + 1: 1000
##
## statistic = 0.35156, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
##
## Monte-Carlo simulation of Lee's L
##
## data: covidKreise$Laendlichkeit , ebiMc$z
## weights: unionedListw_d
## number of simulations + 1: 1000
##
## statistic = 0.11064, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
We see that the incorporation of spatial autocorrelation reduces the estimated strength of the relationship as both variables are positively spatially autocorrelated. As the spatial autocorrelation for rurality is weaker this leads to a lower difference bettwen Pearson’s r and Lee’s L.
##
## Monte-Carlo simulation of Moran I
##
## data: covidKreise$Stimmenanteile.AfD
## weights: unionedListw_d
## number of simulations + 1: 1000
##
## statistic = 0.73757, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
##
## Monte-Carlo simulation of Moran I
##
## data: covidKreise$Laendlichkeit
## weights: unionedListw_d
## number of simulations + 1: 1000
##
## statistic = 0.22267, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater