Chapter 6 Defining neighborhood relationships and the spatial weight matrix W
6.1 Cheat sheet
The following packages are required for this chapter:
require(sf)
require(spdep)
require(sp)
require(ncf)
require(tmap)
require(ggplot2)
require(ggpubr)
require(gstat) # for simulation of autocorrelated data
require(dplyr)
require(tripack) # for triangulation, you might need to install this package
Important commands:
spdep::dnearneigh
- distance based neighbor definitionspdep::knearneigh
- in combination withspdep::knn2nb
- k-nearest neighbors definitionspdep::poly2nb
- contiguity based neighborhood definitionspdep::nbdists
- create distance weights for neighborhood definitionspdep::union.nb
- union two neighborhood listsspdep::nb2listw
- standardization of neighborhood listspdep::lag.listw
- computing a spatially lagged version of a variable vector
6.2 Theory
6.2.1 The spatial weight matrix W
Spatial pattern are based on some definition of neighborhood relationship between the spatial units. They are represented by a spatial weight Matrix \(W\) that describes which spatial units are considered neighbors or not as well as the strength of that neighborhood relationship. Usually \(W\) is standardized and then sometimes named \(W*\). Here we will be a bit casual with respect to the use of the terminology and frequently use \(W\) also for the standardized form.
I \(R\) the weight matrix is represented in form of a weighted list but it is easy to convert to the matrix form and back so this should not bother you.
\(W\) captures the spatial relationships and is therefore an essential part of many spatial analysis tool, including measures of global and spatial autocorrelation, spatial eigenvector mapping, the spatial error and lag model and many more. Using a different weight matrix will lead to different results in many situation. Therefore, it is of importance to define neighborhood relationships in a way that is suitable for the data. As usual there is no best way of defining \(W\) as it depends on the problem at hands. \(W\) is a model of spatial relationships and as for all models it is true that all models are wrong but some are useful. So we need to think and do our best to come up with a suitable neighborhood relationship. It might also be a good idea to test different reasonable ways of defining \(W\) and to compare results.
In the following we will explore a few basic concepts to define neighborhoods. Additionally, a few hints are given which aspects of \(W\) should be checked. These provide a basis for modelling \(W\) - however, for real world applications you might consider additional ways for defining \(W\).
6.2.2 Different ways to define neighborhoods
Investigating spatial relationships depends on a definition of a neighborhood. The question is not as obvious as it sounds especially for unevenly sized and shaped objects such as districts. We could define neighbors based on e.g.:
- a distance threshold for the centroids of the objects,
- if two polygons share an edge
- select the closest k neighbors for each object
- use a topological definition such as a Delauny triangulation
- a combination of those
If we would have other data we could also base the neighborhood definition e.g. on:
- the number of inhabitants of one district visiting the other district
- the average travel time (by car or public transport) between 2 districts, weighted by the population that is unevenly distributed in space
- …
In other contexts, researchers have specified neighborhood relationships based on the interaction between locations. In economic geography for example, on might define neighborhoods (and strength of interaction (weights)) based on import and export relationships. Such weights will usually not be symmetric. Other studies have based the weight matrix on hierarchical relationships derived from Central Place Theory. This theory assumes that smaller municipalities depend on larger cities (higher centrality) but not vice versa. A couple of studies on tax competition have constructed weights based on relative market potential.
We will stick here with the first series of options as these can be simply derived from the geometries.
The selection of a neighborhood should be done based at the problem at hands.
Neighbors are considered e.g. for the calculation of spatial autocorrelation measures such as Moran’s I or to create spatially lagged predictor variables or for all types of spatial regression modelling, such as spatial eigenvector mapping, spatial lag and spatial error models or the Besag-York-Mollié model.
To assess the usefulness of a neighborhood for the question at hand one might consider the following aspects in addition to a visual inspection:
- are there island (observations without neighbors)? Is it useful to consider these as real island or are there connections in the real world? An island might be connected by ferry, bridge or airplane to other spatial units so it might be good to incorporate that link manually. Island cause problems for the upcoming analysis steps and their interpretation so it is good to ensure that they are required.
- is the neighborhood definition symmetric? k-nearest neighborhood definitions frequently lead to asymmetric neighborhoodd definitions that are problematic for analysis and interpretation. It is possible to make asymmetric neighborhood definitions symmetric by a union of the upper and lower triangular part of the adjacency matrix.
- are the neighbors distributed relatively evenly across the different spatial units or are there strong differences and are these differences useful for the problem at hands?
For neighborhood definition that contain island one has to enforce calculation by setting the parameter zero.policy = TRUE for all following steps. This implies that you know that the calculation has to be carefully interpreted and that these consequences are well understood by you. If you are not sure if you understand the consequences avoid neighborhood definitions that contain empty neighborhoods for some spatial units. It might be an option to exclude the effected spatial units from the further analysis but this should be justified and needs to be clearly documented and reported. It is not possible to provide general advise in this respect as the validity and usefulness of such a step it depends on the data and the question at hands.
6.2.3 Are all neighbors of same importance? Weigthing of neighbors based on for example distance
It is possible to adjust the weight of neighbors based on additional information. Distance is frequently used for this purpose by defining an inversely weighted approach ( \(w_{i,j} = \frac{1}{dist_{i,j}^p}\) ) with p as a tuning parameter to specify how strongly weights drop with distance). Another way of defining distance relationships is by an exponential function with a negative coefficient: \(w_{i,j} = e^{- \lambda * dist_{i,j}}\), \(\lambda > 0\).
The following functions are frequently used:
- “idw” - inverse distance weighting: \(w_{ij}= d_{ij}^{-\alpha}\)
- “exp” - exponential distance decay: \(w_{ij}= e^{-\alpha \cdot d_{ij}}\)
- “dpd” -double-power distance weights: \(w_{ij}= (1- (\frac{d_{ij}}{d_{max}})^{\alpha})^{\alpha}\)
Let’s have a look at the shape of these functions:
idwf <- function(d, a)
{
return(d^(-a))
}
expf <- function(d, a)
{
return(exp(-a*d))
}
dpdf <- function(d, a, dmax)
{
res <- ifelse(d > dmax,
0,
(1-(d/dmax)^a)^a )
return(res)
}
IDW leads to very big weights for short distances. If your dataset contains data with same coordinates, division by zero will result in infinity weights. These are typically replaced with the biggest weight encountered in the dataset. Larger power values \(\alpha\) lead to a stronger decay and to higher weights for short distances.
ggplot() + xlim(0,2) + ylim(0,6) +
geom_function(fun = idwf, args = list(a=0.5), mapping=aes(colour="a=0.5")) +
geom_function(fun = idwf, args = list(a=1), mapping=aes(colour="a=1")) +
geom_function(fun = idwf, args = list(a=2), mapping=aes(colour="a=2")) +
geom_function(fun = idwf, args = list(a=3), mapping=aes(colour="a=3")) +
ylab("weights") + labs(title = "idw")
For the exponential decay function, largest weights are fixed to one. Larger power parameters lead to a stronger decay.
ggplot() + xlim(0,20) +
geom_function(fun = expf, args = list(a=.5), mapping=aes(colour="a=.5")) +
geom_function(fun = expf, args = list(a=1), mapping=aes(colour="a=1")) +
geom_function(fun = expf, args = list(a=2), mapping=aes(colour="a=2")) +
geom_function(fun = expf, args = list(a=3), mapping=aes(colour="a=3")) +
ylab("weights") + labs(title = "exponential decay")
For the double-power distance weights function, largest weights are fixed to one as well. Larger power parameters lead to a weaker decay for short distances and to a stronger decay for longer distances.
dmax = 40
ggplot() + xlim(0,50) +
geom_function(fun = dpdf, args = list(a=0.5, dmax= dmax),
mapping=aes(colour="a=0.5")) +
geom_function(fun = dpdf, args = list(a=1, dmax= dmax),
mapping=aes(colour="a=1")) +
geom_function(fun = dpdf, args = list(a=2, dmax= dmax),
mapping=aes(colour="a=2")) +
geom_function(fun = dpdf, args = list(a=3, dmax= dmax),
mapping=aes(colour="a=3")) +
geom_function(fun = dpdf, args = list(a=4, dmax= dmax),
mapping=aes(colour="a=4")) +
ylab("weights") +
labs(title = "double-power distance weights", subtitle = "dmax = 4")
Hint: One has to be careful when setting the parameters, as the function might lead to many zero weights for distances measured in meters due to numerical inaccuracies. So it is good practice to check the resulting weights before continuing.
As mentioned above, the strength of the relationship can also be specified on other information than euclidian distance, such as movement data, import-export relationships, travel time or other accessibility information, Central Place Theory or information on the strength of economic or social networks.
There are also approaches that try to model the spatial weight matrix by means of bayesian approaches (e.g. Lee and Mitchell, 2012). This, however, involves modelling of a really large set of parameters, even for a decently sized number of spatial units as the elements of the weight matrix grows quadratic with the number of spatial units.
6.2.4 Creation of the weight matrix W by standardizing neighborhood weights
In an additional step we need to create the weight matrix \(W\) from the (potentially distance weighted) neighborhood matrix \(C\). In R \(W\) is represented as a weighted list. The following options exist:
- B is the basic binary coding (1 if two spatial units are considered neighbors and zero otherwise)
- W is row standardised (weights sums over all links to n)
- C is globally standardised (weights sums over all links to n)
- U is equal to C divided by the number of neighboors (weights sums over all links to unity)
- S is the variance-stabilizing coding scheme proposed by Tiefelsdorf et al. 1999, p. 167-168 (weights sums over all links to n).
For most situations row standardized coding is recommended. It ensures that the weights of all neighbors sum up to unity which is a useful property. The weighted sum of a lagged attribute (predictor or response) is thereby simply the weighted mean. If binary coding would be used this would not be the case and observations with more neighbors would get more weights compared to observations with less neighbors.
6.2.4.1 Implication of spatial units with empty neighboorhood list
If zero policy is set to TRUE, weights vectors of zero length are inserted for regions without neighbor in the neighbors list. These will in turn generate lag values of zero. The spatially lagged value of x for the zero-neighbor region will then be zero, which may (or may not) be a sensible choice.
6.3 The COVID-19 data case study
We will use the covid-19 cases reported by the RKI as our example data set. The data is aggregated at the level of the districts (in German Kreise) and summed up from the beginning of the pandemic till the 28th of April 2022. The data is stored in a geopackage - like most spatial data formats this file format can be directly read with sf functionality.
## Reading layer `kreiseWithCovidMeldeWeeklyCleaned' from data source
## `/home/slautenbach/Documents/lehre/HD/git_areas/spatial_data_science/spatial-data-science/data/covidKreise.gpkg'
## using driver `GPKG'
## Simple feature collection with 400 features and 145 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 280371.1 ymin: 5235856 xmax: 921292.4 ymax: 6101444
## Projected CRS: ETRS89 / UTM zone 32N
## [1] "RS_0" "EWZ"
## [3] "GEN" "GF"
## [5] "BEZ" "ADE"
## [7] "BSG" "ARS"
## [9] "AGS" "SDV_ARS"
## [11] "ARS_0" "AGS_0"
## [13] "RS" "IdLandkreis"
## [15] "casesWeek_2020_01_02" "casesWeek_2020_01_23"
## [17] "casesWeek_2020_02_01" "casesWeek_2020_02_07"
## [19] "casesWeek_2020_02_15" "casesWeek_2020_02_20"
## [21] "casesWeek_2020_03_01" "casesWeek_2020_03_08"
## [23] "casesWeek_2020_03_15" "casesWeek_2020_03_22"
## [25] "casesWeek_2020_03_29" "casesWeek_2020_04_05"
## [27] "casesWeek_2020_04_12" "casesWeek_2020_04_19"
## [29] "casesWeek_2020_04_26" "casesWeek_2020_05_03"
## [31] "casesWeek_2020_05_10" "casesWeek_2020_05_17"
## [33] "casesWeek_2020_05_24" "casesWeek_2020_05_31"
## [35] "casesWeek_2020_06_07" "casesWeek_2020_06_14"
## [37] "casesWeek_2020_06_21" "casesWeek_2020_06_28"
## [39] "casesWeek_2020_07_05" "casesWeek_2020_07_12"
## [41] "casesWeek_2020_07_19" "casesWeek_2020_07_26"
## [43] "casesWeek_2020_08_02" "casesWeek_2020_08_09"
## [45] "casesWeek_2020_08_16" "casesWeek_2020_08_23"
## [47] "casesWeek_2020_08_30" "casesWeek_2020_09_06"
## [49] "casesWeek_2020_09_13" "casesWeek_2020_09_20"
## [51] "casesWeek_2020_09_27" "casesWeek_2020_10_04"
## [53] "casesWeek_2020_10_11" "casesWeek_2020_10_18"
## [55] "casesWeek_2020_10_25" "casesWeek_2020_11_01"
## [57] "casesWeek_2020_11_08" "casesWeek_2020_11_15"
## [59] "casesWeek_2020_11_22" "casesWeek_2020_11_29"
## [61] "casesWeek_2020_12_06" "casesWeek_2020_12_13"
## [63] "casesWeek_2020_12_20" "casesWeek_2020_12_27"
## [65] "casesWeek_2021_01_03" "casesWeek_2021_01_10"
## [67] "casesWeek_2021_01_17" "casesWeek_2021_01_24"
## [69] "casesWeek_2021_01_31" "casesWeek_2021_02_07"
## [71] "casesWeek_2021_02_14" "casesWeek_2021_02_21"
## [73] "casesWeek_2021_02_28" "casesWeek_2021_03_07"
## [75] "casesWeek_2021_03_14" "casesWeek_2021_03_21"
## [77] "casesWeek_2021_03_28" "casesWeek_2021_04_04"
## [79] "casesWeek_2021_04_11" "casesWeek_2021_04_18"
## [81] "casesWeek_2021_04_25" "casesWeek_2021_05_02"
## [83] "casesWeek_2021_05_09" "casesWeek_2021_05_16"
## [85] "casesWeek_2021_05_23" "casesWeek_2021_05_30"
## [87] "casesWeek_2021_06_06" "casesWeek_2021_06_13"
## [89] "casesWeek_2021_06_20" "casesWeek_2021_06_27"
## [91] "casesWeek_2021_07_04" "casesWeek_2021_07_11"
## [93] "casesWeek_2021_07_18" "casesWeek_2021_07_25"
## [95] "casesWeek_2021_08_01" "casesWeek_2021_08_08"
## [97] "casesWeek_2021_08_15" "casesWeek_2021_08_22"
## [99] "casesWeek_2021_08_29" "casesWeek_2021_09_05"
## [101] "casesWeek_2021_09_12" "casesWeek_2021_09_19"
## [103] "casesWeek_2021_09_26" "casesWeek_2021_10_03"
## [105] "casesWeek_2021_10_10" "casesWeek_2021_10_17"
## [107] "casesWeek_2021_10_24" "casesWeek_2021_10_31"
## [109] "casesWeek_2021_11_07" "casesWeek_2021_11_14"
## [111] "casesWeek_2021_11_21" "casesWeek_2021_11_28"
## [113] "casesWeek_2021_12_05" "casesWeek_2021_12_12"
## [115] "casesWeek_2021_12_19" "casesWeek_2021_12_26"
## [117] "casesWeek_2022_01_02" "casesWeek_2022_01_09"
## [119] "casesWeek_2022_01_16" "casesWeek_2022_01_23"
## [121] "casesWeek_2022_01_30" "casesWeek_2022_02_06"
## [123] "casesWeek_2022_02_13" "casesWeek_2022_02_20"
## [125] "casesWeek_2022_02_27" "casesWeek_2022_03_06"
## [127] "casesWeek_2022_03_13" "casesWeek_2022_03_20"
## [129] "casesWeek_2022_03_27" "casesWeek_2022_04_03"
## [131] "casesWeek_2022_04_10" "casesWeek_2022_04_17"
## [133] "casesWeek_2022_04_24" "casesWeek_2022_04_27"
## [135] "sumCasesTotal" "incidenceTotal"
## [137] "Langzeitarbeitslosenquote" "Hochqualifizierte"
## [139] "Breitbandversorgung" "Stimmenanteile.AfD"
## [141] "Studierende" "Einpendler"
## [143] "Auspendler" "Auslaenderanteil"
## [145] "Laendlichkeit" "geom"
The important fields are as follows:
- sumCasesTotal the total number of cases in that district
- EWZ - the number of inhabitants
- incidenceTotal: sumCasesTotal / EWZ
- GEN - name of the district
Let’s plot the data:
tm_shape(covidKreise) + tm_polygons(col="incidenceTotal", legend.hist = TRUE, palette= "-plasma",
legend.reverse = TRUE, title = "Incidence rate") +
tm_layout(legend.outside = TRUE, bg.color = "darkgrey", outer.bg.color = "lightgrey",
legend.outside.size = 0.5, attr.outside = TRUE, title = "COVID-19 incidence rate till 28th of April 2022",
legend.outside.position = "left") + tm_scale_bar()
The data seem spatially structured. This is what we would like to investigate here.
6.3.1 Defining neighbors and the spatial weight matrix W
For the data at hands a definition of neighborhood is not as obvious as for gridded data. Therefore, we are going to explore a number of different possibilities and discuss them shortly.
6.3.1.1 Demonstration of different neighborhood definitions
6.3.1.1.1 Distance based neighborhood definition
We start with a distance based neighborhood definition. We could use the centroid of the districts (st_centroid(covidKreise, of_largest_polygon=TRUE)) but this might be problematic if we have districts that contain other districts (e.g. Landkreis and Stadtkreis Heilbron) - it might be that the centrod is located outside of the polygon which would be problematic. Therefore, we use st_point_on_surface (see the material from Einführung in die Geoinformatik), which returns a point that is guaranteed to be located inside the district.
A distance of 70km might be a good start to define neighboring districts. The parameters d1 and d2 define the minimum and the maximum distance (in map units) for which a point is considered a neighbor of another point.
## Warning: st_point_on_surface assumes attributes are constant over geometries
How many neighbors have the different districts? We use card to get the number of neighbors from the neighbor list and table to aggregate the information:
##
## 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## 4 5 6 6 11 11 17 13 13 16 29 24 24 12 18 26 22 15 19 10 12 9 8 8 6 5
## 29 30 31 32 33 34 35 36 37
## 6 8 7 6 9 4 7 2 2
If we want it more visual we could simply use card as input for a histogram plot.
as.data.frame(table(card(distnb))) %>%
ggplot(aes(x=Var1, y=Freq)) +
geom_col() + xlab("Number of neighbors")
We can also plot the links between centroids in space.
plot(st_geometry(covidKreise), border = "grey60", reset = FALSE)
plot(distnb, coords=st_coordinates(poSurfKreise), col="red", cex=.7, add=TRUE)
As we see some districts have much more neighbors than others due to the size of the districts. As the size of the districts somewaht corresponds with the population size we have more neighbors in more densely populated areas such as the Ruhrgebiet, the Rhein-Main or the Rhein-Neckar region.
6.3.1.1.2 Contiguity based neighborhood definition
Another way of defining neighborhood for polygon data is based on edges (or nodes) shared between polygons. If two polygons share an edge/line segement (rook adjacency) or a node (queen adjacency) they are considered neighbors.
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 27 26 44 47 60 74 68 30 15 6 2 1
The number of neighbors is distributed more evenly and with a narrower range compared to the 70km distance based neighborhood definition.
as.data.frame(table(card(polynb))) %>%
ggplot(aes(x=Var1, y=Freq)) +
geom_col() + xlab("Number of neighbors")
plot(st_geometry(covidKreise), border = "grey60", reset = FALSE)
plot(polynb, coords=st_coordinates(poSurfKreise), col="red", cex=.7, add=TRUE)
Still the more densely populated areas with smaller districts have more neighbors. However, less severely skewed compared to before.
6.3.1.1.3 10-nearest neighbors
As an alternative we could use the 10 nearest neighbors (or nay other number of neighbors). To create such a neighbor list, we need to combine two functions: knearneigh to get the nearest neighbors and knn2nb to convert this into a neighborhood list.
##
## 10
## 400
Not unexpectedly, every district has 10 neighbors.
plot(st_geometry(covidKreise), border = "grey60", reset = FALSE)
plot(k10nb, coords=st_coordinates(poSurfKreise), col="red", cex=.7, add=TRUE)
Some districts now call relatively far away districts neighbors. And importantly, the neighbor list is not symmetric.
## [1] FALSE
If required, we could make this neighborhood list symetric by make.sym.nb. This completes the list and thereby of course leads a shift in the number of neighbors for the individual districts. The function returns a new neighborhood definition, i.e. it does not work on the existing neighborhood object.
##
## 10 11 12 13 14 15 16 17
## 113 110 72 47 38 12 5 3
6.3.1.1.4 Graph based neighborhoods
A couple of further approaches exist that define neighborhoods on some geometrical property:
- Delaunay triangulation: all nodes connected for which the circumcise around three points ABC contains no other nodes
- Gabriel graph: all nodes connected if were is no other node inside a circle with their distance
- \(d(x,y) \leq min(\sqrt{d(x,z)^2 + d(y,z)^2)}|z \epsilon S)\)
- Relative neighborhood graph: all nodes connected for those the lens formed by the radii of their circles contains no other points
- two points are neighbors whenever there is no third point closer to both of them than they are to each other
- i.e. \(d(x,y) \leq min(max(d(x,z), d(y,z))|z \epsilon S)\), \(d()\) distance, \(S\) set of points, \(x\) and \(y\) the points tested, \(z\) an arbitrary third point
- Sphere of interest:
- for a set of points draw a circle around each point, using the distance to the closest neighbor as the radius
- if two of those circles intersect, they are considered neighbors
- Minimum spanning tree: connects all nodes together while minimizing total edge length
The Gabriel graph is a subgraph of the delaunay triangulation and has the relative neighbor graph as a sub-graph.
In the following I show a few figures (Figures 6.1, 6.3, 6.5, 6.7) that show how the different graph based neighborhood definitions work on a smallish and relatively regular point pattern.
Graph based neighborhoods can be defined as follows, using a set of points as a starting point. For polygon data these points can be derived by using sf::st_point_on_surface
or sf::st_centroids
.
# delauny triangulation
delauny_nb <- tri2nb(st_coordinates(poSurfKreise) )
# sphere of influence
soi_nb <- graph2nb(soi.graph(delauny_nb, st_coordinates(poSurfKreise)))
# gabriel graph
gabriel_nb <- graph2nb(gabrielneigh(st_coordinates(poSurfKreise)))
# relative graph
rg_nb <- graph2nb(relativeneigh(st_coordinates(poSurfKreise)))
The distribution of the number of neighbors has a relatively small range. Pleae note tha both the gebriel graph and the relative graph lead to a large number of districts without a neighbor which makes them not suitable for the dataset.
##
## 4 5 6 7 8 9
## 36 114 134 90 22 4
##
## 1 2 3 4 5 6 7 8 9
## 6 25 66 94 103 66 34 5 1
##
## 0 1 2 3 4 5 6
## 58 86 93 81 53 20 9
##
## 0 1 2 3 4
## 100 122 110 59 9
The Delauny triangulation leads to some neighborhood relationships that seem not really useful.
plot(st_geometry(covidKreise), border = "grey60", reset = FALSE)
plot(delauny_nb, coords=st_coordinates(poSurfKreise), col="red", cex=.7, add=TRUE )
mtext("Delauny Triangulation")
plot(st_geometry(covidKreise), border = "grey60", reset = FALSE)
plot(soi_nb, coords=st_coordinates(poSurfKreise), col="red", cex=.7, add=TRUE )
mtext("Sphere of influence")
6.3.1.1.5 Combination of contiguity and distance based neighborhood definitions
For the following analysis we will be using a combination of the contiguity and the distance based neighborhood definition.
##
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
## 2 1 4 6 5 13 9 18 14 12 17 30 23 23 13 16 28 22 16 19 10 12 9 7 9 6
## 28 29 30 31 32 33 34 35 36 37
## 5 6 8 7 6 9 4 7 2 2
6.3.2 Inverse distance based weighting of neighbors
We will consider the different distances between the point on surface by an inverse distance relationship. Note that that we row standardize the weights afterwards.
nbdists(nb, coords)
calculates the euclidian distances between all pairs of neighbors based on the point coordinates provided. The lapply
command maps a function to the resulting distance list. Here the function simply is 1/x - i.e. we simply calculate the inverse of each distance. The resulting weights are when used for the glist
argument in nb2listw
which creates the row-standardized (style="W")
) weight matrix W.
dlist <- nbdists(unionedNb, st_coordinates(poSurfKreise))
dlist <- lapply(dlist, function(x) 1/x)
unionedListw_d <- nb2listw(unionedNb, glist=dlist, style = "W")
For plotting in a histogram we need to extract the weights, unlist them and store them in a data frame first. The weights are stored in the weights
slot of the weighted list object. unlist
unlists them into an vector which we use as a column in a data.frame which can be used with ggplot
.
data.frame(weights=unlist(unionedListw_d$weights)) %>%
ggplot(mapping= aes(x= weights)) +
geom_histogram() +
labs(title="Weight matrix" , subtitle = "unioned contiguity and 50km distance nb, idw p=1, W")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We are of course not limited to use a power of 1 but could as well use higher power values. This puts more weight on closer neighbors.
dlist2 <- nbdists(unionedNb, st_coordinates(poSurfKreise))
dlist2 <- lapply(dlist2, function(x) x^(-2))
unionedListw_d2 <- nb2listw(unionedNb, glist=dlist2, style = "W")
## Warning in nb2listw(unionedNb, glist = dlist2, style = "W"): zero sum general
## weights
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 400
## Number of nonzero links: 7132
## Percentage nonzero weights: 4.4575
## Average number of links: 17.83
## Link number distribution:
##
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
## 2 1 4 6 5 13 9 18 14 12 17 30 23 23 13 16 28 22 16 19 10 12 9 7 9 6
## 28 29 30 31 32 33 34 35 36 37
## 5 6 8 7 6 9 4 7 2 2
## 2 least connected regions:
## 345 360 with 2 links
## 2 most connected regions:
## 160 198 with 37 links
##
## Weights style: W
## Weights constants summary:
## n nn S0 S1 S2
## W 400 160000 400 120.4916 1628.03
6.3.3 Spatial smoothing
As soon as we have the row-standardized spatial weight Matrix \(W^*\) we can use it for operations that involve neighborhood relationships. One easy thing is spatial smoothing there we calculate the lagged value of a variable, simply by multiplying \(W^*\) with the vector of interest. There are two ways to do it in spdep: converting the weighted list representation to a spatial weight matrix representation by means of listw2mat and using matrix multiplication (%*%) to multiply it with the vector of interest.
or by using the convenience function lag.listw
both result in the same result:
## [1] 0
However, the first approach returns a matrix (with 1 column) which we need to cast to a vector first (using as.numeric). Therefore the second approach is easier to use.
We need to spatially smooth the cases and the population at risk separately and then combine them:
covidKreise$sumCasesTotalLag1 <- lag.listw(x= unionedListw_d, var =covidKreise$sumCasesTotal )
covidKreise$EWZLag1 <- lag.listw(x= unionedListw_d, var =covidKreise$EWZ )
covidKreise <- covidKreise %>% mutate(incidenceTotalLag1 = sumCasesTotalLag1 / EWZLag1 * 10^5)
Next we plot those, overlaying a smoother and the 1:1 line.
covidKreise %>%
ggplot(aes(x=incidenceTotal, y = incidenceTotalLag1)) +
geom_point(alpha=.5) +
geom_smooth() +
geom_abline(slope=1, intercept = 0, lty=2, col="darkgrey") +
xlab("Incidence rate") + ylab("Incidence rate lagged")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
As we have row standardized the weights we are simply calculating a weighted average of the neighbors (ignoring the district itself as it is not considered a neighbor of itself). We see that for districts with low incidence rates neighboring districts have on average higher values. For districts with high values the weighted mean of the neighbors is mostly lower.
#tmap_mode("plot")
breaks <- seq(15000, 45000, by = 5000)
m1 <- tm_shape(covidKreise) +
tm_polygons(col="incidenceTotal", breaks = breaks,
title = "Incidence rate") +
tm_scale_bar() +
tm_layout(legend.outside = TRUE, legend.outside.position = "left")
m2 <- tm_shape(covidKreise) +
tm_polygons(col="incidenceTotalLag1", breaks = breaks,
title = "Incidence rate,\nlagged") +
tm_scale_bar() +
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
tmap_arrange(m1, m2)
Please not that we have defined breaks for the legend to ensure that the colors in both maps match.
We see that a few districts stand out from the trend (smooth part): the saxony swiss and five districts in Bavaria are higher than the smoothed surface and in the West (e.g. the Ruhrgebiet) and North we see quite some districts that are lower than the smoothed surface. However, the classification necessary for a choroplethe map might trick us.
covidKreise %>%
mutate(incidenceTotalDiffLag1 = incidenceTotal - incidenceTotalLag1) %>%
tm_shape() +
tm_polygons(col="incidenceTotalDiffLag1", palette = "-RdBu",
title = "Incidence rate - lagged incidence rate", n=7) +
tm_scale_bar() +
tm_layout(legend.outside = TRUE, legend.outside.position = "left", attr.outside = TRUE)
## Variable(s) "incidenceTotalDiffLag1" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
Using the differences between laged and non laged incidence rate provides us with deeper insights about which districts differ most strongly compared to their neighbors.