Chapter 3 Non-Spatial Data

How to use this book inside RStudio:

  • Open a new empty R Markdown file inside RStudio
  • Make sure to download the datasets used in chapter. Look out for links in the text.
  • Get the path right!
  • Copy either single code chunks and run it inside R (as R Markdown or R Script)
  • Click on the edit button (top of the page, fourth symbol from the left) to get to the chapters source which you then can copy as a whole and paste into your R Markdown file.

The dataset we use for this introduction to non spatial data exploration and wrangling comes from the Armed Conflict Location and Event Data Project (ACLED) (Raleigh et al., 2010). ACLED is a non-governmental organization that produces event data on conflicts worldwide. It provides aggregated data at the provincial and country level, but also disaggregated point locations on events. In this chapter we will look at the curated regional Africa product. Downloaded from here: https://acleddata.com/curated-data-files/. If you are interested in other products, you can register and generate an API key free of charge. The dataset contains information on time and date of the event, actors, addressees, measures or actions, fatalities, sources, comments.

You may be familiar with the Heidelberg Institute for International Conflict Research, which works in the same area of political conflict and violence, but publishes its findings once a year in text form and as aggregated maps.

3.1 Loading data into R

From the download on the ACLED website we get a .xlsx file: Africa_1997-2022_Apr22.xlsx. You can download the file from this book’s repository directly here. In R there are two packages available to read xlsx files: readxl and xlsx. There is not much different in using one or the other. We go with the first.

3.1.1 Tabular data formats

library(readxl)
acled_africa <- read_excel("data/Africa_1997-2022_Apr22.xlsx")

The xlsx file contains only one sheet and there are no empty lines to skip. Therefore, it is not necessary to set other parameters for such cases. Another common type of tabular data is .csv and can be loaded via the function read.csv().

The <- operator assigns values to variables. In this case the output of the function(indicated by the parentheses) read_excel() to the variable acled_africa. Most other programming languages use = instead. In R both operators are available but with slightly different applications.

The operators <- and = assign into the environment in which they are evaluated. The operator <- can be used anywhere, whereas the operator = is only allowed at the top level (e.g., in the complete expression typed at the command prompt) or as one of the subexpressions in a braced list of expressions.

assignOps

The path within the " quotes is a relative path pointing to the data.

3.1.2 JSON - attibute: value format

Less often json files are used in R, but this is also possible.

Excourse: What is the difference of tabular and key - value / attribute - value data structures again?

CSV structure:

country, 1990, 1995, 2000, 2005, 2010, 2019
"Algeria", 0.572, 0.595, 0.637, 0.685, 0.748
"Rwanda", 0.248, , 0.341, 0.413, ,

JSON structure of the same data:

[
  {
    "1990": "0.572",
    "1995": "0.595",
    "2000": "0.637",
    "2005": "0.685",
    "2010": "0.748",
    "2019": "",
    "country": "Algeria"
  },
  {
    "1990": "0.248",
    "2000": "0.341",
    "2005": "0.413",

    "country": "Rwanda"
  }
]

With the library jsonlite JSON files can be read into R and represented as Data frames. The following chunk reads in a json file on HDI estimates downloaded from UNDP. You can download the dataset yourselve here

library(jsonlite)
hdi <- fromJSON("data/hdi.json") # read in JSON file as dataframe
head(hdi) # show first 10 lines of the dataframe
##    1990  1991  1992  1993  1994  1995  1996  1997  1998  1999  2000  2001  2002
## 1 0.302 0.307 0.316 0.312 0.307 0.331 0.335 0.339 0.344 0.348  0.35 0.353 0.384
## 2  0.65 0.631 0.615 0.618 0.624 0.637 0.646 0.645 0.655 0.665 0.671 0.678 0.684
## 3 0.572 0.576 0.582 0.586  0.59 0.595 0.602 0.611 0.621 0.629 0.637 0.647 0.657
## 4  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 0.813 0.815  0.82
## 5  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 0.391   0.4  0.41 0.426
## 6  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
##    2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015
## 1 0.393 0.409 0.418 0.429 0.447 0.447  0.46 0.472 0.477 0.489 0.496   0.5   0.5
## 2 0.691 0.696 0.706 0.713 0.722 0.728 0.733 0.745 0.764 0.775 0.782 0.787 0.788
## 3 0.667 0.677 0.685  0.69   0.7 0.702 0.711 0.721 0.728 0.728 0.729 0.736  0.74
## 4 0.827 0.833 0.827 0.837 0.837  0.84 0.839 0.837 0.836 0.858 0.856 0.863 0.862
## 5 0.435 0.446  0.46 0.473 0.489 0.501 0.515 0.517 0.533 0.544 0.555 0.565 0.572
## 6  <NA>  <NA> 0.764 0.771 0.776 0.774 0.767 0.763 0.755 0.759  0.76  0.76 0.762
##    2016  2017  2018  2019 HDI Rank             Country
## 1 0.502 0.506 0.509 0.511      169         Afghanistan
## 2 0.788  0.79 0.792 0.795       69             Albania
## 3 0.743 0.745 0.746 0.748       91             Algeria
## 4 0.866 0.863 0.867 0.868       36             Andorra
## 5 0.578 0.582 0.582 0.581      148              Angola
## 6 0.765 0.768 0.772 0.778       78 Antigua and Barbuda

3.1.3 Binary R specific data formats

There are other, more R specific file formats like .RData and .rds. Both are binary, meaning they are encoded and not human readable. The files are also compressed which brings the advantage of a smaller file size. These formats are only readable with R.

These formats are well suited for storing intermediate results from R workflows. For example, after a computational or time intensive part of the workflow.

save/load a single object/variable

saveRDS(variable, file = "path/to/file.rds")
readRDS(file = "path/to/file.rds") # variable will use same name as it was saved, if a variable with teh same name exists, it will be overwritten
vaiable <- readRDS(file = "path/to/file.rds") # can also be assigned directly to a specific variable name

save/load multiple objects/variables

save(variable1, variable2, file = "path/to/file.RData")
load(file = "path/to/file.RData")

3.2 Understanding R data structures

Execution of the code chunks before creates entries in the Environment tab. Every variable or object we assign within a R session will be listed there. Some we inspect further like lists or data frames. Others are more complex and only brief description can be printed.

3.2.1 Assessing R object types

With the following command you can check which type an R object has. This is especially important if you try to execute a type-specific function but it does not work. One of many possibilities, but probably the easiest to fix, is that the object type does not match the one the function asks for. A simple example is trying to build the average for a list of character objects. So with class() we can verify the object type.

class(hdi)
## [1] "data.frame"
class(read.csv)
## [1] "function"
class(1)
## [1] "numeric"
class("Hello World")
## [1] "character"
class(acled_africa)
## [1] "tbl_df"     "tbl"        "data.frame"

The object hdi is of type data.frame. It is the standard object type for tabular data in R. Comparable to a pandas dataframe in Python. It is organised in rows and columns. Whereas the rows represent observations and columns attributes. A tibble is the modern form of a data.frame. A tibble will only print the first 10 records and the data type of the columns to the console. It is also visible how many attributes/columns and records/rows the data contains. This is also observable in the Environment tab. With the following commands this can be checked directly in the console:

3.2.2 Assessing R data types

dim(hdi)
## [1] 189  32
dim(acled_africa)
## [1] 277825     29
nrow(acled_africa)
## [1] 277825
ncol(acled_africa)
## [1] 29

The ACLED dataset contains 277825 observations. In the case of ACLED, an observation is a political event. Each event has 29 attributes (given no NAs). What attributes does the dataset provide, and what types are they?

The next command prints all attributes names. Then a single column/attribute’s type is printed. The subsequent command applies the same function for retrieving the column’s type on all columns in the tibble.

typeof(acled_africa$ISO)
## [1] "double"
sapply(acled_africa, typeof)
##              ISO    EVENT_ID_CNTY EVENT_ID_NO_CNTY       EVENT_DATE 
##         "double"      "character"         "double"         "double" 
##             YEAR   TIME_PRECISION       EVENT_TYPE   SUB_EVENT_TYPE 
##         "double"         "double"      "character"      "character" 
##           ACTOR1    ASSOC_ACTOR_1           INTER1           ACTOR2 
##      "character"      "character"         "double"      "character" 
##    ASSOC_ACTOR_2           INTER2      INTERACTION           REGION 
##      "character"         "double"         "double"      "character" 
##          COUNTRY           ADMIN1           ADMIN2           ADMIN3 
##      "character"      "character"      "character"        "logical" 
##         LOCATION         LATITUDE        LONGITUDE    GEO_PRECISION 
##      "character"         "double"         "double"         "double" 
##           SOURCE     SOURCE_SCALE            NOTES       FATALITIES 
##      "character"      "character"      "character"         "double" 
##        TIMESTAMP 
##         "double"

Some of the data types might be familiar from other programming languages or software applications. What if we apply the same function on the R objects we inspected three chunks earlier.

typeof(hdi)
## [1] "list"
typeof(read.csv)
## [1] "closure"
typeof(1)
## [1] "double"
typeof("Hello World")
## [1] "character"
typeof(acled_africa)
## [1] "list"

1 is of class numeric, but of type double. Tibble and data frame actually are lists. Brief background is that a data frame is a more complex two-dimensonal object structure that is built upon the list type.

3.3 Exploring data

3.3.1 handle data frames

Columns of data frames and tibbles can be either selected via their name and the $ operator like tibble$columnName or by index like acled_africa[<ROW>,<COLUMN>] were <ROW> is the positional index of the row and <COLUMN> the positional index of the column.

  • With acled_africa[1,] the first row and all columns of a data frame or tibble will be selected.
  • With acled_africa[,1] the first column and all rows of a data frame or tibble will be selected.
  • With acled_africa[4,20] the fourth row and the 20th column are selected only. So basically a single cell / value.
acled_africa[1,] # first row
## # A tibble: 1 × 29
##     ISO EVENT_ID_CNTY EVENT_ID_NO_CNTY EVENT_DATE           YEAR TIME_PRECISION
##   <dbl> <chr>                    <dbl> <dttm>              <dbl>          <dbl>
## 1    12 ALG1                         1 1997-01-01 00:00:00  1997              1
## # ℹ 23 more variables: EVENT_TYPE <chr>, SUB_EVENT_TYPE <chr>, ACTOR1 <chr>,
## #   ASSOC_ACTOR_1 <chr>, INTER1 <dbl>, ACTOR2 <chr>, ASSOC_ACTOR_2 <chr>,
## #   INTER2 <dbl>, INTERACTION <dbl>, REGION <chr>, COUNTRY <chr>, ADMIN1 <chr>,
## #   ADMIN2 <chr>, ADMIN3 <lgl>, LOCATION <chr>, LATITUDE <dbl>,
## #   LONGITUDE <dbl>, GEO_PRECISION <dbl>, SOURCE <chr>, SOURCE_SCALE <chr>,
## #   NOTES <chr>, FATALITIES <dbl>, TIMESTAMP <dbl>
acled_africa[,9] # 9th column
## # A tibble: 277,825 × 1
##    ACTOR1                                
##    <chr>                                 
##  1 GIA: Armed Islamic Group              
##  2 GIA: Armed Islamic Group              
##  3 GIA: Armed Islamic Group              
##  4 GIA: Armed Islamic Group              
##  5 GIA: Armed Islamic Group              
##  6 GIA: Armed Islamic Group              
##  7 Police Forces of Algeria (1994-1999)  
##  8 GIA: Armed Islamic Group              
##  9 GIA: Armed Islamic Group              
## 10 Military Forces of Algeria (1994-1999)
## # ℹ 277,815 more rows
acled_africa[20,4] # 20 row, 4th column
## # A tibble: 1 × 1
##   EVENT_DATE         
##   <dttm>             
## 1 1997-01-16 00:00:00

Another, more convenient way to check out a dataset’s attributes is the summary() function

summary(acled_africa)
##       ISO        EVENT_ID_CNTY      EVENT_ID_NO_CNTY
##  Min.   : 12.0   Length:277825      Min.   :    1   
##  1st Qu.:231.0   Class :character   1st Qu.: 2020   
##  Median :566.0   Mode  :character   Median : 5372   
##  Mean   :510.2                      Mean   : 8028   
##  3rd Qu.:710.0                      3rd Qu.:10870   
##  Max.   :894.0                      Max.   :48946   
##    EVENT_DATE                          YEAR      TIME_PRECISION 
##  Min.   :1997-01-01 00:00:00.00   Min.   :1997   Min.   :1.000  
##  1st Qu.:2012-09-29 00:00:00.00   1st Qu.:2012   1st Qu.:1.000  
##  Median :2017-01-30 00:00:00.00   Median :2017   Median :1.000  
##  Mean   :2015-03-13 22:19:15.33   Mean   :2015   Mean   :1.135  
##  3rd Qu.:2020-04-02 00:00:00.00   3rd Qu.:2020   3rd Qu.:1.000  
##  Max.   :2022-04-22 00:00:00.00   Max.   :2022   Max.   :3.000  
##   EVENT_TYPE        SUB_EVENT_TYPE        ACTOR1          ASSOC_ACTOR_1     
##  Length:277825      Length:277825      Length:277825      Length:277825     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      INTER1         ACTOR2          ASSOC_ACTOR_2          INTER2     
##  Min.   :1.000   Length:277825      Length:277825      Min.   :0.000  
##  1st Qu.:2.000   Class :character   Class :character   1st Qu.:0.000  
##  Median :3.000   Mode  :character   Mode  :character   Median :2.000  
##  Mean   :3.563                                         Mean   :3.231  
##  3rd Qu.:6.000                                         3rd Qu.:7.000  
##  Max.   :8.000                                         Max.   :8.000  
##   INTERACTION      REGION            COUNTRY             ADMIN1         
##  Min.   :10.0   Length:277825      Length:277825      Length:277825     
##  1st Qu.:15.0   Class :character   Class :character   Class :character  
##  Median :33.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :33.8                                                           
##  3rd Qu.:55.0                                                           
##  Max.   :88.0                                                           
##     ADMIN2           ADMIN3          LOCATION            LATITUDE      
##  Length:277825      Mode:logical   Length:277825      Min.   :-34.710  
##  Class :character   NA's:277825    Class :character   1st Qu.: -0.360  
##  Mode  :character                  Mode  :character   Median :  6.264  
##                                                       Mean   :  6.912  
##                                                       3rd Qu.: 13.500  
##                                                       Max.   : 37.282  
##    LONGITUDE       GEO_PRECISION      SOURCE          SOURCE_SCALE      
##  Min.   :-25.163   Min.   :1.000   Length:277825      Length:277825     
##  1st Qu.:  9.123   1st Qu.:1.000   Class :character   Class :character  
##  Median : 28.100   Median :1.000   Mode  :character   Mode  :character  
##  Mean   : 22.345   Mean   :1.275                                        
##  3rd Qu.: 33.010   3rd Qu.:1.000                                        
##  Max.   : 63.475   Max.   :3.000                                        
##     NOTES             FATALITIES         TIMESTAMP        
##  Length:277825      Min.   :   0.000   Min.   :1.553e+09  
##  Class :character   1st Qu.:   0.000   1st Qu.:1.611e+09  
##  Mode  :character   Median :   0.000   Median :1.619e+09  
##                     Mean   :   2.941   Mean   :1.614e+09  
##                     3rd Qu.:   1.000   3rd Qu.:1.628e+09  
##                     Max.   :1350.000   Max.   :1.651e+09

The summary function not only provides info on amount of observations , NA’s and datatype’s but also scales and ranges for numerical attributes.

Overview of the main attributes of interest

  • YEAR: Year when an event took place
  • EVENT_DATE: Exact date when an event took place
  • COUNTRY: Country in which the the event took place
  • ACTOR1: Political actor who commited an action
  • ACTOR2: (Political) actor who is the addressee or target of an action
  • EVENT_TYPE: Type of event; 6 different available
  • SUB_EVENT_TYPE: Subtype of event; 25 different available
  • FATALITIES: Amount of fatalities caused by the event
  • LATITUDE & LONGITUDE: Both together provide a point coordiante georefernce, which are not going to use in this chapter

3.3.2 Missing data

Is there any missing data in the dataset? Missing data in R is represented as NA. With the following code we check the amount of NA’s for each column:

colSums(is.na(acled_africa))
##              ISO    EVENT_ID_CNTY EVENT_ID_NO_CNTY       EVENT_DATE 
##                0                0                0                0 
##             YEAR   TIME_PRECISION       EVENT_TYPE   SUB_EVENT_TYPE 
##                0                0                0                0 
##           ACTOR1    ASSOC_ACTOR_1           INTER1           ACTOR2 
##                0           206757                0            74458 
##    ASSOC_ACTOR_2           INTER2      INTERACTION           REGION 
##           225170                0                0                0 
##          COUNTRY           ADMIN1           ADMIN2           ADMIN3 
##                0                1             2151           277825 
##         LOCATION         LATITUDE        LONGITUDE    GEO_PRECISION 
##                0                0                0                0 
##           SOURCE     SOURCE_SCALE            NOTES       FATALITIES 
##                0                0             9288                0 
##        TIMESTAMP 
##                0

Every record contains a year and at least one main actor and event type.

The occurence of NA’s can have several implications for one’s analysis:

numeric_vector <- c(1,2,3,4,NA) # vector of numericals. Vector is a kind of list, but only allows to store data of the same type. Vectors are one-dimensional.
sum(numeric_vector) # well we would expect the sum to be 10, but the NA causes a problem here
## [1] NA
sum(numeric_vector, na.rm = TRUE) # na.rm or na.omit are common parameters to handle NAs in the dataset
## [1] 10

What is the distribution of events by year?

table(acled_africa$YEAR)
## 
##  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009 
##  3209  4546  4882  4175  3611  4297  3750  3174  2909  2739  3897  5083  3863 
##  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022 
##  4396  8136  9793 14234 16590 16984 17308 17665 19678 25618 32661 33971 10656
year_cross_eventtype <- table(acled_africa$YEAR, acled_africa$EVENT_TYPE)
head(year_cross_eventtype)
##       
##        Battles Explosions/Remote violence Protests Riots Strategic developments
##   1997    1184                        131      244   137                    615
##   1998    1650                        136      249   183                   1115
##   1999    2685                        198      218   174                    679
##   2000    1814                        224      256   227                    445
##   2001    1502                        127      231   267                    328
##   2002    1724                        183      270   245                    186
##       
##        Violence against civilians
##   1997                        898
##   1998                       1213
##   1999                        928
##   2000                       1209
##   2001                       1156
##   2002                       1689

The table() function creates a contingency table or crosstab. It displays the frequency of a single or multiple attributes. The frequency is the sum of observations per variable or variable combination.

3.3.3 Visualization

We can also answer that by plotting the data. R has several big packages that serve plotting:

  • graphics
  • lattice
  • ggplot2

All have their pros and cons but we will focus on ggplot2 here there possible.

Using the graphics package we can use the command hist that requires a vector of values. Other parameters can be used to specify the title or the axis label. The help file indicates possibilities for fine tuning the plot.

hist(acled_africa$YEAR, main="Amount of events by year ACLED Africa", xlab = "time in years")

Plotting with ggplot2 follows a different syntax. We start with ggplot to define the data.frame we will be using and to map aesthetics to variables/columns of that data frame. Next we add (with a plus sign) different graphical functions that specify e.g. the type of plot to be used. In the following example we distinguish the different event types, by stacking histograms for each event type.

library(ggplot2)

ggplot(data = acled_africa, mapping = aes(x=YEAR, fill=EVENT_TYPE)) +
  geom_histogram(bins=24) + 
  xlab("")

First the ggplot() function is called and the variables of interest are configured in the aesthetics parameter. With a + (!) operator subsequent ggplot functions are added to the plot. The pipe does not work here. The next ggplot function is a layer that specifies the type of plot, e.g. geom_point, geom_line, geom_bar or geom_tile. Other ggplot functions that can be added optionally are to configure scales, labeling, legend, the theme or facet options.

But What do we see in the plot? A sharp increase in 2019 is visible. What does this look like for a single country, e.g. Democratic Republic of Congo:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
drc_subset <- acled_africa |> filter(COUNTRY=="Democratic Republic of Congo") 

ggplot(drc_subset, aes(x=YEAR)) +
  geom_histogram(bins=24) + xlab("")

ggplot(drc_subset, aes(x=YEAR, fill=EVENT_TYPE)) +
  geom_histogram(bins=24) + xlab("")

More ressources on ggplot here: * ggplot2 essentials (sthda.com) * The R Graph Gallery (r-graph-gallery.com)

3.3.4 Subsetting

dplyr is the Swiss army knife package for data wrangling and manipulation. The filter function comes with it. The == operator corresponds to exactly equal. Only rows that meet these criteria in the selected columns are evaluated as true and returned.

Selection of other logical operators:

Operator Description
!= not equal
> greater than
< less than
>= greater than or equal to
<= less than or equal to
a & b logical AND: a AND b
a | b logical OR: a OR b
a %in% b comparing two sequences / columns
is.na(a) test if a is NA

The |> operator is the so-called pipe. The pipe operator is a way to avoid the nesting of functions and to store intermediate steps. This becomes more clear with longer chains of commands. It works by piping the outcome of the function before the pipe as the first argument into the function after the pipe operator. Before R version 4.0 the |> pipe was not part of base R, the basic functionality provided by R without any extending libraries / packages. But there was the magrittr pipe %>%. The magrittr pipe is still very popular and you will see usage of it in a lot of documentation and tutorials. The functionality between |> and %>% is the same. However in roder to use the magrittr pipe you either have to load the library with library(magrittr) or a library that uses magrittr via dependencies like dplyr, tidyr or tidyverse. In RStudio you can use the hotkey combination [ctrl] + [shift] + [m] to faster write pipes. Out of the box, RStudio will use the magrittr pipe, in the settings however you can change it to the native pipe.

# sequentiell with individua assignment
numeric_vectrs <- c(1,2,3,4)
mean_vectrs <- mean(numeric_vectrs)
mean_string <- as.character(mean_vectrs)
paste0(mean_string, " is the mean")
## [1] "2.5 is the mean"
# nested
paste( # function to combine character data types to one
  as.character( # function to convert data type to character type
    mean( # function to calculate the average of a vector, list, sequence of numeric data types
      c(1,2,3,4) # a vector of numerics
      )
    )
  ," is the mean") # second input for the concatenate function
## [1] "2.5  is the mean"
# and with pipes:
c(1,2,3,4) |> mean() |> as.character() |> paste0(" is the mean")
## [1] "2.5 is the mean"
library(magrittr)
c(1,2,3,4) %>% mean() %>% as.character() %>% paste0(" is the mean")
## [1] "2.5 is the mean"

Back to the ACLED dataset.

  • How many events with the event type Violence against civilians took place since 2020 in DRC.
# Counts the number of rows that meet filter criteria. 
drc_subset |>
  filter(EVENT_TYPE == "Violence against civilians" & YEAR > 2020) |>
  count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1  1661
  • How many fatalities have occurred in events of the violence against civilians type since 2020?
# Sums up the number of fatalities of each row that meets the filter criteria. 
drc_subset |>
  filter(EVENT_TYPE == "Violence against civilians" & YEAR > 2020) |>
  summarise(total_fatalities=sum(FATALITIES))
## # A tibble: 1 × 1
##   total_fatalities
##              <dbl>
## 1             3109

3.3.5 Transform

  • Who is the actor in DRC that perpetuated the most events of the type Violence against civilians since 2020, how does it compare to Uganda?

DRC:

drc_subset |> 
  filter(EVENT_TYPE == "Violence against civilians" & YEAR > 2020) |>
  group_by(ACTOR1) |>
  summarise(violent_acts=n()) |> 
  arrange(desc(violent_acts))
## # A tibble: 107 × 2
##    ACTOR1                                                           violent_acts
##    <chr>                                                                   <int>
##  1 Unidentified Armed Group (Democratic Republic of Congo)                   578
##  2 ADF: Allied Democratic Forces                                             372
##  3 CODECO-URDPC: Cooperative for Development of Congo (Union of Re…          109
##  4 CODECO: Cooperative for Development of Congo                               79
##  5 Military Forces of the Democratic Republic of Congo (2019-)                64
##  6 Mayi Mayi Militia                                                          39
##  7 FPAC: Ituri Self-Defense Popular Front (Zaire)                             26
##  8 Batwa Ethnic Militia (Democratic Republic of Congo)                        24
##  9 Police Forces of the Democratic Republic of Congo (2019-)                  22
## 10 Chini Ya Kilima-FPIC: Patriotic and Integrationist Force of Con…           15
## # ℹ 97 more rows

Uganda:

acled_africa |>
  filter(EVENT_TYPE == "Violence against civilians" & YEAR > 2020 & COUNTRY=="Uganda") |>
  group_by(ACTOR1) |>
  summarise(violent_acts=n()) |> 
  arrange(desc(violent_acts))
## # A tibble: 31 × 2
##    ACTOR1                                               violent_acts
##    <chr>                                                       <int>
##  1 Unidentified Armed Group (Uganda)                              88
##  2 Police Forces of Uganda (1986-)                                80
##  3 Military Forces of Uganda (1986-)                              43
##  4 Karamajong Ethnic Militia (Uganda)                             35
##  5 Military Forces of Uganda (1986-) Local Defense Unit            8
##  6 Private Security Forces (Uganda)                                8
##  7 Police Forces of Uganda (1986-) Prison Guards                   7
##  8 Dodoth Ethnic Militia (Uganda)                                  3
##  9 Unidentified Communal Militia (Uganda)                          3
## 10 Unidentified Ethnic Militia (Uganda)                            3
## # ℹ 21 more rows

The group_by() function is used to create groups within the tibble. Subsequent functions on the tibble, like summarizing in the example are executed on the groups instead on the whole table. n() is a context dependent expression that returns the current group size, comparable to count().

arrange sorts the tibble based on one or multiple columns. The order type is ascending by default. With desc() it is changed to descending.

Selection of other summary functions:

Function description
mean() mean or average
median() median
min() minimum value
max() maximum value
quantile() nth quantile
sd() standard deviation
var() variance
first() first value
last() last value

What are the top 10 countries with respect to amount of events?

acled_africa |> 
  group_by(COUNTRY) |>
  summarize(event_count=n()) |> 
  arrange(desc(event_count)) 
## # A tibble: 57 × 2
##    COUNTRY                      event_count
##    <chr>                              <int>
##  1 Somalia                            36352
##  2 Democratic Republic of Congo       24249
##  3 Nigeria                            24193
##  4 Sudan                              16857
##  5 South Africa                       16476
##  6 Algeria                            11875
##  7 Egypt                              11300
##  8 Libya                              10905
##  9 Burundi                             9767
## 10 Tunisia                             9228
## # ℹ 47 more rows

What are the top 10 countries for the last 5 years with respect to all event types?

acled_africa |> 
  filter(YEAR > 2017) |>
  group_by(COUNTRY) |>
  summarize(event_count=n()) |> 
  arrange(desc(event_count)) 
## # A tibble: 57 × 2
##    COUNTRY                      event_count
##    <chr>                              <int>
##  1 Democratic Republic of Congo       12837
##  2 Nigeria                            12797
##  3 Somalia                            11662
##  4 South Africa                        7052
##  5 Algeria                             7017
##  6 Sudan                               5833
##  7 Tunisia                             5616
##  8 Burkina Faso                        4981
##  9 Cameroon                            4726
## 10 Mali                                4699
## # ℹ 47 more rows

3.4 Modifying data

Next we will alter the tibble according to our needs. First we will add a new column with simplified event types. We reduce the number of columns to the ones we are interested in. Then we pivot the table from the current long to the wide format. We define this format as one record/row being unique by country and year.

acled_africa_altered <- acled_africa |>
  mutate(event_type_simple = case_when(
    EVENT_TYPE == "Battles"~ "battles",
    EVENT_TYPE == "Explosions/Remote violence"  ~ "remote_violence",
    EVENT_TYPE == "Protests" ~ "protests",
    EVENT_TYPE == "Riots" ~ "riots",
    EVENT_TYPE == "Strategic developments" ~ "strategic_dev",
    EVENT_TYPE == "Violence against civilians" ~ "violence_civilians",
  )) |> 
  select(YEAR, COUNTRY, event_type_simple) # the first attribute contains information on the country code

With the mutate() function of the dplyr library, one can modify existing columns or add new ones. In combination with the case_when() it is possible to fill a column based on conditions. Take a look into this blogpost for further information. The select() function extracts specified columns into a new tibble. Inversely, certain columns can also be excluded using the following syntax: select(-c(columnA, columnB, columnC))

Next we aggregate the tibble by event type.

aggregated_events <- acled_africa_altered |>
  group_by(COUNTRY, YEAR, event_type_simple) |>
  summarize(event_count=n()) 
## `summarise()` has grouped output by 'COUNTRY', 'YEAR'. You can override using
## the `.groups` argument.

The resulting tibble is ~6590 rows long. We want a tibble were each country and year combination define a single row. For this we need to pivot the table from the long format to the wide one.

Example on pivoting a table from wide to long and long to wide format

(#fig:pivot_img)Example on pivoting a table from wide to long and long to wide format

library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
aggregated_events_wide <- aggregated_events |> 
  ungroup() |> # the tibble still contains grouping information from group_by, this function removes it.
  pivot_wider(names_from=event_type_simple, values_from=event_count) 
aggregated_events_wide
## # A tibble: 1,269 × 8
##    COUNTRY  YEAR battles remote_violence violence_civilians protests
##    <chr>   <dbl>   <int>           <int>              <int>    <int>
##  1 Algeria  1997       8              17                116       NA
##  2 Algeria  1998      14              13                 20        1
##  3 Algeria  1999      27              11                 25       NA
##  4 Algeria  2000      95              12                 61        2
##  5 Algeria  2001      78              11                 48       18
##  6 Algeria  2002      99              37                 72       12
##  7 Algeria  2003      91              18                 45       28
##  8 Algeria  2004      52              14                 22       12
##  9 Algeria  2005      50              18                 10        6
## 10 Algeria  2006      98              48                 32        7
## # ℹ 1,259 more rows
## # ℹ 2 more variables: strategic_dev <int>, riots <int>
aggregated_events_wide <- aggregated_events_wide |>
  rowwise() |> # if not set to rowwise, whole columns will be summed up.
  mutate(total = sum(battles, remote_violence, violence_civilians, protests, strategic_dev, riots, na.rm=TRUE)) # the parameter na.rm=T, will ignore NAs, otherwise a single NA in the column selection will cause the result to be NA too.

The tibble is now reduced to 1269 rows. With pivot_longer() the same process vice versa can be done. Next we add a column on the cumulative sum of the total events by country and year.

aggregated_events_wide <- aggregated_events_wide |>
  group_by(COUNTRY) |>
  arrange(YEAR) |>
  mutate(cum_total = cumsum(total)) |>
  ungroup() |>
  arrange(COUNTRY,YEAR)
aggregated_events_wide
## # A tibble: 1,269 × 10
##    COUNTRY  YEAR battles remote_violence violence_civilians protests
##    <chr>   <dbl>   <int>           <int>              <int>    <int>
##  1 Algeria  1997       8              17                116       NA
##  2 Algeria  1998      14              13                 20        1
##  3 Algeria  1999      27              11                 25       NA
##  4 Algeria  2000      95              12                 61        2
##  5 Algeria  2001      78              11                 48       18
##  6 Algeria  2002      99              37                 72       12
##  7 Algeria  2003      91              18                 45       28
##  8 Algeria  2004      52              14                 22       12
##  9 Algeria  2005      50              18                 10        6
## 10 Algeria  2006      98              48                 32        7
## # ℹ 1,259 more rows
## # ℹ 4 more variables: strategic_dev <int>, riots <int>, total <int>,
## #   cum_total <int>

Lastly we plot the totals by country as facet plot.

aggregated_events_wide |>
  ggplot(aes(x=YEAR, y=total)) + 
  geom_line(alpha=0.5) +
  facet_wrap(~COUNTRY, ncol = 6)
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

With the facet wrap function of ggplot, facet plots can be created out of attributes that define groups within the data.

Now we join the hdi data per country and year

names(hdi)
##  [1] "1990"     "1991"     "1992"     "1993"     "1994"     "1995"    
##  [7] "1996"     "1997"     "1998"     "1999"     "2000"     "2001"    
## [13] "2002"     "2003"     "2004"     "2005"     "2006"     "2007"    
## [19] "2008"     "2009"     "2010"     "2011"     "2012"     "2013"    
## [25] "2014"     "2015"     "2016"     "2017"     "2018"     "2019"    
## [31] "HDI Rank" "Country"
year_cols <- seq(from = 1990, to =2019, by=1) |> as.character()
year_cols
##  [1] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999"
## [11] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [21] "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017" "2018" "2019"
hdi_long <- hdi |> 
  pivot_longer(cols=all_of(year_cols), names_to = "year", values_to = "hdi") # all_of() is necessary that the pivot_longer function reads year_cols vector not as the character vector it is, but to select the column by the names

# check attributes to be used for joining
names(aggregated_events_wide)
##  [1] "COUNTRY"            "YEAR"               "battles"           
##  [4] "remote_violence"    "violence_civilians" "protests"          
##  [7] "strategic_dev"      "riots"              "total"             
## [10] "cum_total"
names(hdi_long)
## [1] "HDI Rank" "Country"  "year"     "hdi"
# verify datatype
typeof(aggregated_events_wide$COUNTRY)
## [1] "character"
typeof(hdi_long$Country)
## [1] "character"
typeof(aggregated_events_wide$YEAR)
## [1] "double"
typeof(hdi_long$year)
## [1] "character"
hdi_long$year <- as.numeric(hdi_long$year) # change character column to numeric
typeof(hdi_long$year) # woroked
## [1] "double"
aggregated_events_wide_hdi <- aggregated_events_wide |>
  left_join(hdi_long, by=c("COUNTRY"="Country", "YEAR"="year"))

aggregated_events_wide_hdi
## # A tibble: 1,269 × 12
##    COUNTRY  YEAR battles remote_violence violence_civilians protests
##    <chr>   <dbl>   <int>           <int>              <int>    <int>
##  1 Algeria  1997       8              17                116       NA
##  2 Algeria  1998      14              13                 20        1
##  3 Algeria  1999      27              11                 25       NA
##  4 Algeria  2000      95              12                 61        2
##  5 Algeria  2001      78              11                 48       18
##  6 Algeria  2002      99              37                 72       12
##  7 Algeria  2003      91              18                 45       28
##  8 Algeria  2004      52              14                 22       12
##  9 Algeria  2005      50              18                 10        6
## 10 Algeria  2006      98              48                 32        7
## # ℹ 1,259 more rows
## # ℹ 6 more variables: strategic_dev <int>, riots <int>, total <int>,
## #   cum_total <int>, `HDI Rank` <chr>, hdi <chr>

3.5 Analyzing data

We load timeseries data on population, GDP per capita and life expectancy from gapminder. Gapminder is a dataset facilitated by the Gapminder foundation. Driving force behind the project was Hans Rosling, the author of the book Factfulness. You can download the data here

gapminder <- read.csv("data/gapminder.csv") |>
  tibble() # read.csv loads data as data.frame, with pipe and tibble function we get a tibble directly.

The gapminder dataset is a csv, we use the function read.csv to load it. We don’t need to alter the parameters, as the csv is completely compliant with the csv standard’s default. If you use csv’s that for instance were compiled by a german authority, it can well be that you need to adjust parameters on string escaping and the delimiter.

The dataset contains the following attributes:

  • country: country name
  • continent: continent name
  • year: year of measurement, ranges from 1952 to 2007
  • life expectancy at birth, in years
  • pop: population
  • gdpPercap: (GDP per capita (US$, inflation-adjusted))

Next we create some plots on life expectancy by continent. Comparison of the years 1997 and 2007 in terms of life expectancy x GDP. Can we predict life expectancy from GDP with a linear model?

gapminder |>
  ggplot(aes(x=year, y=lifeExp, group=country)) + 
  geom_line(alpha=0.5) +
  xlab("") +
  facet_wrap(~continent)

g_subset <- gapminder |>
  filter (year==1997 | year==2007) 

g_subset |>
#  filter (year==2007) |>
  ggplot(aes(x=gdpPercap, y=lifeExp, color=country, size=pop)) + 
  geom_point() +
  facet_wrap(~year) + 
  xlab("GDP per capita [interntational US$]") +
  ylab("Life expectancy [years]") +
  theme(legend.position = "none")

gapminder |>
  filter(continent=="Africa") |>
  ggplot(aes(x=lifeExp, y=gdpPercap)) + 
  geom_point() + xlab("GDP per capita [interntational US$]") +
  ylab("Life expectancy [years]") 

gapminder |>
  filter(continent=="Africa") |>
  ggplot(aes(x=lifeExp, y=gdpPercap)) + 
  geom_point() +
  geom_smooth(method=lm) + xlab("GDP per capita [interntational US$]") +
  ylab("Life expectancy [years]") +
  theme_light()
## `geom_smooth()` using formula = 'y ~ x'

Next we create a linear model with the gapminder dataset. The dependent variable is life expectancy. GPD per capita is the predictor. In R the syntax follows this pattern:

[dependent variable] ~ [predictor A] + [predictor B] + [predictor C]

If there are NAs in the dataset, you need to set the na.action parameter accordingly.

lmod <-lm(lifeExp ~ gdpPercap, data=gapminder)
summary(lmod)
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = gapminder)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.754  -7.758   2.176   8.225  18.426 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.396e+01  3.150e-01  171.29   <2e-16 ***
## gdpPercap   7.649e-04  2.579e-05   29.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.49 on 1702 degrees of freedom
## Multiple R-squared:  0.3407, Adjusted R-squared:  0.3403 
## F-statistic: 879.6 on 1 and 1702 DF,  p-value: < 2.2e-16

The estimated effect of our predictor GDP per capita is significantly different from zero, this can be seen by comparing the standard error with the regression coefficient estimate or by looking at the associated p-value.

The model explains 34% of the variance, which is depicted as R-square. Which clearly indicates that there is a clear relationship but that there are presumably other factors that explain the variability of the data.

If the GDP per capita increases by 1000 US$ (purchasing power adjusted), the life expectancy increases on average by 7.649.

(The linear model has several shortcommings and could be improved in various ways (temporal auto.correlation, prediction of life expectancy below zero,…) but this is not the point here.)

References

Raleigh, C., Linke, A., Hegre, H., Karlsen, J., 2010. Introducing ACLED: An Armed Conflict Location and Event Dataset: Special Data Feature. Journal of Peace Research 47, 651–660. https://doi.org/10.1177/0022343310378914