Kapitel 2 Arbeiten mit R und RStudio

R ist eine Programmiersprache, in der Kommandos zum einlesen der Daten, zum erzeugen von Grafiken und zum berchnen von Statistiken, aber auch für Programmstrukturen wie Schleifen, Funktionen oder Klassen eingegeben werden. Rstudio ist eine IDE (Integrated Development Interface), also eine Arbeitsumgebung, die auf R aufsetzt und das Arbeiten mit R in vielerlei Hinsicht vereinfacht. In diesem Kapitel geht es darum, wie man beide Programme installiert, und wie man grundlegende Arbeitsschritte durchführt. Dazu gehört auch ein kurzer Überblick über die Arbeitsweise in RStudio.

2.1 Setting up R & RStudio

R und RStudio stehen für MS Windows, Mac OS, gebräuchliche Linuxvarianten (ubuntu, debian, fedora, suse, redhat) als binär Dateien zur Installation bereit. Für andere Plattformen, muss die Software aus dem Quellcode kompiliert werden. Mit dem Einsatz auf Samrtphones oder Tablets liegen mir keine Erfahrungswerte vor. Meine Empfehlung ist es auf einem Laptop oder Desktop Rechner zu arbeiten, ggf. im PC-Pool.

Wichtig ist, dass zuerst R installiert wird, dann RStudio.

Falls eine ältere Version von R oder RStudio bereits installiert sein sollte, würde ich empfehlen diese zu aktualisieren. Es ist nicht notwendig, die allerneuste Version zu installieren. Die LTS Version von ubuntu Linux verwendet z.B. i.d.R. eine etwas ältere Version von R. Dies ist unkritisch.

RStudio wird in verschiedenen Versionen vertrieben. Die freie Variante bietet alles, was wir im Kurs (und wahrscheinlich auch im weiteren Berufsleben) benötigen.

https://cran.r-project.org/

https://www.rstudio.com/products/rstudio/download/

2.2 R packages

Der Funktionsumfang von R lässt sich durch Zusatzpakete (packages) erweitern. Diese lassen sich über die Kommandozeile oder über das GUI installieren.

Der nachfolgende Code-Chunk kann dazu genutzt werden, die im Kurs benötigten Pakete zu installieren - dies muss nur einmal passieren. Am einfachsten geht diese, indem man den Code-Chunk kopiert und in die R-Console einfügt (s. weiter unten). Der Code überprüft, ob die Pakete schon installiert sind und installiert fehlende Pakete. Sie werden vermutlich gefragt, von wo Sie die Pakete runterladen möchten. Wählen Sie CRAN.

Achten Sie auf Fehlermeldungen während der Installation. Mitunter fehlen externe Bibliotheken, die separat installiert werden müssen. Üblicherweise sollten aber keine Probleme auftauchen.

thePackages <- c("knitr", "tidyverse", "dplyr", "ggplot2", "sf", "s2", "raster", "terra", 
               "tmap",  "GGally", "corrplot", "ggpubr", "ggExtra", "fitdistrplus", "moments",
               "FAdist", "crch", "ggpmisc", "AICcmodavg", "polycor",
               "AMR", "DescTools", "ellipse")
install.packages(setdiff(thePackages, rownames(installed.packages())))

2.3 RStudio interface

Figure 2.1: RStudio Interface

Source / Script

Der Source Bereich in der linken oberen Ecke enthält den R source code, d.h. die Kommandos die ausgeführt werden sollen. Die Kommandos werden top-down ausgeführt, d.h. oben stehende Kommandos zuerst. Die Reihenfolge ist also wichtig. ZUerst müssen die Daten eingelesen werden, bevor sie geplottet werden können. Mit dem run button oben rechts kann man entweder: * das ganze Script ausführen * eine einzelne Zeile ausführen - hierfür muss sich der Cursor in der auszuführenden Zeile befinden. Dann run drücken oder [ctrl] + [enter] drücken. (Die Zeile muss nicht selektiert werden, es reicht, wenn sich der Cursor in der Zeile befindet.) * mehrere Zeilen oder einen Teil einer Zeile ausführen: hierfür den auszuführenden Bereich selektieren und dann run drücken oder [ctrl] + [enter].

Console

In der Konsole (Console) werden die Befehle ausgeführt. Die Kommandos werden an die Konsole geschickt und dann ausgeführt. Hier erscheinen dann auch Rückmeldungen wie z.B. Fehlermeldungen oder Ergebnisse. In einem weiteren Reiter befindet sich das Terminal. Hiermit lassen sich Kommandos an das Betriebssystem schicken - je nach Betriebssystem unterscheidet sich das Terminal. Der Bckground Jobs Reiter listet alle R Operationen im HIntergrund auf, wie z.B. ausgeführte Scripte, die unabhängig im Hintergrund laufen. Hierzu zählen z.B. die Installation von paclages. Im RenderReiter werden Jobs ausgeführt, die Knit oder bookdown Befhle umfassen, die also aus RMarkdown Dokumenten HTML, Libre/Microsof Offic Dokumente oder PDF erzeugen.

Environment / History

Das Environment Panel oben rechts listet alle Variablen und Funktionen auf, die im aktuellem Projekt definiert sind. Einzelne Variablen lassen sich anklicken um mehr Informationen zu erhalten - data.frames werden dann angezeigt. Der History Reiter listet alle ausgeführten Kommandos der aktuellen Projektsitzung auf.

Files/Plots/Packages/Help

Das panel unten rechts besteht aus den folgenden Reitern: * Files: ein Datei Explorer um z.B. neue Ordner anzulegen, Datein zu verschieben, umzubenennen oder zu kopieren. Für manche Dateiformate wie .csv oder .xlsx gibt esein graphisches Interface, das beim IMport behilflich sein kann. * Plots: alle in der aktuellen Sitzung erzeugten Plots können hier angeschaut werden. * Packages: zeigt welche packages lokal installiert sind und erlaubt es weitere packages zu installieren. Zeiugt auch an ob Updates verfügbar sind. * Help: Dokumentation der installierten packages. Man kann hier KOmmandonamen eintippen. ALternativ kann man Hilfe für einen Befehl auch über ? erhalten, wenn man unmittelbar nach dem Fragezeichen den Befhlsnamen eingbit, z.B. ?read.csv()

2.4 Paths and .Rproj

Working with R involves working with data. These need to be loaded into R - therefore, we need to tell R where to find the data. Likewise, if we want to save the output of an analysis somewhere, we need to specify the location. Input and output data in R are accessed via paths. A dataset resides in a specific location on your hard drive in a folder structure. Likewise, R is executed at a specific location embedded in a folder structure. To load the desired dataset into R, the path to it must be defined correctly. There are two types of paths:

absolute paths and
relative paths

Given is the following example project file structure:

spatial_data_science/
├── notes.docx
├── spatial_data_science.Rproj
├── data
│   ├── acled_example.xlsx
│   ├── covid19_incidence_kreise.xlsx
│   ├── meuse.dbf
│   └── osm_bw.gpkg
└── src
    ├── exercise_1.Rmd
    ├── exercise_2.Rmd
    └── exercise_3.Rmd

Absolute paths start from the root of the file system. On a windows machine an absolute path looks like this:

c:/Documents and users/Desktop/Uni_stuff/classes/spatial_data_science/data/covid19_incidence_kreise.xlsx

Relative paths start from a location a program is currently running. In the following example, a program like R is running at the absolute path:

c:/Documents and users/Desktop/Uni_stuff/classes/spatial_data_science/src/

A relative path to the covid19 dataset looks like this: ../data/covid19_incidence_kreise.xlsx

The ../ indicates moving one folder level up from the current location.

Hint: Windows specifies paths with the backslash *. However, a path specified with backslahs will not work in R. You need to use the slash /* or a double backslash \ instead.

c:/Documents and users/Desktop/Uni_stuff/classes/spatial_data_science/data/covid19_incidence_kreise.xlsx works fine
c:\Documents and users\Desktop\Uni_stuff\classes\spatial_data_science\data\covid19_incidence_kreise.xlsx does not work
c:\\Documents and users\\Desktop\\Uni_stuff\\classes\\spatial_data_science\\data\\covid19_incidence_kreise.xlsx works fine.

To check in which location the current R environment is running use the command getwd(). To change it, use setwd(). However it is bad practice to hard-code paths. A modern RStudio setup uses a project file (.Rproj) to define the working directory. In our example folder structure this would be the root folder of the course: spatial_data_science. Every script used in the project should make use of relative paths to input and output data.

The following .gif shows how to create a R project from scratch:

Figure 2.2: How to create a R project in RStudio

2.5 Rmarkdown

For the exercises and assignments of this class we use R Markdown files. R Markdown files combine markdown text, R code, results like printed text or figures and other media formats. R Markdown files can be compiled into pdf, docx and html via the knit function.

More on the R Markdown Syntax can be found here: https://rmarkdown.rstudio.com/lesson-1.html.

A general overview on Markdown Syntax can be found here: https://www.markdownguide.org/basic-syntax/

Figure 2.3: R Markdown Source file, knitting and result in html

The general structure of an Rmarkdown file is as follows:

a YAML header which specifies the author, the title, the name, the outputformat,…
code chunks that contain R-code to be executed
normal text that can be formated with the Markdown language
in addition it is possible to include equations in LaTeX syntax and many more things that are beyond the scope of this course

2.6 Further ressources

How to get help when stuck?

Look up the functions documentation with ?<function name> within RStudio
Read through the vignettes of the package you are trying to use / understand. A vignette is a kind of long-form guide on the main functions and use cases of a package. It is a very convinient way of getting into the functionality of a package. Here is an example vignette of the sf package. It contains all information on the package development, like author, github page and version number. How to install it and troubleshoot installations. References to the packages functions (these are the same documentations that are found via ? inside RStudio). But the best part can be found under Articles. There you will find application examples of the most important functionalities and explanations about the philosophy of the package.
Google it. R is a very popular software with a big community. The problems you will encounter have probably already been encountered by others and discussed online. If you google a problem make sure to use the right terms. Here is an example to get infos on how to join data frames based on two or more columns:

R how to join two tables based on multiple columns?

Most likely such a google search will bring you on either Stackexchange or Stackoverflow. Two very popular question and answer platforms. There are subpages for GIS and R-related questions. Answers to questions can be voted up and down by users and marked as a working solution by the questioner

Cheatsheets are quite popular with the R community. There are cheatsheets on a couple of packages, here is a selection:

Bücher und Online Ressourcen

Viele der genannten Bücher sind in der Universitätsbibilothek Heidelberg als Druckversion oder teilweise auch als Onlineressource verfügbar.

(Carsten F. Dormann 2013): Sehr gute Einführung in die parametrische Statisik anhand von R mit Beispielen aus der Ökologie.
- - Eine Vorläuferversion des Buches findet sich sich auf den Seiten von CRAN unter Contributed Documentation
(Field, Miles, and Field 2012): Lcoker geschreibene Übersicht über Statistik mit R. Englisch.
(Gotelli and Ellison 2004): Gute Methodenübersicht, allerdings ohne Beispiele in R oder einer andern Statisiksoftware. Anwendungsbeispiele aus der Ökologie. Englisch.
(Hedderich and Sachs 2018): Umfangreiches Übersichtswerk das neben Theorie auch Beispiele in R enthält.
(Jones, Harden, and Crawley 2022): Umfangreiches Werk zur Einführung in die Statistik anhand von R und in das Arbeiten mit R. Anwendungsbeispiele aus der Ökologie.
(Faraway 2004) und (Faraway 2016): Einführung und fortgeschrittene Regressionsverfahren. Alle Anwendungen in R. Breites Spektrum von Beispielen aus diversen Anwendungsdisziplinien. Englisch.
- Eine Vorläuferversion der Bücher findet sich sich auf den Seiten von CRAN unter Contributed Documentation
eine Vielzahl von Manuals in verschiedenen Sprachen findet sich auf den Seiten von CRAN unter Contributed Documentation

Weiterführende/zitierte Literatur

Dormann, Carsten F. 2013. Parametrische Statistik. Springer.

Faraway, Julian J. 2004. Linear Models with R. Chapman; Hall/CRC.

———. 2016. Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. Chapman; Hall/CRC.

Field, Andy P., Jeremy Miles, and Zoë Field. 2012. Discovering Statistics Using R. London ; Thousand Oaks, Calif: Sage.

Gotelli, Nicholas J, and Aaron M Ellison. 2004. A Primer of Ecological Statistics. Sinauer Associates Sunderland.

Hedderich, Jürgen, and Lothar Sachs. 2018. Angewandte Statistik. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-662-56657-2.

Jones, Elinor, Simon Harden, and Michael J Crawley. 2022. The R Book. John Wiley & Sons.