4  Working with R

4.1 Installation

To work with R, you will need to install two pieces of software.

  • R. This is the actual R software, that is used to run R code.
  • RStudio. This is a graphical user interface (GUI) that makes working with R much easier and prettier.

Both programs can be downloaded for free, and are available for all main operating systems (Windows, macOS and Linux).

Installing R

To install R, you can download it from the CRAN (comprehensive R Archive Network) website. Do not be alarmed by the website’s 90’s aesthetics. The website is legit.

Installing RStudio

RStudio can be downloaded from the posit.co website, which is the developer of RStudio. Make sure to pick the latest version available for your operating system.

Once you have installed R and RStudio, you can start by launching RStudio. If everything was installed correctly, RStudio will automatically launch R as well.

The first time you open RStudio, you will likely see three separate windows. The first thing you want to do is open an R Script (!!) to work in. To do so, go to the toolbar and select File -> New File -> R Script.

You will now see four windows split evenly over the four corners of your screen:

  • In the top-left you have the text editor for the file that you are working in. This will most of the time be an R script or RMarkdown file.
  • In the top-right you can see the data and values that you are currently working with (environment) or view your history of input.
  • In the bottom-left you have the console, which is where you can enter and run code, and view the output. If you run code from your R script, it will also be executed in this console.
  • In the bottom-right you can browse through files on your computer, view help for functions, or view visualizations.

While you can directly enter code into your console (bottom-left), you should always work with R scripts (top-left). This allows you to keep track of what you are doing and save every step.

4.2 Using RStudio

Running code

Copy and paste the following example code into your R Script. For now, don’t bother understanding the syntax itself. Just focus on running it.

3 + 3
2 * 5
6 / 2
"some text"
"some more text"
sum(1,2,3,4,5)

You can run code by selecting the code and clicking on the Run button in the toolbar. However, we highly recommend getting used to using the keyboard shortcut, because this will greatly speed up your process. On Windows (and Linux) the shortcut is Ctrl + Enter. On Mac it’s Command + Enter.

There are two ways to run code:

  • If you select a specific piece of code (so that it is highlighted) you can run this specific selection. For example, select the first three lines (the three mathematical operations) and press Ctrl + Enter. This should then print the results for these three mathematical expressions. Note that you can also select a specific part on a line. Try selecting just the second 3 on the first line. This should just print the number 3.
  • If you haven’t made a selection, but your text cursor is somewhere on a line in your editor, you can press Ctrl + Enter to run the line where the cursor is at. This will also move the cursor to the next line, so you can walk through the code from top to bottom, running each line. Try starting on the first line, and pressing Ctrl + Enter six times, to run each line separately.

Using RStudio projects

It is best to put all your code in an RStudio project. This is essentially a folder on your computer in which you can store the R files and data for a project that you are working on. While you do not necessarily need a project to work with R, they are very convenient, and we strongly recommend using them.

To create a new project, go to the top-right corner of your RStudio window. Look for the button labeled Project: (None). Click on this button, and select New Project. Follow the instructions to create a new directory with a new project. Name the project “R introduction”.

Now, open a new R script and immediately save it (select File -> Save in the toolbar, or press ctrl-s). Name the file my_first_r_script.r. In the bottom-right corner, under the Files tab, you’ll now see the file added to the project. The extension .r indicates that the file is an R script.

4.3 Names and objects

The power of assignment

In R, and in computer programming in general, the most essential operation is to assign objects to names. By object, we then broadly mean any piece of information. a single number, a text, a list of numbers, and even an entire data set.

In plain terms, assignment is how you make R remember things by assigning them to a name. To assign an object to a name, we use the arrow notation: name <- value. For example:

x <- 2

Instead of using the arrow notation, you can also use the equal sign notation: name = object.

x = 2

We will in general always use the arrow notation. But if you encounter the equal sign notation, just remember that it’s the same thing.

By running the code x <- 2, you are saying: Assign the value 2 to the name x. Any objects that you assigned to names are stored in your Environment. You can see this environment in the top-right window, under the Environment tab. If you assigned 2 to x, you should see a table with in the left column the names (x) and in the right column a description of the object. For simply objects like numbers, this will just be the value (2).

From hereon, when you use the name x in your code, it will refer to the value 2. So when we run the code x * 5 (x times 5) it will print the number 10

x * 5
[1] 10

When running x * 5, R correctly prints the value 10. But why does it say [1] 10? The reason is that R always thinks of a number (or string) as a vector (i.e. list of values), that can have 1 or multiple values. The [1] indicates that 10 is the first (and only) value.

If you print a longer vector, you can see that R prints [...] at the start of each line, just to help you see the position of individual values. The following code generates a vector with numbers from 1 to 50

1:50
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Assigning versus printing

Notice that when you ran the code x <- 2, R didn’t print any values to the console (the bottom-left window). But when you ran x * 5, R did print the value 10. Basically, when you run code, and you DO NOT assign the result to a name, R will print the result to the console.

So the following code will NOT print the object ("I will not be printed") to the console, but will store it (you can see the name pop up in the Environment tab)

i_will_be_remembered <- "I will not be printed"

And the following object ("I will be printed") will be printed to the console, but not stored in the Environment.

"I will be printed"
[1] "I will be printed"

Assigning different types of objects

You can assign any type of object to a name, and you can use any name, as long as it starts with a letter and doesn’t contain spaces or symbols (but underscores are OK)

a_number = 5
my_cats_name = "Hobbes"

If you run this code and check you Environment (top-right), you should now see these name-object pairs added.

Assigning results

Till now we only directly assigned objects to names. This is convenient, but the power of assignment really shines when you use it to store results. For example, we can also do this.

x = 5 + 10

This a very simple example, but just think for a second what this allows us to do. Since we can assign anything to a name, we can break down any complicated procedure into multiple steps! For now, the key lesson is just to wrap your head around the syntax for assigning objects to names. This is fundamental to everything you will be doing in R (and in programming in general).

4.4 Types of objects

Common types of objects

You can work with many types of data in R. Here are some of the most common types of objects you’ll encounter:

  • Numeric: Numbers, like 5, 3.14, or -0.5.
  • Character: Text, like "Hello, world!".
  • Factor: Categorical data, like education_level or country.

Let’s see how these types of objects work in practice.

number <- 5
number

character <- "Hello, world!"
character

factor <- factor(c(1,2,2,3), labels = c("A", "B", "C"))
factor

Types determine what you can do with an object

The type of an object determines what you can do with it. For example, you can perform mathematical operations on numeric objects, but not on character objects.

10 + 10    # returns 20
"10" + 10  # throws an error

You can coerce objects to different types

Sometimes you have an object of the wrong type. For instance, your numeric data might have been read in as a character object.

number <- "5"

If I want to perform mathematical operations on number, I need to first convert it to a numeric object. You can do this using the as.numeric function.

number <- as.numeric(number)
number
class(number)    # numeric

Note that you cannot always convert objects to a different type. Just use your common sense here.

as.numeric("I am not a number")

When coercion is not possible, R will return NA (missing value) and give you a warning.

Vectors and Data Frames

In practice, you’ll rarely work with single values alone. However, many operations you can perform on a single value can also be applied to multiple values at once. Understanding how individual values combine to form larger data structures is key to working effectively with data.

From Single Values to Vectors

A vector is a collection of values of the same type (e.g., all numeric or all character). You create a vector by combining individual values using the c() (combine) function. Just like with single numbers, a vector has a type.

numbers <- c(1, 2, 3, 4, 5)
class(numbers)
[1] "numeric"

Now, you can perform operations on the entire vector at once:

numbers * 10
[1] 10 20 30 40 50
sum(numbers)
[1] 15

From Vectors to Data Frames

Single vectors are still not very usefull for data analysis, since we’re often interested in relations between variables. This is where data frames come in.

A data frame is a table with rows and columns, where each row is an observation and each column is a variable. You can think of a data frame as a collection of vectors, where each vector represents a column in the dataset. We can create a data frame by combining vectors using the data.frame() function.1

country <- c("NL", "NL", "BE", "BE", "DE", "DE", "FR", "FR", "UK", "UK")
height  <- c(176 , 165 , 172 , 160 , 180 , 170 , 175 , 165 , 185 , 175 )

d <- data.frame(country, height)

Now, you can perform operations on the entire data frame at once! You can for instance perform a statistical test to see if the average height of people differs between countries. Just like with single values and vectors, we need to take the type of each column into account. For instance, we can’t perform a correlation analysis using the country column, because it’s a character vector.

4.5 Functions

What is a function?

99% of what you do in R will involve using functions. A function in R is like a mini-program that you can use to perform specific tasks. It takes input, processes it, and gives you an output. For example, there are functions for:

  • importing data
  • computing descriptive statistics
  • performing statistical tests
  • visualizing data

A function in R has the form:

output <- function_name(argument1, argument2, ...)`
  • function_name is a name to indicate which function you want to use. It is followed by parentheses.
  • arguments are the input of the function, and are inserted within the parentheses.
  • output is anything that is returned by the function.

For example, the function c combines multiple values into a vector.

x = c(1,2,3,4)

Now, we can use the mean function to calculate the mean of these numbers:

m <- mean(x)

The calculated mean, 2.5, is now assigned to the name m:

m
[1] 2.5

Optional arguments

In the c and mean functions above, all the arguments were required. To combine numbers into a vector, we needed to provide a list of numbers. To calculate a mean, we needed to provide a numeric vector.

In addition to the required arguments, a function can also have optional arguments, that give you more control over what a function does. For example, suppose we have a range of numbers that also contains a missing value. In R a missing value is called NA, which stands for Not Available:

x_with_missing <- c(1, 2, 3, NA, 4)

Now, if we call the mean function, R will say that the mean is unknown, since the third value is unknown:

mean(x_with_missing)
[1] NA

This is statistically a very correct answer. But often, if some values happen to be missing in your data, you want to be able to calculate the mean just for the numbers that are not missing. Fortunately, the mean function has an optional argument na.rm (remove NAs) that you can set to TRUE (or to T, which is short for TRUE) to ignore the NAs:

mean(x, na.rm=TRUE)
[1] 2.5

Notice that for the required argument, we directly provide the input x, but for the optional argument we include the argument name na.rm = TRUE. The reason is simply that there are other optional arguments, so we need to specify which one we’re using.

To learn more about what a function does and what arguments it has, you can look it up in the ‘Help’ pane in the bottom right, or run ?function_name in R.

?mean

Here you can learn about the na.rm argument that we just used!

If you are just getting to know R, we recommend first finishing the rest of the Getting Started section. Then once you get the hang of things, have a look at the Use ?function help page tutorial.

Using the pipe syntax

There is another common way to use functions in R using the pipe syntax. With the pipe syntax, you can pipe the first argument into the function, instead of putting it inside the parentheses. As you will see below, this allows you to create a pipeline of functions, which is often easier to read.

argument1 |> function_name(argument2, ...)

For example, the following two lines of code give identical output:

mean(x_with_missing, na.rm=T)
[1] 2.5
x_with_missing |> mean(na.rm=T)
[1] 2.5

Notice how our first argument, the required argument x_with_missing, is piped into the mean function. Inside the mean function, we only specify the second argument, the optional argument na.rm.

So why do we need this alternative way of doing the same thing? The reason is that when writing code, you shouldn’t just think about what the code does, but also about how easy the code is to read. This not only helps you prevent mistakes, but also makes your analysis transparent. As you’ll see later, you’ll encounter many cases where your analysis requires you to string together multiple functions. In these cases, pipes make your code much easier to read.

For example, imagine we would want to round the result (2.5) up to a round number (3). With the pipe syntax we can just add the round function to our pipeline.

x_with_missing |>
  mean(na.rm=T) |>
  round()

You’ll see how powerful this can be later on, especially in the Data Management chapter. In order to prepare and clean up your data, you’ll often need to perform a series of functions in a specific order. The pipe syntax allows you to do this in a very readable way.

Mastering functions

There are some usefull tricks for using functions in R that are good to know about. We do not discuss these here, because if you’re just starting out, there are more important things to learn first. But once you get the hang of things, you can learn more about these tricks in the Good to Know tutorial.

4.6 Packages

What is a package?

In R, a package is a collection of functions, data, and documentation that extends the capabilities of R. You can think of packages kind of like apps on your phone: they provide additional functionality that you can use to perform specific tasks. Also, like apps, you can install and uninstall packages as needed, directly from within R.

There are thousands of R packages available, which enables you to use almost any existing data analysis technique. R is not just a tool for statistical analysis, but also for data visualization, data collection, articifial intelligence, and much more. If there is anything you need to do, there is a good chance that someone has already written a package for it.

How to install

To use a package, you first need to install it. Installing a package only needs to be done once per system (unless you need to update it).

1. Install a package

Most packages are available on the Comprehensive R Archive Network (CRAN), which is the main repository for R packages. To install these packages, all you need to know is their name.

For example, there is a package called lubridate that makes it easier to work with dates and times in your data. To install this package, you can use the install.packages() function:

install.packages("lubridate")

When running install.packages() You sometimes get the message that There is a binary version available but the source version is later (we’re mainly seen this on Mac). You then get the question whether you want to install from source the package which needs compilation (Yes/no). To answer this question, you have to type “yes” or “no” in your R console.

It is then usually best to say NO. This will install a slightly older version of the package, but it will be much faster and easier to install. You often don’t need the latest version, so it’s not worth the extra hassle.

In case you’re curious, the reason for this is that the newest version has not been prepared for your system yet. They do have the source code, but it has not yet been compiled into a binary version that is ready to use. Think of the source code as a recipe, and the binary version as the ready-made dish. If you really want to have the newest version you can say “yes”, but you’ll have to cook it yourself! The main problem is that you will often need to install some extra software to do this, which can be a hassle. So unless you really need the newest version, it’s usually best to say “no”, and just install the older version that has already been prepared for your system.

2. Load a package

Once you have installed a package, it is not yet loaded into your current R session. Similar to when you install a new app on your phone, you need to open it every time you want to use it.

To use a package in your current R session, you can load it with the library() function:

library(lubridate)

Package documentation

Most packages come with good documentation that explains how to use the functions in the package. For individual functions you can use the ? operator to open the help page for that function, as we explain in here. But there is often also a more general documentation that explains the package as a whole. This is called a vignette, and if a package has one, you can open it with the vignette() function.

vignette("lubridate")

You can look up vignettes and function documentation online. This is also a good way to find new packages for whatever you’re trying to do.

4.7 Viewing Function Documentation in R

One of the things that can new users in R can find overwhelming, is that they think they need to learn all functions by heart. This is not the case! Aside from a handfull of functions that you will use all the time, you will often need to look up how to use a function. Rather than learning everything by heart, you therefore need to learn some tricks for how to quickly look up information about functions.

One of the most important tricks is to use the built-in help system in R. You can quickly access documentation for any function using the ? symbol. This is a powerful tool that can help you understand how to use functions, what arguments they require, and what they return.

How to Use the ? Symbol

To view the documentation for a specific function, you simply need to type ? followed by the function name. For example, if you want to learn more about the mean() function, you would type:

?mean

This will open the help page, often in the bottom right pane of RStudio.

How to read the help page

The help page for a function is divided into several sections. The most important sections are:

Description

A brief description of what the function does. For the mean() function, the description is: Generic Function for the (Arithmetic) Mean.

By generic function, they mean that the function can have multiple implementations. When you think of the mean, you are probably thinking of the mean of a vector of numbers.

x = c(1,2,3,4)
mean(x)
[1] 2.5

But you can do more! For example, you can also calculate the mean of a vector of Date values:

dates = as.Date(c("2021-01-01", "2021-01-03"))
mean(dates)
[1] "2021-01-02"

Usage

The syntax of the function, including all the arguments it takes. For example, the mean() function has the following usage:

mean(x, ...)

## Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)

The first part tells you that the most basic way to use the function is to provide an argument called x. What x is, is explained in the arguments section that we discuss below.

The ... at the end means that the function can take additional arguments. This is because the mean function is a generic function. Depending on the type of input you provide (e.g., numbers, dates), some arguments might not be relevant.

The second part tells you that the default method for the mean function has two additional arguments in addition to x: trim, and na.rm. Note an important difference with the x argument: these arguments have default values (0 and FALSE, respectively). This means that these arguments are optional. If you don’t specify them, the function will use these default values.

For example, notice that the na.rm argument is set to FALSE by default. As we can see in the Arguments section, this means that the function will not remove missing values by default. (NA stands for Not Available, and is used in R to indicate missing values, so na.rm is short for remove NAs).

x_with_missing <- c(1, 2, 3, NA, 4)
mean(x_with_missing)
[1] NA

If we want to remove missing values, we can set na.rm to TRUE:

mean(x_with_missing, na.rm=TRUE)
[1] 2.5

Notice how in the code above we specify the argument name na.rm = TRUE to indicate that we want to use this optional argument. For the x argument we don’t need to specify the argument name, because it’s the first argument and the function knows that the first argument is x. Generally speaking, if you don’t specify the argument name, R will assume that you are providing the arguments in the order that they are listed in the usage section. Let’s think a bit about when we should and should not use argument names!

You could decide to always use argument names:

mean(x=x_with_missing, na.rm=TRUE)
[1] 2.5

This is fine, and sometimes you might want to do this for sake of clarity. But it’s also often unnecessary. For the mean function, it is obvious that the first argument is the input over which you want to calculate the mean, so you don’t need to specify the argument name.

On the opposite end of the spectrum, you could decide to never use argument names:

mean(x_with_missing, 0, TRUE)
[1] 2.5

Here the three arguments follow the order in the usage section: x, trim, na.rm.

This has two obvious downsides:

  • It is not obvious what the 0 and TRUE arguments are. The reader might thus have to look up the function documentation.
  • We now also need to specify the trim argument, because it comes before na.rm in the usage section.

So in general, it is often good to use argument names for optional arguments, like na.rm. For required arguments, like x, it is often not necessary. Arguably the best way to use the mean function with na.rm is therefore:

mean(x_with_missing, na.rm=TRUE)
[1] 2.5

Arguments

A description of all the arguments that the function takes. This should cover all the arguments that are listed in the usage section.

For example, the mean() function explains that the x argument is can be a numeric vector, but also something like a logical or date vector. For the na.rm argument it explains that if set to TRUE, missing values will be removed before calculating the mean.

Value

The value section explains what the function returns (i.e. the output).

Examples

The examples section shows you how to use the function. Honestly, this is often the most useful part of the help page. If you are not sure how to use a function, a great way to learn is to look at the examples. Usually, you can directly copy-paste these examples into your script and run them to see how the function works.

4.8 Using tab completion

A very usefull trick in RStudio, and in programming in general, is to use tab completion. The tab key on your keyboard (the one above the caps lock key) can be used to complete the name of a function or variable. And if there are multiple ways in which the name can be completed, R will show you all the options. (On some devices you might need to press Tab twice, or it might not work at all).

For example, type the following code in your R script:

mean()

Now place your cursor between the parentheses and press the Tab key. This should now list all the arguments for the mean function, including their descriptions!

Since the mean function is empty, you should only have seen the x argument (and the ... argument). But you can also use tab completion when adding additional arguments. Type the following code in your R script:

x = c(1,2,3,4)
mean(x, )

Now place your cursor between the parentheses after the comma, and press the Tab key. Now you should see the trim and na.rm arguments, because RStudio knows that these arguments apply to a mean function with a numeric input!

Try using tab completion everywhere

Well ok, not everywhere. But you might be surprised how often it can help you. It can even help you find files on your computer. If you use tab completion between quotes, RStudio will show you all the files in your working directory that match the characters you’ve typed so far. So you can use this inside functions like read_csv to quickly find the file you want to read.

library(tidyverse)
read_csv("")

Try it out!


  1. Throughout this book we will actually be using the tibble data structure from the tidyverse package, which is an improved version of the data.frame.↩︎