The goal of this document is to provide you with a basic understanding of R so you can begin to leverage it for data analysis. While R is not the only free software environment for data analysis, and perhaps not the most intuitive if your frame of reference is pretty much any other programming language, it is among one of the fastest growing programming languages world-wide (“Python Has Brought Computer Programming to a Vast New Audience” 2018).
Arguably, the key features for R are:
Now that you are sold on using R, let’s walk through the goals of this tutorial. In short, this document is designed to get you comfortable with the basics, such as:
Keep in mind that this document assumes you have little programming or data analysis experience. As such, it is intended to be comprehensive. Don’t let this over-specificity on “the basics” discourage you. The fact that you are curious about programming signals that you have some type of data requirement, whether that pertains to gathering, summarizing, transforming, tidying, exploring, visualizing, modeling, or presenting your data. This document will not cover most of those topics. Rather, this document is meant to serve as a primer for those eager to go down the R rabbit hole.
Finally, this document is one in a plethora of tutorials available to new users. If you have the opportunity, consult as many of these online resources as you can. Among some of the notable resources introducing R to new users are:
R is a free, open-source, and highly extensible language and environment for statistical computing and graphics (R Core Team 2018). In layman terms, R is a software suite primarily designed for data analysis, visualization, and communication much like other tools (e.g., SPSS, SAS, etc.); however, R is free and open-source. These characteristics mean that the source code is made available to all those who wish to access it. This distribution typically occurs through the comprehensive R archive network (the aforementioned CRAN), which is a series of servers distributed around the world and used to facilitate the free distribution of the software. The latest version of R, as well as older versions, are archived and freely distributed through the CRAN website, here. The free distribution of the software includes a basic graphic user interface (GUI) were you can type commands and get results of executing these commands immediately (see Figure 1).
In addition to the R software, the CRAN hosts a variety of open-source packages, which are bundles of share-able code, data, documentation, and tests intended to facilitate different types of tasks (e.g., clean, structure, and graph data). While R exists as a base environment with a reasonable amount of functionality, packages allow you to quickly expand the functionality and even interface with other programming languages, such as Python. To recap, packages are fundamental units of share-able code created by other R users intended to solve a common problem (Wickham 2015b).
Before moving forward, let’s set up some basic typographical conventions.
Italic - Indicating new terms, URLs, email addresses, file names, and file extensions.
Bold - Indicating the names of R packages.
Constant width
- Used for program listings, as well as within paragraphs to refer to elements such as variable or function names, databases, data types, etc. In other words, it denotes code listing that should be typed as is or previously defined objects.
Constant width italic
- Text that should be replaced with user-supplied values or determined by context.
Constant width bold
- Shows commands or other text that users should type literally.
These conventions come from R for Data Science (Wickham and Grolemund 2017) and are included in order to help guide you through reading this documentation.
In this section we will cover a variety of jargon and programming principles crucial to R. Table 1 is a summary of the terms covered in the following subsections, this material was adapted from the R Language Definitions site.
Most of your actions in R will revolve around using commands in the console rather than dialog boxes, which might seem like a tall order at first; however, with time and practice, this workflow should become more enjoyable. The simplest way to wrap your brain around using commands is by thinking of them as using inputs and expecting outputs. This is to say, you provide R with an input, and it will produce an output. For instance you can type any number into the console as such:
1
## [1] 1
Notice the immediate output, which should mimic your numerical input. Let’s expand on this basic input/output model by including some functions. Base R is packed with features that allow you to work with data. For instance, to perform simple addition you should provide two numbers separated by an addition operator (+
), like so:
1 + 1
## [1] 2
In this example, you used a arithmetic operator to connect two values and perform a simple calculation (more information on operators can be found here). In addition to operators, base R includes a variety of functions, which are sets of organized statements that perform a specific action to include creating other functions. Typical commands in R usually include one or more of the following:
For example, instead of performing the addition using the +
operator, you can use the sum()
function, which will take a series of values and return the sum of all arguments, like so:
sum(1, 1, 1)
## [1] 3
In order to store information in the R environment, the language uses an assignment operator, which is a statement that sets and/or resets a value stored and denoted by a variable name. The most common type of assignment in R will follow the general command form:
object <- function
Keep in mind that assignment commands contain a mixture of both objects and functions. However, assignments will always include an assignment operator (<-
), like so:
added_value <- sum(1, 2, 3)
A simplified way to read the statement above would be “create an object named ‘x’, containing the value of the sum of 1 + 2 + 3”. Here the object would be the value from the sum and the object is bound to a name, added_value
. Think about it like this, no immediate output was produced from the previous command (at least not on the console). In order to call the value, type added_value
into the console. What is the output? By typing the name of the object you summon the value. By this logic you should now be able to use the name of your object as placeholder for the value(s) in the object. What happens if you sum the added_value
object to another number using the +
operator? Finally, what is the product of summing the added_value
object to itself?
Now that you have begun working with the console and creating objects, let’s take a look at the most important family of data type in R, vectors. This data type is the foundation on which more complex objects and structures are built. Vectors come in two flavors in R: atomic and lists (Wickham 2019). We can define them as follow:
It is helpful to keep in mind two core concepts that pertain to vectors. First, these are one-dimensional structures. Think of them like a line of values all along the same axis. Second, they fall into two categories, homogeneous (atomic vectors) and heterogeneous (lists). More complex data structures build upon those two core concepts. For instance, matrices are two-dimensional homogeneous data structures. This is to say, they must contain the same type of values, but can have an x and y axis (think of columns and rows). Similarly, data frames are also two-dimensional, but they can include different types of atomic vectors. Hadley Wickham, Chief Scientist at RStudio, summarizes (Wickham 2015a) these data types as follows:1
Homogeneous | Heterogeneous | |
---|---|---|
1D | Atomic Vector | List |
2D | Matrix | Data Frame |
Atomic vectors are homogeneous in nature, containing a single sequence of values of the same type, making them one-dimensional by design. The four relevant atomic vector types to statistical analysis2 are:
logical
: TRUE
or FALSE
, but can be abbreviated to T
and F
respectively.double
: A numeric class for floating point numbers. You may encounter doubles in decimal (e.g., 1.0
, 1.2
, etc.), scientific (1.23e4
), or hexadecimal (e.g., 0xa
, 0xab
, 0xabc
, etc.). Keep in mind that there are three special values associated with doubles, these are:
Inf
: Infinity.-Inf
: Negative infinity.NaN
: Not a number.integer
: Much like doubles, these are numbers, but written followed by the L
and do not contain fractional values (e.g., 1L
, 1e4L
, 0xaL
).character
: These contain text strings , which is to say any alphanumeric value or character surrounded by "
or '
(e.g., "hello"
, "hello world"
, etc.).The simplest way to create an atomic vector is using the c()
function, which is short for combine and will allow you to bind multiple values, like so:
vector_lgl <- c(TRUE, FALSE, TRUE)
vector_dbl <- c(0x1, 2.0 , 3e0)
vector_int <- c(1L, 2L, 1:3L)
vector_chr <- c("Hello", "world", "!")
Much like with any object, you can print out the bound values by typing the name into the console. For instance, type the following into your console:
vector_lgl
vector_dbl
vector_int
vector_chr
Before we move away from atomic vectors and into more complex data structures, we should become a familiar with the typeof()
and length()
functions. Both of these are part of base R and are fundamental tools in understanding crucial characteristics of a vector such as its type and dimensions.
typeof(vector_lgl)
## [1] "logical"
length(vector_lgl)
## [1] 3
Knowing the type of vector will be crucial down the line when you begin working with larger data. Each data type has specific attributes associated with it. In practical terms this means that each data type has “do’s” and “don’ts”. For instance, you should be able to perform basic arithmetic on doubles and integers, but not on characters or logicals.
Lists are similar to atomic vectors, but with the crucial difference that they can and usually contain a variety data types. Think of them as a sequence where you can have all four vector types mentioned above, as well as other types (e.g., other lists, matrices, and data frames). This means that the object along that sequence don’t have to be the same length or dimensions (1D v. 2D). This flexibility makes lists very versatile at storing multiple different type so data under one object name.
In order to create a list, you will enlist the aptly named function list()
, which constructs or coerces data into a list type (R Core Team 2018). Let’s put that into motion:
my_list <- list(vector_lgl, vector_dbl, vector_int, vector_chr)
my_list
## [[1]]
## [1] TRUE FALSE TRUE
##
## [[2]]
## [1] 1 2 3
##
## [[3]]
## [1] 1 2 1 2 3
##
## [[4]]
## [1] "Hello" "world" "!"
Notice that the list above contains four objects, all atomic vectors, but of different flavors. However, each object is a single item on a sequence of objects, which means that we can go a step further and name each item along that sequence, much like naming the sequence itself:
my_list <- list(lgl = vector_lgl,
dbl = vector_dbl,
int = vector_int,
chr = vector_chr)
my_list
## $lgl
## [1] TRUE FALSE TRUE
##
## $dbl
## [1] 1 2 3
##
## $int
## [1] 1 2 1 2 3
##
## $chr
## [1] "Hello" "world" "!"
A way to read the code above is “create an object named my_list
which composed of a sequence of objects, named logicals
, doubles
, integers
, and characters
, which contain the values of the aforementioned atomic vectors”. Here we have an object name pointing at other object names, which enhances our ability to store data into more complex structures. To get a better sense of the list structure, use the str()
function to display the internal structure of the R object (R Core Team 2018), like so:
str(my_list)
## List of 4
## $ lgl: logi [1:3] TRUE FALSE TRUE
## $ dbl: num [1:3] 1 2 3
## $ int: int [1:5] 1 2 1 2 3
## $ chr: chr [1:3] "Hello" "world" "!"
Also keep in mind that you are able to gain a sense as to whether an object is a list using the typeof()
function. Additionally, you may want to evaluate the dimensions of this object by enlisting the lenght()
function.
typeof(my_list)
length(my_list)
A matrix is a two-dimensional data structure in which each element must be the same type (e.g., all logical, double, character, or integer.). Matrices are created with the matrix()
function, like so:
my_matrix <- matrix(vector_dbl,
nrow = length(vector_dbl),
ncol = length(vector_dbl))
my_matrix
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 2 2 2
## [3,] 3 3 3
Let’s unpack the statement above. The matrix()
function receives the first argument, double_vector
, which serves as the data vector distributed among the x and y axis accordingly. In this instance, we are using the length of the numeric vector to set the dimensions of the matrix. The output is a 3 x 3 square were the initial vector is repeated accordingly.
Here we will not work with matrices extensively. However, keep in mind that these are a useful way to represent, manipulate, and study a linear map between finite-dimensional vectors. Put differently, matrices are crucial for the storage of data used in many mathematical and statistical analysis, particularly in linear algebra and graph theory.
The last structure to discuss is the data frame, which is perhaps the most useful and intuitive. Wickham’s table above states that this structure is two-dimensional and contains a variety of data types. More formally, you can think of data frames as rectangular collections of two or more named atomic vectors, where the object names serves as column names, and row names (in R row.names
). Informally, you may think of these as a table-like structure similar to those used in SQL or other tabular data storage frameworks (e.g., Excel). Within R, a large portion of statisticians, scientists, programmers, and all-around users prefer storing their data in this format.
In order to create a data frame in R, you must use the data.frame()
function. Like so:
my_df <- data.frame(vector_lgl,
vector_dbl,
vector_chr)
my_df
## vector_lgl vector_dbl vector_chr
## 1 TRUE 1 Hello
## 2 FALSE 2 world
## 3 TRUE 3 !
Notice that the column headers inherit the vector object name. Because data frames have both column and row names, you can use the colnames()
and rownames()
functions to obtain more information on the data frame. You can use column names in conjunction of different operators:
$
operator should allow you to call out a single column at a time, like so:my_df$vector_dbl
## [1] 1 2 3
[
and ]
used in conjunction with a comma and quotes (see code chunk below). Keep in mind that this operator combination accesses rows first, then columns (e.g., [rows, columns]
):my_df[ ,"vector_dbl"]
## [1] 1 2 3
# or
my_df["2", ]
## vector_lgl vector_dbl vector_chr
## 2 FALSE 2 world
Unlike lists, which can also contain multiple types of vectors, here all vectors must have the same length. This is why we can think of data frames as rectangular structures. This property also means that in addition to length a data frame can have width. Put differently, we can count the row and column lengths with functions like nrow()
or ncol()
.
nrow(my_df)
ncol(my_df)
Throughout this document you have been exposed to some functions (e.g., c()
, tyepof()
, etc.) As mentioned previously, functions are repeatable instructions for the program to execute. As with any set of instructions, these can by simple or complex. However, the basic recipe for creating a function is rather straight forward:
function_name <- function(argument_1, argument2, ... ){
body
}
Let’s break down the recipe above:
<-
) to create a new object containing our function. As such, much like with vectors, a function is an object with a name bound to it. However, in this case the object will contain a series of instructions.function()
command.argument_1
, etc.), also called formals. These can be any number of objects such as numbers, data frames, to name a few. These are inputs needed for the function to run.{
and }
is the body of a function. This section is where you will define the statements or instructions for the functions. Note that the last line in a function body will be automatically returned.With these basic building blocks in mind, let’s create a basic function. Our first task is to create a function tasked with printing the string “hello world!” each time it is called:
say_hello <- function() {
print("hello world!")
}
Now that we have defined our first function, you may proceed to call it by typing its name into the console:
say_hello()
## [1] "hello world!"
Let’s elevate the level of difficulty a bit further, while using the basic components highlighted through this document. Many R users compile functions into bundles aimed at streamlining different tasks (e.g., data cleaning, analyzing, visualizing, sharing, etc.) These bundled sets of code are commonly known as R packages and are freely distributed through the CRAN. The following code chunk is intended to check whether or not a package is locally available and either launch it locally or run the remote installation.
check_packages <- function(.dependencies) {
for (i in seq_along(.dependencies)) {
if (!requireNamespace(.dependencies[[i]], quietly = TRUE)) {
install.packages(.dependencies[[i]], dependencies = TRUE)
}
}
}
Let’s break down the code chunk above. First, we defined a new function named check_packages
, which expects an argument named .dependencies
. The function will then generate a sequence up to the length of the vector (for (i in seq_along(.dependencies)) {}
). Each value on that vector will be evaluated in order to determine whether or not a package can be loaded into your environment (if (!requireNamespace(.dependencies[[i]], quietly = TRUE)) {}
). If a package cannot be loaded, the next step is to quietly attempt to install the missing package (install.packages(.dependencies[[i]], dependencies = TRUE)
).
Now that you have a basic understanding of the check_packages()
function, let’s put it to use. In the following set of tutorials we will be working with the tidyverse, a set of packages designed to work together that share a common philosophy of data and R programming. As such, it seems fitting to test whether or not this library is present on your machine:
packages <- c("tidyverse")
check_packages(.dependencies = packages)
Should the tidyverse be installed already on your computer, you will see no output on the console. However, should this library be missing, the installation process will take place quietly in the background.
If function writing does not appeal to you, don’ worry! Not all R users need to write functions. Much of your work in R can be done by leveraging others functions. However, it is important that you get a sense for the basic mechanics behind this incredibly empowering process.
As mentioned earlier in the document, R is a bit quirky, particularly if you have any experience programming in any other language. Keep in mid the following for your sanity (and to discourage you from quitting learning R early on):
A. Syntax, syntax, syntax (!!!): R - like other programming languages- reads what you type on the screen literally. As such, it fails to understand human syntax. For example, we understand that the words “John”, “john”, " John" and “John” are all the same word, but slightly different based on capitalization and trailing or leading spaces. However, R fails to differentiate between these four options, instead itemizing them all as individual observations. Broadly speaking, these are two main types of syntactic errors to keep an eye out for: capitalization and spacing.
A.1. Capitalization: We are accustomed to capitalizing letters at the beginning of new sentences. However, keep in mind that programming is not like general writing; for the most part you are hoping that the machine understands your commands and not a layman reader. To test this point begin by copying and pasting the following function into your R console:
Sys.time()
As mentioned previously, functions are statements that performs a specific task in R. We call them by providing the object name and some arguments. Above we used the Sys.time()
function, which is a base R function that return the system’s idea of the current time and date (R Core Team 2018). Notice that the first letter in this function is capitalized. Now copy and paste the following command:
sys.time()
This time you should see a red error message reading Error in sys.time() : could not find function "sys.time"
. A way to read this is “R could not find the sys.time()
function because the computer cannot recognize the difference between the properly capitalized function and the incorrectly written command.”"
A.2. Spacing: Spacing: Much like casing, spacing matters. However, unlike casing, R is not as cut and dry on how you must space certain items. The Google’s R Style Guide provides multiple examples of when spacing matters. Here we will only focus on one scenario: placing spaces around binary operators (=
,+
,-
,<-
, etc.). Try copying and pasting the following example into the R console:
x<-1
In the example above x<-1
is an assignment, where you have used <-
to assign the number 1
to the x
character. As such, x
now equals to 1
. To test this, type x
into your console, the output should be a number 1
. Now, let’s add some strategically placed spaces to the previous code as such:
x < -1
What output did you get from the code above? The output you should see is a FALSE
message stating that the object x
is in fact not larger than the value -1
. Why do you think the output is different? The short answer is that the additional spacing changed the input, so that in the second example R reads the command as the following test: “is x
smaller than -1
”, to which the answer is “no” or FALSE
.
B. Document your code: It is a good idea to comment on your code regularly. Much like with any type of project, workflow, and/or documentation, consistent note taking is key to keeping your sanity. This point is especially pertinent when you are sharing code or if you have to step away from your code for a few days or even longer. Imagine this document with only code chunks. How long do you think it would take you to understand the code and our logic behind it?
Although your notes do not need to be as extensive as the narrative in this tutorial, they should be enough for you to understand what you are looking at, why you are doing something in a specific manner, etc. Luckily, R allows you to add notes and comments in your scripts and documentation. In order to insert notes or comments into your code, you should use the #
symbol, which tells R to ignore the content to the right of the symbol. For example, try copying and pasting the following commands into R console:
Sys.time() #The Sys.time() function return the system's idea of the current time and date
As opposed to:
#Sys.time() The Sys.time() function return the system's idea of the current time and date
What is the difference between these commands? Using the #
symbol to the right of a function allows for the function to run, while you are still able to include notes or comments. Alternatively, including a #
before the function disables it.
C. Work smarter, not harder: No need to memorize everything, just remember how to find help. For example, should you forget how to work with a specific function or package, you may type ?
or ??
followed by the name of the function or package into your console. To get more information on the base R package, for instance, type the following into your R console:
?base
If you wanted to get information on the all help pages related to base, type ??base
into your console. This process also will work with functions. For example, if you forget how to use the Sys.time()
function, you may check the documentation using ?Sys.time
.
D. Leverage the open-source community: Most R users are proud to belong to the open-source community where R, source code and its packages are available to the general public for use or modification. This means that we all continue experimenting with code to achieve better and more efficient results. It also means that we all learn from each others’ work. As such, there are a variety of platforms for learning and sharing programming knowledge. The preferred, and sometimes feared, platform for asking question on R code is StackOverflow. However, be warned that editors and contributors frown upon asking bad, poorly researched, badly documented, or repeat questions. In other words, you are encouraged to a little research before asking questions online. It is very likely that others have a blog post, tutorial, or Stack Overflow thread with the answer to your question. That being said, should you not find an answer around the internet, keep in mind these parameters as you use Stack overflow.
This document was intended to serve as a top-level guide introducing you to programming with R. As such, many specifics have not been covered in detail. Here the focus was on introducing you to foundation building blocks from which you will continue building your understanding of working with data in R. To recap, in this document you have been exposed to:
Using these building blocks, you should now be able to begin exploring the R ecosystem with some basic knowledge of the component upon which most tools operate. Keep in mind that learning this new programming language is an iterative process with an early high cost for learning. Yet, many tools and resources have been laid out for you to use. Keep the following set of resources on your back-pocket:
Happy R learning! 😀
“Python Has Brought Computer Programming to a Vast New Audience.” 2018. The Economist. Jul. 19. https://www.economist.com/science-and-technology/2018/07/19/python-has-brought-computer-programming-to-a-vast-new-audience.
R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley. 2015a. Advanced R. 1st ed. O’Reilly Media, Inc.
———. 2015b. R Packages: Organize, Test, and Share Your Code. 1st ed. O’Reilly Media, Inc.
———. 2019. Advanced R. 2nd ed. Chapman & Hall/CRC The R Series.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. O’Reilly Media, Inc.