Getting started with R and RStudio

Why R?

The R programming language is free software developed with an eye towards statistical computing and data visualization that has has taken off in popularity over the last decade and is now finds itself among the most used programming languages, in general and is often the go-to language for data science.

So what’s all the fuss about? Among the things that you will come to love about R, you will be hard pressed to find a more active community surrounding a programming language.1 The size and activity of this community means that R stays free, the software improves, and almost anything you are looking to do in your analysis has a user-developed ‘package’ 2 that facilitates getting any analysis done at various stages in the process. Packages are contributed code from R users that encapsulate a particular and cohesive set of ‘functions’, or sub-tasks; but we will get more into this later in this post and in detail later in the series. 3

The number of R packages grows quickly everyday and now applies to many more tasks that just statistics. To get a sense of this activity you can take a look at an interactive display of current downloads from the R package repository (or store) called CRAN that lives on the R Project website.

This display, by the way, was created with an R package called Shiny. We’ll get to building interactive websites (like this one) and documents later on in the series.

R, then, is a widely used open source programming language with a thriving community base that is among the best languages to do the robust and sophisticated analyses that are required for data science. This is an exciting time to be an R programmer and a data scientist, so let’s get started!

The beginning, of course, is downloading and installing the software. Follow this link to the R Project website: Download R. This is the home of the R programming language. There are many resources you can find here including a current list of available R packages, resources for getting help (we will see soon, however, that RStudio will have these built right into its user interface), and other various other resources. For now click on the ‘Download R’ link as seen in Figure 1.

The list of sites here are ‘mirrors’, or alternative servers all over the world where the software is available. Any of these links will do the job, so let’s just select the first mirror ‘https://cloud.r-project.org/’ for simplicity sake.

Next select your operating system. I am working on a Mac so I will select the ‘OS X’ option.

You will now need to scroll down the page a bit and find the link to most recent version of R. The current version at publication is R-3.4.1.pkg. Download the .pkg file to your hard drive.

You can download the .pkg file to any folder on your hard drive. After running this installation file, accept the prompt to move this file to the trash. You will not need it again.

Once it is on your hard drive, open the file by double-clicking it and follow the default installation instructions.

To see what get’s installed on your machine you can click the ‘Customize’ link and see the the various pieces of software that are bundled with the .pkg installer. Go ahead and do that and take a look.

Now you can see the list of software. Let’s focus on two items in the list, seen in Figure 7. First, you will see the R Framework, which is R itself. Among the other items we see the ‘R GUI’ in the list. This is a basic application interface to R. We’ll take a quick look at that application in the next section to begin our understanding of how to access R through applications and more robust IDEs.

We won’t spend much time working with the R Application GUI in our series, and quickly move to using RStudio to interface R, but it’s worth knowing that it’s there and it will help us make a point about the distinction between R, the framework, and interfaces to R like this GUI and RStudio. Go ahead and finish the installation and move to the next section.

Working with the R console in the default R Application GUI

Before we get to downloading and working with the RStudio IDE, let’s take a quick look at an alternative way of connecting to R: the R Application GUI. The R Application GUI is installed with together with R itself as mentioned in the previous section. Opening that software we can get our first glimpse at the R console and will run our first line of code.

The console is where we run R interactively. That is to say that the software interprets our code on a line by line basis as we go; a sort of call and response method. The prompt is the line where the computer is ready waiting for the user to enter code to be sent off to the interpreter. Let’s show you what this looks like in action with the most simple of all code; adding 1 + 1. So at the prompt type exactly that, then hit enter/ return and see what happens.

Admittedly this is not the kind of stuff that will lead you to a romance with R! But it’s our first step and now we know what the console is, what a prompt is, and what an interactive session is.

Why RStudio?

So to be clear, R is not an application. It is a programming language that resides deep on your computer. The R Application GUI is merely an interface to that language. The small bit of code we ran on through the console here is a direct link to the interpreter that talks to R itself. This is an important concept as you will see as we turn to working with a more powerful interface to R, RStudio. RStudio will provide a console interface to R as well as host of other extremely helpful features into our GUI experience to write code offline in the form of scripts, manage our scripts and other data resources, view data and plots, get help, and much, much more that will make working with R more efficient.

To download RStudio, we navigate to the RStudio website’s homepage. Scroll down a bit and you will see the ‘Download’ button for RStudio.

Follow the ‘Download’ button. On this page you will be presented with various versions of the RStudio software. We want the ‘RStudio Desktop - Open source license’ version. Scroll down to the bottom of this table showing the various options you get with the other versions and click the big green ‘Download’ button.

Now select the installer that matches your operating system.

That’s it. You now have the most current version of R and RStudio on your machine. We’re ready to get familiar with the features of RStudio.

Getting to know RStudio

Open RStudio. You will see an interface similar to the one below in Figure 13. Of course the aesthetics will vary depending on the operating system you are on but there may be other small differences in the tabs in each of the panes. As you work with RStudio you may add new functionality to facilitate certain tasks.

RStudio Panes

The layout of RStudio includes four main ‘panes’: Console, Editor, Files, and the Environment panes. You already are familiar with the idea of the console, seen here in Figure 14. Remember that this is the area that is a direct line between the IDE and the R interpreter. When RStudio is first launched, it will start an R session and this has the effect of reporting some details about the version of R that you are running and the operating system that it is running on.

The next logical pane to introduce is the Editor pane, seen in Figure 15. Whereas commands written in the console are run on a line by line basis interactively, the Editor will be a useful space for us to view, edit, and write R scripts, as well as other files. Scripts are simply files that contain a series of R commands that we can then run together at once, instead of line by line as we do in the console. This will be the main way we leverage the power of R; by create scripts that run through a data analysis workflow we can then save these scripts as files to run at a future point, continue to develop, revise, and even share them.

The Files pane has various tabs associated with it. When the ‘Files’ tab is selected we will see files and folders (aka directories) on our machine. As we begin working with R in the next section, we will begin to see how this tab will be extremely useful to see, edit, and delete our scripts, data files, and other associated project files without having to browse our system through as we would normally. There is also a set default tabs: ‘Plots’, ‘Packages’, and ‘Help’ in this pane. Plots will show and export any plots we create during our R session. The Packages tab allows us to manage packages either downloading or installing them. And the Help tab is where we access information about R in general, or particular packages, or functions. You will find that Help is really a godsend as very few R programmers can write code for a project from A to Z without refreshing their memory or exploring documentation on the usage of packages and functions.

The last of the four panes is the Environment pane. It includes a set of tabs as well. The main two are ‘Environment’ and ‘History’. The Environment tab allows us to see variables and objects that we create during our session. We haven’t discussed either of these concepts, but rest assured we will get there soon and you will understand how helpful this tab is to efficient R programming. History is just that; a history of the commands that have been sent to the R interpreter. These commands can come from our interactive session in the console or through running a series of commands in a script.

So there you have it a brief overview of RStudio panes and tabs. Together these tools will provide us ample resources for just about everything you will ever need to do data analysis with R. In the next section we will begin to see how we can leverage these resources in a basic R session.

It’s worth noting that RStudio is highly customizable; panes can be customized in appearance, placement, and tabs available. But for now we will leave the default layout to maintain a consistent format throughout this series. If you do want to experiment with the look and feel of RStudio, you can peruse the options through the ‘Tools > Global Options’ in the application dropdown menu at the top of your screen.

R sessions in RStudio

To get a basic feel for working with R and RStudio let’s run through a basic example that will highlight each of the main panes and tabs that we covered in the previous section.

To get started let’s run the same code we ran before in the R Application GUI console but now in the RStudio console. Type the following code in the Console and hit Enter on your keyboard.

1 + 1

On hitting Enter, the code is sent to the R interpreter which responds with the result; 2. For now ignore the [1] that prefixes the line where the result is displayed.

Now let’s run three separate lines of code in the Console in sequence. Take care to enter this and all subsequent code input correctly computers do not tolerate typos. Remember you’re the brains here, the computer is the brawn!

Quick tip: You can add the assignment operator <- via a keyboard shortcut by hitting option + - (Mac) and alt + - (PC).

x <- 1:26
y <- letters
paste(x, "-", y)

You will notice that the first two commands did not return anything. This is because we sent the results from each of these commands to the variables x and y, respectively. A variable is a data container that is named. This container, or variable, holds the result in memory and gives us access to the result when we use the variable name later on. As you will soon see, you will inevitably create a number of variables in any given data analysis. There are two ways to see a listing of variables you have created. The first can be done in the Console by typing the function ls().

ls()
## [1] "x" "y"

The second is to browse to the Environment pane under the Environment tab. You will see that the Environment tab provides much more information that just the names of the variables in memory. It also indicates the type of each variable, its dimensions (in this case length as we are working with vectors), its memory size, and a summary of the values contained within.

In the third command we make use of the variables x and y: pasting (with the built-in, or ‘base R’, function paste()) the results from x (a series of numbers from 1 to 26) and y (the letters of the alphabet) together with a hyphen as a separator.

If that was the result we were looking for then great. Job done. Imagine, however, that we wanted the result to return contiguous ‘number-hyphen-letter’ strings, like 1-a. Is that possible? Yes, it is. But don’t take my word for it, let’s find out using RStudio’s Help resources. Again, there are two ways to do this. The first is by inputting the code ?paste in the Console.

By appending the ? operator to a function, package, or dataset name the Help tab in the Files pane is opened to the R documentation page.

The second approach is to manually browse to the Help tab and search for the function paste. Either works, you might find yourself alternating between the two.

In some cases you might not remember the name of the function, package, or dataset that you want to find documentation for. For a more general search you can use the ?? operator instead of the ?. Or as an alternative use the search function in the Help tab.

Back to our code. The R documentation shows that paste() can take an optional sub-command, or argument, that will allow us to choose how our paste variables are separated. With this knowledge in hand we can update our code. We will use the argument sep = "" to remove white space between the concatenated elements like so:

paste(x, "-", y, sep = "")
##  [1] "1-a"  "2-b"  "3-c"  "4-d"  "5-e"  "6-f"  "7-g"  "8-h"  "9-i"  "10-j"
## [11] "11-k" "12-l" "13-m" "14-n" "15-o" "16-p" "17-q" "18-r" "19-s" "20-t"
## [21] "21-u" "22-v" "23-w" "24-x" "25-y" "26-z"

With time you will become more accustomed to reading and making sense of the R documentation. For now, however, it is sufficient to have learned how and where to access this documentation.

The previous code we have run is short and simple. Working with the Console is great for this type of quick and dirty exploring, or doing some introspection of variables using ls() or the help operator ? to view the R help documentation. However, data analysis will involve many more lines of code and using the Console directly will become cumbersome. A more convenient way to write code is to use the Editor pane.

Let’s run through a more involved and practical example of creating a basic word frequency analysis. Before we get to the code below, let me introduce you to installing and managing R packages. As we have seen a couple times now there is a Console-based method and an GUI-based method. This time I will start with the GUI method as it is very convenient and tends to be the method most use. First step is to open the Packages tab in the Files pane, seen below.

You will see a list of the R packages that are already installed on your machine. The base installation of R comes with a default set so you will already see some packages listed. However, you may not see all of the packages that appear in Figure 20 if you have not already manually installed new packages (for example the acs package for accessing US Census data).

Let’s install a few packages we need for the upcoming code: dplyr, ggplot2, and tidytext. Installing packages with the Packages tab is easy. First, click on the ‘Install’ button within the tab.

Type the names dplyr, ggplot2, and tidytext into the ‘Packages’ field leaving the other default configuration fields. You will notice as you type the package names, RStudio will pattern match the name which can be helpful to make sure you type the names correctly. Once you have the names entered, click the ‘Install’ button. At this point RStudio will run the installation code for these packages. As they install there will be quite a bit of output, some of it in red font, in the Console. When the installation is complete the prompt > at the Console will appear; ready to take another command.

The Console approach leverages the function install.packages(). To find out more about how to use this function, I encourage you to look at the R documentation using ?install.packages. I will leave this as an exercise for you to complete. Now back to our example code that we will enter in the Editor pane.

Copy and paste the following code into the Editor pane.

library(dplyr) # data manipulation
library(ggplot2) # package for generating a word frequency plot
library(tidytext) # package for doing a word frequency analysis

text <- c("The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced no evidence that any irregularities took place.",
"The jury further said in term-end presentments that the City Executive Committee, which had over-all charge of the election, deserves the praise and thanks of the City of Atlanta for the manner in which the election was conducted.",
"The September-*october term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible irregularities in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr..",
"Only a relative handful of such reports was received, the jury said, considering the widespread interest in the election, the number of voters and the size of this city.") # first 4 lines from the Brown Corpus

text_df <- data_frame(line = 1:4, text = text) # create a tabular dataset with the columns line and text

text.word.freq <- unnest_tokens(tbl = text_df, input = text, output = words) %>% # tokenize text into words
count(words, sort = TRUE) %>% # count words and sort them
head(10) # return only the first ten lines in the dataset

ggplot(data = text.word.freq, aes(x = reorder(words, desc(n)), y = n)) + geom_bar(stat = "identity") + labs(x = "Words", y = "Count", title = "Word frequency plot") # plot the words on the x-axis and the count n on the y-axis using a bar plot ordered by n

Your code should look something similar to what you see in Figure 22. Note that I have resized the Editor pane to view most of the code.

We can run the code in the Editor pane either line by line or as a complete script. Running code line by line can be useful when you are composing a script and want to test the results incrementally. To do this you move your cursor to the line you would like to run and then simply hit the keystroke command + enter (Mac) or ctrl + enter (PC). Run line 1 and see what happens!

Line 1 loads the dplyr package from our package library with the command library(dplyr). If you resize the Console pane you will see that the command was run and resulted in some details about the package printed to the Console. We also notice that the cursor has moved to the next line in the Editor pane, ready for us to run this line.

In this script I’ve also added comments to each line of code describing what each command does in our script. Any time R finds a # symbol it ignores everything to the right on the same line. We can then use plain language to annotate our code. Using commenting is a key best practice for coding in any programming language as it will make your code more legible to other users as well as the future you who might come back to this code sometime down the road to realize you have forgotten what your code does!

Returning to running our code from the Editor, we can also run multiple lines, or even the entire script from the Editor pane in a similar fashion by selecting multiple lines, or all the lines and using the previous keystroke (command + enter, or ctrl + enter). RStudio provides a button ‘Run’ that can be used to run a script line by line or the ‘Source’ button to run the entire script.

Go ahead and run the entire script now with by clicking ‘Source’. The script will be automatically sent line by line sequentially to the Console. The results of this script generate a word frequency plot for the first four lines from the well-known Brown Corpus. The Plots tab in the Files pane will automatically open revealing our plot.

There are many other options for running code from the Editor pane using keyboard shortcuts. Explore the ‘Code > Run region’ dropdown from the RStudio menu bar for more information.

Saving our work

At this point we may want to save the code to our hard drive to make sure we don’t lose our work, or to move on and write some other code. To save this R script navigate to the RStudio tool menu ‘File > Save’ or hit the keyboard shortcut command + s (Mac) or ctrl + s (PC). The next step is to choose where to save this file. Select a directory for this file to live. For now you can choose any directory you feel fit. As we move forward in this series and in more involved projects that you will work on, however, it’s best to do some organizational planning upfront to set up your main project directory, sub-directories, and so on so that it is clear where to save each type of file. This topic will be covered in an upcoming Recipe post dealing with project management using the RStudio ‘Projects’ feature.

For testing purposes, let’s save the file within the current working directory inside a directory we will create named getting_started/. What is the working directory? Well, it is the place that RStudio regards as “Home base” on your hard drive. To find out what this is, enter the function getwd() at the console. R will return a path to the current working directory. On my machine the path is /Users/francojc/Documents/Recipes. This notation is describes the hierarchical scheme of directories. My current working directory, then is the Recipes/ directory which is located inside the Documents/ directory which itself is located within my home directory francojc/.

Let’s go ahead and open the ‘File > Save’ tool RStudio menu and name our file basic_frequency_analysis.R. The use of _ here is to avoid white space in file and directory names. This is good practice as working with paths in programming can be complicated by white space in some environments. So it’s good practice to avoid white space. There are various styles that are employed in programming, generally, and in R programming specifically. Just like commenting code, as mentioned previously, adopting accepted style guidelines can increase your code’s legibility for you and those who may work with code you share.

Once we have saved our file in our working directory, it will now appear in the Files tab in RStudio. From this tab you can move, rename, and delete files as well as create, rename, and delete directories as well. It’s also worth pointing out that the path to our working directory is visible in this Files tab between the working directory listing and the button toolbar. This is a nice feature to see where the files and directories live on your machine without resorting to the OS file explorer.

Let’s create our getting_started/ directory inside of our working directory to house this file. Click the ‘New Folder’ button inside the Files tab and name the new directory. It will now appear in our listing. To move our file inside of this directory we can use the OS to manually move the file, but RStudio again provides tools to do this. Select the checkbox beside the file basic_frequency_analysis.R, then select ‘Move…’ from the ‘More’ dropdown menu.

After you make the move, RStudio will quickly ask you if you want to close the basic_frequency_analysis.R script as it no longer exists. It does exist, just not in the same location as before. This underscores the importance of paths in programming and programming tools like RStudio. It is therefore of utmost importance to be aware of the paths to files and directories. Luckily, RStudio provides us ample ways to monitor the paths to working files and directories!

Closing an R session

At this point let’s say it’s time to close our session and quit RStudio. You can do this by simply navigating to the RStudio tool menu ‘File > Quit session…’. You will be presented with a dialogue box to save the workspace or not.

We can choose not to save the workspace and our files will be preserved (as long as they have been previously saved to the hard disk). What we lose, however, are the variables and history that RStudio has in memory at the current moment in our session. If you want to begin your next session with these variables and history loaded from the get-go, then we will want to save the workspace. Let’s choose to save the workspace to see this in action. After quitting RStudio, restart the application. Because we had chosen to save the workspace we now have our variables and history in this new R session. You can view them in the Environment pane. If you navigate to the working directory in the Files tab in the Files tab you will also see a couple new files have been added to our files and directory list: .RData and .Rhistory.

When you restart another session, R will look for these files and load them into the workspace for you to pick up and continue using. If you choose to quit an R session without saving the workspace these files will not be created, or in the case they already exist from a previous workspace session, will not be overwritten and the earlier variables and history will be loaded.

If you browse the working directory of a session with your operating system’s file explorer you may not see these files listed. Do not be alarmed. By default an operating system hides all files that start with . as a convenience to user. There are many of these types of files hanging around your OS. They often contain application-specific configuration information which usually do not require direct editing.

Round up

We have covered a lot of ground in this post. You should now have a working understanding of how R and RStudio are related and the various panes that are available to develop code and manage your workspace. We have also scratched the surface on some topics that will be covered in more depth in future posts including variables, paths, and workspace and project management, to name a few. The next post in the Recipe series ‘Project management for scalable data analysis’ will deal with many of these topics as we set the foundation for working with more complex and realistic analyses.

1. Follow this link to see estimated figures of the number of R users around the world

2. Think of Steve Jobs’ famous sound bite: ‘There’s an app for that’, just replace ‘app’ with ‘package’ in the case of R.

3. Don’t fret over some of the terminology here, we will come back to these terms later on with detailed description and examples.