By: Charles Leech
Overview and Set Up
R Markdown can be used for a variety of reasons in a multitude of disciplines. Political scientists use it when interpreting data, as do professionals in marketing, advertising, and other social sciences. On a more basic level, R Markdown can create files and edit text just like Google Docs or Word can. Having a working knowledge of R Markdown can help a person stand out in job applications, and may be the deciding factor for employers when the choice is close. In reading this article, one will be able to understand a list of commands that will help you analyze data, and show your findings.
Installation and Opening Files
Before delving into the specifics of R Markdown, you must first have R and R Studio downloaded on to your computer. For R go to the following website,
https://www.r-project.org, for RStudio use https://rstudio.com/products/rstudio/, but make sure to download the free Desktop version of the app. There is no need to spend exorbitant amounts of money for what we will be reviewing. Keep in mind that you have to have a laptop or mac of some kind, Chromebooks are unable to operate R.
Once R is downloaded onto your computer you will want to open up the application. There, we will need a little more set up before we begin. You will want to open a new R Markdown file. You may have noticed that I have been using R Studio and R Markdown interchangeably, this is primarily true. R Studio is the program we run, and R Markdown is the type of file that we will be creating. To begin, there is a menu bar in the top left corner of the screen. There, you will find an icon depicting a white sheet of paper with a green plus sign. You will click it, then select “R Markdown…” from the Menu.
Once you have completed that task, a menu will pop up asking for a.) the title of the document b.) your name and c.) the type of document you would like to create. You can explore these options on your own, but for our purposes you can name the document anything you like, but select Word for the Default Output Format.
Now we are almost ready to begin. First we have to install a few packages. This is done easily enough by clicking on the Tools tab in the Markdown Menu, it should be toward the top center of the page. Once, you click the Tools tab, a dropdown menu should open, where you will select “Install Packages…” Then a tab will pop up and your screen will look like this.
You will then type, “tidyverse” into the bar, and click install. Now we are finally able to begin in earnest.
Data and Graphing
First, we have to select the data we wish to analyze, this could be in the form of an Excel document, google sheets, or a CSV file. As I am a political science major, we will be using a spreadsheet of my own, that deals with the amount of women in civil conflict, and their involvement in the political process afterwards. You may also use this sheet if you would like, but if you have another set of data you would like to work with, that will work as well. So, in order to select the data we are using, we will have to do a number of things. 1.) Find where the data is saved on our computer, and 2.) set that areas as the working directory. I will show you an easy way to do this. First, once you find where your data is, for me this is in the “Quan” folder on my computer. Then go over to the left side of the screen under “Files” and click on the folder.
Please note, that you will have saved your data in a different place than myself.
Then, you will click on the blue gear that says “More” next to it. A drop down menu will appear, and you will select “Set as Working Directory”. Notice that in the Console in the bottom left corner, the code “setwd("~/WITS")” appears. You will want to copy and paste that line of code into your markdown file, and then run it. To run code simply have your cursor on the line of code, and press Ctrl + Enter.
Library and Data Loading
Next, you will want to load in your libraries and your data. To do this, you will type the code,
and then run the chunk Then you will determine what kind of file you will be working with. I shall demonstrate with a CSV file. If you are operating with a different file type, the change between within excel by going to File → Export → Change File Type. Then you will click CSV file. I digress. Next you will enter and run the following code.
data <- read_csv(“The name of your dataset.csv”).
Make sure that the quotation marks are present within the parentheses, and that the file name is exactly typed out, matching case, including the “.csv”. Please note that you should not actually type “the name of you dataset” but the actual name of your data. If the command was executed correctly, there should be a new item in the Global Environment located in the top right corner of the screen. A few things to keep in mind, to run a code, when I say to run your code, you will always click on the line of code you want to run, in order to move your text cursor to the line, and then click the ‘Ctrl’ and ‘Enter’ keys at the same time. Additionally, you do not have to name the object you create “data” as I did, moreover, the name of your dataset will differ from mine if you are using another data frame. Mine in this particular case is called IFPJ data.
Start to Code
Finally, your set up is done, and you can begin to code. One thing that can always be helpful is knowing the measures of central tendency, and the distribution of your variables. So, let’s play around with the data. If you are using different data than me, feel free to use your own variables and adapt my instructions to cater to your needs. For those of you who are using the same data as myself, we will be working with the percent of women in a given country’s legislature post-conflict, this variable is denoted as per_womenLeg in our data. Firstly, you will want to open another chunk of data. To do, so type three pips and then a squiggly bracket with the letter ‘r’ inside, then you will press the ‘Enter’ key a few times and end the code chunk with another three pips as follows:
Note that the pips are located on the top left of your keyboard above the tab key. Now it's time to learn your first code for interpreting analysis. When interpreting data, information such a median, mean, range, and standard deviation are very helpful. They give you a place to start, and help determine what kind of analysis of which you will have to proceed. There are some commands you could enter one by one that will eventually give you the information you need. But I am going to give you a shortcut. Said shortcut comes in the form of the Quantile command. This will generate values for different quartiles of the data. I will spare you the explanation of what quartiles are, but if you want to know, I would recommend taking a stats class. So, in order to use the Quantile command you are going to type:
In R Studio. Once you run the code, you will get five numbers with the percentages of 0, 25, 50, 75, and 100 above them. Without getting bogged down on how we find these numbers or what some of their other applications are, I will tell you which numbers concern use for the moment. The number corresponding to 0% is the minimum, the number with the 100% is the maximum. In turn, the number with the 50% above it is the average of the data. You can find the range of this variable by simply subtracting the minimum from the maximum.
Now that we have some information about one of the variables, you try. Try to find the min, max, median/mean, and range for the variables for the dummy variables of women in noncombat roles, combat roles, and leadership roles. Each of these variables are dummy variables. If a group as a zero in one category, there aren’t women present, and if there is a one, there are women present. If you can’t remember what to do, just enter the codes below:
You may notice that when you ran the code for these variables, many of the numbers under each of the respect percentages were the same. This is due to the fact that these are dummy variables, meaning that the only values that will ever be present is either a one or a zero. When you look at the distribution, women in leadership are less common than the other two groups, and women in combat roles are less common than women in noncombat roles. Let’s run some code that will help us visualize these findings.
Let’s graph each of these variables on their own. You will need to enter the command:
ggplot(data, aes(per_womenLeg)) +
Once entered, a graph should appear. Please note that the code needs to be entered exactly as show, it will not work if the code is all on one line. That being said, depending on if you are using your own data or not, your code may look slightly different, but it must use the same format. Let’s break it down a little. In this code, the first thing you put in the parentheses of ggplot() is the name of your dataframe. Then, you will have a comma and the code aes() ggplot(data, aes(Noncombat_D)) +
geom_bar() and within that parenthesis, you will put the variable you are trying to measure. Then you will exit the parentheses and add a + and press enter to bring the code chunk down a line. Then you will determine what kind of graph you will need. You will always enter geom_ but what you put next determines what kind of graph you will get. Geom_bar will make a bar graph geom_density will make a density curve, and geom_histogram will create, surprise surprise, a histogram. There are many kinds of graphs you can make, but for our first round of graphing, these will be what we are looking at. Now, using the information you have been given, try to graph the other variables on your own. Please note, these other variables are a different kind, and will need to use a different graph. If you need help, the codes are below:
Presence in Noncombat Roles
ggplot(data, aes(Noncombat_D)) +
Presence in Combat Roles
ggplot(data, aes(Combat_D)) +
Presence is Leadership Roles
ggplot(data, aes(Leader_D)) +
If properly executed, the graphs should look as follows.
Percent of Women in Legislature
Presence in Noncombat Roles
Presence in Combat Roles
Presence in Leadership
Now as you can see, groups that have women in leadership and combat roles are rare, while groups that have women in noncombat roles are much more common. Now we, as researchers, know what these graphs mean, but if we are going to communicate our findings effectively, we are going to need to make our graphs look a little better. The following picture is going to show you what you need to add to the code graphing the percent of women in a legislature.
The command labs() adds labels to the title, the x-axis and the y-axis. Due to the fact that the other three variables we are looking at are different variables, I will demonstrate again what do to with these kinds of variables before sending you off on your own.
As you can see from the picture, there are two primary changes that we will have to go through in order to add some labels to the graph. First, you will want to go into the aes(Noncombat_D) command and before the name of the variable, you will want to add the command as.factor(). In the end, the new line of code should read aes(as.factor(Noncombat_D)). This enables the individual bars to be named. Next, you will want to actually add labels to each section of the graph, this is done by adding a plus sign after the geom_bar() command, and then adding the code as seen in the picture. If you are unable to see the picture or would otherwise want it to be written out, here it is.
labs(title = "Distrubtion of Groups with Women Present in Noncombat Roles",
x = "Are Women Present in Noncombat Roles?",
y = "Number of Groups") +
scale_x_discrete(labels = c("No", "Yes"))
Now that you have a feel for how the code should look, try it on your own. The correct codes are shown in the picture below.
There is one last step until you have some of the most important skills that you will need to navigate R Markdown. For this last step, you will learn to graph the relationship between two different variables. However, we have to do some coding, before we can start graphing. This step is only needed if you are graphing a categorical measure versus a continuous measure. In our case, our dependent variable, the percent of women in legislature is continuous, and our independent variables are categorical. If you are confused about what these different kinds of variables are, feel free to reach out to me, and I will explain them to the best of my ability. Yet I have digressed. To begin, enter and run the following command.
graphNoncm <- data %>%
summarise(mean_womenLeg = mean(per_womenLeg, na.rm = TRUE))
This creates a dataset that will accurately represent the data that you wish to graph. This process will need to be repeated for each independent variable. This can be done by keeping the exact same code and switching the group_by(Noncombat_D) to group_by(Combat_D) and so on. Additionally, the first part of the code will need to change as well, instead of graphNoncm it could be graphCom or graphLeader, as you are creating a new dataset, it can be named whatever makes sense to you.
Moving on, once the new dataset is created, the graphing process remains fairly similar to the process executed when we graphed a single variable. In the picture below, you will see the bare-bones code for graphing groups that have women in Noncombat roles against the percent of women in legislature, after the conflict is over. There are a few particular aspects of these new graphs that I want to bring your attention to, so I circled them in red.
The large chunk of code is the necessary step I mentioned earlier. It is crucial to get create an accurate graph, and affects the code the follows it. The second circle shows our first deviation from the earlier graphs. The first bit of code after ggplot is where the data for the graph will come from. When graphing our singular variables, we got our information from the data dataset. However, as we needed to create another dataset to get an accurate reading, we need to change the code to the name of the new dataset. The third circle is also part of the changes created by making the new dataset. We want to graph the percent of women in legislature on the y-axis, and before this, that variable was called per_womenLeg, however, now that we created a new dataset, the name of the variable has changed, that we need to create that change in our code. Our last circle deals with cosmetic changes. If you have managed to make it this far without the assistance of my pictures, first I applaud you, and second, you may have noticed that certain titles for these graphs are too long, and aren’t completely visible. The \n that I have included here moves the text onto another line, allowing the reader to completely read the title of your graph. I would also like to note that the slash in that set of code is not the one that is on the same key as the question mark. It is actually located above the Enter key. I only say this for I have made this error numerous times and only recently have I seen the error in my ways.
Taking the knowledge I have given you, I would like you to try and recreate this process for the other graphs. Again, if you get stuck, I will include pictures of the correct code below.
By looking at the graphs created by the code you just ran, you can see a possible connection between groups that have women in different roles in their revolution, and if they bring those women into the power structure after combat is over. I will not spend much time demonstrating the relationship or arguing for a causal mechanism, for this is the wrong space to do so. However, in reading this article, you have equipped yourself with some knowledge that will set you above your peers. I have heard anecdotes from students who have a working knowledge of R. These particular students claim that they were hired based on this information alone. That being said, I am not claiming to be the end-all-be-all of R Markdown knowledge. This article simply covers some of the basics. If this article piqued your interest, there are classes here at NWU that concern the subject, or you can reach out to me for any other questions you may have. Additionally, there is a wealth of knowledge on the internet. I hope this has helped you feel more confident concerning R Markdown and some basic coding. Happy graphing!