Introduction
R has become the leading choice for Data Science professionals and statisticians. The popularity of R has increased substantially over the years when it comes to data analysis. R is a GNU project which was initially developed by Ross Ihaka and Robert Gentleman at University of Auckland, New Zealand and the source code for R software environment is written primarily in C and Fortran. The founders decided to name the programming language R, which is based on the first letter of their names. The language is both similar and different in many ways when compared to the language S, developed by Bell Labs. R is considered to be a different implementation of S. Most of the code written for S runs in R as well.
Some of the features of R are given below.
- Just like any other programming language, the programming constructs that makes up R are well defined, which includes variables, condition making statements, loops, functions, data types and so on.
- R provides the data structures like vectors, matrices, arrays and data frames, which the users can use for performing statistical analysis and creating graphs.
- R supports Object oriented programming.
- R has mature, effective data handling and a storage facility. We can import data from CSV. MS Excel and other data sources, which will be stored and can be used to analyze the data. We do not require an external DB.
- There are lot of tools, which are available to perform the data analysis within R environment.
- R can be used to generate statistical graphs, which will help in deriving the business intelligence. It has advanced graphics and plotting abilities. R Plot is an interface available in R Tools for Visual Studio, which provides an advanced graphic display.
Image Source- r-bloggers.com
Defining R and its features might look pretty vague. Let’s start and get our hands dirty.
This article is divided into two main sections, which are given below.
- Setting up R Environment and R Tools in Visual Studio IDE
- Understanding the power of R - Analyze and derive the conclusion from the data, using R.
Once this is completed, you will get an idea on setting up R environment locally and it will help you get started with R programming. Let’s head to the first section of the article.
Setting up R environment and R Tools in Visual Studio
R tools for Visual Studio were released in March as a public preview release. This will help you to work with R programming in VS. However, in order to set up R Tools in Visual Studio, there is a prerequisite – R language engine should be installed in the local machine, or else we will get the error, as shown below.
In order to set up the environment, first we will
- Install Microsoft R Open. It is a R language Engine
- Install R tools in Visual Studio, which will help us to work with the data, using R programming.
Hence, let’s install Microsoft R Open, which will install R language engine in the local machine. You can download Microsoft R Open from here.
Depending on the platform, you can chose to download the appropriate R Open executable. I have downloaded Windows 7 Platform executable.
Once we have downloaded the file, run the executable.
Click Continue to proceed with the installation.
Click Continue and select the Agreement.
This will start installing Microsoft R Open.
Click Finish. This would complete R language Engine installation in the local machine.
Once completed, it will provide us R console, where we can implement R Programming.
Double click the icon and it opens up R console.
We can test its functionality by using normal arithmetic operations. However, the console is less interactive.
Setup R Tools for Visual Studio
Hence, let’s spin up Visual Studio IDE and install R Tools for Visual Studio (RTVS), which is a highly flexible and a mature environment to implement R Programming. You can get the executable here.
Click the executable and run it.
Close any Visual Studio session, which is active for a smooth installation. Click Install.
This will start the installation of R Tools in Visual Studio.
Finally, the setup is complete. Let’s head to Visual Studio to check out the new addition.
In the tab next to Test, we have the new tab for R Tools. Click data science settings, so that the session opens in Data Scientist profile.
Click Yes. This will reset Visual Studio layout to the snapshot, as shown below.
We have R Interactive Window in the left side, where we will be doing the programming part. The variable Explorer on the top right end is where we can analyze the loaded variables and import the data from an external data source. R Plot at the bottom right corner is used to display the graphical representations. We can switch back to the default Visual Studio Layout, once we are done with R Programming.
Understand the Power of R - Analyze and derive conclusion from data using R
Let’s use R programming to dig into bulk data and derive the results for our specific queries. Here, I am using dummy student details, which are in CSV format as the input and I will try to derive the answers for the data-related questions. We will be entering the commands in R Interactive Window in the left side of the Wndow.
First, we have to set the working directory, which can be done, using setwd method.
Now, let’s load the data into R Tools environment in Visual Studio. We can use read CSV command to import the data from CSV. I have placed the R CSV file, which contains the student details, in the working directory. This file looks, as shown below:
Now, let’s load CSV file into Visual Studio. We can read from other data sources like MS Excel as well.
- StudentDetails<- read.csv("R.csv")
- print(StudentDetails)
Once the command is executed, which will print out the tabular data, as shown below.
The first row is the sequential serial number. The rest of the columns are loaded as it is from CSV. We have a glob-al variables window in the right side. Once the load is completed, it will be loaded with the data, which we can browse and explore.
Just below the Variable Explorer, R Plot is there, which is used for the graphical representation of the charts.
We have student details of 100 students. Now, let’s quickly do some analysis of the student details data and derive the answers to the *questions, using R Programming.
What is the maximum mark in Java?
Java is one of the subjects. Let’s try to derive the maximum score among 100 students.
- MaxJava<- max(StudentDetails$Java)
- print(paste("The highest score in Java is",MaxJava,sep=":"))
Max is the method which is used to get the maximum value from the collection.
StudentDetails$Java means we are querying the Java Column present within the StudentDetails variable structure. MaxJava is the variable that will hold the value. In order to concatenate two strings, we can use the paste function, which has the syntax, as shown below:
- Paste(“First String”,”Second String”,sep=”JoiningCharater”);
Count of all those who got the max mark in Java
- JavaToppers<- subset(StudentDetails, Java == max(StudentDetails$Java));
- ToppersCount<- nrow(JavaToppers)
- print(paste("Total number of top scorers in Java", ToppersCount, sep = ":"))
Subset is used to derive a subset of the rows from the main data set, which is based on a matching condition (Max mark in Java). Subsequently, we have now used it to get the number of rows present in the subset. Toppers Count is the variable, which will hold the final value.
Details of all those who got the max mark in Java
- JavaToppers<- subset(StudentDetails, Java == max(StudentDetails$Java))
- print(JavaToppers)
Here, we are using subset function to get the subset of the rows from the main data set, which matches the condition and display it as it is. Java Toppers is the variable, which will hold the final value.
Average score of a subject
- MeanPython<- mean(StudentDetails$Python)
- print(paste("The Average score in Python is", MeanPython, sep = ":"))
Here, we have used mean as the method to calculate the average of StudentDetails$Python (ie: Python column present within StudentDetails dataset). MeanPython is the variable, which will hold the final value.
Male Female Classification
- MaleRows<- subset(StudentDetails, Sex == "Male");
- MaleCount<- nrow(MaleRows)
- print(paste("Number of Male Students", MaleCount, sep = " - "))
- FemaleRows<- subset(StudentDetails, Sex == "Female");
- FemaleCount<- nrow(FemaleRows)
- print(paste("Number of Female Students", FemaleCount, sep = " - "))
Here, we have used subset function to get the subset of the rows, which matches a condition. Afterwards, we have used nrow to get the count of the rows within the subset.
Student from the city of Darlington
- DarlingtonStudents<- subset(StudentDetails, City == "Darlington")
- print(DarlingtonStudents)
Just like the queries shown above, we have used a subset here as well, except that the condition, which is different.
Find the sum of subjects and list 3 overall toppers
- StudentDetails$Sum<- StudentDetails$Java + StudentDetails$C + StudentDetails$Ruby + StudentDetails$Python
- head(StudentDetails[order(StudentDetails$Sum, decreasing = T),], n = 3)
Here, we are summing up the scores in Java, C , Ruby and Python and assigning it to a new column SUM, which is not really present in the import table. StudentDetails$Sum<- Some Value will create a new column in the table and assign the value to the column. Finally, we are ordering the table in the descending order and use Head method to get the top rows.NowN=3 will fetch only the first three rows.
Group By Subject Toppers
- JavaToppers<- head(StudentDetails[order(StudentDetails$Java, decreasing = T),], n = 4)
- RubyToppers<- head(StudentDetails[order(StudentDetails$Ruby, decreasing = T),], n = 4)
- PythonToppers<- head(StudentDetails[order(StudentDetails$Python, decreasing = T),], n = 4)
- print("The details of Java Toppers:")
- print(JavaToppers);
- print("The details of Ruby Toppers:")
- print(RubyToppers);
- print("The details of Python Toppers:")
- print(PythonToppers);
Here, we are sorting Java Score in the descending order and get the first 4 rows, using Head function and assigning it to JavaToppers variable. Similarly, we are doing it for the other subjects as well.
Create Charts using R Plot
We can create the charts from the data set, using R Plot as well. This helps to derive the meaningful information from the data visualization. In order to create the data plot, we can create a dataset and use Bar plot function to plot the chart in R Plot, as shown below. This will plot the Marks in Java against Y axis for each student.
- dataset = data.frame(StudentDetails)
- barplot(dataset$Java, names.arg = dataset$StudentName)
If we want to leverage the graph plotting functionality; we can make use of the ggplot package, which has much better functions. We can install it by running
- install.packages("ggplot2")
We can then plot it, using ggplot method, as shown below.
- library(ggplot2)
- ggplot(dataset, aes(x=dataset$StudentName, y=dataset$Java)) + geom_bar(stat="identity",colour="black", size=2) +
- labs(x="StudentName", y="Java")+theme(axis.text.x=element_text(angle=90, colour="grey20", face="bold", size=25))
Summary
Thus, we have seen how can we use R programming for the data analysis. This is just a kick start. R is powerful and can work with the complex data.