Today, the R programming language is one of the most common tools used by people engaged in data science. This language is popular because of the availability of algorithms such as data preparation, data visualization, statistical computation, and machine learning with both open source as well as packet support.
In recent years, it has become more and more important to create meaningful patterns on large-scale data and to add value to business units with increasing data knowledge. Companies want to benefit from decision-making mechanisms built using data accumulated over the years.
Enterprises in many different sectors are making large profits by designing new data products with these data. For example, fraud detection systems that increase the profitability of insurance companies are heavily leveraged from a data-driven approach. These systems are now being applied to a broader range of data types, such as images and videos, rather than just digital forms. Another popular use of data science is the increasing importance of predictive maintenance work, which is widely adopted by manufacturing and electricity distribution companies.
There are many commercial and non-commercial tools used in this specific area I have mentioned above. The R language, which is an open source and extensively used in the world of data science, has a very large library structure related to data analysis and machine learning algorithms. Users can connect to different sources and access the data. They can also visualize and present the data with very large visualization libraries.
RStudio is one of the leading IDEs that support the R programming language. Many R program developers work with RStudio, a tool that is easy to develop visually, and they can also document their work with the facilities provided by the documentation side. There are two versions, RStudio server and local version, and commercial and open source versions of these two versions. In this work, the RStudio server will go through the open source licensed version.
In this tutorial, we will be installing RStudio on an Alibaba Cloud Elastic Compute Service (ECS) instance to perform statistical data analysis and train machine learning algorithms.
Choosing the Right Instance Type
Before we start the tutorial, it is important for you to decide on the hardware architecture according to your purpose of using RStudio.
Basically we can talk about 4 types of needs for RStudio on Alibaba Cloud ECS:
1. To create an experimental environment
General-purpose ECS machines provided by Alibaba Cloud will be sufficient to conduct experimental studies and examine new packages with R.
2. For analytical modeling purposes only
When there is little work on data editing and data preparation on R and machines with high RAM are required to run machine learning models on commonly available table structures. At this point, Alibaba Cloud’s Memory Optimize ECS machines will be a good choice.
3. To utilize the power of parallel calculation
In particular, parallel execution of machine learning algorithms is of great importance. With special packages that support parallelization in R, functions can easily work in multi-core. At the core level, “Compute Optimized” machines offered by Alibaba Cloud are preferred.
4. For the big data world
Technologies such as Apache Hadoop and Apache Spark, which are specialized technologies for processing data in large data worlds, are often used in large data processing. The special packages in R are responsible for orchestration and connection to these systems. For this reason, it is enough to select “General Purpose” hardware on Alibaba Cloud for R Studio installation, which does not require high RAM and CPU.
- A valid account on Alibaba Cloud. Sign up for a Free Trial if you don’t have one already.
- A new Alibaba Cloud ECS instance with Ubuntu 14.04/16.04 server installed.
- A public IP address is configured on the instance (via console).
- A Root password is setup on the instance (via console).
Launch an Alibaba Cloud ECS Instance
Once you have created an account for Alibaba Cloud and selected the location and machine properties that you want, you can create an ECS instance.
After the machine is running, you will need to edit the root password from the Reset Password button. When you create a new instance, you create a new public IP that is specific to your use, and you can connect to that instance with the public IP.
To connect to the Linux machine you have created, you can access the console via the web browser by clicking on the “Connect” button on the Alibaba Cloud page or you can connect to SSH via terminal on a local computer (you should use a utility like Putty for Windows OS ).
You can proceed with the following code to connect via the terminal on the local computer. Note: Replace with your actual ECS IP address.
ssh root@<Your Public Ip>
After connecting to Ubuntu, I update Ubuntu for security and application fix/upgrades using ssh command line ( “Y” parameter is automatic yes to prompts ). After this section we will run all Linux commands with the root user.
# apt-get update -y
Install Base R and RStudio
R programing must be installed before RStudio installation. RStudio is an IDE to help you deal with different versions of different R programs.
# apt-get install r-base
For RStudio installation, the installation file is downloaded from RStudio site and installation is started by “gdebi” command.
# apt-get install gdebi-core
# wget https://download2.rstudio.org/rstudio-server-1.1.456-amd64.deb
# sudo gdebi rstudio-server-1.1.456-amd64.deb
To check the RStudio status, go to the “/usr/sbin/” directory and run the following command.
root@iZgw87jn58ajcwb3gjnupwZ:/usr/sbin# rstudio-server status
If the installation is smooth, the following will return.
rstudio-server start/running, process 6982
RStudio has been successfully installed.
Network Configuration in Alibaba Cloud
RStudio Server is a web browser based application and users can access from their web browsers without installing them on their local computers. RStudio Server communicates with client computer via 8787 port by default.
If you do not have a predefined security group when a new ECS instance is opened, you will not be able to connect to this machine from the external application. You need both ingress and outbound settings from Alibaba Cloud consoles. (By default, the SSH port is open when a new instance is opened).
A new ingress rule must be defined by clicking “Add to Security Group” under the “Security Group” tab on the Alibaba Cloud console to open the default 8787 port of RStudio Server.
If it is required in the rule definition process, a rogue blog of “0.0.0.0/0” can be opened to access from anywhere in the world, or a more secure environment can be created by giving IP lists for accessing certain public IPs.
Once the security group is defined, a connection can be made to RStudio via a web browser on the local computer.
Configure and Manage RStudio
In RStudio, admin activities are especially important for enterprise structures. Admin users can perform activities such as detection, creation of new user, system restart if there is a problem on RStudio.
Connect to R Studio with users created on Linux. New user and password identification can be done on Linux with the following code:
# adduser rstudio_user1
Connect to RStudio:
Password: Defined password
The management and configuration files for RStudio will be included in the following two paths if the default installation progresses.
- Configurations can be made with “rserver.conf” and “rsession.conf” files under “/etc/rstudio/”
- Administrative work is also being done with the “rstudio-server” service under “/usr/sbin/”
# /usr/sbin/rstudio-server start
# /usr/sbin/rstudio-server stop
# /usr/sbin/rstudio-server restart
Service status display:
# /usr/sbin/rstudio-server status
Currently active sessions display:
# /usr/sbin/rstudio-server active-sessions
PID TIME COMMAND
24878 00:00:00 /usr/lib/rstudio-server/bin/rsession -u rstudio_user1 --launcher-token 8C830F0
24895 00:00:00 /usr/lib/rstudio-server/bin/rsession -u rstudio --launcher-token 8C830F0
Last login: Fri Aug 24 11:53:40 on ttys000
Occasionally, RStudio’s default port may already be used by other applications so you may need to change this.
It is necessary to add the following parameter in the “/etc/rstudio/rserver.conf” file. Normally this parameter is not attached.
Finally, if your Linux system uses the disk system “xfs”, you should configure “/etc/rstudio/rsession.conf” file with following code;
To find out what type of disk the Linux system uses
# df -hT
To make changes to the “rserver.conf” or “rsession.conf files active, you need restart RStudio services.
Once you’ve installed RStudio, you can visualize the data with powerful visualization packages, statistical data analysis studies and use many machine learning algorithms on RStudio. Now you can start practicing data science with RStudio on an Alibaba Cloud Elastic Compute Service (ECS) instance.