Working in the field of data science, at some point you'll need computational horsepower. I'm working from a Mac Book Pro, which has been great for learning, but not enough to do some of the things that I'd like to be able to do. In working on a Kaggle competition, I had one algorithm take more than 12 hours. That's when I decided to look for another option.
Initially I looked at buying a used workstation with no O/S. Shopping Craigslist and eBay you can find a suitable system with 16 cores for less than $600, so this seemed like a good option. Then it occurred to me that Rackspace let's you spin up servers as needed and delete them just as quickly. The catch is that you have to put together the server that you want.
I did a little research and found that you can build your server on a small system, save an image of that server, then boot up a powerhouse with that image. That's pretty awesome. Their prices are very reasonable, documentation & support is fantastic to boot. I've created my image and run a couple of jobs on their hardware, and my bill so far this month is less than $20. No need to buy a workstation.
By the way, I know that this is possible with Amazon as well. I love shopping at Amazon, but AWS isn't my favorite. Call it personal preference.
If you know how to use R then you shouldn't have any trouble with this. I've fully documented my steps below, and walked through them a second time to make sure they work, so you should be able to get through it by following this article. Having said that, you should have some familiarity with Linux.
Quick note on Linux flavors. RStudio should work with any version of Linux that you care to use. These instructions are specific to Ubuntu. I tried them with both 12.04 LTS and 13.10. These instructions apply to either version.
Congratulations. You now have a server in the cloud. We're done with the web interface for a while. The rest of our work will be done within the terminal.
Next we'll address a few security issues and update the server.
ssh email@example.com, where the x's represent your server's IP address. The first time you connect you'll see a message about an unrecognized RSA key. Go ahead and accept that.
$ # Log in to your server $ ssh firstname.lastname@example.org $ # Change the root password. $ passwd $ # Update the server $ apt-get update $ apt-get upgrade --show-upgraded
$ # Add a new user. Replace
with whatever you choose. $ # Follow the on screen prompts to create the new user. $ adduser <sudo_username> $ # Grant sudo privileges to your new user. $ adduser <sudo_username> sudo $ # Add one more user. $ adduser <rstudio_user>
$ # Prevent root from logging in. Once the file is open, scroll down until $ # you see PermitRootLogin and change that setting to no $ vi /etc/ssh/sshd_config
PermitRootLoginand change that setting to no.
:wqand hit <Return>.
$ # Restart the ssh service to apply these updates. $ service ssh restart $ # Log out $ logout
Now it's time to configure a firewall.
$ # Check existing firewall rules. This will ask you for your password. $ sudo iptables -L $ # Create a file to store rules. $ touch /etc/iptables.firewall.rules $ vi /etc/iptables.firewall.rules
Copy & paste this:
*filter # Allow all loopback (lo0) traffic and drop all traffic to 127/8 that doesn't use lo0 -A INPUT -i lo -j ACCEPT -A INPUT -d 127.0.0.0/8 -j REJECT # Accept all established inbound connections -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT # Allow all outbound traffic - you can modify this to only allow certain traffic -A OUTPUT -j ACCEPT # Allow HTTP and HTTPS connections from anywhere (the normal ports for websites and SSL). -A INPUT -p tcp --dport 80 -j ACCEPT -A INPUT -p tcp --dport 443 -j ACCEPT -A INPUT -p tcp --dport 8787 -j ACCEPT # Allow SSH connections # The -dport number should be the same port number you set in sshd_config -A INPUT -p tcp -m state --state NEW --dport 22 -j ACCEPT # Allow ping -A INPUT -p icmp -j ACCEPT # Log iptables denied calls -A INPUT -m limit --limit 5/min -j LOG --log-prefix "iptables denied: " --log-level 7 # Drop all other inbound - default deny unless explicitly allowed policy -A INPUT -j DROP -A FORWARD -j DROP COMMIT
:wqand hit Enter.
$ # Activate the rules. $ sudo iptables-restore < /etc/iptables.firewall.rules $ # Recheck the existing firewall rules. $ sudo iptables -L
$ # Create a new file that calls the iptables rules. $ vi /etc/network/if-pre-up.d/firewall
Copy & paste this into the file.
#!/bin/sh /sbin/iptables-restore < /etc/iptables.firewall.rules
$ # Make the file executable. $ sudo chmod +x /etc/network/if-pre-up.d/firewall
That's it for system setup & security. The hard part is over, and we're almost done.
$ # Install R. $ sudo apt-get install r-base
This ran without error for me. If you have any difficulty with this, you’ll need to add your closest CRAN mirror to your /etc/apt/sources.list file. Check the CRAN Mirrors list for the server closest to you.
Once that’s complete, you’ll have R installed. You can now run it at any time by typing
R at the prompt. Note that it’s case sensitive, typing
r will get you an error message.
$ cd $HOME
$ sudo apt-get install gdebi-core $ sudo apt-get install libapparmor1 $ wget http://download2.rstudio.org/rstudio-server-xx-amd64.deb $ sudo gdebi rstudio-server-xx-amd64.deb
You now have RStudio up & running. You can access it by opening http://xxx.xxx.xxx.xxx:8787 in your browser.
I don’t recommend leaving this server up & running all the time. The benefit of doing this with Rackspace is that you can power servers up & down as needed. So save your server by creating an image.
Don’t lose your passwords. You’ve deleted the server, but when you launch it again you’ll need those to access it.
Also, when accessing RStudio, only log in using the <rstudio_user> user that you created. All of the work that you do in RStudio will be over an unsecured connection. This server will only be online for a few hours at most, so I'm not paranoid about security, but there's no need to tempt fate.
When you're working with the server, you'll need to find a way to load your data onto it. To save on costs I've been cleaning/preparing data on my system, then uploading it to Rackspace Files and accessing it there. Any output from the server should be stored in the $HOME directory of your <rstudio_user>.
That's it, you're done! You just created an RStudio server. Total cost to create this for me was less than 50¢. The next time you need horsepower to crunch your data, you can access 32 cores with 120 GB of RAM within minutes. Power these up & down as needed.