How To Operate Large Files in Bioinformatics

Situation: you start your bioinformatics project very happy that you will be analyzing your sequencing data, be it transcriptomic, genomic or epigenomic; you have learned bash, R and statistics, and then you try to transfer all of the data to your computer, and then you see that, bleh, you don’t have enough storage on your computer. Then what?

You start panicking? No!

You start searching the web for any information on that, and come up with a gazillion new terms that you have never heard before. Server, cluster, cloud computing, ssh, ssd, bucket, to enumerate a few. And now you panic for real.

Don’t worry, I am here to help! I’ll get straight to the point about the options available to tackle this problem when working with big data in bioinformatics.

Basically you have two main options:

Cloud computing
Server/cluster of your institution (when available)

Cloud computing

These are the services provided by companies in which you can pay for a temporary remote machine to run your analyses. Google Cloud, Amazon AWS and Azure are the most well-known providers of this type of service. Basically you need to create an account, link to a payment method, and start using it.

If this is your option, you need to learn the basics of analyzing data in the cloud. I would go with the simplest way, which is creating an instance. An instance is nothing more than a computer you configure remotely. You need to pick a size of CPU, RAM memory and storage, which can be HDD (popularly known as HD, cheaper and slower) or SSD (faster to transfer data, but more expensive). Then, you just transfer your data to the instance using wget or ssh, depending on where this data will come from, and you can work from there (using the command line, of course).

Google Cloud Compute Engine

Below I will give you some details on how to do this on Google Cloud Compute Engine. They give $300 of free credits to new Google Cloud users, so you can test and potentially do the analysis you need for free. To give you an idea of costs, a general use machine with 4 CPUs, 15GB of memory (RAM) and 100GB of HDD storage, running for 30 days would cost you 140 USD. You can calculate the expected amount for your personalized needs in the calculator page.

After setting up your instance and transferring the data, you need to install all software needed to run your analyses from scratch. For example, let’s say that you want to run an RNA-seq alignment using STAR. You will need to install STAR, transfer the reference genome or transcriptome to the instance, build the reference for this program, and then run it. Don’t forget that any other tools you might need for previous steps such as QC must be installed to the instance too. After all, it is a new computer you are renting for a limited time.

After finishing your analysis, you need to deal with the resulting data. Let’s talk about this in another post, because it is a topic that deserves better attention.

Server/cluster in your university

In this option there are two different ways that a server might be provided to you: via a job schedule strategy or by giving you access to a slice of this server.

Job schedule strategy

A server, as the instance in the cloud computing resource, is a remote computer to which you have access. The job schedule strategy requires you to learn how to structure your data and then you need to write your scripts and submit them to the job scheduler, then wait for your turn to run them, as other people send their jobs to the scheduler too. The most popular examples are Slurm and SGE.

Generally there is a free limited space (scratch) of the server for you to experiment with a slice of the data (so you are certain that your script is correct to send to schedule for example). Some more complex labs allow you to send to a specific server schedule line, depending on your demand of CPU (or GPU) and memory. These are aspects you may require by task as well when submitting your job to the scheduler.

Using a slice of the server

Another way of doing this, generally in less experienced labs, is by requesting a specific slice of the server available in the university for you to use during your full project. That would be: “I need 100GB disk with 16GB of memory and 4 CPU to run my analysis”. In my PhD I knew that I could have asked for it in my university, but I did not have a bit of idea on what would be my CPU, RAM memory and storage needs. Therefore I could not get it. I still relied on my collaborator’s server to finish my analyses. So this is an important knowledge to have.

Calculating the necessary computational resources for your bioinformatics analyses will need to be addressed in a new blog post.

I hope this post has helped you on your bioinformatics journey!