Archived Content

The following content is from an older version of this website, and may not display correctly.

Amazon Web Services (AWS), the public-cloud business of Amazon.com, is now hosting 200 terabytes of data that consists of DNA sequences of about 1,700 people. This data is available to the public through the Cloud free of charge.

The data is from the 1000 Genomes Project, an international research effort by a consortium of 75 companies and organizations that aims to establish the most detailed catalogue of human genetic variation.

The US National Institutes of Health played a major role in moving the data to the Cloud and open for public access. The organization will continue adding data from the 1000 Genomes project to the Amazon cloud as the project’s participants work to reach their goal of sequencing genomes of more than 2,600 people from 26 of the world’s populations.

Lisa Brooks, a program director at the National Human Genome Research Institute, said researchers who wanted to get access to public data sets like that of the 1000 Genomes Project used to have to download them from government data centers to their systems or have it shipped to them on discs.

“This process took a long time, and that’s assuming a lab had the bandwidth to download the data and sufficient storage and compute infrastructure to hold and analyze the data once they had it. We are happy that the 1000 Genomes Project data are on AWS to give researchers anywhere in the world a simple way to access the data so they can put the data to work in their research.”

The data sets are stored in Amazon’s Simple Storage Service (S3) and its Elastic Block Store (EBS). Users can access this data through Amazon’s Elastic Compute Cloud (ES2) and its Elastic MapReduce (EMR). This means researchers no longer need to move the data in-house and get access to expensive equipment capable of storing and processing such a high volume of data.

To access the data, visit Amazon's Public Data Sets catalog.