Cloud Dataverse: A data repository platform for the Cloud

By Mercè Crosas, Ph.D., Chief Data Science & Technology Officer, Harvard University

Mercè Crosas, Ph.D., Chief Data Science & Technology Officer, Harvard University

Imagine combining the power and scalability of cloud computing and storage with access to thousands of datasets hosted in a reliable and feature-rich data repository platform? Cloud Dataverse does exactly that—brings a mature and widely used data repository platform, Dataverse, together with the OpenStack cloud platform.

This is a necessary next step. At the time of Big Data, when large and streaming datasets are becoming more common, it is necessary for repository and cloud platforms to converge so that data do not need to be moved constantly when they are processed, shared, and archived.

In the last decade, the use of data repository and cloud platforms have grown significantly but mostly separately, not taking advantage of one another. According to the re3data website, there are now more than 1,800 public research data repositories used in academia, government, and business, supported by either open-source repository platforms (e.g., Dataverse, DSpace, CKAN, and Fedora), proprietary software (e.g., Figshare), or one-off databases (e.g., Protein Data Bank). Equally, there has been an increase in the use of open source software for creating private and open clouds (e.g., OpenStack and OpenNebula), as well as in the use of commercial public clouds (e.g., Amazon Web Services, Microsoft Azure, Google Cloud), by academia, government, and businesses alike. Now it is time to bring these together.

" The Harvard Dataverse repository alone hosts more than 70,000 datasets with contributions from 500 research and academic institutions worldwide "

To understand the value of converging data repositories and cloud computing, consider the popular AWS public dataset service. Amazon hosts a variety of public datasets that range from census data, to an inventory of Google Books, to Human Genome information. Rather than having to spend hours downloading these datasets, AWS users can browse the AWS repository, locate a dataset, and then analyze it in-situ using Amazon Elastic Map Reduce. However, this is not a satisfying solution for data repositories stakeholders. While AWS's public datasets service demonstrates the value of integrating access to datasets with cloud computing, it lacks the power of today's fully functional research data repositories, which follow best practices on data sharing. In our experience, many data owners, while willing to widely share datasets, are uncomfortable with a-priori making them fully public. For example, some agencies require dataset users to sign a term of use so that the agency is not liable for fully anonymizing the dataset. Having a mechanism where users can locate datasets and apply for access is critical to making these datasets available. Also, the effort to curate and make a dataset available can be enormous, and it is becoming critical to provide a mechanism to give dataset authors credit for this work. Finally, access to metadata associated with the dataset is key to find and reuse the data.

Data sharing is being adopted in many scientific communities as the way to make data accessible to others. This is driven by a number of factors, including recent open data policies, funder and journal requirements, and community awareness for the need of reproducibility of a scientific claim, which require access to the data. The research community has developed standards and best practices to incentivize and improve the quality of data sharing. Each dataset must have:

1) A data citation to credit data authors

2) A registered global persistent identifier to locate and reference the dataset indefinitely (e.g. a Digital Object Identifier or DOI)

3) Well-defined restrictions, licenses and terms of use to know how to access the data

4) Rich metadata describing the dataset to help find it and reuse it (see Joint Declaration of Data Citation Principles and FAIR principles).

The Dataverse repository platform enables the building of repositories without having to implement from scratch all the standards and best practices needed to fully support data sharing and archiving. Dataverse provides additional features such as versioning of datasets, customized virtual repositories within the same hosting infrastructure, multiples roles and permissions to support data management and curation, tiered access based on granted permissions, and APIs to deposit, explore, or visualize the data. The Dataverse software has been developed since 2006 at the Institute for Quantitative Social Science at Harvard University. Like OpenStack, it is open-source, with a growing user and developer community, and with 22 installations around the world, which can be federated to share metadata.

The Harvard Dataverse repository alone hosts more than 70,000 datasets with contributions from 500 research and academic institutions worldwide.

Cloud Dataverse benefits from the repository infrastructure and rich set of features provided by Dataverse, as well as from cloud technologies that enable storing and computing of large sets. Our first implementation of Cloud Dataverse is with the Massachusetts Open Cloud (MOC); a regional public cloud effort by Harvard, Boston University, MIT, Northeastern, and UMass along with a community of industry partners. How does Cloud Dataverse extend Dataverse? First, it integrates with the MOC’s OpenStack Swift object storage. Swift provides scalable storage optimized to handle large and not bounding datasets, at a low cost. This integration lets Dataverse users deposit and access large data files directly from the Swift storage, without being limited by the Dataverse web interface and APIs, which can only handle datasets up to a few GBs. Second, it integrates with the MOC’s OpenStack’s Keystone identity services. This allows data users to find a dataset in a Dataverse repository and seamlessly access the data in the cloud environment, using the credentials in Dataverse. And third, it integrates with the MOC’s OpenStack Sahara service to manage access to computational-intensive data processing frameworks such as Hadoop or Spark. We are now starting to design how Cloud Dataverse can integrate with other Dataverse repositories to allow datasets from federated repositories to be automatically integrated into the Cloud.

With the convergence of two growing open-source projects, Cloud Dataverse can grow the set of features and be useful to both the scientific and industry communities. But more importantly, Cloud Dataverse represents the necessary next step to combine cloud computing with data sharing.