The Big Data Challenge – Recommendations by Mercè Crosas – Big Data Value

“Currently, Mercè’s team is in the process of implementing datatags for datasets in the Harvard Dataverse repository. This has been a big task due to legal compliance issues, security requirements and the conditions set by various data agreements. These datasets often contain sensitive information about individuals and therefore safeguards need to be put in place to protect these individuals. Policies on data sharing play a critical role in balancing the benefits and risks. The average citizen wants privacy and safety of his data but has little time for data governance. As the amount of data driven products is only expected to increase, so is the demand of citizens for privacy management. It is important to map the data beforehand because the manner in which relevant regulation is to be attached to the data is dependent on the data itself. When regulation changes, the datatags will have to be adopted as well, for instance by providing an updated version of the tag. For these purposes, they teamed up with lawyers helping them with the verification of the datatags. More recently, Mercè has been involved with the OpenDP project as one of the co-PIs, an open-source platform for differential privacy libraries. This work would allow to mine and analyze sensitive datasets while preserving their privacy and never been accessed directly by the researchers. Dataverse, DataTags, and OpenDP will together provide a privacy-preserving platform for sharing and analyzing sensitive data….”

European Dataverse Workshop 2020

Date: January 23-24, 2020 Venue: UiT The Arctic University of Norway

Dataverse is an open source web application to share, preserve, cite, explore, and analyze research data.

One Step Closer to the “Paper of the Future” | Research Data Management @Harvard

“As a researcher who is trying to understand the structure of the Milky Way, I often deal with very large astronomical datasets (terabytes of data, representing almost two billion unique stars). Every single dataset we use is publicly available to anyone, but the primary challenge in processing them is just how large they are. Most astronomical data hosting sites provide an option to remotely query sources through their web interface, but it is slow and inefficient for our science….

To circumvent this issue, we download all the catalogs locally to Harvard Odyssey, with each independent survey housed in a separate database. We use a special python-based tool (the “Large-Survey Database”) developed by a former post-doctoral scholar at Harvard, which allows us to perform fast queries of these databases simultaneously using the Odyssey computing cluster….

To extract information from each hdf5 file, we have developed a sophisticated Bayesian analysis pipeline that reads in our curated hdf5 files and outputs best fits for our model parameters (in our case, distances to local star-forming regions near the sun). Led by a graduate student and co-PI on the paper (Joshua Speagle), the python codebase is publicly available on GitHub with full API documentation. In the future, it will be archived with a permanent DOI on Zenodo. Also on GitHub users will find full working examples of the code, demonstrating how users can read in the publicly available data and output the same style of figures seen in the paper. Sample data are provided, and the demo is configured as a jupyter notebook, so interested users can walk through the methodology line-by-line….”

Open Access Week at Harvard Library 2018 | Communications

In celebration of OA Week, the Harvard Library Office for Scholarly Communicationwill share some great news about OA and the Harvard Community: 

  • The OSC will launch a new OA policy for staff, researchers, and scholars to use open-access licensing
  • we will share our annual statistics from around the world, highlighting Harvard’s scholarship’s impact
  • reveal the new and improved Harvard open-access repository, DASH (Digital Access to Scholarship at Harvard).

In addition, the Harvard Library OSC and the Research Data Management Programare teaming up to co-sponsor a series of events during OA week, including an open-access open house, interactive workshops on ORCID, reproducibility, Dataverse, and more.

GitHub and more: sharing data & code | Innovations in Scholarly Communication

“Among those researchers that do archive and share data, GitHub is indeed the most often used, but just as many people indicate using ‘others’ (i.e. tools not mentioned as one of the preselected options). Figshare comes in third, followed by Bitbucket, Dryad, Dataverse, Zenodo and Pangaea (Figure 3)….Another surprising finding is the overall low use of Zenodo – a CERN-hosted repository that is the recommended archiving and sharing solution for data from EU-projects and -institutions. The fact that Zenodo is a data-sharing platform that is available to anyone (thus not just for EU project data) might not be widely known yet….”

Launch of Data Management Planning Tool | Office for Sponsored Programs

“As a result of collaborations with the Office of the Vice Provost for Research, Harvard University Information Technology, and IQSS, Harvard Library has launched a customized version of DMPTool, an online data management planning tool, for Harvard University. Data management plans—documents that outline what researchers will do with data during and after a project—are becoming increasingly required by funding agencies such as the National Institutes of Health and the National Science Foundation. The online tool provides step-by-step guidance for creating data management plans that include templates and examples; it also helps researchers create and share their plans, assisting them in how to address requirements specific to Harvard….”