Scientific Data Sharing Solutions
Saving wasted healthcare research and promoting scientific innovation
By Daniel Hwang (Weavechain), Martin Codyre (Fleming Protocol), & Angelina Lesnikova (sci2sci)
Introduction
In recent years, the “traditional” workflow of academic scientific research has been upgraded to promote cooperation, replicability, and dissemination of scientific knowledge.
This shift has been spurred by challenges and failures that plague traditional systems, including the failure to properly reproduce and replicate published scientific studies. Lack of reproducibility has been attributed to negative impacts on scientific research and public health, including decreased scientific productivity and progress, misuse of funding and time, and eroding the public’s trust in scientific research (Nature).
Today, new data sharing and management practices have set the foundation to move towards greater collaboration and institutionalized data storage and security. Introducing new data management standards yields several benefits: it encourages greater scientific progress, strengthens the legitimacy of research, and fosters goodwill among researchers.
For scientists however, presenting a new workflow has created new challenges. Many worry that sharing their data and research results can lead to competitors plagiarizing or stealing data. Some have not been provided the resources to properly share their data, and may lack the proper background in data curation and metadata management. While informatics journals have developed archiving and analytic tools to optimize data management and analysis, researchers have continued to struggle with managing increasingly complex and large datasets.
What is the current status of data management and sharing? Why should scientists consider sharing data? This article will provide an overview of scientific data sharing’s current landscape, general practices, and the specific case of patient data management. Finally, we’ll discuss how Fleming Protocol, sci2sci, and Weavechain provide critical solutions to address pain points.
Data Sharing Practices
Much of the responsibility for managing scientific data falls on the principal investigator (PI). The PI is the lead researcher for the grant project and holder of the independent grant, liable for facilitating the integrity and management of the research project. The principal investigator usually makes important decisions regarding funding, data sharing, and project guidelines; in the context of data sharing, the PI authorizes data access and provides legitimate data credentials.
Across different institutions, requirements and responsibilities for data management plans often differ in nuance and tools used. Most common best practices for plans will include:
Procedures for managing and storing data and confidentiality
Roles and responsibilities of research team members
Schedule for data sharing
Necessary Documentation
Method of data collection/sharing
Methods for providing access to data can vary due to a number of factors, ranging from sensitivity of data provided, size and complexity, or even the expected number of requests (IES). This process often involves the PI and institution directly sharing data upon request, using a data archive or data repository, or a combination of the previous methods.
A data repository provides secure, long-term storage for research data. BU Data Services outlines important considerations when selecting a data repository:
Reputation – Is it endorsed by a funding agency, journal, or professional society?
Sustainability – Consider a repository that guarantees access to your data well over five years into the future
Visibility – Depositing your data in a repository will give you a unique identifier that will credit you when others cite your work. Most data repositories will provide a DOI, handle, etc.
Usability – Does the data repository provide access to other users?
Features – Does your repository have integrations with Open Science Framework, GitHub, or other commercial storage solutions?
Formats – Check if your data is compatible with the data repository of your choice
Rights – Make sure to read the terms and conditions!
For more information, check out the NIH Data Sharing Repositories, Registry of Research Data Repositories, and the Nature Data Repository Guidance site.
The 2023 NIH Data Management & Sharing Policy will be effective January 25, 2023, overhauling the previous 2003 NIH Data Sharing Policy which has been in effect for more than fifteen years. Under the new DMS Policy, NIH will require PIs and institutions to include a plan and budget for management of data, submit a DMS plan outlined and provided by the NIH, and comply with the approved DMS plan.
For more information, check the full policy guidelines here.
Patient Data Ownership // Fleming Protocol
From a societal perspective biomedical science is ultimately practiced to cure patients of diseases and alleviate suffering. To prove that a given intervention actually works the “gold standard” is considered a ‘randomized double blind placebo controlled clinical trial’.
Patients are enrolled by pharmaceutical companies to determine if their new drug works or not. The patient generally will sign a nondisclosure agreement and an informed consent document that is probably 20+ pages of dense legalese with some buried verbiage in there saying that the company can keep the data obtained from the patient’s bodies private if they wish. This issue has been pointed out very well by Ben Goldacre and his AllTrials campaign
The patient has donated their medical data in the belief that this will now be used to further biomedical research and to help humanity. The problem here is that if the clinical trial doesn’t tell the story that the sponsor (usually the pharmaceutical company) wants to tell, the company most likely will keep the clinical data private and never divulge it to anybody else. Now, in the case that the study results in a drug approval, the company may make billions of dollars a year based on the data provided by the patient; however, the patient will never see any upside from this.
Fleming Protocol’s team believes that allowing the patients to maintain ownership of the data and to have control over who gets access would ultimately result in better outcomes for humanity. They want to enable well educated and organized patient groups to securely own the data that gets produced during clinical trials. Ultimately, they want to see the patient groups themselves executing these trials. This is where we will see a true revolution in the development of more and more therapeutics for a given disease. This is simply because incentives are now aligned for the patients to own and solve their own problems.
Research Data Sharing Today // sci2sci
Despite the recent enthusiasm about scientific data sharing and certain attempts to popularize open science, including those by government agencies, the current state of data publishing is still very poor.
There are a number of reasons that contribute to this case. First, there is – or has been until very recently – a lack of incentive to publish the data. Scientists have mostly been judged by the impact factor of the journals where they published their research papers, and little by anything else. There are recent initiatives by the US and EU governments to change the status quo and to incentivize research data sharing, however, we have yet to see them materialized.
The second problem with research data sharing is merely a technical one. There are a number of public repositories – from data type-specific such as Gene Expression Omnibus or European Nucleotide Archive – to general ones such as Zenodo, Dryad or FigShare. However, it is merely difficult and time consuming to gather the data used for a manuscript and prepare it in a format suitable for a public release.
In a typical research practice, data is scattered across multiple devices and collaborators, with no standardization efforts taken into account to facilitate sharing. Metadata is written down loosely – and often manually by hand – which means that its search and digitization takes yet an additional effort. In the absence of true incentives for high quality data publication, researchers often dump the data into public repositories because they are sometimes required to do so by the journals where they publish the papers in, with no proper annotation or necessary metadata attached. As a result, the majority of the datasets that are currently available for public access can neither be utilized to verify the original findings nor further reused by researchers in other studies.
One recent attempt to tackle this issue is the Sci2sci platform. Sci2sci is being built to both work with research data and publish it in the same place to streamline the complete process of the research data lifecycle. The platform is yet to be released but subscription for the waitlist is available here.
Simply Stitching it All Together with Weavechain
Weavechain gives scientists the tools to easily implement these modern patterns to avoid the challenges of the past. While Web3 data holds the potential to dramatically increase data security and increase its value, it has been impossible to implement for all but the most technical engineers. Weavechain’s smart hashing technology attaches to existing databases, providing an easy, GDPR/HIPAA compliant way to unlock benefits like immutability, data/compute lineage, confidential computing, data monetization, and more.
Its secure data collaboration solution supports both Fleming and Sci2sci’s needs, among other organizations that benefit from modern data features. By bundling together many of the infrastructural requirements discussed here, it handles the plumbing so that scientists can focus on science.
Conclusion
Fleming Protocol, sci2sci, and Weavechain are among the many to pave the way for infrastructure that supports funding, ownership, security, and dissemination of secure healthcare research data. Sci2sci’s publishing platform improves discoverability of published data sets and promotes accessibility for both public and private uses. Fleming Protocol’s incentivisation and data monetization mechanisms allow patients and researchers to collaborate while maintaining privacy. And Weavechain is infrastructure technology that enables any researcher to easily join this collaborative data future.
The growing availability of scientific data has necessitated the growth and development of the data sharing industry and supporting regulations. To curb the production of irreproducible and wasteful data, scientists and institutions alike must provide the necessary tools and resources to effectively package and prepare scientific data. That future has never been closer than it is today!
To learn more about how Weavechain can help your organization:
Drop us a line at hello@weavechain.com
Subscribe to our newsletter