The exponential increase in the creation of scientific data over the previous two decades has led to the appreciation for the need to properly archive and share those data with the broader scientific community. In 2016, an international collection of authors published an article in Nature (DOI: 10.1038/sdata.2016.18) proposing scientific data should be Findable, Accessible, Interoperable, and Reusable (FAIR). The benefits of FAIR data are widely acknowledged and accepted but making this a reality for all types of data will take considerable effort.
Natural products research is an inherently multi-disciplinary enterprise with a rich history. The isolation, structure elucidation, and biological characterization of an individual compound can sometimes span many years. As such, the methodological details used to characterize the compounds become extremely important. The methods section of most manuscripts describing the discovery and characterization of a natural product contains insufficient information to allow others to reproduce that discovery. Some important details are glossed over with the ubiquitous phrase “as previously reported” with a chain of references between the reader and the primary data. For some types of data, such as x-ray crystallographic data or biosynthetic gene cluster sequences, there are established procedures and repositories to share and report that data. For others, such as mass spectrometry (MS) there are emerging resources and standards. However, for nuclear magnetic resonance (NMR) data, there currently is no consensus on the platform or format for sharing. As such, much work is needed to bring natural products data into compliance with the FAIR principles.
NMR is an essential tool in natural products research, and yet, currently there is no widely accepted platform to share these data among the research community. Multiple attempts have been made to create such a repository, but none of them have reached wide acceptance. For the most part, NMR data are only available for most natural products as a collection of peak tables and perhaps as a low resolution PDF image in journal articles and supplemental information. Nearly all raw NMR data are housed locally in individual labs and academic institutions. As hardware and software platforms change over time, much of these data is lost or very difficult to access. The tools exist currently to allow sharing of raw NMR data. What is needed is a concerted effort to unite the natural products community around a common data repository and shared standards for reporting of raw NMR data.
This FOA is being released in conjunction with the Center for Natural Product Technology, Methodology and Productivity Optimization (NP-TEMPO), and the Botanical Dietary Supplements Research Centers Program. Collectively, the awards under these FOAs constitute the NIH Consortium for Advancing Research on Botanical and Other Natural Products (CARBON) Program. Applicants applying under this NP-NODE FOA are encouraged to collaborate, where appropriate, with the NP-TEMPO, the Botanical Dietary Supplement Research Centers, and other NCCIH, ODS, and NIH supported grantees as well as other national and international researchers.
The main objective of the NP-NODE is to create a data repository for sharing of raw NMR data. Ultimately, the NP-NODE aims to establish a powerful resource that will house the majority of the world’s natural products NMR data in a format that allows facile cross-linking with other data repositories. This requires the database to be structured and annotated in such a way as to allow and support linking with other data repositories that contain different types of data on the same compounds. Parallel development of appropriate tools will permit the rapid interrogation of NMR spectra within the data repository. The NP-NODE should include basic tools to facilitate search and comparison of spectral data and associated structures while also allowing for other groups to develop additional tools using the deposited data. To accomplish this, the grantee will need to establish the platform for the repository and collaborate widely with the natural products community to develop consensus on the raw data format and minimum data standards and metadata for the repository. Additionally, the grantee is expected to actively solicit through various channels deposition of data to populate the repository. Importantly, the grantee will also need to develop a plan for how deposited data are to be curated. Ultimately, a plan for the long-term viability for this repository will be required. Therefore, applicants should begin to consider models for this, but it is not expected that they will be in place in the timeframe of this initial phase of the project.
The NMR data repository is the cornerstone of what is expected to be a larger effort which seeks to establish data sharing standards for the natural products research community. For the purposes of this FOA, natural products include vitamins, minerals, and probiotics as well as small molecules derived from plants, fungi, bacteria, marine organisms or animals. They may also exist along a spectrum of complexity from crude extracts to purified constituents. The NP-NODE will also coordinate broadly with the natural products research community to establish good research practices related to reporting and sharing of all natural products related research data. The conversation should build understanding of, and agreement on, the importance of abiding by the FAIR data principles.
NCCIH will also utilize the NP-NODE to help address the principles of the NCCIH Product Integrity Policy. This policy establishes standards regarding the quality control data expected of any natural product to be used in an NCCIH funded grant. It is expected that the NP-NODE will assist NCCIH and potential NCCIH grantees prior to award to evaluate data submitted in response to this policy. This will include recommendations about additional pieces of data that would help fully address the requirements of the policy as well as potential collaborators in the form of companies, contract labs, or academic groups that could help acquire that information. Importantly, the NP-NODE is not expected to provide that data directly except in special cases. Furthermore, it is not expected that the NP-NODE will have expertise on hand to directly address the full range of products that are within the purview of NCCIH. Rather, the NP-NODE will serve an advisory function to assist prospective NCCIH grantees in finding resources to satisfy the requirements of the policy.
At a minimum, the NP-NODE is expected to have the following capabilities:
- A cloud platform to house raw NMR data;
- A user-friendly interface to allow for upload, download, search and analysis of NMR data in the repository;
- Tools and consensus standards to facilitate widespread utilization of the repository;
- Encouragement of additional tools from the natural products research community that leverage the repository data
- Ability to coordinate information standards across the natural products research community that are compatible with existing standards, such as herbal CONSORT, ARRIVE, NCCIH Product Integrity, etc.
- The ability to assist NCCIH and ODS with implementation of the NCCIH Product Integrity Policy
Developing a Platform
If an NMR Data Repository is to become widely accepted and utilized by the natural products research community it must be housed in such a way that it can accommodate the types of data that need to be deposited. At a minimum, this should include raw 1D and 2D NMR data files. Proton and carbon spectra are by far the most common nuclei, but the repository should be constructed in a way that it could expand to include NMR data for fluorine, nitrogen, phosphorous, etc. At the outset, the barrier for depositing data should be very low until broad agreements can be reached regarding a universal format. There are existing tools that can read the various vendor formats and transform the spectra into useable data. The NP-NODE should expect to handle much of the data curation and extraction effort for deposited spectra in the early stages of development.
The data repository must be capable of scaling as the volume of data and tools expands. Ideally, the repository will grow to include tens of thousands of NMR spectra. The repository must be robust enough to handle that volume, and there must be a reasonable expectation that it will exist long into the future and keep pace with advance in bioinformatic computing. For this reason, a cloud-based format is particularly attractive. It is expected that the repository will integrate future NIH efforts related to data stewardship. As with any resource, the NP-NODE must consider how the repository will be sustained over time. Availability of continuous NIH support cannot be assumed. Therefore, applicants should describe possible models of sustainability that could be considered in the event NIH support ends. The applicant must demonstrate a deep understanding of the barriers that have prevented the widespread utilization of prior attempts at creating an NMR repository. A clear strategy must be presented for how these barriers will be overcome with the current effort.
All deposited data must receive a unique digital object identifier to allow easy retrieval, citation, and tracking. Furthermore, the data should be annotated in such a way that it allows for seamless integration with other data repositories (e.g. structures stored as SMILES strings). The NP-NODE must be a powerful resource on its own, but also extend and amplify its utility through linking with other data repositories such as PubChem and the various MS data repositories. The NP-NODE should have a well-established plan for redundant storage of the data to ensure its security and longevity.
The NP-NODE will need an interface to allow users to interact with the data. The applicant must demonstrate appropriate expertise and prior experience on their team with developing user-friendly interfaces. The website should provide clear options for importing or exporting data. There should be preliminary tools to allow searching of the repository based on combinations of text, structural and/or spectral features. This should include the ability to search on full or partial structure/spectrum. Importantly, it must be possible to access the entire repository to allow researchers to visually see how their data relates to all other spectra in the repository.
Creating Tools and Standards
In order for the deposited raw NMR data to be of any value to the research community there must be tools that allow that data to be searched, compared, and otherwise manipulated. Basic tools that allow for transformation and visualization of individual spectra and simple search functionality for structural or spectral features should be deployed immediately. Longer term, the NP-NODE should work closely with tool developers to encourage design and deployment of more sophisticated tools that permit mining of the entire data set simultaneously to uncover patterns and clustering of spectral and structural features in ways that are becoming common with mass spectrometry data. For this to be possible the interface should be structured in a way that allows the entire repository of spectra to be retrieved by individual users.
Procedures for uploading and downloading of data should be established and well described. Importantly, deposition of data must be as straightforward as possible to minimize the barrier for researchers to use the resource. This will have to also be balanced against the need to create a file format that contains sufficient information. To achieve that balance, the NP-NODE will need to work closely with all stakeholders to reach a consensus. Applicants should clearly describe how they will manage the competing priorities of rapid deployment and consensus building. Furthermore, the NP-NODE should allocate appropriate resources to allow for training of new users or on new tools as they come online.
For the NP-NODE to be successful, it will require interaction with a wide cross section of the natural products research community. Consensus will need to be reached regarding a variety of repository features. One of the first efforts in this regard will be on the format for the raw NMR data. There are a number of existing options including NMReData, nmrML, and NMR-STAR to name a few. Very early in the development of the repository, an agreement will have to be reached about how data will be deposited in the repository. Use of an existing data format is not a strict requirement as long as there is widespread agreement from diverse segments of the stakeholder community. Unanimity of opinions likely will not be reached on this topic, but the final decision must be the result of a dialogue with stakeholders to clearly understand the pros and cons of the various options. In parallel to the decision regarding data format is agreement regarding the metadata to be included for each data submission. In addition to the basic experimental NMR acquisition parameters, this could include important details such as the producing organism, collection coordinates, extraction and purification procedures and perhaps other information that would allow future researchers to reproduce the data. As the repository is developed, the NP-NODE also must actively encourage labs from around the world to deposit their NMR data in the repository. This might also include communication with journal editors about current or future policies related to sharing of NMR data.
Extending beyond the NMR data repository, the NP-NODE is expected to engage the broader natural products research community in a conversation about ways to strengthen the rigor and reproducibility of the overall research enterprise in this field. It is expected that the NP-NODE will work closely with the other components of the CARBON program in this effort. In part, this will entail education regarding the FAIR data principles. Incomplete reporting of experimental details is an ongoing problem across the scientific landscape. The natural products community has specific challenges associated with sufficiently documenting the methodology associated with the collection, purification, and chemical characterization of compounds. The NP-NODE is expected to solicit and gather input to help make recommendations, if not set standards, regarding how to collect and report such data to meet the rigor and reproducibility criteria set forth by NIH and further codified in the FAIR data principles. Again, interaction with journal editors may be helpful in establishing and enforcing these recommendations.
As part of this broader coordination effort, NCCIH will utilize the NP-NODE to help implement the principles of the NCCIH Product Integrity Policy as described above. The number of applicants that are subject to this policy varies substantially throughout the year and across years, but ranges typically between 10 – 30 per year.
Structure and Governance
As outlined above, the NP-NODE has three major components, each of which contains a number of sub-elements and include overlap with each other in places. A majority of NP-NODE activity (~65%) will be devoted to the NMR data repository. Of the remaining resources, ~25% will be devoted to the larger outreach and coordination efforts around implementation of the FAIR data principles across all aspects of natural products research. The remaining ~10% of NP-NODE activity will involve consultation on the NCCIH Product Integrity Policy.
The NP-NODE, in collaboration with NIH staff, will set annual milestones to ensure timely completion of all project goals. The milestones should establish clear metrics such that the degree to which they are achieved can be quantified. To monitor progress on those milestones, the NP-NODE will communicate frequently with NCCIH staff in the form of regularly scheduled phone calls. Furthermore, the NP-NODE should include funds in their budget to attend the annual CARBON program meeting. Applicants should propose milestones for expected accomplishments in each year of the project.
Deadline: April 1, 2019 (letters of intent due 30 days prior to the deadline)
Filed Under: Funding Opportunities