Metadata
RDF graphs • Embeddings • Traceability
Machine Actionable Metadata
DeSci Labs aims to build machine actionable metadata in line with the FAIR principles. Our intent is to be the gold standard of Red Principle implementation while satisfying Blue Principles through widely used, standardized vocabularies. This is achieved through a combination of two methods of machine actionability.
RDF graphs through the semantic web
Research Objects are invaluable for machine-generated metadata regarding semantic types. Nodes are RDF graphs at their core. The Node RDF graph maps connections between digital objects that are not inferable by machine-generated metadata because in most cases, there is little to no mutual information between the embeddings derived from these components in the absence of user-inputed metadata records.
For example, there is no way to know that a particular .csv dataset file is linked to a given research paper in the absence of a human-generated metadata annotating .csv file indicating to the machine that "this is the data of paper X". By adding components with their semantic types to your Node, we create such a relational mapping:
Data.csv -> isOpenDataComponentOf -> ResearchReport.pdf.
Embeddings as supplementary metadata
In order to enhance Machine Actionability during FAIR's uptake phase, DeSci Labs automatically generates embeddings using OpenAI's API for all text based components (manuscripts, preprints, presentations, code repositories, etc) in a research object. These embeddings are stored into a vector database for efficient retrieval and updates, and mapped to the PIDs of the digital objects. For manuscripts, we generate multiple runs of embeddings on the sections of the research report extracted from the PDFs.
At a high level, Node Metadata is:
Base metadata for components as provided in RO-Crate through Schema.org
RDF graph of functional types, contributors (ORCID) and funders (ROR)
Content embeddings over text based digital objects
Traceability of change (who, what, when)
Example:
Once we have the RDF graph of a Node and embeddings over its digital objects, we can query the the vector database to return all Nodes related to COVID19. And because we know the relationships between Node components, we can run more advanced queries, for instance - returning all open datasets or code repositories related to COVID19.
Metadata Model
Our base metadata model can be seen in our open source code on Github. We use an RO-Crate implementation which has been modified to allow for IPLD and CIDs. The manifest files underpinning individual dPIDs can be seen through the following link structures:
https://[prefix].dpid.org/[#]?jsonld
https://[prefix].dpid.org/[#]?raw
Metadata Collected
The metadata we collect can generally be split along three criteria (level of generation, user vs machine generation, required vs optional field) which are visualized in the table below. It should be noted that this data model is still undergoing revisions (specifically the JSON -> RO-Crate Transformer).
User Entered, Required Metadata
While we try our best to minimize the workload on a scientist to FAIRify their work, two metadata fields (license and functional type) require user entry on a component-by-component level. At scale this can be quite time consuming. In both cases, we use a combination of interfaces and inheritance to optimize the data publishing experience for users.
License agreement: The absence of a clearly defined licensing agreement de-facto blocks the re-use of the digital objects. Hence, it is enforced. When you create a Node, you can choose a default license agreement that will apply to all components of your Node by default unless it is overridden by a license agreement that you manually modified or in conflict with the license agreement retrieved from an API call (e.g., if you select CC BY, and your code on Github is under MIT, then MIT will override CC BY for your code component). All Node metadata is licensed under CC0, and the metadata remains available even if the underlying data has been deleted, intentionally or not.
Functional type: the functional type refers to the nature of the digital object. For instance, a "research report" or "code repository" are functional types. We use the functional types to create the RDF graph of your Node. In the future, it should be possible to infer the functional types based on content embeddings.
FIP and Metadata Permanence Plan
DeSci Labs is working hand in hand with the GoFAIR Foundation to create a machine-actionable FIP. The current FIP Draft can be seen on FAIRConnect.
Future Metadata Efforts
Enhanced embeddings
For all digital objects in a research object, we generate embeddings using OpenAI's API. These embeddings are stored into a vector database for efficient retrieval and updates, and mapped to the PIDs of the digital objects. For manuscripts, we generate multiple runs of embeddings on the sections of the research report extracted from the PDFs.
Machine-extracted metadata
We would like to extract authors, affiliations and grant funding information from research reports to expedite the process of adding co-authors to a Node. This means automating the matching of ORCID IDs and ROR PIDs. Time permitting, we may look into the concept of keyword and semantic extraction.
Community defined vocabulary integration
Tools like CEDAR workbench provide scientists with workflows to create machine actionable metadata through the semantic web. We will be evaluating CEDAR and similar tools in the future with the possibility of integration in mind.
Publishing RDF metadata in addition to JSON and JSON-LD
It's very possible that RDF will become a mandate in the FAIR ecosystem. Nodes currently publish through JSON-LD (as RO-Crate uses Schema.org). Conversion between JSON-LD and RDF is easy to do but difficult to do well. It is our opinion that bad metadata is worse than no metadata. When effective, resourced/maintained, and community accepted means of converting from JSON-LD to RDF become available, we will investigate means of integration.
Last updated