Metadata

RDF graphs • Embeddings • Traceability

Machine Actionable Metadata

DeSci Labs aims to build machine actionable metadata in line with the FAIR principles. Our intent is to be the gold standard of Red Principle implementation while satisfying Blue Principles through widely used, standardized vocabularies. This is achieved through a combination of two methods of machine actionability.

RDF graphs through the semantic web

Research Objects are invaluable for machine-generated metadata regarding semantic types. Nodes are RDF graphs at their core. The Node RDF graph maps connections between digital objects that are not inferable by machine-generated metadata because in most cases, there is little to no mutual information between the embeddings derived from these components in the absence of user-inputed metadata records.

For example, there is no way to know that a particular .csv dataset file is linked to a given research paper in the absence of a human-generated metadata annotating .csv file indicating to the machine that "this is the data of paper X". By adding components with their semantic types to your Node, we create such a relational mapping:

Data.csv -> isOpenDataComponentOf -> ResearchReport.pdf.

Embeddings as supplementary metadata

In order to enhance Machine Actionability during FAIR's uptake phase, DeSci Labs automatically generates embeddings using OpenAI's API for all text based components (manuscripts, preprints, presentations, code repositories, etc) in a research object. These embeddings are stored into a vector database for efficient retrieval and updates, and mapped to the PIDs of the digital objects. For manuscripts, we generate multiple runs of embeddings on the sections of the research report extracted from the PDFs.

At a high level, Node Metadata is:

  • Base metadata for components as provided in RO-Crate through Schema.org

  • RDF graph of functional types, contributors (ORCID) and funders (ROR)

  • Content embeddings over text based digital objects

  • Traceability of change (who, what, when)

Example:

Once we have the RDF graph of a Node and embeddings over its digital objects, we can query the the vector database to return all Nodes related to COVID19. And because we know the relationships between Node components, we can run more advanced queries, for instance - returning all open datasets or code repositories related to COVID19.

Metadata Model

Our base metadata model can be seen in our open source code on Github. We use an RO-Crate implementation which has been modified to allow for IPLD and CIDs. The manifest files underpinning individual dPIDs can be seen through the following link structures:

https://[prefix].dpid.org/[#]?jsonld

https://[prefix].dpid.org/[#]?raw

Metadata Collected

The metadata we collect can generally be split along three criteria (level of generation, user vs machine generation, required vs optional field) which are visualized in the table below. It should be noted that this data model is still undergoing revisions (specifically the JSON -> RO-Crate Transformer).

Metadata
Level
User vs Machine
Required vs Optional

Author(s) Name

Application

User

Optional

Author(s) ORCID

Application

User

Optional

Author(s) Google Scholar Profile

Application

User

Optional

Node Title

Node

User

Required

Field of Science

Node

User

Required

License Type (Default)

Node

User

Required

Version Number

Version

Machine

Required

Publishing Timestamp

Version

Machine

Required

Semantic Type

All Components

Machine

Required

Functional Type

All Components

User

Required

License Type (Component)

All Components

User

Optional

Keywords

All Components

User

Optional

Descriptions

All Components

User

Optional

Vectorized Embeddings

Text Components

Machine

Required

User Entered, Required Metadata

While we try our best to minimize the workload on a scientist to FAIRify their work, two metadata fields (license and functional type) require user entry on a component-by-component level. At scale this can be quite time consuming. In both cases, we use a combination of interfaces and inheritance to optimize the data publishing experience for users.

  • License agreement: The absence of a clearly defined licensing agreement de-facto blocks the re-use of the digital objects. Hence, it is enforced. When you create a Node, you can choose a default license agreement that will apply to all components of your Node by default unless it is overridden by a license agreement that you manually modified or in conflict with the license agreement retrieved from an API call (e.g., if you select CC BY, and your code on Github is under MIT, then MIT will override CC BY for your code component). All Node metadata is licensed under CC0, and the metadata remains available even if the underlying data has been deleted, intentionally or not.

  • Functional type: the functional type refers to the nature of the digital object. For instance, a "research report" or "code repository" are functional types. We use the functional types to create the RDF graph of your Node. In the future, it should be possible to infer the functional types based on content embeddings.

FIP and Metadata Permanence Plan

DeSci Labs is working hand in hand with the GoFAIR Foundation to create a machine-actionable FIP. The current FIP Draft can be seen on FAIRConnect.

Future Metadata Efforts

Enhanced embeddings

For all digital objects in a research object, we generate embeddings using OpenAI's API. These embeddings are stored into a vector database for efficient retrieval and updates, and mapped to the PIDs of the digital objects. For manuscripts, we generate multiple runs of embeddings on the sections of the research report extracted from the PDFs.

Machine-extracted metadata

We would like to extract authors, affiliations and grant funding information from research reports to expedite the process of adding co-authors to a Node. This means automating the matching of ORCID IDs and ROR PIDs. Time permitting, we may look into the concept of keyword and semantic extraction.

Community defined vocabulary integration

Tools like CEDAR workbench provide scientists with workflows to create machine actionable metadata through the semantic web. We will be evaluating CEDAR and similar tools in the future with the possibility of integration in mind.

Publishing RDF metadata in addition to JSON and JSON-LD

It's very possible that RDF will become a mandate in the FAIR ecosystem. Nodes currently publish through JSON-LD (as RO-Crate uses Schema.org). Conversion between JSON-LD and RDF is easy to do but difficult to do well. It is our opinion that bad metadata is worse than no metadata. When effective, resourced/maintained, and community accepted means of converting from JSON-LD to RDF become available, we will investigate means of integration.

Last updated