User Guide: DeSci Nodes v1.0 [Capybara]
  • Welcome to DeSci Nodes
  • General user overview
    • Explore
    • Node
      • [New] Node Home
    • Node Workspace
      • Navigation Bar
      • Viewer
      • Node Drive Panel
      • Node Contributors Panel
      • Information Panel
    • Profile
  • Create and Publish
    • Quick Start
    • Introduction
    • Sign Up & Login
    • Create a Node
      • Create Node
      • Add Components
        • Data
        • Manuscript
        • Code & Executables
        • External Links
        • Folder
      • Add Information
        • Add License
        • Add Metadata
        • Add Contributors
        • Add Comments
        • Claim Attestations
      • Organise, Access & Present
        • Return to most recent Node
        • Component Presentation
          • Pinning components
          • Renaming components
          • Moving components
        • Add Cover Art
      • Collaborate
      • Delete Before Publication
        • Delete Unpublished Components
        • [TBD] Delete Unpublished Nodes
    • Publish
      • Update Your Node
        • Editing a published Node
        • Publish a new version of your Research Node
        • Delete Components After Publication
        • Delete Published Node - Cannot
    • Submit for Curation
    • Share
      • Cite
      • Share Link
      • Persistent Identifier (dPID)
      • Content Identifier (CID)
      • Social Media
    • Interact & Reuse
      • Browse
      • Download
      • Support
        • [TBD] Comment
        • [TBD] Attest
      • Compute
        • [TBD] Node IDE
        • [TBD] Compute to data
        • [TBD] Data to compute
      • Communities
        • [TBD] Apply for Comms Curation
        • [TBD] Apply for Comms Attestations
        • [TBD] Become a Comms Member
  • Validate and Curate
    • Community Curation
    • Community Home
    • Validate and Curate
  • Find Help
    • FAQ
      • Fundamentals
      • Using Nodes
      • Nodes and Journals
      • FAIR
      • Benefits of using Nodes
      • Your identity
      • Metrics, citations and PIDs
      • Governance
    • Community Support
    • Feedback & Contact
  • TECHNICAL BACKGROUND
    • Persistent Identifiers 101
    • FAIR Data
      • All About FAIR
        • The FAIR Principles
        • GoFAIR Criteria
        • Red and Blue Principles
        • FAIR Digital Object Framework (FDOF)
        • The FAIR Hourglass
        • The Internet of FAIR Data and Services (IFDS)
      • FAIR Compliance
        • DeSci Nodes FIP
        • Standardized Assessments
        • FAIR Metadata Publishing
    • Open State Repository
      • PID
      • Data
      • Metadata
      • Methods
    • Roadmap
Powered by GitBook
On this page
  • Machine Actionable Metadata
  • Metadata Model
  • Metadata Collected
  • User Entered, Required Metadata
  • FIP and Metadata Permanence Plan
  • Future Metadata Efforts
  1. TECHNICAL BACKGROUND
  2. Open State Repository

Metadata

RDF graphs • Embeddings • Traceability

PreviousDataNextMethods

Last updated 1 year ago

Machine Actionable Metadata

DeSci Labs aims to build machine actionable metadata in line with the FAIR principles. Our intent is to be the gold standard of Red Principle implementation while satisfying Blue Principles through widely used, standardized vocabularies. This is achieved through a combination of two methods of machine actionability.

RDF graphs through the semantic web

Research Objects are invaluable for machine-generated metadata regarding semantic types. Nodes are at their core. The Node RDF graph maps connections between digital objects that are not inferable by machine-generated metadata because in most cases, there is between the embeddings derived from these components in the absence of user-inputed metadata records.

For example, there is no way to know that a particular .csv dataset file is linked to a given research paper in the absence of a human-generated metadata annotating .csv file indicating to the machine that "this is the data of paper X". By adding components with their semantic types to your Node, we create such a relational mapping:

Data.csv -> isOpenDataComponentOf -> ResearchReport.pdf.

Embeddings as supplementary metadata

In order to enhance Machine Actionability during FAIR's uptake phase, DeSci Labs automatically for all text based components (manuscripts, preprints, presentations, code repositories, etc) in a research object. These embeddings are stored into a for efficient retrieval and updates, and mapped to the PIDs of the digital objects. For manuscripts, we generate multiple runs of embeddings on the sections of the research report extracted from the PDFs.

At a high level, Node Metadata is:

  • Base metadata for components as provided in RO-Crate through Schema.org

  • RDF graph of functional types, contributors (ORCID) and funders (ROR)

  • Content embeddings over text based digital objects

  • Traceability of change (who, what, when)

Example:

Once we have the RDF graph of a Node and embeddings over its digital objects, we can query the the vector database to return all Nodes related to COVID19. And because we know the relationships between Node components, we can run more advanced queries, for instance - returning all open datasets or code repositories related to COVID19.

Metadata Model

https://[prefix].dpid.org/[#]?jsonld

https://[prefix].dpid.org/[#]?raw

Metadata Collected

The metadata we collect can generally be split along three criteria (level of generation, user vs machine generation, required vs optional field) which are visualized in the table below. It should be noted that this data model is still undergoing revisions (specifically the JSON -> RO-Crate Transformer).

Metadata
Level
User vs Machine
Required vs Optional

Author(s) Name

Application

User

Optional

Author(s) ORCID

Application

User

Optional

Author(s) Google Scholar Profile

Application

User

Optional

Node Title

Node

User

Required

Field of Science

Node

User

Required

License Type (Default)

Node

User

Required

Version Number

Version

Machine

Required

Publishing Timestamp

Version

Machine

Required

Semantic Type

All Components

Machine

Required

Functional Type

All Components

User

Required

License Type (Component)

All Components

User

Optional

Keywords

All Components

User

Optional

Descriptions

All Components

User

Optional

Vectorized Embeddings

Text Components

Machine

Required

User Entered, Required Metadata

While we try our best to minimize the workload on a scientist to FAIRify their work, two metadata fields (license and functional type) require user entry on a component-by-component level. At scale this can be quite time consuming. In both cases, we use a combination of interfaces and inheritance to optimize the data publishing experience for users.

FIP and Metadata Permanence Plan

Future Metadata Efforts

Enhanced embeddings

Machine-extracted metadata

Community defined vocabulary integration

Publishing RDF metadata in addition to JSON and JSON-LD

It's very possible that RDF will become a mandate in the FAIR ecosystem. Nodes currently publish through JSON-LD (as RO-Crate uses Schema.org). Conversion between JSON-LD and RDF is easy to do but difficult to do well. It is our opinion that bad metadata is worse than no metadata. When effective, resourced/maintained, and community accepted means of converting from JSON-LD to RDF become available, we will investigate means of integration.

Our base metadata model can be seen in our . We use an RO-Crate implementation which has been modified to allow for IPLD and CIDs. The manifest files underpinning individual dPIDs can be seen through the following link structures:

License agreement: The absence of a clearly defined licensing agreement de-facto blocks the re-use of the digital objects. Hence, it is enforced. When you create a Node, you can choose a default license agreement that will apply to all components of your Node by default unless it is overridden by a license agreement that you manually modified or in conflict with the license agreement retrieved from an API call (e.g., if you select CC BY, and your code on Github is under MIT, then MIT will override CC BY for your code component). All Node metadata is licensed under , and the metadata remains available even if the underlying data has been deleted, intentionally or not.

Functional type: the functional type refers to the nature of the digital object. For instance, a "research report" or "code repository" are functional types. We use the functional types to create the of your Node. In the future, it should be possible to infer the functional types based on content embeddings.

DeSci Labs is working hand in hand with the GoFAIR Foundation to create a machine-actionable FIP. The current .

For all digital objects in a research object, we . These embeddings are stored into a for efficient retrieval and updates, and mapped to the PIDs of the digital objects. For manuscripts, we generate multiple runs of embeddings on the sections of the research report extracted from the PDFs.

We would like to extract authors, affiliations and grant funding information from research reports to expedite the process of adding co-authors to a Node. This means automating the matching of and . Time permitting, we may look into the concept of keyword and semantic extraction.

Tools like provide scientists with workflows to create machine actionable metadata through the semantic web. We will be evaluating CEDAR and similar tools in the future with the possibility of integration in mind.

RDF graphs
little to no mutual information
generates embeddings using OpenAI's API
vector database
open source code on Github
CC0
RDF graph
FIP Draft can be seen on FAIRConnect
generate embeddings using OpenAI's API
vector database
ORCID IDs
ROR PIDs
CEDAR workbench