Persistent Identifiers 101
Persistent Identifiers • DOI • dPID
Last updated
Persistent Identifiers • DOI • dPID
Last updated
A or PID, is a unique identifier for a specific object - much like a driver's license or social security number. These identifiers act as a long-lasting reference to an object. The vast majority of entries into the scientific record are accompanied by a PID, typically a Digital Object Identifier (). A DOI is a PID that resolves to the page of the publisher that contains the specific resource in question.
The . Digital object identifiers (DOIs) have emerged as a response to the need for cataloguing and interoperability between scientific publishers. This fulfilled important business priorities:
Cataloguing ownership of copyright
Metadata for digital distribution management and content repurposing
Maintaining control over accessing the data (e.g. through paywalls)
Preserve stable URLs when content is transferred following an acquisition to a new IT infrastructure owned by a different entity
These business constraints came with substantial tradeoffs - DOIs are based on a social contract between the maintainer of the registry's lookup table, and the registrant's promise to maintain the URL to their proprietary server. DOIs are not 'persistent' nor securely mapped to their underlying content, and is a tremendous obstacle that stands in the way of the goals set by the .
Though one of the best solutions at the time, shortcomings of the DOI system include:
Not persistent: content can change, either intentionally or not. There is no versioning schema for DOIs. DOIs need to be crawled for broken links and are expensive to maintain.
Fragmented: DOIs lack support for - leading to the need to mint a PID for every digital object. This is not efficient and causes fragmentation of our knowledge graphs.
Inconsistent resolution: . This makes machine-readability extremely arduous.
Fast forward twenty years, the DOI is used almost universally and is the de-facto primary key of the scientific record.
This has an effect on the incentives of publishers: Why pay for proprietary storage infrastructure when access cannot be monetised? Why incur high maintenance costs on data cataloguing?
With these large systemic changes in mind, there is a window of opportunity to fix the primary key of the scientific record and rethink architectural requirements from first principles. DeSci Lab is building PID system which lies at the heart of our vision for a truly Open and Decentralized repository for knowledge, and aims to solve four main challenges:
We can preserve the DOI system as the topmost overlay on this new PID system, essentially augmenting the DOI system without changing it fundamentally. Compatibility is important because we want to prevent the proliferation of standards, preserve familiarity, and lower adoption costs.
In the meantime, the content monetisation strategy of the industry and the needs of primary research content consumers have radically shifted. We live in the era of open access, and the release of the and have called into question the business imperatives of gating content access.
Simultaneously, there is an increasing demand for the accessibility of interactive research files such as models, datasets and notebooks. As this demand often comes from analysts interested in reusing the information, access formats that easily facilitate reuse, for example through convenient web apps or calls integrated into the researcher's computational workflow, are popular.
Transitioning to , largely driven by labour costs associated with the personalised IT infrastructure to organise content and gate content access.
New requirements on the horizon - such as FAIR data storage, high-quality metadata, machine resolution of PIDs, and cloud computing - are certain to lead to soaring IT costs, which will aggravate the industry's ownership consolidation by favouring large players over small and medium operations. These costs will inevitably be passed onto researchers and funders through rising .
Providing value to research communities: PIDs that are designed to be . Send programs over to the data, or import the content of PIDs directly into your workflows such as code, data, and .
Solving for data security: PIDs linked to (CID) that preserve their underlying data and metadata on a . These PIDs do not rely on a social contract, rather, they are permanently registered on a distributed ledger.
Solving for artefact fragmentation: PIDs specifically designed for . A single PID can secure an entire linked data graph, and resolve securely to any .
Solving for inconsistent resolution: PIDs that resolve to a distributed, open, and in a predictable and machine-actionable way.
You can learn more about our PID system and the Open State Repository .