Cool DOIs in Python
Posted on 12 September 2023
A DOI (digital object identifier) is a persistent identifier used to uniquely identify various objects (usually documents or data sets). DOIs are typically presented as a link consisting of a proxy, a prefix and a suffix: for example:
https://doi.org/10.1017/9781108778039
The proxy (here, https://doi.org/) is the location of a server that will resolve the DOI to the correct online location of the resource. The prefix (10.1017) is assigned by a DOI registration agency such as CrossRef or DataCite to an organization to form a namespace that ensures that DOIs are globally unique. The suffix (here, 9781108778039) is chosen by the registrant and can, in principle, be almost anything (here it seems to be an ISBN) but there is an increasing consensus, outlined in a DataCite blog article, that suffixes should be chosen to be:
- Opaque: that is, include no content that could be mistaken for semantic information (e.g. version numbers, institution names or abbreviations, dates, etc.)
- Web-safe: they should avoid characters that need to be escaped in URLs
- Short, human-readable and resistant to typographical errors (e.g. mistaking the digit 0 for the letter O)
The recommendation in this article was to use the base-32 encoding of a random integer suggested by Douglas Crockford, and DataCite released a tool, cirneco, for generating DOIs in this format, which looks like:
https://doi.org/10.61092/drkw-vb9g
However, the cirneco tool is written in Ruby. The code below implements the Cool DOI principles to generate random DOIs in Python. Since it doesn't include a checksum character, there is a pool of $32^8$ = 1.1 trillion DOIs to draw from. It has no external dependencies outside of the core Python library.
An online service implementing this code is also available on this site.
import random
class DOIGenerator:
    """A generator for DOIs conforming to the "Cool DOIs" convention.
    Cool DOIs (https://datacite.org/blog/cool-dois/) have the format:
    https://doi.org/<PREFIX>/xxxx-xxxx
    where <PREFIX> is the namespace assigned to an organization by a
    DOI registration agency (e.g. CrossRef or DataCite) and the suffix
    xxxx-xxxx consists of two blocks of characters chosen from an
    alphabet of symbols consisting of the digits 0-9 and letters A-Z
    excluding I, L and U. DOIs are case-insensitive.
    """
    symbols = "0123456789ABCDEFGHJKMNPQRSTVWXYZ"
    nsymbols = len(symbols)
    encode_dict = {i: c for i, c in enumerate(symbols)}
    def __init__(self, prefix):
        """Initialize a DOIGenerator.
        The DOI prefix must be provided, but no proxy (e.g. https://doi.org/).
        Any trailing slash character will be stripped.
        """
        if prefix[-1] == "/":
            prefix = prefix[:-1]
        self.prefix = prefix
    def make_doi(self, include_proxy=False):
        """Return a random DOI, perhaps including the proxy."""
        n = random.randrange(32**8)
        suffix = ""
        while n > 0:
            r = n % DOIGenerator.nsymbols
            n //= DOIGenerator.nsymbols
            suffix = DOIGenerator.encode_dict[r] + suffix
        suffix = f"{suffix:>08}"
        suffix = suffix[:4] + "-" + suffix[4:]
        doi = f"{self.prefix}/{suffix}".lower()
        if include_proxy:
            return "https://doi.org/" + doi
        return doi
    def make_dois(self, ndois, include_proxy=False):
        """A generator yielding ndois DOIS, perhaps with proxies."""
        for _ in range(ndois):
            yield self.make_doi(include_proxy)
if __name__ == "__main__":
    # Test the DOI generator.
    doi_generator = DOIGenerator("10.61092")
    for doi in doi_generator.make_dois(15, True):
        print(doi)