medrxiv api python – CASRAI Dictionary

The medRxiv API is a free, public REST interface — operated by openRxiv, the non-profit behind medRxiv and bioRxiv — that returns structured JSON or XML metadata for health-sciences preprints: titles, authors, abstracts, DOIs, dates, licences and subject categories. Developers query it by date range, by a fixed number of recent posts, or by a single DOI, with results paginated in batches of 100 via a cursor parameter. There is no API key requirement and no separate registration step.

A preprint, in this context, is a complete but not-yet-peer-reviewed manuscript; the medRxiv API is the machine-readable route to the metadata that describes it, distinct from the full-text PDF/XML mining pipeline hosted separately on Amazon S3.

What is the medRxiv API?
medRxiv API endpoints explained
Rate limits, formats and bulk access
Practical use cases for developers
Answer-first Q&A
Implications and what’s next

What is the medRxiv API?

The medRxiv API is a lightweight metadata service, not a search engine or a full-text repository. It shares its underlying infrastructure with the bioRxiv API — both are addressable via api.biorxiv.org or the medRxiv-branded mirror at api.medrxiv.org — because both preprint servers are co-managed by the same non-profit.

medRxiv and bioRxiv are operated by openRxiv, a non-profit founded by Cold Spring Harbor Laboratory (CSHL), Yale University and BMJ, and funded by contributions including the Chan Zuckerberg Initiative, Imperial College London and Stanford University. This governance detail matters for developers assessing service continuity: the API is not a commercial product with an SLA, but a grant- and institution-funded public good, published under the same operational umbrella that screens and posts the preprints themselves.

medRxiv API endpoints explained

There are two distinct endpoint families. Confusing them is the most common integration mistake developers make, since both accept the same server, interval and cursor parameters but return different content.

Endpoint	Purpose	Format
`api.medrxiv.org/details/[server]/[interval]/[cursor]/[format]`	Preprint metadata — title, authors, abstract, DOI, posting date, category, licence	json or xml
`api.medrxiv.org/details/[server]/[DOI]/na/[format]`	Single-manuscript lookup by DOI	json or xml
`api.biorxiv.org/pubs/[server]/[interval]/[cursor]`	Published-version linkage — which journal, when, and the published DOI	json

server takes the value medrxiv or biorxiv. interval accepts three forms: two YYYY-MM-DD dates separated by a slash; a plain number for the N most recent posts; or a number suffixed d for the most recent N days. cursor defaults to 0 and advances in steps of 100, matching the fixed page size medRxiv documents for both endpoint families.

A working example: https://api.medrxiv.org/details/medrxiv/2020-03-21/2020-03-24/45 returns up to 100 medRxiv records posted in that window, starting from the 45th result. Each response includes a messages array reporting the cursor position and total count, which developers should use to drive pagination rather than hard-coding offsets.

Rate limits, formats and bulk access

medRxiv does not publish a numeric requests-per-second quota for the metadata API. In practice, the service is engineered around cursor-based pagination capped at 100 records per call, which itself throttles realistic usage patterns without a documented rate-limit header. Developers building high-volume harvesters should paginate sequentially rather than firing parallel requests, since the API is a shared public resource funded by academic and philanthropic contributions, not a commercially provisioned endpoint.

For anything beyond metadata — full-text PDF and XML mining at scale — medRxiv explicitly separates that workload onto a dedicated Amazon S3 requester-pays bucket (s3://medrxiv-src-monthly), rather than serving it through the JSON API. This is a deliberate architectural boundary: the metadata API is for discovery and linkage, S3 is for bulk text and data mining (TDM), and mixing the two workloads against the wrong endpoint is the most common cause of developer friction.

Metadata endpoints return JSON (default) or XML in OAI-PMH format.
Bulk full-text files are delivered monthly as MECA-format zip packages containing manifest XML, full-text XML and PDF.
All preprints are permanently archived at Portico, independent of the live API.

Practical use cases for developers

The API’s realistic use cases cluster around discovery, linkage and bibliometrics rather than full-text analysis:

Systematic review screening — pulling all preprints in a date range and subject category to feed a title/abstract screening pipeline before formal database searches.
Publication-linkage tracking — using the /pubs/ endpoint to detect when a tracked preprint has since been formally published, and in which journal.
Institutional repository harvesting — research offices ingesting metadata for preprints affiliated with their institution to populate internal dashboards.
Bibliometric and disease-surveillance research — reproducing analyses of preprinting velocity by subject category, a pattern widely used during infectious-disease outbreaks.

A minimal Python request against the details endpoint needs no authentication:

import requests

url = "https://api.medrxiv.org/details/medrxiv/10d/0/json"
resp = requests.get(url, timeout=30)
data = resp.json()
print(data["messages"][0]["total"], "preprints in the last 10 days")

The R package medrxivr (rOpenSci) wraps the same endpoints for R users who prefer not to handle pagination and JSON parsing directly.

Answer-first Q&A

Does medRxiv have a public API?

Yes. medRxiv provides a free, unrestricted public API at api.medrxiv.org and api.biorxiv.org, returning JSON or XML preprint metadata by date range, recent-post count, or DOI. No API key or account is required, and the same infrastructure also serves bioRxiv metadata.

What is the difference between the bioRxiv and medRxiv APIs?

They are the same API distinguished only by a server parameter set to medrxiv or biorxiv. Both preprint servers are co-managed by openRxiv and share identical endpoint syntax, pagination and response schemas, though each server applies separate scope and screening policies to submissions.

How do I cite a medRxiv preprint retrieved via the API?

Cite the preprint by its DOI, exactly as medRxiv’s own guidance specifies: author names, year, title, and “medRxiv doi: 10.1101/…”. For a specific revision, append the version-specific URL, since each version keeps the same DOI but a distinct version suffix in its address.

Does medRxiv have an Impact Factor?

No. medRxiv’s own FAQ states plainly: “medRxiv is not a journal and so it has no Impact Factor.” Metadata pulled via the API includes posting dates, categories and licences, but never a journal-style Impact Factor field, because none exists for preprints on the server.

Implications and what’s next

For institutions and research administration offices, the practical implication is that medRxiv metadata is genuinely free to harvest at scale for compliance dashboards, preprint-to-publication tracking, and REF-style output monitoring — no licensing negotiation is required, unlike many commercial bibliographic APIs. The trade-off is that developers must build their own resilience: there is no documented rate-limit contract, no SLA, and no formal versioning notice channel beyond the help pages themselves.

The clearest forward risk is architectural drift rather than access restriction: because the API is maintained by a philanthropically funded non-profit rather than a commercial vendor, integrations should be built defensively — using the documented cursor pagination rather than assumptions about page size, and routing bulk text-mining workloads to the S3 bucket rather than the JSON endpoints, which is where undocumented load is most likely to cause friction for openRxiv’s infrastructure over time.

Tag: medrxiv api python

medRxiv API: Endpoints, Limits and Use Cases