CORE raises repository data quality by consolidating information from external datasets

CORE is constantly seeking ways to collect available OA resources and make research outputs easy to find, cite, link, assess, and reuse.

CORE has greatly increased the amount of content hosted directly in its database; last year the service provided access to approximately 12 million full texts, to date it hosts 18 million full texts and does not stop its continuous efforts to enrich its data. Data from repositories often come without basic identifiers such as DOIs and ORCIDs. This makes linking and understanding the relations between papers in repositories and published literature a non-trivial task.

Over the last year, we have become really excited about being able to offer a unique dataset, i.e. a dataset of full text articles, spanning pre-prints, reports, grey literature, theses as well as the best peer-reviewed research papers, from repositories and journals. A dataset that is complementary to other major scholarly datasets including Microsoft Academic Graph (MAG), Crossref (the majority of articles in CORE do not have an equivalent article in Crossref) and ORCID.

We are now pleased to announce that all article metadata from Crossref, a consortium led initiative which serves as a unique Digital Object Identifier (DOI) registration authority and contains around 100M metadata documents submitted from more than 4,500 publishers and organisations, are now linked and integrated in the CORE data. More specifically, using the internal project we called MUCC, we have processed and linked data from not only Crossref, but also MAG, Unpaywall, ORCID and Pubmed. One of the key challenges in this work was to merge 130 million CORE records with approximately 100 million records in Crossref without having a common unique identifier. Doing such work constitutes a significant technical challenge. We approached this by defining, optimising and evaluating the quality of a number of heuristics for identification of the same records, dealing with a trade off between precision and performance. Eventually, the processing has been carried out using the computational power of the big data cluster available at the Open University.

Figure 1. The growth of high quality metadata in CORE.

Relevant information from Crossref data, such as DOIs and ISSNs, which were not directly accessible from the CORE data before, open a new wealth of information to explore; this way publisher metadata are integrated and enriched with the data available from repositories. For example, using this process we were able to collect more DOIs than the DOIs supplied to CORE by the global repositories network in their metadata. This is not that surprising as DOIs are rarely available from repositories when the metadata record is created prior to publication date, i.e. prior to the minting of the DOI.

Prior to this work being completed, CORE had around 25 million articles with a DOI, while currently it has approximately 81 million. This means that the articles can be both more easily disambiguated and that the DOI can be used to further enrich an article metadata record with additional metadata fields available.

How can this help? For example, CORE is already using these DOIs to connect with authors’ ORCID IDs or help repositories to enrich their metadata records. But there are many other use cases where these data are valuable, for example, to connect research outputs to research data, metrics, reader statistics etc. Another advantage of this integration is that these data can further support the users in filtering and narrowing their searches while using the CORE Dataset, CORE API, CORE FastSync as well as the CORE search engine. The result of the integration is that CORE has now more metadata about journals, giving more choices to the users to gather new insights.

Petr Knoth says: “Our vision was to evolve CORE from just mirroring content from our data providers to being a service that adds value by improving the data quality of content from repositories. Essentially, embedding CORE in the global scholarly knowledge graph.”

We realised this vision by means of linking to complementary relevant scholarly datasets, enabling, in turn, metadata validation and enrichment of CORE itself. The result is that an article in CORE is connected not only to one or more institutional repositories, but also to the relevant record in Crossref, ORCID, Microsoft Academic Graph and other available datasets when applicable.

CORE data

Our further work aims to not only make CORE the world’s largest OA aggregator1, but to deliver value added services by improving data quality, benefiting the scholarly ecosystem.

Our future plan is to feed these enrichments directly into our services, including, in particular, the Repository Dashboard and all of the raw data services. CORE constantly tries to break the technological barriers and be accessible to the widest audience possible.

Petr Knoth, Nancy Pontika, Matteo Cancellieri, David Pride and Catherine Kuliavets


CORE is a not-for-profit service delivered by The Open University and Jisc.

1 currently CORE provides access to 170,873,524 metadata records together with 24,936,921 free to read full text research records and 17,998,076 full texts hosted directly by CORE

Leave a Reply

Your email address will not be published. Required fields are marked *