CORE becomes the world’s largest aggregator

Blog post by Petr Knoth and Nancy Pontika at the CORE team (Open University) and Balviar Notay (Jisc)

CORE, a global aggregator of full text open access scientific content from repositories and journals, has been growing at a fast pace over the last few months. As of May 2018, CORE has aggregated over 131 million article metadata records, 93 million abstracts, 11 million hosted and validated full texts and over 78 million direct links to research papers hosted on other websites. Our dataset of full text papers has reached 49TB.  CORE is a jointly run service between the Open University and Jisc.

In an effort to see how CORE is doing in comparison to other services and initiatives in this field, we have compared our dataset with other relevant services as indicated in the table below. This shows that CORE has become the world’s largest aggregator according to several criteria.  In addition, CORE is unique in its endeavour to aggregate and expose not only metadata, but also full texts of open access research papers. No other service in our list provides this capability.

See comparison table:  How CORE compares – May 2018 [PDF]

The fact that CORE is the only service in the list that also hosts a large amount of open access documents makes the service particularly important to those who are interested in text and data analytics or other computational tasks on a large global collection of full texts of research papers. CORE provides access to its large collection of enriched full text content via its public API, through its data dumps and also using CORE FastSync (premium API).  This means that third party services built on top of CORE content don’t need to deal with the complexity of pulling full text documents at the time of access from many different places, which is non-trivial (and often results in blocked access), slow, error prone (e.g. if resources move to a different URL) and cannot provide a guaranteed service performance. Instead, they can rely on pulling already preprocessed, and validated data from CORE using one of the three above mentioned services and they can be confident that they are having access to the widest possible amount of open access content.

One of the reasons why CORE has been able to put together such a large collection of content is that it supports a wide range of mechanisms of gathering the data. For example, while aggregators such as BASE rely on OAI-PMH harvesting, CORE can pull content using OAI-PMH, ResourceSync and using custom built connectors (some of which make use of CrossRef TDM API) to a variety of publishers as well as subject and preprint repositories.

Most of all, becoming the world’s largest aggregator could also be seen as a success indicator for CORE. Finding, aggregating and processing full text open access content is not trivial especially if this needs to be done at scale. We are proud to announce that the growth of content in CORE demonstrates that the we are able to meet CORE’s mission to “aggregate all open access articles across relevant data sources worldwide, enrich this content and provide seamless access to it through a set of data services.”

13 replies on “CORE becomes the world’s largest aggregator”

Well done. If you managed to crawl additional unstated hybrid OA PDFs, that’s great! Hybrid OA is unreliable and needs archival. Providing public dumps is also great.

I’m a bit confused by some statements in the table, though.

First, I feel it’s missing SemanticScholar and CiteSeerX, which may host more than 11 million full texts (Google claims exactly 11.2 M for pdfs.semanticscholar.org, a surprising coincidence). Does CORE host any PDFs they don’t have?

Second, I don’t understand what is meant by “records with OA links”. Without some comparison of the deduplication methods, it’s hard to tell the real meaning of such numbers. You claim that BASE has 76M, while their own search finds 56M. Dissemin, which aggregates BASE and Unpaywall with some additional checks and clustering, finds 25M full texts in their data. According to “The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles”, the total number of CrossRef DOIs is about 67M and 19M are OA (even without requiring a license), so it would be truly extraordinary to have 76M OA works.

Third, why does BASE have a no for “Aggregates hybrid gold OA Articles”? They do, via CrossRef OAI-PMH. (But what does “key publishers” mean in this context?)

Fourth, it’s not clear whether your definition of OA considers the copyright license. Do your records track licenses at all? I can’t find a license selector on the advanced search and Google doesn’t find any freely licensed record: https://www.google.com/search?q=site:https://core.ac.uk/display/&as_rights=&tbs=sur:fmc . Tracking the license would also help release dumps under a free license, which can be more easily archived and used.

Thanks for the comments. First of all, CiteSeerX and SemanticScholar are not aggregation systems. CiteSeerX actually crawls the web focusing primarily on computer science literature. I have heard (but not 100% sure) that Semantic Scholar actually has deals with the publishers which enables them to index the literature. Their Open Corpus available here: https://labs.semanticscholar.org/ contains 7.2 million papers from computer science and neuroscience, so is is not multidisciplinary, represents a different sample and is not acquired by harvesting sources, but crawling.

While it is true that BASE might be harvesting metadata from publishers, we have now clarified that the row Aggregates full texts of Hybrid Gold Open Access Articles (key publishers) refers only to aggregation of the full text articles.

While it is true that BASE might be harvesting metadata from publishers, we have now clarified that the row Aggregates full texts of Hybrid Gold Open Access Articles (key publishers) refers only to aggregation of the full text articles not metadata of hybrid OA articles.

Your comparison is somewhat biased because you use insider knowledge about CORE and compare that to your outsider knowledge about other services. For example. BASE, like CORE, uses a few “custom built connectors” to access important services that do not speak OAI-PMH. And like CORE, BASE offers data dumps for download, but only for people who have access to BASE’s OAI-PMH endpoint. Call it a premium service if you will.

Thank you for your comment. We have compiled this table from publicly available information. For example, in the case of Unpaywall, we were able to find a page from which dataset dumps can be downloaded. Unfortunately, we were unable to find a similar page for BASE. If you send us a URL to it, we will be able to validate the information and then we would be happy to update the table appropriately.

I was quite impressed with Peter’s CORE presentation at the recent Open Repositories conference, and I think this project has tremendous appeal and could have significant impact. I’ll repeat here the question I asked there, along with another one I forgot to ask. I asked whether the data set had been deduplicated — the reply was that that is future planned work, but that the estimated amount of duplication is about 15%. What I didn’t ask was about disambiguation — is there a sense that the metadata is complete enough that at least institutional affiliation is complete for the authors represented, if not also some work done to normalize/disambiguate authors. If not, is that work planned, or is there discussion that could be shared about how you view this issue?

These are excellent questions. Thank you for your comment. We are at the moment conducting a project developing a deduplication service for scholarly articles which should complete this year after which we should be able to produce a fairly accurate estimate of this. I would rather not give numbers on this at this stage while the work is still ongoing. Having said that, all aggregators do contain duplicates, even CrossRef has them. Also the definition of a duplicate is not completely straightforward. For example, an preprint deposited in arxiv and then made available as a post-print in a repository might be roughly but not exactly the same article. It is important to note that the proportion of duplicates is also likely to increase in the future due to national policies requiring article deposits in institutional repositories.

With regards to author disambiguation, CORE makes currently use of ORCIDs which we can recognise from metadata records in repositories. However, we still need to do more work to fully utilise this information across the CORE services, such as in search, which has not yet been implemented. In future, we would also like to be able to recognise and extract ORCIDs directly form full texts and utilise the ORCID API to automatically enrich records that miss ORCIDs with this information.

Thanks for the reply (I’ll not that my comment has not been approved yet).

I’m not sure what qualifies as insider knowledge about BASE: I’m just referring to what is written in their documentation. True, I know BASE more because I’ve used it for longer. I’m now trying to use CORE as often as I use BASE, to understand more about it.

It’s interesting to know about the non-OAI-PMH connectors. It’s true, that’s an important limitation of a “traditional” aggregator like BASE. It’s worth mentiong in your comparison, I think.

Leave a Reply

Your email address will not be published. Required fields are marked *