Weeding out duplicates on EPrints – new plug-in makes life easier

Steve Byford writes about this new plug-in.

A new plug-in for EPrints is now available to help you spot, weed out or merge duplicate entries on your EPrints repository. The functionality it provides will make it much easier to capture the best information out of repeat entries that really describe the same article, and to safely discard any that are redundant. The new plug-in is now available on the EPrints Bazaar.

To duplicate is human, to de-dup is hard

It’s all too easy to end up with multiple instances of the same article on your institutional repository. Co-authors of the same article may each have uploaded it separately, and you may have received notifications and full text of the same articles from services like Jisc’s Publications Router, or duplicate metadata from other sources. How can you spot that this has happened? And how can you tell if two entries are merely similar but genuinely about different articles?

EPrints already offered some rudimentary functionality to try to deal with this, within the “search issues” section of its administration area, but this is not often used: it’s not well known, and can be difficult to use, requiring an understanding of relatively obscure search syntax.

Finding out what users really need

To work out what functionality repository managers and administrators would find helpful, Jisc commissioned Key Perspectives to do some market research. They worked with staff from a range of institutions that are known to Jisc because they receive content from Publications Router, and so also have experience of capturing content by more than one method.

This helped identify how best to improve upon the “search issues” function, and improve the user interface to make it more intuitive.

Delivering better functionality

Guided by these insights and based on their detailed knowledge of the software, EPrints Services’ developers at Southampton produced a solution that delivered the functionality that users had said would be helpful. The resulting Jisc-funded plug-in offers the following features:

It provides a straightforward user interface within the administrator’s area, offering simple and straightforward searching for similar entries across a number of metadata fields, most importantly the article’s title and its DOI.
Searches then result in a list of possible duplicate records. From there, you can open a pop-up box from which you can quickly amend or retire a duplicate record.
You can view summaries of two records side-by-side in a pop-up window to compare any fields that differ between them – and decide which of them you wish to retain or discard.
If you decide that apparent possible duplicates are actually genuinely different, you can flag them as such so that they don’t reappear every time you run further searches.

Installing the plug-in

You can find out more about the plug-in, including technical information and screenshots of its user interface, on the EPrints wiki at https://wiki.eprints.org/w/Issues2.

The plug-in itself is available for download from the EPrints Bazaar at http://bazaar.eprints.org/523/.

It is surely fine that the Issues2 plug-in offers improved searching and UI. However, on the processing side, it has deficiencies:
– it has been made for a standard, out-of-the box EPrints repository. However, many repositories are customized and do not use the standard id_number field for document identifiers such as DOI or PMID. How to configure those?
– documentation on configuration is completely missing (that was available in the previous plug-in: https://wiki.eprints.org/w/Issues). Also, a cfg.d/z_issues2.pl file should be provided with the Issues2 plug-in that demonstrates how to override the default settings.
– extensibility: New issue conditions require now the programming of a plug-in instead of writing epscript code in an XML issues definition file. Not everyone is proficient in programming plug-ins for EPrints.
– design flaw: Checking for duplicates based on ISSN is rubbish. There may be hundreds of articles of the same journal with the same ISSN in the repository, but none of them is a duplicate.

To duplicate is human, to de-dup is hard

Finding out what users really need

Delivering better functionality

Installing the plug-in

One reply on “Weeding out duplicates on EPrints – new plug-in makes life easier”

Leave a Reply Cancel reply