BibMatch Hacking Guide

Matching methods in BibMatch

When BibMatch receives a collection of records to match, each record will be processed in turn to attempt to match it against the repository in question. The user will specify the meta-data field taken from the input records to compare the records on, as well as the querystrings used in the search queries used to find potential matches in the repository.

To match effectively and to better ensure the matching results are accurate, it is important to define proper search queries and meta-data fields to compare. Both should be evaluated based on the content being matched and the state of the repository. For example, thesis records may require different matching configuration than conference proceedings.

The search results from these queries against the repository are then downloaded and each record in the search result is compared against the original record using specified matching options. The configuration of these methods is explained in the next section.

The primary matching methods in BibMatch are:

Configuring BibMatch

Matching rule-sets

BibMatch has a default configuration in invenio.conf, but depending on the nature of your MARC specification and content types you may want create your own configuration tailored for your data-sets in invenio-local.conf. This configuration concerns validation and comparisons of search results, which involves specifying rules on which MARC fields to compare and how to compare them between records.

All this is done in invenio.conf, which you can overwrite with invenio-local.conf, under the variable CFG_BIBMATCH_MATCH_VALIDATION_RULESETS. This variable is a list of rule-sets each having specific rules regarding the MARC fields to compare.

Each ruleset contains a certain pattern mapped to a list defining a set of rules. The rule-definitions must come in two parts:

('980__ \$\$a(THESIS|Thesis', # Pattern to be matched against record
[{  # If pattern match, each rule is defined like this
 'tags' : '245__%,242__%', # Identical tag strings can replace previous rules
 'threshold' : 0.8,
 'compare_mode' : 'lazy',
 'match_mode' : 'title',
 'result_mode' : 'normal' 
}])

Fields are considered matching when all its subfields or values match.

Another configurable is the CFG_BIBMATCH_FUZZY_MATCH_VALIDATION_LIMIT which determines the minimum percentage of rules to be positively matched when comparing two records, should the ratio of matches be above or equal to this limit, the match will be considered fuzzy.

CFG_BIBMATCH_SEARCH_RESULT_MATCH_LIMIT determines the maximum amount of search results a single search can return before becoming as a non-match. It is advised to keep this number fairly low (10-15) for time and performance reasons.

Predefined templates

In the Invenio configuration file you can add certain short-hand mappings of BibMatch queries. For example predefined querystrings used to standardize common matching queries. By default the following templates are given:

 title             - standard title search. Taken from 245__a (default)
 title-author      - title and author search (i.e. this is a title AND author a)
                     Taken from 245__a and 100__a
 reportnumber      - reportnumber search (i.e. reportnumber:REP-NO-123).