Word Similarity/Similar Records Methods


Important regarding 'Word Similarity/Similar Records' method:
Because of tight connection with the search_engine and database, there
can only be one method using the "word_similarity" template, and the
code for this method as designated in the BibRank admin interface has
to be 'wrd'.

Preparing word similarity rank indexes:
The 'Word similarity' method works by generating an index over terms in
the tags specified in the configuration file for the given records. The
data is stored in two tables, on forward and one reverse. The forward
index has a list of terms, where each term as a dictionary containing
records using this term. The reverse index contains the opposite, a list
of records, where each record contains a dictionary of the terms it
contains (for the selected tags).  This means that the forward/reverse
index is to some degree similar to the tables created by BibIndex.

The main difference is that the rank method stores more information in
the table, based on how important the terms are, based on how many
times they have been used, and how important one term is in one
record. To minimize the number of terms to process, a few techniques
are used. Among these are stemming and stopword removal. Stemming
removes the end of a term, so that only the stem is left, this means
that 'looking' becomes 'look' and minimizes the size of the
database. Stopword removal removes very common words without meaning,
like 'the', 'one', 'me' in english. Terms that consists of numbers or
are below a certain limit is also ignored.

Since automatic language recognisition is not supported, each tag must
therefore be given a language for stemming to work. This means that in
the perfect world one tag should contain text in one language or
mostly in one language. If stemming is not wanted, the module can be
turned off, though lower rank quality may be expected.

Stopword removal works by checking if a term exists in a file, which
can contain any language necessary. Together with the default Invenio
installation, the file contains stopwords in french and english.

For a large Invenio installation (700 000 records), indexing takes
around 2 hours, including calculating the data needed for the
weighting scheme.

How the term importance is computed:
The method used is a variation of the well-known weighting scheme, the
vector model [1], in document retrieval.  The method is described in
[2] and called 'Log-entropy' weigthing scheme. For more detailed
explanation of the scheme, the paper should be consulted.  Since the
calculations necessary to calculate the number needed by the method is
too demanding, most of the numbers are calculated after the index over
term is created and stored in the database for later use.

Step by step at index time:(Using rebalance)
1. Load configuration for method
2. Begin a loop which loops through all records that should be added
3. Load content of tags in current record range
4. For each tag in all records, check each term if it should be used,
   check against stopword list, use stemming, if accepted, add to a structure
   the points from configfile for current tag.
5. Add to database the new values
6. Go back to 3 until no more records to be added.
7. Go through list of added terms, get list of all records containing these terms
8. Find all terms in records from last point.
9. For each record, calculate Fi, for each term calculate Gi
10.For each record, calculate normalization value Nj, add Gi value for each
   term to the structure in reverse index.
11.Adding the Gi value to each term in forward index, and adding the
   normalization value Nj to each term in each document

Word Similarity at search/rank time:
1. For each term, check if it can be used (like check agains stopword list),
   use stemming on the term if possible.
2. For each term, get dictionary from forward index, calculate rank values
   for each term.
3. Add any records not ranked to end of list.
4. Sort records.
5. Return sorted records

Similar records at search/rank time:
1. Get terms from reverse index which exists in record.
2. Sort terms and use only the most important ones for finding similar records.
3. For selected terms, rank the associated records
4. Sort records
5. Return sorted records

References:
[1] Modern Information Retrieval. Baeza-Yates/Ribeiro-Neto
[2] New term weighting formulas for the vector space method in information retrieval.
ORNL/TM-13756.E.Chisholm/T.G.Kolda