Word Similarity/Similar Records Methods

Important regarding 'Word Similarity/Similar Records' method: Because of tight connection with the search_engine and database, there can only be one method using the "word_similarity" template, and the code for this method as designated in the BibRank admin interface has to be 'wrd'. Preparing word similarity rank indexes: The 'Word similarity' method works by generating an index over terms in the tags specified in the configuration file for the given records. The data is stored in two tables, on forward and one reverse. The forward index has a list of terms, where each term as a dictionary containing records using this term. The reverse index contains the opposite, a list of records, where each record contains a dictionary of the terms it contains (for the selected tags). This means that the forward/reverse index is to some degree similar to the tables created by BibIndex. The main difference is that the rank method stores more information in the table, based on how important the terms are, based on how many times they have been used, and how important one term is in one record. To minimize the number of terms to process, a few techniques are used. Among these are stemming and stopword removal. Stemming removes the end of a term, so that only the stem is left, this means that 'looking' becomes 'look' and minimizes the size of the database. Stopword removal removes very common words without meaning, like 'the', 'one', 'me' in english. Terms that consists of numbers or are below a certain limit is also ignored. Since automatic language recognisition is not supported, each tag must therefore be given a language for stemming to work. This means that in the perfect world one tag should contain text in one language or mostly in one language. If stemming is not wanted, the module can be turned off, though lower rank quality may be expected. Stopword removal works by checking if a term exists in a file, which can contain any language necessary. Together with the default Invenio installation, the file contains stopwords in french and english. For a large Invenio installation (700 000 records), indexing takes around 2 hours, including calculating the data needed for the weighting scheme. How the term importance is computed: The method used is a variation of the well-known weighting scheme, the vector model [1], in document retrieval. The method is described in [2] and called 'Log-entropy' weigthing scheme. For more detailed explanation of the scheme, the paper should be consulted. Since the calculations necessary to calculate the number needed by the method is too demanding, most of the numbers are calculated after the index over term is created and stored in the database for later use. Step by step at index time:(Using rebalance) 1. Load configuration for method 2. Begin a loop which loops through all records that should be added 3. Load content of tags in current record range 4. For each tag in all records, check each term if it should be used, check against stopword list, use stemming, if accepted, add to a structure the points from configfile for current tag. 5. Add to database the new values 6. Go back to 3 until no more records to be added. 7. Go through list of added terms, get list of all records containing these terms 8. Find all terms in records from last point. 9. For each record, calculate Fi, for each term calculate Gi 10.For each record, calculate normalization value Nj, add Gi value for each term to the structure in reverse index. 11.Adding the Gi value to each term in forward index, and adding the normalization value Nj to each term in each document Word Similarity at search/rank time: 1. For each term, check if it can be used (like check agains stopword list), use stemming on the term if possible. 2. For each term, get dictionary from forward index, calculate rank values for each term. 3. Add any records not ranked to end of list. 4. Sort records. 5. Return sorted records Similar records at search/rank time: 1. Get terms from reverse index which exists in record. 2. Sort terms and use only the most important ones for finding similar records. 3. For selected terms, rank the associated records 4. Sort records 5. Return sorted records References: [1] Modern Information Retrieval. Baeza-Yates/Ribeiro-Neto [2] New term weighting formulas for the vector space method in information retrieval. ORNL/TM-13756.E.Chisholm/T.G.Kolda