Link-based Ranking with/without External Citations

PageRank is one of the most popular link-based ranking methods for WWW. "PageRank is a link analysis algorithm [..] that assigns a numerical weighting to each element of a hyperlinked set of documents, with the purpose of "measuring" its relative importance within the set."[Wikipedia] If the "hyperlinked set of documents", meaning the Citation Graph, is almost complete, then the classical link-based ranking (a method inspired by the PageRank algorithm) should be used. Otherwise, if the citation graph is missing a lot of citations, we advise the use of the linked-based ranking with external citations. Algorithm: 1. Read the citation data from the database or from a file that has the following format: x[tab]y where the paper with recid x cites the paper with recid y. The citation data is stored in a map key:value, where the 'key' is a record id and the 'value' is the list of publications that cite publication 'key'. 2. Read the publication dates for each paper from the database or from a file that has the following format: x[tab]y, where x is a recid and y is the publication year. There are several possibilities for retrieving the dates from the database: i) using the 260__c MARC tag that contains only the publication year (this is the option that we are using); ii) using the 269__c MARC tag that contains the complete publication date. For the papers that do not have a publication date, we consider the date of insertion in the database (961__x tag). If neither the publication date nor the insertion date are available, we consider an average date (computed with the existing dates). 3. Read the convergence threshold, check_point and damping_factor parameters from the configuration file. (These are specific parameters for the link-based ranking method). 4. There are two possibilities for computing the publications' weights: either use the external citations, either not use the external citations. 4.1. When using the external citations: read the necessary parameters (everything that starts with "ext_") from the configuration file. 4.2 When not using the external citations: - 5. Iteratively calculate the weight for each publication, until it reaches a stable state. 6. Write the ranks to the database and to a file. The name of the file and the number of ranks that should be outputted can be set into the configuration file. If the name of the file is not set, then the ranks are only written in the database.