BibAuthorID Internals

BibAuthorID
-----------

DEFINITIONS
    Inspire: Publication system for high-energy physics
    Hepnames: A collection of known authors including their scientific history in the field
    Document: A document is, in a broader sense, a scientific artifact inside Inspire
        (a publication, a pre-print, a picture, a data set, etc.)
    Virtual Author (VA): The author as it appears on a document
    Real Author (RA): The final entity which ideally is a real individual researcher
    Feature: A feature of a VA or RA is a metric, the owner can be compared to other entities


THE ALGORITHM IN SHORT
    Before any algorithm can run, one step of preparation is to be done:
    - Read all names from the Inspire database and store them in new table

    The algorithm itself is divided into several steps:
    1. Clustering--finding potentially related authors
    2. Matching--create RA entities by pairwise comparison of VAs within a cluster
    3. Post-matching comparison--identifying identical RA entities through cross-cluster comparison


STEP I: CLUSTERING
    [CODE]
    FOR every last name in the database:
        FOR every name with that last name:
            FOR every documents associated with the name:
                create new VA (record):
                    update cluster id of VA and create new cluster if necessary
                    mark as updated
                    mark as not connected
    [/CODE]


STEP II: MATCHING
    The matching algorithm effectively runs several times on different
    conditions.

    Run 1: run matching algorithm and let it find only updated VAs (the ones
        that have the update flag set) that have a full name (i.e. at least
        one first name).
    Run 2: run matching algorithm and let it find the rest of the updated and
        not-connected VAs.
    Run 3: run matching algorithm once or more and let it find only orphaned
        VAs (i.e. VAs that are neither updated nor connected)


    [CODE]
    IF mode == updated_fullname:
        queue := find all updated VAs that have a full name
           and sort them by the number of their features
    ELSEIF mode == updated:
        queue := find all updated VAs and sort them by their features
    ELSEIF mode == orphaned:
        queue := find all disconnected VAs and sort them by their features

    FOR every qVA in queue:
        cluster := find all VAs in the same cluster
        other_RAs := find all RAs that have VAs attached from the same cluster

        IF no other_RAs exist:
            create new RA and copy all features from qVA
        ELSE:
            FOR every other_RA in other_RAs:
                probabilities.add(compare qVA features with other_RA features)

            IF all probabilities < adding threshold:
                create new RA and copy all features from qVA
            ELSEIF (number of probabilities > adding threshold) == 1:
                add qVA to RA with highest probability
                    and copy features from qVA to that RA
            ELSE:
                mark qVA as not connected
                mark qVA as not updated
                continue with next qVA in queue

        mark qVA as connected
        mark qVA as not updated
    [/CODE]

STEP II: MODULES
    Up to this point, the algorithm is build in the fashion of a framework. The
    framework provides all the methods needed to access features of RAs or VAs
    and is extensible through the means of modules. A module's purpose is to
    provide several functions to be able to compare features of a VA to the
    features of a RA. Currently, four modules are implemented to determine the
    correct attribution of a VA to a RA. In particular, these are created to
    compare names, affiliations, paper-equality and co-authorship. The following
    snippets shall show the overall functionality of each of the modules.

    MODULE 1: NAME COMPARISON
    [CODE]
        clean names by removing special chars (.-_/\[]{}())
        split names in last name, initials and names
        build name combinations

        FOR all possible name combinations on both sides:
            initials_p := compare initials:
                attribute weight to position of initial: pos/((1+n/2)*n)
                add up weights of matching initials

            names_p := compare names:
                attribute weight to position of name: pos/((1+n/2)*n)
                add up weights of matching names

            IF names_p > 0.6:
                initials_p_weight := 0.3
                names_p_weight := 0.7
            ELSEIF initials_p_weight == 1.0 and names_p_weight <= 0:
                initials_p_weight := 0
                names_p_weight := 0
            ELSE:
                initials_p_weight := 0.5
                names_p_weight := 0.5

            probabilities.add(names_p_weight * names_p +
                              initials_p_weight * initials_p)

        RETURN MAX(probabilities)
    [/CODE]

    MODULE 2: AFFILIATION COMPARISON
    [CODE]
        IF no affiliation in common:
            RETURN 0.0

        FOR every common affiliation:
            IF common affiliation == "Unknown":
                common_affiliations.add(0)
            ELSE:
                common_affiliations.add(1)

            date_difference := find date difference in month

            IF date_difference > 600 (50 years):
                date_probabilities.add(0)
            ELSE:
                date_probabilities.add(e^(-0.05 * date_difference ^ 0.7))

        affiliation_p := AVERAGE(common_affiliations)
        date_p = AVERAGE(date_probabilities)

        RETURN (affiliation_p + date_p) / 2
    [/CODE]

    MODULE 3: COAUTHORSHIP COMPARISON
    [CODE]
        IF no coauthors on both sides:
            RETURN 0.0

        IF number of VA coauthors > 50:
            create hash of sorted coauthor list.
            IF RA holds same hash:
                RETURN 1.0
            ELSE:
                RETURN 0.0

        parity = find intersection of RA and VA coauthors

        FOR every coauthor in parity:
            if coauthor is a collaboration:
                RETURN 1.0

        IF number of VA coauthors > 0:
            RETURN 1 - e^(-0.8 * len(parity)^0.7))
    [/CODE]

    MODULE 4: PAPER-EQUALITY TEST
    [CODE]
        IF qVA is on a paper that is already part of the RA:
            RETURN impossible match
        ELSE:
            RETURN possible match
    [/CODE]


STEP III: POST-MATCHING COMPARISON
    [CODE]
        FOR every RA:
            compare features of RA with all features of all other RAs

            IF compatablility with other RA > 0.75:
                merge the two RAs into one RA

    [/CODE]