Search Engine API
===================
Search Engine API
===================
There are three kind of API interfaces you can use: XML API, JSON API,
and Python API.
1. XML API
==========
About:
Invenio has been having stable search API since its inception. You
can use regular search interface to refine your query until you
find what you are looking for, and then amend a few URL parameters
to turn the query into an XML API one.
Syntax:
GET /search?p=...&of=...&ot=...&jrec=...&rg=...
Parameters:
p = pattern (i.e. your query)
of = output format (e.g. `xm` for MARCXML)
ot = output tags (e.g. `` to get all fields, `100` to get titles only)
jrec = jump to record ID (e.g. 1 for first hit)
rg = records-in-groups-of (e.g. 10 hits per page)
You can use other parameters as well; the list above mentions the
most useful one. For full documentation on these and the other
`/search` URL parameters, please see Python API section 3.1 below.
Pros:
Eesy web search -> API search context switch. Uses the same
parameters as in visible UI.
Cons:
The XML API output covers only MARC metadata.
Notes:
The master format of Invenio records is usually MARC. Hence
chances are you would like to use `of=xm` output format parameter
in your XML API queries in order to get the richest data.
Set `jrec` and `rg` appropriately to paginate. For example:
/search?p=ellis&of=xm&jrec=1&rg=10
/search?p=ellis&of=xm&jrec=11rg=10
/search?p=ellis&of=xm&jrec=12rg=10
[...]
Do not set `rg` too high; there is a server-wide safety limit on
it. (CFG_WEBSEARCH_MAX_RECORDS_IN_GROUPS)
Example: (returning full XML output)
GET /search?p=ellis&of=xm
<!-- Search-Engine-Total-Number-Of-Results: 12 -->
<collection>
<record>
<controlfield tag="001">47</controlfield>
<controlfield tag="005">20140908173007.0</controlfield>
<datafield tag="037" ind1=" " ind2=" ">
<subfield code="a">hep-ph/0204132</subfield>
</datafield>
<datafield tag="041" ind1=" " ind2=" ">
<subfield code="a">eng</subfield>
</datafield>
...
Example: (returning XML output, first author (100) and title (245) fields only)
GET /search?p=ellis&of=xm&ot=100,245
<!-- Search-Engine-Total-Number-Of-Results: 12 -->
<collection>
<record>
<controlfield tag="001">47</controlfield>
<controlfield tag="005">20140908173007.0</controlfield>
<datafield tag="100" ind1=" " ind2=" ">
<subfield code="a">Shovkovy, I A</subfield>
<subfield code="u">Minnesota Univ.</subfield>
</datafield>
<datafield tag="245" ind1=" " ind2=" ">
<subfield code="a">Thermal conductivity of dense quark matter and cooling of stars</subfield>
</datafield>
</record>
...
Example: returning 250th page of a query, with 50 records per page:
GET /search?p=cern&of=xm&ot=100,245&jrec=12501&rg=50
2. JSON API
===========
About:
Internally, Invenio records are represented in JSON. You can ask
for JSON output format (`of=recjson`) to obtain it. Otherwise use
the same parameters as in XML API, see section 1.
Pros:
The JSON API cover field abstraction (support for virtual fields,
e.g. number of citations or book circulation counts) as well as
master format abstraction (e.g. UNIMARC, EAD).
Cons:
May be unusably slow if `recjson` is not cached on the server.
(See `CFG_BIBUPLOAD_SERIALIZE_RECORD_STRUCTURE`.)
Not yet REST-ified; just an evolution of HTTP XML API described
above.
Example: (who cites me?)
GET /search?p=refersto:author:maldacena&of=recjson&ot=recid,creation_date,authors[0],number_of_authors,system_control_number
[{
recid: 1290100,
creation_date: "2014-04-14T04:44:13"
authors: [{
first_name: "A.",
last_name: "Bernui",
full_name: "Bernui, A."
}],
number_of_authors: 3,
system_control_number: [
{
institute: "arXiv",
value: "oai:arXiv.org:1404.2936"
}
],
},
...]
3. Python API
=============
Invenio Search Engine can be called from within your Python programs
via both a high-level and low-level API interface.
3.1 High-level API
------------------
Description:
The high-level access to the search engine is provided by
exactly the same function as called from the web interface when
users submit their queries. This should guarantee exactly the
same behaviour, and means that you can pass to the high-level
API all the arguments as you see them in the URL.
There are two things to note: (i) the function does not check
for eventual restricted status of the collection, so the
restricted collections will be searched without asking for a
password; (ii) the output format argument (``of'') should be set
to ``id'' (which is the default value) meaning to return list of
recIDs. The function returns the list of recIDs in this case.
Signature:
def perform_request_search(req=None, cc=CFG_SITE_NAME, c=None, p="", f="", rg=0, sf="", so="a", sp="", rm="", of="id", ot="", aas=0,
p1="", f1="", m1="", op1="", p2="", f2="", m2="", op2="", p3="", f3="", m3="", sc=0, jrec=0,
recid=-1, recidb=-1, sysno="", id=-1, idb=-1, sysnb="", action="", d1="",
d1y=0, d1m=0, d1d=0, d2="", d2y=0, d2m=0, d2d=0, dt="", verbose=0, ap=0, ln=CFG_SITE_LANG, ec=None, tab="",
wl=0, em=""):
"""Perform search or browse request, without checking for
authentication. Return list of recIDs found, if of=id.
Otherwise create web page.
The arguments are as follows:
req - mod_python Request class instance.
cc - current collection (e.g. "ATLAS"). The collection the
user started to search/browse from.
c - collection list (e.g. ["Theses", "Books"]). The
collections user may have selected/deselected when
starting to search from 'cc'.
p - pattern to search for (e.g. "ellis and muon or kaon").
f - field to search within (e.g. "author").
rg - records in groups of (e.g. "10"). Defines how many hits
per collection in the search results page are
displayed. (Note that `rg' is ignored in case of `of=id'.)
sf - sort field (e.g. "title").
so - sort order ("a"=ascending, "d"=descending).
sp - sort pattern (e.g. "CERN-") -- in case there are more
values in a sort field, this argument tells which one
to prefer
rm - ranking method (e.g. "jif"). Defines whether results
should be ranked by some known ranking method.
of - output format (e.g. "hb"). Usually starting "h" means
HTML output (and "hb" for HTML brief, "hd" for HTML
detailed), "x" means XML output, "t" means plain text
output, "id" means no output at all but to return list
of recIDs found, "intbitset" means to return an intbitset
representation of the recIDs found (no sorting or ranking
will be performed). (Suitable for high-level API.)
ot - output only these MARC tags (e.g. "100,700,909C0b").
Useful if only some fields are to be shown in the
output, e.g. for library to control some fields.
em - output only part of the page.
aas - advanced search ("0" means no, "1" means yes). Whether
search was called from within the advanced search
interface.
p1 - first pattern to search for in the advanced search
interface. Much like 'p'.
f1 - first field to search within in the advanced search
interface. Much like 'f'.
m1 - first matching type in the advanced search interface.
("a" all of the words, "o" any of the words, "e" exact
phrase, "p" partial phrase, "r" regular expression).
op1 - first operator, to join the first and the second unit
in the advanced search interface. ("a" add, "o" or,
"n" not).
p2 - second pattern to search for in the advanced search
interface. Much like 'p'.
f2 - second field to search within in the advanced search
interface. Much like 'f'.
m2 - second matching type in the advanced search interface.
("a" all of the words, "o" any of the words, "e" exact
phrase, "p" partial phrase, "r" regular expression).
op2 - second operator, to join the second and the third unit
in the advanced search interface. ("a" add, "o" or,
"n" not).
p3 - third pattern to search for in the advanced search
interface. Much like 'p'.
f3 - third field to search within in the advanced search
interface. Much like 'f'.
m3 - third matching type in the advanced search interface.
("a" all of the words, "o" any of the words, "e" exact
phrase, "p" partial phrase, "r" regular expression).
sc - split by collection ("0" no, "1" yes). Governs whether
we want to present the results in a single huge list,
or splitted by collection.
jrec - jump to record (e.g. "234"). Used for navigation
inside the search results. (Note that `jrec' is ignored
in case of `of=id'.)
recid - display record ID (e.g. "20000"). Do not
search/browse but go straight away to the Detailed
record page for the given recID.
recidb - display record ID bis (e.g. "20010"). If greater than
'recid', then display records from recid to recidb.
Useful for example for dumping records from the
database for reformatting.
sysno - display old system SYS number (e.g. ""). If you
migrate to Invenio from another system, and store your
old SYS call numbers, you can use them instead of recid
if you wish so.
id - the same as recid, in case recid is not set. For
backwards compatibility.
idb - the same as recid, in case recidb is not set. For
backwards compatibility.
sysnb - the same as sysno, in case sysno is not set. For
backwards compatibility.
action - action to do. "SEARCH" for searching, "Browse" for
browsing. Default is to search.
d1 - first datetime in full YYYY-mm-dd HH:MM:DD format
(e.g. "1998-08-23 12:34:56"). Useful for search limits
on creation/modification date (see 'dt' argument
below). Note that 'd1' takes precedence over d1y, d1m,
d1d if these are defined.
d1y - first date's year (e.g. "1998"). Useful for search
limits on creation/modification date.
d1m - first date's month (e.g. "08"). Useful for search
limits on creation/modification date.
d1d - first date's day (e.g. "23"). Useful for search
limits on creation/modification date.
d2 - second datetime in full YYYY-mm-dd HH:MM:DD format
(e.g. "1998-09-02 12:34:56"). Useful for search limits
on creation/modification date (see 'dt' argument
below). Note that 'd2' takes precedence over d2y, d2m,
d2d if these are defined.
d2y - second date's year (e.g. "1998"). Useful for search
limits on creation/modification date.
d2m - second date's month (e.g. "09"). Useful for search
limits on creation/modification date.
d2d - second date's day (e.g. "02"). Useful for search
limits on creation/modification date.
dt - first and second date's type (e.g. "c"). Specifies
whether to search in creation dates ("c") or in
modification dates ("m"). When dt is not set and d1*
and d2* are set, the default is "c".
verbose - verbose level (0=min, 9=max). Useful to print some
internal information on the searching process in case
something goes wrong.
ap - alternative patterns (0=no, 1=yes). In case no exact
match is found, the search engine can try alternative
patterns e.g. to replace non-alphanumeric characters by
a boolean query. ap defines if this is wanted.
ln - language of the search interface (e.g. "en"). Useful
for internationalization.
ec - list of external search engines to search as well
(e.g. "SPIRES HEP").
wl - wildcard limit (ex: 100) the wildcard queries will be
limited at 100 results
"""
Examples: (retrieving record IDs)
>>> # import the function:
>>> from invenio.websearch.search_engine import perform_request_search
>>> # get all hits in a collection:
>>> perform_request_search(cc="ATLAS Communications")
>>> # search for the word `of' in Theses and Books:
>>> perform_request_search(p="of", c=["Theses","Books"])
>>> # search for `muon or kaon' within title:
>>> perform_request_search(p="muon or kaon", f="title")
>>> # phrase search (not the quotes):
>>> perform_request_search(p='"Ellis, J"', f="author")
>>> # regexp search for a system number
>>> perform_request_search(p1="^CERN.*2003-001$", f1="reportnumber", m1="r")
>>> # moi inside Standards gives no hits...
>>> perform_request_search(p="moi", cc="Standards")
>>> # but it does if we use alternative patterns:
>>> perform_request_search(p="moi", cc="Standards", ap=1)
Example: (retrieving MARCXML)
>>> import cStringIO
>>> tmp = cStringIO.StringIO()
>>> perform_request_search(req=tmp, p='ellis', of='xm')
>>> out = tmp.getvalue()
>>> tmp.close()
>>> # `out' now contains MARCXML of 12 records found
Example: (retrieving Text MARC, certain tags only)
>>> import cStringIO
>>> tmp = cStringIO.StringIO()
>>> perform_request_search(req=tmp, p='higgs', of='tm', ot=['100', '700'])
>>> out = tmp.getvalue()
>>> tmp.close()
>>> print out
000000085 100__ $$aGirardello, L$$uINFN$$uUniversita di Milano-Bicocca
000000085 700__ $$aPorrati, Massimo
000000085 700__ $$aZaffaroni, A
000000001 100__ $$aPhotolab
3.2. Mid-level API
------------------
Description:
The mid-level API is provided by a search_pattern() function
that only searches for the given pattern in the given field
according to the given matching pattern. This function does not
know anything about collection. The function does not wash its
arguments, it expects them to be `clean' already. The pattern
is split into `basic search units' for which a boolean query is
launched. The function returns an instance of the intbitset class.
Note that if you want to obtain the list of recIDs (as with the
high-level API), you can invoke the ``tolist()'' method on a
hitset.
Signature:
def search_pattern(req=None, p=None, f=None, m=None, ap=0, of="id", verbose=0, ln=CFG_SITE_LANG, display_nearest_terms_box=True, wl=0):
"""Search for complex pattern 'p' within field 'f' according to
matching type 'm'. Return hitset of recIDs.
The function uses multi-stage searching algorithm in case of no
exact match found. See the Search Internals document for
detailed description.
The 'ap' argument governs whether an alternative patterns are to
be used in case there is no direct hit for (p,f,m). For
example, whether to replace non-alphanumeric characters by
spaces if it would give some hits. See the Search Internals
document for detailed description. (ap=0 forbits the
alternative pattern usage, ap=1 permits it.)
'ap' is also internally used for allowing hidden tag search
(for requests coming from webcoll, for example). In this
case ap=-9
The 'of' argument governs whether to print or not some
information to the user in case of no match found. (Usually it
prints the information in case of HTML formats, otherwise it's
silent).
The 'verbose' argument controls the level of debugging information
to be printed (0=least, 9=most).
All the parameters are assumed to have been previously washed.
This function is suitable as a mid-level API.
"""
Examples:
>>> # import the function:
>>> from invenio.websearch.search_engine import search_pattern
>>> # search for muon or kaon in any field:
>>> search_pattern(p="muon or kaon").tolist()
>>> # the following finds nothing by default...
>>> search_pattern(p="cern-moi").tolist()
>>> # ...but it does find something if we allow alternative patterns:
>>> search_pattern(p="cern-moi", ap=1).tolist()
>>> # wildcard search for a report number:
>>> search_pattern(p="CERN-LHC-PROJECT-REPORT-40*", f="reportnumber").tolist()
>>> # regexp search for a report number with possible trailing subjects:
>>> search_pattern(p="^CERN-LHC-PROJECT-REPORT-40(-|$)", f="reportnumber", m="r").tolist()
3.3. Low-level API
------------------
Description:
The low-level API is provided by search_unit() function that
assumes its arguments to be already the basic search units.
Therefore it does not know anything about boolean queries, etc.
The function returns an instance of the intbitset class. Note that
if you want to obtain the list of recIDs (as with the high-level
API), you can invoke the ``tolist()'' method on a hitset.
Signature:
def search_unit(p, f=None, m=None, wl=0, ignore_synonyms=None):
"""Search for basic search unit defined by pattern 'p' and field
'f' and matching type 'm'. Return hitset of recIDs.
All the parameters are assumed to have been previously washed.
'p' is assumed to be already a ``basic search unit'' so that it
is searched as such and is not broken up in any way. Only
wildcard and span queries are being detected inside 'p'.
If CFG_WEBSEARCH_SYNONYM_KBRS is set and we are searching in
one of the indexes that has defined runtime synonym knowledge
base, then look up there and automatically enrich search
results with results for synonyms.
In case the wildcard limit (wl) is greater than 0 and this limit
is reached an InvenioWebSearchWildcardLimitError will be raised.
In case you want to call this function with no limit for the
wildcard queries, wl should be 0.
Parameter 'ignore_synonyms' is a list of terms for which we
should not try to further find a synonym.
This function is suitable as a low-level API.
"""
Examples:
>>> # import the function:
>>> from invenio.websearch.search_engine import search_unit
>>> # search moi in any field:
>>> search_unit(p="moi").tolist()
>>> # this one will not match:
>>> search_unit(p="muon or kaon").tolist()
>>> # regexp search for a report number with possible trailing subjects:
>>> search_unit(p="^CERN-PS-99-037(-|$)", f="reportnumber", m="r").tolist()