--- meta-title: EBI Search indexing title: EBI Search indexing description: The EBI provides a global search service across most of the data sources available at the institute. cta: Call to action hero-class: hero-building-blocks hero: true filter-nav: false filter-dropdowns: false layout: static pagetype: meta-pattern location: search-ebi-indexing.html --- {{#markdown}} ### Glossary of terms used in the search guidelines - Document: Lucene term A document is a virtual document consisting of a set of fields. A document can have several fields with the same name. - Field: Lucene term Part of a Document (see above). A field is a <name, content> pair. The name provides metadata, e.g. row name in a database or the different parts of a web page or email (header, body, ...). The content contains the actual data. Both parts of a field are indexed but the name is only available as structural information, i.e. one can search for something with a specific field-name but a field-name usually will not appear in the search result. - Domain: EB-eye term A data source, most of the times a database. For example: UniProtKB, PDBe, ... - Domain tree, hierarchy: EB-eye term All domains in the EB-eye are organised in a tree. Nodes of the tree are for example Protein Sequences or Small Molecules. Leaves of the tree are for example UniProtKB (parent Protein Sequences) or ChEBI (parent Small Molecules) - Data provider: EB-eye term A group/person who provides the data for a domain.
The EBI provides a global search service across most of the data sources available at the institute. The Lucene EBI search engine (aka EB-eye) is a search engine aimed at providing unified summary results for global searches over the majority of the EBI databases. The engine is responsible for indexing a meaningful subset of data from the databases and returns summary information containing links to the original databases.
The engine has been built to accommodate the vast variety of data available amongst all databases at the EBI. Most of the databases at the EBI already have flat files or XML dumps which are used by the search engine. Some databases use the XML format specifically defined for EB-eye to dump their data and make them available through the search engine.
To make it easier to maintain, and to guarantee the most up-to-date data, EB-eye has a mechanism for updating data automatically. At least once a day, all data sources will be checked and analyzed to identify possible updates and the system will automatically re-index them.
After each update, a footprint is generated and represents a signature for the data source. A new footprint is generated before each new update and is compared to the previous one. If they are equal, i.e. the data has not changed, no update is needed. If they are different, the data gets updated.
In order to index data properly the EB-eye indexer needs some information:
Information needed for the EB-eye indexer | Description | Example |
---|---|---|
Analyser | Extracts information out of documents and transforms them into tokens, which than can be indexed. | For example, dates can be written in different formats. A date analyser tries to find the used format and transforms it into an internal representation and returns this as a token. EB-eye provides several analysers. Most of them are derived from Lucene's analysers, but we also developed our own analysers, e.g. for chemical writing. If no analyser is available for a field, the standard analyser is used. This means a user can only find something if she is querying the exact term. |
Store | Should the data be stored. Possible values for store are: YES, NO, COMPRESSED. At a first glance a NO might be confusing but EB-eye can index data without storing it. | This can be useful for example for keywords. An entry can have several keywords, these can be indexed with the entry so that the entry will be found, if searched for the keywords. However, on the result page the keywords will not appear because they was not stored. |
Boost | Value for the "importance" of a field. EB-eye can give fields a boost factor. The higher the boost factor the higher this entry will be ranked (actually, it is a little bit more complicated). | Boost factors should be used, if at all, only for very important fields, like IDs. For more information about the boost factor, please refer to the Lucene documentation. |
There are various input indexer implementations available in EB-eye, but mostly two are used. These two implementations are based on parsers which are used to describe the structure of the source files and index them.
By default, the source data are indexed in a distributed environment. The files are split into several chunks (set of entries) or grouped into set of files and are indexed in parallel by several machine. This allows us, if necessary, to index all domains in one go in less than half a day.
The previous section explained what information are needed in order to index data. The input indexer most of the time relies on the use of parsers which in turn rely on a grammar to extract information from input files.
The grammar is a lexical representation of the data associated with a set of actions to be executed. The grammar helps to extract the structure of an entry and its various fields from the data source. Actions can be associated with this structure, this means that from the grammar, an action can be executed for each entry and for each of its fields (i.e. dates, cross-references, authors), typically extracting the information and indexing it. A set of predefined actions using the information from the configuration files are available to ease the indexing. That's why it's important to have a detailed description of the format of the data files to make sure the parsers match properly the corresponding data structure.
After the data has been indexed it is available for searching through the global search engine. Several types of searches are possible. Note, the following subsections refer to the internal search mechanisms. For the user interface skip to the next section "Web Interface".
The simplest search is the global search in all the fields indexed for a particular domain.
A more specific search is the field-specific search, where a query term is only searched in a particular field. This type of search is typically what an advanced search offers. Every field indexed for a domain will be available for a field-specific search, including the cross-reference fields.
An important feature of the search engine is the ability to use cross-references to navigate between different domains by jumping from entries to entries.
During the indexing of the data, the cross-references information is extracted from the source files and stored as cross-reference fields in the index. Lucene imposes some restrictions that had to be by-passed and, as a result, the name of the reference database and the ID of the entry referenced are the only information stored.
When launching a cross-reference search, the system will try to find the cross-references by looking for exact matches for the stored IDs. This means if cross-references indexed for a particular domain don't use IDs but an accession number or another kind of identifier the system will not be able to retrieve the cross-reference. In this case the name for the cross reference needs to be specified in the configuration file.
The previous sections explained how the EB-eye search engine works internally, from the automatic update and the indexing of the data to the handling of several types of querys and the retrieval of search results.
Another important element of the EB-eye is its web interface. The aim for the EB-eye web interface should be a "design" as simple as possible. The following text provides some basic guidelines.
A basic search form should be present on all EBI pages and always at the same position. The syntax of the basic search form should be:
Presenting search results is part of the focused group exercise and therefore likely to be subject to changes. The following text gives therefore only very rough guidelines for result pages.
When searching for some information, the way the results are displayed is essential for the user experience. Not enough detail is annoying when browsing through the results, too much information on the other hand might discourage users from using the service.
EB-eye can parse and index data files of different formats but also defines its own XML format (XML4dbDumps). This can be used for databases that currently don't have a flat file or an XML formatted dump and where there is no requirement to dump the whole database in a specified format.
As a rule of thumb:
An existing file format is preferable if:
XML4dbDumps format is preferable if:
Note: Whatever file format will be used the entries can be present in one or several files. There is no restriction on the number of entries per file.
To get data index by the EB-eye search engine two things are needed:
In order to ease the maintenance of the EB-eye and to guaranty the most up-to-date data, an automatic data update mechanism has been implemented. If updates are available the new data is downloaded and uncompressed if necessary and then re-indexed and redeployed to be visible to users. Additionally, metadata (release, release date, number of entries, ...) are generated from the data or a release note for verification and information purposes.
The following information is needed for this step:
# Comment release=[release number or release date if no release defined] release_date=[DD-MMM-YYYY] entries=[number of entries]
You don't need to create a metadata file if you use the XML4dbDumps format as it already contains the information.
e.g. for UniProt:
Root source URI: "/ebi/ftp/private/uniprot/4EBIES/knowledgebase/"
File pattern: "uniprot_.*\.dat\.gz"
Excluded sub dirs: ".*"
Metadata file: "/ebi/ftp/private/uniprot/4EBIES/knowledgebase/relnotes.txt"
e.g. for MSD:
Root source URI: "ftp://ftp.ebi.ac.uk/ebeye_msd"
File pattern: "MSDCHEM\.xml"
Excluded sub dirs: ".*"
Metadata file: -no need as it's a XML4dbDumps format-
e.g. for GO:
Root source URI: "http://archive.geneontology.org/latest-termdb"
File pattern: "go_daily-termdb\.rdf-xml\.gz"
Excluded sub dirs: ".*"
In order to index data the EB-eye search engine needs to know the format of the data (syntax) and what how to index (semantics) it. The format is needed to develop a parser and the semantics defines which fields should be stored under which names.
If a data provider decides not use EB-eye's data format XML4dbDumps, they need to provide sufficient information for their data format to write a parser for it. It is important to well define these fields and how to index them as it has a huge impact on the quality and relevance of the results.
A data provider needs to provide 3 pieces of information for each field to be indexed:
If you use the XML4dbDumps format some fields are already defined (id, authors,keywords, date, ...). Additional fields can be defined in:
<additional_fields>
<field name="namefield1">value1</field>
<field name="namefield2">value2</field>
...
</additional_fields>
Another type of indexed fields are cross-references to other databases.
If the XML4dbDumps format is used they are defined as:
<cross_references>
<ref dbname="db2" dbkey="abc123"/>
<ref dbname="db3" dbkey="abcdef"/>
...
</cross_references>
These cross-references can point to either internal databases that are indexed by the EB-eye (domains) or to external resources.
Note:The external xrefs are not displayed at the moment but will be in the future.
The internal xrefs defined in the data can use different database names from the ones EB-eye uses and can also use a specific field for the identifier. E.g. Databases contain xrefs to dbname="swiss-prot",dbkey="Q62594". This xref needs actually to point to the domain 'UniProtKB' and use the accession number 'Q62594'.
Note:you can add a suffix to the database name to add some 'semantics' to the cross-reference. For example if you have xrefs to Ensembl which actually are xrefs to either transcripts or genes you can name the fields as either ENSEMBL_TRANSCRIPT or ENSEMBL_GENE so that users will be able to make the difference between the two. They will both internally point to the domain Ensembl. Data providers need to go through their xrefs and establish to which database and field they point to.
Here is an example for the information the EB-eye team needs for the different fields:
Field name in data | Brief field description | (NOT_)INDEXED / (NOT_) STORED | Type of value (regular expression, format, semantics, ...) |
---|---|---|---|
[field name] | [description] | [(NOT_)INDEXED/(NOT_)STORED] | [type of the value, specific format, list of values...] |
id | id of the entry | INDEXED, STORED | [A-Z][0-9]{4} |
name | name of the entry | INDEXED, STORED | english text |
last_update | last update | INDEXED, STORED | date |
Information needed for cross references to other resources:
Cross-reference | Brief xref description | Domain name / external resource referenced |
Field referenced (for domains) / URL (for external db) |
Comment |
---|---|---|---|---|
swiss-prot | xref to UniProtKB | UniProtKB (domain) | AC | |
AFCS | xref to AFCS | AFCS (external db) | id | {nolink:[http://www.signaling-gateway.org/data/Y2H/cgi-bin/y2h_int.cgi?id=%{id}] |
Please check for every xref whether the referenced resource is an EB-eye domain: [https://www.ebi.ac.uk/ebisearch/statistics.ebi|]
The previous sections described how the EB-eye search engine works, explained the relationship between the configuration files and the indexing process or the web interface. The following paragraphs describe what can be done to improve the quality of the results and the user experience.
The EB-eye tries to offer the most up to date data for its users. For this reason an automatic update mechanism has been developed and is running every day to make sure the indexes are updated. However, the system relies on the data providers to get these data and needs to know where the latest versions can be found. It is therefore important to define a static location where the EB-eye data for a domain is stored.
Another important stage of the update is the verification of the data. A clearly defined format for the source files is a good start. Most of the time a parser will be used to go through the data and index them. Unfortunately, the format of these data is sometimes not available or not up to date and as a result, writing the parsers becomes difficult and takes time. Providing a detailed description of the data structure, be it a description document, a DTD or an XML schema, will greatly help not only to write the parsers, but also to verify the source files.
Some data providers include release notes which can be used by the automatic update to verify the data which have been indexed (The number of entries is one of the details which are really useful to verify whether the data have been indexed correctly). Unfortunately, most of the data indexed cant be verified because this information is missing or incorrect. Making sure that such information is available and accurate helps to guaranty the quality of the indexed data.
The parsers will determine what fields and what information will be stored in the index. So, to ensure the quality of the data, the data providers should establish a list of the fields that have to be indexed, with their names and descriptions, and how they are represented in the source files. This as well as a detailed description of the data format will help writing a parser and define proper names for the fields. These names will be available in the Advanced Search, so they have to be meaningful for the user.
When using the EB-eye XML format, the names of the fields and their content must be clearly defined with the search application in mind. The additional fields section can prove to be really important to improve the quality of the search. A dump with only ids and names will never return any results when searching for common biological terms. If a description or full text additional fields are included the search engine will provide much better results.
Another aspect to consider when selecting data to export for EB-eye are cross-references. Providing a maximum number of cross-references to a wide range of databases will benefit users. By following cross references she will be able to navigate easily between and explore the different domains within the EB-eye.
However, cross-references have to be clearly identified otherwise they might not be properly recognized by the system. Ideally, the cross-references should be using a correct database name and the corresponding ID (and not an accession number), but obviously this is not always possible. If cross reverences can not be provided in the canonical way, please provide necessary information which allow the EB-eye team to update the EB-eye configuration files with the new aliases and further cross-references information.
EB-eye has only has 2 different result pages which can be slightly modified to improve the user experience.
The default layout displays for each entry the id, name and descriptions followed by the entry links and cross-references links. Correctly defining this information ensures coherence of the result display. Therefore name and description should be stored in the index. If no obvious name or ID can be provided data providers should define a meaningful name and ID. Data without ID and name will be only indexed, if a data provider can conclusively argue why she cannot provide them.
Links pointing to data have to be carefully defined. An important link is the ID field link which will redirect to the corresponding data provider web site. The entry links should be checked as well and reviewed by the data providers to make sure they are correct. Obviously, all links should resolve to a valid web page. EB-eye does not check whether the link to behind an ID is valid. However, EB-eye check for every cross reference whether the site behind the cross reference exists. Obviously, EB-eye cannot check whether the content is valid.
Sometimes the default layout is not appropriate to display the results. In such cases, data providers should contact the EB-eye team to discuss a possibly custom layout. For every layout default or custom, the simplicity of the layout should be one of the main objectives. Thus, only the information which is really needed to allow the user to decide whether he should visit the original site should be included.
{{/markdown}}