Author: mh17, 27 Nov 2009; Converted to HTML 02 Mar 2010
The pipeServer module is based on the existing fileServer and is configured using a [source] stanza in the ZMap config file for each pipe. The module takes as input a GFF format stream and it is intended that ony one or very few featuresets will be included in each pipe - the idea is to upload data in parallel in order to reduce startup time.
Scripts are to be installed locally with ZMap and the directory itself will be identified in the [ZMap] stanza in ~/.ZMap/ZMap. The scripts are defined by a url and paramters can be given in the query string.
Although the intention is to load much of the data direct from database sources instead of passing it via ACEDB it is stll necessary for ZMAP to load the essential data from ACE - transcripts and featuresets-to-styles, style information, as GFF does not allow definition of styles. ZMap can also run from pipes/ files only if styles are specifed in another file.
The pipeServer module replace the old fileServer module and supports both kinds of input.
[ZMap] # define the location for scripts and data # Of course these could be stored centrally somewhere if desired. # and pipeServers can use absoloute paths instead script-dir = /nfs/users/nfs_m/mh17/ZMap/scripts data-dir = /nfs/users/nfs_m/mh17/ZMap/scripts # Each data source must be referenced in [ZMap] by listing the source stanzas like this # Columns are displayed in the order given: # in this example all the acedb features first then the 'other_feed' etc. # later in the ZMap file each source must be defined in its own stanza. sources = acedb ; other-feed ; yet_another_one # If styles are not defined via ACEDB the a file must be given in the # ZMap stanza (not the source stanza) eg: stylesfile = /nfs/users/nfs_m/mh17/zmap/styles/ZMap.b0250.file.styles [other-feed] # config options for 'other-feed' #
When configured, ZMap will request data from each source in parallel, hopefully speeding things up a lot. Each script will obtain and send the data 'somehow'. They will replace the existing mechanism of Otterlace retreiving the data sequentially and adding to ACEDB on startup.
Each source stanza has a 'delayed' option, whcih allows data to be requested on demand rather than at startup:
# delayed == conect from X-Remote request, otherwise connect on startup delayed = true
In each source stanza (one must exist for each data source) the syntax is the same as for existing file:// and acedb:// sources, but specifically for pipe:// sources we interpret the configuration as follows:
URL's take the form
://[user][:password]@[:port]/[url-path][;typecode][?query][#fragment]
will be 'pipe' user:password@host are not used and if present are ignored port is not used and if present will be ignored url-path is the path of the script. Note that according to http://rfc.net/rfc1738.html a single leading '/' signifies a relative path and two signifies absolute. We will interpret relative paths as relative to the ZMap scripts directory. typecode is not used and will be ignored if present query will be expanded into a normal argv vector fragment is not used and will be ignored
Typically we expect a pipe:// data source to have only one (or very few) feature sets, as a major design aim is to exploit concurrent operation. Other configuration parameters will operate as normal (eg 'sequence=true' (which can only appear in one source) and 'navigator_sets=xxx,yyyy').
Here is an example for a test script that simple outputs an existing GFF file.
[b0250] url = pipe://getgff.pl?file=b0250_curated.gff featuresets = curated_features ; curated ; genomic_canonical styles = curated_features ; curated ; genomic_canonical
A more realistic one with an absolute path... (but needs featuresets and styles and stylesfile specifying)
[human_xyx] url=pipe:///software/anacode/bin/get_genes.pl?dataset=human&name=1& analysis=ccds_gene&end=161655109&csver=Otter&cs=chromosome& type=chr1-14&metakey=ens_livemirror_ccds_db&start=161542637& featuresets=CCDS:Coding;CCDS:Transcript
A script must start with #!<program> or else it will not be exec'd. (Assuming Linux)
A script may obtain data in any way it likes but must output valid GFF data and nothing else on STDOUT (but anything is valid in a comment).
Brief error messages may be output to STDERR and these will be appended to the zmap log. STDERR output is intended only to alert users of some failure (eg 'warning not all data found' or 'cannot connect to database') and not as a detailed log of script activity - if this is needed then the script should maintain its own log file. A warning message will be presented to the user, consisting of the last line in STDERR and hopefully this will be enough to explain the situation with resorting to log files.
Regardless of whether an error message is sent ZMap will attempt to use the GFF data provided.
ZMap will probably read STDERR after STDOUT is closed, and only if some error is encountered.
Arguments will be given in the format key=value with no preceeding dashes, these will be as extracted from the server query string. (If people care about this we could change it...)
Extra arguments may be added subject to implementation:
zmap_start=zmap_start_coord # in zmap coordinates not bases (if configured) zmap_end=zmap_end_coord wait=9 # delay some seconds before sending data # (can be given in the query string, main use is for testing)