From d12a8b7c92abb36e260a3940ab559d2482219585 Mon Sep 17 00:00:00 2001 From: mh17 <mh17> Date: Mon, 16 Aug 2010 09:44:22 +0000 Subject: [PATCH] EST masking and featuresets --- doc/Design_notes/index.html | 15 +-- doc/Design_notes/notes/EST_mRNA.html | 147 +++++++++++++++++++++----- doc/Design_notes/notes/f_context.html | 25 ++++- 3 files changed, 144 insertions(+), 43 deletions(-) diff --git a/doc/Design_notes/index.html b/doc/Design_notes/index.html index 6393e625f..d9e7ce5d6 100644 --- a/doc/Design_notes/index.html +++ b/doc/Design_notes/index.html @@ -1,4 +1,4 @@ -<!-- $Id: index.html,v 1.19 2010-08-03 08:09:12 mh17 Exp $ --> +<!-- $Id: index.html,v 1.20 2010-08-16 09:44:22 mh17 Exp $ --> <h2>ZMap Design Documentation</h2> @@ -57,19 +57,6 @@ Coordinate systems<br /> </div> </fieldset> -<fieldset><legend>Thought for food</legend> -<h3>Some things that need investigating and writing up.</h3> -<p><ul> -<li><p>X-Remote appears to talk to zmapControl, zmapView, and zmapWindow, so far. Who is in control?</p> -</ul> -</p> -<h3>Things that are slow</h3> -<p><ul> -<li>3-Frame takes a while to paint, likely due to having to do a lot of sums, but hiding it should simply be a matter of hiding the columns, after which a re-painting would also be quick. -</ul> -</p> - -</fieldset> <fieldset><legend>Unfinished Buisness</legend> <p> diff --git a/doc/Design_notes/notes/EST_mRNA.html b/doc/Design_notes/notes/EST_mRNA.html index 609e7913e..573d8edf2 100644 --- a/doc/Design_notes/notes/EST_mRNA.html +++ b/doc/Design_notes/notes/EST_mRNA.html @@ -1,8 +1,5 @@ <!-- $Id $ --> <h2>RT 69060: Clustering (masking) EST's and mRNA's</h2> -<h3>Warning</h3> -<p> This was once intended to be an interactive process w/ otterlace- see <a href="Design_notes/notes/otter_zmap.shtml">here</a>. However, reading the text and a quick chat w/ the requestor suggest that an interactive process was not wanted. -</p> <fieldset><legend>Background</legend> <p>In 2008 ticket 69060 was created requesting this: @@ -18,34 +15,41 @@ lot fewer boxes. </pre> and recently (after the latest annotation test) we were requested to see how easy this would be to implement. </p> +<h3>Please Note</h3> +<h4>Design issues</h4> +<p> This was once intended to be an interactive process w/ otterlace- see <a href="Design_notes/notes/otter_zmap.shtml">here</a>. However, reading the text and a quick chat w/ the requestor suggest that an interactive process was not wanted. +</p> +<h4>Implementation issues</h4> +<p>Please refer to <a href="Design_notes/notes/f_context.shtml">this page</a> for some notes on loose ends that need attention.</p> </fieldset> <fieldset><legend>Things to define/ solve</legend> <h3>How to decide when to not display a feature?</h3> <p> -<ul> -<li> the feature must be an EST -<li> the EST is covered completely by a same species mRNA ? -<li> the EST is completely covered by any mRNA (eg EST_Other x vertebrate_mRNA) ? -</ul> +An EST feature (ie group of ESTs with the same name) will be masked if each feature is covered by an mRNA feature (ie group of mRNAs with the same name). +Optionally we can mask an EST column against itself and also do the same with mRNA data, but in this case we must only mask data that has the same number of 'exons' to avoid masking alternate splicing events. </p> <h3>How to choose features to compare?</h3> <p>As features are effectively blank data and thier names are 'user-defined' we cannot hard code these and one obvious way to mark featuresets for 'clustering' and non-display would be via styles eg: <pre> -[EST-Human] +[EST_Human] # list of featuresets to compare this featureset with -alignment-mask-sets = vertebrate_mRNA ; cDNA +alignment-mask-sets = self ; vertebrate_mRNA ; cDNA </pre> +If we wish to mask a featureset against itself then we include its own name on the mask-sets list or the key word 'self'. It will be more efficient to include a featureset's self reference first. +The 'self' keyword is necessary to allow a common style to be used for a number of EST columns. </p> + <h3>Display Options</h3> <p> -Masked data can be not displayed, or displayed in grey (for example). The grey colour can be specified as a window colour in the ZMap config file. An RC menu option will be provided to show masked data per column and if selected this will be displayed in grey (if configured) or as normal (if not). +Masked data can be not displayed, or displayed in grey (for example). The grey colour can be specified as a window colour in the ZMap config file. <pre> [ZMapWindow] -colour-masked-feature = light grey +colour-masked-feature-border = light grey +colour-masked-feature-fill = dark grey </pre> </p> -If styles cause data to be hidden from the annotators it may be advisable to allow this to be switched off and to display all features on demand. A new option can be added to the RC menu to toggle display of masked features independantly of existing bump options; although the non-display fucntion is only useful to the user in bumped columns it will result in much faster display of non-bumped columns. This option will apply to the column whether bumped or not. Note that this flag must be stored in the featureset struct and is distinct from flags per feature which say whether or not each feature is displayed. +If styles cause data to be hidden from the annotators it may be advisable to allow this to be switched off and to display all features on demand. A new option can be added to the RC menu to toggle display of masked features independantly of existing bump options; although the non-display function is only useful to the user in bumped columns it will result in much faster display of non-bumped columns. This option will apply to the column whether bumped or not. Note that this flag must be stored in the featureset struct and is distinct from flags per feature which say whether or not each feature is displayed. If no colour are defined as above or the column is not maskable then the menu option will not be visible and the masked features will never be displayed. </p> </fieldset> @@ -59,38 +63,37 @@ But as features may arrive at any time (via pipeservers or via delayed loading) <p> The fact that we expect featuresets to be displayed separately (esp w/ pipe servers) implies that we must be able to mask each EST featureset independantly against each mRNA featureset, although if several mRNA featuresets are available when an EST column is displayed then these could be processed together. </p> -<p>In the interests of simplicity masking shall be performed on the feature context - flags will be maintained in the feature context to decide which features are masked and how this is presented to the user can be indeependant of this. For an initial column display masking will be performed before display and then only those features not flagged as masked will be added to the canvas. For a display update (eg when the masking column (mRNA) arrives after the EST) the feature context will be masked and features being masked will be removed from the foo canvas on the fly. +<p>In the interests of simplicity masking shall be performed on the feature context - flags will be maintained in the feature context to decide which features are masked and how this is presented to the user can be independant of this. </p> -Note that canvas items refer to features and not vice versa, but a canvas item can be found relatively quickly via its hash table. +Note that canvas items refer to features and not vice versa, but a canvas item can be found relatively quickly via a hash table in the ZMapWindow structure. </p> <h4>Implementation README</h4> <p> <ul> -<li>The list of featureset quarks stored in the styles is non canonicalised as for featureset or style ID's, but is whitespace normalised. See zMapConfigString2QuarkList(). +<li>The list of featureset quarks stored in the styles is canonicalised as for featureset or style ID's, and is whitespace normalised. See zMapConfigString2QuarkList(). <li>On merging styles we do not merge quark lists but instead just overwrite. (zMapStyleMerge()) <li>The style parameter for the list of quarks is a string, which is how the data is input and output. <li>A featureset to be masked has a list of featuresets to mask with, which correspond to the styles config. As we mask against a featureset and not a style this list can only be used one-way. Whenever a featureset completes loading then before displaying it (and only if has a masker list) it is masked. If there is no masker list then all other existing featuresets are scanned and if relevant masked with it. (this will be done by an execute function that operates on the feature context). As this only happens once per featureset load it is not a performance problem. -<li>Masking of featuresets is performed immediately after the merge featureset operation - this involves setting flags in the feature context, and when featuresets are loaded it is possible that they have to be RevComped. See <b>zmapView/getFeatures()</b> though to <b>justMergeContext()</b> and <b>justDrawContext()</b>. BTW <b>zMapFeatureContextMerge()</b> is in <b>zmapFeature.c</b> not <b>zmapFeatureContext.c</b>. -<li> Updatang of the non/displayed features must be performed after the conext merge, in line with existing practice. +<li>Masking of featuresets is performed immediately after the merge featureset operation - this involves setting flags in the feature context, and when featuresets are loaded it is possible that they have to be RevComped. See <b>zmapView/getFeatures()</b> through to <b>justMergeContext()</b> and <b>justDrawContext()</b>. BTW <b>zMapFeatureContextMerge()</b> is in <b>zmapFeature.c</b> not <b>zmapFeatureContext.c</b>. +<li> Updating of the non/displayed features must be performed after the context merge, in line with existing practice. </ul> </p> </p> </fieldset> -<fieldset><legend>Masking algorithms</legend> +</fieldset> +<fieldset><legend>Masking algorithm</legend> <p> There may be a combinatorial problem to solve. A sample session from human chr-4-04 presented about 4-500 vertebrate mRNA features and 2-3000 EST_Human and given that each feature may contain many parts a naive algorithm could get quite slow. </p> <p>The following is proposed: <ul> <li> sort the mRNA's into start coordinate order, then end coordinate reversed -<li> remove or flag mRNAs that are covered by another one - this will be easy as they are in order -<li> sort the ESTs into order as for the mRNAs and hide EST's that are covered by others -<li> scan through EST's and for each one: -<li> scan all the mRNA's until the start coordinate means no cover -<li> skip over mRNA's whose end coord means no cover - this can be optimised by including a pointer to the next feature with a different start coordinate - many features have the same start coordinate. -<li> express both featuresets as integer arrays or lists defining exon and intron like sections. +<li> sort the ESTs into order as for the mRNAs +<li> scan through EST's and mRNAs together, advancing each one till mRNA end coord >= EST start and then EST start >= mRNA start +<li> if neither is masked express both features as lists defining exon and intron like sections. <li> use a simple function to determine if one covers the other completely and if so hide that EST and quit. +<li> stop when the start coordinates mean cover is no longer possible </ul> </p> <p> @@ -98,5 +101,99 @@ There may be a combinatorial problem to solve. A sample session from human chr- <p>The feature structure contains a union depending on the featuretype and for EST and mRNA we use style mode=align, which corresponds to a homol data structure. The style data uses a similar structure but the structure is called an alignment. Extra flags will be added to these to handle masking status. </p> +</fieldset> + +<fieldset><legend>Sorted access to the feature context</legend> +<h3>Existing practice</h3> +<p>The Feature Context has features referenced by a hash table which provides random unsorted access. +<b>NB</b> Tests have revealed that creating a hash table of 100k items will fail: the function does not return, although 90K appears to work quite quickly. +The FooCanvas stores data in a GList (which is quite slow) and the Window items code maintains a hash table to map features to canvas items. The canvas items refer back to thier features directly. +</p> +<p>To avoid disturbing existing code we will add data to the feature context to provide sorted access - it would be better to replace the hash table with a B-tree or similar but not very practical in the short term. +</p> +<p>A quick glance at GLib show that they provide binary trees and n-way trees but neither of these are ideal. The n-way trees just provide more lists but obviously would be slow with our data on random access. The binary trees are an opaque structre and while they provide lookup and traverse functions do not implement next or previous functions. +</p> +<h3>Relating feature context data to canvas items: alignments</h3> +<p>Alignment features correspond to features in EnsEMBl and represent gapped alignments. Each feature is represented by a feature.homol structure which includes a GArray to specify the gaps if present; these gaps are small and are displayed as lines accross the feature. Several distinct features of the same name can be joined up with vertical lines. +When displayed each part (alignment feature) is expressed as a separate canvas item, and these can be re-arranged in different ways according to the choice of bump mode. In Ensembl the data is represented as several distinct items with the same name. +</p> +<p> This implies that we have to connect features of the same name together and then sort these composite objects into start coordinate order. Column bumping already does a similar task, but for canvas items - we need lists of features of the same name. To make the sorting and comparison algorithm run faster we will prepend each list with 'whole alignment' entry which will hold the start and end coordinates of each set of alignments. +</p> +<h3>Sorting and maintaining data</h3> +<p>We expect EST data and mRNA data to be static within a ZMap session - once sorted there is no requirement to repeat the process. Additional data supplied by servers will be stored in distinct context featuresets regardless of whether they are to be displayed in the same column. <i>Note that this requires ZMap configuration to specify unique featureset names</i> - this is possible even if the ultimate source of two data sets are named, via the </b>[featureset-source]</b> stanza. If in future we request features for subsets of the region being used then we will need to re-sort a featureset as new data arrives. Data will be sorted into a GList of pointers to the features which will be stored in the featureset structure. If new data arrives the set will be flagged as unsorted by freeing the list. +</p> + +</fieldset> + +<fieldset><legend>Choosing to hide or display masked features in grey</legend> + +<h3>Motivation </h3> +<p>We wish to hide (ie not display at all) features that are masked a) to unclutter the display and b) to run faster and this option is chosen by not specifying a 'masked features colour' for the window. +</p> +<p>For testing and also as an option to show all the data but direct the user's attention to important features we wish to display masked features in grey, and this colour choice will logically apply to the window rather than a column. If a masked features colour is specified then these features will be displayed but initially hidden - display speed will be faster if they are not displayed but typically the difference will be minimal as we are dealing with only 1000 features, and display speed is dominated by protien alignments where volumes are more like 50k+. +</p> + +<h3>How is this done?</h3> +<p> +In <b>zmapWindowFeature.c/zmapWindowFeatureDraw()</b>, called from <b>zmapWindowDrawFeatures.c/ProcessFeature()</b> we test to see if the window colour has been set and if not just don't draw masked features, in which case we never have to consider what colour to use. +</p> +<p> +If the window colour is set then when we draw the features we have to access this colour by a process which is a little obscure. In <b>zmapWindowCanvasItem.c/zMapWindowCanvasItemAddInterval()</b> we call <b>zmapWindowAlignmentFeature.c/zmap_window_alignment_feature_set_colour()</b> via the over-ridden class structure function set_colour() for alignment features. This has no direct access to the colour data but we can access the canvas featureset group (aka column) via the canvas item parent references. This <b>ZMapWindowContainerFeatureSet</b> object has a reference to the window which is opaque, and we have to impelement a ZMapWindow function to access the colour spec. Unfortunately we have to write two of these functions to get past two levels of scope incompatability, one to get to the featureset and one to reach the window. +</p> + +<h3>Controlling the display and column menus</h3> +<p> +Each featureset in the view's context has a <b>masker_sorted_features</b> GList which is non-NULL if the column has been masked ie if masking has been configured. Each feature has flags to specify masked and displayed.</p> +<p> The featureset will be displayed in a column (or more in case of strand and frame) and when it is displayed in a column flags in the <b>ZMapWindowContainerFeatureset</b> will be set to say if the column is maskable (contains one or more featuresets that are masked) and masked (mask data is not displayed). These flags will drive the column menu.</p> +<p>Currently there are no plans to select display of individual featuresets in a column, if so it may be wise to add a list of contained featuresets to the ZMapWindowContainerFeatureset structure. </p> + +<h3>Toggling display of masked items</h3> +<p>By default masked data will not be displayed and if we select 'Show Maksed Features' then we will scan all ZMapWindowCanvasItems in that column and display the masked data. Deselecting this will remove the masked items from the FooCanvas. Each canvas item refers to it's feature and when the display status of a feature is changed the flag in the feature structure that says 'displayed' will be updated. +</p> +<p> +<b>NOTE</b>: We assume that each feature is displayed in one column only, but the situation where multiple windows have been opened is more complex. In line with existing practice each window in a view reflects the same display state and changes in one occur in the others. To accomodate this we must display or hide features according to menu choice rather than feature state. Each time we show or hide a feature we set the flags in the feature accordingly. +</p> +<p> +<b>NOTE</b>: If a featureset is loaded into a columm we must also check the display state of the column. +As per normal we mask the new column and display unmasked features, but if the column state requires masked features to be shown then we unmask the new features. +</p> + +<h3>Where to put the show/hide function and implementation</h3> +<p><b>zmapWindowContainerFeatureset.c/zMapWindowContainerFeatureSetShowHideMaskedFeatures()</b> +</p> +<p>Hidden features were originally intended to be not displayed ans we would have added or removed items from the canvas according to display state. However this is a relatively complicated process due to the item factory... and also the fact that featuresets can appear in several columns (not particularly releveant for current practice with ESTs and mRNAs but we have to code it properly). +Instead we display as normal and use foo_canvas_item_show/hide(). Data volumes in this case are not huge, typically 1000 features total, so performance will not be affected much, and operating the show/hide menu will be quicker. +</p> + +</fieldset> + +<fieldset><legend>Results</legend> +<p>Using [chr11-03_2098831-2266877] and 3 EST columns and vertebrate_mRNA (and each set masking itself) we get: +<pre> +masked = vertebrate_mrna;est_other;est_mouse;est_human +masker = est_human;est_mouse;vertebrate_mrna;est_other +mask set vertebrate_mRNA with set vertebrate_mRNA +masked 51, failed 705, tried 125 of 125 composite features + +mask set EST_Other with set EST_Other +masked 199, failed 1327, tried 322 of 322 composite features +mask set EST_Other with set vertebrate_mRNA +masked 80, failed 580, tried 322 of 322 composite features + +mask set EST_Mouse with set EST_Mouse +masked 57, failed 106, tried 87 of 87 composite features +mask set EST_Mouse with set vertebrate_mRNA +masked 21, failed 113, tried 87 of 87 composite features + +mask set EST_Human with set EST_Human +masked 338, failed 2274, tried 528 of 528 composite features +mask set EST_Human with set vertebrate_mRNA +masked 105, failed 742, tried 514 of 528 composite features +mask old by new +</pre> +Which in rough terms gives us 50% of vertebrate mRNA's and between 10%-25% of ESTs displayed. +</p> +<p>The algortihm appears to run effciently: # failed is the number of comparisons used with no effect. The worst case is quadratic (like the obvious naive algorithm) but we are a long way from that. +</p> </fieldset> diff --git a/doc/Design_notes/notes/f_context.html b/doc/Design_notes/notes/f_context.html index 21b1a274f..4cfd15b0b 100644 --- a/doc/Design_notes/notes/f_context.html +++ b/doc/Design_notes/notes/f_context.html @@ -1,11 +1,11 @@ <!-- $Id $ --> <h2>Restructuring the Feature Context </h2> <fieldset><legend>Introduction</legend> -<p>Please refer to <a href= "Design_notes/modules/zmapFeature.shtml">zmapFeature</a> for a discussion of featuresets columns and styles. +<p>Please refer to <a href= "Design_notes/modules/zmapFeature.shtml">zmapFeature</a> for a discussion of featuresets columns and styles. See the section ?? below for details if implementation and implications of this. </p> <p>Data in ZMap is stored in a <b>Feature Context</b> which is organised in a hierarchy of Context, Align, Block, FeatureSet, Features. The features themselves can be complex but this is not relevant here. </p> -<p>Historically with ACEDB ZMap would request a list of 'featuresets' from ACE and each one of these could turn out to include data from several sources. The acedbServer (and later pipeServers) module would request all the data and store this in a new Feature Context, and this feature context would be passed to the <b>zmapView</b> module complete. The <b>zmapView</b> module would then merge this data into its own feature context and the processcan be repeated for each server. As ZMap moves to being mainly pipeServer based (although ACE will still be supported and needed esp for genefinder features) and various analysis modules are added it become necessary to distinguish more clearly between source data (featuresets) and display layout (columns). +<p>Historically with ACEDB ZMap would request a list of 'featuresets' from ACE and each one of these could turn out to include data from several sources. The acedbServer (and later pipeServers) module would request all the data and store this in a new Feature Context, and this feature context would be passed to the <b>zmapView</b> module complete. The <b>zmapView</b> module would then merge this data into its own feature context and the process can be repeated for each server. As ZMap moves to being mainly pipeServer based (although ACE will still be supported and needed esp for genefinder features) and various analysis modules are added it become necessary to distinguish more clearly between source data (featuresets) and display layout (columns). </p> </fieldset> @@ -39,13 +39,13 @@ It would be possible to identify the type of each feature, but not to access all <h3>The View and Window data structures have more data</h3> <p>The Window now has a copy of the featureset_2_column mapping from the view to allow it to display data in the right place. </p> -<p>In the View the previous list of columns (used to define the display order) has been replaced by a hash table, initially using the same ZMapGFFSet structure used for view->featureset_to_column. The Window also has a copy of this. The window still has window->feature-set_names, whcih is a list of column in display order. +<p>In the View the previous list of columns (used to define the display order) has been replaced by a hash table, initially using the same ZMapGFFSet structure used for view->featureset_to_column. The Window also has a copy of this. The window still has window->feature_set_names, which is a list of columns in display order. </p> </fieldset> <fieldset><legend>Loose Ends</legend> <h3>View data and Window data</h3> -<p>The View has a few data structures used to handle mapping between data sources display columns and styles, and each window has a copy of these. Whenever a new window is opened or more data is added then this is merged into the view data and the update percolated through to all the windows. It's not clear why this copying is necessary, but in case it is the existing mechanism has been extended to the extra data items that have been added. It would be desirable to replace the copied data with a pointer to the view's data (or rather, give the window a pointer to the view). Perhaps this data has been copied for reasons of scope - if the window had access t th view data then it would have to incldue zmapView_P.h - yet maitaining copies introduces further scope for errors & bugs. +<p>The View has a few data structures used to handle mapping between data sources display columns and styles, and each window has a copy of these. Whenever a new window is opened or more data is added then this is merged into the view data and the update percolated through to all the windows. It's not clear why this copying is necessary, but in case it is the existing mechanism has been extended to the extra data items that have been added. It would be desirable to replace the copied data with a pointer to the view's data (or rather, give the window a pointer to the view). Perhaps this data has been copied for reasons of scope - if the window had access to the view data then it would have to include zmapView_P.h - yet maintaining copies introduces further scope for errors & bugs. Perhaps an new structure could be invented to amalgamate all thee items and pass them around. </p> <h3>Column data and ordering</h3> <p>It is hoped to remove window->feature_set_names, which is used for column ordering and use the columns data instead. The columns data is distinct from the featureset to column mapping but as a practical (interim) measure the same data structure is used to store both - historically they have been treated as the same data. A new data structure is needed for column data, which should include display-order, bump status (to be removed from the style), the list of required styles (remove from featureset_2_styles) @@ -76,4 +76,21 @@ It may be wise to review the columns dialog in terms of 'available columns', and <p>This is still not perfect, there are a few areas where memory is possibly not being freed, and the merge strategy is ad-hoc in places. There are also some unexpected side effects to be had while changing code - adding columns to the requested featuresets caused the featuresets_2_styles data to be lost. This is probably not code spontaneously breaking but more likely being used in new ways. </p> +</fieldset> + +<fieldset><legend>Organising featuresets</legend> +<h3>Context to Featureset</h3> +<p>In the feature context we represent a sequence of DNA and various associated features and this is organsied as aligns blocks and featuresets. These are stored as a tree structures containing hash tables that hold the next lower level in the hierarchy and given the small numbers of these groups there is not performance issue with this structure - items may be looked up individually or processed in turn efficiently. +</p> +<p>Each featureset contains data of one type (formerly several related types) and the data volumes can be quite large (eg 50k for some alignment featuresets. As these are strioed in a hash table they can be looked up very quickly and also processed en masse with little performance overhead, but note that the features will not be sorted. +</p> +<h3>Feature context to Foo Canvas Item</h3> +<p>When displayed a featureset can be split into several columns on the screen and each display item has a pointer to its feature in the feature context. The Foo Canvas implements groups of data as lists and this can be sorted, but random access is potentially very slow. +</p> +<a name="sort_featureset"></a> +<h3>Processing Featureset data</h3> +<p>For some functions (eg <a href="Design_notes/notes/EST_mRNA.shtml">masking EST features</a>) we wish to process the feature context in order and this requires either a change of data structure or an additional set of pointers to each feature to allow sorted access. Please follow that link for implementation details. +</p> + + </fieldset> -- GitLab