Bugfix/delint MIM parser (!311) · Merge requests · ensembl-gh-mirror / ensembl

Merged Marek Szuba requested to merge bugfix/delint_MIMParser into feature/xref_sprint 6 years ago

Oct 24, 2018
- MIMParser: more compact definition of parameter defaults · 5cbf499b
  Marek Szuba authored 6 years ago
  
  5cbf499b
- MIMParser: Remove unused imports · 9667e76b
  Marek Szuba authored 6 years ago
  
  9667e76b
Oct 23, 2018
- MIMParser: add comments to synonym generation · ee56a2bd
  Marek Szuba authored 6 years ago
  
  ee56a2bd
- MIMParser: fix a typo in comments · a8c676c8
  Marek Szuba authored 6 years ago
  
  a8c676c8
Oct 18, 2018

MIMParser: move TI processing to a separate function as well · 189f7abb

Marek Szuba authored 6 years ago

Does not reduce complexity of run() by that much according to perlcritic,
then again the record loop now fits on a single page. A single page
in a 100ish-row terminal - but still.

189f7abb

MIMParser: abort if we cannot get a record's TI field · 00f0afe9

Marek Szuba authored 6 years ago

We abort if we cannot extract detailed information from TI so why
should missing or malformed TI be any less fatal?

00f0afe9

MIMParser: extract accession from TI, not NO · 9cff6b08

Marek Szuba authored 6 years ago

Having confirmed with Mag the two numbers should always be the same,
limit the parsing to TI - it contains everything we need.

9cff6b08

MIMParser: first round of complexity and nesting reduction · 5a594bfe

Marek Szuba authored 6 years ago

Move NO and TI extraction to a separate function, moreover we now skip ahead
to the next record iteration if either of them does not exist instead of
introducing two more nesting levels by only proceeding if they are defined.

5a594bfe

MIMParser: simplify processing of single-xref records · a2f3d023

Marek Szuba authored 6 years ago

Instead of a massive if-elsif cascade in which the only things different
were the source id passed to add_xref and the counter to increment,
create a generic object and choose the right source id / counter using
lookup hashes. The processing of two-record insertions takes advantage
of both counter selection and the generic object as well, albeit it is
not fully automated yet. Moved/removed entries only use generic counters
for now.

a2f3d023

MIMParser: show $removed_count in the summary · f30ca640
Marek Szuba authored 6 years ago

f30ca640

MIMParser: put get_source_id_for_source_name() error code in a constant · 4cd79359

Marek Szuba authored 6 years ago

Instead of repeating -1, use a descriptively named constant. This constant
should be defined in BaseParser but for the time being let us not touch
that module.

4cd79359

MIMParser: do not concatenate words in multi-line descriptions · 952f35e7

Marek Szuba authored 6 years ago

Previously we simply removed all newline characters throughout the TI field.
This resulted in e.g.

URINARY TRACT ABNORMALITY AND
CRYPTORCHIDISM;;

becoming

URINARY TRACT ABNORMALITY ANDCRYPTORCHIDISM;;

Now we only remove a newline if it is either immediately preceded or
immediately followed by ;; (i.e. the separator); otherwise replace
\n with a space instead.

952f35e7

MIMParser: get rid of the postfix form of if · 8f7f6bd1

Marek Szuba authored 6 years ago

The first case just IMHO reads better. The second/third were wasteful
because they checked exactly the same condition twice instead of of
just wrapping both print statements in the same block.

8f7f6bd1

MIMParser: improve the setting of parameter defaults · 18cefbda

Marek Szuba authored 6 years ago

Although the old ways of handling default values of $dbi and $verbose in
principle worked, they both left something to be desired of:
 - for $verbose, using the bitwise OR was slightly confusing given
   verbosity levels of the xref pipeline are NOT bitmasks;
 - for $dbi, we used the frowned-upon (and somewhat noisy) postfix form
   of unless.
Both assignments use the logical Defined-OR (//) operator now.

18cefbda

MIMParser: consistently croak on errors · f52fc20f
Marek Szuba authored 6 years ago

f52fc20f
MIMParser: no longer require a fixed order of fields · 17ca46d8
Marek Szuba authored 6 years ago

17ca46d8

MIMParser: croak upon failure to retrieve MIM source IDs · 6a628949

Marek Szuba authored 6 years ago

Without this, if the relevant metadata is not present in the xref
database we are working with the parser will make a lot of noise but only
actually fail upon trying to add a new xref to the database, i.e. much, much
later. Let us fail when the problem actually occurs, shall we?

Use croak() rather than 'return 1' because this is a set-up error rather
than a data-processing one.

6a628949

MIMParser: use a local variable instead of $_ for storing input · 11633ec3
Marek Szuba authored 6 years ago

11633ec3
MIMParser: remove unused variables · a5617284
Marek Szuba authored 6 years ago

a5617284

MIMParser: Explicitly check for "REMOVED FROM DATABASE" records · 0fd02142

Marek Szuba authored 6 years ago

Previously, if the "MOVED TO" match failed we assumed the ^ record in
question indicated removal. Let us not be so optimistic any more and
actually check this, aborting in the event of removal not being the case
either.

0fd02142

MIMParser: Capture regex matches directly to target variables · bc51c3a8
Marek Szuba authored 6 years ago
```
Named or not, capture variables have negative impact on performance.
```
bc51c3a8

MIMParser: simplify processing of "moved to" entries · 8e3f12c1

Marek Szuba authored 6 years ago

No need to reprocess the whole record, $long_desc already contains the
part we need to look at to extract the new ID.

8e3f12c1

MIMParser: add a comment describing overall data structure of input · f3f7a5ca
Marek Szuba authored 6 years ago
```
Could use something about the TI field as well.
```
f3f7a5ca

MIMParser: further clean-up of regular expressions · 86baf1a2

Marek Szuba authored 6 years ago

1. Match $_ explicitly where it used to be matched implicitly. Note that
   it has NOT been confirmed at this point whether there are any other
   parts in the code operating implicitly on $_, which is why we still
   use that variable. Will try changing to a non-magic one later;
2. Add /msx to all regular expressions. Only one of them, the matching
   of the phrase "MOVED TO", required any modifications;
3. Take advantage of /x to unpack the regexes into multiple lines and
   add some comments;
4. Have the capture of $long_desc from TI already exclude the leading
   whitespace, thus doing away with a substitution previously needed to
   get rid of it.

86baf1a2

MIMParser: string cosmetics · c6cc6af4

Marek Szuba authored 6 years ago

Short strings consisting entirely of punctuation marks stand out way
more when in q form than when surrounded by another set of punctuation
marks.

c6cc6af4

MIMParser: use IO::Handle to set input record separator · 3b013561

Marek Szuba authored 6 years ago

We already handle input through IO::Handle so let's be consistent,
furthermore "input_record_separator()" is way more readable than the
"$/".

3b013561

MIMParser: abort if TI-field type/description line cannot be matched · af238eb7
Marek Szuba authored 6 years ago

af238eb7
MIMParser: fatal errors should go to STDERR, not STDOUT · a06d81f1
Marek Szuba authored 6 years ago

a06d81f1

MIMParser: remove unused outer-scope variables redeclared in inner scope · 12d90c56

Marek Szuba authored 6 years ago

Those three were never used in the outer scope (i.e. that of the while
loop) so let's just keep the inner declarations from the "if TI field"
block.

12d90c56

MIMParser: avoid escaping characters in regular expressions · e55cb794

Marek Szuba authored 6 years ago

Regexes in this parser are complicated enough even without backslashes
all over the place... Try to avoid escapes wherever necessary:
 - do not escape characters which do not need to be escaped;
 - for all metacharacters, use single-character bracketed character
   classes instead;
 - in the event of a caret appearing in a multi-character bracketed
   character class, make sure it is not the first character - it loses
   its special meaning then;
 - finally, if the character to match is a caret itself (which would
   require escaping both in a regex string and in a single-character
   bracketed character class), use the named form (\N{CARET}) instead.

e55cb794

MIMParser: apply perltidy · 0594a5e5

Marek Szuba authored 6 years ago

Fixes the mixing of tabs and spaces, trailing whitespace, and many others.

0594a5e5

MIMParser: use 'use parent' instead of 'use base' · b1ac21f4
Marek Szuba authored 6 years ago

b1ac21f4

Bugfix/delint MIM parser

Description

Use case

Benefits

Possible Drawbacks

Testing

Activity

Bugfix/delint MIM parser

Description

Use case

Benefits

Possible Drawbacks

Testing

Merge request reports

Activity