testing bad re-use of database handle
Created by: magaliruffier
PLEASE DO NOT MERGE
This is just an experiment at reproducing the bug of storing xrefs from one species in the wrong database. This was run with all the code on master apart from the BaseParser.pm which has been reverted to its state early 2018, when the database handle is cached at the BaseParser.pm level The parsers still pass the dbi argument to the various methods, but it is not used in the BaseParser.pm, as self->dbi is used instead
The eHive pipeline was run normally on 58 species in parallel. ZFIN xrefs are stored in the cow database, allowing us to reproduce the bad storage scenario. There might be other sources but they are harder to spot (how do you check automatically the refseq xrefs belong to the species they are stored against)
This is a sample data in the zebrafish database: select d.*, x.xref_id, x.accession, x.species_id, x.info_type from dependent_xref d, xref x where dependent_xref_id = xref_id and accession = 'ZDB-GENE-000125-12' limit 3; +----------------+----------------+-------------------+--------------------+-------------------+---------+--------------------+------------+-----------+ | object_xref_id | master_xref_id | dependent_xref_id | linkage_annotation | linkage_source_id | xref_id | accession | species_id | info_type | +----------------+----------------+-------------------+--------------------+-------------------+---------+--------------------+------------+-----------+ | NULL | 366791 | 461953 | NULL | 150 | 461953 | ZDB-GENE-000125-12 | 7955 | DEPENDENT | | NULL | 366790 | 461953 | NULL | 150 | 461953 | ZDB-GENE-000125-12 | 7955 | DEPENDENT | | NULL | 108521 | 461953 | NULL | 150 | 461953 | ZDB-GENE-000125-12 | 7955 | DEPENDENT | +----------------+----------------+-------------------+--------------------+-------------------+---------+--------------------+------------+-----------+
This is a sample data in the cow database: select d.*, x.xref_id, x.accession, x.species_id, x.info_type from dependent_xref d, xref x where dependent_xref_id = xref_id and accession = 'ZDB-GENE-000125-12' limit 3; +----------------+----------------+-------------------+--------------------+-------------------+---------+--------------------+------------+-----------+ | object_xref_id | master_xref_id | dependent_xref_id | linkage_annotation | linkage_source_id | xref_id | accession | species_id | info_type | +----------------+----------------+-------------------+--------------------+-------------------+---------+--------------------+------------+-----------+ | NULL | 366791 | 249592 | NULL | 150 | 249592 | ZDB-GENE-000125-12 | 7955 | DEPENDENT | | NULL | 366790 | 249592 | NULL | 150 | 249592 | ZDB-GENE-000125-12 | 7955 | DEPENDENT | | NULL | 108521 | 249592 | NULL | 150 | 249592 | ZDB-GENE-000125-12 | 7955 | DEPENDENT | +----------------+----------------+-------------------+--------------------+-------------------+---------+--------------------+------------+-----------+
What we notice is that we have the same xref, ZDB-GENE-000125-12, stored in both the cow and zebrafish databases, but with different xref_ids. In the dependent_xref table however, they point to the same master_xref_id. In cow, those master_xref_ids do not point to anything while in zebrafish they point to the Uniprot and RefSeq accessions the ZFIN_ID is dependent on as per the input file.
Looking at the ZFINParser.pm code, we have the following: my (%swiss) = %{$self->get_valid_codes("uniprot/",$species_id, $dbi)}; This will retrieve a list of accession to xref_id mappings for Uniprot xrefs. The method get_valid_codes comes from the BaseParser.pm module and uses the cached dbi when running the SQL query. This query has run on the zebrafish database, as these correspond to Uniprot xref_ids from the zebrafish database
Further down the parser, we have the following:
if(defined(swiss{
acc})){
foreach my xref_id (@{
swiss{$acc}}){
$self->add_dependent_xref({ master_xref_id => $xref_id,
Using the Uniprot xref id extracted from the zebrafish database as a master_xref_id, we now call the add_dependent_xref method from the BaseParser.pm module, which will create the actual insert statements for the dependent_xref and xref tables.
The method add_dependent_xref uses the cached dbi, which this time ends up being the cow database, as this is where the xrefs are then stored.
This seems to indicate that within the same eHive job, two different database handles are retrieved.
Any insights as to how that could happen would help us prevent this from happening again.