Remove duplicates from Compound Origins table
CHEBI:180706 is an example where we have duplicated data submitted in the Compound Origins table. This issue exists only with 2-star entries.
Fix: Remove one record from duplicate entries, but only delete a duplicate record if it has submitted status.
Run an SQL script, ideally in Oracle, but if not possible then in ETL for ChEBI2
Sample SQL to identify issue:
SELECT co.*
FROM compound_origins AS co
JOIN (
SELECT compound_id, species_text, species_accession, component_text, component_accession, source_type_id, source_accession
FROM compound_origins
WHERE status_id=9
GROUP BY compound_id, species_text, species_accession, component_text, component_accession, source_type_id, source_accession
HAVING COUNT(*) > 1
) AS duplicates
ON co.compound_id = duplicates.compound_id
AND co.species_text = duplicates.species_text
AND co.species_accession = duplicates.species_accession
AND co.component_text = duplicates.component_text
AND co.component_accession = duplicates.component_accession
AND co.source_type_id = duplicates.source_type_id
AND co.source_accession = duplicates.source_accession
JOIN compounds c ON co.compound_id = c.id
WHERE co.status_id = 9 AND c.stars = 2;
Some sample CHEBI IDs that have this duplication: 172378, 172379, 172380, 172381, 172382, 180702, 180706, 180708
CHEBI:30769 has "From Metabolights" in comments. There could be other records as well, need to remove this comment