Skip to content

Clarification of meaning of empty directories in literature inputs

As a developer, I want to understand empty directories in the literature inputs to understand if we are seeing expected behaviour or a possible error condition.

Background The inputs to the Open Targets literature pipeline are deposited by the EPMC pipeline under gs://otar025-epmc. From here, they are ingested by the Platform ETL for further processing.

When I look at the available files and their sizes (gsutil du -h gs://otar025-epmc/Full-text) I see some entries where files are written, but there is no data:

0 B          gs://otar025-epmc/Full-text/31_08_2022/NMP_patch-30-08-2022-0.jsonl
0 B          gs://otar025-epmc/Full-text/31_08_2022/NMP_patch-30-08-2022-1.jsonl
0 B          gs://otar025-epmc/Full-text/31_08_2022/NMP_patch-30-08-2022-3.jsonl
0 B          gs://otar025-epmc/Full-text/31_08_2022/NMP_patch-30-08-2022-6.jsonl
0 B          gs://otar025-epmc/Full-text/31_08_2022/NMP_patch-30-08-2022-9.jsonl
0 B          gs://otar025-epmc/Full-text/31_08_2022/

We are unsure if this is because there was no data to output or if something has gone wrong.

Tasks

  • Confirm what is meant by the empty directories.
  • Potentially extend the pipeline to output a metadata file, including information on when the pipeline was run, whether everything worked as expected, and the expected file output size.