Clarification of meaning of empty directories in literature inputs
As a developer, I want to understand empty directories in the literature inputs to understand if we are seeing expected behaviour or a possible error condition.
Background The inputs to the Open Targets literature pipeline are deposited by the EPMC pipeline under gs://otar025-epmc. From here, they are ingested by the Platform ETL for further processing.
When I look at the available files and their sizes (gsutil du -h gs://otar025-epmc/Full-text) I see some entries where files are written, but there is no data:
0 B gs://otar025-epmc/Full-text/31_08_2022/NMP_patch-30-08-2022-0.jsonl
0 B gs://otar025-epmc/Full-text/31_08_2022/NMP_patch-30-08-2022-1.jsonl
0 B gs://otar025-epmc/Full-text/31_08_2022/NMP_patch-30-08-2022-3.jsonl
0 B gs://otar025-epmc/Full-text/31_08_2022/NMP_patch-30-08-2022-6.jsonl
0 B gs://otar025-epmc/Full-text/31_08_2022/NMP_patch-30-08-2022-9.jsonl
0 B gs://otar025-epmc/Full-text/31_08_2022/
We are unsure if this is because there was no data to output or if something has gone wrong.
Tasks
-
Confirm what is meant by the empty directories. -
Potentially extend the pipeline to output a metadata file, including information on when the pipeline was run, whether everything worked as expected, and the expected file output size.