Skip to content

Change date format in literature output from DD-MM-YYYY to YYYY-MM-DD

As a developer I want to be able to quickly see what data is available to ingest to the Literature ETL.

Background The EPMC pipeline runs daily to generate incremental outputs. These files are stored in gs://otar025-epmc/ under two subdirectories, Full-text and Abstracts. Within these subdirectories, the data is partitioned by date in the format DD-MM-YYYY. When I view these files with the command gsutil ls gs://otar025-epmc/Abstracts | head -n 10 I see:

gs://otar025-epmc/Abstracts/
gs://otar025-epmc/Abstracts/01_06_2022/
gs://otar025-epmc/Abstracts/01_07_2022/
gs://otar025-epmc/Abstracts/01_08_2022/
gs://otar025-epmc/Abstracts/02_06_2022/
gs://otar025-epmc/Abstracts/02_07_2022/
gs://otar025-epmc/Abstracts/02_08_2022/
gs://otar025-epmc/Abstracts/03_06_2022/
gs://otar025-epmc/Abstracts/03_07_2022/
gs://otar025-epmc/Abstracts/03_09_2022/

The files are listed automatically in sorted order, meaning we see all the files generated on the first of each month, then all the files generated on the second of each month, and so on. We would like to see something like the following:

gs://otar025-epmc/Abstracts/2022_05_31/
gs://otar025-epmc/Abstracts/2022_06_01/
gs://otar025-epmc/Abstracts/2022_06_02/

So the files remain grouped by month.

Tasks

  • Update the EPMC pipeline to output the files with the correct date format
  • Rename all the existing files to use the updated data format