ensembl-hive merge requestshttps://gitlab.ebi.ac.uk/ensembl-gh-mirror/ensembl-hive/-/merge_requests2022-03-29T10:04:29Zhttps://gitlab.ebi.ac.uk/ensembl-gh-mirror/ensembl-hive/-/merge_requests/66Feature/reset job by input id2022-03-29T10:04:29ZMarek SzubaFeature/reset job by input id*Created by: s-mm*
Reset a job with given input_id and analyses pattern. Both options can be wildcard arguments. *Created by: s-mm*
Reset a job with given input_id and analyses pattern. Both options can be wildcard arguments. https://gitlab.ebi.ac.uk/ensembl-gh-mirror/ensembl-hive/-/merge_requests/140Throw if the requested registry file doesn't exist2019-11-21T02:44:37ZMarek SzubaThrow if the requested registry file doesn't exist*Created by: muffato*
## Use case
When a worker or beekeeper is invoked with a wrong reg_conf argument, there is no explicit warning / error message, only potentially an error about the reg_alias not being found in the Registry.
#...*Created by: muffato*
## Use case
When a worker or beekeeper is invoked with a wrong reg_conf argument, there is no explicit warning / error message, only potentially an error about the reg_alias not being found in the Registry.
## Description
`Registry::load_all` already has a flag to require the file to exist. I merely enable it and we now get a much clearer error message.
## Possible Drawbacks
This is a breaking change for people who use a url _and_ an invalid reg_conf at the same time. Previously, there would have been no warnings about the invalid reg_conf, and the database would have been connected to via its URL. Now eHive is going to complain about the reg_conf,
## Testing
_Have you added/modified unit tests to test the changes?_
Yes, but requires Ensembl/ensembl#408
_If so, do the tests pass/fail?_
Yes
_Have you run the entire test suite and no regression was detected?_
Yeshttps://gitlab.ebi.ac.uk/ensembl-gh-mirror/ensembl-hive/-/merge_requests/46Improved scheduler2019-11-20T15:51:35ZMarek SzubaImproved scheduler*Created by: muffato*
## Main changes
Following #45, here is the meat of the change. There are three major new features
1. Automatic setting of the batch-size.
in fact, the feature has been there for a very very long time. I ju...*Created by: muffato*
## Main changes
Following #45, here is the meat of the change. There are three major new features
1. Automatic setting of the batch-size.
in fact, the feature has been there for a very very long time. I just wasn't very happy with the formula. I now try to achieve 5 claim operations per analysis per second. The estimated batch size is simply the optimal number of jobs to achieve that given the number of running workers and the average runtime of a job.
2. New limiter based on the job throughput
This is similar to `hive_capacity` in the sense that all the analyses will be counted in the same limiter and limit each other, but the difference is that the average runtime of jobs is taken into account. The limiter caps the total _throughput_ (jobs per second) of the pipeline. I've checked most of the Compara databases of the last 6 months or so, and the maximum we could reach is ~2,000 jobs per second. However I will ask feedback on #ensembl-hive
3. Submit fewer workers and ensure they all have enough work to do
The scheduler was very greedy and was basically trying to submit 1 worker per job (before limiting this number according to the capacities). For instance, if there are 20 jobs to run, and we know that the average job runtime is 1 second, eHive will still submit 20 workers even though we could afford submitting a single one. Submitting fewer workers is nicer to LSF, especially if you consider that very short jobs are penalized (the `underrun` exception), and that because workers are not guaranteed to start at the same time, when the 20th worker starts, the analysis may have been depleted.
The scheduler now aims at feeding each worker with ~2 minutes worth of work. There is a mechanism similar to the TailTrimming so that it will still submit a few more workers at the next loop, which helps when job runtimes are heterogeneous and workers get stuck on long jobs.
## Related changes
1. I've made `estimate_num_required_workers` return the number of extra workers needed. In fact, it was used like that in Scheduler although defined as the total number of workers in AnalysisStats, but was working fine only because the limiters were enforced in both.
2. I have added a method in AnalysisStats to estimate `avg_msec_per_job` when no workers have reported their stats yet. This is done by reading how many live workers we have, and how many jobs they've been involved with. Even though MySQL timestamps only store whole seconds, it is still sufficient in most cases as the new mechanisms are mostly effective on very quick analyses
## Other comments
This work is largely motivated by the issues encountered by someone new to eHive and trying to run a pipeline on a large dataset (e.g. LongMultiplication with tens of thousands of jobs). Ideally, people should not bother about batch-size, capacity, and eHive should run sensibly (I guess you'll remember which presentation I'm alluding to).
## Things left to do and test before this PR can be accepted
1. I've tested the automatic batch-size on several large Compara pipelines and it worked well. I have only tested the other changes in small pipelines, but will test them at scale too.
2. By default, the batch-size is set to 1 in Analysis.pm (unless defined otherwise). I will run a few more tests, but I would in fact change that to 0 so that the automatic setting is used and we can tell people to forget having to set batch-size
3. I've defined all the parameters as constants in AnalysisStats, but I feel they should ultimately go to the JSON config file. I think we've already mentioned we should properly version it and use it adequately, and I guess it will become more important now
4. I will need to update the documentation
5. Maybe I should include those explanations in the commit messages themselves ?https://gitlab.ebi.ac.uk/ensembl-gh-mirror/ensembl-hive/-/merge_requests/139[GuestLanguage/java] Don't run the tests in the before_install section as the...2019-08-22T14:28:08ZMarek Szuba[GuestLanguage/java] Don't run the tests in the before_install section as they are run by scripts/dev/travis_run_tests.sh*Created by: muffato*
## Use case
The Java tests are currently run twice: in `before_install` and in `scripts/dev/travis_run_tests.sh`. This causes two issues:
- extra Travis time, which is really precious and should not be wasted
...*Created by: muffato*
## Use case
The Java tests are currently run twice: in `before_install` and in `scripts/dev/travis_run_tests.sh`. This causes two issues:
- extra Travis time, which is really precious and should not be wasted
- The test build will immediately fail and only report the Java error. There could be other errors worth being reported part of the build
## Description
Only run the tests in `scripts/dev/travis_run_tests.sh` (like all the other tests).
## Possible Drawbacks
Extra Travis time when the only issue is that the Java wrapper doesn't pass its own tests. But the same could be said of any components. It would make sense to test (and fail early) the critical components, but I don't think the Java wrapper is currently critical.
## Testing
_Have you added/modified unit tests to test the changes?_
N/A
_Have you run the entire test suite and no regression was detected?_
N/Ahttps://gitlab.ebi.ac.uk/ensembl-gh-mirror/ensembl-hive/-/merge_requests/40enable overflow for accu keys, to_analysis_urls, and submission_cmd_args2019-07-08T13:24:31ZMarek Szubaenable overflow for accu keys, to_analysis_urls, and submission_cmd_args*Created by: ens-bwalts*
*Created by: ens-bwalts*
https://gitlab.ebi.ac.uk/ensembl-gh-mirror/ensembl-hive/-/merge_requests/100Feature/log formatter2019-07-08T13:14:41ZMarek SzubaFeature/log formatter*Created by: mira13*
## Use case
eHive needs unified output , there fore wrapper for log4perl created
## Description
Logger is used to initialise log4perl with command line parameters,
Json and text output has different log s...*Created by: mira13*
## Use case
eHive needs unified output , there fore wrapper for log4perl created
## Description
Logger is used to initialise log4perl with command line parameters,
Json and text output has different log settings.
## Possible Drawbacks
existing warnings can mixed up with new log info
## Testing
Need a discussion if any test are needed
_If so, do the tests pass/fail?_
https://gitlab.ebi.ac.uk/ensembl-gh-mirror/ensembl-hive/-/merge_requests/90Tips included in documentation.2019-04-24T09:07:26ZMarek SzubaTips included in documentation.*Created by: mira13*
## Requirements
No requirements
## Use case
Beginners start to use eHive and looking for additional info
## Description
Tips added to documentation for optimised usage of eHive
https://www.ebi.ac.uk/...*Created by: mira13*
## Requirements
No requirements
## Use case
Beginners start to use eHive and looking for additional info
## Description
Tips added to documentation for optimised usage of eHive
https://www.ebi.ac.uk/panda/jira/browse/ENSCORESW-2437
## Possible Drawbacks
no
## Testing
No testing aplicable.
_Have you added/modified unit tests to test the changes?_
_If so, do the tests pass/fail?_
_Have you run the entire test suite and no regression was detected?_