Improved scheduler
Created by: muffato
Main changes
Following #45, here is the meat of the change. There are three major new features
- Automatic setting of the batch-size.
in fact, the feature has been there for a very very long time. I just wasn't very happy with the formula. I now try to achieve 5 claim operations per analysis per second. The estimated batch size is simply the optimal number of jobs to achieve that given the number of running workers and the average runtime of a job.
- New limiter based on the job throughput
This is similar to hive_capacity
in the sense that all the analyses will be counted in the same limiter and limit each other, but the difference is that the average runtime of jobs is taken into account. The limiter caps the total throughput (jobs per second) of the pipeline. I've checked most of the Compara databases of the last 6 months or so, and the maximum we could reach is ~2,000 jobs per second. However I will ask feedback on #ensembl-hive
- Submit fewer workers and ensure they all have enough work to do
The scheduler was very greedy and was basically trying to submit 1 worker per job (before limiting this number according to the capacities). For instance, if there are 20 jobs to run, and we know that the average job runtime is 1 second, eHive will still submit 20 workers even though we could afford submitting a single one. Submitting fewer workers is nicer to LSF, especially if you consider that very short jobs are penalized (the underrun
exception), and that because workers are not guaranteed to start at the same time, when the 20th worker starts, the analysis may have been depleted.
The scheduler now aims at feeding each worker with ~2 minutes worth of work. There is a mechanism similar to the TailTrimming so that it will still submit a few more workers at the next loop, which helps when job runtimes are heterogeneous and workers get stuck on long jobs.
Related changes
-
I've made
estimate_num_required_workers
return the number of extra workers needed. In fact, it was used like that in Scheduler although defined as the total number of workers in AnalysisStats, but was working fine only because the limiters were enforced in both. -
I have added a method in AnalysisStats to estimate
avg_msec_per_job
when no workers have reported their stats yet. This is done by reading how many live workers we have, and how many jobs they've been involved with. Even though MySQL timestamps only store whole seconds, it is still sufficient in most cases as the new mechanisms are mostly effective on very quick analyses
Other comments
This work is largely motivated by the issues encountered by someone new to eHive and trying to run a pipeline on a large dataset (e.g. LongMultiplication with tens of thousands of jobs). Ideally, people should not bother about batch-size, capacity, and eHive should run sensibly (I guess you'll remember which presentation I'm alluding to).
Things left to do and test before this PR can be accepted
-
I've tested the automatic batch-size on several large Compara pipelines and it worked well. I have only tested the other changes in small pipelines, but will test them at scale too.
-
By default, the batch-size is set to 1 in Analysis.pm (unless defined otherwise). I will run a few more tests, but I would in fact change that to 0 so that the automatic setting is used and we can tell people to forget having to set batch-size
-
I've defined all the parameters as constants in AnalysisStats, but I feel they should ultimately go to the JSON config file. I think we've already mentioned we should properly version it and use it adequately, and I guess it will become more important now
-
I will need to update the documentation
-
Maybe I should include those explanations in the commit messages themselves ?