added hive_id index to analysis_job table to help with dead_worker
job reseting. This allowed direct UPDATE..WHERE.. sql to be used. Also changed the retry_count system: retry_count is only incremented for jobs that failed (status in ('GET_INPUT','RUN','WRITE_OUTPUT')). Job that were CLAIMED by the dead worker are just reset without incrementing the retry_count since they were never attempted to run. Also the fetching of claimed jobs now has an 'ORDER BY retry_count' so that jobs that have failed are at the bottom of the list of jobs to process. This allows the 'bad' jobs to filter themselves out.
Please register or sign in to comment