Use the new traversal policy to simplify reharvest#92
Use the new traversal policy to simplify reharvest#92qtomlinson merged 1 commit intoclearlydefined:mainfrom
Conversation
With the introduction of the "reharvestAlways" traversal policy, the integration tests can now be simplified.
|
@RomanIakovlev as per our discussion in the dev meeting, here is PR with the explanation behind introducing the new traversal policy. |
9aa0ec6 to
a567ca8
Compare
RomanIakovlev
left a comment
There was a problem hiding this comment.
Looks nice! Do I understand correctly that this would make the harvest part of integration test ~2x faster by only doing a single harvesting per coordinate?
Yes for new components (in the case of randomly selected components). There is also saving for partially harvested components. For fully harvested components, the improvement is small. This is due to the default traversal policy used in the initial round of harvest, which respects the existing harvest results and bypasses the majority of the downloading and processing work. |
|
The merging of the PR depends on clearlydefined/crawler#598 |
|
The harvest integration test is successful and the log can be found here |
Previously, the "always" traversal policy was used to retrigger the harvest of a component. This policy essentially reran all the previously successfully executed tools. This can be quite cumbersome, especially for integration tests. Sometimes, certain tool results were not available, such as when the schema version had been updated, causing the previously harvested results to become stale and the tool results with the newly updated schema version to be missing. When the tool result for a specific component is missing, using the "always" policy leads to a "Unreachable for reprocessing " status and the tool being skipped.
To address this issue in integration testing, the approach was to first trigger a round of harvest using the default traversal policy. With the default policy, the tools would run for the component if there were no existing results with the correct schema version. If there were previous results, the tools would skip and use the existing results with the current schema version. This initial run was to ensure a complete set of tool results (both new and old) were available for the correct schema version. Upon completion of the first run, a re-harvest was triggered with "always" traversal policy so that all the tools were rerun and the tool results were updated.
With the introduction of the "reharvestAlways" traversal policy (clearlydefined/crawler#598), the integration tests can now be simplified. This PR handles the adaptation.