Skip to content
This repository was archived by the owner on Aug 14, 2021. It is now read-only.
This repository was archived by the owner on Aug 14, 2021. It is now read-only.

Restarting ParseCsvImportJob creates duplicates #233

@jonallured

Description

@jonallured

When a worker dyno dies that was in the middle of a ParseCsvImportJob and is then restarted by Sidekiq, the result will be duplicate RawInput records for that Import. Say you have a CSV file with 10 rows and the ParseCsvImportJob gets through 5 before being killed. When the worker comes back up, it will start from the beginning and you'll end up with a total of 15 RawInput records.

Does it matter?

The most obvious way this bug will show up is in the UI when a confused Importer is viewing their import. Rather than seeing a total count of 10 like they expect, they'll see 15.

In terms of data, what will happen will depend on what's in those 5 duplicated rows. In the case of a Rankable like Website, the second of the two rows will match and no new data will be created. For all other Rankables, however, duplicate data will be created.

However, at resolution time, when we're sending this data to Redshift - it will be fine. The two pieces of data will tie for Source rank, but ties like this are already broken by created_at. Our existing logic will make Redshift, Looker and Importers none the wiser that we have this bloat going on.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions