Restarting ParseCsvImportJob creates duplicates

When a worker dyno dies that was in the middle of a `ParseCsvImportJob` and is then restarted by Sidekiq, the result will be duplicate `RawInput` records for that `Import`. Say you have a CSV file with 10 rows and the `ParseCsvImportJob` gets through 5 before being killed. When the worker comes back up, it will start from the beginning and you'll end up with a total of 15 `RawInput` records.

### Does it matter?

The most obvious way this bug will show up is in the UI when a confused Importer is viewing their import. Rather than seeing a total count of 10 like they expect, they'll see 15.

In terms of data, what will happen will depend on what's in those 5 duplicated rows. In the case of a `Rankable` like `Website`, the second of the two rows will match and no new data will be created. For all other `Rankables`, however, duplicate data will be created.

However, at resolution time, when we're sending this data to Redshift - it will be fine. The two pieces of data will tie for `Source` rank, but ties like this are already broken by `created_at`. Our existing logic will make Redshift, Looker and Importers none the wiser that we have this bloat going on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting ParseCsvImportJob creates duplicates #233

Does it matter?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Restarting ParseCsvImportJob creates duplicates #233

Description

Does it matter?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions