When a worker dyno dies that was in the middle of a ParseCsvImportJob and is then restarted by Sidekiq, the result will be duplicate RawInput records for that Import. Say you have a CSV file with 10 rows and the ParseCsvImportJob gets through 5 before being killed. When the worker comes back up, it will start from the beginning and you'll end up with a total of 15 RawInput records.
Does it matter?
The most obvious way this bug will show up is in the UI when a confused Importer is viewing their import. Rather than seeing a total count of 10 like they expect, they'll see 15.
In terms of data, what will happen will depend on what's in those 5 duplicated rows. In the case of a Rankable like Website, the second of the two rows will match and no new data will be created. For all other Rankables, however, duplicate data will be created.
However, at resolution time, when we're sending this data to Redshift - it will be fine. The two pieces of data will tie for Source rank, but ties like this are already broken by created_at. Our existing logic will make Redshift, Looker and Importers none the wiser that we have this bloat going on.
When a worker dyno dies that was in the middle of a
ParseCsvImportJoband is then restarted by Sidekiq, the result will be duplicateRawInputrecords for thatImport. Say you have a CSV file with 10 rows and theParseCsvImportJobgets through 5 before being killed. When the worker comes back up, it will start from the beginning and you'll end up with a total of 15RawInputrecords.Does it matter?
The most obvious way this bug will show up is in the UI when a confused Importer is viewing their import. Rather than seeing a total count of 10 like they expect, they'll see 15.
In terms of data, what will happen will depend on what's in those 5 duplicated rows. In the case of a
RankablelikeWebsite, the second of the two rows will match and no new data will be created. For all otherRankables, however, duplicate data will be created.However, at resolution time, when we're sending this data to Redshift - it will be fine. The two pieces of data will tie for
Sourcerank, but ties like this are already broken bycreated_at. Our existing logic will make Redshift, Looker and Importers none the wiser that we have this bloat going on.