Conversation
Closes #27
jmelot
left a comment
There was a problem hiding this comment.
@jamesdunham a few things I was thinking about
sql/lens_metadata.sql
Outdated
| MAX(LOWER(id.value)) AS clean_doi, | ||
| MAX(year_published) as year, | ||
| ARRAY_AGG(author.last_name) AS last_names, | ||
| ARRAY_AGG(reference.lens_id) AS references |
There was a problem hiding this comment.
Again please sanity check me - they're unusual in that they provide both references and citations
There was a problem hiding this comment.
Did this execute successfully for you? I get Cannot query rows larger than 100MB limit. which led me here suggesting the issue is with the size of the values over which we're applying array_agg?
There was a problem hiding this comment.
Ugh I must have somehow forgotten to run this one. Sorry about that - can you take a look at the updated version I've added in the latest commit?
There was a problem hiding this comment.
(I've written the tables to tmp.lens_ids and tmp.lens_metadata, fwiw)
| AND lens_id in (select id from {{ staging_dataset }}.lens_ids) | ||
| ) | ||
| ( | ||
| SELECT |
There was a problem hiding this comment.
I'm guessing based on the name that the alias_lens_ids column maps lens ids to duplicates, so using those to match here
There was a problem hiding this comment.
alias_lens_ids is undocumented so I'm not sure either. But from brief inspection, seems likely. We always have 1+ alias, and from spot checking, the array of aliases includes the lens_id.
select
array_length(alias_lens_ids) as n_aliases,
count(*) n_pubs
from `gcp-cset-projects.lens.scholarly`
group by 1
order by 1
|
Done here with the exception of comments and questions above. |
| FROM | ||
| lens.scholarly | ||
| LEFT JOIN | ||
| dois |
There was a problem hiding this comment.
Is it intentional to join 1:M here? We string_agg() the references and array_agg() the author last names, but not the unnested DOIs, so just checking.
There was a problem hiding this comment.
Yeah, it's intentional (but a reasonable question!). If we have multiple version of the metadata for an article, we record that on different rows for the same orig_id. We aggregate the refs and authors because they're the same for each version of the article's metadata
| FROM | ||
| lens.scholarly | ||
| LEFT JOIN | ||
| dois |
There was a problem hiding this comment.
Yeah, it's intentional (but a reasonable question!). If we have multiple version of the metadata for an article, we record that on different rows for the same orig_id. We aggregate the refs and authors because they're the same for each version of the article's metadata
Closes #27