Skip to content

Add lens ids#28

Merged
jmelot merged 6 commits intomasterfrom
27-integrate-lens
Jul 13, 2023
Merged

Add lens ids#28
jmelot merged 6 commits intomasterfrom
27-integrate-lens

Conversation

@jmelot
Copy link
Contributor

@jmelot jmelot commented Jun 9, 2023

Closes #27

@jmelot jmelot force-pushed the 27-integrate-lens branch from 7cef1e8 to 6f53213 Compare June 9, 2023 21:00
Copy link
Contributor Author

@jmelot jmelot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesdunham a few things I was thinking about

MAX(LOWER(id.value)) AS clean_doi,
MAX(year_published) as year,
ARRAY_AGG(author.last_name) AS last_names,
ARRAY_AGG(reference.lens_id) AS references
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again please sanity check me - they're unusual in that they provide both references and citations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this execute successfully for you? I get Cannot query rows larger than 100MB limit. which led me here suggesting the issue is with the size of the values over which we're applying array_agg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh I must have somehow forgotten to run this one. Sorry about that - can you take a look at the updated version I've added in the latest commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I've written the tables to tmp.lens_ids and tmp.lens_metadata, fwiw)

AND lens_id in (select id from {{ staging_dataset }}.lens_ids)
)
(
SELECT
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing based on the name that the alias_lens_ids column maps lens ids to duplicates, so using those to match here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alias_lens_ids is undocumented so I'm not sure either. But from brief inspection, seems likely. We always have 1+ alias, and from spot checking, the array of aliases includes the lens_id.

select 
   array_length(alias_lens_ids) as n_aliases,
   count(*) n_pubs
from `gcp-cset-projects.lens.scholarly`
group by 1
order by 1

@jmelot jmelot requested a review from jamesdunham June 9, 2023 21:09
@jamesdunham
Copy link
Contributor

Done here with the exception of comments and questions above.

FROM
lens.scholarly
LEFT JOIN
dois
Copy link
Contributor

@jamesdunham jamesdunham Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional to join 1:M here? We string_agg() the references and array_agg() the author last names, but not the unnested DOIs, so just checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's intentional (but a reasonable question!). If we have multiple version of the metadata for an article, we record that on different rows for the same orig_id. We aggregate the refs and authors because they're the same for each version of the article's metadata

Copy link
Contributor Author

@jmelot jmelot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty!

FROM
lens.scholarly
LEFT JOIN
dois
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's intentional (but a reasonable question!). If we have multiple version of the metadata for an article, we record that on different rows for the same orig_id. We aggregate the refs and authors because they're the same for each version of the article's metadata

@jmelot jmelot merged commit 63f0e23 into master Jul 13, 2023
@jmelot jmelot deleted the 27-integrate-lens branch July 13, 2023 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integrate Lens ids

2 participants

Comments