Skip to content

feat(degreeworks): show json differences upon updating#373

Open
matt-franklin225 wants to merge 25 commits into
mainfrom
show-degreeworks-json-diff
Open

feat(degreeworks): show json differences upon updating#373
matt-franklin225 wants to merge 25 commits into
mainfrom
show-degreeworks-json-diff

Conversation

@matt-franklin225

@matt-franklin225 matt-franklin225 commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

Description

The DegreeWorks scraper now contains logs that will display the differences between the information currently in the database and information that is being collected on the current scrape. These changes are formatted as differences between JSONs and displayed in the console. These logs are divided into several categories: requirements (one each for GE, UC, Campuswide Honors Collegium 4-year, and CHC 2-year), majors, minors, specializations, and degrees awarded. This was done by transforming the data scraped into its database formatting and comparing it to the actual database information.

Related Issue

Closes issue #349

Motivation and Context

The DegreeWorks scraper collects a large amount of data that is widely variable, not just due to changes on the system side but due to changes in the user as well. Depending on the certain information about the user (such as the year the user entered UCI and whether they are in the honors program), we will receive different information, which had not been logged to see what is changing until now.

How Has This Been Tested?

This has been tested locally by running the Degreeworks scraper with various cookies and catalogue years to ensure that changes are being properly accounted for.

Screenshots (if appropriate):

image

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code involves a change to the database schema.
  • My code requires a change to the documentation.

@matt-franklin225 matt-franklin225 changed the title Show degreeworks json diff feat(degreeworks): show json differences upon updating Apr 19, 2026
Comment thread apps/data-pipeline/degreeworks-scraper/src/components/Scraper.ts Outdated
Comment thread apps/data-pipeline/degreeworks-scraper/src/components/Scraper.ts Outdated
Comment thread apps/data-pipeline/degreeworks-scraper/src/components/Scraper.ts Outdated
Comment thread apps/data-pipeline/degreeworks-scraper/src/components/Scraper.ts Outdated
Comment thread apps/data-pipeline/degreeworks-scraper/package.json Outdated
@matt-franklin225 matt-franklin225 marked this pull request as ready for review April 27, 2026 02:35

@laggycomputer laggycomputer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no code to allow the user to stop an update if the diff is not satisfactory, which was a major objective of this PR. See the course and instructor scrapers for examples of how this is implemented.

@matt-franklin225

Copy link
Copy Markdown
Contributor Author

I see no code to allow the user to stop an update if the diff is not satisfactory, which was a major objective of this PR. See the course and instructor scrapers for examples of how this is implemented.

Implemented now for re-review

} else {
console.log("Difference between database and scraped minors data:");
console.log(minorsDiff);
if (!readlineSync.keyInYNStrict("Is this ok")) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we press yes after several minutes has passed, the attempt to fetch the next set of data fails with a ECONNRESET error. This is problematic as (1) it can take more than several minutes to look over changes for major scrapes and (2) I can no longer visit brandywine while waiting for my scraper to finish running in my dorm. A try {fetch} catch {retry fetch} pattern seems to work, although there might be a cleaner fix.

})
.toArray()
.sort(sortById);
const collegeBlockIds = await db

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes me uncomfortable that were inserting into the database inside Scraper.ts when the major stuff will be added inside Index.ts. Could we perform a join of the major and college_requirement block inside the select statement for dbMajors and compare that instead?

for (const majorObj of scrapedMajors) {
if (majorObj.collegeBlockIndex !== undefined) {
(majorObj as typeof major.$inferInsert).collegeRequirement =
collegeBlockIds[majorObj.collegeBlockIndex];

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunatly, we cannot always rely on the returning statement of a postgres insert to have the same ordering as the input data (see comments in Polymorphic Majors). However, this step could be skipped with the comment I made above

@HwijungK

HwijungK commented May 1, 2026

Copy link
Copy Markdown
Collaborator

The entire logic may be better placed in index.ts instead of scraper.ts sinse were fiddling with the database quite a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Show diffs when updating DegreeWorks JSON

3 participants