feat(degreeworks): show json differences upon updating#373
feat(degreeworks): show json differences upon updating#373matt-franklin225 wants to merge 25 commits into
Conversation
Implemented now for re-review |
| } else { | ||
| console.log("Difference between database and scraped minors data:"); | ||
| console.log(minorsDiff); | ||
| if (!readlineSync.keyInYNStrict("Is this ok")) { |
There was a problem hiding this comment.
If we press yes after several minutes has passed, the attempt to fetch the next set of data fails with a ECONNRESET error. This is problematic as (1) it can take more than several minutes to look over changes for major scrapes and (2) I can no longer visit brandywine while waiting for my scraper to finish running in my dorm. A try {fetch} catch {retry fetch} pattern seems to work, although there might be a cleaner fix.
| }) | ||
| .toArray() | ||
| .sort(sortById); | ||
| const collegeBlockIds = await db |
There was a problem hiding this comment.
Makes me uncomfortable that were inserting into the database inside Scraper.ts when the major stuff will be added inside Index.ts. Could we perform a join of the major and college_requirement block inside the select statement for dbMajors and compare that instead?
| for (const majorObj of scrapedMajors) { | ||
| if (majorObj.collegeBlockIndex !== undefined) { | ||
| (majorObj as typeof major.$inferInsert).collegeRequirement = | ||
| collegeBlockIds[majorObj.collegeBlockIndex]; |
There was a problem hiding this comment.
unfortunatly, we cannot always rely on the returning statement of a postgres insert to have the same ordering as the input data (see comments in Polymorphic Majors). However, this step could be skipped with the comment I made above
|
The entire logic may be better placed in index.ts instead of scraper.ts sinse were fiddling with the database quite a lot. |
Description
The DegreeWorks scraper now contains logs that will display the differences between the information currently in the database and information that is being collected on the current scrape. These changes are formatted as differences between JSONs and displayed in the console. These logs are divided into several categories: requirements (one each for GE, UC, Campuswide Honors Collegium 4-year, and CHC 2-year), majors, minors, specializations, and degrees awarded. This was done by transforming the data scraped into its database formatting and comparing it to the actual database information.
Related Issue
Closes issue #349
Motivation and Context
The DegreeWorks scraper collects a large amount of data that is widely variable, not just due to changes on the system side but due to changes in the user as well. Depending on the certain information about the user (such as the year the user entered UCI and whether they are in the honors program), we will receive different information, which had not been logged to see what is changing until now.
How Has This Been Tested?
This has been tested locally by running the Degreeworks scraper with various cookies and catalogue years to ensure that changes are being properly accounted for.
Screenshots (if appropriate):
Types of changes
Checklist: