There are some subtle differences between the tests (fonts and line thickness in particular) when images are rendered on different platforms (macos vs linux; linux conda vs linux pip) that need to be investigated. It would be nice to get the test tolerances reduced so that we can see real changes (gridlines, labels) that are relevant.