Include annexed files in zip and tar.gz downloads#43
Include annexed files in zip and tar.gz downloads#43matrss wants to merge 11 commits intoneuropoly:git-annexfrom
Conversation
a064593 to
2e597e1
Compare
35962fc to
de39051
Compare
623fb15 to
d27dc49
Compare
|
I am a bit stuck writing a test case for this. I wanted to amend the integration test TestAPIDownloadArchive by also testing against an annex-enabled repository (or adding a similar test case against a git-annex repo). Unfortunately I can't get the integration test suite to run, even on the latest This seems to be rooted in recent changes to how the access tokens are constructed / what scopes are available. After replacing test output
Am I missing some step? |
968cd74 to
df9116c
Compare
2e597e1 to
807ab78
Compare
|
@kousu in case you haven't seen my comment above, I am a bit stuck on writing tests for this. It looks like the git-annex repositories used in the tests are missing in the testing setup. |
|
I'm sorry! I don't know why I didn't see this. Maybe my inbox was just too noisy but I thought I would have kept my eye out for this because I am very excited for this feature. I'll take a look today and give you feedback. Writing tests for this took me a long time too. |
Oh I see. I had this error message too! It is entirely related to upstream's recent token changes, the same ones that you had fix the build also changed how you have to write the tests. I have fixed this in #1 already by adding the missing scopes everywhere, e.g. so if you rebase/merge #1 into your branch that it should fix it up. I'll build it locally and try to see if it does. Again I'm super super sorry for holding this up. I see the problem now: I am only watching issues I'm tagged in; I'll change that, though if you definitely want my attention feel free to tag me for reviews / in comments like you did! And thanks a lot for this work, and I hope we can get this out together soon :) By the way, the reason the error is confusing is that they should be using calls calls
|
|
Nope: on my branch I'm currently getting The tests have indeed been failing for about a month, so I never actually fixed this properly, e.g. https://github.com/neuropoly/gitea/actions/runs/5390321986/job/14590474958#step:9:113 I'll try to figure out what they changed now, fix it, and let you know when you can rebase on me. |
|
Okay @matrss I fixed all the tokens that needed tokening and rebased your branch locally: and the tests are back to passing.
|
|
I pushed a fix for the database blocking issue. As far as I can tell this fixes parallel archive creation and also the non-responsive UI issue. Edit: upstream PR for this is go-gitea#27563 |
kousu
left a comment
There was a problem hiding this comment.
Okay I think this is quite good now! And has been held off for long enough.
Do you prefer me to squash this or include all your commits in series?
|
Maybe this squashes to three commits: |
|
I squashed it down to 3 commits as you suggested; that is probably easier to maintain as a patchset on top of upstream gitea, right? I also extended the commit messages a bit to describe what's done. |
|
I rebuilt and retested my load test by grabbing out several commit SHA1s and doing I can see it them running in parallel here, and also see that there's only one archive being created and I'm only using 1 CPU (-> 125% CPU) And, success, the UI is responsive now!! but, this is super weird, if I hit similar URLs from Firefox, each of those triggers a parallel archive? Here I tried downloading an extra 3 URLs from Firefox and found by CPU usage increased by 300% (-> 375% in this screenshot but it was flucating around 400%, not quite 300% ) and four archives running at once And firefox even starts downloading multiple archives at once. I don't know if this was caused by your unblocking workaround? that's super weird. I backed out the unblocking commit: and went to http://localhost:3000/admin and clicked "Supprimer toutes les archives des dépôts (ZIP, TAR.GZ, etc..)" and confirmed they were actually gone: Then I re-ran the load test: xargs parallel curl downloadsand I also tried loading, from firefox,
With this:
So I think the committing patch did something weird that allowed multiple archives to be generated at once; I don't know why I only seem to be able to trigger it from Firefox though. I also don't know that that's necessarily a bad thing; either people can DoS the site by blocking the UI, or DoS the site by spiking the CPU load; you can control CPU load in other ways, like systemd CPU limits. |
+x -xI wanted to see if git's executable (+x) file mode would survive being annexed. Unfortunately it does not. Here's a script to make a repo: script to generate test repoThis made for me: Notice that git has recorded "new file mode 100755" for annex-script.sh. I make an empty repo in gitea and upload to it: DetailsGitea's UI handles it okay: But downloading gives: So annex-script.sh loses its +x bit when its run through the archiver. Furthermore, even the non-+x files are wrong. Git only supports three file mode: 0644 (files), 0755 (scripts), and 120000 (links). annex-notes.txt should have come out Also the .tgz preserved on the annex files the username of the user running gitea (p115628/domain users), while all the rest are owned root/root. It must be copying from the modes of the files in .git/annex/objects, which due to lockdown are |
annex symlinks vs pointer filesOlder git-annex used symlinks; newer git annex defaults to using pointer files, though it's possible (unfortunately) to mix and match ( The archive export should be able to handle both formats in the same repo. create test repoThis made me I upload it DetailsAnd then download it It successfully recovered the contents And again both file modes are wrong, pointer and non-pointer, probably because the archive copied the mode from the annexed content not from what the header in git. symlinks + pointer files +
|
|
Okay I know I already approved this but I'm a jerk and I went back to nitpick and found this bug with the file modes. I patched it in 41bba27 and I think a did a good job but maybe there's something I haven't thought of; what do you think @matrss ? This bit is maybe a bit weird With this patch in place my tests come out all uniformly with the permission bits I expect: zip: tgz: I'm working on patching your tests to match now. They're pretty nice tests by the way! They're short and precise. |
This is intended behaviour. The number of parallel archive creations should be limited by the This should also apply for the curl downloads. There seems to be a race condition where two archive downloads started at the same time do not run in parallel. A delay in between the curl calls makes them run in parallel too, e.g. with:
As above, intended behaviour by gitea which was just broken with sqlite. Since the maximum number of workers is limited I don't think this qualifies as a DoS by CPU load. Worst case, set MAX_WORKERS to 1 and this should not be an issue.
Yes, that was an oversight in the case of pointer files.
For what it's worth, I did not yet understand how
Right, that is a bug as well.
That is what my implementation did, for all annexed files (symlink and pointer file).
I don't think git-annex prefers pointer files over symlinks in newer versions. As you noted, Edit:
Taken from: https://git-annex.branchable.com/tips/unlocked_files/ I am wondering why
Yes, it should. |
The permissions in this last line are what I find weird with |
modules/annex/annex_archive.go
Outdated
There was a problem hiding this comment.
Is this direct mapping correct? According to https://pkg.go.dev/code.gitea.io/gitea/modules/git#EntryMode git.EntryModeExec is 0o100755. According to https://pkg.go.dev/io/fs#FileMode ModePerm is 0o777 and according to https://pkg.go.dev/io/fs#FileMode.IsRegular regular files should have no ModeType bits set. Shouldn't therefore git.EntryModeExec map to fs.FileMode(0o755) instead? This applies for targz and git.EntryModeBlob as well.
|
Re: symlink vs pointer file I opened a bug report here for git-annex because I would really prefer |
|
I tested a bit and found this: For a file with executable permissions in git, For a file without executable permissions, We should probably match that behavior as well and I am working on a patch for that (and for the test suite). |
|
That's a weird inconsistency in git, I'd even wonder if it's a bug, but yes we should match it! Thank you for your incredibly good attention to detail and patience. It's clear you want to make a quality product like I do :) And sorry again for not making the time this week to finish up the patch I started. |
I wondered about that too. The code in git is pretty explicit about it though, e.g. for tar it internally uses an explicit |
|
I pushed another refactor of the archive tests, since I had that lying around already and it made it easier to deal with properly checking the file permissions. It also runs 100 parallel downloads against each of the download endpoints, which I mainly did to try if it handles parallel downloads properly. The file permissions should now match what |
|
Rebased again. I would probably squash this down to 3 commits again before this is merged. |
|
I squashed it down again. I think this is ready for another review now. |
[git-annex](https://git-annex.branchable.com/) is a more complicated cousin to git-lfs, storing large files in an optional-download side content. Unlike lfs, it allows mixing and matching storage remotes, so the content remote(s) doesn't need to be on the same server as the git remote, making it feasible to scatter a collection across cloud storage, old harddrives, or anywhere else storage can be scavenged. Since this can get complicated, fast, it has a content-tracking database (`git annex whereis`) to help find everything later. The use-case we imagine for including it in Gitea is just the simple case, where we're primarily emulating git-lfs: each repo has its large content at the same URL. Our motivation is so we can self-host https://www.datalad.org/ datasets, which currently are only hostable by fragilely scrounging together cloud storage -- and having to manage all the credentials associated with all the pieces -- or at https://openneuro.org which is fragile in its own ways. Supporting git-annex also allows multiple Gitea instance to be annex remotes for each other, mirroring the content or otherwise collaborating the split up the hosting costs. Enabling -------- TODO HTTP ---- TODO Permission Checking ------------------- This tweaks the API in routers/private/serv.go to expose the calling user's computed permission, instead of just returning HTTP 403. This doesn't fit in super well. It's the opposite from how the git-lfs support is done, where there's a complete list of possible subcommands and their matching permission levels, and then the API compares the requested with the actual level and returns HTTP 403 if the check fails. But it's necessary. The main git-annex verbs, 'git-annex-shell configlist' and 'git-annex-shell p2pstdio' are both either read-only or read-write operations, depending on the state on disk on either end of the connection and what the user asked it to ask for, with no way to know before git-annex examines the situation. So tell the level via GIT_ANNEX_READONLY and trust it to handle itself. In the older Gogs version, the permission was directly read in cmd/serv.go: ``` mode, err = db.UserAccessMode(user.ID, repo) ``` - https://github.com/G-Node/gogs/blob/966e925cf320beff768b192276774d9265706df5/internal/cmd/serv.go#L334 but in Gitea permission enforcement has been centralized in the API layer. (perhaps so the cmd layer can avoid making direct DB connections?) Deletion -------- git-annex has this "lockdown" feature where it tries really quite very hard to prevent you deleting its data, to the point that even an rm -rf won't do it: each file in annex/objects/ is nested inside a folder with read-only permissions. The recommended workaround is to run chmod -R +w when you're sure you actually want to delete a repo. See https://git-annex.branchable.com/internals/lockdown So we edit util.RemoveAll() to do just that, so now it's `chmod -R +w && rm -rf` instead of just `rm -rf`.
Fixes neuropoly#11 Tests: * `git annex init` * `git annex copy --from origin` * `git annex copy --to origin` over: * ssh for: * the owner * a collaborator * a read-only collaborator * a stranger in a * public repo * private repo And then confirms: * Deletion of the remote repo (to ensure lockdown isn't messing with us: https://git-annex.branchable.com/internals/lockdown/#comment-0cc5225dc5abe8eddeb843bfd2fdc382) ------ To support all this: * Add util.FileCmp() * Patch withKeyFile() so it can be nested in other copies of itself ------- Many thanks to Mathieu for giving style tips and catching several bugs, including a subtle one in util.filecmp() which neutered it. Co-authored-by: Mathieu Guay-Paquet <mathieu.guay-paquet@polymtl.ca>
Fixes neuropoly#8 Co-authored-by: Mathieu Guay-Paquet <mathieu.guaypaquet@gmail.com>
This makes HTTP symmetric with SSH clone URLs. This gives us the fancy feature of _anonymous_ downloads, so people can access datasets without having to set up an account or manage ssh keys. Previously, to access "open access" data shared this way, users would need to: 1. Create an account on gitea.example.com 2. Create ssh keys 3. Upload ssh keys (and make sure to find and upload the correct file) 4. `git clone git@gitea.example.com:user/dataset.git` 5. `cd dataset` 6. `git annex get` This cuts that down to just the last three steps: 1. `git clone https://gitea.example.com/user/dataset.git` 2. `cd dataset` 3. `git annex get` This is significantly simpler for downstream users, especially for those unfamiliar with the command line. Unfortunately there's no uploading. While git-annex supports uploading over HTTP to S3 and some other special remotes, it seems to fail on a _plain_ HTTP remote. See neuropoly#7 and https://git-annex.branchable.com/forum/HTTP_uploads/#comment-ce28adc128fdefe4c4c49628174d9b92. This is not a major loss since no one wants uploading to be anonymous anyway. To support private repos, I had to hunt down and patch a secret extra security corner that Gitea only applies to HTTP for some reason (services/auth/basic.go). This was guided by https://git-annex.branchable.com/tips/setup_a_public_repository_on_a_web_site/ Fixes neuropoly#3 Co-authored-by: Mathieu Guay-Paquet <mathieu.guaypaquet@polymtl.ca>
This moves the `annexObjectPath()` helper out of the tests and into a dedicated sub-package as `annex.ContentLocation()`, and expands it with `.Pointer()` (which validates using `git annex examinekey`), `.IsAnnexed()` and `.Content()` to make it a more useful module. The tests retain their own wrapper version of `ContentLocation()` because I tried to follow close to the API modules/lfs uses, which in terms of abstract `git.Blob` and `git.TreeEntry` objects, not in terms of `repoPath string`s which are more convenient for the tests.
Previously, Gitea's LFS support allowed direct-downloads of LFS content, via http://$HOSTNAME:$PORT/$USER/$REPO/media/branch/$BRANCH/$FILE Expand that grace to git-annex too. Now /media should provide the relevant *content* from the .git/annex/objects/ folder. This adds tests too. And expands the tests to try symlink-based annexing, since /media implicitly supports both that and pointer-file-based annexing.
This updates the repo index/file view endpoints so annex files match the way LFS files are rendered, making annexed files accessible via the web instead of being black boxes only accessible by git clone. This mostly just duplicates the existing LFS logic. It doesn't try to combine itself with the existing logic, to make merging with upstream easier. If upstream ever decides to accept, I would like to try to merge the redundant logic. The one bit that doesn't directly copy LFS is my choice to hide annex-symlinks. LFS files are always _pointer files_ and therefore always render with the "file" icon and no special label, but annex files come in two flavours: symlinks or pointer files. I've conflated both kinds to try to give a consistent experience. The tests in here ensure the correct download link (/media, from the last PR) renders in both the toolbar and, if a binary file (like most annexed files will be), in the main pane, but it also adds quite a bit of code to make sure text files that happen to be annexed are dug out and rendered inline like LFS files are.
Upstream can handle the full test suite; to avoid tedious waiting, we only test the code added in this fork.
This extends the archive creation logic to add annexed files to the created archives. The basic flow is this: 1. Create an archive using `git archive` 2. Read in that archive and write out a new one, replacing all annexed files with their annexed content; leaving the git-only files as-is The file permissions with which the annexed files are put into the archive are decided based on what `git archive` does for other files as well: - For tar.gz archives, executable files get permissions 0775 and regular files get 0664. - For zip archives, executable files get permissions 0755 and regular files are archived with "FAT permissions" rw, instead of unix permissions. If for a given archive request an annexed file is not present on the gitea instance then the content as tracked by git (i.e. a symlink or pointer file) is silently put into the resulting archive instead. Co-authored-by: Nick Guenther <nick.guenther@polymtl.ca>
Tests include:
- Compare the list of files in the resulting archive with the list of
files as tracked in the archived git tree.
- Compare the content of each file with what it should be (git blob
content or the annexed file, respectively).
- Check that the file mode matches the expected file mode for all
archived files.
- Check that the resulting archive has the archived commitID set as a
comment (as `git archive` does as well).
The tests are done for both the "web" endpoints at
`/<user>/<repo>/archive/<git-ref>.{tar.gz,zip}` and the "api-v1"
endpoints at `/api/v1/<user>/<repo>/archive/<git-ref>.{tar.gz,zip}`.
This commit can be dropped as soon as go-gitea#27563 is accepted.





Unfortunately there is no straightforward way to retroactively add additional files to tar or zip archives (at least not using the Writer based API provided by the go standard library). Therefore I chose a somewhat different approach:
For tar.gz:
git archiveto generate a tar archive of the git repo (i.e. what gitea did before)This should make sure that for standard git repos the output stays the same as before this change.
For zip it is mostly the same, apart from needing to read the
git archiveoutput fully into memory, since zip files cannot be handled in a streaming fashion:git archiveto generate a zip archive of the git repoIn general this approach should be pretty efficient, apart from needing to read the entire zip output from
git archiveinto memory. Since a git repo by itself is usually not too large this should be OK for now, though.TODOs:
handle the bundle archive type