bug: action fetches can fail too easily #1091
Labels
No labels
FreeBSD
Kind/Breaking
Kind/Bug
Kind/Chore
Kind/DependencyUpdate
Kind/Documentation
Kind/Enhancement
Kind/Feature
Kind/Security
Kind/Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
Windows
linux-powerpc64le
linux-riscv64
linux-s390x
run-end-to-end-tests
run-forgejo-tests
run-multi-platform-tests
No milestone
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo/runner#1091
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Can you reproduce the bug on the Forgejo test instance?
No
Description
We occasionally see failures like this one: https://codeberg.org/ziglang/zig/actions/runs/68/jobs/4
I've only seen this on our RISC-V machines so far. It might just be because they have fairly low single-core performance, but I can't say for sure. In any case, it seems like there should be a retry mechanism put in place here.
Forgejo Version
Codeberg current
Runner Version
v11.1.2
How are you running Forgejo?
Codeberg
How are you running the Runner?
make build+ systemdLogs
Workflow file
https://codeberg.org/ziglang/zig/src/branch/master/.forgejo/workflows/ci.yaml
Note for anyone adding retry logic: Make sure that #1071 case is handled too - until this is fixed, we don't want to retry on non-terminating errors as they will never succeed.
The clone is re-used if it exists. It may be a way to workaround these transient timeout errors. Unless the runner is launched in a way that does not re-use the
~/.cache/actdirectory.I wonder if there is a way to convince git to retry (via git options). I did not find anything so maybe not.
is there a repository describing how the machine running this workflow is provisioned? I assume this is a real hardware that you have, or is it qemu based?
All our CI machines are real hardware.
We don't have setup instructions all the way from "how to install the OS", but as far as the runner goes, we have this: https://codeberg.org/ziglang/infra/src/branch/master/forgejo-runner.md
For Linux, we generally prefer Debian, but the RISC-V machines in particular run Ubuntu - specifically, the Bianbu variant that SpacemiT maintain, as those machines are Milk-V Jupiters.
This is becoming a significant problem now; the error in the issue description is now happening on almost every run on this particular machine.
There's a stack of changes in this area that are merged and unreleased, or in-progress:
I'm hoping to finish #1162 (today + time for a review), and then push a new runner release.
I wouldn't mind doing a new release with just the first two changes if you want to see whether those have any impact; #1156 will probably improve the situation because effectively two fetches are being performed that get collapsed into one, and, if you switched to specific sha pinning, #1160 would make it entirely local during checkout.
An additional change that could be added would be to allow the fetch operation to fail, regardless of what the error is, as a warning, as long as the requested reference is valid in the current cache.
It should be fine to wait for a release that includes #1162; we do still have 3 other RISC-V machines that don't exhibit this problem so we're not completely blocked.
Efficiency in git operations has been improved with Forgejo Runner v12.0.0's release: #1156 removed an entire
pulloperation during every action, #1162 removed clones when references change, and #1160 gives you an option to pin your action references and no remote access will be required at all (once it is cached).I think those are the best fixes that we can do here; if timeouts remain a problem and you don't want to pin sha references, then the next steps might be looking at configuration on data.forgejo.org?
I'm going to tentatively close this issue, but please feel free to reopen it if needed.