bug: action fetches can fail too easily #1091

Closed
opened 2025-10-14 11:45:28 +00:00 by alexrp · 9 comments
Contributor

Can you reproduce the bug on the Forgejo test instance?

No

Description

We occasionally see failures like this one: https://codeberg.org/ziglang/zig/actions/runs/68/jobs/4

I've only seen this on our RISC-V machines so far. It might just be because they have fairly low single-core performance, but I can't say for sure. In any case, it seems like there should be a retry mechanism put in place here.

Forgejo Version

Codeberg current

Runner Version

v11.1.2

How are you running Forgejo?

Codeberg

How are you running the Runner?

make build + systemd

Logs

callisto(version:v11.1.2) received task 1910864 of job riscv64-linux-debug, be triggered by event: workflow_dispatch
workflow prepared
evaluating expression 'success()'
expression 'success()' evaluated to 'true'
  ☁  git clone 'https://data.forgejo.org/actions/checkout' # ref=v4
  cloning https://data.forgejo.org/actions/checkout to /home/ci/.cache/act/c3/fe249fe73091a17d6638fe1341e7bd0bcc3466ce52323c0688e83e2463a4ab
unexpected client error: unexpected requesting "https://data.forgejo.org/actions/checkout/git-upload-pack" status code: 408
skipping post step for 'Checkout'; step was not executed
🏁  Job failed
unexpected client error: unexpected requesting "https://data.forgejo.org/actions/checkout/git-upload-pack" status code: 408

Workflow file

https://codeberg.org/ziglang/zig/src/branch/master/.forgejo/workflows/ci.yaml

### Can you reproduce the bug on the Forgejo test instance? No ### Description We occasionally see failures like this one: https://codeberg.org/ziglang/zig/actions/runs/68/jobs/4 I've only seen this on our RISC-V machines so far. It might just be because they have fairly low single-core performance, but I can't say for sure. In any case, it seems like there should be a retry mechanism put in place here. ### Forgejo Version Codeberg current ### Runner Version v11.1.2 ### How are you running Forgejo? Codeberg ### How are you running the Runner? `make build` + systemd ### Logs ``` callisto(version:v11.1.2) received task 1910864 of job riscv64-linux-debug, be triggered by event: workflow_dispatch workflow prepared evaluating expression 'success()' expression 'success()' evaluated to 'true' ☁ git clone 'https://data.forgejo.org/actions/checkout' # ref=v4 cloning https://data.forgejo.org/actions/checkout to /home/ci/.cache/act/c3/fe249fe73091a17d6638fe1341e7bd0bcc3466ce52323c0688e83e2463a4ab unexpected client error: unexpected requesting "https://data.forgejo.org/actions/checkout/git-upload-pack" status code: 408 skipping post step for 'Checkout'; step was not executed 🏁 Job failed unexpected client error: unexpected requesting "https://data.forgejo.org/actions/checkout/git-upload-pack" status code: 408 ``` ### Workflow file https://codeberg.org/ziglang/zig/src/branch/master/.forgejo/workflows/ci.yaml

Note for anyone adding retry logic: Make sure that #1071 case is handled too - until this is fixed, we don't want to retry on non-terminating errors as they will never succeed.

Note for anyone adding retry logic: Make sure that #1071 case is handled too - until this is fixed, we don't want to retry on non-terminating errors as they will never succeed.
Contributor

The clone is re-used if it exists. It may be a way to workaround these transient timeout errors. Unless the runner is launched in a way that does not re-use the ~/.cache/act directory.

The clone is re-used if it exists. It may be a way to workaround these transient timeout errors. Unless the runner is launched in a way that does not re-use the `~/.cache/act` directory.
Contributor

I wonder if there is a way to convince git to retry (via git options). I did not find anything so maybe not.

I wonder if there is a way to convince git to retry (via git options). I did not find anything so maybe not.
Contributor

is there a repository describing how the machine running this workflow is provisioned? I assume this is a real hardware that you have, or is it qemu based?

is there a repository describing how the machine running this workflow is provisioned? I assume this is a real hardware that you have, or is it qemu based?
Author
Contributor

All our CI machines are real hardware.

We don't have setup instructions all the way from "how to install the OS", but as far as the runner goes, we have this: https://codeberg.org/ziglang/infra/src/branch/master/forgejo-runner.md

For Linux, we generally prefer Debian, but the RISC-V machines in particular run Ubuntu - specifically, the Bianbu variant that SpacemiT maintain, as those machines are Milk-V Jupiters.

All our CI machines are real hardware. We don't have setup instructions all the way from "how to install the OS", but as far as the runner goes, we have this: https://codeberg.org/ziglang/infra/src/branch/master/forgejo-runner.md For Linux, we generally prefer Debian, but the RISC-V machines in particular run Ubuntu - specifically, the Bianbu variant that SpacemiT maintain, as those machines are Milk-V Jupiters.
Author
Contributor

This is becoming a significant problem now; the error in the issue description is now happening on almost every run on this particular machine.

This is becoming a significant problem now; the error in the issue description is now happening on almost every run on this particular machine.
Owner

There's a stack of changes in this area that are merged and unreleased, or in-progress:

  • use git reset --hard instead of pull and checkout -- #1156
  • if you pin to a specific sha, no fetch needed at all -- #1160
  • use git worktrees so that separate refs don't require separate fetches -- #1162

I'm hoping to finish #1162 (today + time for a review), and then push a new runner release.

I wouldn't mind doing a new release with just the first two changes if you want to see whether those have any impact; #1156 will probably improve the situation because effectively two fetches are being performed that get collapsed into one, and, if you switched to specific sha pinning, #1160 would make it entirely local during checkout.

An additional change that could be added would be to allow the fetch operation to fail, regardless of what the error is, as a warning, as long as the requested reference is valid in the current cache.

There's a stack of changes in this area that are merged and unreleased, or in-progress: - use git reset --hard instead of pull and checkout -- #1156 - if you pin to a specific sha, no fetch needed at all -- #1160 - use git worktrees so that separate refs don't require separate fetches -- #1162 I'm hoping to finish #1162 (today + time for a review), and then push a new runner release. I wouldn't mind doing a new release with just the first two changes if you want to see whether those have any impact; #1156 will probably improve the situation because effectively two fetches are being performed that get collapsed into one, and, if you switched to specific sha pinning, #1160 would make it entirely local during checkout. An additional change that could be added would be to allow the fetch operation to fail, regardless of what the error is, as a warning, as long as the requested reference is valid in the current cache.
Author
Contributor

It should be fine to wait for a release that includes #1162; we do still have 3 other RISC-V machines that don't exhibit this problem so we're not completely blocked.

It should be fine to wait for a release that includes #1162; we do still have 3 other RISC-V machines that don't exhibit this problem so we're not completely blocked.
Owner

Efficiency in git operations has been improved with Forgejo Runner v12.0.0's release: #1156 removed an entire pull operation during every action, #1162 removed clones when references change, and #1160 gives you an option to pin your action references and no remote access will be required at all (once it is cached).

I think those are the best fixes that we can do here; if timeouts remain a problem and you don't want to pin sha references, then the next steps might be looking at configuration on data.forgejo.org?

I'm going to tentatively close this issue, but please feel free to reopen it if needed.

Efficiency in git operations has been improved with Forgejo Runner v12.0.0's release: #1156 removed an entire `pull` operation during every action, #1162 removed clones when references change, and #1160 gives you an option to pin your action references and no remote access will be required at all (once it is cached). I think those are the best fixes that we can do here; if timeouts remain a problem and you don't want to pin sha references, then the next steps might be looking at configuration on data.forgejo.org? I'm going to tentatively close this issue, but please feel free to reopen it if needed.
Sign in to join this conversation.
No milestone
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo/runner#1091
No description provided.