bug: jobs stuck at "Set up job" #1391
Labels
No labels
FreeBSD
Kind/Breaking
Kind/Bug
Kind/Chore
Kind/DependencyUpdate
Kind/Documentation
Kind/Enhancement
Kind/Feature
Kind/Security
Kind/Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
Windows
linux-powerpc64le
linux-riscv64
linux-s390x
run-end-to-end-tests
run-forgejo-tests
No milestone
No project
No assignees
5 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo/runner#1391
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Examples:
This looks the same, symptom-wise, as #1302 which was believed fixed by #1303 & https://codeberg.org/forgejo/forgejo/pulls/10899 (either one).
(notes from Forgejo Development chat):
Symptom-wise it's similar to #1302, which would lead me to to think that the technical events occurring are: runner has initiated a FetchTask, Forgejo has assigned that task to a runner, and then (for some reason goes here), the runner never initiated any work on the task.
Theoretically, a race condition with a fetch timeout could cause this: if FetchTask has completed successfully server-side assigning tasks to a runner, but client-side has timed-out
reqCtx, cancel := context.WithTimeout(ctx, p.cfg.Runner.FetchTimeout)defer cancel()v := tasksVersion.Load()resp, err := client.FetchTask(reqCtx, connect.NewRequest(&runnerv1.FetchTaskRequest{TasksVersion: tasksVersion.Load(),TaskCapacity: &availableCapacity,}))Doesn't seem excessively likely, but it's an interesting possibility.
Broadly, if FetchTask requests were received, committed on Forgejo, and the responses never reached a runner -- that's plausible with a proxy problem of some kind. Making the FetchTask call idempotent would probably be a reasonable change to address a few of these edge cases.
--
Little vaguer -- corruption of the in-memory task map in the runner could cause an error that occurs without any job reporting back to the server --
if _, ok := r.runningTasks.Load(task.Id); ok {return fmt.Errorf("task %d is already running", task.Id)}Probably even less likely
Other than those I can't see any possibilities from a code analysis perspective.
I've thrown together a concept for how to make
FetchTaskidempotent: https://codeberg.org/forgejo/forgejo/compare/forgejo...mfenniak:idempotent-FetchTaskx-runner-request-keyto a unique UUID when making aFetchTaskrequest.FetchTask(), Server would associatex-runner-request-keywith all the tasks returned fromFetchTask.FetchTask(), Server would checkx-runner-request-keyfor any existing tasks (from a previous request w/ same UUID), and return those tasks.I'd just like to have some evidence that this would actually address a problem. 🤔 I can see it theoretically -- the transaction committed on the server doesn't guarantee the response reached the runner. But some real data to support it would make it feel more justified.
Anyone have thoughts to make it simpler, or, to get data to support it?
FWIW, in Zig, we've had probably 100+ jobs hit this issue over the past couple of weeks. It's not that it's a frequent occurrence; rather, when it happens, it just seems to affect every pending job. I want to say we've seen 3 or 4 clusters of failures.
If a tentative fix is deployed to Codeberg, let me know and I can update our runners to include the runner fix as well. Within a week or two, I think we should be able to say with some confidence whether the fix took.
I've dug through my personal runner's logs, and found a few log entries that could correlate with this problem, particularly with Codeberg connections. My runner isn't on a heavy-duty workload, but I thought this might give evidence for request failures of the right type, at least.
read: connection reset by peerwould indicate the runner connected, sent aFetchTask()call, received a TCP RST while reading the response; server could have actually processed the request. I've got about 2 per day of these:These errors, I'm not clear on what they mean. A few hundred of these occurred in one day in the past month:
Same here, unclear on the understanding of these. Only about 1 every 3-5 days:
There are other errors that are clear "502" errors, or failure to establish connection, which I can cleanly bucket as "wouldn't have reached the server".
I'm still skeptical, but I don't hate the idea of doing the idempotent work. It would be maybe a day to polish and test the POC. But the remaining skepticism is that I don't see a cause of a regression here. In the right circumstances errors like this could be related to "Set up job" stuck jobs, but it seems like "Set up job" stuck problems have only been noted since (roughly) Forgejo v14 arrived on Codeberg. 🤔
Two observations without digging into the code:
I see this behavior on our corp forgejo very often on some jobs from a single workflow. It only seems to happen when the runner fetches more than one job at once.
Can we add a flag to temporary disable fetching multiple jobs?
Infra:
@aahlenst wrote in #1391 (comment):
If a task is assigned to a runner by Forgejo, and the runner never gets it, it ends up looking like this:
Until the zombie task timeout is hit, and it turns into this:
The cause is in this Forgejo code:
preStep(the "Set up job" step) is a fake step injected by ForgejofirstStepis the first step of the job..Status = StatusFailure.preStepthat it is introducing is "success".This is just a UI representation, as
FullSteps()is invoked by the UI to create this representation. I can't think of what information Forgejo has to represent this any differently than this... I suppose that if no logs were ever received before the first step (firstStep.LogIndex == 0) that Forgejo could apply a heuristic to represent it differently.I think if we were going to tackle an improvement here, I'd be tempted to flag the task by the zombie kill process in some way, and then have the UI clearly represent "This was killed by the Zombie task cleanup." Something like that might be better than a heuristic on the data available.
This is complicated... it's an interesting idea but I think the complexity might overwhelm the scope of the problem. The design questions I would have are:
Hm. Idempotent jobs have run into a technical blocker. On the plus side, it means I've done a good job of covering it with an automated test 🤣, and on the downside...
When an
ActionTaskis created, a token for the task is generated. Forgejo persists the hash for the token, and then returns the real hash value to the runner (as part of theFetchTaskresult). This means that when a repeat service call is received, Forgejo can't return the task's token as part of the response because it doesn't have it. To address this, we'd need to do something like...I think that regenerating the token is reasonable, but don't love it. 🤔
@mfenniak wrote in #1391 (comment):
This also would have the side-effect that, if a earlier job for some reason was still running it would likely be stopped because the token is no longer valid?
@mfenniak wrote in #1391 (comment):
Why a hash, not the token itself? In what regard is the token sensitive?
(I assume that we talking about
runner_request_keythat stores the UUID in the PR you shared above.)I agree. One option could be to mark jobs as
assigned(new state) until the first log messages arrive. As soon as the first log message has arrived, mark the job asrunning.I happy to discuss that further, but I don't want to derail this issue.
@Gusted wrote in #1391 (comment):
The token being regenerated wouldn't have any immediate affect like this -- it would affect git access to a private repo, or, API access that uses the token, causing those actions to result in errors. Since those actions often happen at the beginning of an action and then don't occur again, invalidating the token likely wouldn't affect simultaneous invocations.
If there's any way "an earlier job for some reason was still running" then the idempotent system would be a mess, since multiple executions would be reporting state & logs into the same job simultaneously. I don't think there's a way that the proposed system design allows this... and the runner does have an internal safeguard to prevent two jobs of the same identity running in parallel. So I think this risk is well guarded against.
@aahlenst wrote in #1391 (comment):
This is in reference to the existing
FORGEJO_TOKENsecret, notrunner_request_key, which allows access to private repositories. It's stored in the DB with a one-way hash so that it can't be recovered by Forgejo, but can be validated on access.Idempotent changes are merged:
For next steps to try to validate this as a fix... 🤔
@alexrp as you've observed this problem in Zig builds, are the runners that are supposed to receive jobs configured with
capacity > 1? Just trying to identify if @viceice's idea to "add a flag to temporary disable fetching multiple jobs" might be counter-indicated by Zig's configuration, or whether it's worth exploring.@Gusted Since every indication we've seen of this bug has occurred on Codeberg, would you mind evaluating https://codeberg.org/forgejo/forgejo/pulls/11401 for a port to Codeberg?
Capacity on our runners ranges from 1 to 8 depending on machine specs. The only ones that have a capacity of 1 are the
riscv64-linuxboards though. Here's an instance of this bug occurring on one of those: https://codeberg.org/ziglang/zig/actions/runs/1799/jobs/7/attempt/1A few more:
Thanks @alexrp. I think considering that, it's unlikely that the root cause of this problem is the multiple task fetch. Theoretically something changed during that development could still be the cause, but if we just disabled multiple fetch with the existing codebase it would behave the same as this
riscv64-linuxrunner, and not likely fix anything. 👍@mfenniak thanks for your pull request! I've backported it at
codeberg.org/Codeberg-Infrastructure/forgejo@0ae17a4856will try to deploy it soon-ish.Neat! I'll tag a v12.7.1 of the runner with the client-side of this, then.
@Gusted wrote in #1391 (comment):
Deployed.
I've deployed runner v12.7.1 to our machines; will be restarting them as they become idle.
All updated and restarted, so now we wait.
I haven't noticed any jobs on forgejo projects running into this issue, except for one private project which is still running 12.7.0 (unpatched).
Note that with the DoS on Codeberg recently, I had to restart a lot of failed jobs. It was kind of a mess and I just didn't have the time to pick through them all to see if any of the failures were caused by this issue. Outside of that, I haven't seen this issue, but I would say it's still worth waiting another week before declaring this fixed.
I'm happy to consider this fixed at this point; haven't seen it since.
Great, thanks for monitoring and confirming. The only instance that the Forgejo contributors have observed is on code.forgejo.org which doesn't have the server-side patch, so it looks fixed to me too. 👍