bug: jobs stuck at "Set up job" #1391

Closed
opened 2026-02-20 02:17:19 +00:00 by mfenniak · 24 comments
Owner
Examples: - https://codeberg.org/forgejo/forgejo/actions/runs/139771/jobs/6/attempt/1 - https://codeberg.org/ziglang/zig/actions/runs/1814/jobs/0/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/140275/jobs/8/attempt/1 (and other jobs in this run) This looks the same, symptom-wise, as https://code.forgejo.org/forgejo/runner/issues/1302 which was believed fixed by https://code.forgejo.org/forgejo/runner/pulls/1303 & https://codeberg.org/forgejo/forgejo/pulls/10899 (either one).
Author
Owner

(notes from Forgejo Development chat):

Symptom-wise it's similar to #1302, which would lead me to to think that the technical events occurring are: runner has initiated a FetchTask, Forgejo has assigned that task to a runner, and then (for some reason goes here), the runner never initiated any work on the task.

Theoretically, a race condition with a fetch timeout could cause this: if FetchTask has completed successfully server-side assigning tasks to a runner, but client-side has timed-out

reqCtx, cancel := context.WithTimeout(ctx, p.cfg.Runner.FetchTimeout)
defer cancel()
v := tasksVersion.Load()
resp, err := client.FetchTask(reqCtx, connect.NewRequest(&runnerv1.FetchTaskRequest{
TasksVersion: tasksVersion.Load(),
TaskCapacity: &availableCapacity,
}))

Doesn't seem excessively likely, but it's an interesting possibility.

@Gusted: What if Codeberg is down for a few minutes? We certainly had a few of those in the last few days again due to OOMs in the reverse proxy

Broadly, if FetchTask requests were received, committed on Forgejo, and the responses never reached a runner -- that's plausible with a proxy problem of some kind. Making the FetchTask call idempotent would probably be a reasonable change to address a few of these edge cases.

--

Little vaguer -- corruption of the in-memory task map in the runner could cause an error that occurs without any job reporting back to the server --

if _, ok := r.runningTasks.Load(task.Id); ok {
return fmt.Errorf("task %d is already running", task.Id)
}

Probably even less likely

Other than those I can't see any possibilities from a code analysis perspective.

(notes from Forgejo Development chat): Symptom-wise it's similar to #1302, which would lead me to to think that the technical events occurring are: runner has initiated a FetchTask, Forgejo has assigned that task to a runner, and then (for some reason goes here), the runner never initiated any work on the task. Theoretically, a race condition with a fetch timeout could cause this: if FetchTask has completed successfully server-side assigning tasks to a runner, but client-side has timed-out https://code.forgejo.org/forgejo/runner/src/commit/6936bb100ddb06b7974a31af42b2b735a29ba876/internal/app/poll/poller.go#L186-L193 Doesn't seem excessively likely, but it's an interesting possibility. > @Gusted: What if Codeberg is down for a few minutes? We certainly had a few of those in the last few days again due to OOMs in the reverse proxy Broadly, if FetchTask requests were received, committed on Forgejo, and the responses never reached a runner -- that's plausible with a proxy problem of some kind. Making the FetchTask call idempotent would probably be a reasonable change to address a few of these edge cases. -- Little vaguer -- corruption of the in-memory task map in the runner could cause an error that occurs without any job reporting back to the server -- https://code.forgejo.org/forgejo/runner/src/commit/6936bb100ddb06b7974a31af42b2b735a29ba876/internal/app/run/runner.go#L154-L156 Probably even less likely Other than those I can't see any possibilities from a code analysis perspective.
Author
Owner

I've thrown together a concept for how to make FetchTask idempotent: https://codeberg.org/forgejo/forgejo/compare/forgejo...mfenniak:idempotent-FetchTask

  • Runner would assign an HTTP header x-runner-request-key to a unique UUID when making a FetchTask request.
  • Runner would retain the same UUID until they get a successful response.
  • Runner should have no codepaths between "successfully got a response and won't reuse the task" and "guarantee that we provide some task status back to Forgejo". Of course, hard interrupts are unavoidable possibilities (server crash).
  • During FetchTask(), Server would associate x-runner-request-key with all the tasks returned from FetchTask.
  • Before performing FetchTask(), Server would check x-runner-request-key for any existing tasks (from a previous request w/ same UUID), and return those tasks.

I'd just like to have some evidence that this would actually address a problem. 🤔 I can see it theoretically -- the transaction committed on the server doesn't guarantee the response reached the runner. But some real data to support it would make it feel more justified.

Anyone have thoughts to make it simpler, or, to get data to support it?

I've thrown together a concept for how to make `FetchTask` idempotent: https://codeberg.org/forgejo/forgejo/compare/forgejo...mfenniak:idempotent-FetchTask - Runner would assign an HTTP header `x-runner-request-key` to a unique UUID when making a `FetchTask` request. - Runner would retain the same UUID until they get a successful response. - Runner should have no codepaths between "successfully got a response and won't reuse the task" and "guarantee that we provide some task status back to Forgejo". Of course, hard interrupts are unavoidable possibilities (server crash). - During `FetchTask()`, Server would associate `x-runner-request-key` with all the tasks returned from `FetchTask`. - Before performing `FetchTask()`, Server would check `x-runner-request-key` for any existing tasks (from a previous request w/ same UUID), and return those tasks. I'd just like to have some evidence that this would actually address a problem. 🤔 I can see it theoretically -- the transaction committed on the server doesn't guarantee the response reached the runner. But some real data to support it would make it feel more justified. Anyone have thoughts to make it simpler, or, to get data to support it?
Contributor

FWIW, in Zig, we've had probably 100+ jobs hit this issue over the past couple of weeks. It's not that it's a frequent occurrence; rather, when it happens, it just seems to affect every pending job. I want to say we've seen 3 or 4 clusters of failures.

If a tentative fix is deployed to Codeberg, let me know and I can update our runners to include the runner fix as well. Within a week or two, I think we should be able to say with some confidence whether the fix took.

FWIW, in Zig, we've had probably 100+ jobs hit this issue over the past couple of weeks. It's not that it's a frequent occurrence; rather, when it happens, it just seems to affect every pending job. I want to say we've seen 3 or 4 clusters of failures. If a tentative fix is deployed to Codeberg, let me know and I can update our runners to include the runner fix as well. Within a week or two, I think we should be able to say with some confidence whether the fix took.
Author
Owner

I've dug through my personal runner's logs, and found a few log entries that could correlate with this problem, particularly with Codeberg connections. My runner isn't on a heavy-duty workload, but I thought this might give evidence for request failures of the right type, at least.

read: connection reset by peer would indicate the runner connected, sent a FetchTask() call, received a TCP RST while reading the response; server could have actually processed the request. I've got about 2 per day of these:

Feb 16 15:35:58 anxi forgejo-act-runner-codeberg[1419711]: time="2026-02-16T22:35:58Z" level=error msg="failed to fetch task" error="unavailable: read tcp 10.88.0.59:38744->217.197.84.140:443: read: connection reset by peer"
Feb 16 17:37:28 anxi forgejo-act-runner-codeberg[1419711]: time="2026-02-17T00:37:28Z" level=error msg="failed to fetch task" error="unavailable: read tcp 10.88.0.59:33568->217.197.84.140:443: read: connection reset by peer"
Feb 17 00:10:58 anxi forgejo-act-runner-codeberg[1419711]: time="2026-02-17T07:10:58Z" level=error msg="failed to fetch task" error="unavailable: read tcp 10.88.0.59:56340->217.197.84.140:443: read: connection reset by peer"

These errors, I'm not clear on what they mean. A few hundred of these occurred in one day in the past month:

Feb 05 16:49:59 anxi forgejo-act-runner-codeberg[17195]: time="2026-02-05T23:49:59Z" level=error msg="failed to fetch task" error="internal: stream error: stream ID 1; INTERNAL_ERROR; received from peer"
Feb 05 16:50:29 anxi forgejo-act-runner-codeberg[17195]: time="2026-02-05T23:50:29Z" level=error msg="failed to fetch task" error="internal: stream error: stream ID 3; INTERNAL_ERROR; received from peer"
Feb 05 16:50:59 anxi forgejo-act-runner-codeberg[17195]: time="2026-02-05T23:50:59Z" level=error msg="failed to fetch task" error="internal: stream error: stream ID 5; INTERNAL_ERROR; received from peer"

Same here, unclear on the understanding of these. Only about 1 every 3-5 days:

Feb 05 18:53:58 anxi forgejo-act-runner-codeberg[17195]: time="2026-02-06T01:53:58Z" level=error msg="failed to fetch task" error="unavailable: unexpected EOF"

There are other errors that are clear "502" errors, or failure to establish connection, which I can cleanly bucket as "wouldn't have reached the server".

I'm still skeptical, but I don't hate the idea of doing the idempotent work. It would be maybe a day to polish and test the POC. But the remaining skepticism is that I don't see a cause of a regression here. In the right circumstances errors like this could be related to "Set up job" stuck jobs, but it seems like "Set up job" stuck problems have only been noted since (roughly) Forgejo v14 arrived on Codeberg. 🤔

I've dug through my personal runner's logs, and found a few log entries that could correlate with this problem, particularly with Codeberg connections. My runner isn't on a heavy-duty workload, but I thought this might give evidence for request failures of the right type, at least. `read: connection reset by peer` would indicate the runner connected, sent a `FetchTask()` call, received a TCP RST while reading the response; server could have actually processed the request. I've got about 2 per day of these: ``` Feb 16 15:35:58 anxi forgejo-act-runner-codeberg[1419711]: time="2026-02-16T22:35:58Z" level=error msg="failed to fetch task" error="unavailable: read tcp 10.88.0.59:38744->217.197.84.140:443: read: connection reset by peer" Feb 16 17:37:28 anxi forgejo-act-runner-codeberg[1419711]: time="2026-02-17T00:37:28Z" level=error msg="failed to fetch task" error="unavailable: read tcp 10.88.0.59:33568->217.197.84.140:443: read: connection reset by peer" Feb 17 00:10:58 anxi forgejo-act-runner-codeberg[1419711]: time="2026-02-17T07:10:58Z" level=error msg="failed to fetch task" error="unavailable: read tcp 10.88.0.59:56340->217.197.84.140:443: read: connection reset by peer" ``` These errors, I'm not clear on what they mean. A few hundred of these occurred in one day in the past month: ``` Feb 05 16:49:59 anxi forgejo-act-runner-codeberg[17195]: time="2026-02-05T23:49:59Z" level=error msg="failed to fetch task" error="internal: stream error: stream ID 1; INTERNAL_ERROR; received from peer" Feb 05 16:50:29 anxi forgejo-act-runner-codeberg[17195]: time="2026-02-05T23:50:29Z" level=error msg="failed to fetch task" error="internal: stream error: stream ID 3; INTERNAL_ERROR; received from peer" Feb 05 16:50:59 anxi forgejo-act-runner-codeberg[17195]: time="2026-02-05T23:50:59Z" level=error msg="failed to fetch task" error="internal: stream error: stream ID 5; INTERNAL_ERROR; received from peer" ``` Same here, unclear on the understanding of these. Only about 1 every 3-5 days: ``` Feb 05 18:53:58 anxi forgejo-act-runner-codeberg[17195]: time="2026-02-06T01:53:58Z" level=error msg="failed to fetch task" error="unavailable: unexpected EOF" ``` There are other errors that are clear "502" errors, or failure to establish connection, which I can cleanly bucket as "wouldn't have reached the server". I'm still skeptical, but I don't hate the idea of doing the idempotent work. It would be maybe a day to polish and test the POC. But the remaining skepticism is that I don't see a cause of a regression here. In the right circumstances errors like this could be related to "Set up job" stuck jobs, but it seems like "Set up job" stuck problems have only been noted since (roughly) Forgejo v14 arrived on Codeberg. 🤔
Member

Two observations without digging into the code:

  • I find it very odd that "Set up job" is green. If the problem is indeed that the runner fetches a task but never receives it, then "Set up job" should not be green. If there's a situation that can lead to Forgejo marking it green on its own, this should ideally be fixed first because it would help analyzing this problem.
  • When I attach a debugger to Forgejo while Forgejo Runner fetches a task, Forgejo Runner times out. It never tells Forgejo about it and Forgejo doesn't recognize it on its own. So the job is stuck in "Set up job". It doesn't look like this problem, but it is an indication that Forgejo needs some logic to recognize and handle those situations. For example, if Forgejo Runner does not confirm (with retry built-in) that it has received the job, Forgejo might put it back into the queue. That would make the process more resilient to communication problems on both sides.
Two observations without digging into the code: * I find it very odd that "Set up job" is green. If the problem is indeed that the runner fetches a task but never receives it, then "Set up job" should not be green. If there's a situation that can lead to Forgejo marking it green on its own, this should ideally be fixed first because it would help analyzing this problem. * When I attach a debugger to Forgejo while Forgejo Runner fetches a task, Forgejo Runner times out. It never tells Forgejo about it and Forgejo doesn't recognize it on its own. So the job is stuck in "Set up job". It doesn't look like this problem, but it is an indication that Forgejo needs some logic to recognize and handle those situations. For example, if Forgejo Runner does not confirm (with retry built-in) that it has received the job, Forgejo might put it back into the queue. That would make the process more resilient to communication problems on both sides.
Owner

I see this behavior on our corp forgejo very often on some jobs from a single workflow. It only seems to happen when the runner fetches more than one job at once.

Can we add a flag to temporary disable fetching multiple jobs?

Infra:

  • forgejo in k8s cluster behind traefik ingress
  • runner in other k8s cluster
  • workflow with matrix (~7 jobs)
I see this behavior on our corp forgejo very often on some jobs from a single workflow. It only seems to happen when the runner fetches more than one job at once. Can we add a flag to temporary disable fetching multiple jobs? Infra: - forgejo in k8s cluster behind traefik ingress - runner in other k8s cluster - workflow with matrix (~7 jobs)
Author
Owner

@aahlenst wrote in #1391 (comment):

I find it very odd that "Set up job" is green. If the problem is indeed that the runner fetches a task but never receives it, then "Set up job" should not be green. If there's a situation that can lead to Forgejo marking it green on its own, this should ideally be fixed first because it would help analyzing this problem.

If a task is assigned to a runner by Forgejo, and the runner never gets it, it ends up looking like this:

20260220_105742

Until the zombie task timeout is hit, and it turns into this:

image

The cause is in this Forgejo code:

  • preStep (the "Set up job" step) is a fake step injected by Forgejo
  • firstStep is the first step of the job.
  • Because the job was killed by the zombie task timeout, the first step has .Status = StatusFailure.
  • Forgejo assumes that because the first real step of the job has a status, then the synthetic preStep that it is introducing is "success".
  • There's no code path or logic for it to ever be anything other than waiting, or, success.

This is just a UI representation, as FullSteps() is invoked by the UI to create this representation. I can't think of what information Forgejo has to represent this any differently than this... I suppose that if no logs were ever received before the first step (firstStep.LogIndex == 0) that Forgejo could apply a heuristic to represent it differently.

I think if we were going to tackle an improvement here, I'd be tempted to flag the task by the zombie kill process in some way, and then have the UI clearly represent "This was killed by the Zombie task cleanup." Something like that might be better than a heuristic on the data available.

... For example, if Forgejo Runner does not confirm (with retry built-in) that it has received the job, Forgejo might put it back into the queue. That would make the process more resilient to communication problems on both sides.

This is complicated... it's an interesting idea but I think the complexity might overwhelm the scope of the problem. The design questions I would have are:

  • Time: How quickly does the runner need to "confirm" (with a second API call) back to Forgejo?
  • State management: Forgejo doesn't have a queue for what tasks are available; this is stored in the database. Items are "removed" from the queue by assigning them to runners. If they aren't assigned to a runner, they're available to another runner. Implementing logic like this would require provisionally assigning them to a runner (so that they're not assigned to another runner), and then being able to either undo that process, or invalidate that process in the DB query (eg. by a time-check on an assigned field).
  • Backwards compatibility: Existing runners in the wild won't provide any confirmation. If this is implemented as part of the existing UpdateTask/UpdateLog API calls, then the runner has already started the task -- if it doesn't get reporting API calls back to Forgejo within the necessary time, it's already doing the actual task, and then how does Forgejo handle getting task updates eventually from multiple servers? So I think it can't be designed that way, and it needs a separate confirmation API call. Then we'd need to update the protobuf API to opt-in to a confirmation-based workflow... make both sides smart enough to work with each other.
@aahlenst wrote in https://code.forgejo.org/forgejo/runner/issues/1391#issuecomment-79181: > I find it very odd that "Set up job" is green. If the problem is indeed that the runner fetches a task but never receives it, then "Set up job" should not be green. If there's a situation that can lead to Forgejo marking it green on its own, this should ideally be fixed first because it would help analyzing this problem. If a task is assigned to a runner by Forgejo, and the runner never gets it, it ends up looking like this: ![20260220_105742](/attachments/1dfd25ca-7e32-496c-a582-ba6e102af82f) Until the zombie task timeout is hit, and it turns into this: ![image](/attachments/c938adba-11b4-48e2-9e8e-750a35aeb334) The cause is in [this Forgejo code](https://codeberg.org/forgejo/forgejo/src/commit/87a06633ea275ab6027acde5c7d9a79fef7d1b70/modules/actions/task_state.go#L38-L62): - `preStep` (the "Set up job" step) is a fake step injected by Forgejo - `firstStep` is the first step of the job. - Because the job was killed by the zombie task timeout, the first step has `.Status = StatusFailure`. - Forgejo assumes that because the first real step of the job has a status, then the synthetic `preStep` that it is introducing is "success". - There's no code path or logic for it to ever be anything other than waiting, or, success. This is just a UI representation, as `FullSteps()` is invoked by the UI to create this representation. I can't think of what information Forgejo has to represent this any differently than this... I suppose that if no logs were ever received before the first step (`firstStep.LogIndex == 0`) that Forgejo could apply a heuristic to represent it differently. I think if we were going to tackle an improvement here, I'd be tempted to flag the task by the zombie kill process in some way, and then have the UI clearly represent "This was killed by the Zombie task cleanup." Something like that might be better than a heuristic on the data available. > ... For example, if Forgejo Runner does not confirm (with retry built-in) that it has received the job, Forgejo might put it back into the queue. That would make the process more resilient to communication problems on both sides. This is complicated... it's an interesting idea but I think the complexity might overwhelm the scope of the problem. The design questions I would have are: - Time: How quickly does the runner need to "confirm" (with a second API call) back to Forgejo? - State management: Forgejo doesn't have a queue for what tasks are available; this is stored in the database. Items are "removed" from the queue by assigning them to runners. If they aren't assigned to a runner, they're available to another runner. Implementing logic like this would require *provisionally* assigning them to a runner (so that they're not assigned to another runner), and then being able to either undo that process, or invalidate that process in the DB query (eg. by a time-check on an assigned field). - Backwards compatibility: Existing runners in the wild won't provide any confirmation. If this is implemented as part of the existing UpdateTask/UpdateLog API calls, then the runner has already started the task -- if it doesn't get reporting API calls back to Forgejo within the necessary time, it's already doing the actual task, and then how does Forgejo handle getting task updates eventually from multiple servers? So I think it can't be designed that way, and it needs a separate confirmation API call. Then we'd need to update the protobuf API to opt-in to a confirmation-based workflow... make both sides smart enough to work with each other.
Author
Owner

Hm. Idempotent jobs have run into a technical blocker. On the plus side, it means I've done a good job of covering it with an automated test 🤣, and on the downside...

When an ActionTask is created, a token for the task is generated. Forgejo persists the hash for the token, and then returns the real hash value to the runner (as part of the FetchTask result). This means that when a repeat service call is received, Forgejo can't return the task's token as part of the response because it doesn't have it. To address this, we'd need to do something like...

  • Regenerate the ActionTask token, invalidating the previous one, on a repeated call. I don't hate this because the API call is still "logically idempotent" in that the return value has the same meaning, capabilities, and outcomes, but, it has the weird effect that the first service call's response is no longer usable. I can't think of a practical reason that's a problem... if you needed to repeat the call, then the first response won't be used. But it feels weird.
  • Change the way that the ActionTask's token is stored so that it can be reconstructed. 👎
  • Use a mechanism for idempotency that isn't rebuilding the response from the database -- for example, using cache storage. One downside is storing secrets in the cache. Another downside is starting to turn the cache from a cache into a "state" storage, where it's data needs to be semi-persistent for a short period in order to provide functional outcomes.

I think that regenerating the token is reasonable, but don't love it. 🤔

Hm. Idempotent jobs have run into a technical blocker. On the plus side, it means I've done a good job of covering it with an automated test 🤣, and on the downside... When an `ActionTask` is created, a token for the task is generated. Forgejo persists the **hash** for the token, and then returns the real hash value to the runner (as part of the `FetchTask` result). This means that when a repeat service call is received, Forgejo can't return the task's token as part of the response because it doesn't have it. To address this, we'd need to do something like... - Regenerate the ActionTask token, invalidating the previous one, on a repeated call. I don't hate this because the API call is still "logically idempotent" in that the return value has the same meaning, capabilities, and outcomes, but, it has the weird effect that the first service call's response is no longer usable. I can't think of a practical reason that's a problem... if you needed to repeat the call, then the first response won't be used. But it feels weird. - Change the way that the ActionTask's token is stored so that it can be reconstructed. 👎 - Use a mechanism for idempotency that isn't rebuilding the response from the database -- for example, using cache storage. One downside is storing secrets in the cache. Another downside is starting to turn the cache from a cache into a "state" storage, where it's data needs to be semi-persistent for a short period in order to provide functional outcomes. I think that regenerating the token is reasonable, but don't love it. 🤔
Owner

@mfenniak wrote in #1391 (comment):

Regenerate the ActionTask token, invalidating the previous one, on a repeated call. I don't hate this because the API call is still "logically idempotent" in that the return value has the same meaning, capabilities, and outcomes, but, it has the weird effect that the first service call's response is no longer usable. I can't think of a practical reason that's a problem... if you needed to repeat the call, then the first response won't be used. But it feels weird.

This also would have the side-effect that, if a earlier job for some reason was still running it would likely be stopped because the token is no longer valid?

@mfenniak wrote in https://code.forgejo.org/forgejo/runner/issues/1391#issuecomment-79482: > Regenerate the ActionTask token, invalidating the previous one, on a repeated call. I don't hate this because the API call is still "logically idempotent" in that the return value has the same meaning, capabilities, and outcomes, but, it has the weird effect that the first service call's response is no longer usable. I can't think of a practical reason that's a problem... if you needed to repeat the call, then the first response won't be used. But it feels weird. This also would have the side-effect that, if a earlier job for some reason was still running it would likely be stopped because the token is no longer valid?
Member

@mfenniak wrote in #1391 (comment):

Forgejo persists the hash for the token, ...

Why a hash, not the token itself? In what regard is the token sensitive?

(I assume that we talking about runner_request_key that stores the UUID in the PR you shared above.)

I think if we were going to tackle an improvement here, I'd be tempted to flag the task by the zombie kill process in some way, and then have the UI clearly represent "This was killed by the Zombie task cleanup." Something like that might be better than a heuristic on the data available.

I agree. One option could be to mark jobs as assigned (new state) until the first log messages arrive. As soon as the first log message has arrived, mark the job as running.

This is complicated... it's an interesting idea but I think the complexity might overwhelm the scope of the problem. The design questions I would have are:

I happy to discuss that further, but I don't want to derail this issue.

@mfenniak wrote in https://code.forgejo.org/forgejo/runner/issues/1391#issuecomment-79482: > Forgejo persists the **hash** for the token, ... Why a hash, not the token itself? In what regard is the token sensitive? (I assume that we talking about `runner_request_key` that stores the UUID in the PR you shared above.) > I think if we were going to tackle an improvement here, I'd be tempted to flag the task by the zombie kill process in some way, and then have the UI clearly represent "This was killed by the Zombie task cleanup." Something like that might be better than a heuristic on the data available. I agree. One option could be to mark jobs as `assigned` (new state) until the first log messages arrive. As soon as the first log message has arrived, mark the job as `running`. > This is complicated... it's an interesting idea but I think the complexity might overwhelm the scope of the problem. The design questions I would have are: I happy to discuss that further, but I don't want to derail this issue.
Author
Owner

@Gusted wrote in #1391 (comment):

This also would have the side-effect that, if a earlier job for some reason was still running it would likely be stopped because the token is no longer valid?

The token being regenerated wouldn't have any immediate affect like this -- it would affect git access to a private repo, or, API access that uses the token, causing those actions to result in errors. Since those actions often happen at the beginning of an action and then don't occur again, invalidating the token likely wouldn't affect simultaneous invocations.

If there's any way "an earlier job for some reason was still running" then the idempotent system would be a mess, since multiple executions would be reporting state & logs into the same job simultaneously. I don't think there's a way that the proposed system design allows this... and the runner does have an internal safeguard to prevent two jobs of the same identity running in parallel. So I think this risk is well guarded against.

@aahlenst wrote in #1391 (comment):

Why a hash, not the token itself? In what regard is the token sensitive?

(I assume that we talking about runner_request_key that stores the UUID in the PR you shared above.)

This is in reference to the existing FORGEJO_TOKEN secret, not runner_request_key, which allows access to private repositories. It's stored in the DB with a one-way hash so that it can't be recovered by Forgejo, but can be validated on access.

@Gusted wrote in https://code.forgejo.org/forgejo/runner/issues/1391#issuecomment-79501: > This also would have the side-effect that, if a earlier job for some reason was still running it would likely be stopped because the token is no longer valid? The token being regenerated wouldn't have any immediate affect like this -- it would affect git access to a private repo, or, API access that uses the token, causing those actions to result in errors. Since those actions often happen at the beginning of an action and then don't occur again, invalidating the token likely wouldn't affect simultaneous invocations. If there's any way "an earlier job for some reason was still running" then the idempotent system would be a mess, since multiple executions would be reporting state & logs into the same job simultaneously. I don't think there's a way that the proposed system design allows this... and the runner does have an internal safeguard to prevent two jobs of the same identity running in parallel. So I think this risk is well guarded against. @aahlenst wrote in https://code.forgejo.org/forgejo/runner/issues/1391#issuecomment-79502: > Why a hash, not the token itself? In what regard is the token sensitive? > > (I assume that we talking about `runner_request_key` that stores the UUID in the PR you shared above.) This is in reference to the existing `FORGEJO_TOKEN` secret, not `runner_request_key`, which allows access to private repositories. It's stored in the DB with a one-way hash so that it can't be recovered by Forgejo, but can be validated on access.
Author
Owner

Idempotent changes are merged:

For next steps to try to validate this as a fix... 🤔

@alexrp as you've observed this problem in Zig builds, are the runners that are supposed to receive jobs configured with capacity > 1? Just trying to identify if @viceice's idea to "add a flag to temporary disable fetching multiple jobs" might be counter-indicated by Zig's configuration, or whether it's worth exploring.

@Gusted Since every indication we've seen of this bug has occurred on Codeberg, would you mind evaluating https://codeberg.org/forgejo/forgejo/pulls/11401 for a port to Codeberg?

Idempotent changes are merged: - https://code.forgejo.org/forgejo/runner/pulls/1393 - https://codeberg.org/forgejo/forgejo/pulls/11401 For next steps to try to validate this as a fix... 🤔 @alexrp as you've observed this problem in Zig builds, are the runners that are supposed to receive jobs configured with `capacity > 1`? Just trying to identify if @viceice's idea to "add a flag to temporary disable fetching multiple jobs" might be counter-indicated by Zig's configuration, or whether it's worth exploring. @Gusted Since every indication we've seen of this bug has occurred on Codeberg, would you mind evaluating https://codeberg.org/forgejo/forgejo/pulls/11401 for a port to Codeberg?
Contributor

Capacity on our runners ranges from 1 to 8 depending on machine specs. The only ones that have a capacity of 1 are the riscv64-linux boards though. Here's an instance of this bug occurring on one of those: https://codeberg.org/ziglang/zig/actions/runs/1799/jobs/7/attempt/1

Capacity on our runners ranges from 1 to 8 depending on machine specs. The only ones that have a capacity of 1 are the `riscv64-linux` boards though. Here's an instance of this bug occurring on one of those: https://codeberg.org/ziglang/zig/actions/runs/1799/jobs/7/attempt/1
Contributor
A few more: * https://codeberg.org/ziglang/zig/actions/runs/1798/jobs/6/attempt/1 * https://codeberg.org/ziglang/zig/actions/runs/1798/jobs/7/attempt/1 * https://codeberg.org/ziglang/zig/actions/runs/1799/jobs/7/attempt/1
Author
Owner

Thanks @alexrp. I think considering that, it's unlikely that the root cause of this problem is the multiple task fetch. Theoretically something changed during that development could still be the cause, but if we just disabled multiple fetch with the existing codebase it would behave the same as this riscv64-linux runner, and not likely fix anything. 👍

Thanks @alexrp. I think considering that, it's unlikely that the root cause of this problem is the multiple task fetch. Theoretically something changed during that development could still be the cause, but if we just disabled multiple fetch with the existing codebase it would behave the same as this `riscv64-linux` runner, and not likely fix anything. 👍
Owner

@mfenniak thanks for your pull request! I've backported it at codeberg.org/Codeberg-Infrastructure/forgejo@0ae17a4856 will try to deploy it soon-ish.

@mfenniak thanks for your pull request! I've backported it at https://codeberg.org/Codeberg-Infrastructure/forgejo/commit/0ae17a485616bd5a45928ca63e01d7e2a51018d5 will try to deploy it soon-ish.
Author
Owner

Neat! I'll tag a v12.7.1 of the runner with the client-side of this, then.

Neat! I'll tag a v12.7.1 of the runner with the client-side of this, then.
Owner

@Gusted wrote in #1391 (comment):

will try to deploy it soon-ish.

Deployed.

@Gusted wrote in https://code.forgejo.org/forgejo/runner/issues/1391#issuecomment-79792: > will try to deploy it soon-ish. Deployed.
Contributor

I've deployed runner v12.7.1 to our machines; will be restarting them as they become idle.

I've deployed runner v12.7.1 to our machines; will be restarting them as they become idle.
Contributor

All updated and restarted, so now we wait.

All updated and restarted, so now we wait.
Author
Owner

I haven't noticed any jobs on forgejo projects running into this issue, except for one private project which is still running 12.7.0 (unpatched).

I haven't noticed any jobs on forgejo projects running into this issue, except for one private project which is still running 12.7.0 (unpatched).
Contributor

Note that with the DoS on Codeberg recently, I had to restart a lot of failed jobs. It was kind of a mess and I just didn't have the time to pick through them all to see if any of the failures were caused by this issue. Outside of that, I haven't seen this issue, but I would say it's still worth waiting another week before declaring this fixed.

Note that with the DoS on Codeberg recently, I had to restart a lot of failed jobs. It was kind of a mess and I just didn't have the time to pick through them all to see if any of the failures were caused by this issue. Outside of that, I haven't seen this issue, but I would say it's still worth waiting another week before declaring this fixed.
Contributor

I'm happy to consider this fixed at this point; haven't seen it since.

I'm happy to consider this fixed at this point; haven't seen it since.
Author
Owner

Great, thanks for monitoring and confirming. The only instance that the Forgejo contributors have observed is on code.forgejo.org which doesn't have the server-side patch, so it looks fixed to me too. 👍

Great, thanks for monitoring and confirming. The only instance that the Forgejo contributors have observed is on code.forgejo.org which doesn't have the server-side patch, so it looks fixed to me too. 👍
Sign in to join this conversation.
No milestone
No project
No assignees
5 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo/runner#1391
No description provided.