bug: TERM on the daemon does not always wait for the jobs to complete #944

Closed
opened 2025-09-06 16:19:12 +00:00 by earl-warren · 2 comments
Contributor

It is yet unclear how to reproduce it. This is to collect evidence.

It is yet unclear how to reproduce it. This is to collect evidence.
Author
Contributor

A workflow was in progress. One job was starting and got interrupted instead of being waited on.

https://code.forgejo.org/xorm/xorm/actions/runs/22/jobs/3

Server side

time="2025-09-06T16:09:01Z" level=info msg="task 183304 repo is xorm/xorm https://code.forgejo.org https://code.forgejo.org"
time="2025-09-06T16:09:30Z" level=info msg="Cleaning up services for job test mysql"
time="2025-09-06T16:09:30Z" level=info msg="Cleaning up network for job test mysql, and network name is: WORKFLOW-42e239ed6f1451289682cbbc9acd1c60"
time="2025-09-06T16:09:31Z" level=info msg="task 183305 repo is xorm/xorm https://code.forgejo.org https://code.forgejo.org"
time="2025-09-06T16:09:46Z" level=info msg="runner: hetzner01-44 shutdown initiated, waiting [runner].shutdown_timeout=3h0m0s for running jobs to complete before shutting down"
time="2025-09-06T16:09:46Z" level=info msg="[poller 0] shutdown"
Terminated

Job logs

2025-09-06T16:09:31.5306090Z hetzner01-44(version:v10.0.1) received task 183305 of job test-mysql8, be triggered by event: pull_request
2025-09-06T16:09:31.5318907Z workflow prepared
2025-09-06T16:09:31.5319424Z 'runs-on' key not defined in test/unit-test
2025-09-06T16:09:31.5320544Z 🚀  Start image=code.forgejo.org/oci/ci:1
2025-09-06T16:09:31.5361461Z The network alias is test_mysql8 (sanitized version of test mysql8)
2025-09-06T16:09:31.5361721Z   🐳  docker pull image=mysql:8.0 platform= username= forcePull=false
2025-09-06T16:09:45.2861116Z   🐳  docker pull image=code.forgejo.org/oci/ci:1 platform= username= forcePull=false
2025-09-06T16:09:45.2883931Z Cleaning up services for job test mysql8
2025-09-06T16:09:45.2888288Z Cleaning up network for job test mysql8, and network name is: WORKFLOW-d3311aebc916fddb1978f58bc9eabbdf
2025-09-06T16:09:45.5517677Z   🐳  docker pull image=mysql:8.0 platform= username= forcePull=false
2025-09-06T16:09:45.5529926Z   🐳  docker create image=mysql:8.0 platform= entrypoint=[] cmd=[] network="WORKFLOW-d3311aebc916fddb1978f58bc9eabbdf"
2025-09-06T16:09:45.5671721Z   🐳  docker run image=mysql:8.0 platform= entrypoint=[] cmd=[] network="WORKFLOW-d3311aebc916fddb1978f58bc9eabbdf"
A workflow was in progress. One job was starting and got interrupted instead of being waited on. https://code.forgejo.org/xorm/xorm/actions/runs/22/jobs/3 Server side ``` time="2025-09-06T16:09:01Z" level=info msg="task 183304 repo is xorm/xorm https://code.forgejo.org https://code.forgejo.org" time="2025-09-06T16:09:30Z" level=info msg="Cleaning up services for job test mysql" time="2025-09-06T16:09:30Z" level=info msg="Cleaning up network for job test mysql, and network name is: WORKFLOW-42e239ed6f1451289682cbbc9acd1c60" time="2025-09-06T16:09:31Z" level=info msg="task 183305 repo is xorm/xorm https://code.forgejo.org https://code.forgejo.org" time="2025-09-06T16:09:46Z" level=info msg="runner: hetzner01-44 shutdown initiated, waiting [runner].shutdown_timeout=3h0m0s for running jobs to complete before shutting down" time="2025-09-06T16:09:46Z" level=info msg="[poller 0] shutdown" Terminated ``` Job logs ``` 2025-09-06T16:09:31.5306090Z hetzner01-44(version:v10.0.1) received task 183305 of job test-mysql8, be triggered by event: pull_request 2025-09-06T16:09:31.5318907Z workflow prepared 2025-09-06T16:09:31.5319424Z 'runs-on' key not defined in test/unit-test 2025-09-06T16:09:31.5320544Z 🚀 Start image=code.forgejo.org/oci/ci:1 2025-09-06T16:09:31.5361461Z The network alias is test_mysql8 (sanitized version of test mysql8) 2025-09-06T16:09:31.5361721Z 🐳 docker pull image=mysql:8.0 platform= username= forcePull=false 2025-09-06T16:09:45.2861116Z 🐳 docker pull image=code.forgejo.org/oci/ci:1 platform= username= forcePull=false 2025-09-06T16:09:45.2883931Z Cleaning up services for job test mysql8 2025-09-06T16:09:45.2888288Z Cleaning up network for job test mysql8, and network name is: WORKFLOW-d3311aebc916fddb1978f58bc9eabbdf 2025-09-06T16:09:45.5517677Z 🐳 docker pull image=mysql:8.0 platform= username= forcePull=false 2025-09-06T16:09:45.5529926Z 🐳 docker create image=mysql:8.0 platform= entrypoint=[] cmd=[] network="WORKFLOW-d3311aebc916fddb1978f58bc9eabbdf" 2025-09-06T16:09:45.5671721Z 🐳 docker run image=mysql:8.0 platform= entrypoint=[] cmd=[] network="WORKFLOW-d3311aebc916fddb1978f58bc9eabbdf" ```
Author
Contributor

Regarding shutdown

#202 is when it was introduced
#870 is the pull request that changed how contexts are built

I think the issue is that the daemon context is canceled by the signal and now it propagates to the jobs. Before that the jobs were ignorant of the daemon context.

Lines 14 to 19 in 26fda6e
func main() {
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
defer stop()
// run the command
cmd.Execute(ctx)
}

Regarding shutdown https://code.forgejo.org/forgejo/runner/pulls/202 is when it was introduced https://code.forgejo.org/forgejo/runner/pulls/870 is the pull request that changed how contexts are built I think the issue is that the daemon context is canceled by the signal and now it propagates to the jobs. Before that the jobs were ignorant of the daemon context. https://code.forgejo.org/forgejo/runner/src/commit/26fda6e0843a7c723b33b64ba1be404bb92d9cab/main.go#L14-L19
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo/runner#944
No description provided.