bug: job timeouts issued from the Forgejo instance are not reliable #980
Labels
No labels
FreeBSD
Kind/Breaking
Kind/Bug
Kind/Chore
Kind/DependencyUpdate
Kind/Documentation
Kind/Enhancement
Kind/Feature
Kind/Security
Kind/Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
Windows
linux-powerpc64le
linux-riscv64
linux-s390x
run-end-to-end-tests
run-forgejo-tests
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo/runner#980
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Can you reproduce the bug on the Forgejo test instance?
No
Description
We have observed that when a job is timed out by the Forgejo instance (in this case Codeberg) due to its global job timeout, the job is marked as a failure rather than as a timeout. But more problematically, the job actually keeps running on the machine in spite of this (which could potentially be forever due to #979).
Some context here: https://codeberg.org/actions/meta/issues/46
Relevant run: https://codeberg.org/ziglang/zig/actions/runs/5
Forgejo Version
Codeberg current (v12.x?)
Runner Version
v11.0.0
How are you running Forgejo?
Codeberg
How are you running the Runner?
Built from source with
make buildbut otherwise following the installation guide.Logs
Debug logging was not enabled on the run linked above; I've kicked off another run with debug logging and will attach logs once it completes.
Workflow file
codeberg.org/ziglang/zig@a16afaf76f/.forgejo/workflows/ci.yamlWhen a step of a job runs, it will upload logs to the Forgejo instance:
resp, err := r.client.UpdateLog(r.ctx, connect.NewRequest(&runnerv1.UpdateLogRequest{TaskId: r.state.Id,Index: int64(r.logOffset),Rows: rows,NoMore: noMore,}))if err != nil {return err}ack := int(resp.Msg.GetAckIndex())if ack < r.logOffset {return fmt.Errorf("submitted logs are lost %d < %d", ack, r.logOffset)}and the forgejo server answers with
message UpdateLogResponse {int64 ack_index = 1; // If all lines are received, should be index + length(lines).}which has no indication that the job was stopped by Forgejo.
When the step is completed by the runner, it will upload the state to the Forgejo instance and be notified if the job was canceled.
if resp.Msg.GetState().GetResult() == runnerv1.Result_RESULT_CANCELLED {r.cancel()}Forgejo side, a task runs every 30 minutes, looking for jobs that exceeds the timeout defined in app.ini by ENDLESS_TASK_TIMEOUT
func registerStopEndlessTasks() {RegisterTaskFatal("stop_endless_tasks", &BaseConfig{Enabled: true,RunAtStart: true,Schedule: "@every 30m",}, func(ctx context.Context, _ *user_model.User, cfg Config) error {return actions_service.StopEndlessTasks(ctx)})}