bug: unique random name for networks not working when network is set in config #912
Labels
No labels
FreeBSD
Kind/Breaking
Kind/Bug
Kind/Chore
Kind/DependencyUpdate
Kind/Documentation
Kind/Enhancement
Kind/Feature
Kind/Security
Kind/Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
Windows
linux-powerpc64le
linux-riscv64
linux-s390x
run-end-to-end-tests
run-forgejo-tests
run-multi-platform-tests
No milestone
No assignees
8 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo/runner#912
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Can you reproduce the bug on the Forgejo test instance?
No
Description
When the network option in the container section of config.yml, the runner is unable to create/connect a random network for a service (#850).
If I understand correctly, with the network option for containers it can be specified, which network should be by for the runner and the containers it creates. If a new network is created for a service it is, from my point of view, correct that the workflow can't connect cause it would break the network settings from the config.
Forgejo Version
12.0.1
Runner Version
9.1.1
How are you running Forgejo?
I'm running forgejo with docker-compose and the official docker-image.
How are you running the Runner?
I'm using the official docker-image
Logs
runner(version:v9.1.1) received task 324490 of job simple-no-container, be triggered by event: push
workflow prepared
🚀 Start image=alpine:3.18
🐳 docker pull image=code.forgejo.org/oci/postgres:15 platform= username= forcePull=false
🐳 docker pull image=alpine:3.18 platform= username= forcePull=false
Cleaning up services for job simple-no-container
Cleaning up network for job simple-no-container, and network name is: WORKFLOW-71d562108ef73a30612c686458001689
🐳 docker pull image=code.forgejo.org/oci/postgres:15 platform= username= forcePull=false
🐳 docker create image=code.forgejo.org/oci/postgres:15 platform= entrypoint=[] cmd=[] network="WORKFLOW-71d562108ef73a30612c686458001689"
🐳 docker run image=code.forgejo.org/oci/postgres:15 platform= entrypoint=[] cmd=[] network="WORKFLOW-71d562108ef73a30612c686458001689"
failed to start container: Error response from daemon: network WORKFLOW-71d562108ef73a30612c686458001689 not found
Workflow file
Could you please share your runner configuration file? Redacted for secrets.
@NickP gentle ping?
I'm experiencing a similar issue at the very least, though the error is slightly different...
Running Forgejo v12.0.1 and Runner v9.1.1, with both run via docker compose (albeit separately).
Config file:
services require
[container].networkto be empty so that a new network can be temporarily created to connect the container running the workflow with the services.It would be useful to add a note, it is not intuitive. I sent a pull request https://codeberg.org/forgejo/docs/pulls/1411/files and it would be great if you can review and approve so it can be merged.
Does that help?
I discovered as much through my own experimentation, but agree that it's not obvious, especially as the existing examples for operating the runner via docker compose (or otherwise) set
[container].network. I think adding a note is a good idea regardless, so I've approved the changes.@NickP is it the same problem you ran into by any chance?
@earl-warren wrote in #912 (comment):
Sorry for my late reply.
I create the configuration using the command option for a service in my docker-compose-file when starting the container.
It is created as follows:
@earl-warren wrote in #912 (comment):
To me, it looks like it's exactly the same issue that I have.
And do you confirm it solves the issue if you do not force a fixed network name with the following?
I use the kubernetes example (which is more or less the same as the docker compose version) and get the same problem:
failed to start container: Error response from daemon: failed to set up container networking: network WORKFLOW-f838f1238182705858f1b42e79db1bd3 not foundthis error happens since version 9.1.0 of the runner, before that every version (i update quite often) since 4.0.1 has worked without an issue. my config looks like this:
I don't know if it had something to do with: #850
drop
sed -i -e "s|network: .*|network: host|" config.yml ;and it should work@viceice Yes that helps with the problem but also generates a new one for me. The advantage of using host network in the workflow step was, that I can also use the dind container (daemon) to build docker images using docker/build-push-action@v6 action in a workflow. Since the steps are now executed in a ephemeral network, that doesn't work anymore.
@metawave wrote in #912 (comment):
it's the
DOCKER_HOSTvariable, set the hostname toRUNNER_NAMEand it should workFound a solution using Statefulset and Service and then using the FQDN of the runner pod as DOCKER_HOST, a way more complicated setup than before. If someone has run into the same problem as I have, please contact me, maybe I can help now.
@metawave could you please describe your solution here? 🙏 It will be very helpful to others: this is not easy to figure out 😓
Problem Statement
When running Forgejo runners in Kubernetes with Docker-in-Docker (DinD), workflow step containers need to access the Docker daemon from their parent pod. However, standard Kubernetes deployments don't provide stable, predictable DNS names for individual pods, making it impossible for step containers to reliably connect to the Docker daemon.
The Challenge
The networking challenge stems from how DNS resolution works in this setup:
Solution: StatefulSets with Headless Services
The solution involves converting the runner deployment to a StatefulSet and creating a headless service. This combination provides each pod with a stable, unique DNS name following the pattern:
Implementation
Step 1: Create a Headless Service
Step 2: Convert to StatefulSet
Change your Deployment to a StatefulSet and add the
serviceName:Step 3: Update Runner Registration
Modify the init container to inject the DOCKER_HOST with the pod's FQDN:
Step 4: Configure TLS SANs
Add the pod's FQDN to the Docker daemon's TLS certificate:
Key Points
StatefulSet with serviceName: Creates predictable DNS names like
forgejo-runner-0.runners.forgejo.svc.cluster.localDOCKER_HOST injection: The sed command injects the pod's FQDN into the runner config, ensuring workflow containers know where to find the Docker daemon
TLS certificate SANs: The DOCKER_TLS_SAN environment variable ensures the certificate is valid for both localhost and the Kubernetes DNS name
Namespace Considerations
If deploying outside the
forgejonamespace, adjust the FQDN pattern accordingly:you can also pass the pod ip1 to the container and use that for dynamic configuration without using dns. it'll work with depoyment too.
Using a stateful set is anyways a better solution to have static runner names, so you don't need to reregister on every restart.
https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/ ↩︎
this is my stateful set:
I'm using mirrored / custom images because i need to internal certificates 😉
I ran into the same issue creating a new network for a service, removing
network: "host"fixed that.Now I'm also unable to connect to the docker-in-docker container for docker workflows. I'm using Docker compose and not familiar with Kubernetes. I tried translating the configs @metawave posted, but wasn't able to get it working again. Is anyone able to help out? 😄
Here's my docker-compose.yml
Here's my runner config.yml
and then my workflow
I've tried using both "tcp://docker:2376" and "tcp://forgejo-runner:2376" for DOCKER_HOST, but each time I get a similar lookup error
error during connect: Get "https://docker:2376/v1.51/containers/json": dial tcp: lookup docker on 127.0.0.11:53: no such hostWhich makes sense because the job container is no longer on the host (forgejo) network. But I'm not sure how to fix it.
@zonrek I think your problem is going to be this: from containers that run within the docker-in-docker container, you won't be able to resolve resolve the DNS address
docker. Only the host will be able to resolve this DNS resolution to containers running directly on the host.You should be able to add an
--add-hostoption to containers that are spawned to force this to resolve in the context of the host...There's a newly released document in the Forgejo documentation about configurations for utilizing Docker from Actions which contains a detailed and tested configuration, and I'm pulling that missing config piece from that document -- https://forgejo.org/docs/latest/admin/actions/docker-access/ Obviously your config isn't exactly the same, but even if this piece doesn't help perhaps the guide will help you out with any other details.
@mfenniak Yep that worked perfectly. Thank you!
@zonrek can this issue be closed?
@earl-warren wrote in #912 (comment):
Shouldn't that question be directed at the person that opened the issue?
I stand corrected 😊
@NickP what do you think?
@NickP I'm closing this issue. Please re-open if you feel something still needs to be addressed.