Cannot connect to the Docker daemon at unix:///var/run/docker.sock. #153

Open
opened 2024-02-23 23:22:07 +00:00 by rpoovey · 18 comments

I'm at my wits end here and hopeful someone can assist.

I'm attempting to setup a runner using a k3s cluster that is installed with the --docker option on ALL the nodes. If my understanding of dind is correct though, that doesn't matter at all. But just pointing it out.

I can achieve the following:

  1. All pods in the deployment start and the runner registers in Codeberg.
  2. runner accepts jobs and I can see logs of the jobs running.
  3. forgejo-runner pod shows that it connects just fine to the daemon (docker:dind) pod and stays running.
  4. Can scale up my deployment to X number and get more runners to show in Codeberg.

My issue:

To note: I'm only trying to test with building a multiarch docker container per a Dockerfile in my repo.

When I run an action on my repo, I can see the output. in both the forgejo-runner pod and the web console on Codeberg, I can see it pulling all the required containers images for the action, and when it gets to the docker-setup-qemu-action@v3step it fails with the error:
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Full output

::group::Docker info
[command]/usr/bin/docker version
Client:
 Version:           24.0.9-1
 API version:       1.43
 Go version:        go1.20.13
 Git commit:        293681613032e6d1a39cc88115847d3984195c24
 Built:             Wed Jan 31 20:53:14 UTC 2024
 OS/Arch:           linux/amd64
 Context:           default
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
::endgroup::
::error::The process '/usr/bin/docker' failed with exit code 1

There was a what seemed like promising solution to this on a Gitea issue. But in the end, it didn't work for me. I have also read countless other issues/articles and no good.

My action file

name: ci
on:
  push:
    branches:
      - "modules"
jobs:
  build:
    runs-on: docker 
    # container: catthehacker/ubuntu:act-latest ##THIS IS COMMENTED BECAUSE THIS IS DEFINED IN THE FORGEJO-RUNNER CMD GLOBALLY. BUT UNCOMMENTING DOESNT CHANGE ANYTHING.
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3
        with:
          platforms: 'amd64,arm64' ## CAN REMOVE with.platforms AND BUILD FOR ALL BUT DOESNT MATTER
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Login to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          #testing
          context: .
          file: ./Dockerfile
          platforms: linux/amd64,linux/arm64
          push: true
          tags: ${{ secrets.DOCKERHUB_USERNAME }}/automation:alpine-codebergactions

My deployment config

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
  generation: 3
  labels:
    App: forgejo-runner
  name: forgejo-runner
  namespace: utilities
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      App: forgejo-runner
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        App: forgejo-runner
      namespace: utilities
    spec:
      automountServiceAccountToken: true
      containers:
      - command:
        - sh
        - -c
        - while ! nc -z localhost 2376 </dev/null; do echo 'waiting for docker daemon...';
          sleep 5; done; forgejo-runner daemon
        env:
        - name: TZ
          value: America/New_York
        - name: DOCKER_HOST
          value: tcp://localhost:2376
        - name: DOCKER_CERT_PATH
          value: /certs/client
        - name: DOCKER_TLS_VERIFY
          value: "1"
        image: code.forgejo.org/forgejo/runner:3.3.0
        imagePullPolicy: IfNotPresent
        name: forgejo-runner
        resources:
          limits:
            cpu: "1"
            memory: 512Mi
          requests:
            cpu: 500m
            memory: 100Mi
        securityContext: ## I DONT THINK THIS HELPED AT ALL. I ADDED IT TO THE RUNNER AS A TROUBLESHOOTING STEP
          allowPrivilegeEscalation: true
          privileged: true
          readOnlyRootFilesystem: false
          runAsGroup: 0
          runAsNonRoot: false
          runAsUser: 0 
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /certs
          mountPropagation: None
          name: forgejo-runner-certs00
        - mountPath: /data
          mountPropagation: None
          name: forgejo-runner-data00
      - env:
        - name: DOCKER_TLS_CERTDIR
          value: /certs
        image: docker:23.0.6-dind
        imagePullPolicy: IfNotPresent
        name: daemon
        resources: {}
        securityContext:
          allowPrivilegeEscalation: true
          privileged: true
          readOnlyRootFilesystem: false
          runAsGroup: 0
          runAsNonRoot: false
          runAsUser: 0
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /certs
          mountPropagation: None
          name: forgejo-runner-certs00
      dnsPolicy: ClusterFirst
      enableServiceLinks: true
      hostname: forgejo-runner
      initContainers:
      - command:
        - forgejo-runner
        - register
        - --no-interactive
        - --instance
        - $(FORGEJO_INSTANCE_URL)
        - --token
        - $(RUNNER_SECRET)
        - --labels
        - docker:docker://ghcr.io/catthehacker/ubuntu:act-latest
        env:
        - name: RUNNER_SECRET
          valueFrom:
            secretKeyRef:
              key: CODEBERG_TOKEN
              name: codeberg-token
              optional: false
        - name: FORGEJO_INSTANCE_URL
          value: https://codeberg.org
        image: code.forgejo.org/forgejo/runner:3.3.0
        imagePullPolicy: IfNotPresent
        name: forgejo-runner-config-generation
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /data
          mountPropagation: None
          name: forgejo-runner-data00
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      shareProcessNamespace: false
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir: {}
        name: forgejo-runner-data00
      - emptyDir: {}
        name: forgejo-runner-certs00

I'm at my wits end here and hopeful someone can assist. I'm attempting to setup a runner using a k3s cluster that is installed with the `--docker` option on ALL the nodes. If my understanding of `dind` is correct though, that doesn't matter at all. But just pointing it out. ### I can achieve the following: 1. All pods in the deployment start and the runner registers in Codeberg. 2. runner accepts jobs and I can see logs of the jobs running. 3. `forgejo-runner` pod shows that it connects just fine to the `daemon` (docker:dind) pod and stays running. 4. Can scale up my deployment to X number and get more runners to show in Codeberg. ### My issue: _To note: I'm only trying to test with building a multiarch docker container per a Dockerfile in my repo._ When I run an action on my repo, I can see the output. in both the `forgejo-runner` pod and the web console on Codeberg, I can see it pulling all the required containers images for the action, and when it gets to the `docker-setup-qemu-action@v3`step it fails with the error: `Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?` Full output ``` ::group::Docker info [command]/usr/bin/docker version Client: Version: 24.0.9-1 API version: 1.43 Go version: go1.20.13 Git commit: 293681613032e6d1a39cc88115847d3984195c24 Built: Wed Jan 31 20:53:14 UTC 2024 OS/Arch: linux/amd64 Context: default Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? ::endgroup:: ::error::The process '/usr/bin/docker' failed with exit code 1 ``` There was a what seemed like promising solution to [this](https://gitea.com/gitea/act_runner/issues/280#issuecomment-752458) on a Gitea issue. But in the end, it didn't work for me. I have also read countless other issues/articles and no good. ### My action file ```yaml name: ci on: push: branches: - "modules" jobs: build: runs-on: docker # container: catthehacker/ubuntu:act-latest ##THIS IS COMMENTED BECAUSE THIS IS DEFINED IN THE FORGEJO-RUNNER CMD GLOBALLY. BUT UNCOMMENTING DOESNT CHANGE ANYTHING. steps: - name: Checkout uses: actions/checkout@v4 - name: Set up QEMU uses: docker/setup-qemu-action@v3 with: platforms: 'amd64,arm64' ## CAN REMOVE with.platforms AND BUILD FOR ALL BUT DOESNT MATTER - name: Set up Docker Buildx uses: docker/setup-buildx-action@v3 - name: Login to Docker Hub uses: docker/login-action@v3 with: username: ${{ secrets.DOCKERHUB_USERNAME }} password: ${{ secrets.DOCKERHUB_TOKEN }} - name: Build and push uses: docker/build-push-action@v5 with: #testing context: . file: ./Dockerfile platforms: linux/amd64,linux/arm64 push: true tags: ${{ secrets.DOCKERHUB_USERNAME }}/automation:alpine-codebergactions ``` ### My deployment config ```yaml apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "3" generation: 3 labels: App: forgejo-runner name: forgejo-runner namespace: utilities spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: App: forgejo-runner strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: creationTimestamp: null labels: App: forgejo-runner namespace: utilities spec: automountServiceAccountToken: true containers: - command: - sh - -c - while ! nc -z localhost 2376 </dev/null; do echo 'waiting for docker daemon...'; sleep 5; done; forgejo-runner daemon env: - name: TZ value: America/New_York - name: DOCKER_HOST value: tcp://localhost:2376 - name: DOCKER_CERT_PATH value: /certs/client - name: DOCKER_TLS_VERIFY value: "1" image: code.forgejo.org/forgejo/runner:3.3.0 imagePullPolicy: IfNotPresent name: forgejo-runner resources: limits: cpu: "1" memory: 512Mi requests: cpu: 500m memory: 100Mi securityContext: ## I DONT THINK THIS HELPED AT ALL. I ADDED IT TO THE RUNNER AS A TROUBLESHOOTING STEP allowPrivilegeEscalation: true privileged: true readOnlyRootFilesystem: false runAsGroup: 0 runAsNonRoot: false runAsUser: 0 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /certs mountPropagation: None name: forgejo-runner-certs00 - mountPath: /data mountPropagation: None name: forgejo-runner-data00 - env: - name: DOCKER_TLS_CERTDIR value: /certs image: docker:23.0.6-dind imagePullPolicy: IfNotPresent name: daemon resources: {} securityContext: allowPrivilegeEscalation: true privileged: true readOnlyRootFilesystem: false runAsGroup: 0 runAsNonRoot: false runAsUser: 0 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /certs mountPropagation: None name: forgejo-runner-certs00 dnsPolicy: ClusterFirst enableServiceLinks: true hostname: forgejo-runner initContainers: - command: - forgejo-runner - register - --no-interactive - --instance - $(FORGEJO_INSTANCE_URL) - --token - $(RUNNER_SECRET) - --labels - docker:docker://ghcr.io/catthehacker/ubuntu:act-latest env: - name: RUNNER_SECRET valueFrom: secretKeyRef: key: CODEBERG_TOKEN name: codeberg-token optional: false - name: FORGEJO_INSTANCE_URL value: https://codeberg.org image: code.forgejo.org/forgejo/runner:3.3.0 imagePullPolicy: IfNotPresent name: forgejo-runner-config-generation resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /data mountPropagation: None name: forgejo-runner-data00 restartPolicy: Always schedulerName: default-scheduler securityContext: {} shareProcessNamespace: false terminationGracePeriodSeconds: 30 volumes: - emptyDir: {} name: forgejo-runner-data00 - emptyDir: {} name: forgejo-runner-certs00 ```
Owner

https://code.forgejo.org/forgejo/runner/src/branch/main/examples/docker-compose

There is a tested example which you could use for inspiration, if not already. It is very easy to get confused wheen thinking about docker in docker.

You also note that container is not effective / does nothing when trying to use another image. This may simply be an indentation problem and you could verify that from this tested example: https://code.forgejo.org/forgejo/end-to-end/src/branch/main/actions/example-container/.forgejo/workflows/test.yml

Does that help?

https://code.forgejo.org/forgejo/runner/src/branch/main/examples/docker-compose There is a [tested example](https://code.forgejo.org/forgejo/runner/src/branch/main/examples/docker-compose/compose-forgejo-and-runner.yml) which you could use for inspiration, if not already. It is very easy to get confused wheen thinking about docker in docker. You also note that `container` is not effective / does nothing when trying to use another image. This may simply be an indentation problem and you could verify that from this tested example: https://code.forgejo.org/forgejo/end-to-end/src/branch/main/actions/example-container/.forgejo/workflows/test.yml Does that help?
Author

There is a tested example which you could use for inspiration, if not already. It is very easy to get confused wheen thinking about docker in docker.

This example I have read through as well. I think the biggest difference I can see here is that if I was to use a compose file to build outside of my kubernetes cluster I would make the containers and link them. Then use DOCKER_HOST environment variable to set it to the dind container that is exposing that port. Which I am also doing in the k3s deployment but since it’s the same deployment the variable is localhost and not a different pod/container. I could try and deploy each container as a separate deployment though and expose a service. This has some security concerns for me though as currently the pods are self contained and while the dind pod is exposing its docker API port, it’s not doing so externally of the cluster addresses.. I might try the —tls=false option with this proposed setup so I don’t have to worry about sharing a volume as well on the pods.

You also note that container is not effective / does nothing when trying to use another image. This may simply be an indentation problem and you could verify that from this tested example: https://code.forgejo.org/forgejo/end-to-end/src/branch/main/actions/example-container/.forgejo/workflows/test.yml

This is correct. container is not effective and with or without it I can see that the logs output that I’m using catthehacker/ubuntu:act-latest as I have that defined in the —label option of the forgejo-runner command. I have run a different echo test and that was successful. So I do not believe it’s an indentation nor image issue. Also, if I don’t set the catthehacker flag in my forgejo-runner command I get a different error that the docker executable doesn’t exist. So I’m confident I got the right image.

I should note. All my comments on my code blocks were added at the time of this posted issue to try and add some color to the code

> There is a [tested example](https://code.forgejo.org/forgejo/runner/src/branch/main/examples/docker-compose/compose-forgejo-and-runner.yml) which you could use for inspiration, if not already. It is very easy to get confused wheen thinking about docker in docker. This example I have read through as well. I think the biggest difference I can see here is that if I was to use a compose file to build outside of my kubernetes cluster I would make the containers and link them. Then use `DOCKER_HOST` environment variable to set it to the dind container that is exposing that port. Which I am also doing in the k3s deployment but since it’s the same deployment the variable is `localhost` and not a different pod/container. I could try and deploy each container as a separate deployment though and expose a service. This has some security concerns for me though as currently the pods are self contained and while the `dind` pod is exposing its docker API port, it’s not doing so externally of the cluster addresses.. I might try the `—tls=false` option with this proposed setup so I don’t have to worry about sharing a volume as well on the pods. > You also note that `container` is not effective / does nothing when trying to use another image. This may simply be an indentation problem and you could verify that from this tested example: https://code.forgejo.org/forgejo/end-to-end/src/branch/main/actions/example-container/.forgejo/workflows/test.yml This is correct. `container` is not effective and with or without it I can see that the logs output that I’m using `catthehacker/ubuntu:act-latest` as I have that defined in the `—label` option of the forgejo-runner command. I have run a different echo test and that was successful. So I do not believe it’s an indentation nor image issue. Also, if I don’t set the catthehacker flag in my forgejo-runner command I get a different error that the docker executable doesn’t exist. So I’m confident I got the right image. _I should note. All my comments on my code blocks were added at the time of this posted issue to try and add some color to the code_
Owner

After looking at it in more detail I don't see what's wrong. To debug it I woud start by running docker info from shell like:

run: |
  docker info  

It will fail and then I would add commands to investigate more. ps, env ls -l. IIRC catthehacker/ubuntu:act-latest should provide everything you need to use docker without worrying about anything.

I'm attempting to setup a runner using a k3s cluster that is installed with the --docker option on ALL the nodes.

I would remove that maybe. I don't know about k3s and --docker but if that has the effect of mounting /var/run/docker.sock in all containers, it may interfere with what you're trying to do.

After looking at it in more detail I don't see what's wrong. To debug it I woud start by running `docker info` from shell like: ```yaml run: | docker info ``` It will fail and then I would add commands to investigate more. ps, env ls -l. IIRC `catthehacker/ubuntu:act-latest` should provide everything you need to use docker without worrying about anything. > I'm attempting to setup a runner using a k3s cluster that is installed with the --docker option on ALL the nodes. I would remove that maybe. I don't know about k3s and `--docker` but if that has the effect of mounting `/var/run/docker.sock` in all containers, it may interfere with what you're trying to do.
Author

After looking at it in more detail I don't see what's wrong.

Haha, yeah, same. I can pull docker images on the forgejo-runner pod, so I know it has the docker executable and the checkout step works just fine in the workflow which I would assume is using a container that then runs a git clone.... Sooo

To debug it I woud start by running docker info from shell like:

run: |
  docker info  

It will fail and then I would add commands to investigate more. ps, env ls -l. IIRC catthehacker/ubuntu:act-latest should provide everything you need to use docker without worrying about anything.

Just to clarify you are referring to doing this in my workflow file? I did do some tweaking to the workflow file that does some checking to see if Docker is running and sleeps for 300s. Which always fails also.

name: ci
on:
  push:
    branches:
      - "modules"
jobs:
  build:
    runs-on: docker
    services:
      docker:
        image: catthehacker/ubuntu:act-latest
        options: --privileged
    steps:
      - name: Wait for Docker daemon
        run: |
          timeout=300  # Set a timeout value in seconds
          until docker info; do
            echo "Waiting for Docker daemon to start..."
            sleep 5
            timeout=$((timeout-5))
            if [ $timeout -le 0 ]; then
              echo "Timeout waiting for Docker daemon to start."
              exit 1
            fi
          done          
      - name: Checkout
        uses: actions/checkout@v4
      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3
        with:
          platforms: 'amd64,arm64'
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Login to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          file: ./Dockerfile
          platforms: linux/amd64,linux/arm64
          push: true
          tags: ${{ secrets.DOCKERHUB_USERNAME }}/automation:alpine-codebergactions

I'm attempting to setup a runner using a k3s cluster that is installed with the --docker option on ALL the nodes.

I would remove that maybe. I don't know about k3s and --docker but if that has the effect of mounting /var/run/docker.sock in all containers, it may interfere with what you're trying to do.

This is a supported method of installation. Basically instead of using containerd on the cluster, its using Docker. Using Docker as the Container Runtime. I don't think this has any bearing, and I can see /var/run/docker.sock on all 7 k3s nodes. I get no error mounting as hostPath either. I have been wrong a million times before though.


I'd be curious for your input on what I have gathered in my troubleshooting today:

  • I have never been able to do "offline registration" with the initContainer. When I do so, it errors saying the --secret needs to be in hexadecimal form. If I convert my token to hex. Its too long and I then get an error that it needs to be 40 characters and not 80. I am assuming here that this is because I am not self-hosting Forgejo and Codeberg doesn't provide me with a SHA-1 hex value to use as a --secret instead of --token that the register command uses. But even Codeberg mentions using offline registration in the README, but their kubernetes example doesn't use offline registration. Probably just didn't edit the README when they imported from this repo.
  • I found reference in the forgejo-runner docs config section about docker-in-docker. Particularly in the container: section there is: (See code snippet below) And because I cannot do offline registration, the forgejo-runner daemon command in my k3s pod (not the initContainer) is using what I assume to be a default config.yml. Which has this value as false.
container:
...
# Whether to use privileged mode or not when launching task containers (privileged mode is required for Docker-in-Docker).
  privileged: false
> After looking at it in more detail I don't see what's wrong. Haha, yeah, same. I can pull docker images on the `forgejo-runner` pod, so I know it has the docker executable and the checkout step works just fine in the workflow which I would assume is using a container that then runs a git clone.... Sooo >To debug it I woud start by running `docker info` from shell like: > > ```yaml > run: | > docker info > ``` > > It will fail and then I would add commands to investigate more. ps, env ls -l. IIRC `catthehacker/ubuntu:act-latest` should provide everything you need to use docker without worrying about anything. Just to clarify you are referring to doing this in my workflow file? I did do some tweaking to the workflow file that does some checking to see if Docker is running and sleeps for 300s. Which always fails also. ```yaml name: ci on: push: branches: - "modules" jobs: build: runs-on: docker services: docker: image: catthehacker/ubuntu:act-latest options: --privileged steps: - name: Wait for Docker daemon run: | timeout=300 # Set a timeout value in seconds until docker info; do echo "Waiting for Docker daemon to start..." sleep 5 timeout=$((timeout-5)) if [ $timeout -le 0 ]; then echo "Timeout waiting for Docker daemon to start." exit 1 fi done - name: Checkout uses: actions/checkout@v4 - name: Set up QEMU uses: docker/setup-qemu-action@v3 with: platforms: 'amd64,arm64' - name: Set up Docker Buildx uses: docker/setup-buildx-action@v3 - name: Login to Docker Hub uses: docker/login-action@v3 with: username: ${{ secrets.DOCKERHUB_USERNAME }} password: ${{ secrets.DOCKERHUB_TOKEN }} - name: Build and push uses: docker/build-push-action@v5 with: context: . file: ./Dockerfile platforms: linux/amd64,linux/arm64 push: true tags: ${{ secrets.DOCKERHUB_USERNAME }}/automation:alpine-codebergactions ``` > > I'm attempting to setup a runner using a k3s cluster that is installed with the --docker option on ALL the nodes. > > I would remove that maybe. I don't know about k3s and `--docker` but if that has the effect of mounting `/var/run/docker.sock` in all containers, it may interfere with what you're trying to do. This is a supported method of installation. Basically instead of using `containerd` on the cluster, its using Docker. [Using Docker as the Container Runtime](https://docs.k3s.io/advanced#using-docker-as-the-container-runtime). I don't think this has any bearing, and I can see `/var/run/docker.sock` on all 7 k3s nodes. I get no error mounting as `hostPath` either. I have been wrong a million times before though. --- I'd be curious for your input on what I have gathered in my troubleshooting today: - I have **never** been able to do ["offline registration"](https://forgejo.org/docs/v1.21/admin/actions/#offline-registration) with the `initContainer`. When I do so, it errors saying the `--secret` needs to be in hexadecimal form. If I convert my **token** to hex. Its too long and I then get an error that it needs to be 40 characters and not 80. I am assuming here that this is because I am not self-hosting Forgejo and Codeberg doesn't provide me with a SHA-1 hex value to use as a `--secret` instead of `--token` that the `register` command uses. But even Codeberg mentions using offline registration in the [README](https://codeberg.org/forgejo/runner/src/branch/main/examples/kubernetes/README.md), but their [kubernetes example](https://codeberg.org/forgejo/runner/src/branch/main/examples/kubernetes/dind-docker.yaml) doesn't use offline registration. ~~Probably just didn't edit the README when they imported from this repo~~. - I found reference in the `forgejo-runner` docs [config](https://forgejo.org/docs/next/admin/actions/#configuration) section about `docker-in-docker`. Particularly in the `container:` section there is: (See code snippet below) And because I cannot do offline registration, the `forgejo-runner daemon` command in my k3s pod (not the initContainer) is using what I assume to be a default `config.yml`. Which has this value as `false`. ```yaml container: ... # Whether to use privileged mode or not when launching task containers (privileged mode is required for Docker-in-Docker). privileged: false ```
Owner
jobs:
  build:
    runs-on: docker
    services:
      docker:
        image: catthehacker/ubuntu:act-latest
        options: --privileged
    steps:
      - name: Wait for Docker daemon
        run: |
          timeout=300  # Set a timeout value in seconds
          until docker info; do
            echo "Waiting for Docker daemon to start..."
            sleep 5
            timeout=$((timeout-5))
            if [ $timeout -le 0 ]; then
              echo "Timeout waiting for Docker daemon to start."
              exit 1
            fi
          done                   

This won't work because the steps run in a container and the service named "docker" runs in another container.

jobs:
  build:
    runs-on: docker
    container:
        image: catthehacker/ubuntu:act-latest
    steps:
      - name: Wait for Docker daemon
        run: |
          timeout=300  # Set a timeout value in seconds
          until docker info; do
            echo "Waiting for Docker daemon to start..."
            sleep 5
            timeout=$((timeout-5))
            if [ $timeout -le 0 ]; then
              echo "Timeout waiting for Docker daemon to start."
              exit 1
            fi
          done                   

Is what I had in mind.

```yaml jobs: build: runs-on: docker services: docker: image: catthehacker/ubuntu:act-latest options: --privileged steps: - name: Wait for Docker daemon run: | timeout=300 # Set a timeout value in seconds until docker info; do echo "Waiting for Docker daemon to start..." sleep 5 timeout=$((timeout-5)) if [ $timeout -le 0 ]; then echo "Timeout waiting for Docker daemon to start." exit 1 fi done ``` This won't work because the steps run in a container and the service named "docker" runs in another container. ```yaml jobs: build: runs-on: docker container: image: catthehacker/ubuntu:act-latest steps: - name: Wait for Docker daemon run: | timeout=300 # Set a timeout value in seconds until docker info; do echo "Waiting for Docker daemon to start..." sleep 5 timeout=$((timeout-5)) if [ $timeout -le 0 ]; then echo "Timeout waiting for Docker daemon to start." exit 1 fi done ``` Is what I had in mind.
Author

Alright, I think I've narrowed it down.

I don't think it has anything with the action file or Kubernetes per se. I actually think the runner isn't passing /var/run/docker.sock to the catthehacker/ubuntu:act-latest image. That image by itself does need to have /var/run/docker.sock bind mounted as a volume or it returns said error from my first post (the docker info output). And when the runner spins up using that image it does not have that volume passed to it. I see volume mounts as an option in the config and am currently trying to write a sed command to mount that volume and test.

Alright, I think I've narrowed it down. I don't think it has anything with the action file or Kubernetes per se. I actually think the runner isn't passing `/var/run/docker.sock` to the `catthehacker/ubuntu:act-latest` image. That image by itself does need to have `/var/run/docker.sock` bind mounted as a volume or it returns said error from my first post (the `docker info` output). And when the runner spins up using that image it does not have that volume passed to it. I see volume mounts as an option in the config and am currently trying to write a `sed` command to mount that volume and test.
Owner

Did it work out?

Did it work out?
Author

Update:

I think I have confirmed that its the runner image and it not passing /var/run/docker.sock to the nested container. Which may be a limitation of kubernetes or something else entirely.

  1. I set the container.volumes: to - /var/run/docker.sock:/var/run/docker.sock and I am given a message in the logs stating
[ci/build] [/var/run/docker.sock] is not a valid volume, will be ignored

So, obviously that didnt help. I tried to make more adjustments, but always ended up the same. Open to suggestions here.

  1. Then I started to read the Github/Gitea Actions docs to get a better understanding. I got to the lables section in the Gitea actions and saw the option to run on the host instead of containers. So my thinking here was "Well, its already in a pod, I dont need nested containers." So I made some adjustments to my action yaml and my forgejo-runner command in kubernetes. They are as follows:

dockerBuildMultiarch.yaml

name: ci
on:
  push:
    branches:
      - "modules"
jobs:
  build:
    runs-on: docker
    steps:
      - name: Wait for Docker daemon
        run: |
          timeout=300  # Set a timeout value in seconds
          until docker info; do
            echo "Waiting for Docker daemon to start..."
            sleep 5
            timeout=$((timeout-5))
            if [ $timeout -le 0 ]; then
              echo "Timeout waiting for Docker daemon to start."
              exit 1
            fi
          done          
      - name: Checkout
        uses: actions/checkout@v4
      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3
        with:
          platforms: 'amd64,arm64'
      - name: Login to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          file: ./Dockerfile
          platforms: linux/amd64,linux/arm64
          push: true
          tags: ${{ secrets.DOCKERHUB_USERNAME }}/automation:alpine-codebergactions

Kubernetes deployment

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: apps/v1
kind: Deployment
metadata:
...
spec:
  ...
  template:
    ...
    spec:
      automountServiceAccountToken: true
      containers:
      - command:
        - sh
        - -c ##ADDED docker/nodejs via apk/built a docker context/set privileged to true in config.yaml in the next line. 
        - 'while ! nc -z localhost 2376 </dev/null; do echo 'waiting for docker daemon...'; sleep 5; done; apk add docker nodejs; docker context create multiarch; docker buildx create multiarch --use; forgejo-runner generate-config > /data/config.yml; sed -i -e \"s|privileged: .*|privileged: true|\" /data/config.yml; forgejo-runner -c /data/config.yml daemon'
        env:
        ...
        image: code.forgejo.org/forgejo/runner:3.3.0
        imagePullPolicy: IfNotPresent
        name: forgejo-runner
        resources:
          ...
        securityContext:
          privileged: true
        volumeMounts:
          ...
      - env:
        ...
        image: docker:23.0.6-dind
        imagePullPolicy: IfNotPresent
        name: daemon
        resources: {}
        securityContext:
          privileged: true
        volumeMounts:
          ...
      dnsPolicy: ClusterFirst
      enableServiceLinks: true
      hostname: forgejo-runner
      initContainers:
      - command:
        - forgejo-runner
        - register
        - --no-interactive
        - --instance
        - $(FORGEJO_INSTANCE_URL)
        - --token
        - $(RUNNER_SECRET)
        - --labels
        - docker:host ##Set label to docker:host
        env:
          ...
        image: code.forgejo.org/forgejo/runner:3.3.0
        imagePullPolicy: IfNotPresent
        name: forgejo-runner-config-generation
        resources: {}
        volumeMounts:
          ...
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      shareProcessNamespace: false
      terminationGracePeriodSeconds: 30
      volumes:
        ...

Note: I trimmed a lot out of this for brevity. But the key take aways are

##ADDED docker/nodejs via apk/built a docker context/set privileged to true in config.yaml in the next line. 
        - 'while ! nc -z localhost 2376 </dev/null; do echo 'waiting for docker daemon...'; sleep 5; done; apk add docker nodejs; docker context create multiarch; docker buildx create multiarch --use; forgejo-runner generate-config > /data/config.yml; sed -i -e \"s|privileged: .*|privileged: true|\" /data/config.yml; forgejo-runner -c /data/config.yml daemon'
      initContainers:
      - command:
        - forgejo-runner
        - register
        - --no-interactive
        - --instance
        - $(FORGEJO_INSTANCE_URL)
        - --token
        - $(RUNNER_SECRET)
        - --labels
        - docker:host ##Set label to docker:host

I had to install docker and nodejs via apk on the forgejo-runner pod and create a new docker context. Easy enough to do in my container command. After that I am successfully building multiarch containers in a k3s cluster. I suppose I could just make a dockerfile and build my own forgejo-runner image with all this already done. docker.sock should still be passing to the forgejo-runner pod via the daemon pod and the DOCKER_HOST var.
Screenshot 2024-02-25 at 6.18.34 PM.png

This setup might not be ideal for most people and I am too unfamiliar with how the runner works and what implications I might have doing it this way. I am open to any thoughts/adjustments or testing to get proper dind working with the runner and kubernetes. But I think this shows that the forgejo-runner pod is not passing docker.sock to the containers it spins up when running the steps. Again, please check my work though.

Update: I think I have confirmed that its the runner image and it not passing `/var/run/docker.sock` to the nested container. Which may be a limitation of kubernetes or something else entirely. 1. I set the `container.volumes:` to `- /var/run/docker.sock:/var/run/docker.sock` and I am given a message in the logs stating ``` [ci/build] [/var/run/docker.sock] is not a valid volume, will be ignored ``` So, obviously that didnt help. I tried to make more adjustments, but always ended up the same. Open to suggestions here. 2. Then I started to read the Github/Gitea Actions docs to get a better understanding. I got to the [lables](https://docs.gitea.com/next/usage/actions/act-runner#labels) section in the Gitea actions and saw the option to run on the `host` instead of containers. So my thinking here was "Well, its already in a pod, I dont _need_ nested containers." So I made some adjustments to my action yaml and my `forgejo-runner` command in kubernetes. They are as follows: **dockerBuildMultiarch.yaml** ```yaml name: ci on: push: branches: - "modules" jobs: build: runs-on: docker steps: - name: Wait for Docker daemon run: | timeout=300 # Set a timeout value in seconds until docker info; do echo "Waiting for Docker daemon to start..." sleep 5 timeout=$((timeout-5)) if [ $timeout -le 0 ]; then echo "Timeout waiting for Docker daemon to start." exit 1 fi done - name: Checkout uses: actions/checkout@v4 - name: Set up QEMU uses: docker/setup-qemu-action@v3 with: platforms: 'amd64,arm64' - name: Login to Docker Hub uses: docker/login-action@v3 with: username: ${{ secrets.DOCKERHUB_USERNAME }} password: ${{ secrets.DOCKERHUB_TOKEN }} - name: Build and push uses: docker/build-push-action@v5 with: context: . file: ./Dockerfile platforms: linux/amd64,linux/arm64 push: true tags: ${{ secrets.DOCKERHUB_USERNAME }}/automation:alpine-codebergactions ``` Kubernetes deployment ```yaml # Please edit the object below. Lines beginning with a '#' will be ignored, # and an empty file will abort the edit. If an error occurs while saving this file will be # reopened with the relevant failures. # apiVersion: apps/v1 kind: Deployment metadata: ... spec: ... template: ... spec: automountServiceAccountToken: true containers: - command: - sh - -c ##ADDED docker/nodejs via apk/built a docker context/set privileged to true in config.yaml in the next line. - 'while ! nc -z localhost 2376 </dev/null; do echo 'waiting for docker daemon...'; sleep 5; done; apk add docker nodejs; docker context create multiarch; docker buildx create multiarch --use; forgejo-runner generate-config > /data/config.yml; sed -i -e \"s|privileged: .*|privileged: true|\" /data/config.yml; forgejo-runner -c /data/config.yml daemon' env: ... image: code.forgejo.org/forgejo/runner:3.3.0 imagePullPolicy: IfNotPresent name: forgejo-runner resources: ... securityContext: privileged: true volumeMounts: ... - env: ... image: docker:23.0.6-dind imagePullPolicy: IfNotPresent name: daemon resources: {} securityContext: privileged: true volumeMounts: ... dnsPolicy: ClusterFirst enableServiceLinks: true hostname: forgejo-runner initContainers: - command: - forgejo-runner - register - --no-interactive - --instance - $(FORGEJO_INSTANCE_URL) - --token - $(RUNNER_SECRET) - --labels - docker:host ##Set label to docker:host env: ... image: code.forgejo.org/forgejo/runner:3.3.0 imagePullPolicy: IfNotPresent name: forgejo-runner-config-generation resources: {} volumeMounts: ... restartPolicy: Always schedulerName: default-scheduler securityContext: {} shareProcessNamespace: false terminationGracePeriodSeconds: 30 volumes: ... ``` Note: I trimmed a lot out of this for brevity. But the key take aways are ```yaml ##ADDED docker/nodejs via apk/built a docker context/set privileged to true in config.yaml in the next line. - 'while ! nc -z localhost 2376 </dev/null; do echo 'waiting for docker daemon...'; sleep 5; done; apk add docker nodejs; docker context create multiarch; docker buildx create multiarch --use; forgejo-runner generate-config > /data/config.yml; sed -i -e \"s|privileged: .*|privileged: true|\" /data/config.yml; forgejo-runner -c /data/config.yml daemon' ``` ```yaml initContainers: - command: - forgejo-runner - register - --no-interactive - --instance - $(FORGEJO_INSTANCE_URL) - --token - $(RUNNER_SECRET) - --labels - docker:host ##Set label to docker:host ``` I had to install `docker` and `nodejs` via apk on the forgejo-runner pod and create a new docker context. Easy enough to do in my container command. After that I am successfully building multiarch containers in a k3s cluster. I suppose I could just make a dockerfile and build my own forgejo-runner image with all this already done. `docker.sock` should still be passing to the `forgejo-runner` pod via the `daemon` pod and the `DOCKER_HOST` var. ![Screenshot 2024-02-25 at 6.18.34 PM.png](/attachments/e6f14968-29e3-45c3-916d-e918ad8be11f) This setup might not be ideal for most people and I am too unfamiliar with how the runner works and what implications I might have doing it this way. I am open to any thoughts/adjustments or testing to get proper `dind` working with the runner and kubernetes. But I think this shows that the `forgejo-runner` pod is not passing `docker.sock` to the containers it spins up when running the steps. Again, please check my work though.
Owner

That's a very detailed post-mortem ❤️

Bottom line is ... you got it working?

That's a very detailed post-mortem ❤️ Bottom line is ... you got it working?
Author

Yeah, that post ended up longer than I thought.

Yes, its working, but not with default settings per the kubernetes-example.

Yeah, that post ended up longer than I thought. Yes, its working, but not with default settings per the [kubernetes-example](https://code.forgejo.org/forgejo/runner/src/branch/main/examples/kubernetes/dind-docker.yaml).
Owner

@rpoovey great to hear. If you have enough fuel left, would you consider a PR to fix the example so that other people do not fall in the same trap as you?

@rpoovey great to hear. If you have enough fuel left, would you consider a PR to fix the example so that other people do not fall in the same trap as you?
Author

@earl-warren I could for sure write up how to do this with my workaround but it is just that, a workaround. Docker-In-Docker still doesn’t work. I would love to discuss how to address that but I am not sure where that issue stems. My guess is in the forgejo-runner container and it’s Dockerfile.

@earl-warren I could for sure write up how to do this with my workaround but it is just that, a workaround. Docker-In-Docker still doesn’t work. I would love to discuss how to address that but I am not sure where that issue stems. My guess is in the `forgejo-runner` container and it’s Dockerfile.
Owner

Ok, thanks for the clarification. Let's keep this issue open until it can be investigated further. It will require some quality brain time 😄

Ok, thanks for the clarification. Let's keep this issue open until it can be investigated further. It will require some quality brain time 😄
Contributor

I've been stumbling a lot on this and I think I finally made this work.

  • generate runner config using forgejo-runner generate-config > config.yaml
  • edit the config file this way:
runner:
  ...
  envs:
    DOCKER_HOST: tcp://localhost:2376
    DOCKER_TLS_VERIFY: 1
    DOCKER_CERT_PATH: /certs/client
    ...
  ...

container:
  ...
  options: -v /certs/client:/certs/client
  ...
  valid_volumes:
    - /certs/client
  ...
  • run the daemon using forgejo-runner -c config.yaml daemon

Here it is.

It tested this using this pipeline:

name: Docker Image CI

on:
  push:
    branches: [ "master" ]
  pull_request:
    branches: [ "master" ]

jobs:

  build:
    runs-on: ubuntu-22.04

    steps:
    - name: test
      run: df -h;docker info

I also tested those actions without any problems:

  • docker/login-action@v2.1.0
  • docker/build-push-action@v4

Hope it may help !

Edit: My kubernetes deployment for the runner:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: forgejo-runner
  name: forgejo-runner
spec:
  replicas: 1
  selector:
    matchLabels:
      app: forgejo-runner
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: forgejo-runner
    spec:
      restartPolicy: Always
      volumes:
      - name: docker-certs
        emptyDir: {}
      - name: runner-data
        emptyDir: {}

      initContainers:
      - name: runner-config-generation
        image: code.forgejo.org/forgejo/runner:3.4.0
        command: ["/bin/bash","-c"]
        args: 
          - /bin/forgejo-runner create-runner-file --instance https://XXX.XXX.XXX --secret XXXXXXXXXXXXXXXXXXXXXX --connect;
            sed -i '/^ +"labels"/ N;N; s/null/[\n\t"self-hosted:host",\n\t"ubuntu-latest:docker:\/\/catthehacker\/ubuntu:act-latest",\n\t"ubuntu-22.04:docker:\/\/catthehacker\/ubuntu:act-22.04",\n\t"ubuntu-20.04:docker:\/\/catthehacker\/ubuntu:act-20.04",\n\t"ubuntu-18.04:docker:\/\/catthehacker\/ubuntu:act-20.04"\n]/' .runner;
            cat .runner
        env:
          ...
        volumeMounts:
        - name: runner-data
          mountPath: /data
          
      containers:
      - name: runner
        image: code.forgejo.org/forgejo/runner:3.4.0
        command: ["sh", "-c", "while ! nc -z localhost 2376 </dev/null; do echo 'waiting for docker daemon...'; sleep 5; done; forgejo-runner daemon"]
        env:
        - name: DOCKER_HOST
          value: tcp://localhost:2376
        - name: DOCKER_CERT_PATH
          value: /certs/client
        - name: DOCKER_TLS_VERIFY
          value: "1"
        volumeMounts:
        - name: docker-certs
          mountPath: /certs
        - name: runner-data
          mountPath: /data
        resources:
          requests:
            memory: 500Mi

      - name: daemon
        image: docker:23.0.6-dind
        env:
        - name: DOCKER_TLS_CERTDIR
          value: /certs
        securityContext:
          privileged: true
        volumeMounts:
        - name: docker-certs
          mountPath: /certs

The deployement does not implement the fix explain above as I manually create and edit the config.yaml directly inside the container after it has been deployed

I've been stumbling a lot on this and I think I finally made this work. * generate runner config using `forgejo-runner generate-config > config.yaml` * edit the config file this way: ```yaml runner: ... envs: DOCKER_HOST: tcp://localhost:2376 DOCKER_TLS_VERIFY: 1 DOCKER_CERT_PATH: /certs/client ... ... container: ... options: -v /certs/client:/certs/client ... valid_volumes: - /certs/client ... ``` * run the daemon using `forgejo-runner -c config.yaml daemon` Here it is. It tested this using this pipeline: ```yaml name: Docker Image CI on: push: branches: [ "master" ] pull_request: branches: [ "master" ] jobs: build: runs-on: ubuntu-22.04 steps: - name: test run: df -h;docker info ``` I also tested those actions without any problems: * docker/login-action@v2.1.0 * docker/build-push-action@v4 Hope it may help ! Edit: My kubernetes deployment for the runner: ```yaml --- apiVersion: apps/v1 kind: Deployment metadata: labels: app: forgejo-runner name: forgejo-runner spec: replicas: 1 selector: matchLabels: app: forgejo-runner strategy: {} template: metadata: creationTimestamp: null labels: app: forgejo-runner spec: restartPolicy: Always volumes: - name: docker-certs emptyDir: {} - name: runner-data emptyDir: {} initContainers: - name: runner-config-generation image: code.forgejo.org/forgejo/runner:3.4.0 command: ["/bin/bash","-c"] args: - /bin/forgejo-runner create-runner-file --instance https://XXX.XXX.XXX --secret XXXXXXXXXXXXXXXXXXXXXX --connect; sed -i '/^ +"labels"/ N;N; s/null/[\n\t"self-hosted:host",\n\t"ubuntu-latest:docker:\/\/catthehacker\/ubuntu:act-latest",\n\t"ubuntu-22.04:docker:\/\/catthehacker\/ubuntu:act-22.04",\n\t"ubuntu-20.04:docker:\/\/catthehacker\/ubuntu:act-20.04",\n\t"ubuntu-18.04:docker:\/\/catthehacker\/ubuntu:act-20.04"\n]/' .runner; cat .runner env: ... volumeMounts: - name: runner-data mountPath: /data containers: - name: runner image: code.forgejo.org/forgejo/runner:3.4.0 command: ["sh", "-c", "while ! nc -z localhost 2376 </dev/null; do echo 'waiting for docker daemon...'; sleep 5; done; forgejo-runner daemon"] env: - name: DOCKER_HOST value: tcp://localhost:2376 - name: DOCKER_CERT_PATH value: /certs/client - name: DOCKER_TLS_VERIFY value: "1" volumeMounts: - name: docker-certs mountPath: /certs - name: runner-data mountPath: /data resources: requests: memory: 500Mi - name: daemon image: docker:23.0.6-dind env: - name: DOCKER_TLS_CERTDIR value: /certs securityContext: privileged: true volumeMounts: - name: docker-certs mountPath: /certs ``` The deployement does not implement the fix explain above as I manually create and edit the config.yaml directly inside the container after it has been deployed
Owner

It is very helpful. What would be fantastic is to run this in the CI. In a way similar to:

It makes a huge difference for the user to know that works for real and there is no tiny roadblock (such as the one you encountered yourself) that will make their experience difficult. And when it fails the developer who introduced a regression knows it right away, it blocks the CI.

It is very helpful. What would be fantastic is to run this in the CI. In a way similar to: * [the workflow that runs the docker-compose example](https://code.forgejo.org/forgejo/runner/src/branch/main/.forgejo/workflows/example-docker-compose.yml) * [which is published with a little documentation in the example directory](https://code.forgejo.org/forgejo/runner/src/branch/main/examples/docker-compose) It makes a huge difference for the user to know that works for real and there is no tiny roadblock (such as the one you encountered yourself) that will make their experience difficult. And when it fails the developer who introduced a regression knows it right away, it blocks the CI.
Contributor

Thanks for the intel, I'll try to work on a PR in the incoming days !

Thanks for the intel, I'll try to work on a PR in the incoming days !
Contributor

I'm actually facing another issue causing runner not respecting provided labels :/
you can see it here: https://code.forgejo.org/forgejo/runner/actions/runs/870#jobstep-4-277

It use node:16-bullseye instead of alpine as specified inside the compose file here

sed -i -e "s|labels: \[\]|labels: \[\"docker:docker://alpine:3.18\"\]|" config.yml ;

It's probably tied to this existing issue : #149

As I need an image that contains docker i'm stuck with this. I'll try to dig on this cause I do not have this problem on my kubernetes deployed forgejo instance.

I'm actually facing another issue causing runner not respecting provided labels :/ you can see it here: https://code.forgejo.org/forgejo/runner/actions/runs/870#jobstep-4-277 It use `node:16-bullseye` instead of `alpine` as specified inside the compose file here https://code.forgejo.org/forgejo/runner/src/commit/eb89a98c6a401bc0afc6f9a897a82de0dfcb9d8d/examples/docker-compose/compose-forgejo-and-runner.yml#L65 It's probably tied to this existing issue : https://code.forgejo.org/forgejo/runner/issues/149 As I need an image that contains docker i'm stuck with this. I'll try to dig on this cause I do not have this problem on my kubernetes deployed forgejo instance.
earl-warren added the
Kind/Bug
label 2024-04-06 13:58:00 +00:00
Owner

Good catch. Could you please create a separate issue for this? I think it simply is an issue that the sed does not do what it is supposed to do. I'll work on resolving that this week-end. 👍

Good catch. Could you please create a separate issue for this? I think it simply is an issue that the sed does not do what it is supposed to do. I'll work on resolving that this week-end. 👍
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: forgejo/runner#153
No description provided.