Feature Request: Pluggable Backend Architecture #107

Open
opened 2026-04-03 16:01:30 +00:00 by eleboucher · 39 comments

How to use this feature request

  • Please describe your first hand experience in a comment to show why you are interested into the same feature.
  • Please don't comment if you have no relevant information to add. It's just extra noise for everyone subscribed to this issue.
  • Subscribe to receive notifications on status change and new comments.

First hand experience

  • I wanted to run Forgejo Actions jobs natively on Kubernetes, without Docker-in-Docker or a separate VM.
  • I implemented a Kubernetes backend for Forgejo Runner as an in-tree addition (https://git.erwanleboucher.dev/eleboucher/runner).
  • The PR was not merged because new backends shouldn't live in-tree. The maintainers suggested a plugin architecture instead.
  • There is no way to add a backend without modifying core runner code today.

Needs and benefits

Forgejo Runner currently supports Docker, Host, and LXC backends, all hardcoded into the runner. There is community interest in additional backends (Kubernetes, Podman, Firecracker, Proxmox, etc.), but each one would require modifying core files and adding backend-specific knowledge to the runner codebase.

A plugin architecture would:

  • Let the community develop and maintain backends independently, without burdening the Forgejo team with review, testing, or support of backend-specific code.
  • Keep the scope of knowledge required to work on Forgejo Runner constrained — runner maintainers would not need to understand K8s, Podman, or any other backend's internals.
  • Allow backends to have their own release cycles, dependencies, and documentation.

Feature Description

Each backend is a separate binary communicating with the runner over gRPC, using something like HashiCorp go-plugin. The runner would launch the plugin binary when needed, and the plugin would implement a defined gRPC interface to manage execution environments.

How it would work:

The runner discovers plugins through config:

plugins:
  k8spod:
    path: /usr/local/bin/forgejo-runner-k8s
    config:
      namespace: default

When a job's runs-on label matches a plugin scheme, the runner launches the corresponding binary. The plugin handles the full environment lifecycle (create, exec, copy files, remove). The runner communicates via a gRPC protocol.

gRPC interface sketch:

The current Container interface has 15 methods, but several are Docker-internal (ConnectToNetwork, ReplaceLogWriter, Pull as a separate step). A plugin protocol could be simpler — roughly 9 RPCs:

  • Capabilities — declares what the backend supports (Docker actions? own networking? default paths?). This replaces the scattered IsHostEnv()/IsK8sEnv() identity checks with a single upfront declaration.
  • Create — provisions the environment (image pull folded in). Returns metadata the runner needs (paths, OS info). Service containers are passed here so the plugin handles them in whatever way makes sense for its platform.
  • Start — boots the environment.
  • Exec — runs a command, streams stdout/stderr back.
  • CopyIn / CopyOut — stream tar data for file transfers.
  • UpdateEnv — reads an env file inside the environment, returns parsed variables.
  • IsHealthy — health check.
  • Remove — tears down everything.

On the runner side, a gRPC client adapter would implement ExecutionsEnvironment by translating calls to plugin RPCs. This adapter is the only runner code that knows about go-plugin. Plugin authors write a standalone binary with its own go.mod and dependencies — the runner never imports K8s/Podman/etc. code.

Docker and Host backends would stay built-in. The plugin system is additive.

Approach 2: External orchestration + host mode

Instead of the runner managing execution environments, external tooling creates the environment and runs the runner inside it in host mode.

For example with Kubernetes: a K8s operator watches for pending jobs, creates a pod with the runner binary, the runner starts in host mode, executes the job, and the pod gets cleaned up.

This is compelling for remote infrastructure (OpenStack, Proxmox) where controlling processes from outside a VM is unreliable or impossible — it's much easier to control a job from inside.

Trade-offs: Simpler for the runner (just host mode), natural for remote VMs. But more moving parts in deployment, service containers become the orchestrator's responsibility, and it loses the "configure a label and go" UX.

Both could coexist

  • Docker stays built-in (no change).
  • gRPC plugins for backends where the runner controls the environment (local Podman, K8s from within a cluster).
  • External orchestration for backends where the environment is remote (Proxmox, OpenStack, cloud VMs).

Incremental steps that help either way

Some small refactoring would improve extensibility regardless of which approach is chosen:

  1. Replace IsHostEnv()/IsK8sEnv() identity checks with capability queries — makes adding any backend safer and avoids the "missed a guard" class of bugs.
  2. Open the label scheme list so labels.Parse() and PickPlatform() don't reject unknown schemes.
  3. Decouple NewContainerInput from Docker-specific types (nat.PortSet/nat.PortMap).
  4. Replace GetLXC()/GetK8s() on ExecutionsEnvironment with something generic (e.g. BackendName() string).

What needs to happen before a feature request is ready to be implemented?

Users can complete the first step (accumulating first and experience) on their own, even if this feature request did not catch the eye of someone with the necessary skills to implement it. And when it reaches that point, it will stand out and have a much higher chance of being implemented.

  1. A few other users contributed their own first hand experience.
    To fully grasp the scope of a feature request, and to brainstorm possible solutions, a feature request will generally wait until several users have provided their perspective.
    Thumbs-up reactions help gauge popularity, but do not provide the same amount of useful information.
  2. The "Needs and benefit" and "Feature description" are finalized.
    Results from discussions and additional user experiences are incorporated into a final summary to provide a single reference for the developers working on this change.
    This can be done by the author of the issue or anyone else in a followup comment.
  3. The label Stage/Idea is changed to Stage/Ready.
  4. Feature request is created in the repository where the code resides.
    Depending on the feature request it can be in Forgejo or Forgejo runner.
    A copy/paste of the "Needs and benefit" and "Feature description" should be used, with link to this issue so the developer knows where to find more details if they need to.
### How to use this feature request * Please describe your first hand experience in a comment to show why you are interested into the same feature. * Please don't comment if you have no relevant information to add. It's just extra noise for everyone subscribed to this issue. * Subscribe to receive notifications on status change and new comments. ### First hand experience - I wanted to run Forgejo Actions jobs natively on Kubernetes, without Docker-in-Docker or a separate VM. - I implemented a Kubernetes backend for Forgejo Runner as an in-tree addition (https://git.erwanleboucher.dev/eleboucher/runner). - The PR was not merged because new backends shouldn't live in-tree. The maintainers suggested a plugin architecture instead. - There is no way to add a backend without modifying core runner code today. ### Needs and benefits Forgejo Runner currently supports Docker, Host, and LXC backends, all hardcoded into the runner. There is community interest in additional backends (Kubernetes, Podman, Firecracker, Proxmox, etc.), but each one would require modifying core files and adding backend-specific knowledge to the runner codebase. A plugin architecture would: - Let the community develop and maintain backends independently, without burdening the Forgejo team with review, testing, or support of backend-specific code. - Keep the scope of knowledge required to work on Forgejo Runner constrained — runner maintainers would not need to understand K8s, Podman, or any other backend's internals. - Allow backends to have their own release cycles, dependencies, and documentation. ### Feature Description Each backend is a separate binary communicating with the runner over gRPC, using something like [HashiCorp go-plugin](https://github.com/hashicorp/go-plugin). The runner would launch the plugin binary when needed, and the plugin would implement a defined gRPC interface to manage execution environments. **How it would work:** The runner discovers plugins through config: ```yaml plugins: k8spod: path: /usr/local/bin/forgejo-runner-k8s config: namespace: default ``` When a job's `runs-on` label matches a plugin scheme, the runner launches the corresponding binary. The plugin handles the full environment lifecycle (create, exec, copy files, remove). The runner communicates via a gRPC protocol. **gRPC interface sketch:** The current `Container` interface has 15 methods, but several are Docker-internal (`ConnectToNetwork`, `ReplaceLogWriter`, `Pull` as a separate step). A plugin protocol could be simpler — roughly 9 RPCs: - `Capabilities` — declares what the backend supports (Docker actions? own networking? default paths?). This replaces the scattered `IsHostEnv()`/`IsK8sEnv()` identity checks with a single upfront declaration. - `Create` — provisions the environment (image pull folded in). Returns metadata the runner needs (paths, OS info). Service containers are passed here so the plugin handles them in whatever way makes sense for its platform. - `Start` — boots the environment. - `Exec` — runs a command, streams stdout/stderr back. - `CopyIn` / `CopyOut` — stream tar data for file transfers. - `UpdateEnv` — reads an env file inside the environment, returns parsed variables. - `IsHealthy` — health check. - `Remove` — tears down everything. On the runner side, a gRPC client adapter would implement `ExecutionsEnvironment` by translating calls to plugin RPCs. This adapter is the only runner code that knows about go-plugin. Plugin authors write a standalone binary with its own `go.mod` and dependencies — the runner never imports K8s/Podman/etc. code. Docker and Host backends would stay built-in. The plugin system is additive. ### Approach 2: External orchestration + host mode Instead of the runner managing execution environments, external tooling creates the environment and runs the runner inside it in host mode. For example with Kubernetes: a K8s operator watches for pending jobs, creates a pod with the runner binary, the runner starts in host mode, executes the job, and the pod gets cleaned up. This is compelling for remote infrastructure (OpenStack, Proxmox) where controlling processes from outside a VM is unreliable or impossible — it's much easier to control a job from inside. **Trade-offs:** Simpler for the runner (just host mode), natural for remote VMs. But more moving parts in deployment, service containers become the orchestrator's responsibility, and it loses the "configure a label and go" UX. ### Both could coexist - Docker stays built-in (no change). - gRPC plugins for backends where the runner controls the environment (local Podman, K8s from within a cluster). - External orchestration for backends where the environment is remote (Proxmox, OpenStack, cloud VMs). ### Incremental steps that help either way Some small refactoring would improve extensibility regardless of which approach is chosen: 1. Replace `IsHostEnv()`/`IsK8sEnv()` identity checks with capability queries — makes adding any backend safer and avoids the "missed a guard" class of bugs. 2. Open the label scheme list so `labels.Parse()` and `PickPlatform()` don't reject unknown schemes. 3. Decouple `NewContainerInput` from Docker-specific types (`nat.PortSet`/`nat.PortMap`). 4. Replace `GetLXC()`/`GetK8s()` on `ExecutionsEnvironment` with something generic (e.g. `BackendName() string`). ### What needs to happen before a feature request is ready to be implemented? Users can complete the first step (accumulating first and experience) on their own, even if this feature request did not catch the eye of someone with the necessary skills to implement it. And when it reaches that point, it will stand out and have a much higher chance of being implemented. 1. **A few other users contributed their own first hand experience.** To fully grasp the scope of a feature request, and to brainstorm possible solutions, a feature request will generally wait until several users have provided their perspective. Thumbs-up reactions help gauge popularity, but do not provide the same amount of useful information. 1. **The "Needs and benefit" and "Feature description" are finalized.** Results from discussions and additional user experiences are incorporated into a final summary to provide a single reference for the developers working on this change. This can be done by the author of the issue or anyone else in a followup comment. 1. **The label `Stage/Idea` is changed to `Stage/Ready`.** 1. **Feature request is created in the repository where the code resides.** Depending on the feature request it can be in [Forgejo](https://codeberg.org/forgejo/forgejo/issues/new?template=.forgejo%2fissue_template%2ffeature-request.yaml) or [Forgejo runner](https://code.forgejo.org/forgejo/runner/issues/new?template=.forgejo%2fissue_template%2ffeature-request.yaml). A copy/paste of the "Needs and benefit" and "Feature description" should be used, with link to this issue so the developer knows where to find more details if they need to.
Owner

I'm more of a fan of "Approach 1", to start with. I don't find the responsibilities between different software components clear in Approach 2 -- the "orchestration" system would either require 30% of the logic that the runner has today, or, would need to be a version of the runner.

Regarding Approach 1, it's a little hard to foresee what (if anything) will make this difficult. At a high-level, it sounds good to me. Let me see if I can tease out some complexity though...

I think that we'll need cancellation capabilities that are managed by the runner, and that are more fine-grained than Remove. For example, a step can be defined with timeout-minutes, and the next step in the job can be set as if: failed(). In this case, the step, and whatever command it is executing, needs to be cancelled while the job continues to live.

I see that you've included an IsHealthy API in the design -- but as I'm reviewing go-plugin, I see that there's a standard gRPC health checking service that is required for the protocol. Does that serve the same purpose?

The runner will need to be prepared at any time for a situation where the plugin panics, crashes, SIGTERM or SIGKILL. Do you have any thoughts on what that would look like? Less on the technical side of how, but more generally what happens to the task on Forgejo, are there resources that can be cleaned up still, etc.

How do you envision a gRPC interface method like Exec working to stream log data back to the runner?

You've noted that "Docker and Host backends would stay built-in.", and I agree, that makes sense. However, Forgejo Runner will need some capability to test the plugin capability. Perhaps we'd need a "test plugin" that can be used in an end-to-end test, and can trigger some of the more complex interactions to ensure that the runner supports them correctly.

I'm more of a fan of "Approach 1", to start with. I don't find the responsibilities between different software components clear in Approach 2 -- the "orchestration" system would either require 30% of the logic that the runner has today, or, would need to be a version of the runner. Regarding Approach 1, it's a little hard to foresee what (if anything) will make this difficult. At a high-level, it sounds good to me. Let me see if I can tease out some complexity though... I think that we'll need cancellation capabilities that are managed by the runner, and that are more fine-grained than `Remove`. For example, a step can be defined with `timeout-minutes`, and the next step in the job can be set as `if: failed()`. In this case, the step, and whatever command it is executing, needs to be cancelled while the job continues to live. I see that you've included an `IsHealthy` API in the design -- but as I'm reviewing `go-plugin`, I see that there's a standard gRPC health checking service that is required for the protocol. Does that serve the same purpose? The runner will need to be prepared at any time for a situation where the plugin panics, crashes, `SIGTERM` or `SIGKILL`. Do you have any thoughts on what that would look like? Less on the technical side of how, but more generally what happens to the task on Forgejo, are there resources that can be cleaned up still, etc. How do you envision a gRPC interface method like `Exec` working to stream log data back to the runner? You've noted that "Docker and Host backends would stay built-in.", and I agree, that makes sense. However, Forgejo Runner will need some capability to *test* the plugin capability. Perhaps we'd need a "test plugin" that can be used in an end-to-end test, and can trigger some of the more complex interactions to ensure that the runner supports them correctly.
Author

Hey, i'm currently testing a POC of a plugin implementation here https://git.erwanleboucher.dev/eleboucher/runner-k8s-plugin. I decided to implement the gRPC interface myself and not use go-plugin (because i don't like the way hashicorp play with licenses, and a plain gRPC service is simpler — the plugin is just a standalone server the runner connects to, no managed process).

I went the full streaming route where i pipe the logger of the plugin directly to the response of the Exec stream (gRPC stream is bilateral). Each Write() on stdout/stderr sends a chunk back to the runner in real-time. Had to add mutex protection since both can write concurrently.

For cancellation — gRPC context cancellation propagates naturally, so when the runner cancels (e.g. timeout-minutes expires), the in-flight Exec stream gets terminated. The plugin can clean up the command while keeping the environment alive for the next step. This is more granular than Remove which tears down everything.

For IsHealthy vs gRPC health checking — IsHealthy in my design checks whether a specific environment (pod, container) is still running, it's per-job. Standard gRPC health would be for the server itself. They serve different purposes, but we could add the standard one alongside.

For plugin crash/SIGKILL — if the plugin dies mid-job, all RPCs fail immediately with gRPC Unavailable. The runner already handles container errors and reports them to Forgejo, so the job would be marked failed. Orphaned resources (pods etc) would need the plugin to clean up on restart or some TTL/garbage collection. Haven't solved that part yet.

For testing — agreed, a test plugin would be useful. I already have adapter tests with a mock gRPC server implementing the full interface. A "dummy" plugin that just runs commands on the host (like the Host backend) would let us test the plugin lifecycle in CI without K8s.

My poc is not perfect yet, as we speak i still have some issues with passing the data around and the setup job is much slower than in-tree because of gRPC overhead on file transfers. I'm thinking for the sidecar case to add a CopyLocal RPC where the plugin just reads from a shared volume path instead of streaming compressed data
over gRPC.

edit: I managed to have my poc working and the slow file transfer was because of caching that i didn't set correctly for the plugin :D

Hey, i'm currently testing a POC of a plugin implementation here https://git.erwanleboucher.dev/eleboucher/runner-k8s-plugin. I decided to implement the gRPC interface myself and not use go-plugin (because i don't like the way hashicorp play with licenses, and a plain gRPC service is simpler — the plugin is just a standalone server the runner connects to, no managed process). I went the full streaming route where i pipe the logger of the plugin directly to the response of the Exec stream (gRPC stream is bilateral). Each Write() on stdout/stderr sends a chunk back to the runner in real-time. Had to add mutex protection since both can write concurrently. For cancellation — gRPC context cancellation propagates naturally, so when the runner cancels (e.g. timeout-minutes expires), the in-flight Exec stream gets terminated. The plugin can clean up the command while keeping the environment alive for the next step. This is more granular than Remove which tears down everything. For IsHealthy vs gRPC health checking — IsHealthy in my design checks whether a specific environment (pod, container) is still running, it's per-job. Standard gRPC health would be for the server itself. They serve different purposes, but we could add the standard one alongside. For plugin crash/SIGKILL — if the plugin dies mid-job, all RPCs fail immediately with gRPC Unavailable. The runner already handles container errors and reports them to Forgejo, so the job would be marked failed. Orphaned resources (pods etc) would need the plugin to clean up on restart or some TTL/garbage collection. Haven't solved that part yet. For testing — agreed, a test plugin would be useful. I already have adapter tests with a mock gRPC server implementing the full interface. A "dummy" plugin that just runs commands on the host (like the Host backend) would let us test the plugin lifecycle in CI without K8s. My poc is not perfect yet, as we speak i still have some issues with passing the data around and the setup job is much slower than in-tree because of gRPC overhead on file transfers. I'm thinking for the sidecar case to add a CopyLocal RPC where the plugin just reads from a shared volume path instead of streaming compressed data over gRPC. edit: I managed to have my poc working and the slow file transfer was because of caching that i didn't set correctly for the plugin :D
Member

I think approach 2 is pretty well covered by external tooling. People are already using KEDA, and integration for GARM is being worked on. Some refinements are still necessary. But that's a separate discussion.

There seems to be strong demand for approach 1. It has the added benefit that we can move LXC out of core and hopefully fix some of its problems.

Regarding labels, #105 should help with supporting arbitrary plug-ins.

One thing we have to consider how plug-ins are going to be discovered and what additional configuration might be necessary within Forgejo Runner.

I think approach 2 is pretty well covered by external tooling. People are already using KEDA, and integration for GARM is being worked on. Some refinements are still necessary. But that's a separate discussion. There seems to be strong demand for approach 1. It has the added benefit that we can move LXC out of core and hopefully fix some of its problems. Regarding labels, https://code.forgejo.org/forgejo/forgejo-actions-feature-requests/issues/105 should help with supporting arbitrary plug-ins. One thing we have to consider how plug-ins are going to be discovered and what additional configuration might be necessary within Forgejo Runner.
Member

Hey @adamcharnock, this discussion might be of interest for you because of your Firecracker efforts.

Hey @adamcharnock, this discussion might be of interest for you because of your Firecracker efforts.
Author

here is the RPC protocol for the MVP plugin https://git.erwanleboucher.dev/eleboucher/runner/src/branch/main/act/plugin/proto/v1/plugin.proto , i'm currently testing it for my homelab https://git.erwanleboucher.dev/eleboucher/homelab/src/branch/main/kubernetes/apps/selfhosted/forgejo/runner/helmrelease.yaml and trying to get people from my community to use it as well

here is the RPC protocol for the MVP plugin https://git.erwanleboucher.dev/eleboucher/runner/src/branch/main/act/plugin/proto/v1/plugin.proto , i'm currently testing it for my homelab https://git.erwanleboucher.dev/eleboucher/homelab/src/branch/main/kubernetes/apps/selfhosted/forgejo/runner/helmrelease.yaml and trying to get people from my community to use it as well
Owner

@eleboucher wrote in #107 (comment)...

Just reading through these notes, and it all sounds promising. I think to some extent a POC is needed to understand the problems in depth -- I would be doing the same thing -- but just a note of caution to not get too locked into a solution since we may still have design thoughts that would change direction. 🙂

I decided to implement the gRPC interface myself and not use go-plugin (because i don't like the way hashicorp play with licenses, and a plain gRPC service is simpler — the plugin is just a standalone server the runner connects to, no managed process).

I'm not easily convinced that this is the right choice. go-plugin solves a lot of problems, and it is broadly used making it a solid base of known functionality. While hashicorp has done dumb licensing things in the past, their open source licensed codebases have lived on in forks quite happily, and the MPL licensed go-plugin would still be usable.

Each Write() on stdout/stderr sends a chunk back to the runner in real-time. Had to add mutex protection since both can write concurrently.

🤔 I don't understand this note. They're different processes on each side of the gRPC -- why would you need a mutex, which is a single-process concurrency mechanism?

@eleboucher wrote in https://code.forgejo.org/forgejo/forgejo-actions-feature-requests/issues/107#issuecomment-83452... Just reading through these notes, and it all sounds promising. I think to some extent a POC is needed to understand the problems in depth -- I would be doing the same thing -- but just a note of caution to not get too locked into a solution since we may still have design thoughts that would change direction. 🙂 > I decided to implement the gRPC interface myself and not use go-plugin (because i don't like the way hashicorp play with licenses, and a plain gRPC service is simpler — the plugin is just a standalone server the runner connects to, no managed process). I'm not easily convinced that this is the right choice. go-plugin solves a lot of problems, and it is broadly used making it a solid base of known functionality. While hashicorp has done dumb licensing things in the past, their open source licensed codebases have lived on in forks quite happily, and the MPL licensed go-plugin would still be usable. > Each Write() on stdout/stderr sends a chunk back to the runner in real-time. Had to add mutex protection since both can write concurrently. 🤔 I don't understand this note. They're different processes on each side of the gRPC -- why would you need a mutex, which is a single-process concurrency mechanism?
Author

Good point, I've actually reconsidered and implemented go-plugin support as a second transport option (pluginsv2). The plain gRPC approach stays as v1 (plugin is a standalone server the runner connects to), and go-plugin is v2 (runner launches the plugin binary as a subprocess). Both share the same proto interface, and the same plugin binary supports either mode. So users can choose based on their deployment model, sidecar container (v1) or embedded binary (v2) until we decide for a way.

Both mutexes are within a single process, not cross-process:

  • Plugin side: stdout and stderr writers share one gRPC stream. A command can write to both concurrently, and stream.Send() isn't safe for concurrent calls.
  • Runner side: protects against ReplaceLogWriter() swapping writers while Exec is writing to them.
Good point, I've actually reconsidered and implemented go-plugin support as a second transport option (pluginsv2). The plain gRPC approach stays as v1 (plugin is a standalone server the runner connects to), and go-plugin is v2 (runner launches the plugin binary as a subprocess). Both share the same proto interface, and the same plugin binary supports either mode. So users can choose based on their deployment model, sidecar container (v1) or embedded binary (v2) until we decide for a way. Both mutexes are within a single process, not cross-process: - Plugin side: stdout and stderr writers share one gRPC stream. A command can write to both concurrently, and stream.Send() isn't safe for concurrent calls. - Runner side: protects against ReplaceLogWriter() swapping writers while Exec is writing to them.

There's also the Caddy option, where you build the main binary with the plugins injected as direct dependencies by generating a suitable main.go on the fly.

There's also the Caddy option, where you build the main binary with the plugins injected as direct dependencies by generating a suitable `main.go` on the fly.
Owner

@whitequark wrote in #107 (comment):

There's also the Caddy option, where you build the main binary with the plugins injected as direct dependencies by generating a suitable main.go on the fly.

Interesting approach... the core technical details are documented in their main.go which helped me interpret this description.

I can see the advantages of this approach:

  • It's pretty simple to get started.
  • Everything runs in one process, allowing more complex data types and interactions in the API. (eg. context.Context and chan could be parameters/returns, if needed)
  • No new third-party dependencies involved.
  • There's a build-time safety component present here that the gRPC solutions wouldn't have -- the initialization of the module wouldn't build against the runner if the API isn't correct.
  • API evolution is intrinsically the same as any Go module -- if the runner's API changes then it becomes .../v13/... as a module, and the plugins can continue to work against v12 and upgrade to v13 when they're ready to incorporate those changes.

Caddy has a complexity in the use of an external build tool (xcaddy) to support this workflow, but I don't think that complexity is really necessary for us.

gRPC gives a theoretical advantage that plugins could be written in a language other than Go. But I'm not sure if that's a realistic use-case or a need.

@whitequark wrote in https://code.forgejo.org/forgejo/forgejo-actions-feature-requests/issues/107#issuecomment-84280: > There's also the Caddy option, where you build the main binary with the plugins injected as direct dependencies by generating a suitable `main.go` on the fly. Interesting approach... the core technical details are documented in their [main.go](https://github.com/caddyserver/caddy/blob/0c7c91a447922022163bbcc107512bc0b7f8a48b/cmd/caddy/main.go#L15-L42) which helped me interpret this description. I can see the advantages of this approach: - It's pretty simple to get started. - Everything runs in one process, allowing more complex data types and interactions in the API. (eg. `context.Context` and `chan` could be parameters/returns, if needed) - No new third-party dependencies involved. - There's a build-time safety component present here that the gRPC solutions wouldn't have -- the initialization of the module wouldn't build against the runner if the API isn't correct. - API evolution is intrinsically the same as any Go module -- if the runner's API changes then it becomes `.../v13/...` as a module, and the plugins can continue to work against v12 and upgrade to v13 when they're ready to incorporate those changes. Caddy has a complexity in the use of an [external build tool (xcaddy)](https://github.com/caddyserver/xcaddy) to support this workflow, but I don't think that complexity is really necessary for us. gRPC gives a theoretical advantage that plugins could be written in a language other than Go. But I'm not sure if that's a realistic use-case or a need.

From extensively using Caddy, I can say that the utility provided by xcaddy is very marginal in practice and for a tool consumed by a more specialized audience like Forgejo Actions runner it's really not essential; I would expect everyone capable of using the runner to also be capable of using the Go toolchain in a basic manner. (We should of course also document it, if we go for this approach.)

From extensively using Caddy, I can say that the utility provided by xcaddy is very marginal in practice and for a tool consumed by a more specialized audience like Forgejo Actions runner it's really not essential; I would expect everyone capable of using the runner to also be capable of using the Go toolchain in a basic manner. (We should of course also document it, if we go for this approach.)
Author

gRPC gives a theoretical advantage that plugins could be written in a language other than Go. But I'm not sure if that's a realistic use-case or a need.

i guess for me it's also that we are have a contract which act like a documentation how to write a plugin. and well we never know maybe some people might need their plugin running in a cobol env. I'm not a huge fan of playing with a binary like xcaddy or the go-plugin as it can be used maliciously, where a simple grpc client does the trick and limit the angle of attack, you can see in my 2 implementation that it's pretty straightforward and rather easy to expand.

the thing with xCaddy means that we will use a 3rd party distributed forgejo runner is it something ok?

There's a build-time safety component present here that the gRPC solutions wouldn't have -- the initialization of the module wouldn't build against the runner if the API isn't correct.

if you don't match the grpc proto then it wouldn't compile as well. as well as you import the proto file from a version see there https://git.erwanleboucher.dev/eleboucher/runner-k8s-plugin/src/branch/main/k8s_pod.go#L18

> gRPC gives a theoretical advantage that plugins could be written in a language other than Go. But I'm not sure if that's a realistic use-case or a need. i guess for me it's also that we are have a contract which act like a documentation how to write a plugin. and well we never know maybe some people might need their plugin running in a cobol env. I'm not a huge fan of playing with a binary like xcaddy or the go-plugin as it can be used maliciously, where a simple grpc client does the trick and limit the angle of attack, you can see in my 2 implementation that it's pretty straightforward and rather easy to expand. the thing with xCaddy means that we will use a 3rd party distributed forgejo runner is it something ok? > There's a build-time safety component present here that the gRPC solutions wouldn't have -- the initialization of the module wouldn't build against the runner if the API isn't correct. if you don't match the grpc proto then it wouldn't compile as well. as well as you import the proto file from a version see there https://git.erwanleboucher.dev/eleboucher/runner-k8s-plugin/src/branch/main/k8s_pod.go#L18

I'm not a huge fan of playing with a binary like xcaddy or the go-plugin as it can be used maliciously, where a simple grpc client does the trick and limit the angle of attack, you can see in my 2 implementation that it's pretty straightforward and rather easy to expand.

In what way would using gRPC here improve security? Could you state a specific threat model please?

> I'm not a huge fan of playing with a binary like xcaddy or the go-plugin as it can be used maliciously, where a simple grpc client does the trick and limit the angle of attack, you can see in my 2 implementation that it's pretty straightforward and rather easy to expand. In what way would using gRPC here improve security? Could you state a specific threat model please?
Author

Fair point, I should be more precise. My concern is less about a strict threat model and more about attack surface and trust boundaries.
With the Caddy/go-plugin approach, you’re executing arbitrary compiled code directly in the same process (or a subprocess spawned with elevated trust). If a plugin is malicious or compromised, it has full access to the runner’s memory space, environment variables, file descriptors, etc. — there’s no isolation boundary.
With gRPC, the plugin runs as a separate process with its own isolation. The runner communicates with it over a well-defined protocol. A compromised plugin can still do damage, but it’s limited to what you explicitly expose through the proto contract — you don’t get implicit access to the runner’s internals just by being loaded into it.
That said, I’ll admit this is more of a defense-in-depth argument than a hard security boundary — if someone’s running a malicious plugin on their own runner, they already have code execution anyway. So the security delta is real but probably not the strongest argument for gRPC in this context.
The stronger arguments for me remain the explicit contract as documentation and language agnosticism — even if the latter is theoretical for now.

Fair point, I should be more precise. My concern is less about a strict threat model and more about attack surface and trust boundaries. With the Caddy/go-plugin approach, you’re executing arbitrary compiled code directly in the same process (or a subprocess spawned with elevated trust). If a plugin is malicious or compromised, it has full access to the runner’s memory space, environment variables, file descriptors, etc. — there’s no isolation boundary. With gRPC, the plugin runs as a separate process with its own isolation. The runner communicates with it over a well-defined protocol. A compromised plugin can still do damage, but it’s limited to what you explicitly expose through the proto contract — you don’t get implicit access to the runner’s internals just by being loaded into it. That said, I’ll admit this is more of a defense-in-depth argument than a hard security boundary — if someone’s running a malicious plugin on their own runner, they already have code execution anyway. So the security delta is real but probably not the strongest argument for gRPC in this context. The stronger arguments for me remain the explicit contract as documentation and language agnosticism — even if the latter is theoretical for now.

If a plugin is malicious or compromised, it has full access to the runner’s memory space, environment variables, file descriptors, etc. — there’s no isolation boundary.

We are talking about plugins for things like "running Firecracker VMs" here, right? In other words, plugins that are already more than capable of extracting job secrets by substituting the code that actually runs in a VM (or the VM image) with something malicious.

> If a plugin is malicious or compromised, it has full access to the runner’s memory space, environment variables, file descriptors, etc. — there’s no isolation boundary. We are talking about plugins for things like "running Firecracker VMs" here, right? In other words, plugins that are already more than capable of extracting job secrets by substituting the code that actually runs in a VM (or the VM image) with something malicious.
Author

Yes you are right. I’m still not a huge fan of the xcaddy solution for the other point above

Yes you are right. I’m still not a huge fan of the xcaddy solution for the other point above
Member

I want to do #105. Because labels affect back-ends and their configuration, it blends into this one. #105 is only about the user-facing aspects. So I'll dump my current plan here because it affects plug-in registration and plug-in instance creation.

It's very early. Almost no code exists, everything in here is pseudo-code, and sudden U-turns should be expected.

Happy to read comments and answer questions.


There's a Backend:

type Backend interface {
    GetID() ID
    ValidateLabelConfiguration(labelConfig *LabelConfiguration) error
    ValidateLabelString(str string) error
    CreateExecutionEnvironment(config *Configuration, labelConfig *LabelConfiguration, label Label, ??) (ExecutionEnvironment, error)
}

type ID string

Examples of Backend are DockerBackend, LXCBackend, or HostBackend. Backend is the static companion of ExecutionsEnvironment (to be renamed to ExecutionEnvironment, without the S in the middle). Backend instances have a static configuration in the runner configuration:

backend:
  docker: // ID of the Backend
    ... // map[string]any

That static configuration is called BackendConfiguration:

type BackendConfiguration interface {
}

Each Backend has to implement its own BackendConfiguration, for example, DockerConfiguration. In the case of Docker, it corresponds to todays config.Container.

In addition to the BackendConfiguration that applies to the Backend for its entire lifetime, a Backend has to know how to turn a label like debian-latest into a virtual machine or container. As an added complication, debian-latest can mean something when a job comes from example.com and something completely different when it comes from codeberg.org, because Forgejo Runner supports connection-specific label configurations:

type LabelConfiguration interface {
    GetBackend(): ID
    GetOptions(): map[string]any
}

So far, labels were interpreted and validated by Forgejo Runner. That is not longer feasible with plugins and when there are many Backend-specific options. Therefore, from now on it is up to the Backend to interpret the LabelConfiguration. That's why Backend has a bound function called ValidateLabelConfiguration().

A label configuration can come in two different shapes:

runner:
  labels:
    freebsd-15: # Now comes the LabelConfiguration
      backend: docker
      backend-options:
        image: ghcr.io/freebsd/freebsd-runtime:15.0
        entrypoint: ["tail", "-f", "/dev/null"]
        init: true
        platform: freebsd/amd64

Or a single string for backwards compatibility and when using CLI options, which will be validated with ValidateLabelString():

freebsd-15:docker://ghcr.io/freebsd/freebsd-runtime:15.0?entrypoint=tail&entrypoint=-f&entrypoint=/dev/null&init=true&platform=freebsd/amd64

As far as I know, there is no standardized way to encode arrays as query parameters. We'll use whatever Go understands. Possible alternative: ParseLabelConfigurationString(str string) (LabelConfiguration, error). Forgejo Runner would then have to store the parsed LabelConfiguration.

The runner has to know which Backend instances are available. So, each Backend has to register itself. Unfortunately, we have to read the runner configuration before we can initialize a Backend. We also need something that knows how to initialize a Backend. The runner doesn't know how to do that. An interface method on Backend is only available after the runner has created the Backend. So, we need a Factory that each Backend has to provide.

type Factory interface {
    CreateBackend(config map[string]any) (Backend, error)
}

map[string]any is what comes out of the runner configuration. How do we find the correct Factory?

var FactoryRegistry registry = registry{factories: map[ID]Factory{}}

type registry struct {
    factories: map[ID]Factory
}

func (r *registry) Register(id ID, factory Factory) error {
	if _, has := r.factories[id]; has {
		return fmt.Errorf("backend ID is already taken: %q", id)
	}
	r.factories[id] = factory
	return nil
}

We also need something to hold onto all the Backend instances:

var BackendRegistry backendRegistry = backendRegistry{}

type backendRegistry struct {
    backends: map[ID]Backend
}

func Register(id ID, backend Backend) error {
    return nil
}

I don't love those global variables. But it's most likely the easiest approach to get started.

When the runner starts up, the various Factory instances register themselves by calling FactoryRegistry.Register(), for example, FactoryRegistry.Register("docker", &DockerFactory{}). They can either do that in an init function or whatever other mechanism we ultimately use for loading plug-ins.

So, now everything is in place for initializing the runner:

  1. Register all factories.
  2. Read the runner configuration.
  3. Ask Configuration for all Backends that should be enabled. Configuration figures that out by looping over all label configurations and collecting all backend IDs in a set.
  4. For each Backend ID:
    • Obtain the Factory using FactoryRegistry.Get(id).
    • Call Factory.CreateBackend() and pass the contents of config.Backend[id] as argument.
    • Store the resulting Backend instance in BackendRegistry.
  5. For each LabelConfiguration:
    • Invoke GetBackend() on the LabelConfiguration.
    • Ask BackendRegistry for the Backend with the given ID.
    • Invoke Validate() on the Backend.

When a new ExecutionEnvironment is required to run a job, the runner asks the Backend for a new instance by invoking CreateExecutionEnvironment(). The details are hazy because the runner is currently aware of the type of Backend it is talking to and I have yet to figure out how to make that agnostic.

The type Label becomes much simpler and will semantically match the Label in Forgejo, which should lead to less confusion.

All that is supposed to be fully backwards-compatible.

I want to do https://code.forgejo.org/forgejo/forgejo-actions-feature-requests/issues/105. Because labels affect back-ends and their configuration, it blends into this one. https://code.forgejo.org/forgejo/forgejo-actions-feature-requests/issues/105 is only about the user-facing aspects. So I'll dump my current plan here because it affects plug-in registration and plug-in instance creation. It's very early. Almost no code exists, everything in here is pseudo-code, and sudden U-turns should be expected. Happy to read comments and answer questions. --- There's a `Backend`: ```go type Backend interface { GetID() ID ValidateLabelConfiguration(labelConfig *LabelConfiguration) error ValidateLabelString(str string) error CreateExecutionEnvironment(config *Configuration, labelConfig *LabelConfiguration, label Label, ??) (ExecutionEnvironment, error) } type ID string ``` Examples of `Backend` are `DockerBackend`, `LXCBackend`, or `HostBackend`. `Backend` is the _static_ companion of `ExecutionsEnvironment` (to be renamed to `ExecutionEnvironment`, without the S in the middle). `Backend` instances have a static configuration in the runner configuration: ```yaml backend: docker: // ID of the Backend ... // map[string]any ``` That static configuration is called `BackendConfiguration`: ```go type BackendConfiguration interface { } ``` Each `Backend` has to implement its own `BackendConfiguration`, for example, `DockerConfiguration`. In the case of `Docker`, it corresponds to todays `config.Container`. In addition to the `BackendConfiguration` that applies to the `Backend` for its entire lifetime, a `Backend` has to know how to turn a label like `debian-latest` into a virtual machine or container. As an added complication, `debian-latest` can mean something when a job comes from `example.com` and something completely different when it comes from `codeberg.org`, because Forgejo Runner supports connection-specific label configurations: ```go type LabelConfiguration interface { GetBackend(): ID GetOptions(): map[string]any } ``` So far, labels were interpreted and validated by Forgejo Runner. That is not longer feasible with plugins and when there are many `Backend`-specific options. Therefore, from now on it is up to the `Backend` to interpret the `LabelConfiguration`. That's why `Backend` has a bound function called `ValidateLabelConfiguration()`. A label configuration can come in two different shapes: ```yaml runner: labels: freebsd-15: # Now comes the LabelConfiguration backend: docker backend-options: image: ghcr.io/freebsd/freebsd-runtime:15.0 entrypoint: ["tail", "-f", "/dev/null"] init: true platform: freebsd/amd64 ``` Or a single string for backwards compatibility and when using CLI options, which will be validated with `ValidateLabelString()`: ``` freebsd-15:docker://ghcr.io/freebsd/freebsd-runtime:15.0?entrypoint=tail&entrypoint=-f&entrypoint=/dev/null&init=true&platform=freebsd/amd64 ``` As far as I know, there is no standardized way to encode arrays as query parameters. We'll use whatever Go understands. Possible alternative: `ParseLabelConfigurationString(str string) (LabelConfiguration, error)`. Forgejo Runner would then have to store the parsed `LabelConfiguration`. The runner has to know which `Backend` instances are available. So, each `Backend` has to register itself. Unfortunately, we have to read the runner configuration before we can initialize a `Backend`. We also need something that knows how to initialize a `Backend`. The runner doesn't know how to do that. An interface method on `Backend` is only available after the runner has created the `Backend`. So, we need a `Factory` that each `Backend` has to provide. ```go type Factory interface { CreateBackend(config map[string]any) (Backend, error) } ``` `map[string]any` is what comes out of the runner configuration. How do we find the correct `Factory`? ```go var FactoryRegistry registry = registry{factories: map[ID]Factory{}} type registry struct { factories: map[ID]Factory } func (r *registry) Register(id ID, factory Factory) error { if _, has := r.factories[id]; has { return fmt.Errorf("backend ID is already taken: %q", id) } r.factories[id] = factory return nil } ``` We also need something to hold onto all the `Backend` instances: ```go var BackendRegistry backendRegistry = backendRegistry{} type backendRegistry struct { backends: map[ID]Backend } func Register(id ID, backend Backend) error { return nil } ``` I don't love those global variables. But it's most likely the easiest approach to get started. When the runner starts up, the various `Factory` instances register themselves by calling `FactoryRegistry.Register()`, for example, `FactoryRegistry.Register("docker", &DockerFactory{})`. They can either do that in an init function or whatever other mechanism we ultimately use for loading plug-ins. So, now everything is in place for initializing the runner: 1. Register all factories. 2. Read the runner configuration. 3. Ask `Configuration` for all `Backends` that should be enabled. `Configuration` figures that out by looping over all label configurations and collecting all backend IDs in a set. 4. For each `Backend` ID: * Obtain the `Factory` using `FactoryRegistry.Get(id)`. * Call `Factory.CreateBackend()` and pass the contents of `config.Backend[id]` as argument. * Store the resulting `Backend` instance in `BackendRegistry`. 5. For each `LabelConfiguration`: * Invoke `GetBackend()` on the `LabelConfiguration`. * Ask `BackendRegistry` for the `Backend` with the given ID. * Invoke `Validate()` on the `Backend`. When a new `ExecutionEnvironment` is required to run a job, the runner asks the `Backend` for a new instance by invoking `CreateExecutionEnvironment()`. The details are hazy because the runner is currently aware of the type of `Backend` it is talking to and I have yet to figure out how to make that agnostic. The type `Label` becomes much simpler and will semantically match the `Label` in Forgejo, which should lead to less confusion. All that is supposed to be fully backwards-compatible.
Author

Quick follow up, i want to push for keeping just the plain gRPC approach (plugin is a standalone server, runner connects to it) and drop the go-plugin path.

Both are MVPs in my fork right now, nothing is production yet, i kept go-plugin around to keep options open while we figured it out. So this isn't ripping out finished work, just picking a direction before we invest more.

The thing that pushed me there is that the plugin system is about whatever backend the community wants to build: Firecracker, Podman, Proxmox, Nomad, weird in-house stuff. Each of those has its own deployment shape. A k8s plugin wants to live in the cluster with its own ServiceAccount, a Firecracker one probably runs as a daemon on the hypervisor host, a Proxmox one needs API access from somewhere. The plain gRPC approach fits all of them because the runner just connects to an address (unix socket, local tcp, remote mTLS, whatever). The go-plugin path forces the plugin to be a child process of the runner with the runner's identity, that works for "binary on a host" but breaks the moment a backend wants its own creds, own lifecycle, or to be shared across runners.

A standalone gRPC server can hold state across jobs — k8s informers, watch caches, kubeconfig, leader leases, connection pools. With go-plugin the subprocess is killed at the end of each job (Close() runs in cleanUpJobContainer), so every job re-parses the kubeconfig, re-opens the API connections, re-warms whatever caches the plugin has. That's real per-job latency for any backend that talks to a remote control plane.

And on "is it easy to write", my k8s plugin's main.go is ~150 lines, plain gRPC server, health service, listen on a socket, handle SIGTERM. Anyone who has touched gRPC in Go can ship a working plugin in an afternoon, the proto really does act as the docs. and also mean that anyone can implement the proto in Rust/Python/whatever

I'll open a PR to Integrate the grcp plugin.

Quick follow up, i want to push for keeping just the plain gRPC approach (plugin is a standalone server, runner connects to it) and drop the go-plugin path. Both are MVPs in my fork right now, nothing is production yet, i kept go-plugin around to keep options open while we figured it out. So this isn't ripping out finished work, just picking a direction before we invest more. The thing that pushed me there is that the plugin system is about whatever backend the community wants to build: Firecracker, Podman, Proxmox, Nomad, weird in-house stuff. Each of those has its own deployment shape. A k8s plugin wants to live in the cluster with its own ServiceAccount, a Firecracker one probably runs as a daemon on the hypervisor host, a Proxmox one needs API access from somewhere. The plain gRPC approach fits all of them because the runner just connects to an address (unix socket, local tcp, remote mTLS, whatever). The go-plugin path forces the plugin to be a child process of the runner with the runner's identity, that works for "binary on a host" but breaks the moment a backend wants its own creds, own lifecycle, or to be shared across runners. A standalone gRPC server can hold state across jobs — k8s informers, watch caches, kubeconfig, leader leases, connection pools. With go-plugin the subprocess is killed at the end of each job (Close() runs in cleanUpJobContainer), so every job re-parses the kubeconfig, re-opens the API connections, re-warms whatever caches the plugin has. That's real per-job latency for any backend that talks to a remote control plane. And on "is it easy to write", my k8s plugin's main.go is ~150 lines, plain gRPC server, health service, listen on a socket, handle SIGTERM. Anyone who has touched gRPC in Go can ship a working plugin in an afternoon, the proto really does act as the docs. and also mean that anyone can implement the proto in Rust/Python/whatever I'll open a PR to Integrate the grcp plugin.
Member

@eleboucher wrote in #107 (comment):

A standalone gRPC server can hold state across jobs — k8s informers, watch caches, kubeconfig, leader leases, connection pools. With go-plugin the subprocess is killed at the end of each job (Close() runs in cleanUpJobContainer), so every job re-parses the kubeconfig, re-opens the API connections, re-warms whatever caches the plugin has. That's real per-job latency for any backend that talks to a remote control plane.

To me, it sounds like the problem here is the supposed plug-in abstraction if it doesn't allow the plug-in to hold onto state for the duration of Forgejo Runner's lifetime. That should be redesigned.

(I have no opinion on gRPC or one of the alternatives, at least not yet.)

@eleboucher wrote in https://code.forgejo.org/forgejo/forgejo-actions-feature-requests/issues/107#issuecomment-85283: > A standalone gRPC server can hold state across jobs — k8s informers, watch caches, kubeconfig, leader leases, connection pools. With go-plugin the subprocess is killed at the end of each job (Close() runs in cleanUpJobContainer), so every job re-parses the kubeconfig, re-opens the API connections, re-warms whatever caches the plugin has. That's real per-job latency for any backend that talks to a remote control plane. To me, it sounds like the problem here is the supposed plug-in abstraction if it doesn't allow the plug-in to hold onto state for the duration of Forgejo Runner's lifetime. That should be redesigned. (I have no opinion on gRPC or one of the alternatives, at least not yet.)

I'll chime in and say that the xcaddy model is absolutely not ideal, because it forces the users to build and manage the image with the relevant plugin. Even if you build the image, for Kubernetes, you would need to push the modified image on a private registry for your nodes to pull it.

Usually people do not bother and rely on docker images built by third party, which can obviously create supply chain issues. If a plugin architecture is used, the separation of concerns is way better : the users are using the official Forgejo image, and then add the relevant plugins they want to use which can then talk to with GRPC or equivalent.

Using a Kubernetes runner is also not something marginal, it's widely used across all forges from homelabs to enterprises, I think it's useful to mention it anyways.

Thanks !

I'll chime in and say that the xcaddy model is absolutely not ideal, because it forces the users to build and manage the image with the relevant plugin. Even if you build the image, for Kubernetes, you would need to push the modified image on a private registry for your nodes to pull it. Usually people do not bother and rely on docker images built by third party, which can obviously create supply chain issues. If a plugin architecture is used, the separation of concerns is way better : the users are using the official Forgejo image, and then add the relevant plugins they want to use which can then talk to with GRPC or equivalent. Using a Kubernetes runner is also not something marginal, it's widely used across all forges from homelabs to enterprises, I think it's useful to mention it anyways. Thanks !
Author

To me, it sounds like the problem here is the supposed plug-in abstraction if it doesn't allow the plug-in to hold onto state for the duration of Forgejo Runner's lifetime. That should be redesigned.

by this i meant more job state for example if someone makes a custom plugin for their own company they will be able to isolate their state etc etc.

i opened the PR here forgejo/runner#1500

> To me, it sounds like the problem here is the supposed plug-in abstraction if it doesn't allow the plug-in to hold onto state for the duration of Forgejo Runner's lifetime. That should be redesigned. by this i meant more job state for example if someone makes a custom plugin for their own company they will be able to isolate their state etc etc. i opened the PR here https://code.forgejo.org/forgejo/runner/pulls/1500
Member

A plug-in interface is a significant commitment. Evolving Forgejo Runner is not easy. It already has to retain compatibility with GitHub Actions, multiple Forgejo versions, and previous versions of Forgejo Runner. A plug-in interface reduces the wiggle room even more because it creates an internal boundary that can only be changed rarely and carefully. Therefore, a lot has to happen before I am willing to consider to commit to a plug-in interface:

  • Significant demand has to be demonstrated for a plug-in interface. By that I do not mean "I want plug-ins" or "I want to build plug-ins", but multiple credible endeavours for publishing and maintaining plug-ins. The idea is to prevent that we build something that makes the development of Forgejo Runner harder without it being used.
  • At least two to three proofs of concept have to exist. At least one of them should use virtual machines and create action and service containers inside virtual machines. That should give us confidence that the plug-in interface offers the right abstractions.
  • There has to be a testing/QA plan. When I change Forgejo Runner, I need to be confident that I do not break plug-ins. Relying on an interface is not sufficient. For example, strings can come in many shape and forms.
  • There has to be a plan for plug-in configuration, discovery, and loading.

Right now, none of these points has been met. Exposing a plug-in interface is one of the last steps, not the first.

In the meantime, the existing interfaces can be cleaned up and improved. Perhaps other changes can be made to Forgejo Runner that make it easier to maintain forks with alternative back-ends and pave the way for a plug-in interface, like additional tests.

A plug-in interface is a significant commitment. Evolving Forgejo Runner is not easy. It already has to retain compatibility with GitHub Actions, multiple Forgejo versions, and previous versions of Forgejo Runner. A plug-in interface reduces the wiggle room even more because it creates an internal boundary that can only be changed rarely and carefully. Therefore, a lot has to happen before I am willing to consider to commit to a plug-in interface: * Significant demand has to be demonstrated for a plug-in interface. By that I do not mean "I want plug-ins" or "I want to build plug-ins", but multiple credible endeavours for publishing and maintaining plug-ins. The idea is to prevent that we build something that makes the development of Forgejo Runner harder without it being used. * At least two to three proofs of concept have to exist. At least one of them should use virtual machines and create action and service containers inside virtual machines. That should give us confidence that the plug-in interface offers the right abstractions. * There has to be a testing/QA plan. When I change Forgejo Runner, I need to be confident that I do not break plug-ins. Relying on an interface is not sufficient. For example, strings can come in many shape and forms. * There has to be a plan for plug-in configuration, discovery, and loading. Right now, none of these points has been met. Exposing a plug-in interface is one of the last steps, not the first. In the meantime, the existing interfaces can be cleaned up and improved. Perhaps other changes can be made to Forgejo Runner that make it easier to maintain forks with alternative back-ends and pave the way for a plug-in interface, like additional tests.
Author

I think a couple of these are actually closer than they look:

On testing: my PR already has act/plugin/testplugin/, a ~500 LOC host-mode reference plugin that implements the full proto with its own tests. Right now it's just a unit-test fixture. I can split that out into its own PR if it's useful to have that signal independently of whether the rest of big PR lands. If you have something more specific in mind for what a testing plan should cover, I'd rather know upfront than guess at it.

On config/discovery: yeah, the loading story isn't fully written. My read is it fits naturally on top of the Backend/Factory/LabelConfiguration design from your discussions A gRPC plugin is just a Factory where CreateBackend dials a remote endpoint. I'd rather write that as a follow-up to your design than propose something that competes with it.

On demand: you're right, but there is also a huge demand for the kubernetes plugin https://codeberg.org/forgejo/discussions/issues/66 .

On the VM POC: only the k8s one exists and I can't realistically build a Firecracker plugin myself in any reasonable timeframe. If that's a hard gate, the most realistic path is probably @adamcharnock or whoever else is interested in that direction. I'm happy to help with proto changes, review, shaping the interface around what a VM backend actually needs, just can't be the one writing it.

I guess what we can try is actually to move docker and LXC to even use this interface so we have this as the source of truth and one thing to maintain, Docker and LXC will be 1st party plugin without the need to specify it and the rest will be community plugin

Anyway, I'm not trying to rush #1500. I just want to make sure I'm working on the right things while the gates get resolved rather than waiting around. If the most useful thing right now is the incremental cleanup from the original issue (capability queries, opening label schemes, decoupling from Docker types) I can start there.

I think a couple of these are actually closer than they look: On testing: my PR already has act/plugin/testplugin/, a ~500 LOC host-mode reference plugin that implements the full proto with its own tests. Right now it's just a unit-test fixture. I can split that out into its own PR if it's useful to have that signal independently of whether the rest of big PR lands. If you have something more specific in mind for what a testing plan should cover, I'd rather know upfront than guess at it. On config/discovery: yeah, the loading story isn't fully written. My read is it fits naturally on top of the Backend/Factory/LabelConfiguration design from your discussions A gRPC plugin is just a Factory where CreateBackend dials a remote endpoint. I'd rather write that as a follow-up to your design than propose something that competes with it. On demand: you're right, but there is also a huge demand for the kubernetes plugin https://codeberg.org/forgejo/discussions/issues/66 . On the VM POC: only the k8s one exists and I can't realistically build a Firecracker plugin myself in any reasonable timeframe. If that's a hard gate, the most realistic path is probably @adamcharnock or whoever else is interested in that direction. I'm happy to help with proto changes, review, shaping the interface around what a VM backend actually needs, just can't be the one writing it. I guess what we can try is actually to move docker and LXC to even use this interface so we have this as the source of truth and one thing to maintain, Docker and LXC will be 1st party plugin without the need to specify it and the rest will be community plugin Anyway, I'm not trying to rush #1500. I just want to make sure I'm working on the right things while the gates get resolved rather than waiting around. If the most useful thing right now is the incremental cleanup from the original issue (capability queries, opening label schemes, decoupling from Docker types) I can start there.

I'm interested in Firecracker support, but have no hard ETA

I'm interested in Firecracker support, but have no hard ETA
Author

@aahlenst made forgejo/runner#1503 to cleanup and improve the interface as discussed

@aahlenst made https://code.forgejo.org/forgejo/runner/pulls/1503 to cleanup and improve the interface as discussed
Member

After looking at the interfaces changes proposed by @eleboucher (thanks again!), I think we have to spend some more time on improving the interface further.

A GitHub Actions workflow comes with certain expectations. runs-on is a VM and everything else runs inside that VM. container, services, and uses are all containers built around the semantics of Docker and compatible container runtimes. A plug-in interface has to preserve those semantics while allowing plug-ins to ignore them to a certain extent. For example, Forgejo Runner already cheats a little by not honouring runs-on when container.image is defined. However, the overall semantics are preserved. On the other hand, LXC violates most of the semantics, which leads to a pretty bad user experience.

Let's take an extreme example: Tart. It creates macOS virtual machines and interacts with them over SSH. If I wanted to create a plug-in for Tart, I would only want to start a Tart VM for runs-on. For everything else, I would have to somehow start containers inside that macOS VM, either using Docker Desktop, Podman Desktop, or Apple's container. That means that the interface has to express "Start an execution environment for a job", "Start a container for a job", "Start a service container", "Start a step container". Otherwise, I cannot preserve the workflow semantics. As far as I can see, ExecutionsEnvironment doesn't provide that information.

We need a better separation of concerns. For example, ConnectToNetwork() doesn't really make sense because not every plug-in has Docker-style networks. Some methods are too narrow. Forgejo Runner doesn't know when it's the right time to pull an image. That's up to the plug-ins.

After looking at the interfaces changes proposed by @eleboucher (thanks again!), I think we have to spend some more time on improving the interface further. A GitHub Actions workflow comes with certain expectations. `runs-on` is a VM and everything else runs inside that VM. `container`, `services`, and `uses` are all containers built around the semantics of Docker and compatible container runtimes. A plug-in interface has to preserve those semantics while allowing plug-ins to ignore them to a certain extent. For example, Forgejo Runner already cheats a little by not honouring `runs-on` when `container.image` is defined. However, the overall semantics are preserved. On the other hand, LXC violates most of the semantics, which leads to a pretty bad user experience. Let's take an extreme example: [Tart](https://tart.run/). It creates macOS virtual machines and interacts with them over SSH. If I wanted to create a plug-in for Tart, I would only want to start a Tart VM for `runs-on`. For everything else, I would have to somehow start containers inside that macOS VM, either using [Docker Desktop](https://www.docker.com/products/docker-desktop/), [Podman Desktop](https://podman-desktop.io/), or Apple's [container](https://github.com/apple/container). That means that the interface has to express "Start an execution environment for a job", "Start a container for a job", "Start a service container", "Start a step container". Otherwise, I cannot preserve the workflow semantics. As far as I can see, `ExecutionsEnvironment` doesn't provide that information. We need a better separation of concerns. For example, `ConnectToNetwork()` doesn't really make sense because not every plug-in has Docker-style networks. Some methods are too narrow. Forgejo Runner doesn't know when it's the right time to pull an image. That's up to the plug-ins.
Member

@eleboucher wrote in #107 (comment):

I think a couple of these are actually closer than they look:

That is great. I do not want to discourage anybody or discount the existing achievements. My intent is to provide guidance and to manage expectations, including those of bystanders. However, that is my POV. It does not necessarily represent the POV of the other contributors.

In any case, I do not expect that you @eleboucher solve all problems on your own.

My personal preference right now is to concentrate on internal improvements of Forgejo Runner to prepare it for plug-ins. That should ideally happen in small, targeted steps like forgejo/runner#1503. That is vastly easier to digest and much easier to provide meaningful feedback.

I guess what we can try is actually to move docker and LXC to even use this interface so we have this as the source of truth and one thing to maintain, Docker and LXC will be 1st party plugin without the need to specify it and the rest will be community plugin.

My hunch is that we have to move in that direction and that it is a good idea, anyway.

@eleboucher wrote in https://code.forgejo.org/forgejo/forgejo-actions-feature-requests/issues/107#issuecomment-85398: > I think a couple of these are actually closer than they look: That is great. I do not want to discourage anybody or discount the existing achievements. My intent is to provide guidance and to manage expectations, including those of bystanders. However, that is *my* POV. It does not necessarily represent the POV of the other contributors. In any case, I do not expect that you @eleboucher solve all problems on your own. My personal preference right now is to concentrate on internal improvements of Forgejo Runner to prepare it for plug-ins. That should ideally happen in small, targeted steps like https://code.forgejo.org/forgejo/runner/pulls/1503. That is vastly easier to digest and much easier to provide meaningful feedback. > I guess what we can try is actually to move docker and LXC to even use this interface so we have this as the source of truth and one thing to maintain, Docker and LXC will be 1st party plugin without the need to specify it and the rest will be community plugin. My hunch is that we have to move in that direction and that it is a good idea, anyway.
Author

fair points, the gap is real. looking at my proto (act/plugin/proto/v1/plugin.proto in the plugin branch), Create bundles runs-on + job container + services into one RPC, and step-level docker actions aren't modelled at all, just gated by a supports_docker_actions capability. that works for k8s where the pod is everything, but it doesn't fit Tart where the plugin only owns the VM.

so yeah splitting helps:

  • CreateJobEnvironment for runs-on
  • CreateJobContainer for container.image
  • CreateServiceContainer for services:
  • CreateStepContainer for uses: docker:// and docker actions

a k8s plugin implements CreateJobEnvironment and absorbs the rest, a Tart plugin only implements CreateJobEnvironment and the runner falls back to a default docker driver for the others. that's the case that makes the split actually worth it the existing docker code becomes a default driver any VM plugin can reuse instead of reimplementing it over SSH.

on ConnectToNetwork and image pull: already aligned on the proto side (no ConnectToNetwork RPC, pull folded into Create with a force_pull hint, plugin owns the timing). same treatment fits the in-tree interface.

Agreed on incremental, that's what i'd rather be doing. once that #1503 is merged, rough sequence from here:

  1. move pull ownership out of the runner
  2. replace ConnectToNetwork with something model-agnostic
  3. split the create methods along those four lines

Each lands in isolation, and the Backend/Factory machinery from #105 slots in once the interface is in better shape. moving docker and LXC onto the same interface plugins use ends up being the natural endpoint, docker as default driver, LXC as built-in plugin, Tart only overrides the outer layer.

happy to drive any of these, let me know which is most useful next.

fair points, the gap is real. looking at my proto (act/plugin/proto/v1/plugin.proto in the plugin branch), Create bundles runs-on + job container + services into one RPC, and step-level docker actions aren't modelled at all, just gated by a supports_docker_actions capability. that works for k8s where the pod is everything, but it doesn't fit Tart where the plugin only owns the VM. so yeah splitting helps: - CreateJobEnvironment for runs-on - CreateJobContainer for container.image - CreateServiceContainer for services: - CreateStepContainer for uses: docker:// and docker actions a k8s plugin implements CreateJobEnvironment and absorbs the rest, a Tart plugin only implements CreateJobEnvironment and the runner falls back to a default docker driver for the others. that's the case that makes the split actually worth it the existing docker code becomes a default driver any VM plugin can reuse instead of reimplementing it over SSH. on ConnectToNetwork and image pull: already aligned on the proto side (no ConnectToNetwork RPC, pull folded into Create with a force_pull hint, plugin owns the timing). same treatment fits the in-tree interface. Agreed on incremental, that's what i'd rather be doing. once that [#1503](https://code.forgejo.org/forgejo/runner/pulls/1503) is merged, rough sequence from here: 1. move pull ownership out of the runner 2. replace ConnectToNetwork with something model-agnostic 3. split the create methods along those four lines Each lands in isolation, and the Backend/Factory machinery from #105 slots in once the interface is in better shape. moving docker and LXC onto the same interface plugins use ends up being the natural endpoint, docker as default driver, LXC as built-in plugin, Tart only overrides the outer layer. happy to drive any of these, let me know which is most useful next.
Author

actually ConnectToNetwork is deadcode

actually ConnectToNetwork is deadcode
Author

I guess thinking about Tart, we want to express Github 4 execution layer: runs-on, container:, services:, uses: docker://. therefore i think this interface will suit well

type Backend interface {
      GetID() ID
      Capabilities() Capabilities
      ValidateLabelConfiguration(*LabelConfiguration) error
      ValidateLabelString(string) error
      // Per-job factory. Replaces today's "set up the job container".
      CreateExecutionEnvironment(*Configuration, *LabelConfiguration, Label, JobEnvironmentSpec) (ExecutionEnvironment, error)
  }

// ExecutionEnvironment is the outer per-job context (the runs-on layer).
// Owned by the back-end; outlives the containers spawned within it.
type ExecutionEnvironment interface {
    // The three inner-container factories. Back-ends that don't support a
    // given kind return ErrUnsupported.
    CreateJobContainer(JobContainerSpec) (Container, error)
    CreateServiceContainer(ServiceContainerSpec) (Container, error)
    CreateStepContainer(StepContainerSpec) (Container, error)

    GetActPath() string
    GetRoot() string
    GetRunnerContext(ctx context.Context) map[string]any

    // Tear down everything the environment owns (including any spawned
    // containers).
    Remove(ctx context.Context) error
}

// Container is a single running unit. Same lifecycle regardless of kind.
type Container interface {
    Start(ctx context.Context) error
    Exec(ExecSpec) common.Executor
    Copy(destPath string, files ...*FileEntry) common.Executor
    CopyDir(destPath, srcPath string, useGitIgnore bool) common.Executor
    GetContainerArchive(ctx context.Context, srcPath string) (io.ReadCloser, error)
    UpdateFromEnv(srcPath string, env *map[string]string) common.Executor
    UpdateFromImageEnv(env *map[string]string) common.Executor
    IsHealthy(ctx context.Context) (time.Duration, error)
    Remove(ctx context.Context) error
}

type Capabilities struct {
    ID                        ID
    ProtocolVersion           uint32
    ManagesOwnNetworking      bool
    SupportsContainerOverride bool  // gates workflow's `container:`
    SupportsServiceContainers bool  // gates workflow's `services:`
    SupportsDockerStepActions bool  // gates `uses: docker://` and runs.using: docker
    Extensions                []string
}

where Back-ends that don't support a kind return ErrUnsupported, gated upfront via Capabilities so the runner can fail early or ignore.

  • docker: ExecutionEnvironment creates the network, implements all three inner factories. This impl lives in the runner and is the default driver.
  • host: refuses all three inner factories. Workflows using container: etc. fail upfront.
  • k8s: ExecutionEnvironment is the pod. Inner-job-container is the pod's main container, services are sidecars, step containers refused.
  • tart: boots a VM, dials Docker inside it, then we can use the docker environment.

If we manage to make docker reusable then Tart, Firecracker, anything that boots a VM with Docker inside becomes trivial.

Cross-checked against GitLab Runner and Woodpecker — both refuse nested docker actions on k8s by default; users opt into DinD via a privileged service container. So k8s refusing step containers is consistent with the rest of the industry.

I guess thinking about Tart, we want to express Github 4 execution layer: runs-on, container:, services:, uses: docker://. therefore i think this interface will suit well ```golang type Backend interface { GetID() ID Capabilities() Capabilities ValidateLabelConfiguration(*LabelConfiguration) error ValidateLabelString(string) error // Per-job factory. Replaces today's "set up the job container". CreateExecutionEnvironment(*Configuration, *LabelConfiguration, Label, JobEnvironmentSpec) (ExecutionEnvironment, error) } // ExecutionEnvironment is the outer per-job context (the runs-on layer). // Owned by the back-end; outlives the containers spawned within it. type ExecutionEnvironment interface { // The three inner-container factories. Back-ends that don't support a // given kind return ErrUnsupported. CreateJobContainer(JobContainerSpec) (Container, error) CreateServiceContainer(ServiceContainerSpec) (Container, error) CreateStepContainer(StepContainerSpec) (Container, error) GetActPath() string GetRoot() string GetRunnerContext(ctx context.Context) map[string]any // Tear down everything the environment owns (including any spawned // containers). Remove(ctx context.Context) error } // Container is a single running unit. Same lifecycle regardless of kind. type Container interface { Start(ctx context.Context) error Exec(ExecSpec) common.Executor Copy(destPath string, files ...*FileEntry) common.Executor CopyDir(destPath, srcPath string, useGitIgnore bool) common.Executor GetContainerArchive(ctx context.Context, srcPath string) (io.ReadCloser, error) UpdateFromEnv(srcPath string, env *map[string]string) common.Executor UpdateFromImageEnv(env *map[string]string) common.Executor IsHealthy(ctx context.Context) (time.Duration, error) Remove(ctx context.Context) error } type Capabilities struct { ID ID ProtocolVersion uint32 ManagesOwnNetworking bool SupportsContainerOverride bool // gates workflow's `container:` SupportsServiceContainers bool // gates workflow's `services:` SupportsDockerStepActions bool // gates `uses: docker://` and runs.using: docker Extensions []string } ``` where Back-ends that don't support a kind return ErrUnsupported, gated upfront via Capabilities so the runner can fail early or ignore. - docker: ExecutionEnvironment creates the network, implements all three inner factories. This impl lives in the runner and is the default driver. - host: refuses all three inner factories. Workflows using container: etc. fail upfront. - k8s: ExecutionEnvironment is the pod. Inner-job-container is the pod's main container, services are sidecars, step containers refused. - tart: boots a VM, dials Docker inside it, then we can use the docker environment. If we manage to make docker reusable then Tart, Firecracker, anything that boots a VM with Docker inside becomes trivial. Cross-checked against GitLab Runner and Woodpecker — both refuse nested docker actions on k8s by default; users opt into DinD via a privileged service container. So k8s refusing step containers is consistent with the rest of the industry.
Member

@eleboucher wrote in #107 (comment):

runner falls back to a default docker driver for the others.

To me, it sounds like you're proposing that a workflow with runs-on: firecracker could get a service container outside the firecracker VM, provided by a Docker-compatible container engine.

I doubt this is a good choice, at least not by default. Either, administrators have to be able to disable that (what happens in that case?) or it should lead to errors outright. Otherwise, workflow authors would be able to escape the confinement that runs-on should provide without the runner administrator being able to do something against that.

For the time being, my preference is to be as predictable and secure as possible. That means that plug-ins either have to implement a capability or it leads to errors, period. There shouldn't be any opt-out, either.

I like your proposal for the interfaces.

host: refuses all three inner factories. Workflows using container: etc. fail upfront.

I am not sure about that one. host basically imitates the GitHub Actions Runner. That means runs-on is ignored, all three container factories should produce actual containers as it does today.

tart: boots a VM, dials Docker inside it, then we can use the docker environment.

How do you imagine that to work? With a remote Docker host? That is a very interesting idea 🤔 It would certainly simplify the implementation of plug-ins and be a lot easier than sending Docker commands to a VM. But it raises a lot of questions, especially when plug-ins do not live inside the same process.

@eleboucher wrote in https://code.forgejo.org/forgejo/forgejo-actions-feature-requests/issues/107#issuecomment-85617: > runner falls back to a default docker driver for the others. To me, it sounds like you're proposing that a workflow with `runs-on: firecracker` could get a service container outside the firecracker VM, provided by a Docker-compatible container engine. I doubt this is a good choice, at least not by default. Either, administrators have to be able to disable that (what happens in that case?) or it should lead to errors outright. Otherwise, workflow authors would be able to escape the confinement that `runs-on` should provide without the runner administrator being able to do something against that. For the time being, my preference is to be as predictable and secure as possible. That means that plug-ins either have to implement a capability or it leads to errors, period. There shouldn't be any opt-out, either. I like your proposal for the interfaces. > host: refuses all three inner factories. Workflows using container: etc. fail upfront. I am not sure about that one. `host` basically imitates the GitHub Actions Runner. That means `runs-on` is ignored, all three container factories should produce actual containers as it does today. > tart: boots a VM, dials Docker inside it, then we can use the docker environment. How do you imagine that to work? With a remote Docker host? That is a very interesting idea 🤔 It would certainly simplify the implementation of plug-ins and be a lot easier than sending Docker commands to a VM. But it raises a lot of questions, especially when plug-ins do not live inside the same process.
Author

I doubt this is a good choice, at least not by default. Either, administrators have to be able to disable that (what happens in that case?) or it should lead to errors outright. Otherwise, workflow authors would be able to escape the confinement that runs-on should provide without the runner administrator being able to do something against that.

I think i misled with the "runner falls back" phrasing. What i meant is that a plug-in explicitly returns a DockerExecutionEnvironment from its own CreateExecutionEnvironment. The runner never substitutes anything. If a plug-in says "no CreateStepContainer", that's an error, full stop.

That makes the security story straightforward: the plug-in owns the boundary. A Tart plug-in pointing its Docker client at the in-VM socket means every container runs inside the VM. The runner never reaches outside on its own.

I am not sure about that one. host basically imitates the GitHub Actions Runner. That means runs-on is ignored, all three container factories should produce actual containers as it does today.

You're right, my bad. So host should implement all three inner factories, effectively the same impl as the docker back-end. Which is actually a useful data point as host and docker would be able to share the same DockerExecutionEnvironment internally, which is exactly what we want out of pulling docker into a reusable piece

On tart, yes, remote or tunneled docker host.concretely the plug-in would boots the VM, opens an SSH tunnel to the docker unix socket inside it (or talks to a TLS-protected docker daemon on the VM's network), and points its docker client at that endpoint. containers spawned via that client run inside the VM.

i think the cleanest way to make "use the docker environment" work cross-process is to flip who owns the docker calls. instead of the plug-in implementing CreateJobContainer / CreateServiceContainer / CreateStepContainer itself and proxying to docker, the plug-in just exposes a docker endpoint and the runner uses it. to keep things simple, we always proxy.

concretely:

  1. runner calls CreateExecutionEnvironment on the plug-in.
  2. plug-in boots the VM, sets up a proxy to the docker daemon inside it, returns the endpoint and any TLS material in CreateExecutionEnvironmentResponse.
  3. runner's built-in docker driver dials that endpoint
  4. CreateJobContainer / CreateServiceContainer / CreateStepContainer never go to the plug-in. they go to the runner's docker code, talking to the in-VM daemon through the proxy
  5. plug-in only sees CreateExecutionEnvironment and RemoveExecutionEnvironment. it boots and tears down the VM and the proxy

This answers the cross-process question because the plug-in doesn't need any docker logic at all, no matter what language it's written in. plug-in authors only need to know how to start a proxy and expose a socket.
It also keeps the security story clean: the plug-in is responsible for what it exposes. as long as the proxy points at the in-VM daemon and not the host's, all containers run inside the VM. the runner connects to whatever the plug-in told it to, never picks an endpoint on its own

we'd express it in capabilities and the proto: a VM-like plug-in returns a delegate_to_docker block in CreateExecutionEnvironmentResponse with the endpoint + TLS material. Other plug-ins (k8s, etc.) don't return that block, and the runner sends them the inner Create RPCs as normal.

> I doubt this is a good choice, at least not by default. Either, administrators have to be able to disable that (what happens in that case?) or it should lead to errors outright. Otherwise, workflow authors would be able to escape the confinement that runs-on should provide without the runner administrator being able to do something against that. I think i misled with the "runner falls back" phrasing. What i meant is that a plug-in explicitly returns a DockerExecutionEnvironment from its own CreateExecutionEnvironment. The runner never substitutes anything. If a plug-in says "no CreateStepContainer", that's an error, full stop. That makes the security story straightforward: the plug-in owns the boundary. A Tart plug-in pointing its Docker client at the in-VM socket means every container runs inside the VM. The runner never reaches outside on its own. > I am not sure about that one. host basically imitates the GitHub Actions Runner. That means runs-on is ignored, all three container factories should produce actual containers as it does today. You're right, my bad. So host should implement all three inner factories, effectively the same impl as the docker back-end. Which is actually a useful data point as host and docker would be able to share the same DockerExecutionEnvironment internally, which is exactly what we want out of pulling docker into a reusable piece On tart, yes, remote or tunneled docker host.concretely the plug-in would boots the VM, opens an SSH tunnel to the docker unix socket inside it (or talks to a TLS-protected docker daemon on the VM's network), and points its docker client at that endpoint. containers spawned via that client run inside the VM. i think the cleanest way to make "use the docker environment" work cross-process is to flip who owns the docker calls. instead of the plug-in implementing CreateJobContainer / CreateServiceContainer / CreateStepContainer itself and proxying to docker, the plug-in just exposes a docker endpoint and the runner uses it. to keep things simple, we always proxy. concretely: 1. runner calls CreateExecutionEnvironment on the plug-in. 2. plug-in boots the VM, sets up a proxy to the docker daemon inside it, returns the endpoint and any TLS material in CreateExecutionEnvironmentResponse. 3. runner's built-in docker driver dials that endpoint 4. CreateJobContainer / CreateServiceContainer / CreateStepContainer never go to the plug-in. they go to the runner's docker code, talking to the in-VM daemon through the proxy 5. plug-in only sees CreateExecutionEnvironment and RemoveExecutionEnvironment. it boots and tears down the VM and the proxy This answers the cross-process question because the plug-in doesn't need any docker logic at all, no matter what language it's written in. plug-in authors only need to know how to start a proxy and expose a socket. It also keeps the security story clean: the plug-in is responsible for what it exposes. as long as the proxy points at the in-VM daemon and not the host's, all containers run inside the VM. the runner connects to whatever the plug-in told it to, never picks an endpoint on its own we'd express it in capabilities and the proto: a VM-like plug-in returns a delegate_to_docker block in CreateExecutionEnvironmentResponse with the endpoint + TLS material. Other plug-ins (k8s, etc.) don't return that block, and the runner sends them the inner Create RPCs as normal.
Author

The extracted docker logic can look like that forgejo/runner#1507

The extracted docker logic can look like that https://code.forgejo.org/forgejo/runner/pulls/1507
Member

@eleboucher wrote in #107 (comment):

On tart, yes, remote or tunneled docker host.concretely the plug-in would boots the VM, opens an SSH tunnel to the docker unix socket inside it (or talks to a TLS-protected docker daemon on the VM's network), and points its docker client at that endpoint. containers spawned via that client run inside the VM.

That's something we should at least try manually before finalizing the API.

i think the cleanest way to make "use the docker environment" work cross-process is to flip who owns the docker calls. instead of the plug-in implementing CreateJobContainer / CreateServiceContainer / CreateStepContainer itself and proxying to docker, the plug-in just exposes a docker endpoint and the runner uses it. to keep things simple, we always proxy.

Sounds sensible.

I've seen in your proposal that you're binding the client to the context. I'm not a fan, at least not yet, because it's implicit. GetDockerClient() is called all over the place and when the injected client isn't present, it creates a new instance using different rules. So it's looks like a source of subtle bugs to me. Passing it as function argument would be very explicit and match what we're trying to do: "Use this client to stop container X."

I don't know yet what should be done exactly, instead. It's weird that that a Client instance is attached to a containerReference and that there are so many free floating functions. That makes it hard to scope and cache some information. Perhaps there could be some additional type like DockerEndpoint (not a great name) that is bound to a particular Docker socket and provides methods like RunnerArch() that can also be cached. Right now, we're calling GetHostInfo() over and over again and that tanks performance with Podman.

Is there a particular reason for the introduction of the alias dockercontainer? That makes the diff rather verbose.

@eleboucher wrote in https://code.forgejo.org/forgejo/forgejo-actions-feature-requests/issues/107#issuecomment-85704: > On tart, yes, remote or tunneled docker host.concretely the plug-in would boots the VM, opens an SSH tunnel to the docker unix socket inside it (or talks to a TLS-protected docker daemon on the VM's network), and points its docker client at that endpoint. containers spawned via that client run inside the VM. That's something we should at least try manually before finalizing the API. > i think the cleanest way to make "use the docker environment" work cross-process is to flip who owns the docker calls. instead of the plug-in implementing CreateJobContainer / CreateServiceContainer / CreateStepContainer itself and proxying to docker, the plug-in just exposes a docker endpoint and the runner uses it. to keep things simple, we always proxy. Sounds sensible. I've seen in your proposal that you're binding the client to the context. I'm not a fan, at least not yet, because it's implicit. `GetDockerClient()` is called all over the place and when the injected client isn't present, it creates a new instance using different rules. So it's looks like a source of subtle bugs to me. Passing it as function argument would be very explicit and match what we're trying to do: "Use this client to stop container X." I don't know yet what should be done exactly, instead. It's weird that that a `Client` instance is attached to a `containerReference` and that there are so many free floating functions. That makes it hard to scope and cache some information. Perhaps there could be some additional type like `DockerEndpoint` (not a great name) that is bound to a particular Docker socket and provides methods like `RunnerArch()` that can also be cached. Right now, we're calling `GetHostInfo()` over and over again and that tanks performance with Podman. Is there a particular reason for the introduction of the alias `dockercontainer`? That makes the diff rather verbose.
Member

@whitequark wrote in #107 (comment):

I'm interested in Firecracker support, but have no hard ETA

What about QEMU instead? Its MicroVM back-end is slower, but QEMU as a whole is way more versatile and is able to host other guests than Linux.

@whitequark wrote in https://code.forgejo.org/forgejo/forgejo-actions-feature-requests/issues/107#issuecomment-85401: > I'm interested in Firecracker support, but have no hard ETA What about QEMU instead? Its MicroVM back-end is slower, but QEMU as a whole is way more versatile and is able to host other guests than Linux.

What about QEMU instead? Its MicroVM back-end is slower, but QEMU as a whole is way more versatile and is able to host other guests than Linux.

I do want non-Linux guests (Windows is the major priority) so I am interested in that. I was initially planning to start with Firecracker to get decent Linux performance and then start doing harder things, but I am quite ignorant of the field all things considered so this may not be a good plan. (Other things have taken priority so I'm still in the "I really need to set up a lab and compare the available solutions" phase of the development.)

> What about QEMU instead? Its MicroVM back-end is slower, but QEMU as a whole is way more versatile and is able to host other guests than Linux. I do want non-Linux guests (Windows is the major priority) so I am interested in that. I was initially planning to start with Firecracker to get decent Linux performance and then start doing harder things, but I am quite ignorant of the field all things considered so this may not be a good plan. (Other things have taken priority so I'm still in the "I really need to set up a lab and compare the available solutions" phase of the development.)
Author

@aahlenst i updated my PR with your feedback, i guess we are now in a good place. if you can have a look again

@aahlenst i updated my PR with your feedback, i guess we are now in a good place. if you can have a look again
Author

@aahlenst looks like the plugin PR is next or do you want me to handle #105 first ?

@aahlenst looks like the plugin PR is next or do you want me to handle https://code.forgejo.org/forgejo/forgejo-actions-feature-requests/issues/105 first ?
Member

@eleboucher Oh, that would be terrific if you could start with #105. I don't have the cycles necessary to do it at the moment. ❤️

@eleboucher Oh, that would be terrific if you could start with https://code.forgejo.org/forgejo/forgejo-actions-feature-requests/issues/105. I don't have the cycles necessary to do it at the moment. ❤️
Author

@aahlenst wrote in #107 (comment):

@eleboucher Oh, that would be terrific if you could start with #105. I don't have the cycles necessary to do it at the moment. ❤️

forgejo/runner#1571/files here you go

@aahlenst wrote in https://code.forgejo.org/forgejo/forgejo-actions-feature-requests/issues/107#issuecomment-91434: > @eleboucher Oh, that would be terrific if you could start with #105. I don't have the cycles necessary to do it at the moment. :heart: https://code.forgejo.org/forgejo/runner/pulls/1571/files here you go
Sign in to join this conversation.
No labels
Stage
Idea
Stage
Ready
No milestone
No assignees
5 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
forgejo/forgejo-actions-feature-requests#107
No description provided.