SIGSEGV on forgejo-runner daemon -c /etc/forgejo-runner.yaml #146

Open
opened 2024-01-20 14:19:04 +00:00 by neuhalje · 2 comments

Observed

Eunning forgejo-runner daemon -c /etc/forgejo-runner.yaml (which points to the default file; same result without -c) on my server crashed every time (and at the same line). It does that for all tried versions, on bare metal, and even in a fully virtualized qemu. I am thinking of hiring an excorcist

If I run the same command and config on another system it starts up flawlessly.

time="2024-01-20T13:52:02Z" level=info msg="log level changed to debug" func="[initLogging]" file="[daemon.go:158]"
time="2024-01-20T13:52:02Z" level=info msg="Starting runner daemon" func="[func6]" file="[daemon.go:38]"
time="2024-01-20T13:52:02Z" level=debug msg="gc: 2024-01-20 13:52:02.7076325 +0000 UTC m=+0.005817793" func="[gcCache]" file="[handler.go:439]" module=arti
factcache
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0xbe0cab]
goroutine 1 [running]:
gitea.com/gitea/act_runner/internal/app/cmd.Execute.runDaemon.func6(0xc0000e9c00?, {0xc000239700?, 0x4?, 0xdbbbfc?})
       /srv/internal/app/cmd/daemon.go:110 +0x70b
github.com/spf13/cobra.(*Command).execute(0xc000004c00, {0xc0002396e0, 0x2, 0x2})
       /go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:940 +0x87c
github.com/spf13/cobra.(*Command).ExecuteC(0xc000004300)
       /go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5
github.com/spf13/cobra.(*Command).Execute(...)
       /go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:992
gitea.com/gitea/act_runner/internal/app/cmd.Execute({0xf30558?, 0xc00021d3c0})
       /srv/internal/app/cmd/cmd.go:84 +0x8f6
main.main()
       /srv/main.go:18 +0x7b

What and where

Versions used

  • all of forgejo-runner 3.0.0 / forgejo-runner 3.10 / forgejo-runner 3.20 / forgejo-runner 3.3.0
  • downloaded as binary or as container

Environments

On this specific server all attempts fail in the same way and 100% of the time.

"Bare metal" (Ubuntu Mantic)

  • running the downloaded version
  • running pfficial container in podman
lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 7 3700X 8-Core Processor
    CPU family:          23
    Model:               113
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            0
    Frequency boost:     enabled
    CPU(s) scaling MHz:  51%
    CPU max MHz:         4426.1709
    CPU min MHz:         2200.0000
    BogoMIPS:            7186.25
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_
                         good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cm
                         p_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp
                         _l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_l
                         lc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter 
                         pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   256 KiB (8 instances)
  L1i:                   256 KiB (8 instances)
  L2:                    4 MiB (8 instances)
  L3:                    32 MiB (2 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-15
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Mitigation; untrained return thunk; SMT enabled with STIBP protection
  Spec rstack overflow:  Mitigation; safe RET
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

hardware virtualized

Tried both Ubuntu Jammy and Mantic. Running plain binary, container in podman, or as container in docker-ce.

sudo qemu-system-amd64 \
    -net nic                                                    \
    -net user                                                   \
    -machine accel=kvm:tcg                                      \
    -cpu host                                                   \
    -m 8192                                                      \
    -nographic                                                  \
    -hda jammy-server-cloudimg-amd64.img                        \
    -smbios type=1,serial=ds='nocloud;s=http://10.1.0.1:8000'

fully virtualized

Running on a fully virtualized qemu with emulated CPU. This was my last try to rule out CPU specifics.

qemu-system-amd64 \
    -net nic                                                    \
    -net user                                                   \
    -machine pc-q35-mantic                                      \
    -cpu kvm64                                                   \
    -m 8192                                                      \
    -nographic                                                  \
    -hda jammy-server-cloudimg-amd64.img                        \
    -smbios type=1,serial=ds='nocloud;s=http://10.1.0.1:8000'
# Observed Eunning `forgejo-runner daemon -c /etc/forgejo-runner.yaml` (which points to the default file; same result without `-c`) on my server crashed every time (and at [the same line](https://code.forgejo.org/forgejo/runner/src/tag/v3.3.0/internal/app/cmd/daemon.go#L110)). It does that for all tried versions, on bare metal, and even in a fully virtualized qemu. *I am thinking of hiring an excorcist* If I run the same command and config on another system it starts up flawlessly. ``` time="2024-01-20T13:52:02Z" level=info msg="log level changed to debug" func="[initLogging]" file="[daemon.go:158]" time="2024-01-20T13:52:02Z" level=info msg="Starting runner daemon" func="[func6]" file="[daemon.go:38]" time="2024-01-20T13:52:02Z" level=debug msg="gc: 2024-01-20 13:52:02.7076325 +0000 UTC m=+0.005817793" func="[gcCache]" file="[handler.go:439]" module=arti factcache panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0xbe0cab] goroutine 1 [running]: gitea.com/gitea/act_runner/internal/app/cmd.Execute.runDaemon.func6(0xc0000e9c00?, {0xc000239700?, 0x4?, 0xdbbbfc?}) /srv/internal/app/cmd/daemon.go:110 +0x70b github.com/spf13/cobra.(*Command).execute(0xc000004c00, {0xc0002396e0, 0x2, 0x2}) /go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:940 +0x87c github.com/spf13/cobra.(*Command).ExecuteC(0xc000004300) /go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 github.com/spf13/cobra.(*Command).Execute(...) /go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:992 gitea.com/gitea/act_runner/internal/app/cmd.Execute({0xf30558?, 0xc00021d3c0}) /srv/internal/app/cmd/cmd.go:84 +0x8f6 main.main() /srv/main.go:18 +0x7b ``` # What and where ## Versions used - all of `forgejo-runner 3.0.0` / `forgejo-runner 3.10` / `forgejo-runner 3.20` / `forgejo-runner 3.3.0` - downloaded as binary or as container ## Environments On this specific server **all** attempts fail in the same way and 100% of the time. ### "Bare metal" (Ubuntu Mantic) - running the downloaded version - running pfficial container in `podman` ``` lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: AuthenticAMD Model name: AMD Ryzen 7 3700X 8-Core Processor CPU family: 23 Model: 113 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 0 Frequency boost: enabled CPU(s) scaling MHz: 51% CPU max MHz: 4426.1709 CPU min MHz: 2200.0000 BogoMIPS: 7186.25 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_ good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cm p_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp _l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_l lc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es Virtualization features: Virtualization: AMD-V Caches (sum of all): L1d: 256 KiB (8 instances) L1i: 256 KiB (8 instances) L2: 4 MiB (8 instances) L3: 32 MiB (2 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Spec rstack overflow: Mitigation; safe RET Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected ``` ### hardware virtualized Tried both Ubuntu Jammy and Mantic. Running plain binary, container in `podman`, or as container in `docker-ce`. ``` sudo qemu-system-amd64 \ -net nic \ -net user \ -machine accel=kvm:tcg \ -cpu host \ -m 8192 \ -nographic \ -hda jammy-server-cloudimg-amd64.img \ -smbios type=1,serial=ds='nocloud;s=http://10.1.0.1:8000' ``` ### fully virtualized Running on a fully virtualized qemu with emulated `CPU`. This was my last try to rule out CPU specifics. ``` qemu-system-amd64 \ -net nic \ -net user \ -machine pc-q35-mantic \ -cpu kvm64 \ -m 8192 \ -nographic \ -hda jammy-server-cloudimg-amd64.img \ -smbios type=1,serial=ds='nocloud;s=http://10.1.0.1:8000' ```
earl-warren added the
Kind/Bug
label 2024-01-20 18:17:53 +00:00
Owner

I am thinking of hiring an excorcist

😄

It should gracefully show an error message instead of crashing like that. The root cause is that resp.Msg.Runner.Name, resp.Msg.Runner.Version, resp.Msg.Runner.Labels has an invalid address somewhere. But the code does not verify any of that, it assumes resp is always good.

Since it runs well on other machines, could it be that the network is interfering? These are the very first network paquets exchanged between the runner and the server, it may be worth taking a quick look at what tcpdump or wireshark sees.

I would recommend trying to recompile on the machine to be 100% sure it is not a binary generation problem and rule that out entirely. Note that the binary is static and does not rely on any shared library. But it does rely on the kernel ABI so it is worth a shot.

After recompiling, if that still fails, you will have the option of adding extra verification to get more clues.

I must say it is puzzling.

> I am thinking of hiring an excorcist 😄 It should gracefully show an error message instead of crashing like that. The root cause is that `resp.Msg.Runner.Name, resp.Msg.Runner.Version, resp.Msg.Runner.Labels` has an invalid address somewhere. But the code does not verify any of that, it assumes `resp` is always good. Since it runs well on other machines, could it be that the network is interfering? These are the very first network paquets exchanged between the runner and the server, it may be worth taking a quick look at what tcpdump or wireshark sees. I would recommend trying to recompile on the machine to be 100% sure it is not a binary generation problem and rule that out entirely. Note that the binary is static and does not rely on any shared library. But it does rely on the kernel ABI so it is worth a shot. After recompiling, if that still fails, you will have the option of adding extra verification to get more clues. I must say it is puzzling.
Author
  1. Where did the empty response come from?
  2. Solution for my situation
  3. Fixing the root cause

Looking into networking was the solution!

The server delivered different data depending on from where I connected. The test cases show that for curling /

  • curl from my laptop gets HTTP 200 with content (cookies, html)
  • curl from the server itself gets HTTP 200 with 0 bytes data

A HTTP 200 response without content is parsed by forgejo-runner in a way that leaves one of these as nil, successively crashing the formatter.

  • resp
  • resp.Msg
  • resp.Msg.Runner
  • resp.Msg.Runner.Name
  • resp.Msg.Runner.Version
  • resp.Msg.Runner.Labels

Where did the empty response come from?

The forgejo instance runs in a container on a server and is exposed via a reverse proxy. In this case the proxy was Caddy.

This Caddy instance had geo blocking (any other filtering would have done the same) configured: It only proxied to the forgejo instance for certain countries.

Caddy has a quite peculiar behavior: If a server is configured (my.example.com) but has no rule on what to do with the requests, it returns an empty HTTP 200 response.

In the cases where forgejo-runner crashed, it accessed the instance either from the host-ip (which is in one of the blocked countries - don’t ask), or, in the running from a VM case, an rfc1918 address, which also is not an allowed country. This made Caddy return the empty HTTP 200 response, which made forgejo-runner crash.

Solution for my situation

I added an IP filter that allowed the local addresses to the Caddy configuration.

Fixing the root cause

The reponse object needs to be validated for empty fields. I have no golang experience (and no compiler installed), but I would think of something like this:

diff --git a/internal/app/run/runner.go b/internal/app/run/runner.go
index 0884c50..7948726 100644
--- a/internal/app/run/runner.go
+++ b/internal/app/run/runner.go
@@ -231,8 +231,23 @@ func (r *Runner) run(ctx context.Context, task *runnerv1.Task, reporter *report.
 }
 
 func (r *Runner) Declare(ctx context.Context, labels []string) (*connect.Response[runnerv1.DeclareResponse], error) {
-	return r.client.Declare(ctx, connect.NewRequest(&runnerv1.DeclareRequest{
+	resp, err = r.client.Declare(ctx, connect.NewRequest(&runnerv1.DeclareRequest{
 		Version: ver.Version(),
 		Labels:  labels,
 	}))
+
+	if err != nil {
+		return err
+	}
+
+	if resp == nil ||
+		resp.Msg == nil ||
+		resp.Msg.Runner == nil ||
+		resp.Msg.Runner.Name == nil ||
+		resp.Msg.Runner.Version == nil ||
+		resp.Msg.Runner.Labels == nil {
+		return nil, fmt.Errorf("invalid or empty response from forgejo instance")
+	} else {
+		return resp, err
+	}
 }
1. [Where did the empty response come from?](#orgc2ab292) 2. [Solution for my situation](#orgc1d0ad0) 3. [Fixing the root cause](#org8d4abaf) Looking into networking was the solution! The server delivered different data depending on from where I connected. The test cases show that for curling `/` - `curl` **from my laptop** gets `HTTP 200` with content (cookies, html) - `curl` **from the server itself** gets `HTTP 200` with 0 bytes data A `HTTP 200` response without content is parsed by `forgejo-runner` in a way that leaves one of these as `nil`, successively crashing the [formatter](https://code.forgejo.org/forgejo/runner/src/tag/v3.3.0/internal/app/cmd/daemon.go#L110). - `resp` - `resp.Msg` - `resp.Msg.Runner` - `resp.Msg.Runner.Name` - `resp.Msg.Runner.Version` - `resp.Msg.Runner.Labels` <a id="orgc2ab292"></a> # Where did the empty response come from? The `forgejo` instance runs in a container on a server and is exposed via a reverse proxy. In this case the proxy was [Caddy](https://caddyserver.com/). This Caddy instance had **geo blocking** (any other filtering would have done the same) configured: It only proxied to the `forgejo` instance for certain countries. Caddy has a quite peculiar behavior: If a server is configured (`my.example.com`) but has no rule on what to do with the requests, it **returns an empty `HTTP 200` response**. In the cases where `forgejo-runner` crashed, it accessed the instance either from the host-ip (which is in one of the blocked countries - don&rsquo;t ask), or, in the *running from a VM* case, an `rfc1918` address, which also is not an allowed country. This made `Caddy` return the empty `HTTP 200` response, which made `forgejo-runner` crash. <a id="orgc1d0ad0"></a> # Solution for my situation I added an IP filter that allowed the local addresses to the `Caddy` configuration. <a id="org8d4abaf"></a> # Fixing the root cause The reponse object needs to be validated for empty fields. I have no `golang` experience (and no compiler installed), but I would think of something like this: diff --git a/internal/app/run/runner.go b/internal/app/run/runner.go index 0884c50..7948726 100644 --- a/internal/app/run/runner.go +++ b/internal/app/run/runner.go @@ -231,8 +231,23 @@ func (r *Runner) run(ctx context.Context, task *runnerv1.Task, reporter *report. } func (r *Runner) Declare(ctx context.Context, labels []string) (*connect.Response[runnerv1.DeclareResponse], error) { - return r.client.Declare(ctx, connect.NewRequest(&runnerv1.DeclareRequest{ + resp, err = r.client.Declare(ctx, connect.NewRequest(&runnerv1.DeclareRequest{ Version: ver.Version(), Labels: labels, })) + + if err != nil { + return err + } + + if resp == nil || + resp.Msg == nil || + resp.Msg.Runner == nil || + resp.Msg.Runner.Name == nil || + resp.Msg.Runner.Version == nil || + resp.Msg.Runner.Labels == nil { + return nil, fmt.Errorf("invalid or empty response from forgejo instance") + } else { + return resp, err + } }
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: forgejo/runner#146
No description provided.