Skip to main content

Watch your tunnel from Grafana

Production tunnels are easier to operate when you can see their health at a glance. TestingBot Tunnel ships with a built-in Prometheus-compatible metrics endpoint, so you can scrape it from any monitoring stack you already run.

  • Prometheus
  • Grafana ready
  • Port 8003

Enable the metrics endpoint

The metrics endpoint is enabled by default on http://localhost:8003. Override the port with --metrics-port.

java -jar testingbot-tunnel.jar --metrics-port 9100

If the tunnel host is reachable from the network (for example a shared CI runner), protect the endpoint with HTTP Basic Auth:

java -jar testingbot-tunnel.jar --metrics-auth ops:s3cret

Export TESTINGBOT_METRICS_AUTH instead of passing credentials on the command line. See the security guide.

Available metrics

Tunnel-specific series are prefixed with testingbot_. The endpoint also exposes the standard JVM and process metrics from the Prometheus Java client. The full label set is always documented at the /metrics endpoint itself.

Tunnel state

Metric Type Meaning
testingbot_tunnel_up gauge 1 when the tunnel is connected, 0 while reconnecting or down. The single most important alerting signal.
testingbot_tunnel_info info Static labels with the tunnel build (version, id, name). Use for dashboard headers and version filters.
testingbot_tunnel_uptime_seconds counter Seconds since this tunnel process started. Drops on restart, useful for detecting flapping.
testingbot_tunnel_reconnects_total counter Total tunnel reconnects since startup. Sustained increases indicate an unstable upstream link.
testingbot_active_connections gauge Number of in-flight client connections through the tunnel right now.
testingbot_tunnel_connect_duration_seconds histogram Time taken to establish the tunnel itself (cold-start latency). Buckets _bucket, _sum, _count.

HTTP traffic

Metric Type Meaning
testingbot_http_requests_total counter HTTP requests proxied since startup. Labels: method, code. Use rate() for throughput, filter on code=~"5.." for errors.
testingbot_http_request_duration_seconds histogram End-to-end HTTP latency. Compute p50/p95/p99 with histogram_quantile().
testingbot_https_connect_total counter HTTPS CONNECT sessions established. Labels: code.
testingbot_https_connect_duration_seconds histogram CONNECT handshake latency, suitable for histogram_quantile().
testingbot_https_connect_errors_total counter CONNECT errors. Label: reason (TLS, target unreachable, timeout, ...).
testingbot_proxy_bytes_transferred_total counter Total bytes proxied (both directions). Use rate() for throughput in B/s.
testingbot_errors_total counter Generic proxy errors. Label: name (the error class). Useful for alerting on burst increases.

JVM and process metrics

The endpoint also exposes the standard jvm_* and process_* series from the Prometheus Java client. These help you size the tunnel host, detect memory pressure and catch garbage-collection pauses.

Metric Type Meaning
jvm_memory_bytes_used gauge Bytes of heap and non-heap memory in use. Label: area (heap or nonheap).
jvm_memory_bytes_max gauge Maximum bytes available per memory area. Pair with _used to compute headroom.
jvm_memory_pool_bytes_used gauge Per-pool memory usage (Eden, Survivor, Old Gen, Metaspace, ...). Label: pool.
jvm_gc_collection_seconds_count counter Number of GC collections. Label: gc (collector name).
jvm_gc_collection_seconds_sum counter Total time spent in GC. Use rate() to see GC pressure over time.
jvm_threads_current gauge Live thread count. Watch for runaway growth.
jvm_threads_daemon gauge Daemon thread count.
jvm_threads_peak gauge Peak thread count since process start.
jvm_classes_loaded gauge Currently loaded classes.
jvm_buffer_pool_used_bytes gauge Direct and mapped NIO buffer usage. Label: pool.
process_cpu_seconds_total counter Total CPU time consumed by the process. Use rate() for CPU utilisation.
process_resident_memory_bytes gauge RSS memory of the tunnel process as reported by the OS.
process_open_fds gauge Open file descriptors. Compare with process_max_fds to spot leaks.
process_start_time_seconds gauge Unix epoch when the process started. Useful for restart detection.

All names above are the canonical Prometheus Java client names. Open http://localhost:8003/metrics to see the full live list and their # HELP / # TYPE annotations.

Prometheus scrape config

Add a job to your prometheus.yml that targets the tunnel host.

scrape_configs:
  - job_name: testingbot_tunnel
    static_configs:
      - targets: ["tunnel-host.example.com:8003"]
    basic_auth:
      username: ops
      password_file: /etc/prometheus/testingbot_metrics_password

Grafana dashboard

A ready-made Grafana dashboard ships with the tunnel source. Import it once and you have a full Overview / HTTP / Tunnel-health view in seconds.

Download testingbot-tunnel.json from the grafana-dashboard examples folder on GitHub. In Grafana go to Dashboards โ†’ New โ†’ Import, paste the JSON or upload the file, pick your Prometheus data source, and click Import.

TestingBot Tunnel Grafana dashboard with Overview and HTTP panels
The bundled dashboard rendered against a live tunnel

The bundled dashboard groups panels into two rows:

  • Overview: Tunnel status (UP/DOWN), build version, active connections, uptime and reconnect counter.
  • HTTP: request rate by status class, HTTPS CONNECT rate, p50/p95/p99 latency and response throughput.

Want a turnkey setup? The repo also includes a docker-compose example that spins up Prometheus + Grafana already wired to scrape a local tunnel and load the dashboard automatically. Useful for local development or a single-machine CI runner.

git clone https://github.com/testingbot/Testingbot-Tunnel.git
cd Testingbot-Tunnel/examples/docker-compose-prometheus-grafana
docker compose up -d
# Grafana on http://localhost:3000  ยท  Prometheus on http://localhost:9090

If you prefer to build the dashboard yourself, here are the most useful PromQL queries:

Tunnel up
max(testingbot_tunnel_up)
Active connections
sum(testingbot_active_connections)
2xx request rate
rate(testingbot_http_requests_total{code=~"2.."}[5m])
5xx error rate
rate(testingbot_http_requests_total{code=~"5.."}[5m])
p95 HTTP latency
histogram_quantile(0.95, sum by (le) (rate(testingbot_http_request_duration_seconds_bucket[5m])))
Bytes throughput
rate(testingbot_proxy_bytes_transferred_total[5m])
Reconnects
sum(testingbot_tunnel_reconnects_total)
JVM heap usage
sum by (area) (jvm_memory_bytes_used)
Process CPU
rate(process_cpu_seconds_total[5m])
Threads
jvm_threads_current

Alert ideas

  • Tunnel restart: fire when testingbot_uptime_seconds drops to under 60 seconds.
  • Error spike: fire when the 5-minute error rate exceeds your normal baseline.
  • Stalled connections: fire when testingbot_connections stays at zero during normal CI hours.
  • Endpoint unreachable: fire on Prometheus up == 0 for the tunnel target.
Was this page helpful?

Looking for More Help?

Have questions or need more information?
You can reach us via the following channels: