Post

CloudWatch Dashboards for n8n Worker Health

CloudWatch Dashboards for n8n Worker Health

n8n queue workers are where automation actually happens. If workers slow down, crash, or lose access to dependencies, the editor may still look fine while real work piles up behind the scenes.

For ECS deployments, CloudWatch dashboards should focus less on generic container uptime and more on whether automation is flowing.

Context

Problem: Worker failures can hide behind a healthy n8n web process. Approach: Build dashboards around queue depth, worker saturation, execution outcomes, and dependency errors. Outcome: Operators can see degraded automation before users report missed actions.

Dashboard sections

A useful dashboard should include:

  • ECS service desired count, running count, and pending count.
  • Worker CPU and memory utilization.
  • Task restarts and stopped reasons.
  • Application log error rate.
  • Redis queue waiting and active job counts.
  • Execution success, failure, and duration by workflow.
  • Downstream API rate limit or authentication errors.
  • Deployment markers by release ID.

Container metrics explain capacity. Execution metrics explain whether the capacity is producing useful work.

Log metric filters

Start with a few high-signal log filters:

1
2
3
4
5
6
ERROR
credential
ECONNRESET
rate limit
Redis unavailable
Workflow execution failed

Then tune them into structured fields as the platform matures. Free-text filters are a starting point, not the final observability model.

Alerting strategy

Avoid alerting on every failed workflow. Security automation often handles malformed input, expired indicators, and temporary SaaS errors. Alert on conditions that represent platform degradation:

  • No workers running.
  • Queue age above threshold.
  • Failure rate spike for a critical workflow.
  • Redis connection failures.
  • Database connection pool exhaustion.
  • Worker restart loop after deployment.

Tie alerts to runbooks that explain what to check first: deployment, Redis, database, downstream API status, and recent workflow changes.

Blue team context

Worker health is also a security signal. A sudden spike in executions may be normal during an incident, or it may indicate webhook abuse. A new class of outbound errors may mean credentials were revoked, or that an attacker changed workflow behavior. Pair worker dashboards with security logs so operations and detection teams see the same timeline.

Takeaways

CloudWatch dashboards for n8n should show the health of automation, not just the health of containers. Track queue movement, execution quality, dependency errors, and release markers together.

This post is licensed under CC BY 4.0 by the author.