Article URL: https://statusdude.com/blog/zero-downtime-docker-compose Comments URL: https://news.ycombinator.com/item?id=48665130 Points: 13 # Comments: 14

There's a mass delusion in the industry that you need Kubernetes to run a serious production service. You don't. At StatusDude, we serve thousands of monitoring checks per minute, run multi-region workers, and deploy multiple times a day — all with Docker Compose and HAProxy. Zero dropped requests. Zero downtime. No etcd to babysit at 3 AM. But we didn't start with HAProxy. We started with Traefik. That lasted about four hours. Traefik is the popular choice for Docker-based setups. It auto-discovers services via Docker labels, has a slick dashboard, and the docs make it look effortless. We set up two backend replicas with Traefik labels, ran a rolling deploy, and watched everything fall apart. Our first deploy strategy was to run a backend_new service alongside the existing backend during the transition. Both had the same Traefik routing labels — same Host rule, same service definition. Makes sense, right? You want both old and new to serve traffic during the cutover. Traefik disagreed. Its Docker provider treats each Compose service as a separate configuration source. Two services with the same labels? "Service defined multiple times." 404 on every request. No fallback, no merge, just a flat refusal to route anything. We reworked the approach to use docker compose --scale backend=4 instead of a separate service. That avoided the label conflict. But it uncovered the next problem. The rolling deploy strategy: scale up to 4 replicas (2 old + 2 new), then scale back down to 2 (keeping only the new ones). Simple enough. Except Traefik's internal routing table didn't update fast enough. We'd scale down from 4 to 2, and Traefik would keep routing to containers that were in the process of shutting down. 502s on every other request. The routing state lagged behind Docker's reality by several seconds — long enough to drop a significant chunk of traffic.