PRINCIPLES

We monitor everything, and we tell you first

Grafana stack deployed from day one. You hear about incidents from us, not from your customers. Postmortems within five business days.

Every system we operate has the Grafana stack deployed from day one. We instrument logs, metrics, and traces before we ship.

This is the most boring principle on this list. It is also the one that pays for itself every quarter.

You hear about incidents from us, not from your customers

This is the operating standard. If your checkout starts erroring, we should be paging ourselves before the first customer email lands. The alternative — your ops lead messaging us "hey is the site down?" — means we missed the window where we could have fixed it quietly.

What you get on an urgent incident

For every urgent incident, you get:

Acknowledgement within 1 hour, 24/7. This is the same SLA we publish on async and fast.
A status update every 2 hours until resolved. Written, in the incident channel, so anyone on your team can read the history.
A written postmortem within 5 business days. Including root cause, what we did to mitigate, and what we are changing so it does not happen again.

We do not hide incidents

There is a temptation — especially for vendors trying to look reliable — to downplay incidents, batch them into "minor issues," or fix them quietly without telling the client. We do the opposite. Every incident with customer impact gets surfaced, written down, and shared.

This is partly principle and partly self-interest. If we fix an issue without telling you, we miss the chance to align on whether the underlying architecture decision was right. The postmortem is where we earn or lose trust.

We do not blame upstream providers

If AWS goes down and our architecture did not tolerate it, that is on us.

"The cloud was down" is not an incident excuse. It is a design constraint we should have planned for. Multi-region failover, queue durability, retry-with-backoff — these are the things our pricing pays for. Blaming AWS for an outage we could have absorbed is admitting we did not do the work.

This does not mean every system is multi-region. It means that for every system, we have made an explicit decision about how it behaves when a dependency fails — and that decision is documented somewhere you can read.

WHAT NEXT

Sounds like a fit? Start a conversation.

We respond within one business day. If we are not the right team, we will tell you on the first call.

Talk to us See pricing