Monitor Temporal Cloud
Temporal Cloud metrics help monitor production deployments. This documentation covers best practices for monitoring Temporal Cloud.
Monitor availability issues
When you see a sudden drop in Worker resource utilization, verify whether Temporal Cloud's API is showing increased latency and error rates.
Reference Metrics
This metric measures latency for SignalWithStartWorkflowExecution
, SignalWorkflowExecution
, StartWorkflowExecution
operations.
These operations are mission critical and never throttled.
This metric is a good indicator of your lowest possible latency.
Prometheus Query for this Metric
P99 service lag (histogram):
histogram_quantile(0.99, sum(rate(temporal_cloud_v0_service_latency_bucket[$__rate_interval])) by (temporal_namespace, operation, le))
Monitor Temporal Service errors
Check for Temporal Service gRPC API errors. Note that Service API errors are not equivalent to guarantees mentioned in the Temporal Cloud SLA.
Reference Metrics
Prometheus Query for this Metric
Measure your daily average errors over 10-minute windows:
avg_over_time((
(
(
sum(increase(temporal_cloud_v0_frontend_service_request_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"}[10m]))
-
sum(increase(temporal_cloud_v0_frontend_service_error_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"}[10m]))
)
/
sum(increase(temporal_cloud_v0_frontend_service_request_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"}[10m]))
)
or vector(1)
)[1d:10m])
Detecting Activity and Workflow Failures
The metrics temporal_activity_execution_failed
and temporal_cloud_v0_workflow_failed_count
together provide failure detection for Temporal applications. These metrics work in tandem to give you both granular component-level visibility and high-level workflow health insights.
Activity failure cascade
If not using infinite retry policies, Activity failures can lead to Workflow failures:
Activity Failure --> Retry Logic --> More Activity Failures --> Workflow Decision --> Potential Workflow Failure
Activity failures are often recoverable and expected. Workflow failures represent terminal states requiring immediate attention. A spike in activity failures may precede workflow failures. Generally Temporal recommends that Workflows should be designed to always succeed. If an Activity fails more than its retry policy allows, we suggest having the Workflow handle Activity failure and take action to notify a human to take corrective action or be aware of the error.
Ratio-based monitoring
Failure conversion rate
Monitor the ratio of workflow failures to activity failures:
workflow_failure_rate = temporal_cloud_v0_workflow_failed_count / temporal_activity_execution_failed
What to watch for:
- High ratio (greater than 0.1): Poor error handling - activities failing are causing workflow failures
- Low ratio (less than 0.01): Good resilience - activities fail but workflows recover
- Sudden spikes: May indicate systematic issues
Activity success rate
activity_success_rate = (total_activities - temporal_activity_execution_failed) / total_activities
Target: >95% for most applications. Lower success rate can be a sign of system troubles. See also:
Monitor replication lag for Namespaces with High Availability features
Replication lag refers to the transmission delay of Workflow updates and history events from the primary Namespace to the replica. Always check the metric replication lag before initiating a failover. A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress.
Who owns the replication lag? Temporal owns replication lag.
What guarantees are available? There is no SLA for replication lag. Temporal recommends that customers do not trigger failovers except for testing or emergency situations. High Availability feature's four-9 guarantee SLA means Temporal will handle failovers and ensure high availability. Temporal also monitors replication lag. Customer who decide to trigger failovers should look at this metric before moving forward.
If the lag is high, what should you do? We don't expect users to failover. Please contact Temporal support if you feel you have a pressing need.
Where can you read more? See operations and metrics for Namespaces with High Availability features.
Reference Metrics
- temporal_cloud_v0_replication_lag_bucket
- temporal_cloud_v0_replication_lag_sum
- temporal_cloud_v0_replication_lag_count
Prometheus Query for this Metric
P99 replication lag (histogram):
histogram_quantile(0.99, sum(rate(temporal_cloud_v0_replication_lag_bucket[$__rate_interval])) by (temporal_namespace, le))
Average replication lag:
sum(rate(temporal_cloud_v0_replication_lag_sum[$__rate_interval])) by (temporal_namespace)
/
sum(rate(temporal_cloud_v0_replication_lag_count[$__rate_interval])) by (temporal_namespace)
Usage and Detecting Resource Exhaustion & Namespace RPS and APS Rate Limits
The Cloud metric temporal_cloud_v0_resource_exhausted_error_count
is the primary indicator for Cloud-side throttling, signaling that namespace limits
are being hit and ResourceExhausted
gRPC errors are occurring. This generally does not break workflow processing due to how resources are prioritized.
In fact, some workloads often run with high amounts of resource exhaustion errors because they are not latency sensitive. Being APS or RPS resource
constrained can slow down throughput and is a good indicator that you should request additional capacity.
To specifically identify whether RPS or APS limits are being hit, this metric can be filtered using the resource_exhausted_cause
label, which will show values
like ApsLimit
or RpsLimit
. This label also helps identify the specific operation that was throttled (e.g., polling, respond activity tasks).
Related useful information:
- Namespace Limits (APS is visible in the Namespace UI)
temporal_cloud_v0_total_action_count
: Useful for tracking the overall action rate (APS)temporal_cloud_v0_frontend_service_request_count
: Useful for tracking the request rate (RPS)- SDK metric
long_request_failure
with causeresource_exhausted