ClickHouse queries for building dashboards and alerts

SigNoz gives you the ability to write powerful ClickHouse queries to plot charts. You can run SQL queries supported by ClickHouse to extract immense value out of your distributed tracing data or logs data.

Info

The distributed tables in ClickHouse have been named by prefixing distributed_ to existing single shard table names. If you want to use ClickHouse queries in dashboard or alerts, you should use the distributed table names. Eg, signoz_index_v3 now corresponds to the table of a single shard. To query all the shards, query against distributed_signoz_index_v3.


Sharing a few examples of some queries which might be helpful in building dashboards.

GroupBy a tag/attribute in distributed tracing data

SELECT toStartOfInterval(timestamp, INTERVAL 1 MINUTE) AS interval, 
attributes_string['http.method'] AS method,
toFloat64(avg(durationNano)) AS value 
FROM signoz_traces.distributed_signoz_index_v3
WHERE attributes_string['http.method']!=''
AND timestamp > now() - INTERVAL 30 MINUTE
AND ts_bucket_start >= toUInt64(toUnixTimestamp(now() - toIntervalMinute(30))) - 1800
GROUP BY (method, interval) order by (method, interval) ASC;

Show count of each customer_id which is present as attribute of a span event

WITH arrayFilter(x -> JSONExtractString(x, 'name')='Getting customer', events) AS filteredEvents
SELECT toStartOfInterval(timestamp, INTERVAL 1 MINUTE) AS interval, toFloat64(count()) AS count, 
arrayJoin(arrayMap(x -> JSONExtractString(JSONExtractString(x, 'attributeMap'), 'customer_id'), filteredEvents)) AS resultArray 
FROM signoz_traces.distributed_signoz_index_v3
WHERE not empty(filteredEvents) 
AND timestamp > toUnixTimestamp(now() - INTERVAL 30 MINUTE) 
GROUP BY (resultArray, interval) order by (resultArray, interval) ASC;

Avg latency between 2 spans of interest (part of the trace tree)

SELECT
    interval,
    round(avg(time_diff), 2) AS result
FROM
(
    SELECT
        interval,
        traceID,
        if(startTime1 != 0, if(startTime2 != 0, (toUnixTimestamp64Nano(startTime2) - toUnixTimestamp64Nano(startTime1)) / 1000000, nan), nan) AS time_diff
    FROM
    (
        SELECT
            toStartOfInterval(timestamp, toIntervalMinute(1)) AS interval,
            traceID,
            minIf(timestamp, if(resource_string_service$$name='driver', if(name = '/driver.DriverService/FindNearest', if((resources_string['component']) = 'gRPC', true, false), false), false)) AS startTime1,
            minIf(timestamp, if(resource_string_service$$name='route', if(name = 'HTTP GET /route', true, false), false)) AS startTime2
        FROM signoz_traces.distributed_signoz_index_v3
        WHERE (timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}}) AND (ts_bucket_start BETWEEN {{.start_timestamp}} - 1800 AND {{.end_timestamp}}) AND (resource_string_service$$name IN ('driver', 'route'))
        GROUP BY (interval, traceID)
        ORDER BY (interval, traceID) ASC
    )
)
WHERE isNaN(time_diff) = 0
GROUP BY interval
ORDER BY interval ASC;

Show sum of values of customer_id which is present as attribute of a span event

WITH arrayFilter(x -> JSONExtractString(x, 'name')='Getting customer', events) AS filteredEvents
SELECT toStartOfInterval(timestamp, INTERVAL 1 MINUTE) AS interval, toFloat64(sum(toInt32(resultArray))) AS sum, 
arrayJoin(arrayMap(x -> JSONExtractString(JSONExtractString(x, 'attributeMap'), 'customer_id'), filteredEvents)) AS resultArray 
FROM signoz_traces.distributed_signoz_index_v3
WHERE not empty(filteredEvents) and timestamp > toUnixTimestamp(now() - INTERVAL 30 MINUTE) 
AND ts_bucket_start >= toUInt64(toUnixTimestamp(now() - toIntervalMinute(30))) - 1800
GROUP BY (resultArray, interval) order by (resultArray, interval) ASC;

Plotting a chart on 100ms interval

Plot a chart of 1 minute showing count of spans in 100ms interval of service frontend with duration > 50ms

SELECT fromUnixTimestamp64Milli(intDiv( toUnixTimestamp64Milli ( timestamp ), 100) * 100) AS interval, 
toFloat64(count()) AS count 
FROM (
 SELECT timestamp 
 FROM signoz_traces.distributed_signoz_index_v3
 WHERE resource_string_service$$name='frontend' 
 AND durationNano>=50*exp10(6) 
 AND timestamp > now() - INTERVAL 1 MINUTE AND ts_bucket_start >= toUInt64(toUnixTimestamp(now() - toIntervalMinute(1))) - 1800)
 GROUP BY interval 
 ORDER BY interval ASC;

Show count of loglines per minute

SELECT toStartOfInterval(fromUnixTimestamp64Nano(timestamp), INTERVAL 1 MINUTE) AS interval, 
toFloat64(count()) AS value 
FROM signoz_logs.distributed_logs_v2
WHERE timestamp > toUnixTimestamp64Nano(now64() - INTERVAL 30 MINUTE) 
AND ts_bucket_start >= toUInt64(toUnixTimestamp(now() - toIntervalMinute(30))) - 1800
GROUP BY interval 
ORDER BY interval ASC;

While writing queries for logs table, if you want to use an attribute/resource in your query you will have to reference it in the following format <type>_<dataType>_value[indexOf(<type>_<dataType>_key, <keyname>)]

where type can be attributes/resources , dataType can be int64/float64/string and keyname is the name of the key.

Eg: If your keyname is status of dataType string and type attribute, it needs to be referenced as attributes_string_value[indexOf(attributes_string_key, 'status')]

📝 Note

In the above example, if status is an selected field, then it can be referenced as attribute_string_status

Building alert queries with ClickHouse data

The most common use case for alerts is to send notifications when a threshold condition is met. For example, conditions like CPU usage exceeds 70% (CPU Usage > 70) or failures in a service exceed your expectations (service error count > 10).

Every alert is composed of a threshold, date range (e.g. last 5 mins) and match condition (e.g. at least once). The rule engine is responsible for evaluating these conditions and triggering notifications. SigNoz allows you a complete flexibility in choosing all three conditions to meet your business requirements.

The evaluation is also based on the source or metric data, the left-hand side of the threshold condition - consider CPU Usage (LHS) > 70 (RHS).

The query editor (ClickHouse tab in the alert form) allows you to define a query to fetch the metric (source) data from ClickHouse and evaluate if the threshold condition occurs in it. The query format requires use of following reserved columns and parameters which simplify the syntax and support dynamic parts.

Reserved Aliases (for use in SELECT, GROUP BY only): value, res, result, interval

Reserved Parameters (for use in WHERE condition only):

  • {{.start_datetime}} {{.end_datetime}}
  • {{.start_timestamp}} {{.end_timestamp}}
  • {{.start_timestamp_ms}} {{.end_timestamp_ms}}
  • {{.start_timestamp_nano}} {{.end_timestamp_nano}}

Let's explore sample queries to use ClickHouse data in alert evaluation.

The rule engine is always looking for special column aliases. When the query returns data, the rule engine looks into the result for a column with alias value or result and compares the vector of all those values with the threshold (defined in alert form).

For example, the following query retrieves not found errors of a service. When this query returns data, the rule engine will compare value from each row with the threshold set in alert form.

SELECT count() AS value  
FROM signoz_traces.distributed_signoz_error_index_v2
WHERE (serviceName='api-service’) 
AND (exceptionType='Not found’);

If the same query is written without the column alias value or result, the rule engine will ignore the metric and fail to arrive at a threshold match.

Second reserved column alias is interval. It is useful when you want to apply threshold check over n time intervals.

SELECT count() AS value, toStartOfInterval(timestamp, INTERVAL 5 MINUTE) AS interval 
FROM signoz_traces.distributed_signoz_error_index_v2
WHERE (serviceName = 'api-service') 
AND (exceptionType = 'Not found') 
GROUP BY interval

The rule engine is also capable of dynamically applying date ranges to your SELECT query. When you choose last 15 mins option in the alert conditions section, the rule engine applies timestamp values in the timeframe condition below.

SELECT count() AS value, toStartOfInterval(timestamp, INTERVAL 5 MINUTE) AS interval 
FROM signoz_traces.distributed_signoz_error_index_v2
WHERE timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}} 
AND (serviceName = 'api-service') 
AND (exceptionType = 'Not found') 
GROUP BY interval

The reserved bind variables {{.start_datetime}} and {{.end_datetime}} would allow rule engine to dynamically include date range query when the alert query runs.

Was this page helpful?