Monitoring
Prerequisites
Some of the tools from the manual deployment documentation are required to interact with monitoring services.
Grafana / Prometheus
Note: this monitoring stack is only currently brought up when deploying services with tanka.
By default, an instance of Grafana and Prometheus are deployed along with the
core DSS services; this combination allows you to view (Grafana) CRDB metrics
(collected by Prometheus). To view Grafana, first ensure that the appropriate
cluster context is selected (kubectl config current-context
). Then, run the
following command:
shell script
kubectl get pod | grep grafana | awk '{print $1}' | xargs -I {} kubectl port-forward {} 3000
While that command is running, open a browser and navigate to
http://localhost:3000. The default username is admin
with a default password of admin
. Click the magnifying glass on the left side
to select a dashboard to view.
Prometheus Federation (Multi Cluster Monitoring)
The DSS can use Prometheus to gather metrics from the binaries deployed with this project, by scraping formatted metrics from an application’s endpoint. Prometheus Federation enables you to easily monitor multiple clusters of the DSS that you operate, unifying all the metrics into a single Prometheus instance where you can build Grafana Dashboards for. Enabling Prometheus Federation is optional. To enable you need to do 2 things:
- Externally expose the Prometheus service of the DSS clusters.
- Deploy a “Global Prometheus” instance to unify metrics.
Externally Exposing Prometheus
You will need to change the values in the prometheus
fields in your metadata tuples:
expose_external
set totrue
- [Optional] Supply a static external IP Address to
IP
- [Highly Recommended] Supply whitelists of IP Blocks in CIDR form, leaving an empty list mean everyone can publicly access your metrics.
- Then Run
tk apply ...
to deploy the changes on your DSS clusters.
Deploy “Global Prometheus” instance
- Follow guide to deploy Prometheus https://prometheus.io/docs/introduction/first_steps/
- The scrape rules for this global instance will scrape other prometheus
/federate
endpoint and rather simple, please look at the example configuration.
Health checks
This section describes various monitoring activities a USS may perform to verify various characteristics of their DSS instance and its pool. In general, they rely on a DSS operator’s monitoring infrastructure querying particular endpoints, evaluating the results of those queries, and producing alerts under certain conditions. Not all checks listed below are fully implemented in the current InterUSS implementation.
One or more procedures below could be implemented into a single, more-accessible endpoint in monitoring middleware.
/healthy check
Summary
Checks whether a DSS instance is responsive to HTTPS requests.
Procedure
For each expected DSS instance in the pool, query /healthy
Alert criteria
- Any query failed or returned a code other than 200
Normal usage metrics
Summary
Checks whether normal calls to USS’s DSS instance generally succeed.
Procedure
USS notifies its monitoring system whenever a normal ASTM-API call to its DSS instance fails due to an error indicating a failed service like timeout, 5xx, 405, 408, 418, 451, and possibly others.
Alert criteria
- Number of failures per time period crosses threshold
DAR identity check
Summary
Checks whether a set of DSS instances indicate that they are using the same DSS Airspace Representation (DAR).
Procedure, Option 1
For each expected DSS instance in the pool, query /aux/v1/pool
and collect dar_id
Alert criteria, Option 1
- Any query failed
- Any collected
dar_id
value is different from any other collecteddar_id
value
Procedure, Option 2
Prior to ongoing operations, exchange the expected DAR ID for the environment among all DSS operators.
On an ongoing basis, query /aux/v1/pool
on DSS operator’s DSS instance and collect dar_id
Alert criteria, Option 2
- Query failed
- Collected
dar_id
differs from expected DAR ID for the environment
Per-USS heartbeat check
Note: the implementation of this functionality is not yet complete.
Summary
Checks whether all DSS instance operators have recently verified their ability to synchronize data to another DSS instance operator.
Procedure
DSS instance operators agree to all configure their monitoring and alerting systems to execute this procedure, with an agreed-upon maximum time interval:
Assert a new heartbeat for the DSS operator’s DSS instance via PUT /aux/v1/pool/dss_instances/heartbeat
which returns the list of dss_instances
including each one’s most_recent_heartbeat
Alert criteria
PUT
query fails- Any expected DSS instance in the pool does not have an entry in
dss_instances
- The current time is past any DSS instance’s
next_heartbeat_expected_before
- The difference between
next_heartbeat_expected_before
andtimestamp
is larger than the agreed-upon maximum time interval for any DSS instance
Nonce exchange check
Note: none of this functionality has been implemented yet.
Summary
Definitively checks whether pool data written into one DSS instance can be read from another DSS instance.
Implementation
This check would involve establishing the ability to read and write (client USS ID, DSS instance writer ID, nonce value) triplets in a database table describing pool information.
Procedure
- For each expected DSS instance in the pool, write a nonce value
- For each expected DSS instance in the pool, read all (DSS instance writer ID, nonce value) pairs written by the DSS instance operator’s client USS ID
Alert criteria
- Any query failed
- The nonce value written to DSS instance i does not match the nonce value that DSS instance j reports was written to DSS instance i by the DSS instance operator
DSS entity injection check
Summary
Actual DSS entities (subscriptions, operational intents) are manipulated in a geographically-isolated test area.
Procedure
Run uss_qualifier with a suitable configuration.
The suitable configuration would cause DSS entities to be created, read, updated, and deleted within an isolated geographical test area, likely via a subset of the dss all_tests automated test suite with uss_qualifier possessing USS-level credentials.
Alert criteria
- Tested requirements artifact does not indicate Pass
Database metrics check
Summary
Certain metrics exposed by the underlying database software are monitored.
Procedure
Each USS queries metrics of underlying database software (CRDB, YugabyteDB) using their database node(s).
Alert criteria
- Any Raft quorum unavailability
- Resource usage within threshold of ceiling for resource (e.g., 90% of storage/memory/CPU on node in use)
- Any SQL failures
Failure detection capability
This section summarizes the preceding health checks and their ability to detect failures.
Potential failures
This list of failures and potential causes is not exhaustive in either respect.
- DSS instance is not accepting incoming HTTPS requests
- Deployment not complete
- HTTP(S) ingress/certificates/routing/etc not configured correctly
- DNS not configured correctly
- Database components of DSS instance are non-functional
- Database container not deployed correctly
- Database functionality failing
- Database software not behaving as expected
- Connectivity (e.g., username/password) between core-service and database not configured correctly
- System-range quorum of database nodes not met
- Trusted certificates for the pool not exchanged or configured correctly
- USS initializes a stand-alone DSS instance or connects to a different pool rather than joining the intended pool
- Database initialization parameter not set properly during deployment + nodes to join omitted
- Nodes to join + trusted certificates specified incorrectly
- USS shared the wrong base URL for their DSS instance with other pool participants
- I.e., USS deployed and uses fully-functional DSS instance at https://dss_q.uss.example.com connected to the DSS pool for environment Q, but indicates to other USSs that the DSS instance for environment Q is located at https://dss_r.uss.example.com (another fully-functional DSS instance connected to a different pool)
- Note: the likelihood of this failure could be reduced to negligible if DSS base URLs were included in #1140
- DSS instance can interact with the database, but cannot read from/write to any tables
- DSS instance operator executed InterUSS-unsupported manual commands directly to the database to change the access rights of database users used by DSS instances
- DSS instance can read from and write to pool table, but cannot read from/write to SCD/RID tables
- DSS instance operator executed InterUSS-unsupported manual commands directly to the database to change the access rights of database users used by DSS instances on a per-table basis
- SCD/RID tables not initialized
- SCD/RID tables corrupt or not at appropriate schema version
- The DSS instance connected to the pool is not used by the USS in the pool’s environment
- USS specified the wrong DSS base URL in the rest of their system in the pool environment
- E.g., DSS instance at https://dss_x.uss.example.com is fully functional, connects to the DSS pool for environment X and is the base URL USS shares with other USSs, but the USS specifies https://dss_y.uss.example.com as the DSS instance for the rest of their system to use in environment X
- USS did not configure their system to use features (e.g., ASTM F3548-21 strategic coordination) requiring a DSS in the test of their system in the pool environment
- USS specified the wrong DSS base URL in the rest of their system in the pool environment
- DSS instance is working, but another part of the owning USS’s system has failed
- USS deploys their DSS instance differently than/separately from the rest of their system, and the rest-of-system deployment failed while the DSS instance deployment is unaffected
- A component in the rest of the USS’s system failed
- Database software indicates success to the core-service client, but does not correctly synchronize data to other DSS instances
- There is a critical bug in the database software (this would seem to be a product problem rather than a configuration problem)
- Aux API works but SCD/RID API does not work or is disabled
- DSS instance configuration does not enable SCD/RID APIs as needed
- SCD/RID endpoint routing does not work (though other routing does work)
- Database nodes are unavailable such that quorum is not met for certain ranges
- Database node container(s) run out of disk space
- Database node container(s) are shut down due to resource shortage
- System maintenance conducted improperly (for instance, multiple USSs bring down nodes contributing to the same range for maintenance simultaneously)
- Everything is working properly, but the system lacks the capacity to handle the volume of traffic
Check detection capabilities
Check (+readiness) | Failure | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
🚀 /healthy | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
🛠️ Normal usage metrics | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | 🔶 | ❌ | ✅ | 🔶 | 🔶 |
✅ DAR identity | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | 🔶 | 🔶↓↓ |
🚧 Per-USS heartbeat | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | 🔶 | 🔶 | ✅ | ❌ | 🔶 | 🔶↓ |
🚧 Nonce exchange | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | 🔶 | 🔶↓ |
🚀 DSS entity injection | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | 🔶 | 🔶 |
🛠️ Database metrics | ❌ | 🔶 | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | 🔶 | ❌ | ✅ | ✅ |
Legend
Readiness | 🚀 | Released | Failure detection | ✅ | Detects failure |
✅ | Complete (not yet released) | 🔶 | May detect failure | ||
🚧 | Not complete | 🔶↓ | Might rarely detect failure | ||
🛠️ | Requires user involvement | 🔶↓↓ | Might very rarely detect failure | ||
❌ | Does not detect failure |