View on GitHub

dss

InterUSS Platform's implementation of the ASTM DSS concept for RID and flight coordination.

Monitoring

Prerequisites

Some of the tools from the manual deployment documentation are required to interact with monitoring services.

Grafana / Prometheus

Note: this monitoring stack is only currently brought up when deploying services with tanka.

By default, an instance of Grafana and Prometheus are deployed along with the core DSS services; this combination allows you to view (Grafana) CRDB metrics (collected by Prometheus). To view Grafana, first ensure that the appropriate cluster context is selected (kubectl config current-context). Then, run the following command:

shell script kubectl get pod | grep grafana | awk '{print $1}' | xargs -I {} kubectl port-forward {} 3000

While that command is running, open a browser and navigate to http://localhost:3000. The default username is admin with a default password of admin. Click the magnifying glass on the left side to select a dashboard to view.

Prometheus Federation (Multi Cluster Monitoring)

The DSS can use Prometheus to gather metrics from the binaries deployed with this project, by scraping formatted metrics from an application’s endpoint. Prometheus Federation enables you to easily monitor multiple clusters of the DSS that you operate, unifying all the metrics into a single Prometheus instance where you can build Grafana Dashboards for. Enabling Prometheus Federation is optional. To enable you need to do 2 things:

Externally expose the Prometheus service of the DSS clusters.
Deploy a “Global Prometheus” instance to unify metrics.

Externally Exposing Prometheus

You will need to change the values in the prometheus fields in your metadata tuples:

expose_external set to true
[Optional] Supply a static external IP Address to IP
[Highly Recommended] Supply whitelists of IP Blocks in CIDR form, leaving an empty list mean everyone can publicly access your metrics.
Then Run tk apply ... to deploy the changes on your DSS clusters.

Deploy “Global Prometheus” instance

Follow guide to deploy Prometheus https://prometheus.io/docs/introduction/first_steps/
The scrape rules for this global instance will scrape other prometheus /federate endpoint and rather simple, please look at the example configuration.

Health checks

This section describes various monitoring activities a USS may perform to verify various characteristics of their DSS instance and its pool. In general, they rely on a DSS operator’s monitoring infrastructure querying particular endpoints, evaluating the results of those queries, and producing alerts under certain conditions. Not all checks listed below are fully implemented in the current InterUSS implementation.

One or more procedures below could be implemented into a single, more-accessible endpoint in monitoring middleware.

/healthy check

Summary

Checks whether a DSS instance is responsive to HTTPS requests.

Procedure

For each expected DSS instance in the pool, query /healthy

Alert criteria

Any query failed or returned a code other than 200

Normal usage metrics

Summary

Checks whether normal calls to USS’s DSS instance generally succeed.

Procedure

USS notifies its monitoring system whenever a normal ASTM-API call to its DSS instance fails due to an error indicating a failed service like timeout, 5xx, 405, 408, 418, 451, and possibly others.

Alert criteria

Number of failures per time period crosses threshold

DAR identity check

Summary

Checks whether a set of DSS instances indicate that they are using the same DSS Airspace Representation (DAR).

Procedure, Option 1

For each expected DSS instance in the pool, query /aux/v1/pool and collect dar_id

Alert criteria, Option 1

Any query failed
Any collected dar_id value is different from any other collected dar_id value

Procedure, Option 2

Prior to ongoing operations, exchange the expected DAR ID for the environment among all DSS operators.

On an ongoing basis, query /aux/v1/pool on DSS operator’s DSS instance and collect dar_id

Alert criteria, Option 2

Query failed
Collected dar_id differs from expected DAR ID for the environment

Per-USS heartbeat check

Note: the implementation of this functionality is not yet complete.

Summary

Checks whether all DSS instance operators have recently verified their ability to synchronize data to another DSS instance operator.

Procedure

DSS instance operators agree to all configure their monitoring and alerting systems to execute this procedure, with an agreed-upon maximum time interval:

Assert a new heartbeat for the DSS operator’s DSS instance via PUT /aux/v1/pool/dss_instances/heartbeat which returns the list of dss_instances including each one’s most_recent_heartbeat

Alert criteria

PUT query fails
Any expected DSS instance in the pool does not have an entry in dss_instances
The current time is past any DSS instance’s next_heartbeat_expected_before
The difference between next_heartbeat_expected_before and timestamp is larger than the agreed-upon maximum time interval for any DSS instance

Nonce exchange check

Note: none of this functionality has been implemented yet.

Summary

Definitively checks whether pool data written into one DSS instance can be read from another DSS instance.

Implementation

This check would involve establishing the ability to read and write (client USS ID, DSS instance writer ID, nonce value) triplets in a database table describing pool information.

Procedure

For each expected DSS instance in the pool, write a nonce value
For each expected DSS instance in the pool, read all (DSS instance writer ID, nonce value) pairs written by the DSS instance operator’s client USS ID

Alert criteria

Any query failed
The nonce value written to DSS instance i does not match the nonce value that DSS instance j reports was written to DSS instance i by the DSS instance operator

DSS entity injection check

Summary

Actual DSS entities (subscriptions, operational intents) are manipulated in a geographically-isolated test area.

Procedure

Run uss_qualifier with a suitable configuration.

The suitable configuration would cause DSS entities to be created, read, updated, and deleted within an isolated geographical test area, likely via a subset of the dss all_tests automated test suite with uss_qualifier possessing USS-level credentials.

Alert criteria

Tested requirements artifact does not indicate Pass

Database metrics check

Summary

Certain metrics exposed by the underlying database software are monitored.

Procedure

Each USS queries metrics of underlying database software (CRDB, YugabyteDB) using their database node(s).

Alert criteria

Any Raft quorum unavailability
Resource usage within threshold of ceiling for resource (e.g., 90% of storage/memory/CPU on node in use)
Any SQL failures

Failure detection capability

This section summarizes the preceding health checks and their ability to detect failures.

Potential failures

This list of failures and potential causes is not exhaustive in either respect.

DSS instance is not accepting incoming HTTPS requests
1. Deployment not complete
2. HTTP(S) ingress/certificates/routing/etc not configured correctly
3. DNS not configured correctly
Database components of DSS instance are non-functional
1. Database container not deployed correctly
2. Database functionality failing
3. Database software not behaving as expected
4. Connectivity (e.g., username/password) between core-service and database not configured correctly
5. System-range quorum of database nodes not met
6. Trusted certificates for the pool not exchanged or configured correctly
USS initializes a stand-alone DSS instance or connects to a different pool rather than joining the intended pool
1. Database initialization parameter not set properly during deployment + nodes to join omitted
2. Nodes to join + trusted certificates specified incorrectly
USS shared the wrong base URL for their DSS instance with other pool participants
1. I.e., USS deployed and uses fully-functional DSS instance at https://dss_q.uss.example.com connected to the DSS pool for environment Q, but indicates to other USSs that the DSS instance for environment Q is located at https://dss_r.uss.example.com (another fully-functional DSS instance connected to a different pool)
2. Note: the likelihood of this failure could be reduced to negligible if DSS base URLs were included in #1140
DSS instance can interact with the database, but cannot read from/write to any tables
1. DSS instance operator executed InterUSS-unsupported manual commands directly to the database to change the access rights of database users used by DSS instances
DSS instance can read from and write to pool table, but cannot read from/write to SCD/RID tables
1. DSS instance operator executed InterUSS-unsupported manual commands directly to the database to change the access rights of database users used by DSS instances on a per-table basis
2. SCD/RID tables not initialized
3. SCD/RID tables corrupt or not at appropriate schema version
The DSS instance connected to the pool is not used by the USS in the pool’s environment
1. USS specified the wrong DSS base URL in the rest of their system in the pool environment
  1. E.g., DSS instance at https://dss_x.uss.example.com is fully functional, connects to the DSS pool for environment X and is the base URL USS shares with other USSs, but the USS specifies https://dss_y.uss.example.com as the DSS instance for the rest of their system to use in environment X
2. USS did not configure their system to use features (e.g., ASTM F3548-21 strategic coordination) requiring a DSS in the test of their system in the pool environment
DSS instance is working, but another part of the owning USS’s system has failed
1. USS deploys their DSS instance differently than/separately from the rest of their system, and the rest-of-system deployment failed while the DSS instance deployment is unaffected
2. A component in the rest of the USS’s system failed
Database software indicates success to the core-service client, but does not correctly synchronize data to other DSS instances
1. There is a critical bug in the database software (this would seem to be a product problem rather than a configuration problem)
Aux API works but SCD/RID API does not work or is disabled
1. DSS instance configuration does not enable SCD/RID APIs as needed
2. SCD/RID endpoint routing does not work (though other routing does work)
Database nodes are unavailable such that quorum is not met for certain ranges
1. Database node container(s) run out of disk space
2. Database node container(s) are shut down due to resource shortage
3. System maintenance conducted improperly (for instance, multiple USSs bring down nodes contributing to the same range for maintenance simultaneously)
Everything is working properly, but the system lacks the capacity to handle the volume of traffic

Check detection capabilities

Check (+readiness)	Failure
Check (+readiness)	1	2	3	4	5	6	7	8	9	10	11	12
🚀 /healthy	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
🛠️ Normal usage metrics	✅	✅	❌	❌	✅	✅	❌	🔶	❌	✅	🔶	🔶
✅ DAR identity	✅	✅	✅	✅	❌	❌	❌	❌	❌	❌	🔶	🔶↓↓
🚧 Per-USS heartbeat	✅	✅	✅	❌	✅	❌	🔶	🔶	✅	❌	🔶	🔶↓
🚧 Nonce exchange	✅	✅	✅	✅	✅	❌	❌	❌	✅	❌	🔶	🔶↓
🚀 DSS entity injection	✅	✅	✅	✅	✅	✅	❌	❌	✅	✅	🔶	🔶
🛠️ Database metrics	❌	🔶	❌	❌	❌	❌	❌	❌	🔶	❌	✅	✅

Legend

Readiness	🚀	Released	Failure detection	✅	Detects failure
	✅	Complete (not yet released)		🔶	May detect failure
	🚧	Not complete		🔶↓	Might rarely detect failure
	🛠️	Requires user involvement		🔶↓↓	Might very rarely detect failure
				❌	Does not detect failure