View on GitHub

dss

InterUSS Platform's implementation of the ASTM DSS concept for RID and flight coordination.

Monitoring

Prerequisites

Some of the tools from the manual deployment documentation are required to interact with monitoring services.

Grafana / Prometheus

Note: this monitoring stack is only currently brought up when deploying services with tanka.

By default, an instance of Grafana and Prometheus are deployed along with the core DSS services; this combination allows you to view (Grafana) CRDB metrics (collected by Prometheus). To view Grafana, first ensure that the appropriate cluster context is selected (kubectl config current-context). Then, run the following command:

shell script kubectl get pod | grep grafana | awk '{print $1}' | xargs -I {} kubectl port-forward {} 3000

While that command is running, open a browser and navigate to http://localhost:3000. The default username is admin with a default password of admin. Click the magnifying glass on the left side to select a dashboard to view.

Prometheus Federation (Multi Cluster Monitoring)

The DSS can use Prometheus to gather metrics from the binaries deployed with this project, by scraping formatted metrics from an application’s endpoint. Prometheus Federation enables you to easily monitor multiple clusters of the DSS that you operate, unifying all the metrics into a single Prometheus instance where you can build Grafana Dashboards for. Enabling Prometheus Federation is optional. To enable you need to do 2 things:

  1. Externally expose the Prometheus service of the DSS clusters.
  2. Deploy a “Global Prometheus” instance to unify metrics.

Externally Exposing Prometheus

You will need to change the values in the prometheus fields in your metadata tuples:

  1. expose_external set to true
  2. [Optional] Supply a static external IP Address to IP
  3. [Highly Recommended] Supply whitelists of IP Blocks in CIDR form, leaving an empty list mean everyone can publicly access your metrics.
  4. Then Run tk apply ... to deploy the changes on your DSS clusters.

Deploy “Global Prometheus” instance

  1. Follow guide to deploy Prometheus https://prometheus.io/docs/introduction/first_steps/
  2. The scrape rules for this global instance will scrape other prometheus /federate endpoint and rather simple, please look at the example configuration.

Health checks

This section describes various monitoring activities a USS may perform to verify various characteristics of their DSS instance and its pool. In general, they rely on a DSS operator’s monitoring infrastructure querying particular endpoints, evaluating the results of those queries, and producing alerts under certain conditions. Not all checks listed below are fully implemented in the current InterUSS implementation.

One or more procedures below could be implemented into a single, more-accessible endpoint in monitoring middleware.

/healthy check

Summary

Checks whether a DSS instance is responsive to HTTPS requests.

Procedure

For each expected DSS instance in the pool, query /healthy

Alert criteria

Normal usage metrics

Summary

Checks whether normal calls to USS’s DSS instance generally succeed.

Procedure

USS notifies its monitoring system whenever a normal ASTM-API call to its DSS instance fails due to an error indicating a failed service like timeout, 5xx, 405, 408, 418, 451, and possibly others.

Alert criteria

DAR identity check

Summary

Checks whether a set of DSS instances indicate that they are using the same DSS Airspace Representation (DAR).

Procedure, Option 1

For each expected DSS instance in the pool, query /aux/v1/pool and collect dar_id

Alert criteria, Option 1

Procedure, Option 2

Prior to ongoing operations, exchange the expected DAR ID for the environment among all DSS operators.

On an ongoing basis, query /aux/v1/pool on DSS operator’s DSS instance and collect dar_id

Alert criteria, Option 2

Per-USS heartbeat check

Note: the implementation of this functionality is not yet complete.

Summary

Checks whether all DSS instance operators have recently verified their ability to synchronize data to another DSS instance operator.

Procedure

DSS instance operators agree to all configure their monitoring and alerting systems to execute this procedure, with an agreed-upon maximum time interval:

Assert a new heartbeat for the DSS operator’s DSS instance via PUT /aux/v1/pool/dss_instances/heartbeat which returns the list of dss_instances including each one’s most_recent_heartbeat

Alert criteria

Nonce exchange check

Note: none of this functionality has been implemented yet.

Summary

Definitively checks whether pool data written into one DSS instance can be read from another DSS instance.

Implementation

This check would involve establishing the ability to read and write (client USS ID, DSS instance writer ID, nonce value) triplets in a database table describing pool information.

Procedure

  1. For each expected DSS instance in the pool, write a nonce value
  2. For each expected DSS instance in the pool, read all (DSS instance writer ID, nonce value) pairs written by the DSS instance operator’s client USS ID

Alert criteria

DSS entity injection check

Summary

Actual DSS entities (subscriptions, operational intents) are manipulated in a geographically-isolated test area.

Procedure

Run uss_qualifier with a suitable configuration.

The suitable configuration would cause DSS entities to be created, read, updated, and deleted within an isolated geographical test area, likely via a subset of the dss all_tests automated test suite with uss_qualifier possessing USS-level credentials.

Alert criteria

Database metrics check

Summary

Certain metrics exposed by the underlying database software are monitored.

Procedure

Each USS queries metrics of underlying database software (CRDB, YugabyteDB) using their database node(s).

Alert criteria

Failure detection capability

This section summarizes the preceding health checks and their ability to detect failures.

Potential failures

This list of failures and potential causes is not exhaustive in either respect.

  1. DSS instance is not accepting incoming HTTPS requests
    1. Deployment not complete
    2. HTTP(S) ingress/certificates/routing/etc not configured correctly
    3. DNS not configured correctly
  2. Database components of DSS instance are non-functional
    1. Database container not deployed correctly
    2. Database functionality failing
    3. Database software not behaving as expected
    4. Connectivity (e.g., username/password) between core-service and database not configured correctly
    5. System-range quorum of database nodes not met
    6. Trusted certificates for the pool not exchanged or configured correctly
  3. USS initializes a stand-alone DSS instance or connects to a different pool rather than joining the intended pool
    1. Database initialization parameter not set properly during deployment + nodes to join omitted
    2. Nodes to join + trusted certificates specified incorrectly
  4. USS shared the wrong base URL for their DSS instance with other pool participants
    1. I.e., USS deployed and uses fully-functional DSS instance at https://dss_q.uss.example.com connected to the DSS pool for environment Q, but indicates to other USSs that the DSS instance for environment Q is located at https://dss_r.uss.example.com (another fully-functional DSS instance connected to a different pool)
    2. Note: the likelihood of this failure could be reduced to negligible if DSS base URLs were included in #1140
  5. DSS instance can interact with the database, but cannot read from/write to any tables
    1. DSS instance operator executed InterUSS-unsupported manual commands directly to the database to change the access rights of database users used by DSS instances
  6. DSS instance can read from and write to pool table, but cannot read from/write to SCD/RID tables
    1. DSS instance operator executed InterUSS-unsupported manual commands directly to the database to change the access rights of database users used by DSS instances on a per-table basis
    2. SCD/RID tables not initialized
    3. SCD/RID tables corrupt or not at appropriate schema version
  7. The DSS instance connected to the pool is not used by the USS in the pool’s environment
    1. USS specified the wrong DSS base URL in the rest of their system in the pool environment
      1. E.g., DSS instance at https://dss_x.uss.example.com is fully functional, connects to the DSS pool for environment X and is the base URL USS shares with other USSs, but the USS specifies https://dss_y.uss.example.com as the DSS instance for the rest of their system to use in environment X
    2. USS did not configure their system to use features (e.g., ASTM F3548-21 strategic coordination) requiring a DSS in the test of their system in the pool environment
  8. DSS instance is working, but another part of the owning USS’s system has failed
    1. USS deploys their DSS instance differently than/separately from the rest of their system, and the rest-of-system deployment failed while the DSS instance deployment is unaffected
    2. A component in the rest of the USS’s system failed
  9. Database software indicates success to the core-service client, but does not correctly synchronize data to other DSS instances
    1. There is a critical bug in the database software (this would seem to be a product problem rather than a configuration problem)
  10. Aux API works but SCD/RID API does not work or is disabled
    1. DSS instance configuration does not enable SCD/RID APIs as needed
    2. SCD/RID endpoint routing does not work (though other routing does work)
  11. Database nodes are unavailable such that quorum is not met for certain ranges
    1. Database node container(s) run out of disk space
    2. Database node container(s) are shut down due to resource shortage
    3. System maintenance conducted improperly (for instance, multiple USSs bring down nodes contributing to the same range for maintenance simultaneously)
  12. Everything is working properly, but the system lacks the capacity to handle the volume of traffic

Check detection capabilities

Check (+readiness) Failure
1 2 3 4 5 6 7 8 9 10 11 12
🚀 /healthy
🛠️ Normal usage metrics 🔶 🔶 🔶
✅ DAR identity 🔶 🔶↓↓
🚧 Per-USS heartbeat 🔶 🔶 🔶 🔶↓
🚧 Nonce exchange 🔶 🔶↓
🚀 DSS entity injection 🔶 🔶
🛠️ Database metrics 🔶 🔶

Legend

Readiness 🚀Released Failure detection Detects failure
Complete (not yet released) 🔶May detect failure
🚧Not complete 🔶↓Might rarely detect failure
🛠️Requires user involvement 🔶↓↓Might very rarely detect failure
Does not detect failure