Deploying a DSS instance
Deployment options
This document describes how to deploy a production-style DSS instance to interoperate with other DSS instances in a DSS pool.
To run a local DSS instance for testing, evaluation, or development, see dev/standalone_instance.md.
To create a local DSS instance with multi-node CRDB cluster, see dev/mutli_node_local_dss.md.
To create or join a pool consisting of multiple interoperable DSS instances, see information on pooling.
Glossary
- DSS Region - A region in which a single, unified airspace representation is presented by one or more interoperable DSS instances, each instance typically operated by a separate organization. A specific environment (for example, “production” or “staging”) in a particular DSS Region is called a “pool”.
- DSS instance - a single logical replica in a DSS pool.
Preface
This doc describes a procedure for deploying the DSS and its dependencies (namely CockroachDB) via Kubernetes. The use of Kubernetes is not a requirement, and a DSS instance can join a CRDB cluster constituting a DSS pool as long as it meets the CockroachDB requirements below.
Prerequisites
Download & install the following tools to your workstation:
- If deploying on Google Cloud,
install Google Cloud SDK
- Confirm successful installation with
gcloud version
- Run
gcloud init
to set up a connection to your account. kubectl
can be installed fromgcloud
instead of via the method below.
- Confirm successful installation with
- Install kubectl to
interact with kubernetes
- Confirm successful installation with
kubectl version --client
(should succeed from any working directory). - Note that kubectl can alternatively be installed via the Google Cloud SDK
gcloud
shell if using Google Cloud.
- Confirm successful installation with
- Install tanka
- On Linux, after downloading the binary per instructions, run
sudo chmod +x /usr/local/bin/tk
- Confirm successful installation with
tk --version
- On Linux, after downloading the binary per instructions, run
- Install Docker.
- Confirm successful installation with
docker --version
- Confirm successful installation with
- Install CockroachDB to
generate CockroachDB certificates.
- These instructions assume CockroachDB Core.
- You may need to run
sudo chmod +x /usr/local/bin/cockroach
after completing the installation instructions. - Confirm successful installation with
cockroach version
- If developing the DSS codebase,
install Golang
- Confirm successful installation with
go version
- Confirm successful installation with
- Optionally install Jsonnet if editing the jsonnet templates.
Docker images
The application logic of the DSS is located in core-service which is provided in a Docker image which is built locally and then pushed to a Docker registry of your choice. All major cloud providers have a docker registry service, or you can set up your own.
To use the prebuilt InterUSS Docker images (without building them yourself), use
docker.io/interuss/dss
for VAR_DOCKER_IMAGE_NAME
.
To build these images (and, optionally, push them to a docker registry):
-
Set the environment variable
DOCKER_URL
to your docker registry url endpoint.-
For Google Cloud,
DOCKER_URL
should be set similarly to as described here, likegcr.io/your-project-id
(do not include the image name; it will be appended by the build script) -
For Amazon Web Services,
DOCKER_URL
should be set similarly to as described here, like${aws_account_id}.dkr.ecr.${region}.amazonaws.com/
(do not include the image name; it will be appended by the build script)
-
-
Ensure you are logged into your docker registry service.
-
For Google Cloud, these are the recommended instructions (
gcloud auth configure-docker
). Ensure that appropriate permissions are enabled. -
For Amazon Web Services, create a private repository by following the instructions here, then login as described here.
-
-
Use the
build.sh
script in this directory to build and push an image tagged with the current date and git commit hash. -
Note the VAR_* value printed at the end of the script.
Access to private repository
See below the description of VAR_DOCKER_IMAGE_PULL_SECRET
to configure authentication.
Deploying a DSS instance via Kubernetes
This section discusses deploying a Kubernetes service, although you can deploy a DSS instance however you like as long as it meets the CockroachDB requirements above. You can do this on any supported cloud provider or even on your own infrastructure. Consult the Kubernetes documentation for your chosen provider.
If you can augment this documentation with specifics for another cloud provider, a PR to that effect would be greatly appreciated.
-
Create a new Kubernetes cluster. We recommend a new cluster for each DSS instance. A reasonable cluster name might be
dss-us-prod-e4a
(wheree4a
is a zone identifier abbreviation),dss-ca-staging
,dss-mx-integration-sae1a
, etc. The name of this cluster will be combined with other information by Kubernetes to generate a longer cluster context ID.- On Google Cloud, the recommended procedure to create a cluster is:
- In Google Cloud Platform, go to the Kubernetes Engine page and under Clusters click Create cluster.
- Name the cluster appropriately; e.g.,
dss-us-prod
- Select Zonal and a compute-zone appropriate to your geography
- For the “default-pool” node pool:
- Enter 3 for number of nodes.
- In the “Nodes” bullet under “default-pool”, select N2 series and n2-standard-4 for machine type.
- In the “Networking” bullet under “Clusters”, ensure “Enable VPC -native traffic” is checked.
- On Google Cloud, the recommended procedure to create a cluster is:
-
Make sure correct cluster context is selected by printing the context name to the console:
kubectl config current-context
-
Record this value and use it for
$CLUSTER_CONTEXT
below; perhaps:export CLUSTER_CONTEXT=$(kubectl config current-context)
-
On Google Cloud, first configure kubectl to interact with the cluster created above with these instructions. Specifically:
gcloud config set project your-project-id
gcloud config set compute/zone your-compute-zone
gcloud container clusters get-credentials your-cluster-name
-
-
Ensure the desired namespace is selected; the recommended namespace is simply
default
with one cluster per DSS instance. Print the the current namespaces withkubectl get namespace
. Use the current namespace as the value for$NAMESPACE
below; perhaps use an environment variable for convenience:export NAMESPACE=<your namespace>
.It may be useful to create a
login.sh
file with content like that shown below andsource login.sh
when working with this cluster.GCP:
#!/bin/bash export CLUSTER_NAME=<your cluster name> export REGION=<GCP region in which your cluster resides> gcloud config set project <your GCP project name> gcloud config set compute/zone $REGION-a gcloud container clusters get-credentials $CLUSTER_NAME export CLUSTER_CONTEXT=$(kubectl config current-context) export NAMESPACE=default export DOCKER_URL=docker.io/interuss echo "Current CLUSTER_CONTEXT is $CLUSTER_CONTEXT
-
Create static IP addresses: one for the Core Service ingress, and one for each CockroachDB node if you want to be able to interact with other DSS instances.
-
If using Google Cloud, the Core Service ingress needs to be created as a “Global” IP address, but the CRDB ingresses as “Regional” IP addresses. IPv4 is recommended as IPv6 has not yet been tested. Follow these instructions to reserve the static IP addresses. Specifically (replacing CLUSTER_NAME as appropriate since static IP addresses are defined at the project level rather than the cluster level), e.g.:
gcloud compute addresses create ${CLUSTER_NAME}-backend --global --ip-version IPV4
gcloud compute addresses create ${CLUSTER_NAME}-crdb-0 --region $REGION
gcloud compute addresses create ${CLUSTER_NAME}-crdb-1 --region $REGION
gcloud compute addresses create ${CLUSTER_NAME}-crdb-2 --region $REGION
-
-
Link static IP addresses to DNS entries.
-
Your CockroachDB nodes should have a common hostname suffix; e.g.,
*.db.interuss.com
. Recommended naming is0.db.yourdeployment.yourdomain.com
,1.db.yourdeployment.yourdomain.com
, etc. -
If using Google Cloud, see these instructions to create DNS entries for the static IP addresses created above. To list the IP addresses, use
gcloud compute addresses list
.
-
-
Use
make-certs.py
script to create certificates for the CockroachDB nodes in this DSS instance:./make-certs.py --cluster $CLUSTER_CONTEXT --namespace $NAMESPACE [--node-address <ADDRESS> <ADDRESS> <ADDRESS> ...] [--ca-cert-to-join <CA_CERT_FILE>]
-
$CLUSTER_CONTEXT
is the name of the cluster (see step 2 above). -
$NAMESPACE
is the namespace for this DSS instance (see step 3 above). -
Each ADDRESS
is the DNS entry for a CockroachDB node that will use the certificates generated by this command. This is usually just the nodes constituting this DSS instance, though if you maintain multiple DSS instances in a single pool, the separate instances may share certificates. Note that--node-address
must include all the hostnames and/or IP addresses that other CockroachDB nodes will use to connect to your nodes (the nodes using these certificates). Wildcard notation is supported, so you can use*.<subdomain>.<domain>.com>
. If following the recommendations above, use a single ADDRESS similar to*.db.yourdeployment.yourdomain.com
. The ADDRESS entries should be separated by spaces. -
If you are pooling with existing DSS instance(s) you need their CA public cert (ca.crt), which will be concatenated with yours. Set
--ca-cert-to-join
to aca.crt
file. Reach out to existing operators to request their public cert. If not joining an existing pool, omit this argument. -
Note: If you are creating multiple DSS instances at once, and joining them together you likely want to copy the nth instance’s
ca.crt
into the rest of the instances, such that ca.crt is the same across all instances.
-
-
If joining an existing DSS pool, share ca.crt with the DSS instance(s) you are trying to join, and have them apply the new ca.crt, which now contains both your instance’s and the original instance’s public certs, to enable secure bi-directional communication. Each original DSS instance, upon receipt of the combined ca.crt from the joining instance, should perform the actions below. While they are performing those actions, you may continue with the instructions.
- Overwrite its existing ca.crt with the new ca.crt provided by the DSS instance joining the pool.
- Upload the new ca.crt to its cluster using
./apply-certs.sh $CLUSTER_CONTEXT $NAMESPACE
- Restart their CockroachDB pods to recognize the updated ca.crt:
kubectl rollout restart statefulset/cockroachdb --namespace $NAMESPACE
- Inform you when their CockroachDB pods have finished restarting (typically around 10 minutes)
-
Ensure the Docker images are built according to the instructions in the previous section.
- From this working directory,
cp -r ../deploy/services/tanka/examples/minimum/* workspace/$CLUSTER_CONTEXT
. Note that theworkspace/$CLUSTER_CONTEXT
folder should have already been created by themake-certs.py
script. Replace the imports at the top ofmain.jsonnet
to correctly locate the files:local dss = import '../../../deploy/services/tanka/dss.libsonnet'; local metadataBase = import '../../../deploy/services/tanka/metadata_base.libsonnet';
-
If providing a .pem file directly as the public key to validate incoming access tokens, copy it to dss/build/jwt-public-certs. Public key specification by JWKS is preferred; if using the JWKS approach to specify the public key, skip this step.
-
Edit
workspace/$CLUSTER_CONTEXT/main.jsonnet
and replace allVAR_*
instances with appropriate values:-
VAR_NAMESPACE
: Same$NAMESPACE
used in the make-certs.py (and apply-certs.sh) scripts. -
VAR_CLUSTER_CONTEXT
: Same $CLUSTER_CONTEXT used in themake-certs.py
andapply-certs.sh
scripts. -
VAR_ENABLE_SCD
: Set this boolean true to enable strategic conflict detection functionality (currently an R&D project tracking an initial draft of the upcoming ASTM standard). -
VAR_CRDB_DOCKER_IMAGE_NAME
: Docker image of cockroach db pods. Until DSS v0.16, the recommended CockroachDB image name iscockroachdb/cockroach:v21.2.7
. From DSS v0.17, the recommended CockroachDB version iscockroachdb/cockroach:v24.1.3
. -
VAR_CRDB_HOSTNAME_SUFFIX
: The domain name suffix shared by all of your CockroachDB nodes. For instance, if your CRDB nodes were addressable at0.db.example.com
,1.db.example.com
, and2.db.example.com
, then VAR_CRDB_HOSTNAME_SUFFIX would bedb.example.com
. -
VAR_CRDB_LOCALITY
: Unique name for your DSS instance. Currently, we recommend “_ ", and the `=` character is not allowed. However, any unique (among all other participating DSS instances) value is acceptable. -
VAR_CRDB_NODE_IPn
: IP address (numeric) of nth CRDB node (add more entries if you have more than 3 CRDB nodes). Example:1.1.1.1
-
VAR_SHOULD_INIT
: Set tofalse
if joining an existing pool,true
if creating the first DSS instance for a pool. When settrue
, this can initialize the data directories on your cluster, and prevent you from joining an existing pool. -
VAR_EXTERNAL_CRDB_NODEn
: Fully-qualified domain name of existing CRDB nodes if you are joining an existing pool. If more than three are available, add additional entries. If not joining an existing pool, comment out thisJoinExisting:
line.- You should supply a minimum of 3 seed nodes to every CockroachDB node. These 3 nodes should be the same for every node (ie: every node points to node 0, 1, and 2). For external DSS instances you should point to a minimum of 3, or you can use a loadbalanced hostname or IP address of other DSS instances. You should do this for every DSS instance in the pool, including newly joined instances. See CockroachDB’s note on the join flag.
VAR_STORAGE_CLASS
: Kubernetes Storage Class to use for CockroachDB and Prometheus volumes. You can check your cluster’s possible values withkubectl get storageclass
. If you’re not sure, each cloud provider has some default storage classes that should work:- Google Cloud:
standard
- Azure:
default
- AWS:
gp2
- Google Cloud:
-
VAR_INGRESS_NAME
: If using Google Kubernetes Engine, set this to the the name of the core-service static IP address created above (e.g.,CLUSTER_NAME-backend
). -
VAR_DOCKER_IMAGE_NAME
: Full name of the docker image built in the section above.build.sh
prints this name as the last thing it does when run withDOCKER_URL
set. It should look something likegcr.io/your-project-id/dss:2020-07-01-46cae72cf
if you built the image yourself, ordocker.io/interuss/dss
if using the InterUSS image withoutbuild.sh
.- Note that
VAR_DOCKER_IMAGE_NAME
is used in two places.
- Note that
-
VAR_DOCKER_IMAGE_PULL_SECRET
: Secret name of the credentials to access the image registry. If the image specified in VAR_DOCKER_IMAGE_NAME does not require authentication to be pulled, then do not populate this instance and do not uncomment the line containing it. You can use the following command to store the credentials as kubernetes secret:kubectl create secret -n VAR_NAMESPACE docker-registry VAR_DOCKER_IMAGE_PULL_SECRET
–docker-server=DOCKER_REGISTRY_SERVER
–docker-username=DOCKER_USER
–docker-password=DOCKER_PASSWORD
–docker-email=DOCKER_EMAILFor docker hub private repository, use
docker.io
asDOCKER_REGISTRY_SERVER
and an access token asDOCKER_PASSWORD
. -
VAR_APP_HOSTNAME
: Fully-qualified domain name of your Core Service ingress endpoint. For example,dss.example.com
. -
VAR_PUBLIC_KEY_PEM_PATH
: If providing a .pem file directly as the public key to validate incoming access tokens, specify the name of this .pem file here as/jwt-public-certs/YOUR-KEY-NAME.pem
replacing YOUR-KEY-NAME as appropriate. For instance, if using the providedus-demo.pem
, use the path/jwt-public-certs/us-demo.pem
. Note that your .pem file must have been copied intojwt-public-certs
in an earlier step, or mounted at runtime using a volume.- If providing an access token public key via JWKS, provide a blank string for this parameter.
-
VAR_JWKS_ENDPOINT
: If providing the access token public key via JWKS, specify the JWKS endpoint here. Example:https://auth.example.com/.well-known/jwks.json
- If providing a .pem file directly as the public key to valid incoming access tokens, provide a blank string for this parameter.
-
VAR_JWKS_KEY_ID
: If providing the access token public key via JWKS, specify thekid
(key ID) of they appropriate key in the JWKS file referenced above.- If providing a .pem file directly as the public key to valid incoming access tokens, provide a blank string for this parameter.
- If you are only turning up a single DSS instance for development, you
may optionally change
single_cluster
totrue
.
VAR_SSL_POLICY
: When deploying on Google Cloud, a ssl policy can be applied to the DSS Ingress. This can be used to secure the TLS connection. Follow the instructions to create the Global SSL Policy and replace VAR_SSL_POLICY variable with its name.RESTRICTED
profile is recommended. Leave it empty if not applicable.
-
-
Edit workspace/$CLUSTER_CONTEXT/spec.json and replace all VAR_* instances with appropriate values:
-
VAR_API_SERVER: Determine this value with the command:
echo $(kubectl config view -o jsonpath="{.clusters[?(@.name==\"$CLUSTER_CONTEXT\")].cluster.server}")
- Note that
$CLUSTER_CONTEXT
should be replaced with your actualCLUSTER_CONTEXT
value prior to executing the above command if you have not defined aCLUSTER_CONTEXT
environment variable.
- Note that
-
VAR_NAMESPACE: See previous section.
-
-
Use the
apply-certs.sh
script to create secrets on the Kubernetes cluster containing the certificates and keys generated in the previous step../apply-certs.sh $CLUSTER_CONTEXT $NAMESPACE
-
Run
tk apply workspace/$CLUSTER_CONTEXT
to apply it to the cluster.- If you are joining an existing pool, do not execute this command until the the existing DSS instances all confirm that their CockroachDB pods have finished their rolling restarts.
-
Wait for services to initialize. Verify that basic services are functioning by navigating to https://your-domain.example.com/healthy.
- On Google Cloud, the highest-latency operation is provisioning of the
HTTPS certificate which generally takes 10-45 minutes. To track this
progress:
- Go to the “Services & Ingress” left-side tab from the Kubernetes Engine page.
- Click on the
https-ingress
item (filter by just the cluster of interest if you have multiple clusters in your project). - Under the “Ingress” section for Details, click on the link corresponding with “Load balancer”.
- Under Frontend for Details, the Certificate column for HTTPS protocol will have an icon next to it which will change to a green checkmark when provisioning is complete.
- Click on the certificate link to see provisioning progress.
- If everything indicates OK and you still receive a cipher mismatch error message when attempting to visit /healthy, wait an additional 5 minutes before attempting to troubleshoot further.
- On Google Cloud, the highest-latency operation is provisioning of the
HTTPS certificate which generally takes 10-45 minutes. To track this
progress:
-
If joining an existing pool, share your CRDB node addresses with the operators of the existing DSS instances. They will add these node addresses to JoinExisting where
VAR_CRDB_EXTERNAL_NODEn
is indicated in the minimum example, and then update their deployment:tk apply workspace/$CLUSTER_CONTEXT
Pooling
See the pooling documentation.
Tools
Grafana / Prometheus
By default, an instance of Grafana and Prometheus are deployed along with the
core DSS services; this combination allows you to view (Grafana) CRDB metrics
(collected by Prometheus). To view Grafana, first ensure that the appropriate
cluster context is selected (kubectl config current-context
). Then, run the
following command:
```shell script kubectl get pod | grep grafana | awk ‘{print $1}’ | xargs -I {} kubectl port-forward {} 3000
While that command is running, open a browser and navigate to
[http://localhost:3000](http://localhost:3000). The default username is `admin`
with a default password of `admin`. Click the magnifying glass on the left side
to select a dashboard to view.
### Istio
Istio has been removed from the standard deployment. See this [discussion](https://lists.interussplatform.org/g/dss/message/47) for more details.
### Prometheus Federation (Multi Cluster Monitoring)
The DSS uses [Prometheus](https://prometheus.io/docs/introduction/overview/) to
gather metrics from the binaries deployed with this project, by scraping
formatted metrics from an application's endpoint.
[Prometheus Federation](https://prometheus.io/docs/prometheus/latest/federation/)
enables you to easily monitor multiple clusters of the DSS that you operate,
unifying all the metrics into a single Prometheus instance where you can build
Grafana Dashboards for. Enabling Prometheus Federation is optional. To enable
you need to do 2 things:
1. Externally expose the Prometheus service of the DSS clusters.
2. Deploy a "Global Prometheus" instance to unify metrics.
#### Externally Exposing Prometheus
You will need to change the values in the `prometheus` fields in your metadata tuples:
1. `expose_external` set to `true`
2. [Optional] Supply a static external IP Address to `IP`
3. [Highly Recommended] Supply whitelists of [IP Blocks in CIDR form](https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing), leaving an empty list mean everyone can publicly access your metrics.
4. Then Run `tk apply ...` to deploy the changes on your DSS clusters.
#### Deploy "Global Prometheus" instance
1. Follow guide to deploy Prometheus https://prometheus.io/docs/introduction/first_steps/
2. The scrape rules for this global instance will scrape other prometheus `/federate` endpoint and rather simple, please look at the [example configuration](https://prometheus.io/docs/prometheus/latest/federation/#configuring-federation).
## Troubleshooting
### Check if the CockroachDB service is exposed
Unless specified otherwise in a deployment configuration, CockroachDB
communicates on port 26257. To check whether this port is open from Mac or
Linux, e.g.: `nc -zvw3 0.db.dss.your-region.your-domain.com 26257`. Or, search
for a "port checker" web page/app. Port 26257 will be open on a working
CockroachDB node.
A standard TLS diagnostic may also be run on this hostname:port combination and
all results should be valid except Trust. Certificates are signed by
"Cockroach CA" which is not a generally-trusted CA, but this is ok.
### Accessing a CockroachDB SQL terminal
To interact with the CockroachDB database directly via SQL terminal:
kubectl
–context $CLUSTER_CONTEXT exec –namespace $NAMESPACE -it
cockroachdb-0 –
./cockroach sql –certs-dir=cockroach-certs/
```
Using the CockroachDB web UI
The CockroachDB web UI is not exposed publicly, but you can forward a port to your local machine using kubectl:
Create a user account
Pick a username and create an account:
Access the CockrachDB SQL terminal then create user with sql command
root@:26257/rid> CREATE USER foo WITH PASSWORD 'foobar';
Access the web UI
kubectl -n $NAMESPACE port-forward cockroachdb-0 8080
Then go to https://localhost:8080. You’ll have to ignore the HTTPS certificate warning.
Upgrading Database Schemas
All schemas-related files are in db_schemas
directory. Any changes you
wish to make to the database schema should be done in their respective database
folders. The files are applied in sequential numeric steps from the current
version M to the desired version N.
For the first-ever run during the CRDB cluster initialization, the db-manager will run once to bootstrap and bring the database up to date. To upgrade existing clusters you will need to:
If performing this operation on the original cluster
- Update the
desired_xyz_db_version
field inmain.jsonnet
- Delete the existing db-manager job in your k8s cluster
- Redeploy the newly configured db-manager with
tk apply -t job/<xyz-schema-manager>
. It should automatically up/down grade your database schema to your desired version.
If performing this operation on any other cluster
-
Create
workspace/$CLUSTER_CONTEXT_schema_manager
in this (build) directory. -
From this (build) working directory,
cp -r ../deploy/services/tanka/examples/schema_manager/* workspace/$CLUSTER_CONTEXT_schema_manager
. -
Edit
workspace/$CLUSTER_CONTEXT_schema_manager/main.jsonnet
and replace allVAR_*
instances with appropriate values where applicable as explained in the above section. -
Run
tk apply workspace/$CLUSTER_CONTEXT_schema_manager
Garbage collector job
Only since commit c789b2b on Aug 25, 2020 will the DSS enable automatic garbage collection of records by tracking which DSS instance is responsible for garbage collection of the record. Expired records added with a DSS deployment running code earlier than this must be manually removed.
The Garbage collector job runs every 30 minute to delete records in RID tables that records’ endtime is 30 minutes less than current time. If the event takes a long time and takes longer than 30 minutes (previous job is still running), the job will skip a run until the previous job completes.