How PaaSTA Interacts with SmartStack
PaaSTA uses SmartStack configuration to influence the deployment and monitoring of services. This document assumes some prior knowledge about SmartStack; see http://nerds.airbnb.com/smartstack-service-discovery-cloud/.
Table of Contents
How SmartStack Settings Influence Deployment
In SmartStack, a service can be configured to be discovered at a particular latency zone.
At Yelp, we give names to these latency zones like this:
The discover:
key in smartstack sets the scope at which individual
tasks can be discovered and load balanced to. For example, a setting
of discover: superregion
would discover one large pool of tasks for
use, regardless of which region or habitat they are in. (great for availability)
e.g. A service in habitat A will make requests of a service in any of habitats
A-F. This is great for availability – six habitats to try – but may introduce
latency if communications are slow between region 1 and region 2.
As another example, a setting of discover: habitat
would make it so only
the tasks in a particular habitat are discover-able by the clients in that
habitat. (great for latency)
e.g. A service in habitat A will make requests of a service only in habitats
A-C. This is great for latency – only talk to habitats that are
topographically “nearby” – but reduces availability since only three habitats
can be reached.
What Would Happen if PaaSTA Were Not Aware of SmartStack
PaaSTA uses Marathon to deploy
long-running services. At Yelp, PaaSTA clusters are deployed at the
superregion
level. This means that a service could potentially be deployed
on any available host in that superregion
that has resources to run it. If
PaaSTA were unaware of the Smartstack discover:
settings, Marathon would
naively deploy tasks in a potentially “unbalanced” manner:
With the naive approach, there is a total of six tasks for the superregion, but
four landed in region 1
, and two landed in region 2
. If
the discover
setting were set to habitat
, there would be habitats
without tasks available to serve anything, likely causing an outage.
In a world with configurable SmartStack discovery settings, the deployment system (Marathon) must be aware of these and deploy accordingly.
What A SmartStack-Aware Deployment Looks Like
By taking advantage of Marathon Constraint Language , specifically the GROUP_BY operator, Marathon can deploy tasks in such a way as to ensure a balanced number of tasks in each latency zone.
Example: Balanced deployment to every habitat
For example, if the SmartStack setting
were discover: habitat
[1], we Marathon could enforce the constraint
["habitat", "GROUP_BY"]
, which will ask Marathon to distribute tasks
evenly between the habitats[2]_:
Example: Deployment balanced to each region
Similarly, if the discover
setting were set to region
, the equivalent
Marathon constraint would ensure an equal number of tasks distributed to each region.
Even though there some habitats in this diagram that lack the service, the
discover: region
setting allows clients to utilize any process as long
as it is in the local region. The Marathon constraint of ["region", "GROUP_BY"]
ensures that tasks are distributed equally over the regions, in this case three
in each.
[1] | Technically PaaSTA should be using the smallest value of the advertise
setting, tracked in PAASTA-1253. |
[2] | Currently the instances: count represents the total number of
instances in the cluster. Eventually with PAASTA-1254
the instance count will be a per-discovery-location setting, meaning there
will always be an equal number of instances per location. (With instances: 6
and a discovery: habitat , and three habitats, the total task count would be
18, 6 in each habitat.) |
How SmartStack Settings Influence Monitoring
If a service is in SmartStack, PaaSTA uses the same discover
setting
referenced above to decide how the service should be monitored. When a service
author sets a particular setting, say discover: region
, it implies that the
system should enforce availability of that service in every region. If there
are regions that lack tasks to serve that service, then PaaSTA should alert.
Example: Checking Each Habitat When discover: habitat
If SmartStack is configured to discover: habitat
, PaaSTA configures
Marathon to balance tasks to each habitat. But what if it is unable to do that?
In this case, there are no tasks in habitat F. This is a problem because
discover: habitat
implies that any clients in habitat F will not
be able to find the service. It is down in habitat F.
To detect and alert on this, PaaSTA uses the discover
setting to decide
which unique locations to look at (e.g. habitat
). Paasta iterates over
each unique location (e.g. habitats A-F) and inspects the replication levels
in each location. It finds that there is at least one habitat with too few
instances (habitat F, which has 0 out of 1) and alerts.
The output of the alert or paasta status
looks something like this:
Smartstack:
habitatA - Healthy - in haproxy with (1/1) total backends UP in this namespace.
habitatB - Healthy - in haproxy with (1/1) total backends UP in this namespace.
habitatC - Healthy - in haproxy with (1/1) total backends UP in this namespace.
habitatD - Healthy - in haproxy with (1/1) total backends UP in this namespace.
habitatE - Healthy - in haproxy with (1/1) total backends UP in this namespace.
habitatF - Critical - in haproxy with (0/1) total backends UP in this namespace.
In this case the service authors have a few actions they can take:
- Increase the total instance count to have more tasks per habitat. (In this example, each habitat contains a single point of failure!)
- Change the
discovery
setting toregion
to increase availability at the cost of latency. - Investigate why tasks can’t run in habitat F. (Lack of resources? Improper configs? Missing service dependencies?)
Example: Checking Each Region When discover: region
If SmartStack is configured to discover: region
, PaaSTA configures
Marathon to balance tasks to each region. But what if it is unable to launch
all the tasks, but there were tasks running in that region?
The output of the alert or paasta status
looks something like this:
Smartstack:
region1 - Healthy - in haproxy with (3/3) total backends UP in this namespace.
region2 - Warning - in haproxy with (2/3) total backends UP in this namespace.
Assuming a threshold of 50%, an alert would not be sent to the team in this case.
Even if some habitats do not have tasks for this service, discover: region
ensures that clients can be satisfied by tasks in the same region if not by
tasks in the same habitat.
The Relationship Between Nerve “namespaces” and PaaSTA “instances”
Example: One-to-one Mapping
SmartStack’s Nerve component can be configured to advertise different components of the same service on different ports. In PaaSTA we call these “Nerve namespaces”. Here is an example of a service with two namespaces:
#smartstack.yaml
main:
proxy_port: 20001
api:
proxy_port: 20002
The corresponding Marathon configuration in PaaSTA might look like this:
#marathon.yaml
main:
instances: 10
cmd: myserver.py
api:
instances: 3
cmd: apiserver.py
In this way, a service can provide two different “modes”, while using the same codebases. Here there is a one-to-one mapping between the “Nerve namespaces” and the “PaaSTA instances”. By default the PaaSTA will advertise a service under the Nerve namespace with the same name as the instance.
Example: Canary
However, there are situations where you would like to pool instances together under the same Nerve namespace. Consider this example:
#smartstack.yaml
main:
proxy_port: 20001
#marathon.yaml
main:
instances: 10
cmd: myserver.py
canary:
instances: 1
registrations: ['service.main']
cmd: myserver.py --experiment
With this example, the canary
instance gets advertised under the main
Nerve
namespace, which gives an effective pool of 11 instances listening behind port 20001.
This allows the canary instance, which may be configured differently than the main instance, to participate in the normal “main” pool.
Example: Sharding
Sharding is another use case for using alternative namespaces:
#smartstack.yaml
main:
proxy_port: 20001
#marathon.yaml
shard1:
instances: 10
registrations: ['service.main']
shard2:
instances: 10
registrations: ['service.main']
shard3:
instances: 10
registrations: ['service.main']
These shards all end up being load-balanced in the same “main” pool. More complex YAML definitions can take advantage of YAML’s anchors and references to reduce duplication.
Addendum: Non-Smartstack Monitoring
If a service is not in SmartStack, then our monitoring requirements are greatly simplified. PaaSTA simply looks at the number of tasks that are running and compares it to the requested task count. If the running task count is under the configured percentage threshold (defaults to 50%) then an alert will be sent. No consideration for the distribution of the tasks among latency zones (habitats, regions, etc) is taken into account.