Effectively Monitor Your Micro-Service Architectures

Zohar Arad. © 2014

About Me

Why do we need monitoring?

Real Time Visibility

Real-Time and Historical Analysis

Automatic Reaction

Monitoring is about knowledge, control and reaction.
We monitor stuff so we can meet our SLA without losing sleep.

From Monolithic to Service-Oriented Architectures


An imaginary application for restaurant recommendations


When monitoring a monolithic application, we usually look at a limited number of things for breakage


Now, let's imagine we break up our monolithic application into specialized services

wheretoe.at - V2

What Changed from a monitoring perspective?

  Monolithic Service-Oriented
Node Types Few Many
Log Sources Few & Similar Many & Different
Monitored Endpoints App / DB / OS Apps / APIs / DBs / OS
Actionable Data Size Small / Medium Medium / Large
Data Location Relatively Close Together Dispersed

Let's Recap

We're trying to find problems / sore points by finding anomalies in log data.

In monolithic setups, we usually work with smaller data sizes, from limited number of types and sources.

Let's Recap

When moving to service-oriented setups, our log data sources grow in number and variance.

Therefore our challenges lie in dealing with larger sizes of variable sources of data effectively and efficiently.

Back at wheretoe.at H.Q.

Let's see how we can monitor our new Web application and social login services.

Monitoring Strategy

Look at Nginx access logs and find HTTP requests with status 200, 404, 500


Anomalous HTTP statuses signify breakage that need our attention

Say hello to

Kibana in a nutshell

Kibana is a visual interface for interacting with large amounts of aggregated log data.

Kibana's working assumption is that log data analysis is a good way to find problems

Kibana Mechanics

Setup at wheretoe.at

LogStash - Hello World

LogStash Example - Input

        input {
            file {
                type => "nginx_access"
                path => ["/var/log/nginx/**"]
                exclude => ["*.gz", "error.*"]
                discover_interval => 10

LogStash Example - Output

        output {
            elasticsearch {
                host => "es01.wheretoe.at"
                port => "9300"

A short while later...

Nginx -> LogStash -> Elastic Search

HTTP Response Codes Graph

Let's Recap

As our architecture evolves into small distributed services, our ability to find problems in each service becomes more limited.

Our challenge is finding anomalies in large, variable amounts of data and do something with our findings.

Let's Recap

Problem Solution
Data Variance LogStash (Grok)
Data Aggregation LogStash
Data Size Elastic Search
Data Querying Elastic Search
Visualization Kibana

When should I use Kibana?

So far so good?

2am in the morning...

Boss: Zohar?
Me: Yes?
Boss: Users can't login to our system...
Me: Wasting an hour only to find out disk is full on login02.wheretoe.at

Kibana helps me find problems, not deal with them!

Back at wheretoe.at H.Q.

Let's see how we can monitor our login service KPI.

Monitoring Strategy

Check OS health (disk, load, memory) & number of active logins and SMS if things go wrong

* kept to minimum for sake of simplicity

What are we trying to do?

We want to define a set of parameters that define the operational norm, check these parameters periodically and handle abnormalities.

For example: High CPU and low number of logged-in users indicate we have a problem in our users DB server.

Say hello to

Sensu in a nutshell

Sensu Mechanics

Setup at wheretoe.at

Sensu Example - Load Check

        "check_load": {
          "command": "/etc/sensu/plugins/system/check-load.rb -c 5,10,15 -w 3,7,10",
          "interval": 60,
          "subscribers": ["common"],
          "handlers": ["sms", "graphite"]

Sensu Example - SMS Handler

        "sms": {
          "command": "/etc/sensu/handlers/notifications/pagerduty.rb",
          "severities": [ "warning","critical" ],
          "type": "pipe"

Let's Recap

Our challenge is to identify abnormal events in the ongoing operation of our system components and react to them in real-time.

Let's Recap

  1. With Sensu we can identify abnormalities by running a series of various checks at specified intervals
  2. Check results are piped to designated handlers that decide how to react to breakage

Let's Recap

We can check anything we like (it's just a Ruby script)

We can handle results in multiple ways based on criteria (eg. always send to Graphite, but only SMS when critical)

Why should I use Sensu?

Putting it all Together


Some Links...

Thank You!

Presentation: http://goo.gl/68A2bj

E: zohar@zohararad.com | Github: @zohararad | Twitter: @zohararad