Monitoring Micro-Service Architectures

About Me

Full-Stack Web Developer
Lead-Tech at Quicklizard Ltd.
Freelance consultant & trainer

Why do we need monitoring?

Real Time Visibility

What's happening with my application NOW?
Is everything working properly?
Where are my sore points?

Real-Time and Historical Analysis

Why did things break after deployment?
How can I find bottlenecks?
Is my application working well over time?

Automatic Reaction

What do we do when things stop working?
Service resurrection
OS housekeeping
Reacting to node failures
Load changes
Cluster maintenance

Monitoring is about knowledge, control and reaction.
We monitor stuff so we can meet our SLA without losing sleep.

From Monolithic to Service-Oriented Architectures

wheretoe.at

An imaginary application for restaurant recommendations

Web app (Rails / Django/ Foo) running under Nginx
Postgresql Server stores all data
Total of 4 servers (3 Web + 1 DB)

wheretoe.at

When monitoring a monolithic application, we usually look at a limited number of things for breakage

Server health
Nginx access logs
Application logs
DB Slow query log

wheretoe.at

Now, let's imagine we break up our monolithic application into specialized services

wheretoe.at - V2

Social Login Service - Sinatra App x 2 (Under Nginx)
User Database - Postgresql x 1 Node
Venue Database - Riak Cluster (5 Nodes)
Recommendation Engine - HBase Cluster (6 Nodes)
Web Application - Rails App x 4 (Under Nginx)

	Monolithic	Service-Oriented
Node Types	Few	Many
Log Sources	Few & Similar	Many & Different
Monitored Endpoints	App / DB / OS	Apps / APIs / DBs / OS
Actionable Data Size	Small / Medium	Medium / Large
Data Location	Relatively Close Together	Dispersed

Let's Recap

We're trying to find problems / sore points by finding anomalies in log data.

In monolithic setups, we usually work with smaller data sizes, from limited number of types and sources.

Let's Recap

When moving to service-oriented setups, our log data sources grow in number and variance.

Therefore our challenges lie in dealing with larger sizes of variable sources of data effectively and efficiently.

Back at wheretoe.at H.Q.

Let's see how we can monitor our new Web application and social login services.

Monitoring Strategy

Look at Nginx access logs and find HTTP requests with status 200, 404, 500

Rational

Anomalous HTTP statuses signify breakage that need our attention

Say hello to
Kibana

Kibana in a nutshell

Kibana is a visual interface for interacting with large amounts of aggregated log data.

Kibana's working assumption is that log data analysis is a good way to find problems

Kibana Mechanics

Use common log aggregators (Logstash, Fluentd, Flume) for data aggregation
Use Elastic Search as a centralized log aggregation and indexing server
Define search queries on log data using Kibana
Visualize query results as graphs and widgets

Setup at wheretoe.at

Elastic Search Cluster
Kibana UI
LogStash on each Nginx node

LogStash - Hello World

LogStash Example - Input

        input {
            file {
                type => "nginx_access"
                path => ["/var/log/nginx/**"]
                exclude => ["*.gz", "error.*"]
                discover_interval => 10
            }
        }

LogStash Example - Output

        output {
            elasticsearch {
                host => "es01.wheretoe.at"
                port => "9300"
            }
        }

A short while later...

Nginx -> LogStash -> Elastic Search

Let's Recap

As our architecture evolves into small distributed services, our ability to find problems in each service becomes more limited.

Our challenge is finding anomalies in large, variable amounts of data and do something with our findings.

Let's Recap

Problem	Solution
Data Variance	LogStash (Grok)
Data Aggregation	LogStash
Data Size	Elastic Search
Data Querying	Elastic Search
Visualization	Kibana

When should I use Kibana?

Logs are your main source of information
Sending metrics from your applications isn't realistic
You need to query the same data in multiple ways
You need a unified way to sift through data from variable sources
You're happy with an in-house, self-hosted solution

So far so good?

2am in the morning...

Boss: Zohar?
Me: Yes?
Boss: Users can't login to our system...
Me: Wasting an hour only to find out disk is full on login02.wheretoe.at

Kibana helps me find problems, not deal with them!

Back at wheretoe.at H.Q.

Let's see how we can monitor our login service KPI.

Monitoring Strategy

Check OS health (disk, load, memory) & number of active logins and SMS if things go wrong

* kept to minimum for sake of simplicity

What are we trying to do?

We want to define a set of parameters that define the operational norm, check these parameters periodically and handle abnormalities.

For example: High CPU and low number of logged-in users indicate we have a problem in our users DB server.

Say hello to
Sensu

Sensu in a nutshell

Distributed, asynchronous checks and handlers system written in Ruby
Server / client architecture on top of RabbitMQ
Monitor anything by running a series of checks
React to problems / results by designating handlers

Sensu Mechanics

Setup at wheretoe.at

RabbitMQ Server
Redis Server
Sensu Server
Sensu client on every monitored server

Sensu Example - Load Check

        "check_load": {
          "command": "/etc/sensu/plugins/system/check-load.rb -c 5,10,15 -w 3,7,10",
          "interval": 60,
          "subscribers": ["common"],
          "handlers": ["sms", "graphite"]
        }

Sensu Example - SMS Handler

        "sms": {
          "command": "/etc/sensu/handlers/notifications/pagerduty.rb",
          "severities": [ "warning","critical" ],
          "type": "pipe"
        }

Let's Recap

Our challenge is to identify abnormal events in the ongoing operation of our system components and react to them in real-time.

Let's Recap

With Sensu we can identify abnormalities by running a series of various checks at specified intervals
Check results are piped to designated handlers that decide how to react to breakage

Let's Recap

We can check anything we like (it's just a Ruby script)

We can handle results in multiple ways based on criteria (eg. always send to Graphite, but only SMS when critical)

Why should I use Sensu?

Distributed and scalable - Built for the cloud
- Mention Nagios, Zenoss vs. Sensu
Monitor anything from OS to Service Health
Great for both real-time or historical analysis
Excellent 3rd-party integration (Graphite, Dashing, Pager Duty)
Highly customizable via straight-forward Ruby DSL
Open-Source, actively maintained, vibrant community
Lot's of plugins to choose from

Putting it all Together

Questions?

Some Links...

Thank You!

Presentation: http://goo.gl/68A2bj

E: zohar@zohararad.com | Github: @zohararad | Twitter: @zohararad

About Me

Why do we need monitoring?

Real Time Visibility

Real-Time and Historical Analysis

Automatic Reaction

From Monolithic to Service-Oriented Architectures

wheretoe.at

wheretoe.at

wheretoe.at

wheretoe.at - V2

What Changed from a monitoring perspective?

Let's Recap

Let's Recap

Back at wheretoe.at H.Q.

Monitoring Strategy

Rational

Say hello toKibana

Kibana in a nutshell

Kibana Mechanics

Setup at wheretoe.at

LogStash - Hello World

LogStash Example - Input

LogStash Example - Output

A short while later...

Let's Recap

Let's Recap

When should I use Kibana?

So far so good?

2am in the morning...

Kibana helps me find problems, not deal with them!

Back at wheretoe.at H.Q.

Monitoring Strategy

What are we trying to do?

Say hello toSensu

Sensu in a nutshell

Sensu Mechanics

Setup at wheretoe.at

Sensu Example - Load Check

Sensu Example - SMS Handler

Let's Recap

Let's Recap

Let's Recap

Why should I use Sensu?

Putting it all Together

Questions?

Some Links...

Thank You!

Say hello to
Kibana

Say hello to
Sensu