Introduction to Big Data

About Me

Full-Stack Web Developer
Lead-Tech at Quicklizard Ltd.
Freelance consultant & trainer

I am not a big-data scientist, but I am a big-data user.

Our agenda

Demystify the term "Big Data"
Find out what is Hadoop
Explore the realms of batch and real-time big data processing
Explore challenges of size, speed and scale in databases
Skim the surface of big-data technologies
Provide ways into the big-data world

Big Data
Demystified

What is big data?

Big data is a collective term for a set technologies designed for storage, querying and analysis of extremely large data sets, sources and volumes.

Big data technologies come in where traditional off-the-shelf databases, data warehousing systems and analysis tools fall short.

How did we end up with so much data?

Data Generation: Human (Internal) ↦ Human (Social) ↦ Machine
Data Processing: Single Core ↦ Multi-Core ↦ Cluster / Cloud

An Important Side Note

Big Data technologies are based on the concept of clustering - Many computers working in sync to process chunks of our data.

Not just size

Big data isn't just about data size, but also about data volume, diversity and inter-connectedness.

Big data is

Any attribute of our data that challenges either technological capabilities or business needs, like:

Scaling, moving, storage and retrieval of ever-growing generated data
Processing many small data points in real-time
Analysing diverse semi-structured data from multiple sources
Querying multiple, diverse data sources in real-time

Breath... Let's recap

Lot's of data due to technological capabilities and social paradigms
Not just size! Diversity, volume and inter-connectedness also count
Scale, speed, processing, querying and analysis
Challenges technological capabilities or business needs

Hadoop
The Elephant in the Room

Everyone talks about Hadoop

Hadoop is a powerful platform for batch analysis of large volumes of both structured and unstructured data.
From: Conquering Hadoop with Haskell

Hadoop explained

Hadoop is a horizontally scalable, fault-tolerant, open-source file system and batch-analysis platform capable of processing large amounts of data.

HDFS - Hadoop File System
M/R - Hadoop Map-Reduce platform

Hadoop explained

HDFS is an ever-growing file system. We can store lots and lots of data on it for later use.

HDFS is used as the underlying platform for other technologies like Hadoop M/R, Apache Mahout or HBase.

Hadoop explained

Imagine we want to look at 30 days worth of access logs to identify site usage patterns at a volume of 30M log entries per day.

Hadoop M/R is a platform that allows us to query HDFS data in parallel for the purpose of batch (offline) data processing and analysis.

Why is Hadoop so important?

Scalable and fault-tolerant
Handles massive amounts of data
Truly parallel processing
Data can be semi-structured or unstructured (schemaless)
Serves as basis for other technologies (Hbase, Mahout, Impala, Shark)

Hadoop - Words of caution

Complex
Not for real-time
Choose a distribution (Cloudera, HW, MapR) for better interoperability
Requires trained DevOps for day-to-day operations

Breath....

We demystified the term Big Data and glimpsed at Hadoop. Now What?

How do I really get into the Big Data world?

The world of big data

Batch & Data Science
DBs
Real-Time

Batch Processing
Hadoop M/R

Batch processing of large data sets

We collect data for the purpose of providing end-users with better experience in our business domain. This means we have to constantly query our data and divine new insights and relevant information.

The problem is doing that in very large scales is a painful, slow challenge.

How do we do this on Hadoop data?

Batch processing of large data sets

Hadoop gives us the basic tools for large data processing in the form of M/R.
However, Hadoop M/R is pretty annoying to work with directly as it lacks a lot of relevant tools for the job (statistical analysis, machine learning etc.)

Source: http://xiaochongzhang.me/blog/?p=338

Hadoop querying and data science tools

Tool	Purpose
Hive	Write SQL-like M/R queries on top of Hadoop
Shark	Hive-compatible, distributed SQL query engine for Hadoop
Pig	Write scripted M/R queries on top of Hadoop
Impala	Real-time SQL-like queries of Hadoop
Mahout	Scalable machine-learning on top of Hadoop M/R

The gentle way in

Hive or Shark are a great place to start due to their SQL-like nature
Shark is faster than Hive - less frustration
You need some Hadoop data to work with (consider Avro)
Remember - it's SQL-like, not SQL
Start small, locally and grow to production later
Check out Apache Sqoop for moving processed Hadoop data to your DB

Databases
In the big data world

Databases in the big data world

The Problem: Traditional RDBMS were not designed for storing, indexing and querying growing amounts and volumes of data.

The 3S Challenge:

Size - How much data is written and read
Speed - How fast can we write and read data
Scale - How easily can our DB scale to accommodate more data

The 3S Challenge

There's no single, simple solution to the 3S challenge. Instead, solutions focus on making an informed sacrifice in one area in order to gain in another area.

NoSQL and C.A.P.

NoSQL is a term referring to a family of DBMS that attempt to resolve the 3S challenge by sacrificing one of three areas:

Consistency - All clients have the same view of data
Availability - Each client can always read and write
Partition Tolerance - System works despite physical network failures

NoSQL and C.A.P.

C.A.P. means you have to make an informed choice (and sacrifice)
No single perfect solution
Opt for mixed solutions per use-case
Remember we're talking about read/write volume, not just size

Confused?
Let's take a breath and focus

Source: http://blog.nahurst.com/visual-guide-to-nosql-systems

OK, so where do I go from here?

Identify your needs and limitations
Choose a few candidates
Research & Prototype
Read about NewSQL - VoltDB, InfiniDB, MariaDB, HyperDex, FoundationDB (omitted due to time constraints).

Real-Time
Big Data Now!

Real-Time big data processing

Processing big data in real-time is about data volumes rather than just size. For example, given a rate of 100K ops/sec, how do I do the following in real-time?:

Find anomalies in a data stream (spam)
Group check-ins by geo
Identify trending pages / topics

Hadoop isn't for real-time processing

When it comes to data processing and analysis, Hadoop's M/R framework is wonderful for batch (offline) processing.

However, processing, analysing and querying Hadoop data in real-time is quite difficult.

Apache Storm and Apache Spark

Apache Storm and Apache Spark are two frameworks for large-scale, distributed data processing in real-time.

One could say that both Storm and Spark are for real-time data processing what is Hadoop M/R for batch data processing.

Apache Storm - Highlights

Runs on the JVM (Clojure / Java mix)
Fully distributed and fault-tolerant
Highly-scalable and extremely fast
Interoperability with popular languages (Scala, Python etc.)
Mature and production ready
Hadoop interoperability via Storm-YARN
Stateless / Non-Persistent (Data brought to processors)

Apache Spark - Highlights

Fully distributed and extremely fast
Write applications in Java Scala and Python
Perfect for both batch and real-time
Combine Hadoop SQL (Shark), Machine Learning and Data streaming
Native Hadoop interoperability
HDFS, HBase, Cassandra, Flume as data sources
Stateful / Persistent (Processors brought to data)

Storm & Spark - Use Cases

Continuous/Cyclic Computation
Real-time analytics
Machine Learning (eg. recommendations, personalisation)
Graph Processing (eg. social networks) - Only Spark
Data Warehouse ETL (Extract, Transform, Load)

Recap

Term	Purpose
Big Data	Collective term for data-processing solutions at scale
Hadoop	Scalable file-system and batch processing platform
Batch Processing	Sifting and analysing data offline / in background
M/R	Parallel, batch data-processing algorithm
3S Challenge	Size, Speed, Scale of DBs
C.A.P	Consistency, Availability, Partition Tolerance
NoSQL	Family of DBMS that grew due to the 3S Challenge
NewSQL	Family of DBMS that provide ACID at scale

Questions?

See presentation on http://goo.gl/NOq2qX

Feel free to drop my a line:
Email: zohar [AT] zohararad.com
Github: zohararad

Thank You!

About Me

Our agenda

Big DataDemystified

What is big data?

How did we end up with so much data?

An Important Side Note

Not just size

Big data is

Breath... Let's recap

HadoopThe Elephant in the Room

Everyone talks about Hadoop

Hadoop explained

Hadoop explained

Hadoop explained

Why is Hadoop so important?

Hadoop - Words of caution

Breath....

The world of big data

Batch ProcessingHadoop M/R

Batch processing of large data sets

How do we do this on Hadoop data?

Batch processing of large data sets

Hadoop querying and data science tools

The gentle way in

DatabasesIn the big data world

Databases in the big data world

The 3S Challenge:

The 3S Challenge

NoSQL and C.A.P.

NoSQL and C.A.P.

Confused?Let's take a breath and focus

OK, so where do I go from here?

Real-TimeBig Data Now!

Real-Time big data processing

Hadoop isn't for real-time processing

Apache Storm and Apache Spark

Apache Storm - Highlights

Apache Spark - Highlights

Storm & Spark - Use Cases

Recap

Questions?

Big Data
Demystified

Hadoop
The Elephant in the Room

Batch Processing
Hadoop M/R

Databases
In the big data world

Confused?
Let's take a breath and focus

Real-Time
Big Data Now!