What is Big Data?

Big Data is a massive collection of both structured and unstructured data, which is exactly what we looked at in our previous post on Machine learning. This term is used to describe such data that is both huge in size and grows exponentially with time. Such an immense amount of information is too large and complex to be treated with traditional data management tools and excedes the basic cababilities of the common software used to manage and process data efficiently.

Big Data Processing

Due to this escalation of data and evolution of our computing capacity, the need for more sophisticated big data processing tools is of the upmost importance. In the image below, we can see the evolution of big data processing frameworks overtime and the further evolution of its technologies. Defined by the big 3Vs that are velocity, volume, and variety, big data sits in a seperate row than your average data.


A batch is a collection of data points that have been grouped together within a specific time interval. It´s based on volume as it consists of the processing of large blocks of data all at once (hence the name). The exact time when each group is processed can be determined in a number of different ways, either on triggered condition e.g. the group is to be processed when it contains more than 1MB of data, or it can be based on a schueduled time interval e.g. collected data is to be processed every ten minutes.

Real time

Real time data processing calls for the continuos unbounded streams of data flows, and data is processed as soon as it´s collected, just as its name suggests, with continual input, process and output of data. This type of big data pipeline is important nowadays, due to the urgency of processing and the necessity of instant knowledge.


Hybrid combines the advantages of both Bach processing with the advantages of real-time processing. Lambda architecture combines “batch” and “stream” data processing, seeking advantages offered by both.
This architecture has been largely developed with the arrival of big data that provides a low cost solution for complex processing problems.


Tools of Data Processing 


Apache storm is a computing system based on master-slave architecture. It´s perfect for when working with and processing vast amounts of data within short time frames. Storm continues to be a leader of real-time analytics as it offers low-latency and is fully scalable and easy to deploy.

A recent example would be that of IoT sensors, through the combination of a set of sensors and efficient communication network, devices can share information with each other and improve their effectiveness.


  • Integrates with queuing systems such as Kestrel, RabbitMQ/AMQP, Kafka, JMS or Amazon Kinesis and database systems.
  • Fault Tolerant & ability to redeploy tasks when necessary ( this is important as it processes massive data all the time and should not be interrupted by a minimal failure)
  • Programmable in any language including Ruby, Python, Javascript or Perl.
  • Reliable, ensures that a tuple is fully processed



Apache Spark is an open-source distributed cluster computing framework and an extension of the Spark Core API, which responds to real-time data processing in a scalable, high-performance and fault-tolerant way. It is capable of carrying out analytics and machine learning, among others, for batch or streaming processing with specefic APIs for the following programming languages: Scala, Python, Java, R and SQL.

Spark boasts at being a distributed platformfor executing complex multi-stage applications like machine learning algorithms.


As seen in the image above, Dstream (or discrete stream) is no more than an abstraction provided by Spark that symbolizes a sequence of RDDS (resilient distributed datasets) ordered at the time that each of them saves data from a particular range. Thanks to this, the core can analyze it without knowing that it is a data flow, since Spark carries out the creation and coordination of the RDDs itself. Spark Engine (Core) processes data using a variety of Machine learning or graph algorithms, like map, join and window. After processing, the data is stored in a file system to be presented in dashboards.

Spark streaming can consume data from a multitude of sources including Kafka, Flume, or Kinesis.



  • Popular and widely adopted technology.
  • Large community of developers.
  • Libraries for Machine Learning & graphs.
  • Supports water marks.
  • Allows custom memory management.
  • Event-time processing support.


Apache Flink is a native and low-latency data flow processing engine that provides communication and fault tolerance distribution capabilities. It not only contains a flexible window management system that allows you to define 3 types of Windows including size definition, time interval definition and Interval by number of events setting, but also offers more advanced options such as triggers that allow window executions to be released when specific conditions are fulfilled, and evictors that allow you to remove items from the window under specific conditions and consistency, because correct results are obtained despite errors.

It was first developed in Java and Scala and now in Python, R and SQL and obtains the best application performance in the following applications.


  • Respond quickly to computationally complex questions of machine learning and statistics etc.
  • Clean and pre-filter processes on vast amounts of data.
  • Detects anomaly.
  • Monitoring systems that offer real-time alerts.
  • IoT projects.


Written by: Diego Calvo, check out his blog here: http://www.diegocalvo.es/big-data/