Big Data and ML Pipelines

January 19, 2018
1 Minute Read

Data Lake, Business Intelligence, Enterprise Data Warehouse, Big Data Pipeline, Online Machine Learning, Lambda Architecture, Streaming, Spark, Kafka, Storm, Flink, Hadoop, Mesos and SMACK stack are some of the things you hear about when you want to dive into building a data pipeline. The Big Data Landscape cannot fit on a single screen as seen in the presentation. This is in addition to all the Big Data & Machine Learning offerings AWS has been introducing over the past few years which address many pain points highlighted by the various communities and help you get up and running faster.

The objective of this talk is to provide the audience with a framework which helps them define their pipeline problems, isolate components and pick the right tools for the right job.

We talk about:

  1. A consistent definition of BIG in big data

  2. The lineage of fundamental tools in the ecosystem

  3. First principles of a big data pipeline based on the lambda (not lambda functions) and kappa architectures

  4. Distinguishing between big data and online machine learning pipelines

  5. Technology choices based on first principles, open source solutions and AWS offerings

  6. Demo: Serverless, Managed Big Data Pipeline and real-time dashboard on AWS (orchestrated via Terraform)

This deck was presented at the Vancouver AWS User Group