Learning PySpark

It is estimated that in 2013 the whole world produced around 4.4 zettabytes of data; that is, 4.4 billion terabytes! By 2020, we (as a human race) are expected to produce ten times that. With data getting larger literally by the second there is a growing appetite for making sense out of it.

In this book, we will guide you through the latest incarnation of Apache Spark using Python. We will show you how to read structured and unstructured data, how to use some fundamental data types available in PySpark, how to build machine learning models, operate on graphs, read streaming data and deploy your models in the cloud. Each chapter will tackle different problem and by the end of the book we hope you will be knowledgeable enough to solve other problems we did not have space to cover here.

Purchasing the book

You can purchase the book on Amazon and Packt.

With this book, you will learn about a wide variety of topics including Apache Spark and the Spark 2.0 architecture; build and interact with Spark DataFrames using Spark SQL; learn how to solve graph and deep learning problems using GraphFrames and TensorFrames respectively; and read, transform, and understand data and use it to train machine learning models with MLlib and ML.

Spark SQL Engine / Catalyst Optimizer

Table of contents:

Understanding Spark
Resilient Distributed Dataset
DataFrames
Preparing Data for Modeling
Introducing MLlib
Introducing the ML Package
GraphFrames
TensorFrames
Polyglot Persistence with Blaze
Structured Streaming
Packaging Spark Applications

The code samples within this book can be found at: https://github.com/drabastomek/learningPySpark.

About us

Tomasz Drabas is a Data Scientist working for Microsoft and currently residing in Seattle area. He has over 13 years of experience in data analytics and data science in numerous elds: advanced technology, airlines, telecommunications, nance and consulting he gained while working on three continents: Europe, Australia and North America. While in Australia, Tomasz has been working on his PhD in Operations Research with focus on choice modeling and revenue management applications in airline industry.

At Microsoft, Tomasz works with big data on a daily basis solving machine learning problems such as anomaly detection, churn prediction or pattern recognition using Spark.

Tomasz has also authored the Practical Data Analysis Cookbook published by Packt Publishing in 2016; you can purchase that book on Amazon, Packt and O'Reilly.

Denny Lee is a Principal Program Manager at Microsoft for the Azure DocumentDB team - Microsoft's blazing fast, planet-scale managed document store service. He is a hands-on distributed systems and data sciences engineer with more than 20 years of experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments.

He has extensive experience in building green field teams as well as turnaround / change catalyst. Prior to joining the Azure DocumentDB team, Denny worked as a Technology Evangelist at Databricks; he has been working with Apache Spark since 0.5. He was also the Senior Director of Data Sciences Engineering at Concur, and was on the incubation team that built Microsoft's Hadoop on Windows and Azure service (currently known as HDInsight). Denny also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers for the last fifteen years.

Blogs: tomdrabas.com | dennyglee.com

Blog

March 5, 2018

Learning PySpark videos are up!

In this tutorial, we provide a brief overview of Spark and its stack. This tutorial presents effective, time-saving techniques on how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Apache Spark architecture and how to set up a …

Continue reading "Learning PySpark videos are up!"

July 25, 2017

A quick primer on TensorFrames at PyData Seattle 2017

The video of Denny Lee's and my workshop during PyData Seattle 2017 is up. Enjoy!

July 6, 2017

PySpark and TensorFrames!

PySpark and TensorFrames---a bridge between Spark and TensorFlow---were the topics of a workshop by Denny Lee and Tom Drabas at PyData Seattle on July 5, 2017. Things covered: Neural networks and deep learning Feature learning Feature engineering TensorFlow introduction Building a multinomial logistic regression and a Convolutional Neural Network to recognize handwritten digits (MNIST) …

Continue reading "PySpark and TensorFrames!"