It is estimated that in 2013 the whole world produced around 4.4 zettabytes of data; that is, 4.4 billion terabytes! By 2020, we (as a human race) are expected to produce ten times that. With data getting larger literally by the second there is a growing appetite for making sense out of it.
In this book, we will guide you through the latest incarnation of Apache Spark using Python. We will show you how to read structured and unstructured data, how to use some fundamental data types available in PySpark, how to build machine learning models, operate on graphs, read streaming data and deploy your models in the cloud. Each chapter will tackle different problem and by the end of the book we hope you will be knowledgeable enough to solve other problems we did not have space to cover here.