About the Author Jules S. Damji is a senior developer advocate at Databricks and an MLflow contributor. He is a hands-on developer with over 20 years of experience and has worked as a software engineer at leading companies such as Sun Microsystems, Netscape, @Home, Loudcloud/Opsware, Verisign, ProQuest, and Hortonworks, building large scale distributed systems. He holds a B.Sc. and an M.Sc. in computer science and an MA in political advocacy and communication from Oregon State University, Cal State, and Johns Hopkins University, respectively.Brooke Wenig is a machine learning practice lead at Databricks. She leads a team of data scientists who develop large-scale machine learning pipelines for customers, as well as teaching courses on distributed machine learning best practices. Previously, she was a principal data science consultant at Databricks. She holds an M.S. in computer science from UCLA with a focus on distributed machine learning.Tathagata Das is a staff software engineer at Databricks, an Apache Spark committer, and a member of the Apache Spark Project Management Committee (PMC). He is one of the original developers of Apache Spark, the lead developer of Spark Streaming (DStreams), and is currently one of the core developers of Structured Streaming and Delta Lake. Tathagata holds an M.S. in computer science from UC Berkeley.Denny Lee is a staff developer advocate at Databricks who has been working with Apache Spark since 0.6. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premises and cloud environments. He also has an M.S. in biomedical informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise healthcare customers.
Features & Highlights
Data is bigger, arrives faster, and comes in a variety of formatsâ??and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.
Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, youâ??ll be able to:
Learn Python, SQL, Scala, or Java high-level Structured APIs
Learn Python, SQL, Scala, or Java high-level Structured APIs
Understand Spark operations and SQL Engine
Understand Spark operations and SQL Engine
Inspect, tune, and debug Spark operations with Spark configurations and Spark UI
Inspect, tune, and debug Spark operations with Spark configurations and Spark UI
Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
Perform analytics on batch and streaming data using Structured Streaming
Perform analytics on batch and streaming data using Structured Streaming
Build reliable data pipelines with open source Delta Lake and Spark
Build reliable data pipelines with open source Delta Lake and Spark
Develop machine learning pipelines with MLlib and productionize models using MLflow
Develop machine learning pipelines with MLlib and productionize models using MLflow
Customer Reviews
Rating Breakdown
★★★★★
60%
(120)
★★★★
25%
(50)
★★★
15%
(30)
★★
7%
(14)
★
-7%
(-14)
Most Helpful Reviews
★★★★★
5.0
AFGFKGBTLIDIRUIJ54H2...
✓ Verified Purchase
Best introductory Spark guide as of early-2021
The foreword and preface to this book comment that an update to the first edition, published in 2015, was long overdue. After all, the first edition makes use of Apache Spark 1.3.0, whereas this update makes use of Apache Spark 3.0.0-preview2 (the latest version available at the time of writing). For the most part, I successfully ran all notebook code out of the box using Databricks Runtime 7.6 ML (includes Apache Spark 3.0.1, Scala 2.12), albeit minor issues are explained later in this review alongside my resolutions to these. I was, however, able to successfully run all standalone PySpark applications from chapters #2 and #3 out of the box using Apache Spark 3.0.1 and Python 3.7.9. As explained, the approach used here is intended to be conductive to hands-on learning, but with a focus on Spark's Structured APIs, so there are a few topics that aren't covered, such as the following: the older low-level Resilient Distributed Dataset (RDD) APIs, GraphX (Spark's API for graphs and graph-parallel computation), how to extend Spark's Catalyst optimizer, how to implement your own catalog, and how to write your own DataSource V2 data sinks and sources.
Content is broken down into 12 chapters: (1) "Introduction to Apache Spark: A Unified Analytics Engine", (2) "Downloading Apache Spark and Getting Started", (3) "Apache Spark's Structured APIs", (4) "Spark SQL and DataFrames: Introduction to Built-in Data Sources", (5) "Spark SQL and DataFrames: Interacting with External Data Sources", (6) "Spark SL and Datasets", (7) "Optimizing and Tuning Spark Applications", (8) "Structured Streaming", (9) "Building Reliable Data Lakes with Apache Spark", (10) "Machine Learning with MLlib", (11) "Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark", and (12) "Epilogue: Apache Spark 3.0". The longest chapter is chapter #8, followed closely behind by chapters #3, #4, #5, and #10, and the most notebooks are provided for chapters #10 and #11, although this is largely due to individual notebooks dedicated to a variety of topics.
This book is the fourth of four related books I've worked through, a couple years after the earlier three: "Spark: The Definitive Guide" (2018), "High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark" (2017), and "Practical Hive: A Guide to Hadoop's Data Warehouse System" (2016). As I mentioned in an earlier review, if you are new to Apache Spark, these four texts will help enable your going in the right direction, although keep in mind that the related tech stack is evolving and you will obviously need to supplement this material with web documentation and developer forums, as well as to get hands-on with the tooling. Reading the earlier three books in opposite order of publication date enabled exposure to more current material sooner rather than later, but this was largely just a coincidence. Now that this new book is available, I recommend working through this one first. While I wouldn't discount "Spark: The Definitive Guide", because it provides content not in this new book and I personally think it flows better, use it very judiciously because it was created using the Spark 2.0.1 APIs.
The only notebooks I wasn't able to successfully run out of the box are constrained to chapter #11. In notebook 11-3 ("Distributed Inference"), 11-5 ("Joblib"), and 11-7 ("Koalas"), FileNotFoundErrors were generated when attempting to use Pandas to read from CSV or Parquet files using "read_csv()" and "read_parquet()", respectively. In taking a look at what the community had to say, I discovered that this is a known issue, so I replaced these Pandas statements with "spark.read.option(...).csv("...") and "spark.read.option(...).parquet("...") instead, respectively, subsequently converting to Pandas using "toPandas()". In looking at the documentation, Pandas 1.0.1 is installed on both the CPU and GPU clusters for the aforementioned Databricks Runtime (the latest non-beta currently available). In notebook 11-3 ("Distributed Inference"), the following PythonException was generated when attempting to execute a "mapInPandas()" statement that uses a mix of numeric data types in the schema argument: "pyarrow.lib.ArrowInvalid: Could not convert 3.0 with type str: tried to convert to double". In the absence of decent community guidance, and because this statement is solely used for display purposes, I simply converted all of these data types to "STRING". According to the documentation, Pyarrow 1.0.1 is installed on both the CPU and GPU clusters for the aforementioned Databricks Runtime.
I personally got the most value out of chapters #7 and #8. Chapter #7 covers optimizing and tuning Spark for efficiency, caching and persistence of data, Spark joins, and inspecting the Spark UI. Chapter #8 covers evolution of the Apache Spark stream processing engine, the programming model of Structured Streaming, the fundamentals of a Structured Streaming query, streaming data sources and sinks, data transformations, stateful streaming aggregations, streaming joins, arbitrary stateful computations, and performance tuning. In particular, I especially appreciated the sections on the two most common Spark join strategies (the broadcast hash join and shuffle sort merge join), the Spark UI, stateful streaming aggregations, and streaming joins. Well recommended for anyone making use of Spark.
12 people found this helpful
★★★★★
1.0
AESOOVVDJ5CUDMJVKF3A...
✓ Verified Purchase
Not up to O'Reilly standards
Highly inaccessable. A lot of the setup of Spark is glossed over such that you'll spend most of your time trying to figure out why a simple python script won't run. Definitely recommend another text.
9 people found this helpful
★★★★★
5.0
AF2AYYHP5N3PF56YKHTA...
✓ Verified Purchase
Great beginner book
I'm a software engineer who knows his way through SQL, mostly running queries/transforms on Postgres and Redshift. The majority of my background is in building and supporting services. Having no background knowledge in Spark, I was looking for a book that explains the fundamental concepts, helps me get up running, and helps me expand my toolkit for working with "big data".
I was able to follow along in this book fairly easily. Working on a MacBook, I did have to first install Scala, download Spark, enable Spark in IntelliJ, etc. I didn't have trouble with this as it was fairly straightforward. With my environment set up, I found the book presents every code sample in Scala and Python. I worked through the code samples, chapter by chapter, writing Scala in IntelliJ or sometimes writing Scala in the Spark CLI itself.
I did take a detour from the book slightly to learn a bit more about sbt, which is the Scala build tool.
For a beginner such as myself, this book is a God send, but I do wish the authors approached some things differently.
In my opinion, some topics are covered in a very "hand-wavy" manner. For example, Chapter 4 discusses managed vs. unmanaged tables. While knowing this difference exists is helpful for the reader, the authors never discuss when you should use a managed table or an unmanaged table. They could have included that information or pointed the user to some external source. This part of Chapter 4 then shows sample code on how to create a managed table from a CSV file. However, it's not clear what should I do with that information. What are the patterns applicable to a managed table vs. unmanaged table? What are the trade-offs? Being a beginner book, I still feel the authors could have written even just 1 page, which would add significant value to this section.
Sometimes the book will share some interesting tidbit but using terminology or concepts that the authors haven't really described. I found this very frustrating. For example:
> (Chapter 4, page 92) ... you can create multiple SparkSessions within a single Spark application—this can be handy, for example, in cases where you want to access (and combine) data from two different SparkSessions that don’t share the same Hive metastore configurations.
If you search for mentions Hive, you see the authors briefly mentioned Spark uses a Hive metastore to persist table metadata. So are the authors saying I can use one Spark installation and access table metadata from different Hive metastores? Why would I ever want to access only the metadata for different tables? Again – the use case isn't clear.
As a beginner, I found this book very valuable, and I believe it is a great investment.
8 people found this helpful
★★★★★
1.0
AEWIA4737EY5N4ZHL7RT...
✓ Verified Purchase
Not good if you just want to use Spark
For a big data novice who has SQL fundamentals and is looking for a beginner's book for Spark, this is NOT the one for you. The authors from Databricks focused on bragging about how Spark is well designed, and spent plenty of pages explaining the internal design and API design of Spark, and showing off how much significant progress Spark 2 has made on a unified high-level API, instead of teaching you how to use the DataFrame and Spark SQL in different applications. They just sent you to the Documentation! Some notebooks for the book hosted on GitHub, if you import them to Databricks account, cannot even align with the examples in the book. In summary, the title of this book is indeed misleading, not friendly to beginner at all. Much lower quality than Learning SQL 2ed. I'd rather recommend the 2-year old Definitive Guide book for the beginners. What a waste of 3 day time to read the book!
3 people found this helpful
★★★★★
5.0
AFAEHKKLUMI6OAN7LCYE...
✓ Verified Purchase
Well organized and solid information
It was easy to follow the book. The setup of Spark shell was also clearly written. I also find the instructions online to install spark locally to be sufficient as well. The book is well organized to delineate different components of Spark, e.g. intro, structured api, streaming, optimizations, data lake, ml deployment options. While ML deployment needs for individual business use cases are highly specific, I find the overview deployment framework provided by the book to be helpful. I also liked that the book uses screenshots of Spark UI and arrows to point in the screenshots to explain the UI, since the UI can be hard to understand. The code samples and the graphics in other sections are useful as well. There’s also coverage on how to connect to different apps, like beeline (which I’ve never heard of), tableau, thrift. Overall, the book contains solid information on the inner workings of Spark. I would recommend giving this book a read!
3 people found this helpful
★★★★★
5.0
AEPPL5SEAR224YKLG6CQ...
✓ Verified Purchase
More databricks centric
Nice book if you really want to work hands on without having to worry about internals of spark.
2 people found this helpful
★★★★★
2.0
AEBMHBNOSX4FYGWXRFJL...
✓ Verified Purchase
Way to basic
Maybe recommended to people who have never used Spark before; but not to anyone who has any experience with. Simply too basic and covers everything superficially.
1 people found this helpful
★★★★★
4.0
AGR64R33GA7LTCHUGY3Y...
✓ Verified Purchase
Soft Entry into the World of Apache Spark
It's a good read. It was definitely written keeping in mind the Databricks ecosystem. Could have added info about writing production Spark applications (there's none) which I feel is the biggest missing piece of this book.
1 people found this helpful
★★★★★
5.0
AHBUFZ7FGALFOLEGO3XF...
✓ Verified Purchase
Very good
I really did feel like I learned something. The book covered a lot of ground with good quality examples. And it was well written, even if it was a little dry.
★★★★★
5.0
AFQ4JW6EWZBL5TJ24JEM...
✓ Verified Purchase
Buen libro para iniciarse en spark
Da buenos ejemplos sea en Scala y python aunque no siempre están en python el lenguaje Scala es similar (como un Java python). Sugiere que si quieres practicar utiliza databricks si no quieres instalar nada on-premise o si gusta instala spark utilizando wsl de Windows o una máquina virtual con Linux.