"Big Data on Kubernetes" Book Review

A great intro not only to Kubernetes, but also to the open source foundations of the modern data stack.

Oct 07, 2024

In July 2024, Packt released a book by the great Brazilian data professional Neylson Crepalde - former CDO and CTO of A3Data and current Strategist of GenAI on AWS.

The book Big Data on Kubernetes is available in physical and digital formats. Its premise is to provide a practical guide to building complete data solutions on Kubernetes while leveraging industry-standard open-source software.

Overall, it's a great book that covers a lot of ground. You can definitely use it as a panoramic overview of the so-called ‘modern data stack.’

BTW, I'm reviewing it based on the Kindle version.

How is the book structured?

The book starts by explaining the concept of containers- foundational knowledge for Kubernetes - which manages containerized applications. The first chapter covers many Docker topics, including images, containers, Dockerfiles, and registries.

Note that this chapter and the others include some code-along projects. For instance, the Docker chapter guides you through building an API and a batch data processing job. You can find all the code on the book's GitHub repo.

Following the summary, the following two chapters cover the main topics of Kubernetes. I've found this to be my favorite part of the book, as I never worked with this technology and was dead curious to learn more about it.

To be honest, I've always felt left out when I saw the cool kids on LinkedIn and Reddit talking about their `K8s`. Now I can at least understand it lol

In this part of the book, you will learn the core concepts that compose a Kubernetes cluster—the control plane, nodes, etc. It also covers some resources from the main API, such as Pods, Deployments, Stateful Sets, Jobs, and Services.

There are also cool topics that help ensure safe deployment, like how to handle secrets and persistent storage.

After the base content—containers and clusters—is established, the book starts Part 2, which is about the ‘Big Data Stack’.

This part is all about introductions and practical “quick starts” on Spark, Airflow, Kafka, Trino and ElasticSearch. But before getting hands-on, there's an in-depth section highlighting the generic services and routines that make a data solution - data lakes, data warehouses, data lakehouses, batch and streaming processes, etc.

On these chapters, both newcomers and practitioners will learn something useful, as Neylson covers not only the basics but important topics that one should be mindful of when deploying their application. As example, we can mention how he explains the semantics of exactly-once, at-most-once and at-least-once for streaming routines.

Highlights on the Spark chapter

I’m not going deep into how he covers all of the technologies, but I want to highlight a chapter that I really liked: Apache Spark.

I mean, when you read other content on Spark - books, forums or courses - people do tend to follow the chest-thumping and ego-contest-y route of only using RDDs and bragging about their map-reduce-like operations.

Neylson goes on a different path: he only uses the DataFrame API, which helps a lot with the readability of the code on the guide and makes it much more approachable to beginners - it's a beginner-friendly book, btw.

Also, Neylson shares some cool Spark snippets - like declaring a schema with plain strings, without needing to `import pyspark.sql.types as t`.

But my favorite part on this chapter was an explanation on the different types of joins, as well as their tradeoffs. It covers sort-merge joins, shuffle-hash joins, broadcast-hash joins, etc.

(BTW, expect to soon see a post on these operations here)

Overall, his take on each software is very approachable for beginners, but there are also many things that seasoned practitioners can learn from it. Also, you can expect to learn how to run each of them on Kubernetes, which is huge a plus! (and it's also the main purpose of the book).

There was a cool bonus at the end.

As I mentioned, Neylson is working on AWS as Senior GenAI Strategist. So, he leverages his role to write some interesting ending chapters on GenAI and LLMs using technologies like Amazon Bedrock and Streamlit.

In this part of the book, he differentiates traditional ML from generative AI and goes deep into how LLMs work, as well as the concepts of RAG - retrieval augmented generation, LLM agents and knowledge bases.

Then, he goes on to create an entire solution with an actual real-world use case leveraging these concepts while creating a custom knowledge base and deploying a Bedrock-based solution on Streamlit using Kubernetes. While he's on that, he also covers more topics on safe deployment.

It was a great surprise on the end of the book!

Verdict

Overall, I did like the book very much.

Neylson's writing is very didactic and also very spiritful and funny. The only part I did not like was the last chapter, where it felt somewhat generic and rushed.

It's a very thorough book that not only gives you great insight into Kubernetes but also explains the open-source version of the main services that compose the current standard for data solutions.

If you're tempted to buy it, go for it! You’ll have a great time!