Yay, we have design patterns too!
Under The Hood 9 - A review of "Data Engineering Design Patterns", our vade mecum
I've been waiting for the release of Data Engineering Design Patterns for some time. While I waited, I kept thinking, “Do we have design patterns in Data Engineering?”
I mean, the only occasions I have seen this topic come up are in the context of web development with hexagonal design patterns, domain-driven patterns, model-view-controller patterns, etc.
In my uneducated mind, I thought of design patterns as big project patterns that involved the way you organize your folders, your application logic, your business rules, etc.
Since a typical data engineering repo consists of DAGs, ETL scripts, and reusable modules with common methods like loading and transforming data, I didn't think we'd have much to talk about this subject.
So, I had a very different idea of what design patterns are.
Nevermind the traditional patterns like “Factory,” “Adapter,” “Bridge,” etc. I was really clueless about what to expect from this book.
Then it came out.
Then I read it.
And the TL;DR of my review is: Data Engineering Design Patterns is the de facto reference handbook for our craft.
A manual for all occasions
The book's subtitle is “Recipes for Solving the Most Common Data Engineering Problems.” So, the scope of the patterns the author presents is not huge, abstract, or repo-reaching, like you'd see with an MVC or DDD pattern.
Instead, he brings forth solutions for specific problems you'd face. Let's see the summary.
The book has a total of 10 chapters, with the following titles:
1. Introducing Data Engineering Design Patterns
2. Data Ingestion Design Patterns
3. Error Management Design Patterns
4. Idempotency Design Patterns
5. Data Value Design Patterns
6. Data Flow Design Patterns
7. Data Security Design Patterns
8. Data Storage Design Patterns
9. Data Quality Design Patterns
10. Data Observability Design Patterns
To give an example of the discussions brought by the book, let's take a look at some Data Ingestion Design Patterns.
In this chapter, the author presents some types of ingestions, such as full loads, incremental loads, and replications, as well as some parallel discussions, such as data compaction, data readiness, and events.
Each of these topics is structured with (a) the problem it solves, (b) how they do it, and (c) examples with actual code.
The description of the problems is accurate and concrete. Here, Bartosz draws his writing from actual practice, and not from abstract scenarios. This is good because you won't feel lost. Sometimes I see Data Engineers using super abstract terms online that don't really convey meaning; rather, they mask it to make it look more difficult than it is. You won't find this here in this book.
Also, the code examples are excellent and diverse! The author uses a lot of different software, so you'll see Airflow, Spark, Kafka, various AWS services, and various SQL flavours here.
Funny feeling is: if you are a more senior Data Engineer, you can expect to read about techniques you implement daily and didn't even think had a name.
I myself had a good feeling of being seen and understood, like “Yeah, I had this problem and I tackled it more or less like this. Now, in my mind, I'll call it ‘Static Late Data Integrator’”, lol.
How to read it?
The book is suitable for two types of reads: a read-through and a reference read.
If you're doing a read-through, you may get tired of the repetition of problem x solution
, but the value of the content 100% makes up for the repeated structure.
You can also keep it next to you and do research whenever you need to visit a topic, solve a problem, etc.
There's a term for this type of book: vade mecum, which translates from Latin as “come with me.” Trust me, you will query this book all the time.
I'd like to compare this book with two other technical works: Fundamentals of Data Engineering and Fluent Python.
Fundamentals of Data Engineering, by Joe Reis, is friendlier to read-throughs. The book covers major concepts that are very closely linked to each other and build upon this concept of what the role of a Data Engineer is.
On the other hand, Fluent Python, by Luciano Ramalho, has more detailed technical discussions of topics that do not necessarily need the other to be understood. So this is more of a “reference read”.
The point I'm trying to make is that you can read laws or a book about the Law.
Fundamentals of Data Engineering is more of a book about the Law.
Fluent Python is more of reading the plain law.
Data Engineering Design Patterns is more of a Fluent Python-type book. This is great because this book will last you a lifetime. You'll be coming back to it time and time again. In gaming lingo, it has a high replay value!
Final verdict
Great book, 10/10, get it! This book addresses a significant and overlooked gap in the field.
I liked the comparison with law and also show similarities and differences with Fundamentals of Data Engineering and Fluent Python.