The progression from data warehouses to big data to data lakes and now to data lakehouses is probably not over. The data warehouse is built to report on business operations with some analysis for predictive modeling (forecasting) or discovery. A limited group of experts has access. Big data was typically housed in a file-based or object-based repository where it waited, uncategorized and raw, for intrepid analysts to mine it for insights. The data lake architecture evolved from the big data repository. Then came the data lakehouse.
Tomer Shiran, Co-founder and CPO of Dremio, titled his keynote at Subsurface 2023 “The Year of the Data Lakehouse.” Tomer took the time to sit down with Cloud Data Insights (CDI) and explain why that’s so. He covered related topics like the importance of the semantic layer, performance optimization, and emerging capabilities in data lakehouses. (See Shiran’s bio below.)
CDI: Your keynote at this year’s Subsurface conference was titled “The Year of the Data Lakehouse.” What kind of inflection point in the lakehouse do you see?
Tomer: Throughout the last decade, we’ve gone through a number of different phases or eras much more rapidly in the previous decades. We had enterprise data warehouses for several decades, and then we had the whole Hadoop craze and big data. I was very involved in that phase as the VP of product at MapR. Then we had the rise of the public cloud, which came all of a sudden out of nowhere. That led to solutions like Redshift and Snowflake that made cloud data warehouses popular. I think they addressed some ease of use and had the ability to serve a complete range of data warehousing use cases that data lakes couldn’t.
In just one year, all that has changed with Apache Iceberg as a common table format that the entire ecosystem has gotten behind. Now you can do things in a lake that you couldn’t do before–basically, everything that data warehouses can do. And so, now, for the first time, really, it’s become possible to solve all these use cases with an open data architecture. That’s why we call it the year of the lakehouse.
We at Dremio actually think that a lakehouse is a pretty good category name to describe what we are doing. Yes, at our core, we have a query engine. Yes, we can also connect to other data sources and federate them. An essential feature of a lakehouse is the ability to work with data in object storage and also work across other sources because, in the real world, companies have data in all sorts of places, and they can’t always centralize all of it.
CDI: One of the points you brought up in your keynote was the tension between governance and accessibility. We think of the data warehouse as being a very controlled, inaccessible data storage for which only a few people have the “keys” to unlock it. How are you seeing that tension play out?
Tomer: You can’t have barriers between groups and between departments and projects that are physical or artificial barriers. Of course, from a governance standpoint and a security standpoint, not everybody can see every piece of data. So you want to have that ability to control things. But that should be driven by business requirements and compliance and things like that, as opposed to physical restrictions like saying that data exists in that system and not in the other. A semantic layer lets you access data regardless of where the data resides or how big it is.
See also: The Semantic Layer’s Role in Analytics and Data Integration
CDI: Tell us about your strategy for increasing access and how Dremio Arctic accomplishes that.
Tomer: Arctic introduces a new idea of sandboxing data for when you’re doing intermediate work, like fixing up a bunch of data or ingesting some new source that you haven’t tested yet and want to make sure that data is correct. Arctic lets you do that work in isolation using the same concepts as in Git and GitHub. You create a branch, ingest the data into the branch, and the data in the branch. Those concepts are actually really valuable because that data is a work in progress at that point. That’s not really data that you want to share with anybody, but once you’re ready to share it, you run a single command, and boom, you can share it within the branch. The whole idea of the branches lets you work on the side.
CDI: That’s one way to work with transient data. Branches would also help with rollback in case of errors. It takes the risk out of experimenting.
Tomer: Yes. I think this idea of GitHub for data potentially represents one of the biggest changes ever to like data management. Take the data warehouse we have today. Yes, it’s more scalable and easier to use than 20 years ago, but fundamentally it’s still the same thing. It’s a bunch of tables, and I run SQL commands on these tables, right? The model is the exact same model it’s been for decades. But in software development, everything has changed in the last decade. We have things like GitHub for collaboration and much more agility, version control, and source code governance. We have CI/CD, and so many things have evolved. And that hasn’t happened in the world of data. Arctic [based on the open-source project Nessie] as being a key part of what’s driving that change going forward.
See also: Python: Top Programming Language but SQL Gets You Noticed
CDI: That’s a great perspective on using the same disciplines and tools for data and data products as for creating an application.
Tomer: There’s already a lot of familiarity with these development tools. Data engineers are also our source code engineers because they’ve built Python scripts, at a minimum, and they use these tools like GitHub. We don’t have to reinvent everything–we can just apply them to a new space.
CDI: There’s some hesitancy around adopting a semantic layer, mostly around its effect on performance. A workaround is to make the semantic layer cover only part of an organization’s data assets which defeats the purpose. How do you overcome this issue?
Tomer: If things aren’t fast enough, the way to get performance is you optimize the data. Rather than a table of a billion phone calls, I’m going to aggregate it by phone number. I have stats per phone number instead of all the individual calls. And then, when I query that, it’ll be faster because I have a smaller data set. Another example is joining the tables that people are going to query in advance so that join doesn’t happen for every query. Some people manually optimize the data as they extract it from, let’s say, a data warehouse that’s too slow and maybe overloaded. So they take the data and start caching pieces of it on the BI side, inside of applications like Tableau.
These approaches make sense to give you more performance, but they create a lot of problems because you’ve now created all these disconnected copies of data which you then have to manage. You also have to make sure that permissions are always up to date on these datasets because they have sensitive data in them, and the permissions don’t automatically travel from the source of the data. In our view, the only real way to solve for performance is to recognize that people need data to be transformed for logical business reasons. Maybe I need a zip code and not an address. Things are misspelled and need to be corrected. Maybe things need to be rationalized in different ways. Logical transformations are always going to be needed in the world, but there’s no reason to tie those to the physical transformation. So we say create a semantic layer for logical transformations.
You still have to worry even more about performance because these logical transformations are happening at query time. Instead, we create what we call materializations which are basically different aggregations of the data or different sorts of the data that are done behind the scenes where the user doesn’t connect to one of these materialization. In the traditional model, the user connects to pre-join or pre-aggregated tables, so whenever you want to make changes, you can’t because the user has built their dashboard on that specific table. In our world, the user connects to the logical layer, and as part of the query planner, we automatically use these additional aggregations or sorts of data transparently. Your data team can rewrite the user’s query internally without them knowing about these optimized versions of data, and that gives them sub-second response times.
CDI: How do you pre-select what you’re going to aggregate, merge or join?
Tomer: We have some capabilities in tech preview that automate these based on a company’s previous workloads. We look at query patterns and how data is commonly being aggregated. Based on that, we figure out recommendations that the data engineers can override. For instance, we might not know that a certain dashboard is run by the CEO and is really, really important. This other one is used by interns, and it’s less important. You can’t always know all the business differences behind the scenes, so we give the data team the controls it needs.
CDI: Sounds like AI in terms of pattern detection and predictions. Has Dremio embedded AI?
Tomer: Yes. Our understanding of workloads is getting smarter and smarter over time. That’s key to pre-selecting optimizations in an effective way.
CDI: In addition to Dremio’s commercial track record, it has made major contributions to open-source software. The Apache Arrow project you founded had over 7 million downloads in 2022. How do you explain its popularity?
Tomer: It is one of the most popular open-source projects now. First and foremost, it’s driven by the demand for data, the number of data scientists, and how many people are using data science tools and Python. The PyArrow library is basically used by every data scientist in every Notebook they create and every application they build. Something like Arrow provides very fast data access–that’s why our entire engine is based on it.
In addition to Apache Arrow, we are huge contributors to Iceberg and made sure we did it in a native way. We built a team dedicated to evangelizing it and educating the market on its benefits, and now we see many companies adopting Iceberg in their product.
See also: Why Python Is Best for AI, ML and Deep Learning
CDI: Open-source projects sometimes start as explorations and experiments. Is that Dremio’s approach, or were you really trying to solve an actual problem that you needed to complete your product strategy?
Tomer: It was more something that we needed. We were building our query engine, and we knew that to get world-class performance, we needed to have columnar in-memory execution. Initially, the technology was just in our product. However, before we actually launched the first version of our product, we realized that other database companies and other data science tools are going to need something like this. Sure, everybody can build their own, but if we open-sourced that piece of Dremio, then maybe that single format would become a standard, which would also benefit us in many ways by having a large developer community contributing to the project and making it faster and faster and better and better.
CDI: Let’s finish our conversation with a forward-looking question: Where do you see the data lakehouse moving?
Tomer: Well, I’m most excited about the new paradigm of managing data as code and all the benefits and agility that it will bring to companies. I think as a paradigm, and it’s going to be game-changing for so many companies since it just makes it easier to work with data, to collaborate, and to manage data. It also makes it easier to build a data mesh and data products. Lots of benefits to come from that approach.
[Note that Subsurface 2023 sessions are available for on-demand viewing after a no-fee registration]
Bio: Tomer Shiran, Co-Founder and CPO of Dremio, served as Dremio’s CEO for the first 4.5 years, overseeing the development of the company’s core technology and growing the team to 100 employees. Previously, he was the fourth employee and VP of Product of MapR, a Big Data analytics pioneer. Tomer held numerous product management and engineering roles at IBM Research and Microsoft. He is the founder of two websites that have served millions of users and 100K+ paying customers. He holds an MS in Computer Engineering from Carnegie Mellon University and a BS in Computer Science from Technion – Israel Institute of Technology and is the author of numerous U.S. patents.
Elisabeth Strenger is a Senior Technology Writer at CDInsights.ai.