<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" > <channel> <title>data warehouse Archives - CDInsights</title> <atom:link href="https://www.clouddatainsights.com/tag/data-warehouse/feed/" rel="self" type="application/rss+xml" /> <link>https://www.clouddatainsights.com/tag/data-warehouse/</link> <description>Trsanform Your Business in a Cloud Data World</description> <lastBuildDate>Wed, 16 Nov 2022 00:05:37 +0000</lastBuildDate> <language>en-US</language> <sy:updatePeriod> hourly </sy:updatePeriod> <sy:updateFrequency> 1 </sy:updateFrequency> <generator>https://wordpress.org/?v=6.6.1</generator> <image> <url>https://www.clouddatainsights.com/wp-content/uploads/2022/05/CDI-Favicon-2-45x45.jpg</url> <title>data warehouse Archives - CDInsights</title> <link>https://www.clouddatainsights.com/tag/data-warehouse/</link> <width>32</width> <height>32</height> </image> <site xmlns="com-wordpress:feed-additions:1">207802051</site> <item> <title>CITY Furniture: A Real Time and Data Virtualization Case Study</title> <link>https://www.clouddatainsights.com/city-furniture-a-real-time-and-data-virtualization-case-study/</link> <comments>https://www.clouddatainsights.com/city-furniture-a-real-time-and-data-virtualization-case-study/#respond</comments> <dc:creator><![CDATA[Elisabeth Strenger]]></dc:creator> <pubDate>Thu, 10 Nov 2022 20:51:17 +0000</pubDate> <category><![CDATA[Cloud Data Platforms]]></category> <category><![CDATA[Cloud Strategy]]></category> <category><![CDATA[Data Architecture]]></category> <category><![CDATA[data warehouse]]></category> <category><![CDATA[Practitioner]]></category> <category><![CDATA[real time]]></category> <guid isPermaLink="false">https://www.clouddatainsights.com/?p=2001</guid> <description><![CDATA[Ryan Fattini explains how a data fabric and virtualization help feed one retailer’s growing appetite for data.]]></description> <content:encoded><![CDATA[ <div class="wp-block-uagb-image uagb-block-419f09b1 wp-block-uagb-image--layout-default wp-block-uagb-image--effect-static wp-block-uagb-image--align-center"><figure class="wp-block-uagb-image__figure"><img fetchpriority="high" decoding="async" srcset="https://www.clouddatainsights.com/wp-content/uploads/2022/11/data-virtualization-Depositphotos_221368578_S.jpg " src="https://www.clouddatainsights.com/wp-content/uploads/2022/11/data-virtualization-Depositphotos_221368578_S.jpg" alt="" class="uag-image-2003" width="750" height="666" title=""/><figcaption class="uagb-image-caption"><em>Ryan Fattini explains how a data fabric and virtualization help feed one retailer’s growing appetite for data.</em></figcaption></figure></div> <p></p> <p>Cloud Data Insights (CDI) met with Ryan Fattini, who runs data engineering and data science at CITY Furniture, and Ravi Shankar, SVP and CMO for Denodo, at the Gartner Data and Analytics Summit in August. Ryan told the story of how CITY Furniture took the success of a real-time data system for sales and extended it across multiple departments. The journey begins with a software engineer and an IBM mainframe and ends with a data democratization initiative. There are many interesting stops along the way–a streaming layer, an IBM cloud <a href="https://www.investopedia.com/terms/d/data-warehousing.asp" target="_blank" rel="noreferrer noopener">data warehouse</a>, a miscellany of data stores, a data fabric, and data virtualization.</p> <p><strong>CDI: Four years ago, you were a software development engineer, and now you are an expert data professional with considerable influence. What was that transition like?</strong></p> <p><strong>Ryan Fattini:</strong> It started back at the previous company where I worked as the full stack engineer. I built out their e-commerce platform and the application layer behind it. What introduced me to data or at least solving problems with data was that we were embedded in a marketing report. There were questions that needed to be answered. One of our major vendors wanted to know activation patterns around the sales of smartphones. Vendors wanted to know activation rates, what fueled them, and basically what was behind the trends that were coming up. Nobody really had an answer to these questions so I looked into how you would solve this kind of problem. It turns out the answer was data science. <laugh>. We built logistic regression models by taking demographic data against our activation rates in our cities– basically modeling. That was the start of my transition from being an engineer to someone who solves problems with data. </p> <p>After building more models, I realized that the problem with data science isn’t building the models, it’s the engineering components. Six or seven years ago I was hearing about a lot of failures in the industry; data science wasn’t working. Companies were hiring academics who could build models but had no idea how to move them into production. It’s an engineering gap. I realized that most of what we were doing was engineering, not just model building. You can’t do one without the other.</p> <p>Now we have the hybrid machine learning engineer kind of a hybrid, which is what happened in the software developer area where the roles of back-end and front-end developers merged. When I joined CITY Furniture as a software developer–there was no data team, no data warehouse, and no analysts, so I brought the same data-solving solutions to CITY. I found a couple of other engineers who were also interested in this kind of thing, and we started picking at problems using a data science approach with our engineering teams. We were going rogue at first but were able to show the company that this was the future and that we’d eventually need to do predictive and prescriptive analysis. When we presented our damage classification model to the CEO, he said, “This is great…but what I really need to know is predicting retail foot traffic.” So we pivoted to forecasting retail traffic by day by store.</p> <p><strong>See also:</strong> <a href="https://www.clouddatainsights.com/22-top-cloud-database-vendors/" target="_blank" rel="noreferrer noopener">22 Top Cloud Database Vendors</a></p> <p><strong>CDI: As happened to many other businesses, COVID-19 made forecasting almost impossible. What happened to your retail traffic forecasting model?</strong></p> <p><strong>Ryan Fattini:</strong> The model ended up being critical. For brick-and-mortar retailers, keeping stores staffed to accommodate traffic was extremely difficult. There was no more historical context to forecast on, but there were some underlying things that didn’t change, and that was weekday seasonality. Saturday was always still the busiest, then Wednesday. We plugged the traffic forecasting model into the scheduling system, which helped stabilize the forecasting as we moved through phases of operating by appointment only, then 25% open times, then 50% until we were fully open.</p> <p>Data science had proven its value to the company, and we now have a dedicated team of data science engineers–another hybrid role.</p> <p><strong>CDI: There is a continuum of workflow and skill sets that is the typical breakdown between data science and DataOps or data engineering. That hybrid role could be the key to bridging that disconnect.</strong></p> <p><strong>Ryan Fattini: </strong>We do have two academic data scientists researching potential models, but we also need engineers that can build operational models that are more connected to the business and can be delivered in three months.</p> <p><strong>CDI:</strong> <strong>You’ve set the business and cultural context for us. What can you share about the technology challenges City Furniture faced in becoming more data-driven?</strong></p> <p><strong>Ryan Fattini: </strong>The starting point was an IBM mainframe that pulled data from almost a hundred systems. It had been set up in the seventies, so it had data structures built under constraints for maximizing space. Every column had short names, dates were all numeric, and there were some weird data slices. The data warehouse was built when advanced analytics wasn’t even considered. We decided to focus on providing real-time data to the stores–managers could see in real-time what was being sold or not sold and the salespeople could monitor their KPIs and change selling strategies that same day. We did that by adding a streaming layer on the mainframe system that fed into the IBM cloud data warehouse.</p> <p>When other business units saw what real-time data did for the sales department, they wanted some too.</p> <p><strong>CDI: That meant adding more systems to the real-time data warehouse?</strong></p> <p><strong>Ryan Fattini: </strong>Yes, lots of systems were all different from the IBM transactional system that we had enabled for streaming data. There were different databases and different data sources. We thought our software engineering approach would work in this case too. We started batching in other systems databases into our warehouse, but this was clumsy and slow. There had to be a better way. We worked with Gartner consulting, and they brought up virtualization and building out a data fabric that supported it.</p> <p>Connecting to the various data sources when data is virtualized means that you don’t need to move the data, and you’ve solved the data-gravity problem. Our proof of concept included relational and non-relational data sources, some of them on-prem that some teams use to run little Excel files. All of it could be addressed by the virtualization data fabric.</p> <p><strong>Ravi Shankar: </strong>Many people are not very aware of the logical form of integration. For 30 years, they have been doing physical integration. The analogy I would give is a patient who is having a cardiac event and needs some medication from the drugstore. You could take a bicycle to the drug store, pick up the medicine, and bring it back. The patient might not survive. Or you could drive to the drug store, and within minutes you’re back with the medication. Logical integration is a much faster way of getting to data than using some of the physical ways. The modern equivalent is dumping all data into a physical data lake. It’s still not integrated. Physical integration will continue to exist, for example, when moving data into a data warehouse, but the right tool should be used for the right job.</p> <p>Data is increasing, and the variety is increasing. There is a benefit in having the data in a single place where it’s easy to find for business users. But the rate at which data proliferates far exceeds the human ability to pull that into a central place. So we’ve moved from a centralized data warehouse to multiple data warehouses, then came the data lake and the data lakehouse. The cloud service providers would like to have all the data put into the cloud, but even then, we use multiple technologies and multiple clouds. All require some kind of integration.</p> <p><strong>CDI: CITY Furniture has an impressive collection of data sources, probably like many businesses. You’ve broken down that last data storage silo–Excel. Which aspect of the data fabric was key to an effective virtualization strategy?</strong></p> <p><strong>Ryan Fattini: </strong>The data catalog. This lets the software teams be more dynamic with their queries. When they have data, or they need a data solution, or just a data point, instead of them having to use a driver and write some crazy query into their application, we give them access to the data fabric through the data catalog and write a simple query string.</p> <p><strong>CDI: The data catalog lets them really know what the data is they’re accessing. It’s not just a mysterious set of fields. What happens next?</strong></p> <p><strong>Ryan Fattini: </strong>The data catalog is still in beta testing with key stakeholders like the CFO and the COO. They can search and tag data and find reliable data sources themselves. First, we made data accessible in real time. Second, we used virtualization to connect all our data into one fabric. The third key shift is democratizing the data. The data catalog will allow anyone to find the right data. Also, the data fabric and virtualization layer support the work of distributed data teams while maintaining governance and consistency since they cannot make changes to the core system. Their changes are logical changes. We will still have a central team managing that. That’s how we scale data access and the infrastructure it requires across many teams. Hopefully, no more bottlenecks will be caused since these distributed teams will support their group of business users.</p> <p><strong>CDI: A data catalog can support data governance to a certain extent. Have you found that more is needed?</strong></p> <p><strong>Ryan Fattini: </strong> You have to have strong governance in place. In addition to governance, you need standard operating procedures for how you do work. You can’t have six people building tables six, right? And you need a PII (Personally identifiable information) strategy before you consider democratizing data.</p> <p><strong>CDI: Thanks so much for sharing the story of CITY Furniture’s move to real-time data and the virtual integration of disparate data systems. You’ve laid a solid foundation for the data democratization phase and other transformations that might come after that.</strong></p> <div class="saboxplugin-wrap" itemtype="http://schema.org/Person" itemscope itemprop="author"><div class="saboxplugin-tab"><div class="saboxplugin-gravatar"><img alt='Elisabeth Strenger' src='https://secure.gravatar.com/avatar/d42bdc4339b8a684f54ad42d3ac0accb?s=100&d=mm&r=g' srcset='https://secure.gravatar.com/avatar/d42bdc4339b8a684f54ad42d3ac0accb?s=200&d=mm&r=g 2x' class='avatar avatar-100 photo' height='100' width='100' itemprop="image"/></div><div class="saboxplugin-authorname"><a href="https://www.clouddatainsights.com/author/estrenger/" class="vcard author" rel="author"><span class="fn">Elisabeth Strenger</span></a></div><div class="saboxplugin-desc"><div itemprop="description"><p>Elisabeth Strenger is a Senior Technology Writer at <a href="https://www.clouddatainsights.com/">CDInsights.ai</a>.</p> </div></div><div class="clearfix"></div></div></div>]]></content:encoded> <wfw:commentRss>https://www.clouddatainsights.com/city-furniture-a-real-time-and-data-virtualization-case-study/feed/</wfw:commentRss> <slash:comments>0</slash:comments> <post-id xmlns="com-wordpress:feed-additions:1">2001</post-id> </item> <item> <title>Sound Data Management Practices Can Pay Off Now and Later. Here’s How. </title> <link>https://www.clouddatainsights.com/sound-data-management-practices-can-pay-off-now-and-later-heres-how/</link> <comments>https://www.clouddatainsights.com/sound-data-management-practices-can-pay-off-now-and-later-heres-how/#respond</comments> <dc:creator><![CDATA[Ken Seier]]></dc:creator> <pubDate>Tue, 19 Jul 2022 15:38:35 +0000</pubDate> <category><![CDATA[Governance]]></category> <category><![CDATA[data lake]]></category> <category><![CDATA[data management]]></category> <category><![CDATA[data warehouse]]></category> <guid isPermaLink="false">https://www.clouddatainsights.com/?p=1451</guid> <description><![CDATA[Taking a proactive and strategic approach to data management can save time, money, and resources while unlocking even more powerful insights that lead to business outcomes. ]]></description> <content:encoded><![CDATA[<div class="wp-block-image"> <figure class="aligncenter size-full is-resized"><img decoding="async" src="https://www.clouddatainsights.com/wp-content/uploads/2022/07/data-management-Depositphotos_63896367_S.jpg" alt="" class="wp-image-1452" width="500" height="334"/><figcaption>Taking a proactive and strategic approach to data management can save time, money, and resources while unlocking even more powerful insights.</figcaption></figure></div> <p>Modern companies know that analytics needs to be at the heart of their decision-making and business strategy. That’s why many have invested in pathways to access vast amounts of data. But now, they are encountering a new challenge: <a href="https://www.datacenterdynamics.com/en/opinions/saving-money-smarter-storage-why-data-management-becoming-business-critical/" target="_blank" rel="noreferrer noopener">Over 70% of the data generated today</a> is no longer structured, easy to manage, find or analyze. This challenge begs the question: What can be done to make this data meaningful to the business?</p> <p>Unfortunately, a “one size fits all” approach to data architecture and management doesn’t exist. But here are different approaches that have historically been used while also introducing the lesser-known, emerging model of data mesh.</p> <p><strong>See also: </strong><a href="https://www.clouddatainsights.com/enabling-innovation-with-the-right-cloud-data-architecture/" target="_blank" rel="noreferrer noopener">Enabling Innovation with the Right Cloud Data Architecture </a></p> <h3 class="wp-block-heading"><strong>Breaking Down the Role of Traditional Data Architecture</strong></h3> <p><strong>Data Lakes</strong>: A <a href="https://www.techtarget.com/searchdatamanagement/definition/data-lake#:~:text=A%20data%20lake%20is%20a,in%20files%20or%20object%20storage.">data</a><a href="https://www.techtarget.com/searchdatamanagement/definition/data-lake#:~:text=A%20data%20lake%20is%20a,in%20files%20or%20object%20storage." target="_blank" rel="noreferrer noopener"> </a><a href="https://www.techtarget.com/searchdatamanagement/definition/data-lake#:~:text=A%20data%20lake%20is%20a,in%20files%20or%20object%20storage.">lake</a> is a giant, central storage repository that holds a vast amount of raw data in its native format until it is needed. Data lakes help data scientists and analysts who are tasked with determining whether raw data from an organization’s datasets can be turned into actionable insights. Through its flat architecture, the data lake provides more flexibility, storage, and usage at a lower cost, so that the data from the business systems can be replicated into a single repository.</p> <p>Data lakes can be beneficial for industries like oil and gas that accumulate large, complex data sets. On average, an oil company generates 1.5 terabytes of Internet of Things data daily. By leveraging data lakes for exploration, this industry can optimize directional drilling, lower operating expenses, improve safety, and stay compliant with regulatory requirements.</p> <p><strong>Data Warehouses</strong>: Once data scientists or analysts find value in the various datasets within these data lakes, that refined data and intelligence can be brought into a <a href="https://www.gartner.com/en/information-technology/glossary/data-warehouse" target="_blank" rel="noreferrer noopener">data warehouse</a>. A data warehouse is a structured storage architecture used to hold cleansed and transformed data from various sources for historical reporting and large-scale decision support.</p> <p>Data warehouses tend to be quite large and central to business success. They require significant engineering and operational effort to build and maintain. They can be platformed in on-premises systems, cloud deployments, or data-warehouse-as-a-service offerings.</p> <p>The refined data and intelligence housed in a data warehouse are commonly aggregated and shaped to be more “business-friendly” to inform better reuse and decision-making for enterprises. This approach can be helpful for organizations that need to make repeatable business decisions and drive operational efficiencies.</p> <p>For instance, Walmart used its data warehouses to test inventory management methods in its U.S. and Canadian stores. As a result, Walmart could make more informed decisions to <a href="https://corporate.walmart.com/newsroom/2010/02/23/walmart-canada-to-open-35-to-40-supercentres-in-2010#:~:text=Walmart%20Canada%20to%20Open%2035%20to%2040%20Supercentres%20in%202010" target="_blank" rel="noreferrer noopener">open new locations in Canada</a> and <a href="https://www.usatoday.com/story/money/business/2016/01/15/list-of-walmart-stores-closing/78852898/" target="_blank" rel="noreferrer noopener">close certain U.S. stores</a> in an effort to accommodate its customers’ needs.</p> <p><strong>Operational Data Store (ODS):</strong> Because the data warehouse is massive and has many moving parts, it can be difficult to update that data frequently to support fast-moving decisions.</p> <p>An <a href="https://www.techtarget.com/searchoracle/definition/operational-data-store" target="_blank" rel="noreferrer noopener">ODS</a> integrates and transforms the minimal cross-system data required to provide real-time decision support. This separates the high-compute transformations for fast intelligence from the large, regular needs of the data warehouse. Data, decisions, and alerts from the ODS are often moved into the data warehouse or data lake for archival use. </p> <p>Any organization managing minute-to-minute decisioning — patient care, manufacturing lines, or energy management are great examples — could benefit from an ODS approach.</p> <p><strong>See also: </strong><a href="https://www.clouddatainsights.com/how-the-data-lakehouse-might-usurp-the-warehouse-and-the-lake/" target="_blank" rel="noreferrer noopener">How the Data ‘Lakehouse’ Might Usurp the Warehouse and the Lake</a></p> <h3 class="wp-block-heading"><strong>How Data Mesh Fits</strong></h3> <p>Alongside these three classic modes of data storage, <a href="https://www.datanami.com/2022/01/21/data-meshes-set-to-spread-in-2022/" target="_blank" rel="noreferrer noopener">the data mesh concept</a> is a growing architectural design principle that allows data scientists and analysts to examine data anywhere in a system: across the data lake, data warehouse, and ODS, as well as source systems. This creates a “virtual data hub” that enables robust, enterprise-wide exploration without the cost and overhead of replicating data of unknown value into the data lake.</p> <p>Once value is discovered, engineering effort is used to pipe the data and intelligence into the data lake, data warehouse, and/or operational data store for consumption.</p> <p>Numerous industries can benefit from data mesh. Many financial services companies are grappling with how to modernize outdated technology. The systems they typically use have been in place for 50+ years, and any attempt at updating slows down system processes and incurs risk.</p> <p>Healthcare is another industry that greatly benefits from the data mesh concept. It allows providers to navigate security positions tailored to HIPAA while also using patient data to improve the care experience and overall outcomes.</p> <h3 class="wp-block-heading"><strong>Key Considerations when Migrating to Data Management Models</strong></h3> <p>While the value proposition for any of these data management models is compelling, it’s important to recognize that organizations can face several challenges with migration.</p> <p>There’s a sizable investment for these infrastructures (think large on-prem appliances like Netezza). Additionally, the amount of labor and skill required to maintain a system is very different compared to what it takes to build it, so organizations must adapt accordingly.</p> <p>In order to lay the foundation for well-governed data management, companies must:</p> <ul class="wp-block-list"><li><strong>Understand the long-term strategy</strong>: If a company knows what long-term analytics success looks like, it’s possible to bring data into the decision-making process and provide data scientists and analysts with the tools they need to be successful.</li></ul> <ul class="wp-block-list"><li><strong>Be nimble and flexible</strong>: Ideally, organizations should put flexibility front and center when migrating to a data management solution. This way, the data architect’s designs will meet the organization’s needs today and grow with it based on future needs.</li></ul> <ul class="wp-block-list"><li><strong>Invest in the right people and skillsets</strong>: Because these data architectures are relatively new, there is often a knowledge gap. While the data warehouse has been around for 50+ years, newer systems like data lake or data mesh are often less understood. Business and technology leaders must be in lockstep when it comes to making critical investments and having the right expertise in place to drive these solutions. In today’s talent-constrained environment, often bringing in a third-party solution provider can make sense for both implementation and day-to-day management.</li></ul> <p>There’s a business case to be made for investing in and implementing a well-governed, measured approach to data management<a>. Taking a proactive and strategic approach to data management </a>can save time, money, and resources while unlocking even more powerful insights that lead to business outcomes.</p> <div class="saboxplugin-wrap" itemtype="http://schema.org/Person" itemscope itemprop="author"><div class="saboxplugin-tab"><div class="saboxplugin-gravatar"><img alt='Ken Seier' src='https://secure.gravatar.com/avatar/de6a16a9d49a5f42d9db3b4f500388bb?s=100&d=mm&r=g' srcset='https://secure.gravatar.com/avatar/de6a16a9d49a5f42d9db3b4f500388bb?s=200&d=mm&r=g 2x' class='avatar avatar-100 photo' height='100' width='100' itemprop="image"/></div><div class="saboxplugin-authorname"><a href="https://www.clouddatainsights.com/author/ken-seier/" class="vcard author" rel="author"><span class="fn">Ken Seier</span></a></div><div class="saboxplugin-desc"><div itemprop="description"><p><em>Ken Seier is chief architect of data and artificial intelligence at </em><a href="https://www.insight.com/"><em>Insight Enterprises</em></a><em>, a Fortune 500 solutions integrator helping organizations accelerate their digital journey to modernize their business and maximize the value of technology. Ken and his team have been responsible for billions of dollars of revenue and savings through responsible analytics initiatives and innovation.</em></p> </div></div><div class="clearfix"></div></div></div>]]></content:encoded> <wfw:commentRss>https://www.clouddatainsights.com/sound-data-management-practices-can-pay-off-now-and-later-heres-how/feed/</wfw:commentRss> <slash:comments>0</slash:comments> <post-id xmlns="com-wordpress:feed-additions:1">1451</post-id> </item> <item> <title>How the Data ‘Lakehouse’ Might Usurp the Warehouse and the Lake</title> <link>https://www.clouddatainsights.com/how-the-data-lakehouse-might-usurp-the-warehouse-and-the-lake/</link> <comments>https://www.clouddatainsights.com/how-the-data-lakehouse-might-usurp-the-warehouse-and-the-lake/#respond</comments> <dc:creator><![CDATA[Joel Hans]]></dc:creator> <pubDate>Fri, 08 Jul 2022 13:59:10 +0000</pubDate> <category><![CDATA[Cloud Data Platforms]]></category> <category><![CDATA[Data Architecture]]></category> <category><![CDATA[data lake]]></category> <category><![CDATA[data warehouse]]></category> <category><![CDATA[lakehouse]]></category> <guid isPermaLink="false">https://www.clouddatainsights.com/?p=1415</guid> <description><![CDATA[Lakehouses combine the benefits of warehouses and lakes so organizations can use their massive quantities of unstructured data with the speed and reliability of a warehouse.]]></description> <content:encoded><![CDATA[<div class="wp-block-image"> <figure class="aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://www.clouddatainsights.com/wp-content/uploads/2022/07/lake-house-Depositphotos_1343399_S.jpg" alt="" class="wp-image-1416" width="750" height="497" srcset="https://www.clouddatainsights.com/wp-content/uploads/2022/07/lake-house-Depositphotos_1343399_S.jpg 1000w, https://www.clouddatainsights.com/wp-content/uploads/2022/07/lake-house-Depositphotos_1343399_S-300x199.jpg 300w, https://www.clouddatainsights.com/wp-content/uploads/2022/07/lake-house-Depositphotos_1343399_S-768x509.jpg 768w" sizes="(max-width: 750px) 100vw, 750px" /><figcaption>Lakehouses combine the benefits of warehouses and lakes so organizations can use their massive quantities of unstructured data with the speed and reliability of a warehouse.</figcaption></figure></div> <p>Let’s assume that you’re well-off enough to have an entire lake in your possession. What do you do next? Build a lakehouse, of course.</p> <p>You know, an open architecture for managing your organization’s data that combines the scale of data lakes and the ACID-friendly queries of data warehouses on a single, flexible, and cost-effective platform.</p> <p>We’re talking about a platform to handle the vast quantities of an organization’s data here, not your second (or third) house where you store your pontoon boat and only visit two weekends every year.</p> <p><strong>See also:</strong> <a href="https://www.clouddatainsights.com/governance-in-the-age-of-cloud-databases/" target="_blank" rel="noreferrer noopener">Governance in the Age of Cloud Databases</a></p> <p>The data lakehouse is a growing market segment, with companies like Dremio, Databricks, and Onehouse already elbowing for the best cloud implementation of open frameworks like <a href="https://hudi.apache.org/" target="_blank" rel="noreferrer noopener">Apache Hudi</a>, <a href="https://iceberg.apache.org/" target="_blank" rel="noreferrer noopener">Apache Iceberg</a>, and <a href="https://delta.io/" target="_blank" rel="noreferrer noopener">Delta Lake</a>. But before jumping straight into the supposed benefits of the lakehouse, let’s talk about how the industry got here, to a new product category, just as it seemed like data lakes were catching on.</p> <p>Years ago, the data warehouse was the standard for business intelligence and analytics. Organizations stored their structured data in an ACID-compliant environment, which refers to the atomicity, consistency, isolation, and durability of the warehouse’s data. For all the benefits they created in terms of data quality and driving business analytics, they were costly, and their inflexibility tended to create silos.</p> <p>The data lake was developed as an answer to these problems. As a central, “flat” repository of all raw structured <em>and</em> unstructured data in object form, the data lake was designed to make data more accessible to more employees without the risk of siloing. Data lakes tend to run cheaper than warehouses since most public clouds support the object storage model.</p> <p>But many organizations, especially those at the leading edge of data storage and analysis, started to notice problems with data warehouses and lakes, even after trying to solve their individual cons by combining them into a single management and analysis infrastructure.</p> <p>Back in 2014, Uber was struggling with their data warehouse, <a href="https://techcrunch.com/2022/02/02/with-8m-seed-onehouse-builds-open-source-data-lake-house-eyes-managed-service/" target="_blank" rel="noreferrer noopener">according to Vinoth Chandar</a>, who managed the company’s data team at the time. They realized that different business units had different “versions” of the company’s data. Some analyses included the most recent updates, while others didn’t, which meant their people made critical decisions based on false or outdated assumptions.</p> <p>Uber’s engineers started building a custom Hadoop infrastructure around their warehouse, effectively combining their data warehouse with a data lake, to help different teams run analytics and make decisions based on the data they were paying handsomely to collect and store. Internally, they called this project “Hoodie.”</p> <p>In parallel with Uber, developers from Netflix, Apple, and Salesforce started working on a different open-source framework for democratizing the enormous volume of data they were all collecting about their customers. With both warehouses and lakes, these companies often needed to copy data to other systems to help their employees run analytics in comfortable, ACID-compliant environments where they didn’t have to worry about affecting durability. They were being overrun with complexity.</p> <p>They started building what’s now called <a href="https://iceberg.apache.org/">Iceberg</a>, an open-source format for big data analytics that lets multiple engines work on the same tables, at the same time, with the “reliability and simplicity of SQL tables.”</p> <p>Developers behind both projects eventually released them into open source, following a trend long-established in Silicon Valley tech giants. Back in 2011, Yahoo spun Hadoop out into its own company, and in 2014, LinkedIn did the same with Kafka. Both Hoodie—how called Hudi—and Iceberg are part of the <a href="https://www.apache.org/" target="_blank" rel="noreferrer noopener">Apache Software Foundation</a>, where they’re maintained and built by a global network of volunteer contributors.</p> <p>Hudi is now supported on AWS, Google Cloud, and Microsoft Azure and is used by companies like Disney, Twitter, Walmart, and more.</p> <p>They’re also now the foundation of the data lakehouse industry. When deployed into production against new or existing data sets, these tools let organizations store all their structured and unstructured data on low-cost storage, just like data lakes do. They also combine data structure/management features in warehouses, like ACID-compliant transactions and simpler query development.</p> <p>By combining the benefits of warehouses and lakes, the lakehouse lets organizations utilize their massive quantities of unstructured data with the speed and reliability of a warehouse. That’s a new foundation for data democratization—an organization’s entire workforce, from developers to marketers to salespeople, running business and machine learning (ML) analytics on large quantities of data stored in a single, stable place.</p> <p>The lakehouse’s pitch is compelling, which is why the market is heating up fast. Back in February, Onehouse <a href="https://techcrunch.com/2022/02/02/with-8m-seed-onehouse-builds-open-source-data-lake-house-eyes-managed-service/" target="_blank" rel="noreferrer noopener">netted an $8 million seed round</a> to build an open-source data lakehouse based on Hudi with a managed service in the offing. Earlier this year, <a href="https://techcrunch.com/2022/01/25/dremio-raises-160m-series-e-for-its-data-lake-platform/" target="_blank" rel="noreferrer noopener">Dremio raised $150 million in its Series E</a> to extend its product, partially based on Iceberg. The company recently made the <a href="https://venturebeat.com/2022/03/02/dremio-launches-free-data-lakehouse-service-for-enterprises/" target="_blank" rel="noreferrer noopener">free edition of its cloud service</a> generally available for enterprises. <a href="https://databricks.com/product/data-lakehouse" target="_blank" rel="noreferrer noopener">Databricks</a>, which also maintains its own open-source <a href="https://delta.io/" target="_blank" rel="noreferrer noopener">Delta Lake architecture</a>, claims more than 450 partners and multicloud support.</p> <p>But, like all lakehouses, there’s the hype cycle and price tag to account for, which likely locks out small- or mid-sized companies for the time being. In the meantime, they’ll have to settle for a swim in the lake.</p> <div class="saboxplugin-wrap" itemtype="http://schema.org/Person" itemscope itemprop="author"><div class="saboxplugin-tab"><div class="saboxplugin-gravatar"><img loading="lazy" decoding="async" src="https://www.clouddatainsights.com/wp-content/uploads/2022/06/joel_hans.jpg" width="100" height="100" alt="" itemprop="image"></div><div class="saboxplugin-authorname"><a href="https://www.clouddatainsights.com/author/joel-hans/" class="vcard author" rel="author"><span class="fn">Joel Hans</span></a></div><div class="saboxplugin-desc"><div itemprop="description"><p>Joel Hans is a copywriter and technical content creator for open source, B2B, and SaaS companies at <a href="https://commitcopy.com/">Commit Copy</a>, bringing experience in infrastructure monitoring, time-series databases, blockchain, streaming analytics, and more. Find him on Twitter <a href="https://twitter.com/joelhans">@joelhans</a>.</p> </div></div><div class="clearfix"></div></div></div>]]></content:encoded> <wfw:commentRss>https://www.clouddatainsights.com/how-the-data-lakehouse-might-usurp-the-warehouse-and-the-lake/feed/</wfw:commentRss> <slash:comments>0</slash:comments> <post-id xmlns="com-wordpress:feed-additions:1">1415</post-id> </item> </channel> </rss>