<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" > <channel> <title>data lake Archives - CDInsights</title> <atom:link href="https://www.clouddatainsights.com/tag/data-lake/feed/" rel="self" type="application/rss+xml" /> <link>https://www.clouddatainsights.com/tag/data-lake/</link> <description>Trsanform Your Business in a Cloud Data World</description> <lastBuildDate>Sun, 03 Mar 2024 17:55:53 +0000</lastBuildDate> <language>en-US</language> <sy:updatePeriod> hourly </sy:updatePeriod> <sy:updateFrequency> 1 </sy:updateFrequency> <generator>https://wordpress.org/?v=6.6.1</generator> <image> <url>https://www.clouddatainsights.com/wp-content/uploads/2022/05/CDI-Favicon-2-45x45.jpg</url> <title>data lake Archives - CDInsights</title> <link>https://www.clouddatainsights.com/tag/data-lake/</link> <width>32</width> <height>32</height> </image> <site xmlns="com-wordpress:feed-additions:1">207802051</site> <item> <title>Planning a Data Lake? Prepare for These 7 Challenges</title> <link>https://www.clouddatainsights.com/planning-a-data-lake-prepare-for-these-7-challenges/</link> <comments>https://www.clouddatainsights.com/planning-a-data-lake-prepare-for-these-7-challenges/#respond</comments> <dc:creator><![CDATA[Kausik Chaudhuri]]></dc:creator> <pubDate>Fri, 01 Mar 2024 17:52:00 +0000</pubDate> <category><![CDATA[Cloud Strategy]]></category> <category><![CDATA[Data Architecture]]></category> <category><![CDATA[data lake]]></category> <guid isPermaLink="false">https://www.clouddatainsights.com/?p=5050</guid> <description><![CDATA[Without addressing challenges like the need for cybersecurity protections and data quality controls, enterprises may struggle to derive full value from data lakes.]]></description> <content:encoded><![CDATA[<div class="wp-block-image"> <figure class="aligncenter size-full"><img fetchpriority="high" decoding="async" width="1000" height="771" src="https://www.clouddatainsights.com/wp-content/uploads/2024/03/Data-lake-Depositphotos_214678082_S.jpg" alt="" class="wp-image-5052" srcset="https://www.clouddatainsights.com/wp-content/uploads/2024/03/Data-lake-Depositphotos_214678082_S.jpg 1000w, https://www.clouddatainsights.com/wp-content/uploads/2024/03/Data-lake-Depositphotos_214678082_S-300x231.jpg 300w, https://www.clouddatainsights.com/wp-content/uploads/2024/03/Data-lake-Depositphotos_214678082_S-768x592.jpg 768w" sizes="(max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><em>Without addressing these seven challenges, enterprises may struggle to derive full value from data lakes.</em></figcaption></figure></div> <p>“Build a data lake!” has become one of the standard points of advice for organizations with large amounts of data to store. As data lakes offer a convenient, centralized location that can house data of all kinds, they often seem like an obvious solution for businesses that need to share disparate types of data with multiple stakeholders.</p> <p>They can be, but only when they’re optimally designed and managed. Data lakes can also present significant challenges, which are critical to understanding before committing your company’s information to a data lake.</p> <p>Before diving into the challenges, let’s briefly define data lakes.</p> <p>A data lake is a centralized repository for storing data of all types and at any scale. Its core purpose is to allow organizations to take the disparate data assets they own – such as various databases, documents, media files, and so on – and house them in a central place where anyone who needs to access them can easily do so.</p> <p>This is what data lakes are meant to do, in theory. In practice, several challenges may hinder the effectiveness of data lakes.</p> <h3 class="wp-block-heading">Data lake challenges</h3> <p>Here’s a look at seven key challenges that organizations need to address to get the most out of data lake architectures.</p> <p><strong>See also: </strong><a href="https://www.clouddatainsights.com/maximizing-the-value-of-your-data-lake/" target="_blank" rel="noreferrer noopener">Maximizing the Value of Your Data Lake</a></p> <p><strong>1) Cybersecurity risks</strong></p> <p>When users populate all their data in a single location without managing security features, the data is often at risk of manipulation by threat actors. A data breach targeting the data lake can mean that external users gain access to the data assets the business manages. Unless you implement strict cybersecurity controls, your data lake becomes a prime target for attack.</p> <p><strong>2) Compliance challenges</strong></p> <p>Storing data in a central location simplifies compliance in the sense that you know where your data resides, though it also creates compliance challenges. If you store many different types of data in your lake, different assets may be subject to different compliance standards. Data that contains personally identical information (PII), for instance, must be managed differently in some ways than other types of data to comply with laws like DPA, GDPR, or HIPAA.</p> <p>While a data lake won’t prevent you from applying granular security controls to different data assets, it doesn’t make it easier, either – and it can make it more difficult if your security and compliance tools are not capable of applying different policies to different data assets within a centralized repository.</p> <p><strong>3) Data integration headaches</strong></p> <p>Placing your data into a central location to create a data lake is one thing, but connecting it to various applications and the workforce that needs access is another. Until you develop the necessary data integrations – and unless you keep them up to date – your data lake will deliver little value.</p> <p>Building data integrations takes time, effort, and expertise, and users sometimes underestimate how difficult it is to create successful data integrations. Be sure to prioritize data integration strategy as part of your overall process.</p> <p><strong>4) Data performance risks</strong></p> <p>While data lakes can theoretically accommodate any volume of data, in practice, performance often suffers as they scale up. The more data you have in your lake, the more difficult it is to ensure that the data moves quickly, that you can run fast queries on data assets, and so on.</p> <p>Addressing these risks requires careful attention to the infrastructure that hosts your data lake, which needs to scale as data scales to ensure adequate performance. Optimizing the way data is stored is also important for maintaining optimal performance.</p> <p><strong>5) Single point of failure</strong></p> <p>Placing your data in a data lake means creating a single point of failure. If the infrastructure that hosts your lake fails, your data becomes unavailable.</p> <p>Backups and replications can help in this regard. However, they’re only a partial solution because backup data may not be coordinated with production data, and both options will add additional costs. Plus, it takes time to restore data from backups, especially if you lack a well-designed data recovery plan and the right tools to implement it.</p> <p><strong>6) Data quality challenges</strong></p> <p>Keeping on top of data quality can be challenging when you have many different data types stored in a data lake. To optimize data performance and infrastructure utilization, you’ll want to perform tasks like data deduplication. Remember that the vast scale of a data lake, combined with the constantly changing nature of data inside, makes this cumbersome if you lack proper data quality tools and processes.</p> <p><strong>7) Management challenges</strong></p> <p>Data lakes are a unique type of <a href="https://www.rtinsights.com/data-fabric-vs-data-mesh-key-differences-and-similarities/" target="_blank" rel="noreferrer noopener">data architecture</a>. They’re different from databases, file systems, object storage systems, and other approaches to storing information.</p> <p>As a result, data engineers who don’t have experience with data lakes may struggle to design and manage them optimally. Not every organization has a data team on hand that’s ready to make the most of a data lake. Enterprises should ensure that their IT workforce is adept at both legacy systems and new technologies.</p> <p><strong>See also: </strong><a href="https://www.clouddatainsights.com/7-data-lake-best-practices-for-effective-data-management/" target="_blank" rel="noreferrer noopener">7 Data Lake Best Practices for Effective Data Management</a></p> <h3 class="wp-block-heading">Conclusion: Getting More from Data Lakes</h3> <p>Data lakes can be a great way to consolidate vast amounts of data and make it easily accessible, but only if they are carefully planned, implemented, and managed. Without addressing challenges like the need for cybersecurity protections and data quality controls and addressing risks like the possibility that your data lake infrastructure could fail, enterprises may struggle to derive full value from data lakes.</p> <p><strong>The bottom line:</strong> By all means, build a data lake if your business has determined that it’s the best way to store data. But you can’t just dump your data into a data lake and call it a day. Hard work is needed to navigate the many challenges described above that can undercut the value of data lakes.</p> <div class="saboxplugin-wrap" itemtype="http://schema.org/Person" itemscope itemprop="author"><div class="saboxplugin-tab"><div class="saboxplugin-gravatar"><img decoding="async" src="https://www.clouddatainsights.com/wp-content/uploads/2024/02/Kausik-Chaudhuri.jpg" width="100" height="100" alt="" itemprop="image"></div><div class="saboxplugin-authorname"><a href="https://www.clouddatainsights.com/author/kausik-chaudhuri/" class="vcard author" rel="author"><span class="fn">Kausik Chaudhuri</span></a></div><div class="saboxplugin-desc"><div itemprop="description"><p>Kausik Chaudhuri is the Chief Innovation Officer at <strong><a href="http://www.lemongrasscloud.com/">Lemongrass</a></strong>. Kausik is a thought leader known for designing, deploying, migrating, and running complex technical solutions for mission-critical enterprise applications, including SAP. At Lemongrass, he is responsible for Platform and Enterprise Architecture, Product Management Capability in alignment with Sales and Product teams, and platform enablement of the Delivery Service Team.</p> </div></div><div class="clearfix"></div></div></div>]]></content:encoded> <wfw:commentRss>https://www.clouddatainsights.com/planning-a-data-lake-prepare-for-these-7-challenges/feed/</wfw:commentRss> <slash:comments>0</slash:comments> <post-id xmlns="com-wordpress:feed-additions:1">5050</post-id> </item> <item> <title>7 Data Lake Best Practices for Effective Data Management</title> <link>https://www.clouddatainsights.com/7-data-lake-best-practices-for-effective-data-management/</link> <comments>https://www.clouddatainsights.com/7-data-lake-best-practices-for-effective-data-management/#respond</comments> <dc:creator><![CDATA[Dave Armlin]]></dc:creator> <pubDate>Sat, 17 Jun 2023 12:45:22 +0000</pubDate> <category><![CDATA[Cloud Data Platforms]]></category> <category><![CDATA[data lake]]></category> <guid isPermaLink="false">https://www.clouddatainsights.com/?p=3394</guid> <description><![CDATA[Follow these best practices for data lake management to ensure your organization can make the most of your investment. ]]></description> <content:encoded><![CDATA[<div class="wp-block-image"> <figure class="aligncenter size-full is-resized"><img decoding="async" src="https://www.clouddatainsights.com/wp-content/uploads/2023/06/data-lake-Depositphotos_203903418_S.jpg" alt="" class="wp-image-3396" width="750" height="569" srcset="https://www.clouddatainsights.com/wp-content/uploads/2023/06/data-lake-Depositphotos_203903418_S.jpg 1000w, https://www.clouddatainsights.com/wp-content/uploads/2023/06/data-lake-Depositphotos_203903418_S-300x227.jpg 300w, https://www.clouddatainsights.com/wp-content/uploads/2023/06/data-lake-Depositphotos_203903418_S-768x582.jpg 768w" sizes="(max-width: 750px) 100vw, 750px" /><figcaption class="wp-element-caption"><em>Follow these best practices for data lake management to ensure your organization can make the most of your investment.</em></figcaption></figure></div> <p>Data lakes are rapidly becoming one of the most popular ways for organizations to store and manage their data. By storing data in a central location, data lakes allow organizations to access, analyze, and gain insights from their data more easily. However, without proper management and implementation, data lakes can quickly become unmanageable and difficult to work with. In this article, we will discuss some key data lake best practices to make sure your data management is optimized from the start.</p> <h2 class="wp-block-heading"><a></a>Best Practices for Data Lake Success</h2> <h3 class="wp-block-heading"><a></a>1. Plan for Your Data Lake</h3> <p>Before you begin implementing your data lake, it’s important to plan ahead. This means understanding the types of data you will be storing and how you will be accessing and analyzing that data. You should also consider how you will be securing your data and ensuring compliance with any relevant regulations. Additionally, you will want to think about how you will be scaling your data lake as your organization grows.</p> <h3 class="wp-block-heading"><a></a>2. Choose the Right Tools</h3> <p>There are many tools available for building data lakes, including Amazon S3, Google Cloud Platform, Azure, and Snowflake. It’s important to choose the right tool for your needs based on factors such as your data volume, processing needs, and budget. You may also want to consider using a data lake platform that includes built-in tools for data management, such as data cataloging, indexing, and search.</p> <h3 class="wp-block-heading"><a></a>3. Optimize Your Data Lake for Performance</h3> <p>One of the biggest challenges with data lakes is ensuring fast query performance. To optimize your data lake for performance, you can use techniques such as partitioning, indexing, and caching. Partitioning involves dividing your data into smaller, more manageable segments, which can speed up queries by limiting the amount of data that needs to be scanned. Indexing involves creating indexes on your data that allow for faster searches. Caching involves storing frequently accessed data in memory, which can significantly improve query performance.</p> <h3 class="wp-block-heading"><a></a>4. Use a Data Catalog</h3> <p>A data catalog is a tool that allows you to organize and manage your data lake, making it easier to discover, access, and analyze your data. A good data catalog should allow you to search for data by keywords, tags, and other metadata and should provide information about the quality, lineage, and usage of your data. By using a data catalog, you can make your data lake more accessible and user-friendly, which can help drive the adoption and usage of your data.</p> <h3 class="wp-block-heading"><a></a>5. Ensure Data Quality and Governance</h3> <p>One of the biggest risks with data lakes is the potential for poor data quality and governance. To ensure that your data is accurate, consistent, and trustworthy, you should establish processes for data quality control, data lineage, and <a href="https://www.clouddatainsights.com/the-data-governance-solutions-landscape-is-evolving/" target="_blank" rel="noreferrer noopener">data governance</a>. This includes establishing data validation rules, tracking data lineage, and defining policies for data access, retention, and deletion.</p> <h3 class="wp-block-heading"><a></a>6. Implement Security and Compliance Measures</h3> <p>Security and compliance are critical considerations for any data lake implementation. To ensure the security of your data, you should implement measures such as encryption, access controls, and audit trails. You should also ensure compliance with relevant regulations such as GDPR, HIPAA, and <a href="https://oag.ca.gov/privacy/ccpa" target="_blank" rel="noreferrer noopener">CCPA</a>. This may involve establishing policies for data retention, deletion, and sharing, as well as conducting regular security audits and assessments.</p> <h3 class="wp-block-heading"><a></a>7. Monitor and Optimize Your Data Lake</h3> <p>Once your data lake is up and running, it’s important to monitor and optimize its performance. This involves regularly analyzing query performance, resource utilization, and data growth and making adjustments as needed. You may also want to consider using tools such as <a href="https://www.clouddatainsights.com/explore-the-mutual-advantages-of-generative-ai-and-the-cloud/" target="_blank" rel="noreferrer noopener">machine learning</a> and predictive analytics to identify patterns and optimize your data lake over time.</p> <p><strong>See also: </strong><a href="https://www.clouddatainsights.com/data-gravity-a-comprehensive-guide/" target="_blank" rel="noreferrer noopener">Data Gravity: A Comprehensive Guide</a></p> <h2 class="wp-block-heading">Conclusion</h2> <p>Implementing a data lake can provide many benefits for organizations, including improved data accessibility, analysis, and insights. However, without proper management and implementation, data lakes can quickly become unmanageable and difficult to work with, not to mention, very costly! Follow these best practices for data lake management to ensure your organization can make the most of your investment.</p> <div class="saboxplugin-wrap" itemtype="http://schema.org/Person" itemscope itemprop="author"><div class="saboxplugin-tab"><div class="saboxplugin-gravatar"><img loading="lazy" decoding="async" src="https://www.clouddatainsights.com/wp-content/uploads/2023/06/Armlin.jpeg" width="100" height="100" alt="" itemprop="image"></div><div class="saboxplugin-authorname"><a href="https://www.clouddatainsights.com/author/dave-armlin/" class="vcard author" rel="author"><span class="fn">Dave Armlin</span></a></div><div class="saboxplugin-desc"><div itemprop="description"><p><strong><a href="https://www.linkedin.com/in/darmlin/">Dave Armlin</a></strong> is the VP Customer Success of <a href="https://www.chaossearch.io/blog/data-lake-best-practices"><strong>ChaosSearch</strong></a>. In this role, he works closely with new customers to ensure successful deployments, as well as with established customers to help streamline integrating new workloads into the ChaosSearch platform. Dave has extensive experience in big data and customer success from prior roles at Hubspot, Deep Information Sciences, Verizon, and more. Dave loves technology and balances his addiction to coffee with quality time with his wife, daughter, and son as they attack whatever sport is in season. He holds a Bachelor of Science in Computer Science from Northeastern University.</p> </div></div><div class="clearfix"></div></div></div>]]></content:encoded> <wfw:commentRss>https://www.clouddatainsights.com/7-data-lake-best-practices-for-effective-data-management/feed/</wfw:commentRss> <slash:comments>0</slash:comments> <post-id xmlns="com-wordpress:feed-additions:1">3394</post-id> </item> <item> <title>Maximizing the Value of Your Data Lake</title> <link>https://www.clouddatainsights.com/maximizing-the-value-of-your-data-lake/</link> <comments>https://www.clouddatainsights.com/maximizing-the-value-of-your-data-lake/#respond</comments> <dc:creator><![CDATA[Brendan Newlon]]></dc:creator> <pubDate>Tue, 20 Sep 2022 23:28:40 +0000</pubDate> <category><![CDATA[Cloud Data Platforms]]></category> <category><![CDATA[Data Architecture]]></category> <category><![CDATA[data lake]]></category> <guid isPermaLink="false">https://www.clouddatainsights.com/?p=1832</guid> <description><![CDATA[Organizations are adopting modern data management approaches, such as semantic-based knowledge graphs, to connect data across the enterprise and accelerate the value from their data lake investments.]]></description> <content:encoded><![CDATA[<div class="wp-block-image"> <figure class="aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://www.clouddatainsights.com/wp-content/uploads/2022/09/data-lake-Depositphotos_215033950_S-1.jpg" alt="" class="wp-image-1834" width="750" height="625" srcset="https://www.clouddatainsights.com/wp-content/uploads/2022/09/data-lake-Depositphotos_215033950_S-1.jpg 1000w, https://www.clouddatainsights.com/wp-content/uploads/2022/09/data-lake-Depositphotos_215033950_S-1-300x250.jpg 300w, https://www.clouddatainsights.com/wp-content/uploads/2022/09/data-lake-Depositphotos_215033950_S-1-768x640.jpg 768w" sizes="(max-width: 750px) 100vw, 750px" /><figcaption>Organizations are adopting modern data management approaches, such as semantic-based knowledge graphs, to connect data across the enterprise and accelerate the value from their data lake investments.</figcaption></figure></div> <p>Data lakes have the ability to store a variety of data types and rapidly handle the huge volumes of data, which has led to their widespread adoption. <a href="https://www.gartner.com/en/information-technology/glossary/data-lake" target="_blank" rel="noreferrer noopener">Gartner defines a data lake</a> as a collection of storage instances of various data assets that are stored in a near-exact, or even exact, copy of the source format of the originating data stores. So, data lakes hold enormous promise in supporting modern enterprise data architectures. Implementations continue to be successful in uniting enterprise data physically; however, they can fall short in delivering returns for business users. This is because the bulk of the data within the data lake is unconnected and stored in its native form, requiring businesses to spend considerable time and money to prepare it for analysis.</p> <p><strong>See also: </strong><a href="https://www.clouddatainsights.com/what-is-a-data-lakehouse/" target="_blank" rel="noreferrer noopener">What is a Data Lakehouse? </a></p> <p>When used in conjunction with data lakes, data lakehouses, an approach that combines elements of the data warehouse with those of the data lake, help organizations co-locate data from across the organization using cost-effective approaches for storage. They also provide the opportunity to leverage the data at the computational layer to capitalize on the benefits of AI and reduce the need to maintain expensive and brittle ETL pipelines against traditional structured and costly on-prem data warehouses. However, while data lakes address the data access problem, they have yet to democratize access so that non-technical users can self-serve and collaborate to generate the rapid insights needed to keep pace with consumer preferences and changing business dynamics.</p> <p><strong>See also: </strong><a href="https://www.rtinsights.com/the-role-of-knowledge-graphs-in-cloud-data-integration/" target="_blank" rel="noreferrer noopener">The Role of Knowledge Graphs in Cloud Data Integration</a></p> <p>In the past, organizations linked BI tools to their data lake, but this resulted in other issues, such as higher latency, reduced collaboration and reuse, and the inability to leverage data across domains to provide context. These storage solutions also hindered the ability to conduct self-service through data exploration in support of enriching analytics and inferring new insights.</p> <p>To resolve those challenges, organizations are adopting modern data management approaches such as enterprise knowledge graphs to connect data across the enterprise and accelerate the value from their data lake investments. By connecting enterprise data with business semantics, knowledge graphs reduce the cost of data integration and help generate powerful insights into complex business challenges, all while enabling more agile data operations.</p> <p><em>Read the rest of this article on <strong><a href="https://www.rtinsights.com/maximizing-the-value-of-your-data-lake/" target="_blank" rel="noreferrer noopener">RTInsights</a></strong>.</em></p> <div class="saboxplugin-wrap" itemtype="http://schema.org/Person" itemscope itemprop="author"><div class="saboxplugin-tab"><div class="saboxplugin-gravatar"><img loading="lazy" decoding="async" src="https://www.clouddatainsights.com/wp-content/uploads/2022/09/Brandon-Newlon-Stardog.jpg" width="100" height="100" alt="" itemprop="image"></div><div class="saboxplugin-authorname"><a href="https://www.clouddatainsights.com/author/brendan-newlon/" class="vcard author" rel="author"><span class="fn">Brendan Newlon</span></a></div><div class="saboxplugin-desc"><div itemprop="description"><p>Brendan Newlon is a Solutions Architect at <strong><a href="http://www.stardog.com/">Stardog</a></strong>, the leading Enterprise Knowledge Graph (EKG) platform provider. For more information, visit <a href="http://www.stardog.com/">www.stardog.com</a> or follow them @StardogHQ.</p> </div></div><div class="clearfix"></div></div></div>]]></content:encoded> <wfw:commentRss>https://www.clouddatainsights.com/maximizing-the-value-of-your-data-lake/feed/</wfw:commentRss> <slash:comments>0</slash:comments> <post-id xmlns="com-wordpress:feed-additions:1">1832</post-id> </item> <item> <title>Are Data Lakehouses the Panacea, Or Is There Something Better?</title> <link>https://www.clouddatainsights.com/are-data-lakehouses-the-panacea-or-is-there-something-better/</link> <comments>https://www.clouddatainsights.com/are-data-lakehouses-the-panacea-or-is-there-something-better/#respond</comments> <dc:creator><![CDATA[Lewis Carr]]></dc:creator> <pubDate>Wed, 14 Sep 2022 23:48:09 +0000</pubDate> <category><![CDATA[Cloud Data Platforms]]></category> <category><![CDATA[data lake]]></category> <category><![CDATA[data lakehouse]]></category> <guid isPermaLink="false">https://www.clouddatainsights.com/?p=1799</guid> <description><![CDATA[While data lakehouses solve some issues, they are not a universal remedy. They really are the next generation of data lakes, incorporating some features and functionality found in data warehouses but with an eye toward data science.]]></description> <content:encoded><![CDATA[<div class="wp-block-image"> <figure class="aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://www.clouddatainsights.com/wp-content/uploads/2022/09/cloud-database-Depositphotos_8280934_S.jpg" alt="" class="wp-image-1858" width="750" height="551" srcset="https://www.clouddatainsights.com/wp-content/uploads/2022/09/cloud-database-Depositphotos_8280934_S.jpg 1000w, https://www.clouddatainsights.com/wp-content/uploads/2022/09/cloud-database-Depositphotos_8280934_S-300x220.jpg 300w, https://www.clouddatainsights.com/wp-content/uploads/2022/09/cloud-database-Depositphotos_8280934_S-768x564.jpg 768w" sizes="(max-width: 750px) 100vw, 750px" /><figcaption><strong>While data lakehouses solve some issues, they are not a universal remedy. They really are the next generation of data lakes, incorporating some features and functionality found in data warehouses but with an eye toward data science.</strong></figcaption></figure></div> <p>The technology world is full of innovations that take useful aspects of two separate technologies and create a whole new category of products. Clock radios, fax machines, and smartphones stand as popular combinations that changed the lives of many. </p> <p>“Data lakehouses” have been pitched as one of the newest examples of this type of innovation. Backers describe it as a cross between a big, hard-to-access data lake and a costly, limited-functionality data warehouse. They say that data lakehouses combine the best features of data lakes and data warehouses: the flexibility and relatively low cost of a data lake, coupled with the ease of access and support for enterprise analytics capabilities found in data warehouses.</p> <p>It’s a reasonable argument based on the needs in the marketplace and the shortcomings displayed in the age of unstructured (or semi-structured) data. But are data lakehouses really poised to become the market drivers proponents say they will? Or are they just another passing fad that’s making noise today but will be replaced by a new, more targeted innovation tomorrow?</p> <p>The answer will impact the strategies of large numbers of enterprises looking for solutions to manage data in a variety of formats, including those that could potentially be analyzed by artificial intelligence (AI) and machine learning (ML) tools, such as text, images, video, and audio.</p> <p><strong>See also: </strong><a href="https://www.clouddatainsights.com/what-is-a-data-lakehouse/" target="_blank" rel="noreferrer noopener">What is a Data Lakehouse?</a></p> <h3 class="wp-block-heading">It’s a bird! It’s a plane! It’s …</h3> <p>Today’s rapidly expanding data landscape is being served not only by data lakes and data warehouses but also by data hubs and analytics hubs (with the functionality of these two platforms as generally nonexistent in data warehouses or lakes). What are all of these mechanisms? And how do they relate to each other?</p> <p>Let’s start with a data lake. A <strong>data lake</strong> is the upstream location where all of the organization’s data flows. Data lives there in its raw state – either unstructured or structured, in image files, PDFs, databases, and other formats. Data lakes can typically ingest and manage almost any type of data, and as exemplified by Hadoop (historically the most popular type of data lake) and, more recently, object stores like S3, ADLS, and Google Cloud Store, they provide tools for enriching, querying, and analyzing the data they hold.</p> <p>Data lakes have historically been used to explore new ways of mining, combining, and analyzing data that was thrown out or not used as part of day-to-day business processes. In other words, it was applied either to operational data that is no longer in service or to data that may be considered in the future for operational use but is nonetheless currently in exploratory mode.</p> <p><strong>See also: </strong><a href="https://www.rtinsights.com/okay-your-data-is-in-the-cloud-now-what/" target="_blank" rel="noreferrer noopener">Okay, Your Data Is in The Cloud. Now What?</a></p> <p>A <a href="https://www.investopedia.com/terms/d/data-warehousing.asp" target="_blank" rel="noreferrer noopener"><strong>data warehouse</strong></a> tends to support long-standing datasets that represent fundamental, core data that runs the business: customer records, supply chain bills of materials, and so forth. Most of this data is highly structured but increasingly has semi-structured elements, incrementally built over time from multiple downstream data source silos. Changes to how the data is used can be time-consuming – not because of the data itself but because of the intricacies of how, where, and by whom it’s being used. New datasets – possibly after exploratory phases of work in the data lake – are made available for more regular, and routine analytics in the data warehouse, provided it can accommodate the size and structure of that data.</p> <p>Data warehouses are increasingly incorporating data streams and advanced analytics on both historical batch and real-time data streams. In general, data warehouses also differ from data lakes in that they require some sort of data hub technology to prepare the data for ingestion.</p> <p>But how do hubs come into play? A <strong>data hub</strong> is a gateway through which virtual or physical data can be merged, transformed, and enriched for passage to another destination. That destination might be an application or a database or some other kind of repository (such as a data lake or data warehouse) either for use by applications as a part of their ongoing business/operational process or by an analytics platform as a feedback loop on the process – automated or human decision support, exception handling, etc. </p> <p><em>Read the rest of this article on <strong><a href="https://www.rtinsights.com/are-data-lakehouses-the-panacea-weve-been-waiting-for-or-is-there-something-better/" target="_blank" rel="noreferrer noopener">RTInsights</a></strong>.</em></p> <div class="saboxplugin-wrap" itemtype="http://schema.org/Person" itemscope itemprop="author"><div class="saboxplugin-tab"><div class="saboxplugin-gravatar"><img loading="lazy" decoding="async" src="https://www.clouddatainsights.com/wp-content/uploads/2022/05/Lewis-Carr-headshot-150x150-1.jpg" width="100" height="100" alt="" itemprop="image"></div><div class="saboxplugin-authorname"><a href="https://www.clouddatainsights.com/author/lewis-carr/" class="vcard author" rel="author"><span class="fn">Lewis Carr</span></a></div><div class="saboxplugin-desc"><div itemprop="description"><div class="author-info"> <div class="author-description"> <p>Lewis Carr is Senior Director of Product Marketing at <strong><a href="https://www.actian.com/">Actian</a></strong>. In his role, Lewis leads product management, marketing and solutions strategies and execution. Lewis has extensive experience in Cloud, Big Data Analytics, IoT, Mobility and Security, as well as a background in original content development and diverse team management. He is an individual contributor and manager in engineering, pre-sales, business development, and most areas of marketing targeted at Enterprise, Government, OEM, and embedded marketplaces.</p> </div> </div> </div></div><div class="clearfix"></div></div></div>]]></content:encoded> <wfw:commentRss>https://www.clouddatainsights.com/are-data-lakehouses-the-panacea-or-is-there-something-better/feed/</wfw:commentRss> <slash:comments>0</slash:comments> <post-id xmlns="com-wordpress:feed-additions:1">1799</post-id> </item> <item> <title>What is a Data Lakehouse? </title> <link>https://www.clouddatainsights.com/what-is-a-data-lakehouse/</link> <comments>https://www.clouddatainsights.com/what-is-a-data-lakehouse/#respond</comments> <dc:creator><![CDATA[David Curry]]></dc:creator> <pubDate>Fri, 05 Aug 2022 02:24:22 +0000</pubDate> <category><![CDATA[Cloud Data Platforms]]></category> <category><![CDATA[Data Architecture]]></category> <category><![CDATA[data lake]]></category> <category><![CDATA[lakehouse]]></category> <guid isPermaLink="false">https://www.clouddatainsights.com/?p=1500</guid> <description><![CDATA[ A lakehouse is a new, open architecture that combines the best elements of data lakes and data warehouses.]]></description> <content:encoded><![CDATA[ <figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" src="https://www.clouddatainsights.com/wp-content/uploads/2022/08/lakehouse-Depositphotos_92364538_S-1.jpg" alt="" class="wp-image-1502" width="750" height="498" srcset="https://www.clouddatainsights.com/wp-content/uploads/2022/08/lakehouse-Depositphotos_92364538_S-1.jpg 1000w, https://www.clouddatainsights.com/wp-content/uploads/2022/08/lakehouse-Depositphotos_92364538_S-1-300x199.jpg 300w, https://www.clouddatainsights.com/wp-content/uploads/2022/08/lakehouse-Depositphotos_92364538_S-1-768x510.jpg 768w" sizes="(max-width: 750px) 100vw, 750px" /><figcaption> A <em>lakehouse</em> is a new, open architecture that combines the best elements of <em>data</em> lakes and <em>data</em> warehouses.</figcaption></figure> <p>A data lakehouse might be the next step in data storage and processing, combining the best of data warehouse and data lake architecture into a new system that is built for the next decade of technological development. </p> <p>When data lakes were first introduced by <a href="https://www.forbes.com/sites/danwoods/2015/01/26/james-dixon-imagines-a-data-lake-that-matters/?sh=4cd7fb8d4fdb" target="_blank" rel="noreferrer noopener">Pentaho CTO James Dixon</a>, experts in the field were split between the potential value of lakes as a fix to some of the issues with standard data warehouse solutions and what appeared to be simply a marketing term for a set of products built around the Hadoop system. </p> <p>Some also took issue with the potential for data silos, caused by a data lakes ability to store and process all types of data, whether structured, semi-structured or unstructured. That concern was warranted, with an entire industry springing up over the last decade to accommodate the huge influx in unstructured data. </p> <p>Data lakes have improved in value and sophistication over the past few years, which some consider a comeback for an architecture. Others perceive that data lakes have evolved into what Databricks and Snowflake are both claiming to have coined data lakehouses. </p> <p><strong>See also:</strong> <a href="https://www.clouddatainsights.com/how-the-data-lakehouse-might-usurp-the-warehouse-and-the-lake/" target="_blank" rel="noreferrer noopener">How the Data ‘Lakehouse’ Might Usurp the Warehouse and the Lake</a></p> <p>“The lakehouse is a new data management architecture that radically simplifies enterprise data infrastructure and accelerates innovation in an age when machine learning is poised to disrupt every industry,” <a href="https://databricks.com/discover/pages/the-rise-of-the-lakehouse-paradigm" target="_blank" rel="noreferrer noopener">said Ali Ghodsi</a>, CEO of Databricks. “In the past most of the data that went into a company’s products or decision making was structured data from operational systems, whereas today, many products incorporate AI in the form of computer vision and speech models, text mining, and others. Why use a lakehouse instead of a data lake for AI? A lakehouse gives you data versioning, governance, security and ACID properties that are needed even for unstructured data.”</p> <p>In Databricks’ overview of the topic, it illustrates how data lakehouse architecture embeds a metadata and governance layer during data processing. This means that data from a diverse set of data can be processed and stored in a unitary system, which improves accessibility for everyone in an organization. </p> <p>Accessibility is important, as it is one of the key issues of previous generation data storage and processing solutions. With a data lakehouse, different departments in an organization can get access to datasets without having to go through the engineering department, which can improve productivity and enable deeper analysis of the data. </p> <p>Another benefit of the data lakehouse is additional security, as organizations can limit access to documents without the worry of additional copies being made. This level of control, down to the column or row level, is very difficult to achieve once data is offloaded to a data warehouse or stored in multiple areas.</p> <p>With an open unitary system, organizations can also connect third-party analytics, visualization, and other tools directly to the data source, which can enable businesses to see analysis and visualization as close to real-time as possible. </p> <div class="saboxplugin-wrap" itemtype="http://schema.org/Person" itemscope itemprop="author"><div class="saboxplugin-tab"><div class="saboxplugin-gravatar"><img loading="lazy" decoding="async" src="https://www.clouddatainsights.com/wp-content/uploads/2022/05/curry-150x150-1.webp" width="100" height="100" alt="" itemprop="image"></div><div class="saboxplugin-authorname"><a href="https://www.clouddatainsights.com/author/david-curry/" class="vcard author" rel="author"><span class="fn">David Curry</span></a></div><div class="saboxplugin-desc"><div itemprop="description"><div class="author-info"> <div class="author-description"> <p>David is a technology writer with several years experience covering all aspects of IoT, from technology to networks to security.</p> </div> </div> <div class="clear"> <article id="post-47305" class="entry-grid first-grid post-47305 post type-post status-publish format-standard has-post-thumbnail hentry category-aiops tag-aiops tag-observability"> <div class="post-thumb"></div> </article> </div> </div></div><div class="clearfix"></div></div></div>]]></content:encoded> <wfw:commentRss>https://www.clouddatainsights.com/what-is-a-data-lakehouse/feed/</wfw:commentRss> <slash:comments>0</slash:comments> <post-id xmlns="com-wordpress:feed-additions:1">1500</post-id> </item> <item> <title>Sound Data Management Practices Can Pay Off Now and Later. Here’s How. </title> <link>https://www.clouddatainsights.com/sound-data-management-practices-can-pay-off-now-and-later-heres-how/</link> <comments>https://www.clouddatainsights.com/sound-data-management-practices-can-pay-off-now-and-later-heres-how/#respond</comments> <dc:creator><![CDATA[Ken Seier]]></dc:creator> <pubDate>Tue, 19 Jul 2022 15:38:35 +0000</pubDate> <category><![CDATA[Governance]]></category> <category><![CDATA[data lake]]></category> <category><![CDATA[data management]]></category> <category><![CDATA[data warehouse]]></category> <guid isPermaLink="false">https://www.clouddatainsights.com/?p=1451</guid> <description><![CDATA[Taking a proactive and strategic approach to data management can save time, money, and resources while unlocking even more powerful insights that lead to business outcomes. ]]></description> <content:encoded><![CDATA[<div class="wp-block-image"> <figure class="aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://www.clouddatainsights.com/wp-content/uploads/2022/07/data-management-Depositphotos_63896367_S.jpg" alt="" class="wp-image-1452" width="500" height="334"/><figcaption>Taking a proactive and strategic approach to data management can save time, money, and resources while unlocking even more powerful insights.</figcaption></figure></div> <p>Modern companies know that analytics needs to be at the heart of their decision-making and business strategy. That’s why many have invested in pathways to access vast amounts of data. But now, they are encountering a new challenge: <a href="https://www.datacenterdynamics.com/en/opinions/saving-money-smarter-storage-why-data-management-becoming-business-critical/" target="_blank" rel="noreferrer noopener">Over 70% of the data generated today</a> is no longer structured, easy to manage, find or analyze. This challenge begs the question: What can be done to make this data meaningful to the business?</p> <p>Unfortunately, a “one size fits all” approach to data architecture and management doesn’t exist. But here are different approaches that have historically been used while also introducing the lesser-known, emerging model of data mesh.</p> <p><strong>See also: </strong><a href="https://www.clouddatainsights.com/enabling-innovation-with-the-right-cloud-data-architecture/" target="_blank" rel="noreferrer noopener">Enabling Innovation with the Right Cloud Data Architecture </a></p> <h3 class="wp-block-heading"><strong>Breaking Down the Role of Traditional Data Architecture</strong></h3> <p><strong>Data Lakes</strong>: A <a href="https://www.techtarget.com/searchdatamanagement/definition/data-lake#:~:text=A%20data%20lake%20is%20a,in%20files%20or%20object%20storage.">data</a><a href="https://www.techtarget.com/searchdatamanagement/definition/data-lake#:~:text=A%20data%20lake%20is%20a,in%20files%20or%20object%20storage." target="_blank" rel="noreferrer noopener"> </a><a href="https://www.techtarget.com/searchdatamanagement/definition/data-lake#:~:text=A%20data%20lake%20is%20a,in%20files%20or%20object%20storage.">lake</a> is a giant, central storage repository that holds a vast amount of raw data in its native format until it is needed. Data lakes help data scientists and analysts who are tasked with determining whether raw data from an organization’s datasets can be turned into actionable insights. Through its flat architecture, the data lake provides more flexibility, storage, and usage at a lower cost, so that the data from the business systems can be replicated into a single repository.</p> <p>Data lakes can be beneficial for industries like oil and gas that accumulate large, complex data sets. On average, an oil company generates 1.5 terabytes of Internet of Things data daily. By leveraging data lakes for exploration, this industry can optimize directional drilling, lower operating expenses, improve safety, and stay compliant with regulatory requirements.</p> <p><strong>Data Warehouses</strong>: Once data scientists or analysts find value in the various datasets within these data lakes, that refined data and intelligence can be brought into a <a href="https://www.gartner.com/en/information-technology/glossary/data-warehouse" target="_blank" rel="noreferrer noopener">data warehouse</a>. A data warehouse is a structured storage architecture used to hold cleansed and transformed data from various sources for historical reporting and large-scale decision support.</p> <p>Data warehouses tend to be quite large and central to business success. They require significant engineering and operational effort to build and maintain. They can be platformed in on-premises systems, cloud deployments, or data-warehouse-as-a-service offerings.</p> <p>The refined data and intelligence housed in a data warehouse are commonly aggregated and shaped to be more “business-friendly” to inform better reuse and decision-making for enterprises. This approach can be helpful for organizations that need to make repeatable business decisions and drive operational efficiencies.</p> <p>For instance, Walmart used its data warehouses to test inventory management methods in its U.S. and Canadian stores. As a result, Walmart could make more informed decisions to <a href="https://corporate.walmart.com/newsroom/2010/02/23/walmart-canada-to-open-35-to-40-supercentres-in-2010#:~:text=Walmart%20Canada%20to%20Open%2035%20to%2040%20Supercentres%20in%202010" target="_blank" rel="noreferrer noopener">open new locations in Canada</a> and <a href="https://www.usatoday.com/story/money/business/2016/01/15/list-of-walmart-stores-closing/78852898/" target="_blank" rel="noreferrer noopener">close certain U.S. stores</a> in an effort to accommodate its customers’ needs.</p> <p><strong>Operational Data Store (ODS):</strong> Because the data warehouse is massive and has many moving parts, it can be difficult to update that data frequently to support fast-moving decisions.</p> <p>An <a href="https://www.techtarget.com/searchoracle/definition/operational-data-store" target="_blank" rel="noreferrer noopener">ODS</a> integrates and transforms the minimal cross-system data required to provide real-time decision support. This separates the high-compute transformations for fast intelligence from the large, regular needs of the data warehouse. Data, decisions, and alerts from the ODS are often moved into the data warehouse or data lake for archival use. </p> <p>Any organization managing minute-to-minute decisioning — patient care, manufacturing lines, or energy management are great examples — could benefit from an ODS approach.</p> <p><strong>See also: </strong><a href="https://www.clouddatainsights.com/how-the-data-lakehouse-might-usurp-the-warehouse-and-the-lake/" target="_blank" rel="noreferrer noopener">How the Data ‘Lakehouse’ Might Usurp the Warehouse and the Lake</a></p> <h3 class="wp-block-heading"><strong>How Data Mesh Fits</strong></h3> <p>Alongside these three classic modes of data storage, <a href="https://www.datanami.com/2022/01/21/data-meshes-set-to-spread-in-2022/" target="_blank" rel="noreferrer noopener">the data mesh concept</a> is a growing architectural design principle that allows data scientists and analysts to examine data anywhere in a system: across the data lake, data warehouse, and ODS, as well as source systems. This creates a “virtual data hub” that enables robust, enterprise-wide exploration without the cost and overhead of replicating data of unknown value into the data lake.</p> <p>Once value is discovered, engineering effort is used to pipe the data and intelligence into the data lake, data warehouse, and/or operational data store for consumption.</p> <p>Numerous industries can benefit from data mesh. Many financial services companies are grappling with how to modernize outdated technology. The systems they typically use have been in place for 50+ years, and any attempt at updating slows down system processes and incurs risk.</p> <p>Healthcare is another industry that greatly benefits from the data mesh concept. It allows providers to navigate security positions tailored to HIPAA while also using patient data to improve the care experience and overall outcomes.</p> <h3 class="wp-block-heading"><strong>Key Considerations when Migrating to Data Management Models</strong></h3> <p>While the value proposition for any of these data management models is compelling, it’s important to recognize that organizations can face several challenges with migration.</p> <p>There’s a sizable investment for these infrastructures (think large on-prem appliances like Netezza). Additionally, the amount of labor and skill required to maintain a system is very different compared to what it takes to build it, so organizations must adapt accordingly.</p> <p>In order to lay the foundation for well-governed data management, companies must:</p> <ul class="wp-block-list"><li><strong>Understand the long-term strategy</strong>: If a company knows what long-term analytics success looks like, it’s possible to bring data into the decision-making process and provide data scientists and analysts with the tools they need to be successful.</li></ul> <ul class="wp-block-list"><li><strong>Be nimble and flexible</strong>: Ideally, organizations should put flexibility front and center when migrating to a data management solution. This way, the data architect’s designs will meet the organization’s needs today and grow with it based on future needs.</li></ul> <ul class="wp-block-list"><li><strong>Invest in the right people and skillsets</strong>: Because these data architectures are relatively new, there is often a knowledge gap. While the data warehouse has been around for 50+ years, newer systems like data lake or data mesh are often less understood. Business and technology leaders must be in lockstep when it comes to making critical investments and having the right expertise in place to drive these solutions. In today’s talent-constrained environment, often bringing in a third-party solution provider can make sense for both implementation and day-to-day management.</li></ul> <p>There’s a business case to be made for investing in and implementing a well-governed, measured approach to data management<a>. Taking a proactive and strategic approach to data management </a>can save time, money, and resources while unlocking even more powerful insights that lead to business outcomes.</p> <div class="saboxplugin-wrap" itemtype="http://schema.org/Person" itemscope itemprop="author"><div class="saboxplugin-tab"><div class="saboxplugin-gravatar"><img alt='Ken Seier' src='https://secure.gravatar.com/avatar/de6a16a9d49a5f42d9db3b4f500388bb?s=100&d=mm&r=g' srcset='https://secure.gravatar.com/avatar/de6a16a9d49a5f42d9db3b4f500388bb?s=200&d=mm&r=g 2x' class='avatar avatar-100 photo' height='100' width='100' itemprop="image"/></div><div class="saboxplugin-authorname"><a href="https://www.clouddatainsights.com/author/ken-seier/" class="vcard author" rel="author"><span class="fn">Ken Seier</span></a></div><div class="saboxplugin-desc"><div itemprop="description"><p><em>Ken Seier is chief architect of data and artificial intelligence at </em><a href="https://www.insight.com/"><em>Insight Enterprises</em></a><em>, a Fortune 500 solutions integrator helping organizations accelerate their digital journey to modernize their business and maximize the value of technology. Ken and his team have been responsible for billions of dollars of revenue and savings through responsible analytics initiatives and innovation.</em></p> </div></div><div class="clearfix"></div></div></div>]]></content:encoded> <wfw:commentRss>https://www.clouddatainsights.com/sound-data-management-practices-can-pay-off-now-and-later-heres-how/feed/</wfw:commentRss> <slash:comments>0</slash:comments> <post-id xmlns="com-wordpress:feed-additions:1">1451</post-id> </item> <item> <title>How the Data ‘Lakehouse’ Might Usurp the Warehouse and the Lake</title> <link>https://www.clouddatainsights.com/how-the-data-lakehouse-might-usurp-the-warehouse-and-the-lake/</link> <comments>https://www.clouddatainsights.com/how-the-data-lakehouse-might-usurp-the-warehouse-and-the-lake/#respond</comments> <dc:creator><![CDATA[Joel Hans]]></dc:creator> <pubDate>Fri, 08 Jul 2022 13:59:10 +0000</pubDate> <category><![CDATA[Cloud Data Platforms]]></category> <category><![CDATA[Data Architecture]]></category> <category><![CDATA[data lake]]></category> <category><![CDATA[data warehouse]]></category> <category><![CDATA[lakehouse]]></category> <guid isPermaLink="false">https://www.clouddatainsights.com/?p=1415</guid> <description><![CDATA[Lakehouses combine the benefits of warehouses and lakes so organizations can use their massive quantities of unstructured data with the speed and reliability of a warehouse.]]></description> <content:encoded><![CDATA[<div class="wp-block-image"> <figure class="aligncenter size-full is-resized"><img loading="lazy" decoding="async" src="https://www.clouddatainsights.com/wp-content/uploads/2022/07/lake-house-Depositphotos_1343399_S.jpg" alt="" class="wp-image-1416" width="750" height="497" srcset="https://www.clouddatainsights.com/wp-content/uploads/2022/07/lake-house-Depositphotos_1343399_S.jpg 1000w, https://www.clouddatainsights.com/wp-content/uploads/2022/07/lake-house-Depositphotos_1343399_S-300x199.jpg 300w, https://www.clouddatainsights.com/wp-content/uploads/2022/07/lake-house-Depositphotos_1343399_S-768x509.jpg 768w" sizes="(max-width: 750px) 100vw, 750px" /><figcaption>Lakehouses combine the benefits of warehouses and lakes so organizations can use their massive quantities of unstructured data with the speed and reliability of a warehouse.</figcaption></figure></div> <p>Let’s assume that you’re well-off enough to have an entire lake in your possession. What do you do next? Build a lakehouse, of course.</p> <p>You know, an open architecture for managing your organization’s data that combines the scale of data lakes and the ACID-friendly queries of data warehouses on a single, flexible, and cost-effective platform.</p> <p>We’re talking about a platform to handle the vast quantities of an organization’s data here, not your second (or third) house where you store your pontoon boat and only visit two weekends every year.</p> <p><strong>See also:</strong> <a href="https://www.clouddatainsights.com/governance-in-the-age-of-cloud-databases/" target="_blank" rel="noreferrer noopener">Governance in the Age of Cloud Databases</a></p> <p>The data lakehouse is a growing market segment, with companies like Dremio, Databricks, and Onehouse already elbowing for the best cloud implementation of open frameworks like <a href="https://hudi.apache.org/" target="_blank" rel="noreferrer noopener">Apache Hudi</a>, <a href="https://iceberg.apache.org/" target="_blank" rel="noreferrer noopener">Apache Iceberg</a>, and <a href="https://delta.io/" target="_blank" rel="noreferrer noopener">Delta Lake</a>. But before jumping straight into the supposed benefits of the lakehouse, let’s talk about how the industry got here, to a new product category, just as it seemed like data lakes were catching on.</p> <p>Years ago, the data warehouse was the standard for business intelligence and analytics. Organizations stored their structured data in an ACID-compliant environment, which refers to the atomicity, consistency, isolation, and durability of the warehouse’s data. For all the benefits they created in terms of data quality and driving business analytics, they were costly, and their inflexibility tended to create silos.</p> <p>The data lake was developed as an answer to these problems. As a central, “flat” repository of all raw structured <em>and</em> unstructured data in object form, the data lake was designed to make data more accessible to more employees without the risk of siloing. Data lakes tend to run cheaper than warehouses since most public clouds support the object storage model.</p> <p>But many organizations, especially those at the leading edge of data storage and analysis, started to notice problems with data warehouses and lakes, even after trying to solve their individual cons by combining them into a single management and analysis infrastructure.</p> <p>Back in 2014, Uber was struggling with their data warehouse, <a href="https://techcrunch.com/2022/02/02/with-8m-seed-onehouse-builds-open-source-data-lake-house-eyes-managed-service/" target="_blank" rel="noreferrer noopener">according to Vinoth Chandar</a>, who managed the company’s data team at the time. They realized that different business units had different “versions” of the company’s data. Some analyses included the most recent updates, while others didn’t, which meant their people made critical decisions based on false or outdated assumptions.</p> <p>Uber’s engineers started building a custom Hadoop infrastructure around their warehouse, effectively combining their data warehouse with a data lake, to help different teams run analytics and make decisions based on the data they were paying handsomely to collect and store. Internally, they called this project “Hoodie.”</p> <p>In parallel with Uber, developers from Netflix, Apple, and Salesforce started working on a different open-source framework for democratizing the enormous volume of data they were all collecting about their customers. With both warehouses and lakes, these companies often needed to copy data to other systems to help their employees run analytics in comfortable, ACID-compliant environments where they didn’t have to worry about affecting durability. They were being overrun with complexity.</p> <p>They started building what’s now called <a href="https://iceberg.apache.org/">Iceberg</a>, an open-source format for big data analytics that lets multiple engines work on the same tables, at the same time, with the “reliability and simplicity of SQL tables.”</p> <p>Developers behind both projects eventually released them into open source, following a trend long-established in Silicon Valley tech giants. Back in 2011, Yahoo spun Hadoop out into its own company, and in 2014, LinkedIn did the same with Kafka. Both Hoodie—how called Hudi—and Iceberg are part of the <a href="https://www.apache.org/" target="_blank" rel="noreferrer noopener">Apache Software Foundation</a>, where they’re maintained and built by a global network of volunteer contributors.</p> <p>Hudi is now supported on AWS, Google Cloud, and Microsoft Azure and is used by companies like Disney, Twitter, Walmart, and more.</p> <p>They’re also now the foundation of the data lakehouse industry. When deployed into production against new or existing data sets, these tools let organizations store all their structured and unstructured data on low-cost storage, just like data lakes do. They also combine data structure/management features in warehouses, like ACID-compliant transactions and simpler query development.</p> <p>By combining the benefits of warehouses and lakes, the lakehouse lets organizations utilize their massive quantities of unstructured data with the speed and reliability of a warehouse. That’s a new foundation for data democratization—an organization’s entire workforce, from developers to marketers to salespeople, running business and machine learning (ML) analytics on large quantities of data stored in a single, stable place.</p> <p>The lakehouse’s pitch is compelling, which is why the market is heating up fast. Back in February, Onehouse <a href="https://techcrunch.com/2022/02/02/with-8m-seed-onehouse-builds-open-source-data-lake-house-eyes-managed-service/" target="_blank" rel="noreferrer noopener">netted an $8 million seed round</a> to build an open-source data lakehouse based on Hudi with a managed service in the offing. Earlier this year, <a href="https://techcrunch.com/2022/01/25/dremio-raises-160m-series-e-for-its-data-lake-platform/" target="_blank" rel="noreferrer noopener">Dremio raised $150 million in its Series E</a> to extend its product, partially based on Iceberg. The company recently made the <a href="https://venturebeat.com/2022/03/02/dremio-launches-free-data-lakehouse-service-for-enterprises/" target="_blank" rel="noreferrer noopener">free edition of its cloud service</a> generally available for enterprises. <a href="https://databricks.com/product/data-lakehouse" target="_blank" rel="noreferrer noopener">Databricks</a>, which also maintains its own open-source <a href="https://delta.io/" target="_blank" rel="noreferrer noopener">Delta Lake architecture</a>, claims more than 450 partners and multicloud support.</p> <p>But, like all lakehouses, there’s the hype cycle and price tag to account for, which likely locks out small- or mid-sized companies for the time being. In the meantime, they’ll have to settle for a swim in the lake.</p> <div class="saboxplugin-wrap" itemtype="http://schema.org/Person" itemscope itemprop="author"><div class="saboxplugin-tab"><div class="saboxplugin-gravatar"><img loading="lazy" decoding="async" src="https://www.clouddatainsights.com/wp-content/uploads/2022/06/joel_hans.jpg" width="100" height="100" alt="" itemprop="image"></div><div class="saboxplugin-authorname"><a href="https://www.clouddatainsights.com/author/joel-hans/" class="vcard author" rel="author"><span class="fn">Joel Hans</span></a></div><div class="saboxplugin-desc"><div itemprop="description"><p>Joel Hans is a copywriter and technical content creator for open source, B2B, and SaaS companies at <a href="https://commitcopy.com/">Commit Copy</a>, bringing experience in infrastructure monitoring, time-series databases, blockchain, streaming analytics, and more. Find him on Twitter <a href="https://twitter.com/joelhans">@joelhans</a>.</p> </div></div><div class="clearfix"></div></div></div>]]></content:encoded> <wfw:commentRss>https://www.clouddatainsights.com/how-the-data-lakehouse-might-usurp-the-warehouse-and-the-lake/feed/</wfw:commentRss> <slash:comments>0</slash:comments> <post-id xmlns="com-wordpress:feed-additions:1">1415</post-id> </item> </channel> </rss>