Given the dominance of Amazon Web Services in the cloud marketplace, many look to its annual re:Invent conference to get a glimpse of what’s to come. The company did not disappoint. In his keynote address this week, AWS CEO Adam Selipsky touted a zero-ETL future, introducing new integration between the Redshift data warehouse service and the Aurora relational database service.
The details: The company announced two new integrations that make it easier for businesses to connect and analyze data across data stores without having to move data between services. Specifically, the new capabilities enable businesses to analyze Amazon Aurora data with Amazon Redshift in near real time, eliminating the need to extract, transform, and load (ETL) data between services. Businesses can also now run Apache Spark applications on Amazon Redshift data using AWS analytics and machine learning (ML) services (e.g., Amazon EMR, AWS Glue, and Amazon SageMaker).
“The new capabilities help us move customers toward a zero-ETL future on AWS, reducing the need to manually move or transform data between services,” said Swami Sivasubramanian, vice president of Databases, Analytics, and Machine Learning at AWS. “By eliminating ETL and other data movement tasks for our customers, we are freeing them to focus on analyzing data and driving new insights for their business.”
When making the announcement, the company noted that many organizations are seeking to get the maximum value out of their vast data resources. To help, AWS provides a range of purpose-built tools like Amazon Aurora to store transactional data in MySQL and PostgreSQL-compatible relational databases, and Amazon Redshift to run high-performance data warehousing and analytics workloads on petabytes of data.
But to truly maximize the value of data, businesses need these tools to work together seamlessly. In general, they don’t. That is why AWS has invested in zero-ETL capabilities like Amazon Aurora ML and Amazon Redshift ML, which let customers take advantage of Amazon SageMaker for ML-powered use cases without moving data between services.
Additionally, AWS is offering seamless data ingestion from AWS streaming services (e.g., Amazon Kinesis and Amazon MSK) into a wide range of AWS data stores, such as Amazon Simple Storage Service (Amazon S3) and Amazon OpenSearch Service, so businesses can analyze data as soon as it is available.
The new announcements build on the integrations of AWS’s database and analytics portfolio to make it faster, easier, and more cost-effective for businesses to access and analyze data across data stores on AWS.
See also: Amazon Web Services AI and ML Offerings: An Overview
Zero-ETL: Running petabyte-scale analytics on transactional data in near real time
One aspect that the new capabilities address is helping businesses get insights in near real time. Specifically, AWS noted that the requirement for near real-time insights on transactional data (e.g., purchases, reservations, and financial trades) is growing. Why? Businesses want to better understand core business drivers and develop strategies to increase sales, reduce costs, and gain a competitive advantage.
To accomplish this, many businesses use a three-part solution to analyze their transactional data—a relational database to store data, a data warehouse to perform analytics, and a data pipeline to ETL data between the relational database and the data warehouse. Anyone who has done this type of work knows that the data pipelines can be costly to build and challenging to manage, requiring developers to write custom code and constantly manage the infrastructure to ensure it scales to meet demand.
AWS noted that some companies maintain entire teams just to facilitate this process. Additionally, it can take days before data is ready for analysis, and intermittent data transfer errors can delay access to time-sensitive insights even further, leading to missed business opportunities.
With Amazon Aurora zero-ETL integration with Amazon Redshift, transactional data is automatically and continuously replicated seconds after it is written into Amazon Aurora and seamlessly made available in Amazon Redshift. Once data is available in Amazon Redshift, businesses can start analyzing it immediately and apply advanced features like data sharing and Amazon Redshift ML to get holistic and predictive insights.
Additionally, businesses can replicate data from multiple Amazon Aurora database clusters into a single Amazon Redshift instance to derive insights across several applications. As such, businesses can use Amazon Aurora to support their transactional database needs and Amazon Redshift to power their analysis without building or maintaining complex data pipelines.
New integration with Apache Spark
Another area of focus with the new capabilities announced at the conference is to help developers more easily use Apache Spark to support a broad range of analytics and ML applications.
To that end, the company noted that businesses often want to analyze Amazon Redshift data directly from a variety of services. This requires them to go through the complex, time-consuming process of finding, testing, and certifying a third-party connector to help read and write the data between their environment and Amazon Redshift. Even after they have found a connector, businesses must manage intermediate data-staging locations, such as Amazon S3, to read and write data from and to Amazon Redshift. All of these challenges increase operational complexity and make it difficult for businesses to use Apache Spark to its full extent.
Now, AWS supports Apache Spark on Amazon EMR, AWS Glue, and Amazon SageMaker with a fully compatible, AWS-optimized runtime. Why is this important? Amazon Redshift integration for Apache Spark makes it easier for developers to build and run Apache Spark applications on data in Amazon Redshift using AWS-supported analytics and ML services.
Amazon Redshift integration for Apache Spark eliminates the cumbersome and error-prone process associated with third-party connectors. Developers can begin running queries on Amazon Redshift data from Apache Spark-based applications within seconds using popular language frameworks (e.g., Java, Python, R, and Scala). Intermediate data-staging locations are managed automatically, eliminating the need for customers to configure and manage these in application code.
Salvatore Salamone is a physicist by training who has been writing about science and information technology for more than 30 years. During that time, he has been a senior or executive editor at many industry-leading publications including High Technology, Network World, Byte Magazine, Data Communications, LAN Times, InternetWeek, Bio-IT World, and Lightwave, The Journal of Fiber Optics. He also is the author of three business technology books.