AWS Big Data Blog Official Big Data Blog of Amazon Web Services
- Building serverless event streaming applications with Amazon MSK and AWS Lambdaby Tarun Rai Madan on June 26, 2025 at 5:48 pm
In this post, we describe how you can simplify your event-driven application architecture using AWS Lambda with Amazon MSK. We demonstrate how to configure Lambda as a consumer for Kafka topics, including a cross-account setup and how to optimize price and performance for these applications.
- Enhance data ingestion performance in Amazon Redshift with concurrent insertsby Raghu Kuppala on June 26, 2025 at 5:39 pm
Amazon Redshift employs columnar storage for database tables, reducing overall disk I/O requirements. This storage method significantly improves analytic query performance by minimizing data read during queries. This post showcases the key improvements in Amazon Redshift concurrent data ingestion operations.
- Introducing AWS Glue Data Catalog usage metrics for API usageby David Zhang on June 26, 2025 at 5:23 pm
We’re excited to announce AWS Glue Data Catalog usage metrics. The usage metrics is a new feature that provides native integration with Amazon CloudWatch. In this post, we demonstrate how to access these metrics, provide a step-by-step walkthrough, and set up meaningful alarms.
- Amazon OpenSearch Service 101: Create your first search application with OpenSearchby Sriharsha Subramanya Begolli on June 25, 2025 at 5:03 pm
In this post, we walk you through a search application building process using Amazon OpenSearch Service. Whether you’re a developer new to search or looking to understand OpenSearch fundamentals, this hands-on post shows you how to build a search application from scratch—starting with the initial setup; diving into core components such as indexing, querying, result presentation; and culminating in the execution of your first search query.
- Implement secure hybrid and multicloud log ingestion with Amazon OpenSearch Ingestionby Xiaoxue Xu on June 25, 2025 at 4:44 pm
In this post, we demonstrate how to configure Fluent Bit, a fast and flexible log processor and router supported by various operating systems, to securely send logs from any environment to OpenSearch Ingestion using IAM Roles Anywhere.
- Capture data lineage from dbt, Apache Airflow, and Apache Spark with Amazon SageMakerby Jose Romero on June 24, 2025 at 8:15 pm
This post walks you through how to use the OpenLineage-compatible API of SageMaker or Amazon DataZone to push data lineage events programmatically from tools supporting the OpenLineage standard like dbt, Apache Airflow, and Apache Spark.
- How Skroutz handles real-time schema evolution in Amazon Redshift with Debeziumby Konstantina Mavrodimitraki on June 23, 2025 at 6:34 pm
Skroutz chose Amazon Redshift to promote data democratization, empowering teams across the organization with seamless access to data, enabling faster insights and more informed decision-making. In this post, we share how we handled real-time schema evolution in Amazon Redshift with Debezium.
- Stream data from Amazon MSK to Apache Iceberg tables in Amazon S3 and Amazon S3 Tables using Amazon Data Firehoseby Pratik Patel on June 20, 2025 at 9:20 pm
In this post, we walk through two solutions that demonstrate how to stream data from your Amazon MSK provisioned cluster to Iceberg-based data lakes in Amazon S3 using Amazon Data Firehose.
- Secure access to a cross-account Amazon MSK cluster from Amazon MSK Connect using IAM authenticationby Venkata Sai Mahesh Swargam on June 19, 2025 at 8:26 pm
In this post, we demonstrate a use case where you might need to use an MSK cluster in one AWS account, but MSK Connect is located in a separate account. We demonstrate how to implement IAM authentication after establishing network connectivity. IAM provides enhanced security measures, making sure your systems are protected against unauthorized access.
- Build a multi-Region analytics solution with Amazon Redshift, Amazon S3, and Amazon QuickSightby Donatas Kuchalskis on June 19, 2025 at 8:20 pm
This post explores how to effectively architect a solution that addresses this specific challenge: enabling comprehensive analytics capabilities for global teams while making sure that your data remains in the AWS Regions required by your compliance framework. We use a variety of AWS services, including Amazon Redshift, Amazon Simple Storage Service (Amazon S3), and Amazon QuickSight.
- RocksDB 101: Optimizing stateful streaming in Apache Spark with Amazon EMR and AWS Glueby Melody Yang on June 18, 2025 at 7:28 pm
This post explores RocksDB’s key features and demonstrates its implementation using Spark on Amazon EMR and AWS Glue, providing you with the knowledge you need to scale your real-time data processing capabilities.
- Reduce time to access your transactional data for analytical processing using the power of Amazon SageMaker Lakehouse and zero-ETLby Avijit Goswami on June 16, 2025 at 7:25 pm
In this post, we demonstrate how you can bring transactional data from AWS OLTP data stores like Amazon Relational Database Service (Amazon RDS) and Amazon Aurora flowing into Redshift using zero-ETL integrations to SageMaker Lakehouse Federated Catalog (Bring your own Amazon Redshift into SageMaker Lakehouse). With this integration, you can now seamlessly onboard the changed data from OLTP systems to a unified lakehouse and expose the same to analytical applications for consumptions using Apache Iceberg APIs from new SageMaker Unified Studio.
- Enhance security and performance with TLS 1.3 and Perfect Forward Secrecy on Amazon OpenSearch Serviceby Shubham Kumar on June 12, 2025 at 2:56 pm
Amazon OpenSearch Service recently introduced a new Transport Layer Security (TLS) policy Policy-Min-TLS-1-2-PFS-2023-10, which supports the latest TLS 1.3 protocol and TLS 1.2 with Perfect Forward Secrecy (PFS) cipher suites. This new policy improves security and enhances OpenSearch performance. In this post, we discuss the benefits of this new policy and how to enable it using the AWS Command Line Interface (AWS CLI).
- How Nexthink built real-time alerts with Amazon Managed Service for Apache Flinkby Nikos Tragaras, Raphaël Afanyan on June 12, 2025 at 12:14 pm
In this post, we describe Nexthink’s journey as they implemented a new real-time alerting system using Amazon Managed Service for Apache Flink. We explore the architecture, the rationale behind key technology choices, and the Amazon Web Services (AWS) services that enabled a scalable and efficient solution.
- Designing centralized and distributed network connectivity patterns for Amazon OpenSearch Serverlessby Ankush Goyal on June 10, 2025 at 3:43 pm
As organizations scale their use of OpenSearch Serverless, understanding network architecture and DNS management becomes increasingly important. This post covers advanced deployment scenarios focused on centralized and distributed access patterns—specifically, how enterprises can simplify network connectivity across multiple AWS accounts and extend access to on-premises environments for their OpenSearch Serverless deployments.
- Simplify real-time analytics with zero-ETL from Amazon DynamoDB to Amazon SageMaker Lakehouseby Narayani Ambashta on June 6, 2025 at 4:46 pm
At AWS re:Invent 2024, we introduced a no code zero-ETL integration between Amazon DynamoDB and Amazon SageMaker Lakehouse, simplifying how organizations handle data analytics and AI workflows. In this post, we share how to set up this zero-ETL integration from DynamoDB to your SageMaker Lakehouse environment.
- Using AWS Glue Data Catalog views with Apache Spark in EMR Serverless and Glue 5.0by Aarthi Srinivasan on June 5, 2025 at 4:45 pm
In this post, we guide you through the process of creating a Data Catalog view using EMR Serverless, adding the SQL dialect to the view for Athena, sharing it with another account using LF-Tags, and then querying the view in the recipient account using a separate EMR Serverless workspace and AWS Glue 5.0 Spark job and Athena. This demonstration showcases the versatility and cross-account capabilities of Data Catalog views and access through various AWS analytics services.
- Embracing event driven architecture to enhance resilience of data solutions built on Amazon SageMakerby Dhrubajyoti Mukherjee on June 5, 2025 at 4:43 pm
This post provides guidance on how you can use event driven architecture to enhance the resiliency of data solutions built on the next generation of Amazon SageMaker, a unified platform for data, analytics, and AI. SageMaker is a managed service with high availability and durability.
- Introducing managed query results for Amazon Athenaby Guy Bachar on June 3, 2025 at 8:40 pm
We’re thrilled to introduce managed query results, a new Athena feature that automatically stores, secures, and manages the lifecycle of query result data for you at no additional cost. In this post, we demonstrate how to get started with managed query results and, by removing the undifferentiated effort spent on query result management, how Athena helps you get insights from your data in fewer steps than before.
- Centralize Apache Spark observability on Amazon EMR on EKS with external Spark History Serverby Sri Potluri on June 3, 2025 at 4:20 pm
This post demonstrates how to centralize Apache Spark observability using SHS on EMR on EKS. We showcase how to enhance SHS with performance monitoring tools, with a pattern applicable to many monitoring solutions such as SparkMeasure and DataFlint.