AWS Big Data Blog | Website Cyber Security ☁️

AWS Big Data Blog Official Big Data Blog of Amazon Web Services

Best practices for migrating from Apache Airflow 2.x to Apache Airflow 3.x on Amazon MWAA
by Anurag Srivastava on October 7, 2025 at 5:31 pm
Apache Airflow 3.x on Amazon MWAA introduces architectural improvements such as API-based task execution that provides enhanced security and isolation. This migration presents an opportunity to embrace next-generation workflow orchestration capabilities while providing business continuity. This post provides best practices and a streamlined approach to successfully navigate this critical migration, providing minimal disruption to your mission-critical data pipelines while maximizing the enhanced capabilities of Airflow 3.
Breaking down data silos: Volkswagen’s approach with Amazon DataZone
by Bandana Das on October 7, 2025 at 5:28 pm
In this post, we introduce Amazon DataZone and explore how Volkswagen used Amazon DataZone to build their data mesh, tackle the challenges encountered, and break the data silos.
Bridging data silos: cross-bounded context querying with Vanguard’s Operational Read-only Data Store (ORDS) using Amazon Redshift
by Naresh Rajaram on October 6, 2025 at 9:32 pm
At Vanguard, we faced significant challenges with our legacy mainframe system that limited our ability to deliver modern, personalized customer experiences. Our centralized database architecture created performance bottlenecks and made it difficult to scale services independently for our millions of personal and institutional investors. In this post, we show you how we modernized our data architecture using Amazon Redshift as our Operational Read-only Data Store (ORDS).
Seamlessly Integrate Data on Google BigQuery and ClickHouse Cloud with AWS Glue
by Ray Wang on October 6, 2025 at 9:29 pm
Migrating from Google Cloud’s BigQuery to ClickHouse Cloud on AWS allows businesses to leverage the speed and efficiency of ClickHouse for real-time analytics while benefiting from AWS’s scalable and secure environment. This article provides a comprehensive guide to executing a direct data migration using AWS Glue ETL, highlighting the advantages and best practices for a
Optimize efficiency with language analyzers using scalable multilingual search in Amazon OpenSearch Service
by Sunil Ramachandra on October 2, 2025 at 10:13 pm
Organizations manage content across multiple languages as they expand globally. Ecommerce platforms, customer support systems, and knowledge bases require efficient multilingual search capabilities to serve diverse user bases effectively. This unified search approach helps multinational organizations maintain centralized content repositories while making sure users, regardless of their preferred language, can effectively find and access relevant
How Laravel Nightwatch handles billions of observability events in real time with Amazon MSK and ClickHouse Cloud
by Masudur Rahaman Sayem on October 1, 2025 at 4:59 pm
Laravel, one of the world’s most popular web frameworks, launched its first-party observability platform, Laravel Nightwatch, to provide developers with real-time insights into application performance. Built entirely on AWS managed services and ClickHouse Cloud, the service already processes over one billion events per day while maintaining sub-second query latency, giving developers instant visibility into the health of their applications.
Introducing Apache Airflow 3 on Amazon MWAA: New features and capabilities
by Anurag Srivastava on October 1, 2025 at 4:40 pm
AWS announced the general availability of Apache Airflow 3 on Amazon Managed Workflows for Apache Airflow (Amazon MWAA). This release transforms how organizations use Apache Airflow to orchestrate data pipelines and business processes in the cloud, bringing enhanced security, improved performance, and modern workflow orchestration capabilities to Amazon MWAA customers. This post explores the features of Airflow 3 on Amazon MWAA and outlines enhancements that improve your workflow orchestration capabilities
Search++, Going Beyond Keywords with Amazon OpenSearch Service
by Prashant Agrawal on September 29, 2025 at 10:38 pm
Search technology, specifically web search technology, has been around for more than 30 years. You entered a few words in a text box, clicked “Search,” and received a series of links. However, the results were often a mix of related, non-related, and general links. If the results didn’t contain the information you needed, you reformulated
Scaling cluster manager and admin APIs in Amazon OpenSearch Service
by Rajiv Kumar Vaidyanathan on September 26, 2025 at 8:32 pm
In this post, we demonstrate the different bottlenecks that were identified and the corresponding solutions that were implemented in OpenSearch Service to scale cluster manager for large cluster deployments. These optimizations are available to all new domains or existing domains upgraded to OpenSearch Service versions 2.17 or above.
Optimize Amazon EMR runtime for Apache Spark with EMR S3A
by Giovanni Matteo Fumarola on September 24, 2025 at 8:51 pm
With the Amazon EMR 7.10 runtime, Amazon EMR has introduced EMR S3A, an improved implementation of the open source S3A file system connector. In this post, we showcase the enhanced read and write performance advantages of using Amazon EMR 7.10.0 runtime for Apache Spark with EMR S3A as compared to EMRFS and the open source S3A file system connector.
Amazon OpenSearch Serverless monitoring: A CloudWatch setup guide
by Urmila Iyer on September 24, 2025 at 4:50 pm
In this post, we explore commonly used Amazon CloudWatch metrics and alarms for OpenSearch Serverless, walking through the process of selecting relevant metrics, setting appropriate thresholds, and configuring alerts. This guide will provide you with a comprehensive monitoring strategy that complements the serverless nature of your OpenSearch deployment while maintaining full operational visibility.
Accelerating SQL analytics with Amazon Redshift MCP server
by Ramkumar Nottath on September 23, 2025 at 9:04 pm
In this post, we walk through setting up the Amazon Redshift MCP server and demonstrate how a data analyst can efficiently explore Redshift data warehouses and perform data analysis using natural language queries.
Use Apache Airflow workflows to orchestrate data processing on Amazon SageMaker Unified Studio
by Vinod Jayendra on September 22, 2025 at 4:56 pm
Orchestrating machine learning pipelines is complex, especially when data processing, training, and deployment span multiple services and tools. In this post, we walk through a hands-on, end-to-end example of developing, testing, and running a machine learning (ML) pipeline using workflow capabilities in Amazon SageMaker, accessed through the Amazon SageMaker Unified Studio experience. These workflows are powered by Amazon Managed Workflows for Apache Airflow.
Trellix achieved 35% cost savings and enhanced security with Amazon OpenSearch Service
by Leeneksh Dubey on September 19, 2025 at 3:48 pm
Trellix, a global leader in cybersecurity solutions, emerged in 2022 from the merger of McAfee Enterprise and FireEye. To address exponential log growth across their multi-tenant, multi-Region infrastructure, Trellix used Amazon OpenSearch Service, Amazon OpenSearch Ingestion, and Amazon S3 to modernize their log infrastructure. In this post, we share how, by adopting these AWS solutions, Trellix enhanced their system’s performance, availability, and scalability while reducing operational overhead.
Announcing cross-account ingestion for Amazon OpenSearch Service
by David Venable on September 19, 2025 at 3:36 pm
Amazon OpenSearch Ingestion is a powerful data ingestion pipeline that AWS customers use for many different purposes, such as observability, analytics, and zero-ETL search. Many customers today push logs, traces, and metrics from their applications to OpenSearch Ingestion to store and analyze this data. Today, we are happy to announce that OpenSearch Ingestion pipelines now
Integrate Tableau and PingFederate with Amazon Redshift using AWS IAM Identity Center
by Rohit Vashishtha on September 18, 2025 at 10:02 pm
In this post, we outline a comprehensive guide for setting up single sign-on from Tableau desktop to Amazon Redshift using integration with IAM Identity Center and PingFederate as the identity provider (IdP) with an LDAP based data store, AWS Directory Service for Microsoft Active Directory.
Tailor Amazon SageMaker Unified Studio project environments to your needs using custom blueprints
by Aditya Challa on September 17, 2025 at 10:49 pm
Amazon SageMaker Unified Studio is a single data and AI development environment that brings together data preparation, analytics, machine learning (ML), and generative AI development in one place. By unifying these workflows, it saves teams from managing multiple tools and makes it straightforward for data scientists, analysts, and developers to build, train, and deploy ML
Unlock the power of Apache Iceberg v3 deletion vectors on Amazon EMR
by Arun Shanmugam on September 17, 2025 at 7:26 pm
As modern data architectures expand, Apache Iceberg has become a widely popular open table format, providing ACID transactions, time travel, and schema evolution. In table format v2, Iceberg introduced merge-on-read, improving delete and update handling through positional delete files. These files improve write performance but can slow down reads when not compacted, since Iceberg must
Get started with Amazon OpenSearch Service: T-shirt size your domain for log analytics
by Harsh Bansal on September 16, 2025 at 8:26 pm
When you’re spinning up your Amazon OpenSearch Service domain, you need to figure out the storage, instance types, and instance count; decide the sharding strategies and whether to use a cluster manager; and enable zone awareness. Generally, we consider storage as a guideline for determining instance count, but not other parameters. In this post, we
Amazon SageMaker introduces Amazon S3 based shared storage for enhanced project collaboration
by Hari Ramesh on September 16, 2025 at 8:23 pm
AWS recently announced that Amazon SageMaker now offers Amazon Simple Storage Service (Amazon S3) based shared storage as the default project file storage option for new Amazon SageMaker Unified Studio projects. This feature addresses the deprecation of AWS CodeCommit while providing teams with a straightforward and consistent way to collaborate on project files across the