AWS Big Data Blog Official Big Data Blog of Amazon Web Services
- The next generation of Amazon OpenSearch Serverless: Built from the ground up for agentsby Sohaib Katariwala on May 28, 2026 at 6:24 pm
Today, we are announcing a ground-up re-architecture of Amazon OpenSearch Serverless that delivers up to 20 times faster autoscaling, scale to zero, and up to 60% lower cost than provisioning clusters for peak load. Amazon OpenSearch Service is a fully managed, open source retrieval engine that unifies vector, lexical, hybrid, and agentic search, delivering low-latency, accurate and relevant results. Amazon OpenSearch Serverless is an automatically scaled deployment option. The new architecture decouples compute from storage. The service provisions infrastructure in seconds instead of minutes, and scales compute all the way to zero when your application is idle. In this post, we walk through the new architecture, what it means for your applications, and how to get started with a hands-on tutorial.
- How Buildkite Operates Test Analytics at Massive Scale with Amazon MSK and Amazon Managed Service for Apache Flinkby James Hill on May 27, 2026 at 6:22 pm
In this post, we explore how Buildkite uses Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink to power Test Engine’s streaming-first analytics architecture at scale.
- How Zynga scaled multi-warehouse data governance with Amazon Redshift federated permissionsby Johan Eklund, Matthew Wongkee, Noelia Tardón on May 27, 2026 at 6:19 pm
In this post, we walk through how Zynga adopted Amazon Redshift federated permissions and AWS IAM Identity Center to enforce consistent, tiered data access across provisioned and serverless Amazon Redshift environments without building custom synchronization pipelines.
- Automate data discovery and centralized management with AWS Glue Data Catalogby Ramakrishna Natarajan on May 26, 2026 at 5:59 pm
In this post, we show you how to tackle data discovery, classification, and governance across your databases, data warehouses, and object storage to regain visibility and control over your data landscape.
- How Amazon is moving to integrate catalogs to improve data discovery with Amazon SageMakerby Pradeep Misra on May 22, 2026 at 8:56 pm
Enterprises face challenges when teams create data assets outside of central data catalogs. It adds overhead for discovery, and limits collaboration. Amazon’s Business Data Technologies (BDT) team has built an enterprise data catalog Andes for sharing datasets under well-defined policies. However, teams created catalog of local datasets and other non-tabular assets such as dashboards and metrics, outside Andes. This made it difficult to discover all assets in a consolidated way. In this post, we share how Amazon.com is working to integrate catalogs by extending enterprise data catalog Andes with Amazon SageMaker.
- Automate deployment of data and AI applications with Amazon SageMaker Unified Studio CI/CD CLIby Saurabh Bhutyani on May 21, 2026 at 7:13 pm
The CI/CD CLI for Amazon SageMaker Unified Studio (aws-smus-cicd-cli) is an open source command line tool that automates deployment of multi-service data and AI applications across pipeline stages. Data teams define their application once in a YAML manifest, DevOps teams deploy with a single command, and the CLI handles configuration substitution, dependency ordering, and resource provisioning automatically. In this post, we walk through how the CI/CD CLI works, show you how to deploy a real application across environments, and demonstrate how it fits into your existing CI/CD workflows.
- A systematic approach to benchmarking SQL processing engines on AWSby Anubhav Awasthi on May 19, 2026 at 3:44 pm
Selecting the right SQL processing solution for large-scale data analytics is a critical decision for organizations. As data volumes grow exponentially, the technology landscape has evolved to offer diverse options for processing and analyzing this information efficiently. This post presents a systematic framework for evaluating and benchmarking SQL processing engines on AWS, using Apache JMeter to conduct practical performance testing at scale.
- Build petabyte-scale synthetic test data with Amazon EMR on EC2by Anubhav Awasthi on May 19, 2026 at 3:42 pm
As data volumes grow from terabytes to petabytes, the architecture for generating synthetic data must evolve to meet increasing demands for scale, performance, and data quality. In this post, we show how you can build a scalable synthetic data generation solution using Amazon EMR, Apache Spark, and the Faker library.
- Meet Amazon Redshift RG – AWS Graviton-based instances with an integrated data lake query engine delivering up to 2.4x better performance at 30% lower price than RA3by Ankit Sahu on May 19, 2026 at 3:38 pm
On May 12, 2026, we announced the general availability of Amazon Redshift RG instances, powered by AWS Graviton processors. RG instances are up to 2.2x as fast for data warehouse workloads and up to 2.4x as fast for data lake workloads, all at 30% lower price per vCPU compared to RA3 instances. RG instances support all data lake formats supported by RA3 and eliminate Amazon Redshift Spectrum’s per-TB scanning charges. RG instances feature a custom-built integrated vectorized query engine, making them a more performant and cost-effective foundation for unified analytics. We are launching with two instance sizes: rg.xlarge and rg.4xlarge, with additional sizes coming later this year.
- OpenSearch Agent Skills bring built-in intelligence to your agentic IDEby Bobby Mohammed on May 18, 2026 at 7:15 pm
Today, we’re launching OpenSearch Agent Skills, a repository of open, composable skills that bring built-in intelligence to developer workflows with OpenSearch, directly inside your favorite agentic IDE. By embedding OpenSearch expertise into the developer’s existing workflow, Agent Skills reduce setup time, eliminate unnecessary tool-hopping, and let teams focus on building rather than configuring.
- How Smartsheet built Real-time Dynamic Filtering on Apache Flink reducing $40K/month in messaging costsby Emre Kartoglu on May 18, 2026 at 6:59 pm
In this post, you learn how Smartsheet built a Real-time Dynamic Filtering (RDF) system on Amazon Managed Service for Apache Flink, cutting messaging costs by over $40,000 per month and improving live collaboration latency by 1.8x.
- Optimize Amazon S3 Tables queries with Amazon Redshiftby Tom Romano on May 14, 2026 at 4:58 pm
This is the third post in our S3 Tables and Amazon Redshift series. The first post covered getting started with querying Apache Iceberg tables, and the second post walked through enterprise-scale governance and access controls. In this post, you address those performance and usability gaps with three different approaches.
- Securing client confidentiality at scale: Automated data discovery and governed analytics for legal workloadsby Rohan Kamat on May 13, 2026 at 3:57 pm
In this post, we show you a reference architecture that automates sensitive data discovery across legal document repositories on Amazon Web Services (AWS), demonstrate how to capture structured findings as a compliance dataset, and guide you through building a governed analytics workspace that maintains your security boundaries. You walk away with a practical model for building security and analytics into the same lifecycle, without moving documents outside their system of record.
- Streamlined monitoring and debugging for Amazon EMR on EC2by Parul Saxena on May 12, 2026 at 3:59 pm
In this post, we walk you through five key enhancements: Amazon CloudWatch Logs integration, step-level Amazon Simple Storage Service (Amazon S3) logging controls, expanded console UIs for YARN and Tez, Amazon EMR step to YARN application ID mapping, and enhanced custom metrics with updated documentation.
- Detect and resolve HBase inconsistencies faster with AI on Amazon EMRby Yu-Ting Su on May 12, 2026 at 3:56 pm
In this post, we show you how to build an AI-powered troubleshooting solution using Amazon OpenSearch Service vector search and intelligent analysis. This solution reduces HBase inconsistency resolution from hours to minutes and root cause identification from days to hours through natural language queries over operational data. This democratizes HBase troubleshooting capabilities across teams and reducing dependency on specialized expertise.
- How to use streamlined permissions for Amazon S3 Tables and Iceberg materialized viewsby Srividya Parthasarathy on May 11, 2026 at 6:59 pm
In this post, we walk through how to set up and manage S3 Tables in the AWS Glue Data Catalog, create and query Iceberg materialized views, and configure access controls that work across your analytics stack with IAM-based authorization.
- Improve DynamoDB analytics with AWS Glue zero-ETL schema and partition controlsby Raju Ansari on May 11, 2026 at 6:51 pm
In this post, you learn how to replicate Amazon DynamoDB data to Apache Iceberg tables in Amazon S3 through a zero-ETL integration. We walk through the challenges that the DynamoDB nested, schema-flexible data model introduces for analytics workloads, and show you how to configure schema unnesting and data partitioning for a sample product catalog table. We also cover how to query the replicated data in Amazon Athena using standard SQL.
- How to build a cross-Region resilience for Amazon OpenSearch Service with Amazon MSKby Sriharsha Subramanya Begolli on May 11, 2026 at 6:46 pm
In this post, we outline the solution that provides cross-Region resiliency without needing to reestablish relationships during a fail-back, using an active-active replication model with Amazon OpenSearch Ingestion (OSI) and Amazon Managed Streaming for Apache Kafka (Amazon MSK). This solution applies to both OpenSearch Service managed clusters and Amazon OpenSearch Serverless collections. We use Amazon OpenSearch Serverless as an example for the configurations in this post.
- How to consolidate cross-Region S3 data into OpenSearchby David Venable on May 8, 2026 at 1:37 pm
We’re happy to announce that Amazon OpenSearch Ingestion pipelines can now read from S3 buckets in different Regions to ingest and consolidate data into a single OpenSearch Service domain or collection. In this post, I’ll show you how to use the new cross-Region support to ingest data from S3 buckets across multiple AWS Regions into a single OpenSearch Service domain or collection.
- Enable real-time mainframe analytics with Precisely Connect and Amazon S3by Supreet Padhi, Rochelle Grubbs on May 8, 2026 at 1:29 pm
In this post, we discuss how you can use Precisely Connect to enable real-time, direct replication of mainframe data to Amazon Simple Storage Service (Amazon S3), and how your organization can extend this foundation using Amazon S3 Tables for advanced analytics.






















