AWS Big Data Blog Official Big Data Blog of Amazon Web Services
- How Octus achieved 85% infrastructure cost reduction with zero downtime migration to Amazon OpenSearch Serviceby Vaibhav Sabharwal on November 26, 2025 at 7:38 pm
This post highlights how Octus migrated its Elasticsearch workloads running on Elastic Cloud to Amazon OpenSearch Service. The journey traces Octus’s shift from managing multiple systems to adopting a cost-efficient solution powered by OpenSearch Service.
- Getting started with Apache Iceberg write support in Amazon Redshiftby Sanket Hase on November 26, 2025 at 7:34 pm
In this post, we show how you can use Amazon Redshift to write data directly to Apache Iceberg tables stored in Amazon S3 and S3 Tables for seamless integration between your data warehouse and data lake while maintaining ACID compliance.
- Orchestrating data processing tasks with a serverless visual workflow in Amazon SageMaker Unified Studioby Suba Palanisamy on November 25, 2025 at 11:08 pm
In this post, we show how to use the new visual workflow experience in SageMaker Unified Studio IAM-based domains to orchestrate an end-to-end machine learning workflow. The workflow ingests weather data, applies transformations, and generates predictions—all through a single, intuitive interface, without writing any orchestration code.
- Save up to 24% on Amazon Redshift Serverless compute costs with Reservationsby Satesh Sonti on November 24, 2025 at 10:14 pm
In this post, you learn how Amazon Redshift Serverless Reservations can help you lower your data warehouse costs. We explore ways to determine the optimal number of RPUs to reserve, review example scenarios, and discuss important considerations when purchasing these reservations.
- Introducing Cluster Insights: Unified monitoring dashboard for Amazon OpenSearch Service clustersby Siddhant Gupta on November 21, 2025 at 4:38 pm
This blog will guide you through setting up and using Cluster Insights, including key features and metrics. By the conclusion, you’ll understand how to use Cluster insights to recognize and address performance and resiliency issues within your OpenSearch Service clusters.
- Enforce business glossary classification rules in Amazon SageMaker Catalogby Ramesh H Singh on November 20, 2025 at 6:39 pm
Amazon SageMaker Catalog now supports metadata enforcement rules for glossary terms classification (tagging) at the asset level. With this capability, administrators can require that assets include specific business terms or classifications. Data producers must apply required glossary terms or classifications before an asset can be published. In this post, we show how to enforce business glossary classification rules in SageMaker Catalog.
- Enhanced data discovery in Amazon SageMaker Catalog with custom metadata forms and rich text documentationby Ramesh H Singh on November 20, 2025 at 6:35 pm
Amazon SageMaker Catalog now supports custom metadata forms and rich text descriptions at the column level, extending existing curation capabilities for business names, descriptions, and glossary term classifications. Column-level context is essential for understanding and trusting data. This release helps organizations improve data discoverability, collaboration, and governance by letting metadata stewards document columns using structured and formatted information that aligns with internal standards. In this post, we show how to enhance data discovery in SageMaker Catalog with custom metadata forms and rich text documentation at the schema level.
- Getting started with Amazon S3 Tables in Amazon SageMaker Unified Studioby David Pasha on November 19, 2025 at 11:26 pm
In this post, you learn how to integrate SageMaker Unified Studio with S3 Tables and query your data using Amazon Athena, Amazon Redshift, or Apache Spark in EMR and AWS Glue.
- Cross-account lakehouse governance with Amazon S3 Tables and SageMaker Catalogby Sneha Rao on November 18, 2025 at 11:01 pm
In this post, we walk you through a practical solution for secure, efficient cross-account data sharing and analysis. You’ll learn how to set up cross-account access to S3 Tables using federated catalogs in Amazon SageMaker, perform unified queries across accounts with Amazon Athena in Amazon SageMaker Unified Studio, and implement fine-grained access controls at the column level using AWS Lake Formation.
- Introducing Amazon MWAA Serverlessby John Jackson on November 17, 2025 at 10:22 pm
Today, AWS announced Amazon Managed Workflows for Apache Airflow (MWAA) Serverless. This is a new deployment option for MWAA that eliminates the operational overhead of managing Apache Airflow environments while optimizing costs through serverless scaling. In this post, we demonstrate how to use MWAA Serverless to build and deploy scalable workflow automation solutions.
- Your guide to AWS Analytics at AWS re:Invent 2025by Sonu Kumar Singh on November 13, 2025 at 8:06 pm
It’s that time of year again — AWS re:Invent is here! At re:Invent, bold ideas come to life. Get a front-row seat to hear inspiring stories from AWS experts, customers, and leaders as they explore today’s most impactful topics, from data analytics to AI. For all the data enthusiasts and professionals, we’ve curated a comprehensive
- How Yelp modernized its data infrastructure with a streaming lakehouse on AWSby Umesh Dangat on November 13, 2025 at 6:07 pm
This is a guest post by Umesh Dangat, Senior Principal Engineer for Distributed Services and Systems at Yelp, and Toby Cole, Principle Engineer for Data Processing at Yelp, in partnership with AWS. Yelp processes massive amounts of user data daily—over 300 million business reviews, 100,000 photo uploads, and countless check-ins. Maintaining sub-minute data freshness with
- Introducing the Amazon OpenSearch Lens for the AWS Well-Architected Frameworkby Muslim Abu-Taha on November 12, 2025 at 1:07 am
In this post, we show you how to use the Amazon OpenSearch Service Lens to evaluate your OpenSearch Service workloads against architectural best practices.
- Amazon MSK Express brokers now support Intelligent Rebalancing for 180 times faster operation performanceby Swapna Bandla on November 10, 2025 at 11:15 pm
Effective today, all new Amazon Managed Streaming for Apache Kafka (Amazon MSK) Provisioned clusters with Express brokers will support Intelligent Rebalancing at no additional cost. In this post we’ll introduce the Intelligent Rebalancing feature and show an example of how it works to improve operation performance.
- Analyzing Amazon EC2 Spot instance interruptions by using event-driven architectureby Shekhar Shrinivasan on November 10, 2025 at 10:05 pm
In this post, you’ll learn how to build this comprehensive monitoring solution step-by-step. You’ll gain practical experience designing an event-driven pipeline, implementing data processing workflows, and creating insightful dashboards that help you track interruption trends, optimize ASG configurations, and improve the resilience of your Spot Instance workloads.
- Enhanced search with match highlights and explanations in Amazon SageMakerby Ramesh H Singh on November 4, 2025 at 10:57 pm
Amazon SageMaker now enhances search results in Amazon SageMaker Unified Studio with additional context that improves transparency and interpretability. The capability introduces inline highlighting for matched terms and an explanation panel that details where and how each match occurred across metadata fields such as name, description, glossary, and schema. In this post, we demonstrate how to use enhanced search in Amazon SageMaker.
- Amazon Kinesis Data Streams launches On-demand Advantage for instant throughput increases and streaming at scaleby Pratik Patel on November 3, 2025 at 10:00 pm
Today, AWS announced the new Amazon Kinesis Data Streams On-demand Advantage mode, which includes warm throughput capability and an updated pricing structure. With this feature you can enable instant scaling for traffic surges while optimizing costs for consistent streaming workloads. In this post, we explore this new feature, including key use cases, configuration options, pricing considerations, and best practices for optimal performance.
- Scaling data governance with Amazon DataZone: Covestro success storyby Jörg Janssen on November 3, 2025 at 9:02 pm
In this post, we show you how Covestro transformed its data architecture by implementing Amazon DataZone and AWS Serverless Data Lake Framework, transitioning from a centralized data lake to a data mesh architecture. The implementation enabled streamlined data access, better data quality, and stronger governance at scale, achieving a 70% reduction in time-to-market for over 1,000 data pipelines.
- Use trusted identity propagation for Apache Spark interactive sessions in Amazon SageMaker Unified Studioby Aarthi Srinivasan on October 31, 2025 at 8:55 pm
In this post, we provide step-by-step instructions to set up Amazon EMR on EC2, EMR Serverless, and AWS Glue within SageMaker Unified Studio, enabled with trusted identity propagation. We use the setup to illustrate how different IAM Identity Center users can run their Spark sessions, using each compute setup, within the same project in SageMaker Unified Studio. We show how each user will see only tables or part of tables that they’re granted access to in Lake Formation.
- Amazon Kinesis Data Streams now supports 10x larger record sizes: Simplifying real-time data processingby Sumant Nemmani on October 28, 2025 at 7:23 pm
Today, AWS announced that Amazon Kinesis Data Streams now supports record sizes up to 10MiB – a tenfold increase from the previous limit. In this post, we explore Amazon Kinesis Data Streams large record support, including key use cases, configuration of maximum record sizes, throttling considerations, and best practices for optimal performance.























