AWS Big Data Blog Official Big Data Blog of Amazon Web Services
- Guide to adopting Amazon SageMaker Unified Studio from ATPCO’s Journeyby Mitesh Patel on August 18, 2025 at 7:03 pm
ATPCO is the backbone of modern airline retailing, helping airlines and third-party channels deliver the right offers to customers at the right time. ATPCO addressed data governance challenges using Amazon DataZone. SageMaker Unified Studio, built on the same architecture as Amazon DataZone, offers additional capabilities, so users can complete various tasks such as building data pipelines using AWS Glue and Amazon EMR, or conducting analyses using Amazon Athena and Amazon Redshift query editor across diverse datasets, all within a single, unified environment. In this post, we walk you through the challenges ATPCO addresses for their business using SageMaker Unified Studio.
- Achieve low-latency data processing with Amazon EMR on AWS Local Zonesby Gagan Brahmi on August 18, 2025 at 6:56 pm
By deploying Amazon EMR on AWS Local Zones, organizations can achieve single-digit millisecond latency data processing for applications while maintaining data residency compliance. This post demonstrates how to use AWS Local Zones to deploy EMR clusters closer to your users, enabling millisecond-level response times.
- Transform your data to Amazon S3 Tables with Amazon Athenaby Pathik Shah on August 15, 2025 at 8:25 pm
This post demonstrates how Amazon Athena CREATE TABLE AS SELECT (CTAS) simplifies the data transformation process through a practical example: migrating an existing Parquet dataset into Amazon S3 Tables.
- Export JMX metrics from Kafka connectors in Amazon Managed Streaming for Apache Kafka Connect with a custom pluginby Jaydev Nath on August 15, 2025 at 3:51 pm
In this post, we demonstrate how you can export the JMX metrics for Debezium connector when used with Amazon MSK Connect.
- Cluster manager communication simplified with Remote Publicationby Himshikha Gupta on August 14, 2025 at 3:37 pm
Amazon OpenSearch Service has taken a significant leap forward in scalability and performance with the introduction of support for 1,000-node OpenSearch Service domains capable of handling 500,000 shards with OpenSearch Service version 2.17. This post explains cluster state publication, Remote Publication, and their benefits in improving durability, scalability, and availability.
- Enhance Amazon EMR observability with automated incident mitigation using Amazon Bedrock and Amazon Managed Grafanaby Yu-Ting Su on August 14, 2025 at 3:25 pm
In this post, we demonstrate how to integrate real-time monitoring with AI-powered remediation suggestions, combining Amazon Managed Grafana for visualization, Amazon Bedrock for intelligent response recommendations, and AWS Systems Manager for automated remediation actions on Amazon Web Services (AWS).
- Build data pipelines with dbt in Amazon Redshift using Amazon MWAA and Cosmosby Cindy Li on August 13, 2025 at 8:12 pm
In this post, we explore a streamlined, configuration-driven approach to orchestrate dbt Core jobs using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) and Cosmos, an open source package. These jobs run transformations on Amazon Redshift. With this setup, teams can collaborate effectively while maintaining data quality, operational efficiency, and observability.
- The Amazon SageMaker lakehouse architecture now automates optimization configuration of Apache Iceberg tables on Amazon S3by Tomohiro Tanaka on August 8, 2025 at 9:40 pm
The Amazon SageMaker lakehouse architecture now automates optimization of Iceberg tables stored in Amazon S3 with catalog-level configuration, optimizing storage in your Iceberg tables and improving query performance. This post demonstrates an end-to-end flow to enable catalog level table optimization setting.
- Boosting search relevance: Automatic semantic enrichment in Amazon OpenSearch Serverlessby Jon Handler on August 6, 2025 at 4:47 pm
In this post, we show how automatic semantic enrichment removes friction and makes the implementation of semantic search for text data seamless, with step-by-step instructions to enhance your search functionality.
- Create an OpenSearch dashboard with Amazon OpenSearch Serviceby Smita Singh on August 5, 2025 at 4:32 pm
This post demonstrates how to harness OpenSearch Dashboards to analyze logs visually and interactively. With this solution, IT administrators, developers, and DevOps engineers can create custom dashboards to monitor system behavior, detect anomalies early, and troubleshoot issues faster through interactive charts and graphs.
- Build a multi-tenant healthcare system with Amazon OpenSearch Serviceby Ezat Karimi on August 5, 2025 at 4:23 pm
In this post, we address common multi-tenancy challenges and provide actionable solutions for security, tenant isolation, workload management, and cost optimization across diverse healthcare tenants.
- Integrate scientific data management and analytics with the next generation of Amazon SageMaker, Part 1by Nadeem Bulsara on August 5, 2025 at 4:19 pm
In this blog post, AWS introduces a solution to a common challenge in scientific research – the inefficient management of fragmented scientific data. The post demonstrates how the next generation of Amazon SageMaker, through its Unified Studio and Catalog features, helps scientists streamline their workflow by integrating data management and analytics capabilities.
- Develop and deploy a generative AI application using Amazon SageMaker Unified Studioby Amit Maindola on August 4, 2025 at 5:17 pm
In this post, we demonstrate how to use Amazon Bedrock Flows in SageMaker Unified Studio to build a sophisticated generative AI application for financial analysis and investment decision-making.
- Near real-time streaming analytics on protobuf with Amazon Redshiftby Konstantinos Tzouvanas on August 4, 2025 at 5:06 pm
In this post, we explore an end-to-end analytics workload for streaming protobuf data, by showcasing how to handle these data streams with Amazon Redshift Streaming Ingestion, deserializing and processing them using AWS Lambda functions, so that the incoming streams are immediately available for querying and analytical processing on Amazon Redshift.
- Amazon Redshift out-of-the-box performance innovations for data lake queriesby Martin Milenkoski on July 31, 2025 at 2:05 pm
In this post, we first briefly review how planner statistics are collected and what impact they have on queries. Then, we discuss Amazon Redshift features that deliver optimal plans on Iceberg tables and Parquet data even with the lack of statistics. Finally, we review some example queries that now execute faster because of these latest Amazon Redshift innovations.
- Optimize traffic costs of Amazon MSK consumers on Amazon EKS with rack awarenessby Austin Groeneveld on July 30, 2025 at 4:35 pm
In this post, we walk you through a solution for implementing rack awareness in consumer applications that are dynamically deployed across multiple Availability Zones using Amazon EKS.
- Automate data lineage in Amazon SageMaker using AWS Glue Crawlers supported data sourcesby Mohit Dawar on July 30, 2025 at 4:33 pm
In this post, we explore its real-world impact through the lens of an ecommerce company striving to boost their bottom line. To illustrate this practical application, we walk you through how you can use the prebuilt integration between SageMaker Catalog and AWS Glue crawlers to automatically capture lineage for data assets stored in Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
- Accelerate your data quality journey for lakehouse architecture with Amazon SageMaker, Apache Iceberg on AWS, Amazon S3 tables, and AWS Glue Data Qualityby Brody Pearman on July 28, 2025 at 6:09 pm
This post explores how you can use AWS Glue Data Quality to maintain data quality of S3 Tables and Apache Iceberg tables on general purpose S3 buckets. We’ll discuss strategies for verifying the quality of published data and how these integrated technologies can be used to implement effective data quality workflows.
- Build an analytics pipeline that is resilient to Avro schema changes using Amazon Athenaby Mohammad Sabeel on July 25, 2025 at 4:33 pm
This post demonstrates how to build a solution by combining Amazon Simple Storage Service (Amazon S3) for data storage, AWS Glue Data Catalog for schema management, and Amazon Athena for one-time querying. We’ll focus specifically on handling Avro-formatted data in partitioned S3 buckets, where schemas can change frequently while providing consistent query capabilities across all data regardless of schema versions.
- Secure generative SQL with Amazon Qby Gregory Knowles on July 25, 2025 at 4:25 pm
In this post, we discuss the design and security controls in place when using generative SQL and its use in both Amazon SageMaker Unified Studio and Amazon Redshift Query Editor v2.