AWS Big Data Blog Official Big Data Blog of Amazon Web Services
- Compaction support for Avro and ORC file formats in Apache Iceberg tables in Amazon S3by Angel Conde Manjon on July 16, 2025 at 12:25 am
In this post, we explore how Amazon S3 Tables has expanded its automatic compaction capabilities to include Avro and ORC file formats for Apache Iceberg tables, alongside the previously supported Parquet format. Through performance testing with over 20 billion events, the capability demonstrates significant query performance improvements ranging from 12% to 40% when using compacted tables compared to non-compacted tables across different file formats.
- Introducing Jobs in Amazon SageMakerby Chiho Sugimoto on July 15, 2025 at 7:10 pm
This post demonstrates how the new jobs experience works in SageMaker Unified Studio.
- Orchestrate data processing jobs, querybooks, and notebooks using visual workflow experience in Amazon SageMakerby Naohisa Takahashi on July 15, 2025 at 5:42 pm
Today, we are excited to launch a new visual workflows builder in SageMaker Unified Studio. With the new visual workflow experience, you don’t need to code the Python DAGs manually. Instead, you can visually define the orchestration workflow in SageMaker Unified Studio, and the visual definition is automatically converted to a Python DAG definition that is supported in Airflow.This post demonstrates the new visual workflow experience in SageMaker Unified Studio.
- Revenue NSW modernises analytics with AWS, enabling unified and scalable data management, processing, and accessby Saeed Barghi on July 15, 2025 at 12:04 pm
Revenue NSW, Australia’s principal revenue management agency, successfully modernized its analytics infrastructure using AWS services. In this blog post, we show how the organization transformed its on-premises data environment into a unified, scalable cloud-based solution using Amazon Redshift, AWS Database Migration Service, Amazon AppFlow, and AWS Glue.
- Harnessing the Power of Nested Materialized Views and exploring Cascading Refreshby Ritesh Sinha on July 11, 2025 at 3:46 pm
In this post, we explore how to maximize Amazon Redshift query performance through nested materialized views and implementing cascading refresh strategies. We demonstrate how to create materialized views based on other materialized views, enabling a hierarchical structure of precomputed results that significantly enhances query performance and data processing efficiency, particularly useful for reusing precomputed joins with different aggregate options.
- Realizing ocean data democratization: Furuno Electric’s initiatives using Amazon DataZoneby Akira Mikami on July 10, 2025 at 9:56 pm
In this post, we explore how Furuno Electric built a comprehensive data management foundation using Amazon DataZone and other AWS services to transform from a traditional manufacturing company to a data-driven business.
- Geospatial data lakes with Amazon Redshiftby Jeremy Spell on July 10, 2025 at 9:52 pm
In this post, we review how to set up Redshift Serverless to use geospatial data contained within a data lake to enhance maps in ArcGIS Pro. This technique helps builders and GIS analysts use available datasets in data lakes and transform it in Amazon Redshift to further enrich the data before presenting it on a map.
- Develop and monitor a Spark application using existing data in Amazon S3 with Amazon SageMaker Unified Studioby Amit Maindola on July 9, 2025 at 7:31 pm
In this post, we demonstrate how to develop and monitor a Spark application using existing data in Amazon S3 using SageMaker Unified Studio. The solution addresses key challenges organizations face in managing big data analytics workloads through an integrated development environment where data teams can develop, test, and refine Spark applications while leveraging EMR Serverless for dynamic resource allocation and built-in monitoring tools.
- Perform per-project cost allocation in Amazon SageMaker Unified Studioby Enrique Salgado Hernández on July 9, 2025 at 3:27 pm
Amazon SageMaker Unified Studio enables per-project cost allocation through resource tagging, allowing organizations to track and manage costs across different projects and domains effectively. This post demonstrates how to implement cost tracking using AWS Billing and Cost Management tools, including Cost Explorer and Data Exports, to help finance and business analysts follow FinOps best practices for controlling cloud infrastructure costs.
- Near real-time baggage operational insights for airlines using Amazon Kinesis Data Streamsby Subhash Sharma on July 8, 2025 at 8:29 pm
This post explores a framework developed by IBM to modernize baggage analytics using AWS managed services like Amazon Kinesis Data Streams, DynamoDB Streams, and other AWS services within a serverless architecture. The solution enables near real-time baggage operational insights for airlines, delivering cost savings, enhanced scalability, and improved performance while providing better security and operational efficiency to meet evolving airline needs.
- Overcome your Kafka Connect challenges with Amazon Data Firehoseby Swapna Bandla on July 7, 2025 at 2:26 pm
We’re happy to announce a new feature in the Amazon Data Firehose integration with Amazon MSK. You can now specify the Firehose stream to either read from the earliest position on the Kafka topic or from a custom timestamp to begin reading from your MSK topic. In this post of this series, we focus on managed data delivery from Kafka to your data lake.
- How Stifel built a modern data platform using AWS Glue and an event-driven domain architectureby Amit Maindola on July 7, 2025 at 2:22 pm
In this post, we show you how Stifel implemented a modern data platform using AWS services and open data standards, building an event-driven architecture for domain data products while centralizing the metadata to facilitate discovery and sharing of data products.
- Build conversational AI search with Amazon OpenSearch Serviceby Bharav Patel on July 3, 2025 at 4:45 pm
Amazon OpenSearch Service is a versatile search and analytics tool. In this post, we explore conversational search, its architecture, and various ways to implement it.
- Enhance stability with dedicated cluster manager nodes using Amazon OpenSearch Serviceby Chinmayi Narasimhadevara on July 3, 2025 at 4:09 pm
In this post, we show how to enhance the stability of your OpenSearch Service domain with dedicated cluster manager nodes and how using these in deployment enhances your cluster’s stability and reliability.
- Kaltura reduces observability operational costs by 60% with Amazon OpenSearch Serviceby Ido Ziv on July 3, 2025 at 1:59 pm
In this post, we share how Kaltura transformed its observability strategy and technological stack by migrating from a software as a service (SaaS) logging solution to Amazon OpenSearch Service—achieving higher log retention, a 60% reduction in cost, and a centralized platform that empowers multiple teams with real-time insights.
- Introducing GenAI-powered business description recommendations for custom assets in Amazon SageMaker Catalogby Ramesh H Singh on July 1, 2025 at 10:24 pm
Amazon SageMaker Catalog now supports generative AI-powered recommendations for business descriptions, including table summaries, use cases, and column-level descriptions for custom structured assets registered programmatically. In this post, we demonstrate how to generate AI recommendations for business descriptions for custom structured assets in SageMaker Catalog.
- Amazon Redshift Python user-defined functions will reach end of support after June 30, 2026by Raks Khare on June 30, 2025 at 6:54 pm
The Amazon Redshift integration with AWS Lambda provides the capability to create Amazon Redshift Lambda user-defined functions (UDFs). Because Lambda UDFs provide these significant advantages in integration, flexibility, scalability, and security, we will be ending support for Python UDFs in Amazon Redshift. In this post, we walk you through how to migrate your existing Python UDFs to Lambda UDFs, set up monitoring and cost evaluations, and review key considerations for a smooth transition.
- Enforce table level access control on data lake tables using AWS Glue 5.0 with AWS Lake Formationby Layth Yassin on June 30, 2025 at 4:34 pm
In this post, we show you how to enforce FTA control on AWS Glue 5.0 through Lake Formation permissions.
- Building serverless event streaming applications with Amazon MSK and AWS Lambdaby Tarun Rai Madan on June 26, 2025 at 5:48 pm
In this post, we describe how you can simplify your event-driven application architecture using AWS Lambda with Amazon MSK. We demonstrate how to configure Lambda as a consumer for Kafka topics, including a cross-account setup and how to optimize price and performance for these applications.
- Enhance data ingestion performance in Amazon Redshift with concurrent insertsby Raghu Kuppala on June 26, 2025 at 5:39 pm
Amazon Redshift employs columnar storage for database tables, reducing overall disk I/O requirements. This storage method significantly improves analytic query performance by minimizing data read during queries. This post showcases the key improvements in Amazon Redshift concurrent data ingestion operations.