AWS Big Data Blog Official Big Data Blog of Amazon Web Services
- Navigating architectural choices for a lakehouse using Amazon SageMakerby Lakshmi Nair on January 12, 2026 at 8:46 pm
Over time, several distinct lakehouse approaches have emerged. In this post, we show you how to evaluate and choose the right lakehouse pattern for your needs. A lakehouse architecture isn’t about choosing between a data lake and a data warehouse. Instead, it’s an approach to interoperability where both frameworks coexist and serve different purposes within a unified data architecture. By understanding fundamental storage patterns, implementing effective catalog strategies, and using native storage capabilities, you can build scalable, high-performance data architectures that support both your current analytics needs and future innovation.
- Access Databricks Unity Catalog data using catalog federation in the AWS Glue Data Catalogby Srividya Parthasarathy on January 12, 2026 at 8:37 pm
AWS has launched the catalog federation capability, enabling direct access to Apache Iceberg tables managed in Databricks Unity Catalog through the AWS Glue Data Catalog. With this integration, you can discover and query Unity Catalog data in Iceberg format using an Iceberg REST API endpoint, while maintaining granular access controls through AWS Lake Formation. In this post, we demonstrate how to set up catalog federation between the Glue Data Catalog and Databricks Unity Catalog, enabling data querying using AWS analytics services.
- Use Amazon SageMaker custom tags for project resource governance and cost trackingby David Victoria on January 9, 2026 at 1:04 am
Amazon SageMaker announced a new feature that you can use to add custom tags to resources created through an Amazon SageMaker Unified Studio project. This helps you enforce tagging standards that conform to your organization’s service control policies (SCPs) and helps enable cost tracking reporting practices on resources created across the organization. In this post, we look at use cases for custom tags and how to use the AWS Command Line Interface (AWS CLI) to add tags to project resources.
- Create AWS Glue Data Catalog views using cross-account definer rolesby Aarthi Srinivasan on January 8, 2026 at 10:45 pm
In this post, we demonstrate how to use cross-account IAM definer roles with AWS Glue Data Catalog views. We show how data owner accounts can create and manage views in a central governance account while maintaining security and control over their data assets.
- AWS analytics at re:Invent 2025: Unifying Data, AI, and governance at scaleby Larry Weber on January 7, 2026 at 10:44 pm
re:Invent 2025 showcased the bold Amazon Web Services (AWS) vision for the future of analytics, one where data warehouses, data lakes, and AI development converge into a seamless, open, intelligent platform, with Apache Iceberg compatibility at its core. Across over 18 major announcements spanning three weeks, AWS demonstrated how organizations can break down data silos,
- Amazon EMR Serverless eliminates local storage provisioning, reducing data processing costs by up to 20%by Karthik Prabhakar on January 6, 2026 at 10:45 pm
In this post, you’ll learn how Amazon EMR Serverless eliminates the need to configure local disk storage for Apache Spark workloads through a new serverless storage capability. We explain how this feature automatically handles shuffle operations, reduces data processing costs by up to 20%, prevents job failures from disk capacity constraints, and enables elastic scaling by decoupling storage from compute.
- Building scalable AWS Lake Formation governed data lakes with dbt and Amazon Managed Workflows for Apache Airflowby Abhilasha Agarwal on January 6, 2026 at 10:37 pm
Organizations often struggle with building scalable and maintainable data lakes—especially when handling complex data transformations, enforcing data quality, and monitoring compliance with established governance. Traditional approaches typically involve custom scripts and disparate tools, which can increase operational overhead and complicate access control. A scalable, integrated approach is needed to simplify these processes, improve data reliability,
- Simplify multi-warehouse data governance with Amazon Redshift federated permissionsby Satesh Sonti on January 5, 2026 at 9:20 pm
Amazon Redshift federated permissions simplify permissions management across multiple Redshift warehouses. In this post, we show you how to define data permissions one time and automatically enforce them across warehouses in your AWS account, removing the need to re-create security policies in each warehouse.
- Simplified management of Amazon MSK with natural language using Kiro CLI and Amazon MSK MCP Serverby Kalyan Janaki on December 24, 2025 at 5:55 pm
In this post, we demonstrate how Kiro CLI and the MSK MCP server can streamline your Kafka management. Through practical examples and demonstrations, we show you how to use these tools to perform common administrative tasks efficiently while maintaining robust security and reliability.
- Unifying governance and metadata across Amazon SageMaker Unified Studio and Atlanby Karan Singh Thakur, Satabrata Paul on December 22, 2025 at 6:17 pm
In this post, we show you how to unify governance and metadata across Amazon SageMaker Unified Studio and Atlan through a comprehensive bidirectional integration. You’ll learn how to deploy the necessary AWS infrastructure, configure secure connections, and set up automated synchronization to maintain consistent metadata across both platforms.
- Modernize Apache Spark workflows using Spark Connect on Amazon EMR on Amazon EC2by Philippe Wanner on December 18, 2025 at 9:29 pm
In this post, we demonstrate how to implement Apache Spark Connect on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) to build decoupled data processing applications. We show how to set up and configure Spark Connect securely, so you can develop and test Spark applications locally while executing them on remote Amazon EMR clusters.
- How Taxbit achieved cost savings and faster processing times using Amazon S3 Tablesby Larry Christensen on December 18, 2025 at 9:27 pm
In this post, we discuss how Taxbit partnered with Amazon Web Services (AWS) to streamline their crypto tax analytics solution using Amazon S3 Tables, achieving 82% cost savings and five times faster processing times.
- Create and update Apache Iceberg tables with partitions in the AWS Glue Data Catalog using the AWS SDK and AWS CloudFormationby Aarthi Srinivasan on December 18, 2025 at 9:22 pm
In this post, we show how to create and update Iceberg tables with partitions in the Data Catalog using the AWS SDK and AWS CloudFormation.
- Power data ingestion into Splunk using Amazon Data Firehoseby Tarik Makota on December 17, 2025 at 6:52 pm
With Kinesis Data Firehose, customers can use a fully managed, reliable, and scalable data streaming solution to Splunk. In this post, we tell you a bit more about the Kinesis Data Firehose and Splunk integration. We also show you how to ingest large amounts of data into Splunk using Kinesis Data Firehose.
- Best practices for querying Apache Iceberg data with Amazon Redshiftby Anusha Challa on December 17, 2025 at 5:15 pm
In this post, we discuss the best practices that you can follow while querying Apache Iceberg data with Amazon Redshift
- IPv6 addressing with Amazon Redshiftby Srini Ponnada on December 17, 2025 at 4:51 pm
As we witness the gradual transition from IPv4 to IPv6, AWS continues to expand its support for dual-stack networking across its service portfolio. In this post, we show how you can migrate your Amazon Redshift Serverless workgroup from IPv4-only to dual-stack mode, so you can make your data warehouse future ready.
- Reference guide for building a self-service analytics solution with Amazon SageMakerby Navnit Shukla on December 16, 2025 at 9:47 pm
In this post, we show how to use Amazon SageMaker Catalog to publish data from multiple sources, including Amazon S3, Amazon Redshift, and Snowflake. This approach enables self-service access while ensuring robust data governance and metadata management.
- Introducing the Apache Spark troubleshooting agent for Amazon EMR and AWS Glueby Jake Zych on December 16, 2025 at 2:02 am
In this post, we show you how the Apache Spark troubleshooting agent helps analyze Apache Spark issues by providing detailed root causes and actionable recommendations. You’ll learn how to streamline your troubleshooting workflow by integrating this agent with your existing monitoring solutions across Amazon EMR and AWS Glue.
- Introducing Apache Spark upgrade agent for Amazon EMRby Keerthi Chadalavada on December 16, 2025 at 1:04 am
In this post, you learn how to assess your existing Amazon EMR Spark applications, use the Spark upgrade agent directly from the Kiro IDE, upgrade a sample e-commerce order analytics Spark application project (including build configs, source code, tests, and data quality validation), and review code changes before rolling them out through your CI/CD pipeline.
- Accelerate Apache Hive read and write on Amazon EMR using enhanced S3Aby Ramesh Kandasamy on December 15, 2025 at 9:55 pm
In this post, we demonstrate how Apache Hive on Amazon EMR 7.10 delivers significant performance improvements for both read and write operations on Amazon S3.
























