AWS Big Data Blog

AWS Big Data Blog Official Big Data Blog of Amazon Web Services

  • Amazon Redshift announces history mode for zero-ETL integrations to simplify historical data tracking and analysis
    by Raks Khare on February 18, 2025 at 9:13 pm

    This post will explore brief history of zero-ETL, its importance for customers, and introduce an exciting new feature: history mode for Amazon Aurora PostgreSQL-Compatible Edition, Amazon Aurora MySQL-Compatible Edition, Amazon Relational Database Service (Amazon RDS) for MySQL, and Amazon DynamoDB zero-ETL integration with Amazon Redshift.

  • Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose
    by Charishma Makineni on February 18, 2025 at 9:12 pm

    In this post, we demonstrate how to build a scalable AWS WAF log analysis solution using Firehose and Apache Iceberg. Firehose simplifies the entire process—from log ingestion to storage—by allowing you to configure a delivery stream that delivers AWS WAF logs directly to Apache Iceberg tables in Amazon S3. The solution requires no infrastructure setup and you pay only for the data you process.

  • Migrate from Standard brokers to Express brokers in Amazon MSK using Amazon MSK Replicator
    by Subham Rakshit on February 13, 2025 at 10:09 pm

    Creating a new cluster with Express brokers is straightforward, as described in Amazon MSK Express brokers. However, if you have an existing MSK cluster, you need to migrate to a new Express based cluster. In this post, we discuss how you should plan and perform the migration to Express brokers for your existing MSK workloads on Standard brokers. Express brokers offer a different user experience and a different shared responsibility boundary, so using them on an existing cluster is not possible. However, you can use Amazon MSK Replicator to copy all data and metadata from your existing MSK cluster to a new cluster comprising of Express brokers.

  • Foundational blocks of Amazon SageMaker Unified Studio: An admin’s guide to implement unified access to all your data, analytics, and AI
    by Lakshmi Nair on February 13, 2025 at 10:07 pm

    In this post, we discuss the foundational building blocks of SageMaker Unified Studio and how, by abstracting complex technical implementations behind user-friendly interfaces, organizations can maintain standardized governance while enabling efficient resource management across business units. This approach provides consistency in infrastructure deployment while providing the flexibility needed for diverse business requirements.

  • Amazon Redshift Serverless adds higher base capacity of up to 1024 RPUs
    by Ricardo Serafim on February 10, 2025 at 6:40 pm

    In this post, we explore the new higher base capacity of 1024 RPUs in Redshift Serverless, which doubles the previous maximum of 512 RPUs. This enhancement empowers you to get high performance for your workload containing highly complex queries and write-intensive workloads, with concurrent data ingestion and transformation tasks that require high throughput and low latency with Redshift Serverless.

  • Use DeepSeek with Amazon OpenSearch Service vector database and Amazon SageMaker
    by Jon Handler on February 7, 2025 at 9:21 pm

    OpenSearch Service provides rich capabilities for RAG use cases, as well as vector embedding-powered semantic search. You can use the flexible connector framework and search flow pipelines in OpenSearch to connect to models hosted by DeepSeek, Cohere, and OpenAI, as well as models hosted on Amazon Bedrock and SageMaker. In this post, we build a connection to DeepSeek’s text generation model, supporting a RAG workflow to generate text responses to user queries.

  • Handle errors in Apache Flink applications on AWS
    by Alexis Tekin on February 6, 2025 at 3:16 pm

    This post discusses strategies for handling errors in Apache Flink applications. However, the general principles discussed here apply to stream processing applications at large.

  • How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions
    by Michael Davies on January 30, 2025 at 1:14 pm

    At Open Universities Australia (OUA), we empower students to explore a vast array of degrees from renowned Australian universities, all delivered through online learning. In this post, we show you how we used AWS services to replace our existing third-party ETL tool, improving the team’s productivity and producing a significant reduction in our ETL operational costs.

  • Hybrid big data analytics with Amazon EMR on AWS Outposts
    by Shoukat Ghouse on January 29, 2025 at 9:20 pm

    In this post, we dive into the transformative features of EMR on Outposts, showcasing its flexibility as a native hybrid data analytics service that allows seamless data access and processing both on premises and in the cloud.

  • How MuleSoft achieved cloud excellence through an event-driven Amazon Redshift lakehouse architecture
    by Sean Zou on January 28, 2025 at 4:42 pm

    In our previous thought leadership blog post Why a Cloud Operating Model we defined a COE Framework and showed why MuleSoft implemented it and the benefits they received from it. In this post, we’ll dive into the technical implementation describing how MuleSoft used Amazon EventBridge, Amazon Redshift, Amazon Redshift Spectrum, Amazon S3, & AWS Glue to implement it.

  • OpenSearch Vector Engine is now disk-optimized for low cost, accurate vector search
    by Dylan Tong on January 24, 2025 at 8:21 pm

    OpenSearch Vector Engine can now run vector search at a third of the cost on OpenSearch 2.17+ domains. You can now configure k-NN (vector) indexes to run on disk mode, optimizing it for memory-constrained environments, and enable low-cost, accurate vector search that responds in low hundreds of milliseconds. Disk mode provides an economical alternative to memory mode when you don’t need near single-digit latency. In this post, you’ll learn about the benefits of this new feature, the underlying mechanics, customer success stories, and getting started.

  • Access Apache Iceberg tables in Amazon S3 from Databricks using AWS Glue Iceberg REST Catalog in Amazon SageMaker Lakehouse
    by Srividya Parthasarathy on January 23, 2025 at 4:59 pm

    In this post, we will show you how Databricks on AWS general purpose compute can integrate with the AWS Glue Iceberg REST Catalog for metadata access and use Lake Formation for data access. To keep the setup in this post straightforward, the Glue Iceberg REST Catalog and Databricks cluster share the same AWS account.

  • Generate vector embeddings for your data using AWS Lambda as a processor for Amazon OpenSearch Ingestion
    by Jagadish Kumar on January 21, 2025 at 6:08 pm

    In this post, we demonstrate how to use the OpenSearch Ingestion’s Lambda processor to generate embeddings for your source data and ingest them to an OpenSearch Serverless vector collection. This solution uses the flexibility of OpenSearch Ingestion pipelines with a Lambda processor to dynamically generate embeddings.

  • Automate topic provisioning and configuration using Terraform with Amazon MSK
    by Vijay Kardile on January 16, 2025 at 4:45 pm

    In this post, we address common challenges associated with manual MSK topic configuration management and present a robust Terraform-based solution. This solution supports both provisioned and serverless MSK clusters.

  • How EUROGATE established a data mesh architecture using Amazon DataZone
    by Dr. Leonard Heilig on January 15, 2025 at 5:37 pm

    In this post, we show you how EUROGATE uses AWS services, including Amazon DataZone, to make data discoverable by data consumers across different business units so that they can innovate faster. Two use cases illustrate how this can be applied for business intelligence (BI) and data science applications, using AWS services such as Amazon Redshift and Amazon SageMaker.

  • Juicebox recruits Amazon OpenSearch Service’s vector database for improved talent search
    by Ishan Gupta on January 14, 2025 at 6:41 pm

    Juicebox is an AI-powered talent sourcing search engine, using advanced natural language models to help recruiters identify the best candidates from a vast dataset of over 800 million profiles. At the core of this functionality is Amazon OpenSearch Service, which provides the backbone for Juicebox’s powerful search infrastructure, enabling a seamless combination of traditional full-text search methods with modern, cutting-edge semantic search capabilities. In this post, we share how Juicebox uses OpenSearch Service for improved search.

  • Batch data ingestion into Amazon OpenSearch Service using AWS Glue
    by Ravikiran Rao on January 13, 2025 at 8:50 pm

    This post showcases how to use Spark on AWS Glue to seamlessly ingest data into OpenSearch Service. We cover batch ingestion methods, share practical examples, and discuss best practices to help you build optimized and scalable data pipelines on AWS.

  • Build a high-performance quant research platform with Apache Iceberg
    by Guy Bachar on January 9, 2025 at 8:55 pm

    In our previous post Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg, we showed how to use Apache Iceberg in the context of strategy backtesting. In this post, we focus on data management implementation options such as accessing data directly in Amazon Simple Storage Service (Amazon S3), using popular data formats like Parquet, or using open table formats like Iceberg. Our experiments are based on real-world historical full order book data, provided by our partner CryptoStruct, and compare the trade-offs between these choices, focusing on performance, cost, and quant developer productivity.

  • Cost Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization techniques
    by Aruna Govindaraju on January 9, 2025 at 6:16 pm

    This blog post introduces a new disk-based vector search approach that allows efficient querying of vectors stored on disk without loading them entirely into memory. By implementing these quantization methods, organizations can achieve compression ratios of up to 64x, enabling cost-effective scaling of vector databases for large-scale AI and machine learning applications.

  • Use CI/CD best practices to automate Amazon OpenSearch Service cluster management operations
    by Camille BIRBES on January 7, 2025 at 5:51 pm

    This post explores how to automate Amazon OpenSearch Service cluster management using CI/CD best practices. It presents two options: the Terraform OpenSearch provider and the Evolution library. The solution demonstrates how to use AWS CDK, Lambda, and CodeBuild to implement automated index template creation and management. By applying these techniques, organizations can improve the consistency, reliability, and efficiency of their OpenSearch operations.

Share Websitecyber