Efficient Data Processing in Python: Batch vs Streaming Pipelines Explained
Every data pipeline makes a fundamental choice before any code is written: does it process data in chunks on a schedule, or does it process data continuously as it arrives? This choice — batch versus streaming — shapes the architecture of everything downstream. The tools you use, the guarantees you can make about data freshness, the complexity of your error handling, and the infrastructure you need to run it all follow directly from this decision. Getting it wrong is expensive. Teams that build streaming pipelines when batch would have sufficed end up maintaining complex infrastructure for a problem that didn't require it. Teams that build batch pipelines when their use case demands real-time processing discover the gap at the worst possible moment — when a stakeholder asks why the dashboard is six hours out of date. In this article, you'll learn what batch and streaming pipelines actually are, how they differ in terms of architecture and tradeoffs, and how to implement both patterns in Python. By the end, you'll have a clear framework for choosing the right approach for any data engineering problem you solve. To follow along comfortably, make sure you have: Practice writing Python functions and working with modules Familiarity with pandas DataFrames and basic data manipulation A general understanding of what ETL pipelines do — extract, transform, load Prerequisites What Is a Batch Pipeline? Implementing a Batch Pipeline in Python When Batch Works Well What Is a Streaming Pipeline? Implementing a Streaming Pipeline in Python When Streaming Works Well The Key Differences at a Glance Choosing Between Batch and Streaming The Hybrid Pattern: Lambda and Kappa Architectures A batch pipeline processes a bounded, finite collection of records together — a file, a database snapshot, a day's worth of transactions. It runs on a schedule, say, hourly, nightly, weekly, reads all the data for that period, transforms it, and writes the result somewhere. Then it stops and waits until the next run. The mental model is simple: collect, then process. Nothing happens between runs. In a retail ETL context, a typical batch pipeline might look like this: At midnight, extract all orders placed in the last 24 hours from the transactional database Join with the product catalogue and customer dimension tables Compute daily revenue aggregates by region and product category Load the results into the data warehouse for reporting The pipeline runs, finishes, and produces a complete, consistent snapshot of yesterday's business. By the time analysts arrive in the morning, the warehouse is up to date. A batch pipeline in its simplest form is a Python script with three clearly separated stages: extract, transform, load. Let's walk through what this code is doing: The three functions are deliberately kept separate. This separation — extract, transform, load — makes each stage independently testable, replaceable, and debuggable. If the transform logic changes, you don't need to modify the extract or load code. Batch pipelines are the right choice when: Data freshness requirements are measured in hours, not seconds.A daily sales report doesn't need to be updated every minute. A weekly marketing attribution model certainly doesn't. You're processing large historical datasets.Backfilling two years of transaction history into a new data warehouse is inherently a batch job — the data exists, it's bounded, and you want to process it as efficiently as possible in one run. Consistency matters more than latency.Batch pipelines produce complete, point-in-time snapshots. Every row in the output was computed from the same input state. This consistency is valuable for financial reporting, regulatory compliance, and any downstream process that requires a stable, reproducible dataset. A streaming pipeline processes data continuously, record by record or in small micro-batches, as it arrives. There is no "end" to the dataset — the pipeline runs indefinitely, consuming events from a source like a message queue, a Kafka topic, or a webhook, and processing each one as it comes in. The mental model is: process as you collect. The pipeline is always running. In the same retail ETL context, a streaming pipeline might handle order events as they're placed: An order is placed on the website and an event is published to a message queue The streaming pipeline consumes the event within milliseconds It validates, enriches, and routes the event to downstream systems The fraud detection service, the inventory system, and the real-time dashboard all receive updated information immediately The difference from batch is fundamental: the data isn't sitting in a file waiting to be processed. It's flowing, and the pipeline has to keep up. Python's generator functions are the natural building block for streaming pipelines. A generator produces values one at a time and pauses between yields — which maps directly onto the idea of processing records as they arrive without loading everything into memory. Here's what's happening: Streaming pipelines are the right choice when: Data freshness is measured in seconds or milliseconds.Fraud detection, real-time inventory updates, live dashboards, and alerting systems all require data to be processed immediately — a batch job running every hour would make them useless. The data volume is too large to accumulate.High-frequency IoT sensor data, clickstream events, and financial tick data can generate millions of records per hour. Accumulating all of that before processing is often impractical – you'd need enormous storage and the processing job would take too long to be useful. You need to react, not just report.Streaming pipelines can trigger downstream actions — send a notification, block a transaction, update a recommendation — in response to individual events. Batch pipelines can only report on what already happened. Here is an overview of the differences between batch and stream processing we've discussed thus far: Okay, all of this info is great. But howdo you choose between batch and stream processing? The decision comes down to three questions: How fresh does the data need to be?If stakeholders can tolerate results that are hours old, batch is simpler and more cost-effective. If they need results within seconds, streaming is unavoidable. How complex is your processing logic?Batch jobs can join across large datasets, run expensive aggregations, and apply complex business logic without worrying about latency. Streaming pipelines must process each event quickly, which constrains how much work you can do per record. What's your operational capacity?Streaming infrastructure — Kafka clusters, Flink or Spark Streaming jobs, dead-letter queues, exactly-once delivery guarantees — is significantly more complex to operate than a scheduled Python script. If your team is small or your use case doesn't demand real-time results, that complexity is cost without benefit. Start with batch. It's simpler to build, simpler to test, simpler to debug, and simpler to maintain. Move to streaming when a specific, concrete requirement — not a hypothetical future one — makes batch insufficient. Most data problems are batch problems, and the ones that genuinely require streaming are usually obvious when you run into them. And as you might have guessed, you may need to combine them for some data processing systems. Which is why hybrid approaches exist. In practice, many production data systems use both patterns together. The two most common hybrid architectures are: Lambda and Kappa architecture. Lambda architectureruns a batch layer and a streaming layer in parallel. The batch layer processes complete historical data and produces accurate, consistent results on a delay. The streaming layer processes live data and produces approximate results immediately. Downstream consumers merge both outputs — using the streaming result for freshness and the batch result for correctness. The tradeoff is operational complexity: you're maintaining two separate processing codebases that must produce semantically equivalent results. Kappa architecturesimplifies this by using only a streaming layer, but with the ability to replay historical data through the same pipeline when you need batch-style reprocessing. This works well when your streaming framework like Apache Kafka and Apache Flink supports log retention and replay. You get one codebase, one set of logic, and the ability to reprocess history when your pipeline changes. Neither architecture is universally better. Lambda is more common in organizations that adopted batch processing first and added streaming incrementally. Kappa is more common in systems designed with streaming as the primary pattern. Batch and streaming are tools with different tradeoffs, each suited to a different class of problems. Batch pipelines excel at consistency, simplicity, and bulk throughput. Streaming pipelines excel at latency, reactivity, and continuous processing. Understanding both patterns at the architectural level — before reaching for specific frameworks like Apache Spark, Kafka, or Flink — gives you the judgment to choose the right one and explain that choice clearly. The frameworks implement these patterns, while the judgment about which pattern fits your problem is yours to make first.Prerequisites
What Is a Batch Pipeline?
Implementing a Batch Pipeline in Python
import pandas as pdfrom datetime import datetime, timedeltadef extract(filepath: str) -> pd.DataFrame: """Load raw orders from a daily export file.""" df = pd.read_csv(filepath, parse_dates=["order_timestamp"]) return dfdef transform(df: pd.DataFrame) -> pd.DataFrame: """Clean and aggregate orders into daily revenue by region.""" # Filter to completed orders only df = df[df["status"] == "completed"].copy() # Extract date from timestamp for grouping df["order_date"] = df["order_timestamp"].dt.date # Aggregate: total revenue and order count per region per day summary = ( df.groupby(["order_date", "region"]) .agg( total_revenue=("order_value_gbp", "sum"), order_count=("order_id", "count"), avg_order_value=("order_value_gbp", "mean"), ) .reset_index() ) return summarydef load(df: pd.DataFrame, output_path: str) -> None: """Write the aggregated result to the warehouse (here, a CSV).""" df.to_csv(output_path, index=False) print(f"Loaded { len(df)} rows to { output_path}")# Run the pipelineraw = extract("orders_2024_06_01.csv")aggregated = transform(raw)load(aggregated, "warehouse/daily_revenue_2024_06_01.csv")extractreads a CSV file representing a daily order export. The parse_datesargument tells pandas to interpret the order_timestampcolumn as a datetime object rather than a plain string — this matters for the date extraction step in transform.transformdoes two things: it filters out any orders that didn't complete (returns, cancellations), and then groups the remaining orders by date and region to produce revenue aggregates. The .agg()call computes three metrics per group in a single pass.loadwrites the result to a destination — in production this would be a database insert or a cloud storage upload, but the pattern is the same regardless.When Batch Works Well
What Is a Streaming Pipeline?
Implementing a Streaming Pipeline in Python
import jsonimport timefrom typing import Generator, Dictdef event_source(filepath: str) -> Generator[Dict, None, None]: """ Simulate a stream of order events from a file. In production, this would consume from Kafka or a message queue. """ with open(filepath, "r") as f: for line in f: event = json.loads(line.strip()) yield event time.sleep(0.01) # simulate arrival delay between eventsdef validate(event: Dict) -> bool: """Check that the event has the required fields and valid values.""" required_fields = ["order_id", "customer_id", "order_value_gbp", "region"] if not all(field in event for field in required_fields): return False if event["order_value_gbp"] <= 0: return False return Truedef enrich(event: Dict) -> Dict: """Add derived fields to the event before routing downstream.""" event["processed_at"] = time.strftime("%Y-%m-%dT%H:%M:%S") event["value_tier"] = ( "high" if event["order_value_gbp"] >= 500 else "mid" if event["order_value_gbp"] >= 100 else "low" ) return eventdef run_streaming_pipeline(source_file: str) -> None: """Process each event as it arrives from the source.""" processed = 0 skipped = 0 for raw_event in event_source(source_file): if not validate(raw_event): skipped += 1 continue enriched_event = enrich(raw_event) # In production: publish to downstream topic or write to sink print(f"[{ enriched_event['processed_at']}] " f"Order { enriched_event['order_id']} | " f"£{ enriched_event['order_value_gbp']:.2f} | " f"tier={ enriched_event['value_tier']}") processed += 1 print(f"\nDone. Processed: { processed} | Skipped: { skipped}")run_streaming_pipeline("order_events.jsonl")event_sourceis a generator function — note the yieldkeyword instead of return. Each call to yield eventpauses the function and hands one event to the caller. The pipeline processes that event before the generator resumes and fetches the next one. This means only one event is in memory at a time, regardless of how large the stream is. The time.sleep(0.01)simulates the real-world delay between events arriving from a message queue.validatechecks each event for required fields and valid values before doing anything else with it. In a streaming context, bad events are super common — network issues, upstream bugs, and schema changes all produce malformed records. Validating early and skipping invalid events is far safer than letting them propagate into downstream systems.enrichadds derived fields to the event. This can be a processing timestamp and a value tier classification. In production, this step might also join against a lookup table, call an external API, or apply a model prediction.run_streaming_pipelineties it together. The forloop over event_sourceconsumes events one at a time, processes each through the validate → enrich → routestages, and keeps a running count of processed and skipped events.When Streaming Works Well
The Key Differences at a Glance
DIMENSION BATCH STREAMING Data model Bounded, finite dataset Unbounded, continuous flow Processing trigger Schedule (time or event) Arrival of each record Latency Minutes to hours Milliseconds to seconds Throughput High (optimized for bulk processing) Lower per-record overhead Complexity Lower Higher State management Stateless per run Often stateful across events Error handling Retry the whole job Per-event dead-letter queues Consistency Strong (point-in-time snapshot) Eventually consistent Best for Reporting, ML training, backfills Alerting, real-time features, event routing Choosing Between Batch and Streaming
The Hybrid Pattern: Lambda and Kappa Architectures
Conclusion
相关推荐
-
The Java Handbook – Learn Java Programming for Beginners
-
How to Optimize Enterprise Knowledge Graphs for Scalable Digital Product Platforms
-
The REST API Handbook – How to Build, Test, Consume, and Document REST APIs
-
Key Technical Design Decisions for Building an Educational App with LLMs
-
Grid Computing Platform
-
Learn JavaScript for Beginners – JS Basics Handbook
- 最近发表
-
- Prediction Model Tools
- Open Source Tools Every STEM Student Should Know About
- How to Build a PostgreSQL
- Key Technical Design Decisions for Building an Educational App with LLMs
- Statistical Analysis Software
- How to Build a Browser
- How to Build a Browser
- CSS Transform Handbook – Complete Guide to CSS Transform Functions and Properties
- Root Cause Analysis Platform
- How to Write Clean Code – Tips and Best Practices (Full Handbook)
- 随机阅读
-
- Nikhil Adithyan
- Open Source Tools Every STEM Student Should Know About
- The REST API Handbook – How to Build, Test, Consume, and Document REST APIs
- Learn JavaScript for Beginners – JS Basics Handbook
- The Software Architecture Handbook
- Database Version Control with Liquibase and Spring Boot
- Learn TypeScript – A Handbook for Developers
- The REST API Handbook – How to Build, Test, Consume, and Document REST APIs
- Risk Mitigation Planning
- Key Technical Design Decisions for Building an Educational App with LLMs
- The REST API Handbook – How to Build, Test, Consume, and Document REST APIs
- How to Choose the Best Stock Market API for FinTech Projects and AI Agents
- Machine Learning
- The GraphQL API Handbook – How to Build, Test, Consume and Document GraphQL APIs
- Learn JavaScript for Beginners – JS Basics Handbook
- Command Line for Beginners – How to Use the Terminal Like a Pro [Full Handbook]
- Diagram Software for Documentation
- CSS Transform Handbook – Complete Guide to CSS Transform Functions and Properties
- How to Choose the Best Stock Market API for FinTech Projects and AI Agents
- Database Version Control with Liquibase and Spring Boot
- 搜索
-