Skip to main content

AWS S3 Collector versus S3 Input SQS - Which to Pick?

  • October 28, 2025
  • 0 replies
  • 47 views

Josh Rice

You’ve got logs landing in S3 — maybe CloudTrail, maybe Snowflake, maybe Crowdstrike FDR as examples. The question that inevitably comes up is:

 

“Should we use an S3 Collector or an Amazon S3 Source (via SQS) to get them into Cribl Stream?”

 

It’s a great question, and like many things in the security and observability space, the answer depends on how you want to work with your data. Let’s break it down.

 

What’s an S3 Collector Source?

 

Think of the S3 Collector Source as a batch retrieval engine. It’s perfect when you need to reach back in time and grab a chunk of data from S3 — say, “get me three hours of logs from noon yesterday for host==abcd.” It doesn’t care about real-time ingestion or message queues. You tell it what timeframe and what filters you want, and it goes and gets them, either on demand or on a schedule.

 

Ideal when you:

  • Need to rehydrate historical logs (for example, for a retroactive investigation)
  • Want to replay old data for validation, benchmarking, or pipeline testing
  • Work with data that’s organized neatly by date or prefix (e.g., YYYY/MM/DD/hh/mm/...)
  • Don’t need continuous ingestion, just periodic snapshots

 

What’s the Amazon S3 Source (with SQS)?

 

The Amazon S3 Source takes a more event-driven approach. It uses SQS notifications to track new objects as they’re written to S3 and then ingests those files automatically. In other words: it’s a “pull as you go” model — continuous, near-real-time ingestion without having to define time ranges or run manual jobs.

 

Ideal when you:

  • Want to ingest new data continuously as it lands in S3
  • Are collecting logs from many sources that drop files at unpredictable intervals
  • Don’t want to waste time or compute scanning the bucket for already-processed data
  • Need high efficiency and low duplication risk

 

 

S3 Collector Source

Amazon S3 Source (SQS)

Ingestion Model

Batch / scheduled

Continuous / event-driven

Best For

Historical replays, backfills

Real-time ingestion

Data Targeting

Time-range or prefix based

File-specific via SQS messages

Performance

Scans defined ranges (may be slower)

Efficiently picks up only new files

Duplication Risk

Possible if schedules overlap

Low — SQS messages only created for new files

Setup Complexity

Simpler to start (no queue needed)

Slightly more setup (SQS policy + subscription)

Recovery Flexibility

Easy to replay old data

Harder to backfill historical ranges

Use Cases

Bulk ingestion / reingestion, forensics, testing

Streaming, operational monitoring, continuous logs

 

Summary

 

Both options deliver the same goal of getting your data out of S3 and into Cribl Stream. The real difference comes down to how you work and what your priorities are. If your team regularly needs to go back in time, replay logs for investigations, or validate data flows after a configuration change, the S3 Collector is ideal. It gives you the control to pull exactly what you need when you need it.

 

If your focus is on keeping data flowing in real time, the Amazon S3 Source with SQS is built for that continuous pace. It automatically keeps up as new logs land in S3, without the manual work of defining time ranges or worrying about overlap. It is a set-and-forget option that keeps data fresh and consistent across your environment.

 

In many cases, the best answer is not one or the other, but both. Use the S3 Collector when you need to look back or recover lost data, and use the S3 Source with SQS when you need nonstop ingestion. Cribl Stream gives you the flexibility to mix and match depending on your workload so you can build a data pipeline that fits your team instead of forcing your team to fit the pipeline.