AWS S3 Collector versus S3 Input SQS - Which to Pick?

Forum|Forum|3 months ago
October 28, 2025
0 replies
60 views

Josh Rice
Employee

You’ve got logs landing in S3 — maybe CloudTrail, maybe Snowflake, maybe Crowdstrike FDR as examples. The question that inevitably comes up is:

“Should we use an S3 Collector or an Amazon S3 Source (via SQS) to get them into Cribl Stream?”

It’s a great question, and like many things in the security and observability space, the answer depends on how you want to work with your data. Let’s break it down.

What’s an S3 Collector Source?

Think of the S3 Collector Source as a batch retrieval engine. It’s perfect when you need to reach back in time and grab a chunk of data from S3 — say, “get me three hours of logs from noon yesterday for host==abcd.” It doesn’t care about real-time ingestion or message queues. You tell it what timeframe and what filters you want, and it goes and gets them, either on demand or on a schedule.

Ideal when you:

Need to rehydrate historical logs (for example, for a retroactive investigation)
Want to replay old data for validation, benchmarking, or pipeline testing
Work with data that’s organized neatly by date or prefix (e.g., YYYY/MM/DD/hh/mm/...)
Don’t need continuous ingestion, just periodic snapshots

What’s the Amazon S3 Source (with SQS)?

The Amazon S3 Source takes a more event-driven approach. It uses SQS notifications to track new objects as they’re written to S3 and then ingests those files automatically. In other words: it’s a “pull as you go” model — continuous, near-real-time ingestion without having to define time ranges or run manual jobs.

Ideal when you:

Want to ingest new data continuously as it lands in S3
Are collecting logs from many sources that drop files at unpredictable intervals
Don’t want to waste time or compute scanning the bucket for already-processed data
Need high efficiency and low duplication risk

	S3 Collector Source	Amazon S3 Source (SQS)
Ingestion Model	Batch / scheduled	Continuous / event-driven
Best For	Historical replays, backfills	Real-time ingestion
Data Targeting	Time-range or prefix based	File-specific via SQS messages
Performance	Scans defined ranges (may be slower)	Efficiently picks up only new files
Duplication Risk	Possible if schedules overlap	Low — SQS messages only created for new files
Setup Complexity	Simpler to start (no queue needed)	Slightly more setup (SQS policy + subscription)
Recovery Flexibility	Easy to replay old data	Harder to backfill historical ranges
Use Cases	Bulk ingestion / reingestion, forensics, testing	Streaming, operational monitoring, continuous logs

Summary

Both options deliver the same goal of getting your data out of S3 and into Cribl Stream. The real difference comes down to how you work and what your priorities are. If your team regularly needs to go back in time, replay logs for investigations, or validate data flows after a configuration change, the S3 Collector is ideal. It gives you the control to pull exactly what you need when you need it.

If your focus is on keeping data flowing in real time, the Amazon S3 Source with SQS is built for that continuous pace. It automatically keeps up as new logs land in S3, without the manual work of defining time ranges or worrying about overlap. It is a set-and-forget option that keeps data fresh and consistent across your environment.

In many cases, the best answer is not one or the other, but both. Use the S3 Collector when you need to look back or recover lost data, and use the S3 Source with SQS when you need nonstop ingestion. Cribl Stream gives you the flexibility to mix and match depending on your workload so you can build a data pipeline that fits your team instead of forcing your team to fit the pipeline.

What’s an S3 Collector Source?

What’s the Amazon S3 Source (with SQS)?

Summary

Sign up

Using your Cribl.Cloud account

Login to the community

Using your Cribl.Cloud account

Scanning file for viruses.

This file cannot be downloaded