You’ve got logs landing in S3 — maybe CloudTrail, maybe Snowflake, maybe Crowdstrike FDR as examples. The question that inevitably comes up is:
“Should we use an S3 Collector or an Amazon S3 Source (via SQS) to get them into Cribl Stream?”
It’s a great question, and like many things in the security and observability space, the answer depends on how you want to work with your data. Let’s break it down.
What’s an S3 Collector Source?
Think of the S3 Collector Source as a batch retrieval engine. It’s perfect when you need to reach back in time and grab a chunk of data from S3 — say, “get me three hours of logs from noon yesterday for host==abcd.” It doesn’t care about real-time ingestion or message queues. You tell it what timeframe and what filters you want, and it goes and gets them, either on demand or on a schedule.
Ideal when you:
- Need to rehydrate historical logs (for example, for a retroactive investigation)
- Want to replay old data for validation, benchmarking, or pipeline testing
- Work with data that’s organized neatly by date or prefix (e.g., YYYY/MM/DD/hh/mm/...)
- Don’t need continuous ingestion, just periodic snapshots
What’s the Amazon S3 Source (with SQS)?
The Amazon S3 Source takes a more event-driven approach. It uses SQS notifications to track new objects as they’re written to S3 and then ingests those files automatically. In other words: it’s a “pull as you go” model — continuous, near-real-time ingestion without having to define time ranges or run manual jobs.
Ideal when you:
- Want to ingest new data continuously as it lands in S3
- Are collecting logs from many sources that drop files at unpredictable intervals
- Don’t want to waste time or compute scanning the bucket for already-processed data
- Need high efficiency and low duplication risk
| S3 Collector Source | Amazon S3 Source (SQS) | |
| Ingestion Model | Batch / scheduled | Continuous / event-driven |
| Best For | Historical replays, backfills | Real-time ingestion |
| Data Targeting | Time-range or prefix based | File-specific via SQS messages |
| Performance | Scans defined ranges (may be slower) | Efficiently picks up only new files |
| Duplication Risk | Possible if schedules overlap | Low — SQS messages only created for new files |
| Setup Complexity | Simpler to start (no queue needed) | Slightly more setup (SQS policy + subscription) |
| Recovery Flexibility | Easy to replay old data | Harder to backfill historical ranges |
| Use Cases | Bulk ingestion / reingestion, forensics, testing | Streaming, operational monitoring, continuous logs |
Summary
Both options deliver the same goal of getting your data out of S3 and into Cribl Stream. The real difference comes down to how you work and what your priorities are. If your team regularly needs to go back in time, replay logs for investigations, or validate data flows after a configuration change, the S3 Collector is ideal. It gives you the control to pull exactly what you need when you need it.
If your focus is on keeping data flowing in real time, the Amazon S3 Source with SQS is built for that continuous pace. It automatically keeps up as new logs land in S3, without the manual work of defining time ranges or worrying about overlap. It is a set-and-forget option that keeps data fresh and consistent across your environment.
In many cases, the best answer is not one or the other, but both. Use the S3 Collector when you need to look back or recover lost data, and use the S3 Source with SQS when you need nonstop ingestion. Cribl Stream gives you the flexibility to mix and match depending on your workload so you can build a data pipeline that fits your team instead of forcing your team to fit the pipeline.
