I believe, as a Source, that you bring in logs from GCP much like you do with S3. With S3, you use SQS for the Cribl to identify which logs are new and their location. Then Cribl picks them up based on what is provided from SQS. Google Pub/Sub is similar in that its the messaging system for GCP that tells Cribl which logs are new and what Cribl needs to pick up.
Looking at the GCS collector, there is no setting for Google Pub/Sub, so Im still curious how thats working. The only thing we would configure would be the bucket name and a path.
Sorry, I misunderstood. I didnt realize you were talking about the collector, I thought you were talking about the Google Cloud Pub/Sub source. For the Collector, it would be very similar to how you would pick up logs from an S3 bucket using a collector.
For me, I usually format my path with dates (i.e archive/2022/04/12/13/
)
This formatting allows you to use time-based tokens in your Collector:
archive/${_time:%Y}/${_time:%m}/${_time:%d}/${_time:%H}/
When you run your collector you can do so in such a way with cron schedules and timeframes that Cribl doesnt necessarily need to know whats been read, so much as you just ensuring the next run to pick up off where the last one ended.
I hope that makes sense.
Further reference documentation here: Google Cloud Storage | Cribl Docs
It does; however, Im planning to pull from a location where I dont have the ability to change the output path format for the keys in the storage location. So, while I can appreciate how such a method would help going forward, it still doesnt answer how Cribl internally keeps track of files its already downloaded to prevent from downloading the same files again.