Skip to main content
Question

File size upper limit when collecting from File System?

  • March 11, 2025
  • 6 replies
  • 40 views

I am seeing some irregularities with collecting large files from a filesystem.

We batch process files anywhere from 100MB to 100GB in size. I am currently noticing an issue with larger files.

To troubleshoot I created a collector that reads the data in, and directly writes it back to disk. No other ETL is done on the data.

Result of 1 collection

Result of collecting the same file from the same location using the same event breakers and going to the same destination.

My current environment is 1 leader, 1 worker, and the file is being picked up from an NFS mount.

As you can see from the screenshots, only 3-4 million of the events are being collected of the ~24 million events in the file. The destination is writing to disk about 5-6 GB from the original ~38GB.

There are no errors that I see in the job log, and I can't find any setting regarding worker process limits or job limits that would affect this.

6 replies

What version of Stream are you on?

Can you show your Collector settings?

Turning on debug on the collector could possibly provide more information.


Might be worth opening a ticket. I think we will need to take a closer look at your Logs/configurations through a diag.


  • Author
  • Inspiring
  • March 11, 2025

Sounds good. Ill open a case with a link to this thread.


After doing that you can look at the logs for the job by going to Monitoring → Job Inspector.

64_f8ac83ddfe194e7ea31078632d0a109e.png

  • Author
  • Inspiring
  • March 11, 2025

Stream version: 3.4.1

Most of the collector settings are default.
I have added my event breakers.
Set a custom field for routing specifically back out to the filesystem.

64_d320fe06b0504f4599e78d3b9c4ffe28.png

I am going to run a collection with debug on now.


  • Author
  • Inspiring
  • March 11, 2025

The majority of the debug logs are…

"message: failed to pop task reason: no task in the queue"
and
"message: skipping metrics flush on pristine metrics"

Nothing sticks out to me as bad or breaking. No tasks in queue because its one file… so only 1 task.

Also this 3rd run has a different amount of events captured again.

64_d0e5fa1ff6624a9cbd6add2d42b5b5a0.png