I am seeing some irregularities with collecting large files from a filesystem.
We batch process files anywhere from 100MB to 100GB in size. I am currently noticing an issue with larger files.
To troubleshoot I created a collector that reads the data in, and directly writes it back to disk. No other ETL is done on the data.
Result of 1 collection
Result of collecting the same file from the same location using the same event breakers and going to the same destination.
My current environment is 1 leader, 1 worker, and the file is being picked up from an NFS mount.
As you can see from the screenshots, only 3-4 million of the events are being collected of the ~24 million events in the file. The destination is writing to disk about 5-6 GB from the original ~38GB.
There are no errors that I see in the job log, and I can't find any setting regarding worker process limits or job limits that would affect this.