Objective
Recover orphaned files left in a staging directory after a Worker Process (WP) crash (e.g., OOM) by rebuilding the state file so Stream re-processes them on the next startup.
Environment
- Cribl Stream 4.x
- File-based output destinations using staging directories (ADX, Azure Blob, S3, etc.)
Procedure
1. Identify the orphaned files
Locate the staging directory for the affected output and confirm orphaned .tmp files exist:
STAGE_DIR="$CRIBL_HOME/state/outputs/staging/<output_id>" find "$STAGE_DIR" -name "*.tmp" -o -name "*.tmp.agg.*"
Note the current worker process ID from the existing state file name (e.g., CriblOpenFiles.0.json → worker ID is 0).
2. Stop Stream
$CRIBL_HOME/bin/cribl stop
3. Rebuild the state file
Back up the existing state file for safekeeping:
cd $STAGE_DIR mv CriblOpenFiles.0.json CriblOpenFiles.0.json.bak
Review the backup so you know what the normal structure looks like. The format is:
{ "files": [ { "file": "/opt/cribl/state/outputs/staging/<output_id>/path/to/file.tmp", "state": "closing", "retryNum": 0 } ] }
Create a new state file that lists of the staging directory files and sets the state set to "closing". You can build it manually or use this script:
#!/bin/bash WORKER_ID="0" # worker process ID from your active state file name STAGE_DIR="/opt/cribl/state/outputs/staging/<output_id>" STATE_FILE="${STAGE_DIR}/CriblOpenFiles.${WORKER_ID}.json" echo '{"files":[' > "$STATE_FILE" first=true find "$STAGE_DIR" \( -name "*.tmp" -o -name "*.tmp.agg.*" \) | while read f; do if [ "$first" = true ]; then first=false; else echo ',' >> "$STATE_FILE"; fi echo "{\"file\":\"$f\",\"state\":\"closing\",\"retryNum\":0}" >> "$STATE_FILE" done echo ']}' >> "$STATE_FILE"
Replace <output_id> with the actual output ID (the subdirectory name under staging/).
Setting state: "closing" tells the WP these files are ready to upload. On startup, it transitions them to dead-closing and begins processing after a 30-second delay.
4. Start Stream
$CRIBL_HOME/bin/cribl start
5. Validate
Confirm the files were sent:
- The staging directory — orphaned files should be removed after successful upload.
- Logs for upload activity:
tail -f $CRIBL_HOME/log/cribl.log | grep -E "(moveToFinal|upload|ingest|dead-)"- The state file — a successful run will remove processed entries from the files array.
- Files that exceed 20 retries will be moved to $STAGE_DIR/dead-letter/ — investigate these separately.
Last Validated
- Stream 4.15
Additional Information
- State file worker ID: The worker ID in the state file name must match the running WP's ID. Check $CRIBL_HOME/log/cribl.log if unsure.
- Stale aggregation claims: If files in the .aggregate/ subdirectory have .agg.{N} suffixes from a dead worker, strip the suffix before rebuilding the state file:
find "$STAGE_DIR/.aggregate" -name "*.agg.*" | while read f; do mv "$f" "$(echo "$f" | sed 's/\.agg\.[0-9]*$//')" done - Same-node recovery: This procedure works on the same healthy worker node. No file copies or additional nodes are needed if the instance is not being replaced.
- Instance at risk of replacement: If the node may be replaced before recovery, copy the staged files to persistent storage (NFS) first and perform the recovery on a stable node.
- Why files become orphaned: Stream discovers staged files exclusively through the state file — not by scanning the directory. When a WP crashes before flushing state, the replacement process has no record of the in-flight files.
