Sequencing Packs and Pipelines
When setting up Cribl Stream and Cribl Edge, you are given a lot of choice about how you process data. This can lead to conversations shifting from “Can I do…XYZ?” as the answer is so often yes, to instead “Where and how should I do…XYZ.?”. In this post I’m going to discuss what options are available, best practices to consider and my advice on what not to do.
Where can you make changes?
When logging into Stream you’ll notice you can manipulate data flows by using Packs and pipelines in three different places*. You can do this at the pre-processing level attached to a source, at the main routing step or at the post processing stage. This is explained in more detail in our Event Processing Order documentation.
Key link: https://docs.cribl.io/stream/event-processing-order/
* - Already I’ve simplified things here. With a Cribl Edge -> Cribl Stream arrangement we’ve got a choice of six locations as each product grants three options. Additionally, for simplicity I’m not going to discuss the Chain function which allows connecting pipelines or packs together. You really can do some sophisticated processing if you wish!
Reference: https://docs.cribl.io/stream/chain-function/
What are some best practices to think about?
Cribl Stream was not built with a “This is the one way to run your data flows” mindset so I can’t give you a single instruction on what to do. Instead I’m going to list out a series of key considerations, if these apply to your circumstances, use them. Elsewise you can remix or ignore them. In general, pipelines at the routing stage are the easier to get started with so I’ll be covering extra cases to that.
✅Metadata additions are best done at pre-processing level
Metadata labelling is best done as soon as possible with a simple field addition. The reason for this is to limit the amount of conditional logic later on in your data flow. This allows for precise updates and ease of control.
✅Unrolls and event breaking are best done as soon as possible
By the time events get to your routing stage, ideally they are all consistently formatted and roughly shaped and tagged. Ensuring that all events are neatly divided sooner will make your event management far easier. The best place to break is at the breaker stage, the second best is the pre processing stage.
✅Aggregation and enrichment is best done in Stream rather than Edge
Any memory intensive operations are best done centrally rather than having lookup files distributed to all Edge nodes. Additionally, aggregation is best performed in a single location to allow for more predictable control.
✅Destination specific logic is best done at a destination
If there’s a simple specific schema change or field transformation, this is usually best done at the post processing stage. This could invovle metadata addition or event serialization.
What should you ideally avoid
There’s a number of choices you can make which may make your life more challenging. Here’s a list of things to consider.
❗Repeating logic is unwise and will make debugging more challenging
If you are doing field remapping, it is unwise to split that remapping/renaming into multiple locations. For example, renaming some fields at pre-processing, some in routing and some at post processing. Having data shaping logic together will make things far easier to test and debug with samples.
❗Don’t duplicate data flows from Edge to Stream
For bandwidth and load reasons, if you are doing an Edge to Stream transfer it’s best to split data out once you get to Stream rather than have two very similar event flows from Edge to Stream in parallel.
❗Manage stale comments, descriptions and outdated pipelines
Over time your comments will age and lose relevance. When this happens, refreshing and refactoring is important. Additionally, it’s possible that conditional functions or entire pipelines are no longer used. When this happens, I’d recommend tidying up to simplify things. Don’t have pipeline_v3_final style naming conventions! Simplicity is best.
Concluding notes
I hope these tips have got you thinking about how best to manage your pipelines and packs. If you ever want a second view, do ask a community member, a colleague or reach out to your Cribl team. I find that often seeing another person try to understand your own thoughts can be fantastic for spotting an easier solution.
