
The role of machine data is changing from its original stronghold in security and compliance to becoming a vital part of how operations are managed. One of the biggest new applications is an area called Operational Intelligence.
One of the scary things I’ve learned recently is how many application outages are due to errors in config files! Any guesses as to what that number looks like in your organization? Remember, anything from an error in an Access Control List (ACL) to mis-mapping a router to a database configuration can result in application outages. If you are the impatient type, then jump to the end of the article where I share our estimate.
While many people talk about log data, oftentimes application log data just tells you that an error has occurred, but tracing it to the root cause is a mystical and magical process. One reason for that is that log data is often stored in different systems, so seeing logs from all systems in context is a giant workflow and aggregation problem. It is so rare to see logs from different systems side by side that it is difficult to build the knowledge base of what impacts what. That’s why we recommend centralized collection and management of machine data—it can then be farmed out to separate systems for best-of-breed products in security, compliance, and even infrastructure monitoring and management.
Another challenge is that knowledge is often accrued by individuals, but is not institutionalized. One solution is developing correlations between config file changes and the impact on application availability and performance. This can give organizations a great way to head off deployment and config file-related problems. If you are a fast-moving DevOps environment, monitoring config file updates during rollouts should be one of your first validity checks. Hopefully, your logging solution has a great File Integrity Monitoring (FIM) feature that can track changes not just at the file level, but also for each individual line.
Why am I so bullish today on monitoring config files? Because our estimate is that 50% of all run-time errors with our software are the result of configuration file errors! For example, one of the most frequent mistakes our customers make is simply forgetting to change the name of servers in config files from the dev/test instance to the production instance. What are some of the config file mistakes you have seen that have brought application or infrastructure to their knees?