Indexing vs. Normalization of logs

Recently a friend asked me whether he should normalize or index logs for faster reporting.  My response was that it depends on who is interpreting the reports.  I suggested he use indexing if the logs are coming from a single application and his users understand the log message format.  However, if the logs come from many different applications or have complicated message formats, then normalization is the quickest and easiest method.  In most cases the best method is to use both indexed and normalized log data if it is an option.

First, a review of what it means to index log data

An index of log data is similar to an index in the back of most books.  A separate, smaller collection of words(index) is created with pointers to the full text.  In most cases the log data to search can be very voluminous and repetitive, which allows for efficient indexes (few words with lots of pointers is the most efficient index).  To optimize search capabilities engineers are using full-text indexing tools such as Lucene or IT search applications like Splunk to index the data, especially when there are GBs and TBs of logs to search.

For a user to get what he/she wants when using an index search they must be very knowledgeable on how the applications or devices creating the log data function and be able to understand the contents of the log messages.  If they are searching for a specific term or word, then they need to know exactly the syntax of the term and how it is used in the application’s or device’s log messages.  For example, it is important to know the difference between login and logon.

Another key for successfully using indexed log data is for users to understand what is “indexed” and what is not.  Users should know what are the delimiters in the indexed words and how are they used and what is considered an index-able word.  Is an email address a full word or two separate words separated by the @ sign?  Is a domain name one word or multiple words separated by periods (or both)?

It also is important for users to know exactly what words are used in the messages they are searching.  For example, say you want to run a report for failed logins.  It is fairly easy and works great if the user has a homogeneous environment like all UNIX servers or all Web application servers.  All you have to do is search on ‘failed login’ in a search index and it gives you all the messages that have both the words ‘failed’ and ‘login’ in them.  However, when I tried this on my lab systems (combination of UNIX and Windows devices), I didn’t get all the records I expected.  My indexed search showed nine failed logins.  After looking at the raw log data I found out that some of my failed login messages didn’t have the exact words ‘failed’ and ‘login’ in them.  One of the messages I was looking for said “login failure“, and my Windows logs said “Failed… Logon“.  It became very obvious that a user has to know the log content very clearly before they do the search of indexed data.

What does it mean to ‘normalize’ log data

Normalizing log data is the process of aggregating similar messages. This is typically done via software writing rules that interpret and summarize similar log messages. For example, different messages that contain the words ‘failed logins’, ‘authorization failures’, and ‘logon failures’ can all be interpreted as failed logins and allow users to search or report on one key phrase rather than knowing what specifically the log message context states.

One problem with normalized data is that someone has to write the software to normalize the data.  Organizations can do this themselves or use a 3rd party product like LogLogic or ArcSight that have teams of developers constantly writing rules to normalize typical log messages from various devices.  The vendors will focus on the most common types of devices and provide a development kit for end users to create their own normalization for in-house applications.  This also means that a 3rd party is making decisions about what the log messages mean so that it can be easier for users to read reports.  For example, does a ‘Denied’ message from a Cisco PIX firewall mean the same thing as a ‘Rejected’ message from a CheckPoint firewall?

In a typical large enterprise configuration I can see a use for both types of log reports.  The app team could use an indexed search report because they are very familiar with the specific application they are running and the exact message contents on which they want to report.  The IT Operations team is responsible for a much more diverse group of messages and is more reliant on ‘normalization’ for their reports to be useful, and therefore should stick with a product that normalizes the log data for all the tools that they need to manage.

There are good reasons to use both indexed and normalized log reports.  Several of the vendors I have worked with recently are working toward combining indexing and normalization to make it easier for their customers to get to the data they are searching for and/or reporting on.  For now, users need to understand the difference and the details about when and why to index vs. normalize their log data.

Full disclosure: While I was at LogLogic we developed both indexed and normalized search & reporting so the customer could choose what they wanted to use.  I have seen other vendors, like Arcsight, provide this choice.  But most log analysis tools will typically use one method or the other.

This entry was posted in Logging and tagged , , , , , . Bookmark the permalink.

2 Responses to Indexing vs. Normalization of logs

  1. Thanks for mentioning Splunk. It should be noted (in case you weren’t aware), Splunk indexes the full text of any message, organizes it by the time at which it occurred–making its indexing far more useful than Lucene–and then uniquely does the “normalization” at search time. Splunk calls this “search-time field extraction”. The best answer to your question of normalize vs. index is both. Indexing provides a far faster and more broad search than traditional schema/database solutions–but to answer the full question, you really have to have the fields and structure at some point in time. Splunk just saves the user massive amounts of time and offers insane flexibility by “first eat, then organize”

    Michael Wilde
    Splunk Ninja
    http://splunkninja.com

  2. Pingback: Tweets that mention Indexing vs. Normalization of logs - LogAdvisor -- Topsy.com

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>