Parsing nginx log files

The python group at OASIS has been busy making plone do backflips!  We’ve hooked up several Zeo instances to nginx and varnish — cool stuff.  After the grunt work of digging through config files we faced the problem of how to test our setup; but none of us wanted to do end-user testing for thirty — big — sites.  What to do, what to do…

Well, as always, we found a solution.

We have Nginx logging set up and about 10,000 log entries per site.  All we needed was software that used the log entries to programmatically replicate user actions.  The first step, however, was to write some code that could take each log entry, parse it and check it for consistency.  Thats where RegTemplate fits in.  RegTemplate takes a $ delimited string and a dictionary of regular expressions who’s keys are the template identifiers.  Each identifier is replaced by its corresponding regular expression, forming the pattern for each log entry.  RegTemplate also has a method called ‘verify’ which takes a log entry as an argument and processes it through the constructed regular expression, then compares the result to the original log entry.  Cool huh?

Below is a sample use case for a simplified log entry:

if __name__ == '__main__':
   sample = '152.2.103.87 [22/Apr/2009] GET'

   print 'Sample log entry: ', sample, '\n'

   passesExact    = '152.2.103.87 [22/Apr/2009] GET'
   passesNotExact = '152.2.103.87 [22/Apr/2009] GET \\ '
   template       = '$ip [$date] $request'

   ip      = '(([0-9]+\.?)+)'
   date    = '([0-9]{2})\/([A-Za-z]+)\/([0-9]{4})'
   request = '(GET|POST)'
   dct     = {'ip':ip, 'date':date, 'request':request}

   print 'dictionary of regex substitutions: ', dct, '\n'

   rt = RegTemplate(template)
   print 'Template before compilation is: ', rt.template
   print 'Pattern after compilation is: ',   rt.compile(dct)._RegTemplate__pattern
   print 'Template after compilation is: ',  rt.template

   print '\n'

   print '"%s" passes exact match: '       % passesExact, rt.verify(passesExact, exact=True)
   print '"%s" passes exact match: '       % passesNotExact, rt.verify(passesNotExact, exact=True)
   print '"%s" passes approximate match: ' % passesNotExact, rt.verify(passesNotExact, exact=False)

   print '\n'

   match = rt.match(passesNotExact)
   print 'pulled ip from named group, value retrieved is: ',      match.group('ip')
   print 'pulled date from named group, value retrieved is: ',    match.group('date')
   print 'pulled request from named group, value retrieved is: ', match.group('request')
   print 'named groups are: ', rt.namedGroups(passesNotExact)

And here is the code:

import re
import string

class RegTemplate(string.Template):
   """
   Takes a $ delimited string and a dictionary of regular expressions
   who's keys are the template identifiers.  Each identifier is replaced by its
   corresponding regular expression, forming the pattern for each log-entry.
   RegTemplate has a method called 'verify' which takes a log-entry as an argument
   and processes it through the constructed regular expression, then compares the
   result to the original log entry.
   """

   # python automatically converts spaces to \s internally so no need
   # to explicitly account for them.  this is not true for all whitespace
   # so Your Mileage May Vary for regular expressions containing newlines,
   # carriage returns, etc.  also, not all special or non-alphanumeric
   # characters are accounted for so caution.
   specials = [ '[' , ']' , ',' , '.', '\'' , '"' , '{', '}' , '?' , '-' , '+' , '(' , ')', '^', '&', '$']

   def __init__(self, template):
     super(RegTemplate, self).__init__(template)
     self.isCompiled = False # prevents repeat compilation
     self.regex_dict = {}

   def compile(self, regex_dict):
     """
     Replaces each identifier with its corresponding regular expression.
     Additionally, turns each expression into a named group, indexed
     by its key in regex_dict.
     """        
     if not self.isCompiled:
       self.isCompiled = True
       for k, v in regex_dict.iteritems():
         self.regex_dict[k] = '(?P<%s>%s)' % (k, v)
       for s in RegTemplate.specials:
         # delimiter is normally $ sign, if this is not the case
         # the $ sign must be escaped while the delimiter is
         # left untouched. this feature may be removed in future versions
         if s == self.delimiter: continue
         # escape special characters
         self.template = self.template.replace(s, '\\' + s)
         # make private to prevent overriding string.Template.pattern
       self.__pattern = self.substitute(**self.regex_dict)
     return self

   def namedGroups(self, entry):
     """docstring for namedGroups"""
     named_groups = {}
     if self.isCompiled:
       for name in self.regex_dict.keys():
         named_groups[name] = self.match(entry).group(name)
       return named_groups
     else:
       raise 'RegTemplate object must be compiled before iterating'

   def match(self, entry, pos=0, endpos=None):
     """
     Works very similar to re.compile(pattern).match( ... )
     """
     if self.isCompiled:
       return re.match(self.__pattern, entry[pos:endpos])
     else:
       raise 'RegTemplate object must be compiled before matching'

   def search(self, entry, pos=0, endpos=None):
     """
     Works very similar to re.compile(pattern).search( ... )
     """
     if self.isCompiled:
       return re.search(self.__pattern, entry[pos:endpos])
     else:
       raise 'RegTemplate object must be compiled before searching'

   def verify(self, entry, exact=True):
     """
     If exact is set to True, the method checks for equality,
     otherwise the method only checks for containment
     """
     if self.isCompiled:
       result = re.match(self.__pattern, entry)
     else:
       raise 'RegTemplate object must be compiled before verification'
     if result is not None:
       if exact:
         tv = (entry == result.group())
       else:
         tv = result.group() in entry           
     else:
       tv = False
     return tv
Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • Slashdot
  • StumbleUpon
  • Technorati
  1. No comments yet.

  1. No trackbacks yet.