Parsing nginx log files
The python group at OASIS has been busy making plone do backflips! We’ve hooked up several Zeo instances to nginx and varnish — cool stuff. After the grunt work of digging through config files we faced the problem of how to test our setup; but none of us wanted to do end-user testing for thirty — big — sites. What to do, what to do…
Well, as always, we found a solution.
We have Nginx logging set up and about 10,000 log entries per site. All we needed was software that used the log entries to programmatically replicate user actions. The first step, however, was to write some code that could take each log entry, parse it and check it for consistency. Thats where RegTemplate fits in. RegTemplate takes a $ delimited string and a dictionary of regular expressions who’s keys are the template identifiers. Each identifier is replaced by its corresponding regular expression, forming the pattern for each log entry. RegTemplate also has a method called ‘verify’ which takes a log entry as an argument and processes it through the constructed regular expression, then compares the result to the original log entry. Cool huh?
Below is a sample use case for a simplified log entry:
if __name__ == '__main__':
sample = '152.2.103.87 [22/Apr/2009] GET'
print 'Sample log entry: ', sample, '\n'
passesExact = '152.2.103.87 [22/Apr/2009] GET'
passesNotExact = '152.2.103.87 [22/Apr/2009] GET \\ '
template = '$ip [$date] $request'
ip = '(([0-9]+\.?)+)'
date = '([0-9]{2})\/([A-Za-z]+)\/([0-9]{4})'
request = '(GET|POST)'
dct = {'ip':ip, 'date':date, 'request':request}
print 'dictionary of regex substitutions: ', dct, '\n'
rt = RegTemplate(template)
print 'Template before compilation is: ', rt.template
print 'Pattern after compilation is: ', rt.compile(dct)._RegTemplate__pattern
print 'Template after compilation is: ', rt.template
print '\n'
print '"%s" passes exact match: ' % passesExact, rt.verify(passesExact, exact=True)
print '"%s" passes exact match: ' % passesNotExact, rt.verify(passesNotExact, exact=True)
print '"%s" passes approximate match: ' % passesNotExact, rt.verify(passesNotExact, exact=False)
print '\n'
match = rt.match(passesNotExact)
print 'pulled ip from named group, value retrieved is: ', match.group('ip')
print 'pulled date from named group, value retrieved is: ', match.group('date')
print 'pulled request from named group, value retrieved is: ', match.group('request')
print 'named groups are: ', rt.namedGroups(passesNotExact)
And here is the code:
import re
import string
class RegTemplate(string.Template):
"""
Takes a $ delimited string and a dictionary of regular expressions
who's keys are the template identifiers. Each identifier is replaced by its
corresponding regular expression, forming the pattern for each log-entry.
RegTemplate has a method called 'verify' which takes a log-entry as an argument
and processes it through the constructed regular expression, then compares the
result to the original log entry.
"""
# python automatically converts spaces to \s internally so no need
# to explicitly account for them. this is not true for all whitespace
# so Your Mileage May Vary for regular expressions containing newlines,
# carriage returns, etc. also, not all special or non-alphanumeric
# characters are accounted for so caution.
specials = [ '[' , ']' , ',' , '.', '\'' , '"' , '{', '}' , '?' , '-' , '+' , '(' , ')', '^', '&', '$']
def __init__(self, template):
super(RegTemplate, self).__init__(template)
self.isCompiled = False # prevents repeat compilation
self.regex_dict = {}
def compile(self, regex_dict):
"""
Replaces each identifier with its corresponding regular expression.
Additionally, turns each expression into a named group, indexed
by its key in regex_dict.
"""
if not self.isCompiled:
self.isCompiled = True
for k, v in regex_dict.iteritems():
self.regex_dict[k] = '(?P<%s>%s)' % (k, v)
for s in RegTemplate.specials:
# delimiter is normally $ sign, if this is not the case
# the $ sign must be escaped while the delimiter is
# left untouched. this feature may be removed in future versions
if s == self.delimiter: continue
# escape special characters
self.template = self.template.replace(s, '\\' + s)
# make private to prevent overriding string.Template.pattern
self.__pattern = self.substitute(**self.regex_dict)
return self
def namedGroups(self, entry):
"""docstring for namedGroups"""
named_groups = {}
if self.isCompiled:
for name in self.regex_dict.keys():
named_groups[name] = self.match(entry).group(name)
return named_groups
else:
raise 'RegTemplate object must be compiled before iterating'
def match(self, entry, pos=0, endpos=None):
"""
Works very similar to re.compile(pattern).match( ... )
"""
if self.isCompiled:
return re.match(self.__pattern, entry[pos:endpos])
else:
raise 'RegTemplate object must be compiled before matching'
def search(self, entry, pos=0, endpos=None):
"""
Works very similar to re.compile(pattern).search( ... )
"""
if self.isCompiled:
return re.search(self.__pattern, entry[pos:endpos])
else:
raise 'RegTemplate object must be compiled before searching'
def verify(self, entry, exact=True):
"""
If exact is set to True, the method checks for equality,
otherwise the method only checks for containment
"""
if self.isCompiled:
result = re.match(self.__pattern, entry)
else:
raise 'RegTemplate object must be compiled before verification'
if result is not None:
if exact:
tv = (entry == result.group())
else:
tv = result.group() in entry
else:
tv = False
return tv







No comments yet.