Today will be a gentile introduction to using regular expressions. One of the most important things you want to do when your working with text files is pull out specific pieces of information and store them in a way that you can work with later on. Regular expressions will help you pick out the pieces of data that you need and python will provide you with a structure to use that information in a programmatic fashion.
Many people will tell you regular expressions are ph33r. It is true that regular expressions can be really dense in terms of how much though is put into them on a per character basis. However I think by starting with basic cases and working our way up no one should feel intimidated by regex.
A couple of caveats: This is not an exhaustive guide to regex or python but it will give you an idea of the constructs and tools you have in hand to build on.
Pattern Matching
We will start off in the interpreter because it gives us an interface where we can quickly run test-cases and get basic help for modules and functions that we want to use.
First things first lets have a look at the interpreter, import re (which is the python module that gives access to regex) and have a look at getting basic help.
root@Trident:~/Desktop# python
Python 2.7.3 (default, Jan 2 2013, 13:56:14)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
# We import "re" which is the python module that deals with regular expressions.
>>> import re
>>>
# "dir()" allows us to list what kind of functions we can call.
>>> dir(re)
['DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 'MULTILINE', 'S', 'Scanner', 'T', 'TEMPLATE',
'U', 'UNICODE', 'VERBOSE', 'X', '_MAXCACHE', '__all__', '__builtins__', '__doc__', '__file__', '__name__',
'__package__', '__version__', '_alphanum', '_cache', '_cache_repl', '_compile', '_compile_repl',
'_expand', '_pattern_type', '_pickle', '_subx', 'compile', 'copy_reg', 'error', 'escape', 'findall',
'finditer', 'match', 'purge', 'search', 'split', 'sre_compile', 'sre_parse', 'sub', 'subn', 'sys',
'template']
>>>
>>> dir(re.search)
['__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__doc__',
'__format__', '__get__', '__getattribute__', '__globals__', '__hash__', '__init__', '__module__',
'__name__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__',
'__subclasshook__', 'func_closure', 'func_code', 'func_defaults', 'func_dict', 'func_doc', 'func_globals',
'func_name']
>>>
# "help()" will also give us a short help for a specific function.
>>> help(re.search)
Help on function search in module re:
search(pattern, string, flags=0)
Scan through string looking for a match to the pattern, returning
a match object, or None if no match was found.
(END)
Ok time to take our first steps with regular expressions. The first things we will look at are "." (dot) "\" and "r". To make our life more easy we will define a small function straight in the interpreter. At the end I will provide a python script which has all the test cases in it so you can review them.
Definition:
. (dot) – Matches any character except newline.
\ (backslash) – Allows certain characters to be interpreted literally.
r' ' - Prepending the pattern with "r" makes sure that it is interpreted literally. We don't really need to think about this to much, we will just add it to all our patterns.
>>>
# This small "find" function takes two arguments "pat" = regex pattern and "text" = string we're
searching. The function returns the match or "Not Found!".
>>> def find(pat, text):
... match = re.search(pat, text)
... if match:
... print match.group()
... else:
... print 'Not Found!'
>>>
# In this case "...y" matches "eeey".
>>> find(r'...y', 'Say heeey')
eeey
>>>
# In this case "Not Found!" because the match needs to be exact.
>>> find(r'...ys', 'Say heeey')
Not Found!
>>>
# This matches "eeey" and "vwxy" but only returns the first match.
>>> find(r'...y', 'Say heeey or even better vwxy')
eeey
>>>
# This matches "vwxy".
>>> find(r'v..y', 'Say heeey or even better vwxy')
vwxy
>>>
# If we actually want to match "." (dot) we need to escape it with a "\" (backslash).
>>> find(r'.\..y', 'Say hee.ey')
e.ey
So far so good, time to throw some more regex into the mix. The next thing we will look at are "\w", "\d", "\s" and the "+" operator.
Definition:
\w – Matches any "word" character which is a bit misleading but it includes alphanumeric and underscore.
\d – Matches any digit.
\s & \S – \s matches white-space and \S matches any non whitespace character.
+ Operator - Appending "+" to a regex operator indicates one or more of that operator type (eg: \w+ will match word characters until it hits a non-word character).
>>>
# Matches ":" and three word characters.
>>> find(r':\w\w\w', 'list :cat dog fish!')
:cat
>>>
# Matches ":" and three digits.
>>> find(r':\d\d\d', 'bla :123xxx')
:123
>>>
# This matches "bob baker" starting from the first "b" + \w\w\w, this example is total overkill but serves
it's purpose.
>>> find(r'b\w\w\s\w\w\w\w\w', 'this is bob baker')
bob baker
>>>
# Here the "+" operator matches an arbitrary amount of whitespaces.
>>> find(r'\d\s+\d\s+\d', '1 2 3')
1 2 3
>>>
# Similarly the "+" operator helps simplify our previous pattern by matching an arbitrary amount of "word"
characters.
>>> find(r'b\w+\s\w+', 'this is bob baker')
bob baker
>>>
# As a final example the pattern matches ":" and an arbitrary amount of "word" characters.
>>> find(r':\w+', 'bla bla :sweet ow snap')
:sweet
Ok we can now pick out simple patterns, next we will be looking at [ ] and ( ).
Definition:
[ ] – [ ] will allow us to specify a set of characters which we can match without having to know the order.
( ) - Putting ( ) in our pattern allows us to pick out a piece or pieces of the pattern that we are interested in.
>>>
# In this case we're matching "http" plus an arbitrary amount of colon's, forward slashes, "word"
characters and dot's. The main thing to take notice of here is that the characters can be matched in
any order and that the "." (dot) in this case is interpreted literally as a dot.
>>> find(r'http[:/\w.]+', '<a href="http://sample_url.here">Click here!</a>')
http://sample_url.here
>>>
# Using similar techniques as before this pattern picks out email addresses.
>>> find(r'[\w.]+@[\w.]+', 'junk total.ph33r@offsec.com snap @ ')
total.ph33r@offsec.com
>>>
# On the back of our previous example we are using the ( ) braces to pick out the first and second part of
the email address.
>>> x = re.search(r'([\w.]+)@([\w.]+)', 'junk total.ph33r@offsec.com snap @ ')
>>>
# We can print the entire match or the individual parts we picked out.
>>> print x.group()
total.ph33r@offsec.com
>>> print x.group(1)
total.ph33r
>>> print x.group(2)
offsec.com
>>>
# So far we have been using re.search which returns only the first match but we can also use re.findall
which will return all the matches.
>>> x = re.findall(r'[\w.]+@[\w.]+', 'bla total.ph33r@offsec.com snap @ foo@bar.com ')
>>>
# Printing "x" returns a list of matches.
>>> print x
['total.ph33r@offsec.com', 'foo@bar.com']
>>>
# We are now picking out parts of the pattern over multiple matches.
>>> x = re.findall(r'([\w.]+)@([\w.]+)', 'bla total.ph33r@offsec.com snap @ foo@bar.com ')
>>>
# This returns a list of tuples.
>>> print x
[('total.ph33r', 'offsec.com'), ('foo', 'bar.com')]
We can do some decent pattern matching now. There is one last thing I want to introduce, regex assertions.
Definition:
Positive lookahead a(?=b) – Check is "a" is followed by "b".
Negative lookahead a(?!b) – Check is "a" is not followed by "b".
Positive lookbehind (?<=a)b – Check if "b" is preceded by "a".
Negative lookbehind (?<!a)b – Check if "b" is not preceded by "a".
Lookaround (?=.*a) - Check if "a" exists inside the match.
The regex assertion test-cases each have a separate if statement so it wouldn't be practical to do them in the interpreter. I have created a python script below that contains all the test-cases we have seen so far and includes the regex assertions.
#!/usr/bin/python
##############################################
# . (dot) finds any character except newline
# \w any word character (letters, digits, underscore)
# \d any digit
# \s whitespace
# \S all non-whitespace
# + one or more of that
# * zero or more of that
# r'pat' send pattern in raw format (no formatting)
# [] define set of allowed characters (inside [] . == raw dot)
# () pick out the parts that we care about
# (?= string) Positive lookahead
# (?! string) Negative lookahead
# (?<= string) Positive lookbehind
# (?<! string) Negative lookbehind
# (?=.* string) Is the string present
##############################################
import re
def find(pat, text):
match = re.search(pat, text)
if match:
print match.group()
else:
print 'Not found!'
def main():
print "Using our find function:"
print "--------------------------------------------------"
find('...y', 'Say heeey') #finds 'eeey'
find('...ys', 'Say heeey') #finds 'Not Found!'
find('...y', 'Say heeey or even better vwxy') #finds 'eeey'
find('v..y', 'Say heeey or even better vwxy') #finds 'vwxy'
find('..yz', 'Say heeey or even better vwxyz') #finds 'wxyz'
find(r'.\..y', 'Say hee.ey') #finds 'e.ey'
find(r':\w\w\w', 'list :cat dog fish!') #finds ':cat'
find(r':\d\d\d', 'bla :123xxx') #finds ':123'
find(r'b\w\w\s\w\w\w\w\w', 'this is bob baker') #finds 'bob baker'
find(r'\d\s+\d\s+\d', '1 2 3') #finds '1 2 3'
find(r'b\w+\s\w+', 'this is bob baker') #finds 'bob baker'
find(r':\w+', 'bla bla :sweet ow snap') #finds ':sweet'
find(r':\S+', 'bla bla :kitten&a=123&test hello') #finds ':kitten&a=123&test'
find(r'http[:/\w.]+', '<a href="http://sample_url.here">Click here!</a>') #finds 'http://sample_url.here'
find(r'[\w.]+@[\w.]+', 'junk total.ph33r@offsec.com snap @ ') #finds 'total.ph33r@offsec.com'
print ""
print "Using re.search:"
print "--------------------------------------------------"
x = re.search(r'([\w.]+)@([\w.]+)', 'junk total.ph33r@offsec.com snap @ ') # 'total.ph33r' and 'offsec.com'
print x.group() #returns 'total.ph33r@offsec.com'
print x.group(1) #returns 'total.ph33r'
print x.group(2) #returns 'offsec.com'
print ""
print "Using re.findall:"
print "--------------------------------------------------"
x = re.findall(r'[\w.]+@[\w.]+', 'bla total.ph33r@offsec.com snap @ foo@bar.com ') #returns list
print x # returns ['total.ph33r@offsec.com', 'foo@bar.com']
x = re.findall(r'([\w.]+)@([\w.]+)', 'bla total.ph33r@offsec.com snap @ foo@bar.com ') #returns list of tuples
print x # returns [('total.ph33r', 'offsec.com'), ('foo', 'bar.com')]
print ""
print "Regex Assertions:"
print "--------------------------------------------------"
week_d = 'monday tuesday wednesday thursday friday saturday sunday'
# Monday is followed by Tuesday
expression = re.compile(r'monday\s(?=tuesday)')
if expression.search(week_d):
print 'Positive lookahead!'
# Monday is not followed by Friday
expression = re.compile(r'monday\s(?!friday)')
if expression.search(week_d):
print 'Negative lookahead!'
# Thursday is preceded by Wednesday
expression = re.compile(r'(?<=wednesday)\sthursday')
if expression.search(week_d):
print 'Positive lookbehind!'
# Thursday is not preceded by Tuesday
expression = re.compile(r'(?<!tuesday)\sthursday')
if expression.search(week_d):
print 'Negative lookbehind!'
# String contains Sunday
expression = re.compile(r'(?=.*sunday)')
if expression.search(week_d):
print 'There is a Sunday!'
print ""
print "Extended options dir(re):"
print "--------------------------------------------------"
print dir(re)
# eg => re.findall(r'h\w+', 'Hey hola Hello', re.IGNORECASE) # ['Hey', 'hola', 'Hello']
main()
If you have text files or strings that adhere to specific patterns regular expressions are great tools to sift through that data and quickly pull out bits that you need. With python on the back-end you can easily use that data in a programmatic fashion to accomplish your goals.
Just keep the following quote by Jamie Zawinski in mind:
Some people, when confronted with a problem, think
"I know, I'll use regular expressions." Now they have two problems.
Regex are powerful but don't shoot yourself in the foot, use it when it suits your needs!!
Usage / Practice
I want to finish off this tutorial by presenting you with a very simple usage case that is non the less quite useful and will allow you to practice regex (what more can you ask for).
If you have ever done any (serious) Return Oriented Programming (ROP) you will know that there usually aren’t any "canned" solutions out there. You will most likely end up spending a lot of time sifting through gadgets to find what you are looking for.
In my experience the most useful tool out there to get a basic raw list of ROP gadgets is mona (a debugger plug-in by Corelan). Even so, if you are building custom ROP chains and need to find very specific instructions in a 20k list there will definitely be moments that you think the world is ending hehe. ROP gadgets follow specific patterns but more often than not they can't efficiently be matched with simple searches. This results in tons of useless results and migraines. Fortunately for us regex allows us to do much better pattern matching.
The following 17 lines of python are guaranteed to save you allot of time and aspirin tablets. The script takes two command line arguments, a text file and a regex pattern. A couple of things to take note of (1) I've per-formated this to work with mona output files and (2) since gadgets are usually less complex the shorter they are sorting the matches by length works wonders.
#!/usr/bin/python
import sys, re
def regex(rop):
gadget = open(rop, 'rU')
text = re.findall(r'.+' + sys.argv[2] + '.+RETN' , gadget.read())
for x in sorted(text, key=len):
print x
def main():
regex(sys.argv[1])
main()
I have included a sample ROP-list below for testing purposes and a couple of searches so you can get an idea of how the script works.
MSVCR71.dll - raw_rop.txt
Keep in mind that these are just simple searches but you can use the full range of regular expressions to find even the most rare instructions. I have also limited the output below to just a few results but most searches return tons of matches.
root@Trident:~/Desktop# ./regROP.py raw_rop.txt 'PUSH ESP.+POP'
0x7c372f4f : # PUSH ESP # AND AL,10 # MOV DWORD PTR DS:[EDX],ECX # POP ESI # RETN
0x7c34969e : # PUSH ESP # MOV AL,BYTE PTR DS:[C68B7C37] # POP ESI # POP EBX # RETN
0x7c37591f : # PUSH ESP # ADD CH,BL # INC EBP # OR AL,59 # POP ECX # POP EBP # RETN
[...Snip...]
root@Trident:~/Desktop# ./regROP.py raw_rop.txt 'ADD ESP,..\s#'
0x7c352007 : # ADD ESP,0C # RETN
0x7c352041 : # ADD ESP,0C # RETN
0x7c35f9a0 : # ADD ESP,2C # RETN
0x7c35207e : # ADD ESP,0C # RETN
0x7c3520bd : # ADD ESP,0C # RETN
0x7c3440be : # ADD ESP,14 # RETN
[...Snip...]
root@Trident:~/Desktop# ./regROP.py raw_rop.txt 'MOV\s\w+,DWORD PTR DS:\[EAX\]'
0x7c3530ea : # MOV EAX,DWORD PTR DS:[EAX] # RETN
0x7c3413aa : # MOV EAX,DWORD PTR DS:[EAX] # PUSH EAX # RETN
0x7c35a000 : # MOV EAX,DWORD PTR DS:[EAX] # ADD EAX,ECX # RETN
0x7c359fff : # POP ESI # MOV EAX,DWORD PTR DS:[EAX] # ADD EAX,ECX # RETN
[...Snip...]
root@Trident:~/Desktop# ./regROP.py raw_rop.txt 'XCHG'
0x7c348b05 : # XCHG EAX,ESP # RETN
0x7c36c652 : # XCHG EAX,EDI # PUSH ESP # STD # DEC ECX # RETN
0x7c341cae : # XCHG EAX,ESP # PUSH ES # ADD BYTE PTR DS:[EAX],AL # RETN
0x7c3413a9 : # XCHG EAX,ESP # MOV EAX,DWORD PTR DS:[EAX] # PUSH EAX # RETN
0x7c342643 : # XCHG EAX,ESP # POP EDI # ADD BYTE PTR DS:[EAX],AL # POP ECX # RETN
[...Snip...]
root@Trident:~/Desktop# ./regROP.py raw_rop.txt 'KERNEL'
0x7c355c63 : # ADC EAX,<&KERNEL32.Beep> # RETN
0x7c355c54 : # ADC EAX,<&KERNEL32.Sleep> # RETN
0x7c341a0f : # ADC EAX,<&KERNEL32.TlsAlloc> # RETN
0x7c35f575 : # ADC EAX,<&KERNEL32.LoadLibraryA> # RETN
0x7c355c61 : # OR BH,BH # ADC EAX,<&KERNEL32.Beep> # RETN
0x7c34ade7 : # ADC EAX,<&KERNEL32.HeapFree> # POP ESI # RETN
0x7c341a0d : # ADD BH,BH # ADC EAX,<&KERNEL32.TlsAlloc> # RETN
[...Snip...]
root@Trident:~/Desktop# ./regROP.py raw_rop.txt 'POP.+POP.+POP'
0x7c374011 : # POP ECX # POP ESI # POP EBP # RETN
0x7c37606c : # POP EDI # POP ESI # POP EBX # RETN
0x7c3660e2 : # POP EDI # POP ESI # POP EBX # RETN
0x7c3761a5 : # POP EDI # POP ESI # POP EBX # RETN
0x7c342301 : # POP EDI # POP ESI # POP EBP # RETN
0x7c35437b : # POP EDI # POP ESI # POP EBP # RETN
0x7c350389 : # POP ESI # POP EBX # POP EBP # RETN
[...Snip...]
root@Trident:~/Desktop# ./regROP.py raw_rop.txt 'XOR\sEAX,EAX.+INC\sEAX'
0x7c364045 : # XOR EAX,EAX # INC EAX # RETN
0x7c364071 : # XOR EAX,EAX # INC EAX # RETN
0x7c358077 : # XOR EAX,EAX # INC EAX # RETN
0x7c36409d : # XOR EAX,EAX # INC EAX # RETN
0x7c3480c1 : # XOR EAX,EAX # INC EAX # RETN
0x7c3640e7 : # XOR EAX,EAX # INC EAX # RETN
0x7c34810b : # XOR EAX,EAX # INC EAX # RETN
0x7c354146 : # XOR EAX,EAX # INC EAX # RETN
[...Snip...]