在python中使用正则表达式

一、查找首个匹配串

1.1 简单示例

str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change
# If-statement after search() tests if it succeeded
if match:                      
print 'found', match.group() ## 'found word:cat'
else:
print 'did not find'

1.2单一字符

a, X, 9, < -- 一般字符匹配自身。
. -- 匹配除换行符\n外的任意单一字符。
\w -- 匹配任意一个单词字符[a-zA-Z0-9_]。
\W -- 匹配任意一个非单词字符。
\b -- 单词字符和非单词字符的边界。一个零宽界定符，只用以匹配单词的词首和词尾。
\B -- 单词字符和单词字符的连接。另一个零宽界定符，在当前位置不在单词边界时匹配。
\s -- 匹配任意一个空字符[ \n\r\t\f\v]。
\S -- 匹配任意一个非空字符。
\t,\n,\r,\f,\v -- 制表符，换行符，回车符，换页符，垂直制表符。
\d -- 匹配任意一个非数字字符[0-9] 。
\D -- 匹配任意一个非数字字符。
^ = start, $ = end -- 分别匹配字符串首尾
\ -- 取消一个字符的特殊性，使右边的任意字符按照一般字符进行匹配。

注：在 Python 字符串里，\b是反斜杠字符，ASCII值是8。如果没有使用 raw 字符串时，那么 Python 将会把 \b 转换成一个回退符，RE 将无法像希望的那样匹配它了。

1.3 方括号

[] -- 表示可能出现字符的集合，可填充上述任意字符。需注意，在方括号中：
- .仅代表一般字符.；
- -位于末尾代表一般字符-，否则代表一个区间；
- ^位于起始代表“非”，匹配所有不在集合中的字符；

1.4 重复

+ -- 表示左边的字符重复出现一次或更多次。
* -- 表示左边的字符重复出现零次或更多次。
? -- 表示左边的字符重复出现零次或一次。
{m,n} -- 表示有 m 个重复(默认为0），至多到 n 个重复（默认整数类型极大值）。换言之，{0,} 等同于 *，{1,} 等同于 +，而{0,1}则与 ? 相同。

匹配遵循两个原则 1. 找到字符串满足模式串的最左子串 2. 子串中的+、*匹配到最大长度

1.5 圆括号

() -- 在模式串上加上圆括号不改变其匹配内容，但是会按顺序将结果子串分组。

str = 'purple alice-b@google.com monkey dishwasher'
match = re.search('([\w.-]+)@([\w.-]+)', str)
if match:
print match.group()   ## 'alice-b@google.com' (the whole match)
print match.group(1)  ## 'alice-b' (the username, group 1)
print match.group(2)  ## 'google.com' (the host, group 2)

二、查找全部匹配串

2.1 简单示例

## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
# do something with each found email string
print email

2.2 分组

str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
print tuples  ## [('alice', 'google.com'), ('bob', 'abc.com')]
for tuple in tuples:
print tuple[0]  ## username
print tuple[1]  ## host

注：在括号内开始位置添加?:可以不将该括号当做分组处理。

三、高级

3.1 额外参数

re模块函数接收额外参数改变其匹配行为，命令示例：re.search(pat, str, re.IGNORECASE)。参数说明:

IGNORECASE -- 忽略大小写。（默认匹配区分大小写。）
DOTALL -- 使.可以匹配换行符\n。（默认.匹配除换行符\n外的其它任意字符。）
MULTILINE -- 在一个包含多行的长字符串中，使得^、$可以匹配每一行的行首、行尾。（默认^、$只能匹配到整个字符串的首尾。）

3.2 最短匹配

一般情况下，*会匹配到符合的最大长度为止，如模式串<.*>匹配字符串foo and so on会得到整个字符串foo and so on。若是在*/+后加上一个?则可以使*/+最快终结，得到。同样目的，一个更普遍的用法为<[^>]*>。

3.3 替换

示例：

str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', str)
## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher

一、 查找首个匹配串