术→技巧, 研发

正则表达式与Python Re模块

钱魏Way · · 904 次浏览

在数据抓取的时候会经常使用正则表达式,如果对于Python的re模块不太熟悉,很容易被里面的各种方法搞混,今天就一起来复习下Python的re模块。

在学习Python模块之前,先来看下官方说明文档是怎么说的?执行:

import re
help(re)

帮助信息:

Help on module re:
NAME
re - Support for regular expressions (RE).
FILE
c:\python27\lib\re.py
DESCRIPTION
This module provides regular expression matching operations similar to
those found in Perl.  It supports both 8-bit and Unicode strings; both
the pattern and the strings being processed can contain null bytes and
characters outside the US ASCII range.
Regular expressions can contain both special and ordinary characters.
Most ordinary characters, like "A", "a", or "0", are the simplest
regular expressions; they simply match themselves.  You can
concatenate ordinary characters, so last matches the string 'last'.
The special characters are:
"."      Matches any character except a newline.
"^"      Matches the start of the string.
"$"      Matches the end of the string or just before the newline at
the end of the string.
"*"      Matches 0 or more (greedy) repetitions of the preceding RE.
Greedy means that it will match as many repetitions as possible.
"+"      Matches 1 or more (greedy) repetitions of the preceding RE.
"?"      Matches 0 or 1 (greedy) of the preceding RE.
*?,+?,?? Non-greedy versions of the previous three special characters.
{m,n}    Matches from m to n repetitions of the preceding RE.
{m,n}?   Non-greedy version of the above.
"\\"     Either escapes special characters or signals a special sequence.
[]       Indicates a set of characters.
A "^" as the first character indicates a complementing set.
"|"      A|B, creates an RE that will match either A or B.
(...)    Matches the RE inside the parentheses.
The contents can be retrieved or matched later in the string.
(?iLmsux) Set the I, L, M, S, U, or X flag for the RE (see below).
(?:...)  Non-grouping version of regular parentheses.
(?P<name>...) The substring matched by the group is accessible by name.
(?P=name)     Matches the text matched earlier by the group named name.
(?#...)  A comment; ignored.
(?=...)  Matches if ... matches next, but doesn't consume the string.
(?!...)  Matches if ... doesn't match next.
(?<=...) Matches if preceded by ... (must be fixed length).
(?<!...) Matches if not preceded by ... (must be fixed length).
(?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
the (optional) no pattern otherwise.
The special sequences consist of "\\" and a character from the list
below.  If the ordinary character is not on the list, then the
resulting RE will match the second character.
\number  Matches the contents of the group of the same number.
\A       Matches only at the start of the string.
\Z       Matches only at the end of the string.
\b       Matches the empty string, but only at the start or end of a word.
\B       Matches the empty string, but not at the start or end of a word.
\d       Matches any decimal digit; equivalent to the set [0-9].
\D       Matches any non-digit character; equivalent to the set [^0-9].
\s       Matches any whitespace character; equivalent to [ \t\n\r\f\v].
\S       Matches any non-whitespace character; equiv. to [^ \t\n\r\f\v].
\w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_].
With LOCALE, it will match the set [0-9_] plus characters defined
as letters for the current locale.
\W       Matches the complement of \w.
\\       Matches a literal backslash.
This module exports the following functions:
match    Match a regular expression pattern to the beginning of a string.
search   Search a string for the presence of a pattern.
sub      Substitute occurrences of a pattern found in a string.
subn     Same as sub, but also return the number of substitutions made.
split    Split a string by the occurrences of a pattern.
findall  Find all occurrences of a pattern in a string.
finditer Return an iterator yielding a match object for each match.
compile  Compile a pattern into a RegexObject.
purge    Clear the regular expression cache.
escape   Backslash all non-alphanumerics in a string.
Some of the functions in this module takes flags as optional parameters:
I  IGNORECASE  Perform case-insensitive matching.
L  LOCALE      Make \w, \W, \b, \B, dependent on the current locale.
M  MULTILINE   "^" matches the beginning of lines (after a newline)
as well as the string.
"$" matches the end of lines (before a newline) as well
as the end of the string.
S  DOTALL      "." matches any character at all, including the newline.
X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
U  UNICODE     Make \w, \W, \b, \B, dependent on the Unicode locale.
This module also defines an exception 'error'.
CLASSES
exceptions.Exception(exceptions.BaseException)
sre_constants.error
class error(exceptions.Exception)
|  Method resolution order:
|      error
|      exceptions.Exception
|      exceptions.BaseException
|      __builtin__.object
|  
|  Data descriptors defined here:
|  
|  __weakref__
|      list of weak references to the object (if defined)
|  
|  ----------------------------------------------------------------------
|  Methods inherited from exceptions.Exception:
|  
|  __init__(...)
|      x.__init__(...) initializes x; see help(type(x)) for signature
|  
|  ----------------------------------------------------------------------
|  Data and other attributes inherited from exceptions.Exception:
|  
|  __new__ = <built-in method __new__ of type object>
|      T.__new__(S, ...) -> a new object with type S, a subtype of T
|  
|  ----------------------------------------------------------------------
|  Methods inherited from exceptions.BaseException:
|  
|  __delattr__(...)
|      x.__delattr__('name') <==> del x.name
|  
|  __getattribute__(...)
|      x.__getattribute__('name') <==> x.name
|  
|  __getitem__(...)
|      x.__getitem__(y) <==> x[y]
|  
|  __getslice__(...)
|      x.__getslice__(i, j) <==> x[i:j]
|      
|      Use of negative indices is not supported.
|  
|  __reduce__(...)
|  
|  __repr__(...)
|      x.__repr__() <==> repr(x)
|  
|  __setattr__(...)
|      x.__setattr__('name', value) <==> x.name = value
|  
|  __setstate__(...)
|  
|  __str__(...)
|      x.__str__() <==> str(x)
|  
|  __unicode__(...)
|  
|  ----------------------------------------------------------------------
|  Data descriptors inherited from exceptions.BaseException:
|  
|  __dict__
|  
|  args
|  
|  message
FUNCTIONS
compile(pattern, flags=0)
Compile a regular expression pattern, returning a pattern object.
escape(pattern)
Escape all non-alphanumeric characters in pattern.
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
finditer(pattern, string, flags=0)
Return an iterator over all non-overlapping matches in the
string.  For each match, the iterator returns a match object.
Empty matches are included in the result.
match(pattern, string, flags=0)
Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found.
purge()
Clear the regular expression cache
search(pattern, string, flags=0)
Scan through string looking for a match to the pattern, returning
a match object, or None if no match was found.
split(pattern, string, maxsplit=0, flags=0)
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings.
sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl.  repl can be either a string or a callable;
if a string, backslash escapes in it are processed.  If it is
a callable, it's passed the match object and must return
a replacement string to be used.
subn(pattern, repl, string, count=0, flags=0)
Return a 2-tuple containing (new_string, number).
new_string is the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in the source
string by the replacement repl.  number is the number of
substitutions that were made. repl can be either a string or a
callable; if a string, backslash escapes in it are processed.
If it is a callable, it's passed the match object and must
return a replacement string to be used.
template(pattern, flags=0)
Compile a template pattern, returning a pattern object
DATA
DOTALL = 16
I = 2
IGNORECASE = 2
L = 4
LOCALE = 4
M = 8
MULTILINE = 8
S = 16
U = 32
UNICODE = 32
VERBOSE = 64
X = 64
__all__ = ['match', 'search', 'sub', 'subn', 'split', 'findall', 'comp...
__version__ = '2.2.1'
VERSION
2.2.1

正则表达式简介

正则表达式是对字符串操作的一种逻辑公式,就是用事先定义好的一些特定字符、及这些特定字符的组合,组成一个“规则字符串”,这个“规则字符串”用来表达对字符串的一种过滤逻辑。 正则表达式是用来匹配字符串非常强大的工具,在其他编程语言中同样有正则表达式的概念,Python同样不例外,利用了正则表达式,我们想要从返回的页面内容提取出我们想要的内容就易如反掌了。

下表列出了正则表达式模式语法中的特殊元素。如果你使用模式的同时提供了可选的标志参数,某些模式元素的含义会改变。

数量词的贪婪模式与非贪婪模式

正则表达式通常用于在文本中查找匹配的字符串。Python 里默认是贪婪的,总是尝试匹配尽可能多的字符。我们一般使用非贪婪模式来提取。在我解释这个概念之前,我想先展示一个例子。我们要从一段 html 文本寻找锚标签:

import re
html = 'Hello <a href="https://www.biaodainfu.com">biaodianfu</a>'
m = re.findall('<a.*>.*<\/a>', html)
if m:
print(m)

执行结果:

['<a href="https://www.biaodainfu.com">biaodianfu</a>']

我们改下输入,添加第二个锚标签:

import re
html = 'Hello <a href="https://www.biaodainfu.com">biaodianfu</a> | Hello <a href="https://www.google.com">Google</a>'
m = re.findall('<a.*>.*<\/a>', html)
if m:
print(m)

执行结果:

['<a href="https://www.biaodainfu.com">biaodianfu</a> | Hello <a href="https://www.google.com">Google</a>']

貌似不是我们想要的啊,这次模式匹配了第一个开标签和最后一个闭标签以及在它们之间的所有的内容,成了一个匹配而不是两个 单独的匹配。这是因为默认的匹配模式是“贪婪的”。

当处于贪婪模式时,量词(比如 * 和 +)匹配尽可能多的字符。当你加一个问号在后面时(.*?)它将变为“非贪婪的”。

import re
html = 'Hello <a href="https://www.biaodainfu.com">biaodianfu</a> | Hello <a href="https://www.google.com">Google</a>'
m = re.findall('<a.*?>.*?<\/a>', html)
if m:
print(m)

执行结果:

['<a href="https://www.biaodainfu.com">biaodianfu</a>', '<a href="https://www.google.com">Google</a>']

反斜杠问题

与大多数编程语言相同,正则表达式里使用”\”作为转义字符,这就可能造成反斜杠困扰。假如你需要匹配文本中的字符”\”,那么使用编程语言表示的正则表达式里将需要4个反斜杠”\\”:前两个和后两个分别用于在编程语言里转义成反斜杠,转换成两个反斜杠后再在正则表达式里转义成一个反斜杠。

Python 里的原生字符串很好地解决了这个问题,这个例子中的正则表达式可以使用 r”\” 表示。同样,匹配一个数字的 ”\d” 可以写成 r”\d”。

Python Re中常见的方法

re.compile(pattern, flags=0)

Pattern(_sre.SRE_Pattern)对象是一个编译好的正则表达式,通过Pattern提供的一系列方法可以对文本进行匹配查找。Pattern不能直接实例化,必须使用re.compile()进行构造。

pattern = re.compile(r'hello')

re.compile中参数 flag 是匹配模式,匹配模式让你可以修改正则表达式的一些运行方式。在 re 模块中标志可以使用两个名字,一个是全名如 IGNORECASE,一个是缩写,一字母形式如 I。多个标志可以通过按位 OR-ing 它们来指定。如 re.I | re.M 被设置成 I 和 M 标志。可选值有:

  • I(全名:IGNORECASE):使匹配对大小写不敏感,字符类和字符串匹配字母时忽略大小写。
  • L(全名:LOCALE): 使预定字符类 \w \W \b \B \s \S 取决于当前区域设定。(不常用)
  • M(全名:MULTILINE): 多行模式,改变’^’和’$’的行为。
  • S(全名:DOTALL): 点任意匹配模式,改变’.’的行为,使.匹配包括换行在内的所有字符。
  • X(全名:VERBOSE): 详细模式。这个模式下正则表达式可以是多行,忽略空白字符,并可以加入注释。
  • U(全名:UNICODE): 使得\w, \W, \b, \B, \d, \D, \s和 \S 取决于UNICODE定义的字符属性。

匹配模式可以是数字,要满足多个匹配模式,数字相加即可。

  • I = IGNORECASE = 2
  • L = LOCALE = 4
  • M = MULTILINE =8
  • S = DOTALL = 16
  • U = UNICODE = 32
  • X = VERBOSE = 64

详细说明:

  • L:locales 是 C 语言库中的一项功能,是用来为需要考虑不同语言的编程提供帮助的。举个例子,如果你正在处理法文文本,你想用 \w+ 来匹配文字,但 \w 只匹配字符类 [A-Za-z];它并不能匹配 “é” 或 “ç”。如果你的系统配置适当且本地化设置为法语,那么内部的 C 函数将告诉程序 “é” 也应该被认为是一个字母。当在编译正则表达式时使用 LOCALE 标志会得到用这些 C 函数来处理 \w 后的编译对象;这会更慢,但也会象你希望的那样可以用 \w+ 来匹配法文文本。
  • M:使用 “^” 只匹配字符串的开始,而 $ 则只匹配字符串的结尾和直接在换行前(如果有的话)的字符串结尾。当本标志指定后, “^” 匹配字符串的开始和字符串中每行的开始。同样的, $ 元字符匹配字符串结尾和字符串中每行的结尾。
  • X:该标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解。当该标志被指定时,在 RE 字符串中的空白符被忽略,除非该空白符在字符类中或在反斜杠之后;这可以让你更清晰地组织和缩进 RE。它也可以允许你将注释写入 RE,这些注释会被引擎忽略;注释用 “#”号 来标识,不过该符号不能在字符串或反斜杠之后。

re.template(pattern, flags=0)

模版形式编译?没用过。也找不到更详细的资料。

re.escape(pattern)

可以对字符串中所有可能被解释为正则运算符的字符进行转义的应用函数。如果字符串很长且包含很多特殊技字符,而你又不想输入一大堆反斜杠,或者字符串来自于用户(比如通过raw_input函数获取输入的内容),且要用作正则表达式的一部分的时候,可以使用这个函数。

import re
print(re.escape('www.biaodianfu.com'))

执行结果:

www\.biaodianfu\.com

re.purge()

清空缓存中的正则表达式

re.search(pattern, string, flags=0)

re.search 函数会在字符串内查找模式匹配,只到找到第一个匹配然后返回,返回_sre.SRE_Match对象,如果字符串没有匹配,则返回None。

import re
pattern = re.compile(r'Hello')
result1 = re.search(pattern,'Hello World')
result2 = re.search(pattern,'Hello World, World Hello!')
print(result1)
print(result2)

执行结果:

<_sre.SRE_Match object at 0x027AFA30>
<_sre.SRE_Match object at 0x027FDDB0>

如何获取到_sre.SRE_Match中的内容?

Match对象

Match对象是一次匹配的结果,包含了很多关于此次匹配的信息,可以使用Match提供的可读属性或方法来获取这些信息。

属性:

  • string: 匹配时使用的文本。
  • re: 匹配时使用的Pattern对象。
  • pos: 文本中正则表达式开始搜索的索引。
  • endpos: 文本中正则表达式结束搜索的索引。
  • lastindex: 最后一个被捕获的分组在文本中的索引。如果没有被捕获的分组,将为None。
  • lastgroup: 最后一个被捕获的分组的别名。如果这个分组没有别名或者没有被捕获的分组,将为None。

方法:

  • group([group1, …]):获得一个或多个分组截获的字符串;指定多个参数时将以元组形式返回。group可以使用编号也可以使用别名;编号0代表整个匹配的子串;不填写参数时,返回group(0);没有截获字符串的组返回None;截获了多次的组返回最后一次截获的子串。
  • groups([default]):以元组形式返回全部分组截获的字符串。相当于调用group(1,2,…last)。default表示没有截获字符串的组以这个值替代,默认为None。
  • groupdict([default]):返回以有别名的组的别名为键、以该组截获的子串为值的字典,没有别名的组不包含在内。default含义同上。
  • start([group]):返回指定的组截获的子串在string中的起始索引(子串第一个字符的索引)。group默认值为0。
  • end([group]):返回指定的组截获的子串在string中的结束索引(子串最后一个字符的索引+1)。group默认值为0。
  • span([group]):返回(start(group), end(group))。
  • expand(template):将匹配到的分组代入template中然后返回。template中可以使用\id或\g<id>、\g<name>引用分组,但不能使用编号0。\id与\g<id>是等价的;但\10将被认为是第10个分组,如果你想表达\1之后是字符’0’,只能使用\g<1>0。

re.match(pattern, string, flags=0)

字符串的开头是否能匹配正则表达式。返回_sre.SRE_Match对象,如果不能匹配返回None。match 方法与search 方法极其类似,区别在于 match() 函数只检测 re 是不是在 string的开始位置匹配,search() 会扫描整个 string 查找匹配。

import re
pattern = re.compile(r'Hello')
result1 = re.match(pattern,'Hello')
result2 = re.match(pattern,'Hello World')
result3 = re.match(pattern,'World Hello')
if result1:
print(result1.group())
else:
print('1匹配失败!')
if result2:
print(result2.group())
else:
print('2匹配失败!')
if result3:
print(result3.group())
else:
print('3匹配失败!')

执行结果

Hello
Hello
3匹配失败!

re.findall(pattern, string, flags=0)

找到 RE 匹配的所有子串,并把它们作为一个列表返回。这个匹配是从左到右有序地返回。如果无匹配,返回空列表。

import re
pattern = re.compile(r'\d+')
print(re.findall(pattern,'one1two2three3four4'))

执行结果:

['1', '2', '3', '4'] ['1', '2', '3', '4']

re.finditer(pattern, string, flags=0)

找到 RE 匹配的所有子串,并把它们作为一个迭代器返回。这个匹配是从左到右有序地返回。如果无匹配,返回空列表。返回_sre.SRE_Match对象。

import re
pattern = re.compile(r'\d+')
results = re.finditer(pattern,'one1two2three3four4')
for result in results:
print(result)

执行结果:

<_sre.SRE_Match object at 0x0336FA30>
<_sre.SRE_Match object at 0x033BDDB0>
<_sre.SRE_Match object at 0x0336FA30>
<_sre.SRE_Match object at 0x033BDDB0>

re.split(pattern, string, maxsplit=0, flags=0)

通过正则表达式将字符串分离。如果用括号将正则表达式括起来,那么匹配的字符串也会被列入到list中返回。maxsplit是分离的次数,maxsplit=1分离一次,默认为0,不限制次数。

import re
pattern = re.compile(r'\d+')
print(re.split(pattern,'one1two2three3four4'))

执行结果:

['one', 'two', 'three', 'four', '']

re.sub(pattern, repl, string, count=0, flags=0)

找到 RE 匹配的所有子串,并将其用一个不同的字符串替换。可选参数 count 是模式匹配後替换的最大次数;count 必须是非负整数。缺省值是 0 表示替换所有的匹配。如果无匹配,字符串将会无改变地返回。

re.subn(pattern, repl, string, count=0, flags=0)

与re.sub方法作用一样,但返回的是包含新字符串和替换执行次数的两元组。

参考资料:

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注