一,正则表达式

  正则表达式是对字符串操作的一种逻辑公式,我们一般使用正则表达式对字符串进行匹配和过滤,使用正则的优缺点,我们可以去http://tool.chinaz.com/regex/进行测试。

SRE实战 互联网时代守护先锋,助力企业售后服务体系运筹帷幄!一键直达领取阿里云限量特价优惠。

  优点:灵活,功能性强,逻辑性强

  缺点:上手难,一旦上手,使用起来很方便

  正则表达式由普通字符和元字符组成,普通字符包含大小写字母,数字,在匹配普通字符的时候我们直接写就好,比如‘abc’匹配的就是‘abc’。元字符才是正则表达式的灵魂。

  1,字符组:字符组很简单,用[]括起来,在[]中出现的内容会被匹配,例如[abc]匹配a或b或c,如果字符组中的内容过多还可以使用-,例如[a-z]匹配a到z之间的所有字母,[0-9]匹配所有阿拉伯数字

  2,简单元字符

  正则表达式,re模块 Python 第1张

  正则表达式,re模块 Python 第2张

  3,量词

  正则表达式,re模块 Python 第3张

  4,惰性匹配和贪婪匹配

  正则表达式,re模块 Python 第4张

  正则表达式,re模块 Python 第5张

  5,分组

  正则表达式,re模块 Python 第6张

  6,转义

  正则表达式,re模块 Python 第7张

二,re模块

  正则表达式,re模块 Python 第8张

  正则表达式,re模块 Python 第9张

  正则表达式,re模块 Python 第10张

  正则表达式,re模块 Python 第11张

  正则表达式,re模块 Python 第12张

 三,实例一,用re和urllib爬电影下载地址,我爬的是电影天堂小片网的电影(中间一大段if,elif只是排除编码有问题电影) 

from urllib.request import urlopen
import re
for i in range(1,302):
if i ==1:
url='https://www.dy2018.com/html/gndy/dyzz/index.html'
else:
url='https://www.dy2018.com/html/gndy/dyzz/index_%s.html'%i
print(url)
f1 = open('move.txt', mode='a', encoding='utf-8')
f1.write(url + '\n')
f1.close()
content=urlopen(url).read().decode('gbk')
if i==51:
obj = re.compile(r'<td height="20" style="padding-left:3px">&nbsp;</td>.*?<td height="26">.*?a href="/(?P<adress>.*?)".*?">(?P<name>.*?)</a>', re.S)
else:
obj = re.compile(r'<td height="26">.*?a href="/(?P<adress>.*?)".*?">(?P<name>.*?)</', re.S)
ss=obj.finditer(content)
for el in ss:
if el.group('adress')=='i/91900.html' or el.group('adress')=='html/gndy/dyzz/20130913/91826.html' :
continue
elif el.group('adress')=='html/gndy/dyzz/20130131/41375.html'or el.group('adress')=='html/gndy/dyzz/20130903/91759.html':
continue
elif el.group('adress')=='html/gndy/dyzz/20130908/91797.html'or el.group('adress')=='html/gndy/dyzz/20120623/38354.html':
continue
elif el.group('adress')=='html/gndy/dyzz/20110219/30940.html'or el.group('adress')=='html/gndy/dyzz/20110112/30354.html':
continue
elif el.group('adress')=='html/gndy/dyzz/20110109/30300.html'or el.group('adress')=='html/gndy/dyzz/20110107/30267.html':
continue
elif el.group('adress')=='html/gndy/dyzz/20101231/30152.html'or el.group('adress')=='html/gndy/dyzz/20101231/30150.html':
continue
elif el.group('adress')=='html/gndy/dyzz/20101221/29988.html'or el.group('adress')=='html/gndy/dyzz/20101215/29886.html':
continue
elif el.group('adress')=='html/gndy/dyzz/20101209/29780.html'or el.group('adress')=='html/gndy/dyzz/20101208/29768.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20101101/29127.html'or el.group('adress') == 'html/gndy/dyzz/20101027/29031.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20101019/28883.html'or el.group('adress') == 'html/gndy/dyzz/20100919/28321.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20100909/28122.html'or el.group('adress') == 'html/gndy/dyzz/20100828/27899.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20100820/27742.html'or el.group('adress') == 'html/gndy/dyzz/20100726/27308.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20100620/26683.html'or el.group('adress') == 'html/gndy/dyzz/20100613/26564.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20100428/25809.html'or el.group('adress') == 'html/gndy/dyzz/20100116/24050.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20100110/23933.html'or el.group('adress') == 'html/gndy/dyzz/20100103/23790.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20091104/22691.html'or el.group('adress') == 'html/gndy/dyzz/20090423/18292.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20090308/17370.html'or el.group('adress') == 'html/gndy/dyzz/20090303/17285.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20090308/17377.html'or el.group('adress') == 'html/gndy/dyzz/20090113/16496.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20081221/16079.html'or el.group('adress') == 'html/gndy/dyzz/20081215/15964.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20081008/14637.html'or el.group('adress') == 'html/gndy/dyzz/20080916/14200.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20080905/13985.html'or el.group('adress') == 'html/gndy/dyzz/20080119/9339.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20071213/8253.html'or el.group('adress') == 'html/gndy/dyzz/20071125/7635.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20071025/6879.html'or el.group('adress') == 'html/gndy/dyzz/20070902/5424.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20070824/5143.html'or el.group('adress') == 'html/gndy/dyzz/20070823/5109.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20070823/5088.html'or el.group('adress') == 'html/gndy/dyzz/20070806/4621.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20070805/4617.html'or el.group('adress') == 'html/gndy/dyzz/20070805/4611.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20070727/4333.html'or el.group('adress') == 'html/gndy/dyzz/20070726/4301.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20070716/4023.html'or el.group('adress') == 'html/gndy/dyzz/20070712/3862.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20070617/3072.html'or el.group('adress') == 'html/gndy/dyzz/20070615/3019.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20070527/2418.html'or el.group('adress') == 'html/gndy/dyzz/20070425/1672.html':
continue
elif el.group('adress') == 'html/gndy/dyzz/20070416/1478.html'or el.group('adress') == 'html/gndy/dyzz/20070412/1399.html':
continue
d1={}
d1['name']=el.group('name')
url1='https://www.dy2018.com/'+el.group('adress')
content1 = urlopen(url1).read().decode('gbk')
obj1=re.compile(r'下载地址.*?<a href="(?P<adress1>.*?)">',re.S)
ss1=obj1.search(content1)
d1['adress']=ss1.group('adress1')
f1=open('move.txt',mode='a',encoding='utf-8')
f1.write(str(d1)+'\n')
f1.close()
四,用re和urllib爬维密图片
import re
from urllib.request import urlopen,urlretrieve
for n in range(1,5):
if n ==1:
url2 = 'https://stock.tuchong.com/topic?topicId=49282'
else:
url2='https://stock.tuchong.com/topic?topicId=49282&page=%s&count=100'%n
content=urlopen(url2).read().decode('utf-8')
obj=re.compile(r'"imageId":"(?P<bianhao>.*?)"',re.S)
ss=obj.finditer(content)
i=(n-1)*100+1
for el in ss:
num=el.group('bianhao')
if num=='525558981780570122'or num=='525569719200382978':
continue
elif num=='525589673617915918'or num=='525563551629443081':
continue
elif num == '525571411419332618'or num == '525585163906711562':
continue
elif num == '525554334627135503'or num == '525554274497593347':
continue
elif num == '525555872228048910'or num == '525555889405689865':
continue
elif num == '525566472207728651'or num == '525569650482741255':
continue
elif num == '525571420010053640'or num == '525571196669132802':
continue
elif num == '525571239618281484'or num == '525575964084666381':
continue
elif num == '525577347064528914'or num == '525580594059804675':
continue
elif num == '525592946387451916'or num == '525602292234190873':
continue
elif num == '525551036092645412'or num == '525552685360087057':
continue
elif num == '525554300266217477'or num == '525555657477062662':
continue
elif num == '525557349694570506'or num == '525557418413522955':
continue
elif num == '525575757924401164'or num == '525585146722910211':
continue
elif num == '525591194036862980'or num == '525594604244828160':
continue
elif num == '525619248767172615'or num == '525546397526392859':
continue
elif num == '525547952305733639'or num == '525549618751864836':
continue
elif num == '525552539333034011'or num == '525560631049584650':
continue
elif num == '525561910952460294'or num == '525577252573020167':
continue
elif num == '525577390015905802'or num == '525578919019806730':
continue
elif num == '525589845421064206'or num == '525600746043736066':
continue
elif num == '525600746046750729'or num == '525600539888320533':
continue
elif num == '525636059267465227'or num == '525551070454611970':
continue
elif num == '525551173533827166'or num == '525552693952643113':
continue
elif num == '525557461366341633'or num == '525557169308041219':
continue
elif num == '525558818575613969'or num == '525559024730243078':
continue
elif num == '525560596692074509'or num == '525563345468391437':
continue
elif num == '525575809462435842'or num == '525578901843083269':
continue
elif num == '525580405077442575'or num == '525582148835344441':
continue
elif num == '525585146721337349'or num == '525549532855926790':
continue
elif num == '525552711133822979'or num == '525560338990235661':
continue
elif num == '525565097818193935'or num == '525566489386811399':
continue
elif num == '525569684840644622'or num == '525571239620640780':
continue
elif num == '525571316927692806'or num == '525572725675917316':
continue
elif num == '525573017733693460'or num == '525574254687682562':
continue
elif num == '525574538153689102'or num == '525575869593550849':
continue
elif num == '525594295007182858'or num == '525596004400234497':
continue
elif num == '525608305186177028'or num == '525650120992096265':
continue
elif num == '525564900245504013':
continue
s='http://p3.pstatp.com/weili/ms/%s.webp'%num
print(s)
ss=urlretrieve(s,'维密\%s.jpg'%i)
i += 1
 

 

扫码关注我们
微信号:SRE实战
拒绝背锅 运筹帷幄