Python爬虫与数据可视化

2021年6月16日 26条评论 9.67k次阅读 14人点赞云淡风轻

项目仓库：https://github.com/haohaizhi/51job_spiders

数据挖掘

代码所需包

import urllib.request
import xlwt
import re
import urllib.parse
import time

进入前程无忧官网
我这里以搜索大数据职位信息

在这里插入图片描述

打开开发者模式
Request Headers 里面是我们用浏览器访问网站的信息，有了信息后就能模拟浏览器访问
这也是为了防止网站封禁IP，不过前程无忧一般是不会封IP的。

在这里插入图片描述

模拟浏览器

header={
    'Host':'search.51job.com',
    'Upgrade-Insecure-Requests':'1',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}

在这里插入图片描述

这些基本数据都可以爬取：

注意

为了实现交互型爬取，我写了一个能够实现“输入想了解的职位就能爬取到相关内容”的函数

类似与百度搜索引擎，想要爬取什么职位信息就输入什么关键字！

def getfront(page,item):       #page是页数，item是输入的字符串，见后文
     result = urllib.parse.quote(item)                  #先把字符串转成十六进制编码
     ur1 = result+',2,'+ str(page)+'.html'
     ur2 = 'https://search.51job.com/list/000000,000000,0000,00,9,99,'
     res = ur2+ur1                                                          #拼接网址
     a = urllib.request.urlopen(res)
     html = a.read().decode('gbk')          # 读取源代码并转为unicode
     return html

def getInformation(html):
    reg = re.compile(r'class="t1 ">.*? <a target="_blank" title="(.*?)" href="(.*?)".*? <span class="t2"><a target="_blank" title="(.*?)" href="(.*?)".*?<span class="t3">(.*?)</span>.*?<span class="t4">(.*?)</span>.*?<span class="t5">(.*?)</span>.*?',re.S)#匹配换行符
    items=re.findall(reg,html)
    return items

这里我除了爬取图上信息外，还把职位超链接后的网址，以及公司超链接的网址爬取下来了。
这里先不讲，后面后面会说到，
接下来就需要储存信息，这里使用Excel，虽然比较麻烦，不过胜在清晰直观

#新建表格空间
excel1 = xlwt.Workbook()
# 设置单元格格式
sheet1 = excel1.add_sheet('Job', cell_overwrite_ok=True)
sheet1.write(0, 0, '序号')
sheet1.write(0, 1, '职位')
sheet1.write(0, 2, '公司名称')
sheet1.write(0, 3, '公司地点')
sheet1.write(0, 4, '公司性质')
sheet1.write(0, 5, '薪资')
sheet1.write(0, 6, '学历要求')
sheet1.write(0, 7, '工作经验')
sheet1.write(0, 8, '公司规模')
sheet1.write(0, 9, '公司类型')
sheet1.write(0, 10,'公司福利')
sheet1.write(0, 11,'发布时间')

爬取代码如下，这里就能利用双层循环来实现换页爬取与换行输出
我这里为了获得大量数据所以爬取了1000页，调试时可以只爬取几页

number = 1
item = input()
for j in range(1,1000):   #页数自己随便改
    try:
        print("正在爬取第"+str(j)+"页数据...")
        html = getfront(j,item)      #调用获取网页原码
        for i in getInformation(html):
            try:
                url1 = i[1]          #职位网址
                res1 = urllib.request.urlopen(url1).read().decode('gbk')
                company = re.findall(re.compile(r'<div class="com_tag">.*?<p class="at" title="(.*?)"><span class="i_flag">.*?<p class="at" title="(.*?)">.*?<p class="at" title="(.*?)">.*?',re.S),res1)
                job_need = re.findall(re.compile(r'<p class="msg ltype".*?>.*?  <span>|</span>  (.*?)  <span>|</span>  (.*?)  <span>|</span>  .*?</p>',re.S),res1)
                welfare = re.findall(re.compile(r'<span class="sp4">(.*?)</span>',re.S),res1)
                print(i[0],i[2],i[4],i[5],company[0][0],job_need[2][0],job_need[1][0],company[0][1],company[0][2],welfare,i[6])
                sheet1.write(number,0,number)
                sheet1.write(number,1,i[0])
                sheet1.write(number,2,i[2])
                sheet1.write(number,3,i[4])
                sheet1.write(number,4,company[0][0])
                sheet1.write(number,5,i[5])
                sheet1.write(number,6,job_need[1][0])
                sheet1.write(number,7,job_need[2][0])
                sheet1.write(number,8,company[0][1])
                sheet1.write(number,9,company[0][2])
                sheet1.write(number,10,("  ".join(str(i) for i in welfare)))
                sheet1.write(number,11,i[6])
                number+=1
                excel1.save("51job.xls")
                time.sleep(0.3) #休息间隔，避免爬取海量数据时被误判为攻击，IP遭到封禁
            except:
                pass
    except:
        pass

结果如下：

在这里插入图片描述

数据清洗

首先要打开文件

#coding:utf-8
import pandas as pd
import re
#除此之外还要安装xlrd包

data = pd.read_excel(r'51job.xls',sheet_name='Job')
result = pd.DataFrame(data)

清洗思路：
1、出现有空值（NAN）得信息，直接删除整行

a = result.dropna(axis=0,how='any')
pd.set_option('display.max_rows',None)     #输出全部行，不省略

2、职位出错（很多职位都是与大数据无关的职业）

在这里插入图片描述

b = u'数据'
number = 1
li = a['职位']
for i in range(0,len(li)):
    try:
        if b in li[i]:
            #print(number,li[i])
            number+=1
        else:
            a = a.drop(i,axis=0)
    except:
        pass

3、其他地方出现的信息错位，比如在学历里出现 ‘招多少人’

在这里插入图片描述

b2= u'人'
li2 = a['学历要求']
for i in range(0,len(li2)):
    try:
        if b2 in li2[i]:
            #print(number,li2[i])
            number+=1
            a = a.drop(i,axis=0)
    except:
        pass

4、转换薪资单位
如上图就出现单位不一致的情况

b3 =u'万/年'
b4 =u'千/月'
li3 = a['薪资']
#注释部分的print都是为了调试用的
for i in range(0,len(li3)):
    try:
        if b3 in li3[i]:
            x = re.findall(r'\d*\.?\d+',li3[i])
            #print(x)
            min_ = format(float(x[0])/12,'.2f')              #转换成浮点型并保留两位小数
            max_ = format(float(x[1])/12,'.2f')
            li3[i][1] = min_+'-'+max_+u'万/月'
        if b4 in li3[i]:
            x = re.findall(r'\d*\.?\d+',li3[i])
            #print(x)
            #input()
            min_ = format(float(x[0])/10,'.2f')
            max_ = format(float(x[1])/10,'.2f')
            li3[i][1] = str(min_+'-'+max_+'万/月')
        print(i,li3[i])

    except:
        pass

保存到另一个Excel文件

a.to_excel('51job2.xlsx', sheet_name='Job', index=False)

这里只是简单的介绍了一些数据清理的思路，并不是说只要清理这些就行了
有时候有的公司网页并不是前程无忧类型的，而是他们公司自己做的网页，这也很容易出错
不过只要有了基本思路，这些都不难清理

数据可视化

数据可视化可以说是很重要的环节，如果只是爬取数据而不去可视化处理，那么可以说数据的价值根本没有发挥
可视化处理能使数据更加直观，更有利于分析
甚至可以说可视化是数据挖掘最重要的内容

同样的我们先看代码需要的包

# -*- coding: utf-8 -*-
import pandas as pd
import re
from pyecharts import Funnel,Pie,Geo
import matplotlib.pyplot as plt

在这里插入图片描述

若找不到或者安装失败，可用源码安装的方式

在这里插入图片描述

https://github.com/pyecharts/pyecharts

其次如果要做地理坐标图，热力图啥的，必须安装地图包，比如世界地图包，中国地图包，城市地图包等

在这里插入图片描述

接下来就是正戏
一样的先要打开文件

file = pd.read_excel(r'51job2.xls',sheet_name='Job')
f = pd.DataFrame(file)
pd.set_option('display.max_rows',None)

1、创建多个列表来单独存放【‘薪资’】【‘工作经验’】【‘学历要求’】【‘公司地点’】等信息

add = f['公司地点']
sly = f['薪资']
edu = f['学历要求']
exp = f['工作经验']
address =[]
salary = []
education = []
experience = []
for i in range(0,len(f)):
    try:
        a = add[i].split('-')
        address.append(a[0])
        #print(address[i])
        s = re.findall(r'\d*\.?\d+',sly[i])
        s1= float(s[0])
        s2 =float(s[1])
        salary.append([s1,s2])
        #print(salary[i])
        education.append(edu[i])
        #print(education[i])
        experience.append(exp[i])
        #print(experience[i])
    except:
       pass

2、matploblib库生成工作经验—薪资图与学历—薪资图

min_s=[]                            #定义存放最低薪资的列表
max_s=[]                            #定义存放最高薪资的列表
for i in range(0,len(experience)):
    min_s.append(salary[i][0])
    max_s.append(salary[i][0])

my_df = pd.DataFrame({'experience':experience, 'min_salay' : min_s, 'max_salay' : max_s})             #关联工作经验与薪资
data1 = my_df.groupby('experience').mean()['min_salay'].plot(kind='line')
plt.show()
my_df2 = pd.DataFrame({'education':education, 'min_salay' : min_s, 'max_salay' : max_s})              #关联学历与薪资
data2 = my_df2.groupby('education').mean()['min_salay'].plot(kind='line')
plt.show()

在这里插入图片描述

3、学历要求圆环图

def get_edu(list):
    education2 = {}
    for i in set(list):
        education2[i] = list.count(i)
    return education2
dir1 = get_edu(education)
# print(dir1)

attr= dir1.keys()
value = dir1.values()
pie = Pie("学历要求")
pie.add("", attr, value, center=[50, 50], is_random=False, radius=[30, 75], rosetype='radius',
        is_legend_show=False, is_label_show=True,legend_orient='vertical')
pie.render('学历要求玫瑰图.html')

在这里插入图片描述

4、大数据城市需求地理位置分布图

def get_address(list):
    address2 = {}
    for i in set(list):
        address2[i] = list.count(i)
    address2.pop('异地招聘')
    # 有些地名可能不合法或者地图包里没有可以自行删除，之前以下名称都会报错，现在好像更新了
    #address2.pop('山东')
    #address2.pop('怒江')
    #address2.pop('池州')
    return address2
dir2 = get_address(address)
#print(dir2)

geo = Geo("大数据人才需求分布图", title_color="#2E2E2E",
          title_text_size=24,title_top=20,title_pos="center", width=1300,height=600)
attr2 = dir2.keys()
value2 = dir2.values()
geo.add("",attr2, value2, type="effectScatter", is_random=True, visual_range=[0, 1000], maptype='china',symbol_size=8, effect_scale=5, is_visualmap=True)
geo.render('大数据城市需求分布图.html')

在这里插入图片描述

5、工作经验要求漏斗图

def get_experience(list):
    experience2 = {}
    for i in set(list):
         experience2[i] = list.count(i)
    return experience2
dir3 = get_experience(experience)
#print(dir3)

attr3= dir3.keys()
value3 = dir3.values()
funnel = Funnel("工作经验漏斗图",title_pos='center')
funnel.add("", attr3, value3,is_label_show=True,label_pos="inside", label_text_color="#fff",legend_orient='vertical',legend_pos='left')
funnel.render('工作经验要求漏斗图.html')

在这里插入图片描述

当然，pyecharts里面的图还有很多种，就靠大家去自己发掘了。

关于代码中的问题可以看最新的反馈

反馈

接到部分人反应的乱码情况，主要可能是因为网站规则变动。我去重新更新了一下代码，并且改进了一些地方，如果遇到爬取过程中途停下的情况，可能是网络问题或者陷入阻塞，可以重新运行一次代码

所有代码如下：

# -*- coding:utf-8 -*-
import urllib.request
import xlwt
import re
import urllib.parse
import time
header={
    'Host':'search.51job.com',
    'Upgrade-Insecure-Requests':'1',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
def getfront(page,item):       #page是页数，item是输入的字符串
     result = urllib.parse.quote(item)                  #先把字符串转成十六进制编码
     ur1 = result+',2,'+ str(page)+'.html'
     ur2 = 'https://search.51job.com/list/000000,000000,0000,00,9,99,'
     res = ur2+ur1                                                          #拼接网址
     a = urllib.request.urlopen(res)
     html = a.read().decode('gbk')          # 读取源代码并转为unicode
     return html
def getInformation(html):
    reg = re.compile(r'class="t1 ">.*? <a target="_blank" title="(.*?)" href="(.*?)".*? <span class="t2"><a target="_blank" title="(.*?)" href="(.*?)".*?<span class="t3">(.*?)</span>.*?<span class="t4">(.*?)</span>.*?<span class="t5">(.*?)</span>.*?',re.S)#匹配换行符
    items=re.findall(reg,html)
    return items
#新建表格空间
excel1 = xlwt.Workbook()
# 设置单元格格式
sheet1 = excel1.add_sheet('Job', cell_overwrite_ok=True)
sheet1.write(0, 0, '序号')
sheet1.write(0, 1, '职位')
sheet1.write(0, 2, '公司名称')
sheet1.write(0, 3, '公司地点')
sheet1.write(0, 4, '公司性质')
sheet1.write(0, 5, '薪资')
sheet1.write(0, 6, '学历要求')
sheet1.write(0, 7, '工作经验')
sheet1.write(0, 8, '公司规模')
sheet1.write(0, 9, '公司类型')
sheet1.write(0, 10,'公司福利')
sheet1.write(0, 11,'发布时间')
number = 1
item = input()
for j in range(1,10000):   #页数自己随便改
    try:
        print("正在爬取第"+str(j)+"页数据...")
        html = getfront(j,item)      #调用获取网页原码
        for i in getInformation(html):
            try:
                url1 = i[1]          #职位网址
                res1 = urllib.request.urlopen(url1).read().decode('gbk')
                company = re.findall(re.compile(r'<div class="com_tag">.*?<p class="at" title="(.*?)"><span class="i_flag">.*?<p class="at" title="(.*?)">.*?<p class="at" title="(.*?)">.*?',re.S),res1)
                job_need = re.findall(re.compile(r'<p class="msg ltype".*?>.*?  <span>|</span>  (.*?)  <span>|</span>  (.*?)  <span>|</span>  .*?</p>',re.S),res1)
                welfare = re.findall(re.compile(r'<span class="sp4">(.*?)</span>',re.S),res1)
                print(i[0],i[2],i[4],i[5],company[0][0],job_need[2][0],job_need[1][0],company[0][1],company[0][2],welfare,i[6])
                sheet1.write(number,0,number)
                sheet1.write(number,1,i[0])
                sheet1.write(number,2,i[2])
                sheet1.write(number,3,i[4])
                sheet1.write(number,4,company[0][0])
                sheet1.write(number,5,i[5])
                sheet1.write(number,6,job_need[2][0])
                sheet1.write(number,7,job_need[1][0])
                sheet1.write(number,8,company[0][1])
                sheet1.write(number,9,company[0][2])
                sheet1.write(number,10,("  ".join(str(i) for i in welfare)))
                sheet1.write(number,11,i[6])
                number+=1
                excel1.save("51job.xls")
                time.sleep(0.3) #休息间隔，避免爬取海量数据时被误判为攻击，IP遭到封禁
            except:
                pass
    except:
        pass

#coding:utf-8
import pandas as pd
import re

data = pd.read_excel(r'51job.xls',sheet_name='Job')
result = pd.DataFrame(data)

a = result.dropna(axis=0,how='any')
pd.set_option('display.max_rows',None)     #输出全部行，不省略

b = u'数据'
number = 1
li = a['职位']
for i in range(0,len(li)):
    try:
        if b in li[i]:
            #print(number,li[i])
            number+=1
        else:
            a = a.drop(i,axis=0)  #删除整行
    except:
        pass

b2 = '人'
li2 = a['学历要求']
for i in range(0,len(li2)):
    try:
        if b2 in li2[i]:
            # print(number,li2[i])
            number += 1
            a = a.drop(i, axis=0)
    except:
        pass

b3 =u'万/年'
b4 =u'千/月'
li3 = a['薪资']
#注释部分的print都是为了调试用的
for i in range(0,len(li3)):
    try:
        if b3 in li3[i]:
            x = re.findall(r'\d*\.?\d+',li3[i])
            #print(x)
            min_ = format(float(x[0])/12,'.2f')              #转换成浮点型并保留两位小数
            max_ = format(float(x[1])/12,'.2f')
            li3[i][1] = min_+'-'+max_+u'万/月'
        if b4 in li3[i]:
            x = re.findall(r'\d*\.?\d+',li3[i])
            #print(x)
            #input()
            min_ = format(float(x[0])/10,'.2f')
            max_ = format(float(x[1])/10,'.2f')
            li3[i][1] = str(min_+'-'+max_+'万/月')
        print(i,li3[i])

    except:
        pass
a.to_excel('51job2.xls', sheet_name='Job', index=False)
#############################################################################################
import pandas as pd
import re
from pyecharts import Funnel,Pie,Geo
import matplotlib.pyplot as plt

file = pd.read_excel(r'51job2.xls',sheet_name='Job')
f = pd.DataFrame(file)
pd.set_option('display.max_rows',None)

add = f['公司地点']
sly = f['薪资']
edu = f['学历要求']
exp = f['工作经验']
address =[]
salary = []
education = []
experience = []
for i in range(0,len(f)):
    try:
        a = add[i].split('-')
        address.append(a[0])
        #print(address[i])
        s = re.findall(r'\d*\.?\d+',sly[i])
        s1= float(s[0])
        s2 =float(s[1])
        salary.append([s1,s2])
        #print(salary[i])
        education.append(edu[i])
        #print(education[i])
        experience.append(exp[i])
        #print(experience[i])
    except:
       pass

min_s=[]                            #定义存放最低薪资的列表
max_s=[]                            #定义存放最高薪资的列表
for i in range(0,len(experience)):
    min_s.append(salary[i][0])
    max_s.append(salary[i][0])
#matplotlib模块如果显示不了中文字符串可以用以下代码。
plt.rcParams['font.sans-serif'] = ['KaiTi'] # 指定默认字体
plt.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号'-'显示为方块的问题

my_df = pd.DataFrame({'experience':experience, 'min_salay' : min_s, 'max_salay' : max_s})             #关联工作经验与薪资
data1 = my_df.groupby('experience').mean()['min_salay'].plot(kind='line')
plt.show()
my_df2 = pd.DataFrame({'education':education, 'min_salay' : min_s, 'max_salay' : max_s})              #关联学历与薪资
data2 = my_df2.groupby('education').mean()['min_salay'].plot(kind='line')
plt.show()

def get_edu(list):
    education2 = {}
    for i in set(list):
        education2[i] = list.count(i)
    return education2
dir1 = get_edu(education)
# print(dir1)

attr= dir1.keys()
value = dir1.values()
pie = Pie("学历要求")
pie.add("", attr, value, center=[50, 50], is_random=False, radius=[30, 75], rosetype='radius',
        is_legend_show=False, is_label_show=True,legend_orient='vertical')
pie.render('学历要求玫瑰图.html')

def get_address(list):
    address2 = {}
    for i in set(list):
        address2[i] = list.count(i)
    address2.pop('异地招聘')
    # 有些地名可能不合法或者地图包里没有可以自行删除，之前以下名称都会报错，现在好像更新了
    #address2.pop('山东')
    #address2.pop('怒江')
    #address2.pop('池州')
    return address2
dir2 = get_address(address)
#print(dir2)

geo = Geo("大数据人才需求分布图", title_color="#2E2E2E",
          title_text_size=24,title_top=20,title_pos="center", width=1300,height=600)
attr2 = dir2.keys()
value2 = dir2.values()
geo.add("",attr2, value2, type="effectScatter", is_random=True, visual_range=[0, 1000], maptype='china',symbol_size=8, effect_scale=5, is_visualmap=True)
geo.render('大数据城市需求分布图.html')

def get_experience(list):
    experience2 = {}
    for i in set(list):
         experience2[i] = list.count(i)
    return experience2
dir3 = get_experience(experience)
#print(dir3)

attr3= dir3.keys()
value3 = dir3.values()
funnel = Funnel("工作经验漏斗图",title_pos='center')
funnel.add("", attr3, value3,is_label_show=True,label_pos="inside", label_text_color="#fff",legend_orient='vertical',legend_pos='left')
funnel.render('工作经验要求漏斗图.html')

在这里插入图片描述
HTML文件最好用谷歌浏览器打开，如果点开没反应可以在文件夹里找到该文件然后打开

最近比较多人说爬取数据没有动静，我去看了下，其实不是什么问题，就是网页源码有更改，之前python爬取到的信息是用HTML写的，而现在数据那里是JavaScript写的，这样的话正则肯定就不匹配了。我也花时间改了改。有些东西也去的去，加的加，不过不影响后面数据可视化。

# -*- coding:utf-8 -*-
import urllib.request
import xlwt
import re
import urllib.parse
import time
header={
    'Host':'search.51job.com',
    'Upgrade-Insecure-Requests':'1',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
def getfront(page,item):       #page是页数，item是输入的字符串
     result = urllib.parse.quote(item)                  #先把字符串转成十六进制编码
     ur1 = result+',2,'+ str(page)+'.html'
     ur2 = 'https://search.51job.com/list/000000,000000,0000,00,9,99,'
     res = ur2+ur1    #拼接网址
     a = urllib.request.urlopen(res)
     html = a.read().decode('gbk')      # 读取源代码并转为unicode
     html = html.replace('\\','')       # 将用于转义的"\"替换为空
     html = html.replace('[', '')
     html = html.replace(']', '')
     #print(html)
     return html

def getInformation(html):
    reg = re.compile(r'\{"type":"engine_search_result","jt":"0".*?"job_href":"(.*?)","job_name":"(.*?)".*?"company_href":"(.*?)","company_name":"(.*?)","providesalary_text":"(.*?)".*?"updatedate":"(.*?)".*?,'
                     r'"companytype_text":"(.*?)".*?"jobwelf":"(.*?)".*?"attribute_text":"(.*?)","(.*?)","(.*?)","(.*?)","companysize_text":"(.*?)","companyind_text":"(.*?)","adid":""},',re.S)#匹配换行符
    items=re.findall(reg,html)
    print(items)
    return items

#新建表格空间
excel1 = xlwt.Workbook()
# 设置单元格格式
sheet1 = excel1.add_sheet('Job', cell_overwrite_ok=True)
sheet1.write(0, 0, '序号')
sheet1.write(0, 1, '职位')
sheet1.write(0, 2, '公司名称')
sheet1.write(0, 3, '公司地点')
sheet1.write(0, 4, '公司性质')
sheet1.write(0, 5, '薪资')
sheet1.write(0, 6, '学历要求')
sheet1.write(0, 7, '工作经验')
sheet1.write(0, 8, '公司规模')
#sheet1.write(0, 9, '公司类型')
sheet1.write(0, 9,'公司福利')
sheet1.write(0, 10,'发布时间')
number = 1
item = input()

for j in range(1,10000):   #页数自己随便改
    try:
        print("正在爬取第"+str(j)+"页数据...")
        html = getfront(j,item)      #调用获取网页原码
        for i in getInformation(html):
            try:
                url1 = i[1]          #职位网址
                res1 = urllib.request.urlopen(url1).read().decode('gbk')
                company = re.findall(re.compile(r'<div class="com_tag">.*?<p class="at" title="(.*?)"><span class="i_flag">.*?<p class="at" title="(.*?)">.*?<p class="at" title="(.*?)">.*?',re.S),res1)
                job_need = re.findall(re.compile(r'<p class="msg ltype".*?>.*?  <span>|</span>  (.*?)  <span>|</span>  (.*?)  <span>|</span>  .*?</p>',re.S),res1)
                welfare = re.findall(re.compile(r'<span class="sp4">(.*?)</span>',re.S),res1)
                print(i[0],i[2],i[4],i[5],company[0][0],job_need[2][0],job_need[1][0],company[0][1],company[0][2],welfare,i[6])
                sheet1.write(number,0,number)
                sheet1.write(number,1,i[0])
                sheet1.write(number,2,i[2])
                sheet1.write(number,3,i[4])
                sheet1.write(number,4,company[0][0])
                sheet1.write(number,5,i[5])
                sheet1.write(number,6,job_need[2][0])
                sheet1.write(number,7,job_need[1][0])
                sheet1.write(number,8,company[0][1])
                sheet1.write(number,9,company[0][2])
                sheet1.write(number,10,("  ".join(str(i) for i in welfare)))
                sheet1.write(number,11,i[6])
                number+=1
                excel1.save("51job.xls")
                time.sleep(0.3) #休息间隔，避免爬取海量数据时被误判为攻击，IP遭到封禁
            except:
                pass
    except:
        pass

反馈2

爬取信息出现卡住的情况

该情况大概率是由网络造成的，我也没有什么比较好的解决思路，碰到这种情况只能从新爬取了，不过前面已经爬取过的信息已经成功写到excel文件里了，此时重新执行程序时可以将起始页数修改成之前卡住的页码，并且修改表名
data = pd.read_excel(r'51job.xls',sheet_name='Job2')

出现data.to_excel('1.xls')保存文件报错

由于xlwt软件包不再维护，xlwt引擎将在未来版本的pandas中删除。这是pandas中唯一支持xls格式写入的引擎。安装openpyxl并改为写入xlsx文件。您可以设置选项io.excel.xls文件。写入“xlwt”以消除此警告。虽然此选项已弃用，并且还会引发警告，但可以全局设置它并抑制警告

将data.to_excel('1.xls')改为data.to_excel('1.xlsx')就可以了。

pyecharts异常

由于之前项目是依赖于v0.5.x ，和新版的 V1.x 间不兼容，
本次将代码修改以适配最新版的pyechart.
参考官方文档： https://github.com/pyecharts/pyecharts-gallery

全部代码如下


# -*- coding:utf-8 -*-
import urllib.request
import xlwt
import re
import urllib.parse
import time
header={
    'Host':'search.51job.com',
    'Upgrade-Insecure-Requests':'1',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
def getfront(page,item):       #page是页数，item是输入的字符串
     result = urllib.parse.quote(item)					#先把字符串转成十六进制编码
     ur1 = result+',2,'+ str(page)+'.html'
     ur2 = 'https://search.51job.com/list/000000,000000,0000,00,9,99,'
     res = ur2+ur1    #拼接网址
     a = urllib.request.urlopen(res)
     html = a.read().decode('gbk')      # 读取源代码并转为unicode
     html = html.replace('\\','')       # 将用于转义的"\"替换为空
     html = html.replace('[', '')
     html = html.replace(']', '')
     #print(html)
     return html

def getInformation(html):
    reg = re.compile(r'\{"type":"engine_search_result","jt":"0".*?"job_href":"(.*?)","job_name":"(.*?)".*?"company_href":"(.*?)","company_name":"(.*?)","providesalary_text":"(.*?)".*?"updatedate":"(.*?)".*?,'
                     r'"companytype_text":"(.*?)".*?"jobwelf":"(.*?)".*?"attribute_text":"(.*?)","(.*?)","(.*?)","(.*?)","companysize_text":"(.*?)","companyind_text":"(.*?)","adid":""},',re.S)#匹配换行符
    items=re.findall(reg,html)
    print(items)
    return items

#新建表格空间
excel1 = xlwt.Workbook()
# 设置单元格格式
sheet1 = excel1.add_sheet('Job', cell_overwrite_ok=True)
sheet1.write(0, 0, '序号')
sheet1.write(0, 1, '职位')
sheet1.write(0, 2, '公司名称')
sheet1.write(0, 3, '公司地点')
sheet1.write(0, 4, '公司性质')
sheet1.write(0, 5, '薪资')
sheet1.write(0, 6, '学历要求')
sheet1.write(0, 7, '工作经验')
sheet1.write(0, 8, '公司规模')
#sheet1.write(0, 9, '公司类型')
sheet1.write(0, 9,'公司福利')
sheet1.write(0, 10,'发布时间')
number = 1
item = input()

for j in range(1,100):   #页数自己随便改
    try:
        print("正在爬取第"+str(j)+"页数据...")
        html = getfront(j,item)      #调用获取网页原码
        for i in getInformation(html):
            try:
                sheet1.write(number,0,number)
                sheet1.write(number,1,i[1])
                sheet1.write(number,2,i[3])
                sheet1.write(number,3,i[8])
                sheet1.write(number,4,i[6])
                sheet1.write(number,5,i[4])
                sheet1.write(number,6,i[10])
                sheet1.write(number,7,i[9])
                sheet1.write(number,8,i[12])
                #sheet1.write(number,9,i[7])
                sheet1.write(number,9,i[7])
                sheet1.write(number,10,i[5])
                number+=1
                excel1.save("51job.xls")
                time.sleep(0.3) #休息间隔，避免爬取海量数据时被误判为攻击，IP遭到封禁
            except:
                pass
    except:
        pass


#coding:utf-8
import pandas as pd
import re

data = pd.read_excel(r'51job.xls',sheet_name='Job')
result = pd.DataFrame(data)

a = result.dropna(axis=0,how='any')
pd.set_option('display.max_rows',None)     #输出全部行，不省略

b = u'数据'
number = 1
li = a['职位']
for i in range(0,len(li)):
    try:
        if b in li[i]:
            #print(number,li[i])
            number+=1
        else:
            a = a.drop(i,axis=0)  #删除整行
    except:
        pass

b2 = '人'
li2 = a['学历要求']
for i in range(0,len(li2)):
    try:
        if b2 in li2[i]:
            # print(number,li2[i])
            number += 1
            a = a.drop(i, axis=0)
    except:
        pass

b3 =u'万/年'
b4 =u'千/月'
li3 = a['薪资']
#注释部分的print都是为了调试用的
for i in range(0,len(li3)):
    try:
        if b3 in li3[i]:
            x = re.findall(r'\d*\.?\d+',li3[i])
            #print(x)
            min_ = format(float(x[0])/12,'.2f')              #转换成浮点型并保留两位小数
            max_ = format(float(x[1])/12,'.2f')
            li3[i][1] = min_+'-'+max_+u'万/月'
        if b4 in li3[i]:
            x = re.findall(r'\d*\.?\d+',li3[i])
            #print(x)
            #input()
            min_ = format(float(x[0])/10,'.2f')
            max_ = format(float(x[1])/10,'.2f')
            li3[i][1] = str(min_+'-'+max_+'万/月')
        print(i,li3[i])

    except:
        pass
a.to_excel('51job2.xlsx', sheet_name='Job', index=False)
#############################################################################################
import pandas as pd
import re
from pyecharts.charts import Funnel,Pie,Geo
import matplotlib.pyplot as plt
from pyecharts import options as opts
from pyecharts.datasets import register_url


file = pd.read_excel(r'51job2.xls',sheet_name='Job')
f = pd.DataFrame(file)
pd.set_option('display.max_rows',None)

add = f['公司地点']
sly = f['薪资']
edu = f['学历要求']
exp = f['工作经验']
address =[]
salary = []
education = []
experience = []
for i in range(0,len(f)):
    try:
        a = add[i].split('-')
        address.append(a[0])
        #print(address[i])
        s = re.findall(r'\d*\.?\d+',sly[i])
        s1= float(s[0])
        s2 =float(s[1])
        salary.append([s1,s2])
        #print(salary[i])
        education.append(edu[i])
        #print(education[i])
        experience.append(exp[i])
        #print(experience[i])
    except:
       pass

min_s=[]							#定义存放最低薪资的列表
max_s=[]							#定义存放最高薪资的列表
for i in range(0,len(experience)):
    min_s.append(salary[i][0])
    max_s.append(salary[i][0])
#matplotlib模块如果显示不了中文字符串可以用以下代码。
plt.rcParams['font.sans-serif'] = ['KaiTi'] # 指定默认字体
plt.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号'-'显示为方块的问题

my_df = pd.DataFrame({'experience':experience, 'min_salay' : min_s, 'max_salay' : max_s})				#关联工作经验与薪资
data1 = my_df.groupby('experience').mean()['min_salay'].plot(kind='line')
plt.show()
my_df2 = pd.DataFrame({'education':education, 'min_salay' : min_s, 'max_salay' : max_s})				#关联学历与薪资
data2 = my_df2.groupby('education').mean()['min_salay'].plot(kind='line')
plt.show()

def get_edu(list):
    education2 = {}
    for i in set(list):
        education2[i] = list.count(i)
    return education2
dir1 = get_edu(education)
# print(dir1)

attr= dir1.keys()
value = dir1.values()

# 旧版pyecharts
# pie = Pie("学历要求")
# pie.add("", attr, value, center=[50, 50], is_random=False, radius=[30, 75], rosetype='radius',
#         is_legend_show=False, is_label_show=True,legend_orient='vertical')
# pie.render('学历要求玫瑰图.html')

# 新版pyecharts
c = (
    Pie()
    .add(
        "",
        [list(z) for z in zip(attr, value)],
        radius=["40%", "75%"],
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="Pie-Radius"),
        legend_opts=opts.LegendOpts(orient="vertical", pos_top="15%", pos_left="2%"),
    )
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
    .render("学历要求玫瑰图.html")
)

def get_address(list):
    address2 = {}
    for i in set(list):
        address2[i] = list.count(i)
    address2.pop('异地招聘')
    # 有些地名可能不合法或者地图包里没有可以自行删除，之前以下名称都会报错，现在好像更新了
    #address2.pop('山东')
    #address2.pop('怒江')
    #address2.pop('池州')
    return address2
dir2 = get_address(address)
#print(dir2)
attr2 = dir2.keys()
value2 = dir2.values()

# 旧版pyecharts
# geo = Geo("大数据人才需求分布图", title_color="#2E2E2E",
#           title_text_size=24,title_top=20,title_pos="center", width=1300,height=600)

# geo.add("",attr2, value2, type="effectScatter", is_random=True, visual_range=[0, 1000], maptype='china',symbol_size=8, effect_scale=5, is_visualmap=True)
# geo.render('大数据城市需求分布图.html')

# 新版pyecharts
c = (
    Geo()
    .add_schema(maptype="china")
    .add("geo", [list(z) for z in zip(attr2, value2)])
    .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
    .set_global_opts(
        visualmap_opts=opts.VisualMapOpts(), title_opts=opts.TitleOpts(title="Geo-基本示例")
    )
    .render("大数据城市需求分布图.html")
)

def get_experience(list):
    experience2 = {}
    for i in set(list):
         experience2[i] = list.count(i)
    return experience2
dir3 = get_experience(experience)
#print(dir3)

attr3= dir3.keys()
value3 = dir3.values()

# 旧版pyecharts
# funnel = Funnel("工作经验漏斗图",title_pos='center')
# funnel.add("", attr3, value3,is_label_show=True,label_pos="inside", label_text_color="#fff",legend_orient='vertical',legend_pos='left')
# funnel.render('工作经验要求漏斗图.html')

# 新版pyecharts
c = (
    Funnel()
    .add(
        "",
        [list(z) for z in zip(attr3, value3)],
        label_opts=opts.LabelOpts(position="inside"),
    )
    .set_global_opts(title_opts=opts.TitleOpts(title="Funnel-Label（inside)"))
    .render("工作经验要求漏斗图.html")
)

本作品采用知识共享署名-相同方式共享 4.0 国际许可协议进行许可

Pingback： Python爬虫与数据可视化 – 源码巴士
invoker说道：

2021年12月14日下午11:12

Google Chrome Windows 10
大佬好像网址的规则改了所以爬不下来是不是16进制编码中间多了%25
回复
1. 云淡风轻说道：
  
  2021年12月15日下午2:12
  
  Warning: Undefined variable $wpua_show_version in /www/wwwroot/blog.mehoon.com/wp-content/themes/kratos-pjax-master/inc/ua.php on line 96
  
  Google Chrome Mac OS X 10.15.7
  不应该吧，我昨天特地把代码拉下来跑了下，是可以正常爬取到信息的。
  回复
橘子说道：

2021年12月14日下午2:17

Firefox Windows 10
大佬我刚刚看了下好像爬不下来了 hearder我已经改了还是不行啊
回复
1. 云淡风轻说道：
  
  2021年12月14日下午2:43
  
  Warning: Undefined variable $wpua_show_version in /www/wwwroot/blog.mehoon.com/wp-content/themes/kratos-pjax-master/inc/ua.php on line 96
  
  Google Chrome Mac OS X 10.15.7
  去浏览器访问一下看看网页正常不，有没有被防爬机制挡住。
  回复
  1. 橘子说道：
    
    2021年12月15日下午3:31
    
    Firefox Windows 10
    我测试了下数据可以爬下来，但是到getInformation(html)这个函数，最后items是个空的就进入不了后面的try 这是为什么呢，网站规则修改了嘛麻烦作者大大看一下
    回复
    1. 云淡风轻说道：
      
      2021年12月15日下午3:41
      
      Warning: Undefined variable $wpua_show_version in /www/wwwroot/blog.mehoon.com/wp-content/themes/kratos-pjax-master/inc/ua.php on line 96
      
      Google Chrome Mac OS X 10.15.7
      加我QQ聊吧，2584456944，有截图看的明白些.
      回复
正函数说道：

2021年11月7日上午11:53

Google Chrome Windows 10
大佬，膜拜，我就是觉得很多的网站有许多的反爬机制，就让我很烦躁，让我找不到源码，就无法获取我想要的数据，我学到了很多，非常感谢
回复
frytea说道：

2021年9月2日下午3:44

Google Chrome Windows 10
怎么感觉这篇文章好火的样子呀，不简单呢
回复
1. 云淡风轻说道：
  
  2021年9月2日下午6:34
  
  Google Chrome Windows 10
  那必须。
  回复
2. 云淡风轻说道：
  
  2021年9月2日下午6:34
  
  Google Chrome Windows 10
  CSDN 6万浏览量
  回复
苏渊说道：

2021年8月30日下午3:57

Google Chrome Windows 10
大佬，因这一篇爬虫数据可视化被吸引而来，反倒是对个人博客有点兴趣了大佬能不能出一篇关于个人博客的开发制作等文章呢？我一定拜读！
回复
1. 云淡风轻说道：
  
  2021年8月30日下午5:33
  
  Google Chrome Windows 10
  感谢认可，关于个人博客的搭建的文章也在考虑写，后续会发布的。
  回复
123说道：

2021年8月16日上午1:26

Google Chrome Windows 10
我想问下那个51job_view 2.py 我运行显示我 File "C:\Users\13300\Desktop\51job_spiders-master\51job_view2.py", line 6, in data = pd.read_excel(r'51job.xls',sheet_name='Job') 这些错误下面还有几个这种错误求求大神解答，请大神喝奶茶，谢谢了
回复
阿呆叔叔说道：

2021年8月12日上午11:08

Google Chrome Windows 10
哥，爬取的结果是空是怎么回事啊. 前端开发正在爬取第1页数据... [] 正在爬取第2页数据... [] 正在爬取第3页数据... []
回复
1. 阿呆叔叔说道：
  
  2021年8月12日上午11:22
  
  Google Chrome Windows 10
  而且新的header 也是更改过的
  回复
2. 云淡风轻说道：
  
  2021年8月12日上午11:26
  是用GitHub上的代码跑的吗
  回复
  1. 阿呆叔叔说道：
    
    2021年8月12日下午12:45
    
    Google Chrome Windows 10
    嗯嗯是的。博客上的也跑了也是不可以
    回复
    1. 云淡风轻说道：
      
      2021年8月12日下午12:58
      
      Google Chrome Windows 10
      程序正常在跑说明没运行问题，但是爬取的数据为空就需要调试了，可以将一些前面注释的地方打开，看一下数据是不是异常。
      回复
      1. 阿呆叔叔说道：
        
        2021年8月12日下午1:07
        
        Google Chrome Windows 10
        好的，我再自己搞一搞，谢谢哟。打扰了，嘿嘿嘿。
3. qq_45700048说道：
  
  2022年3月17日下午5:41
  
  Google Chrome Windows 10
  老哥，方便加个qq聊一下这个问题吗
  回复
修安心说道：

2021年7月25日上午11:51

QQbrowser Windows 10
大佬进不到二次循环怎么回事，就只在控制台上出现了大数据三个字。就没反应了。
回复
修安心说道：

2021年7月25日上午11:28

QQbrowser Windows 10
大佬模拟浏览器那一块是用自己的浏览器版本，还是随便一个版本就行。
回复
1. 云淡风轻说道：
  
  2021年7月26日上午8:36
  
  Google Chrome Windows 10
  用自己的
  回复
小雪说道：

2021年7月3日上午10:45

Firefox Windows 10
您好~爬取数据时还是没有反应，可以麻烦看一下吗
回复
1. 莫问说道：
  
  2021年7月15日上午11:21
  
  Google Chrome Windows 7
  运行之后要输入要爬取的职位信息，例如“大数据”
  回复

Hong's Blog

Python爬虫与数据可视化

数据挖掘

注意

数据清洗

数据可视化

若找不到或者安装失败，可用源码安装的方式

反馈

反馈2

爬取信息出现卡住的情况

出现data.to_excel('1.xls')保存文件报错

pyecharts异常

本作品采用知识共享署名-相同方式共享 4.0 国际许可协议进行许可

回复莫问取消回复

Hong's Blog

数据挖掘

注意

数据清洗

数据可视化

若找不到或者安装失败，可用源码安装的方式

反馈

反馈2

爬取信息出现卡住的情况

出现data.to_excel('1.xls')保存文件报错

pyecharts异常

本作品采用 知识共享署名-相同方式共享 4.0 国际许可协议 进行许可

回复 莫问 取消回复

本作品采用知识共享署名-相同方式共享 4.0 国际许可协议进行许可

回复莫问取消回复