python資料視覺化自制職位分析生成崗位分析資料包表

2021-09-27 13:01:38

前言

為什麼要進行職位分析？職位分析是人力資源開發和管理的基礎與核心，是企業人力資源規劃、招聘、培訓、薪酬制定、績效評估、考核激勵等各項人力資源管理工作的依據。其次我們可以根據不同崗位的職位分析，視覺化展示各崗位的資料分析報告。

首先我們來看看分析展示的效果：

下面，我們開始介紹這個小工具的製作過程。

1. 核心功能設計

總體來說，我們的這款職位分析器可以通過輸入崗位關鍵字和崗位條數，自動爬取相關崗位的資料，並對爬蟲的崗位視覺化表格展示。然後分析這些崗位資料的公司型別規模，對學歷要求，薪資分佈等。

拆解需求，大致可以整理出核心功能如下：

視覺化展示崗位表格資料

通過輸入的崗位關鍵字和獲取條數，自動爬取職位資料對爬蟲的職位原始資料進行清洗讀取展示清洗後的職位資料

分析崗位薪資情況

根據工作年限及對應平均薪資，繪製工作經驗年限和薪資折線圖統計彙總薪資分佈區間，瞭解該崗位薪資情況，展示薪資分佈直方圖

分析崗位公司情況

展示崗位中公司型別分佈情況，包含民營企業、合資、上市、外企等等展示公司規模人數分佈情況，包含少於50人,50-150，150-500等等統計崗位對於學歷的要求

資料分析匯出

對於視覺化資料進行彈窗預覽，並將資料匯出儲存

基本的核心功能確定，下面我們我們首先開始GUI設計。

2. GUI設計與實現

基於功能點，我們可以先考慮進行簡單的UE佈局設計，然後再通過GUI開發庫進行設計，這裡採用的是tkinker，主要是簡單方便。

基於UI設計，我們gui設計編碼如下：

# 建立主視窗
root = Tk()
root.title('職位查詢分析資料平臺 -- Dragon少年')
# 設定視窗大小
root.minsize(1380, 730)
root.resizable(False, False)
#得到螢幕寬度
sw = root.winfo_screenwidth()
#得到螢幕高度
sh = root.winfo_screenheight()
ww = 1380
wh = 730
x = (sw-ww) / 2
y = (sh-wh) / 2
root.geometry("%dx%d+%d+%d" %(ww,wh,x,y))
frame_left_top = Frame(width=1050, height=400)
frame_right_top = Frame(width=320, height=400)
# 定義列表區域
tree = ttk.Treeview(frame_left_top, show="headings", height=18,
                    columns=("n", "a", "b", "c", "d", "e", "f", "g", "h"))
vbar = ttk.Scrollbar(frame_left_top, orient=VERTICAL, command=tree.yview)
# 定義樹形結構與卷軸
tree.configure(yscrollcommand=vbar.set)
# 表格的標題
tree.column("n", width=60, anchor="center")
tree.column("a", width=180, anchor="center")
tree.column("b", width=200, anchor="center")
tree.column("c", width=100, anchor="center")
tree.column("d", width=100, anchor="center")
tree.column("e", width=80, anchor="center")
tree.column("f", width=100, anchor="center")
tree.column("g", width=90, anchor="center")
tree.column("h", width=90, anchor="center")
tree.heading("n", text="序號")
tree.heading("a", text="崗位名稱")
tree.heading("b", text="公司名稱")
tree.heading("c", text="公司型別")
tree.heading("d", text="公司規模")
tree.heading("e", text="學歷")
tree.heading("f", text="工作經驗")
tree.heading("g", text="最低工資（k）")
tree.heading("h", text="最高工資（k）")
tree.grid(row=0, column=0, sticky=NSEW)
vbar.grid(row=0, column=1, sticky=NS)
# 整體區域定位
frame_left_top.grid(row=0, column=0, padx=4, pady=5)
frame_right_top.grid(row=0, column=1, padx=2, pady=2)
frame_left_top.grid_propagate(0)
frame_right_top.grid_propagate(0)
type_str=StringVar()
#設定捲動視窗文字
habits = tk.LabelFrame(root, text="公司型別", padx=10, pady=4 )  # 水平，垂直方向上的邊距均為 10
habits.place(x=1035, y=170)
habits_Window = Label(habits, textvariable=type_str, width=30, height=10,  font=('楷體', 12))
habits_Window.grid()
size_str=StringVar()
#設定捲動視窗文字
company_size = tk.LabelFrame(root, text="公司規模", padx=10, pady=4 )  # 水平，垂直方向上的邊距均為 10
company_size.place(x=1035, y=370)
company_size_Window = Label(company_size, textvariable=size_str, width=30, height=8,  font=('楷體', 12))
company_size_Window.grid()
edu_str=StringVar()
#設定捲動視窗文字
company_edu = tk.LabelFrame(root, text="學歷要求", padx=10, pady=4 )  # 水平，垂直方向上的邊距均為 10
company_edu.place(x=1035, y=540)
company_edu_Window = Label(company_edu, textvariable=edu_str, width=30, height=8,  font=('楷體', 12))
company_edu_Window.grid()
# 開啟檔案
# right_top_button = Button(frame_right_top, text="開啟檔案", command=lambda :openFile(), font=('楷體', 12))
input_name = Label(frame_right_top, text='崗位關鍵字:', font=('楷體', 12)).place(x=0, y=10)
label = StringVar()
entry = Entry(frame_right_top, bg='#ffffff', width=20, textvariable=label, font=('楷體', 12)).place(x=120, y=10)
input_num = Label(frame_right_top, text='資料條數:', font=('楷體', 12)).place(x=0, y=50)
label_num = StringVar()
entry_num = Entry(frame_right_top, bg='#ffffff', width=15, textvariable=label_num, font=('楷體', 12)).place(x=80, y=50)
btn_search = Button(frame_right_top, text="查詢輸出", command=lambda :openFile(label, label_num), font=('楷體', 12)).place(x=210, y=50)
right_pic_button = Button(frame_right_top, text="工作經驗對應薪資圖", command=lambda: show_plot(), font=('楷體', 12)).place(x=0, y=90)
right_hist_button = Button(frame_right_top, text="工資分佈圖", command=lambda: show_hist(), font=('楷體', 12)).place(x=180, y=90)
right_data_button = Button(frame_right_top, text="資料分析", command=lambda: show_data(), font=('楷體', 12)).place(x=0, y=130)

主介面中各個控制元件建立，主要包含label-文字，Entry-文字輸入框，Treeview-表格樹，LabelFrame-控制元件容器等等。

效果如下：

3. 功能實現

我們明確功能點以及有了GUI佈局後，可以正式開始實現功能邏輯。

3.1 職位資料爬蟲

關於職位資料爬取，我們爬取的是51job的資料，編寫一個函數。按照核心功能要求，這個函數通過引數職位關鍵字及資料條數，自動爬取並將資料儲存。

首先我可以通過對51job職位搜尋頁面進行分析，獲取一個列表頁，程式碼如下：

# 獲取一個列表頁
def geturl(url):
    headers = {
        'User-Agent': 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
                      '(KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    }
    # 使用通用爬蟲對url進行一頁進行爬取
    page_text = requests.get(url=url, headers=headers).text       # 每一頁原始碼用text
    # print(page_text)
    return page_text

其次通過對具體職位頁面資料分析，獲取列表頁的所有崗位資訊。主要包含崗位名稱、公司名稱、薪資水平、公司型別、招聘條件、福利待遇等等。程式碼如下：

# 獲取列表頁的所有崗位資訊
def get_data(page_text):
    # 正規表示式提取崗位資訊
    job_href = '"job_href":"(.*?)"'              # 崗位連結
    job_name = '"job_name":"(.*?)"'              # 崗位名稱
    com_href = '"company_href":"(.*?)"'          # 公司連結
    com_name = '"company_name":"(.*?)"'          # 公司名稱
    salary = '"providesalary_text":"(.*?)"'     # 薪資水平
    company_type = '"companytype_text":"(.*?)"'  # 公司型別
    attribute = '"attribute_text":[(.*?)]'     # 招聘條件
    work_area = '"workarea_text":"(.*?)"'        # 工作地點
    company_size = '"companysize_text":"(.*?)"'  # 公司規模
    company_ind = '"companyind_text":"(.*?)"'    # 主要業務
    job_welf = '"jobwelf":"(.*?)"'               # 福利待遇
    # 第一個引數是規則，第二個引數是被檢索內容，第三個引數re.S是單行匹配
    jobName_list = re.findall(job_name, page_text, re.S)
    comName_list = re.findall(com_name, page_text, re.S)
    salary_list = re.findall(salary, page_text, re.S)
    companytype_list = re.findall(company_type, page_text, re.S)
    attribute_list = re.findall(attribute, page_text, re.S)
    workarea_list = re.findall(work_area, page_text, re.S)
    companysize_list = re.findall(company_size, page_text, re.S)
    companyind_list = re.findall(company_ind, page_text, re.S)
    jobwelf_list = re.findall(job_welf, page_text, re.S)
    all_list = [jobName_list, comName_list, salary_list,
                companytype_list, attribute_list, workarea_list, companysize_list,
                companyind_list, jobwelf_list]
    return all_list

最後將獲取崗位資料儲存至csv檔案中，方便後面對這些資料進行清洗。主要程式碼如下：

# 主函數
def main(kw, num):
    # 關鍵字二次轉譯
    # kw = input("請輸入你要搜尋的崗位關鍵字：")
    keyword = parse.quote(parse.quote(kw))
    page_num = 0
    col = ["崗位名稱", "公司名稱", "薪資水平", "公司型別", "招聘條件", "工作地點", '公司規模', '主要業務', '福利待遇']
    csv_file = open("51job.csv", "w+", encoding='utf-8', newline='')
    try:
        writer = csv.writer(csv_file)
        writer.writerow(col)
        for i1 in range(0, num):  # 爬取前3頁資料
            page_num += 1
            url = "https://search.51job.com/list/080000,000000,0000,00,9,99," + keyword + ",2," + str(
                page_num) + ".html"
            page_text = geturl(url)
            all_list = get_data(page_text)
            if len(all_list[0]) == 0:
                print('沒有搜尋到職位資訊')
                break
            else:
                print('正在爬取第%d頁' % page_num)
                save_data(all_list, writer, search_num=len(all_list[0]))
    finally:
        csv_file.close()

3.2 資料預處理

我們通過爬蟲已經拿到了原始資料，接下來我們需要將這些資料淨化，去除異常值、缺失值，轉換我們要的資料格式。

首先我們可以對工作地點進行整理，我們爬取的資料，預設都是浙江省，我們按照浙江省各個地級市進行工作地點轉換，程式碼如下：

job = pd.read_csv("51job.csv", encoding='utf-8')
df = pd.DataFrame(job)
dict_city = {'杭州': 0, '湖州': 0, '紹興': 0, '寧波': 0, '嘉興': 0, '麗水': 0, '台州': 0, '溫州': 0, '金華': 0, '衢州': 0, '舟山': 0}
city = df.loc[:, "工作地點"]
print(city.shape[0])
for i in range(city.shape[0]):
    # print(city[i])
    for k, v in dict_city.items():
        # print(k, v)
        if k in city[i]:
            dict_city[k] = dict_city[k] + 1
            df.loc[i, "城市"] = k
            break
# print(list(dict_city.keys()))
# print(list(dict_city.values()))
city_df = df["城市"]
if np.any(pd.notnull(df["城市"])):
    df["城市"].fillna("其他", inplace=True)
# print(df["城市"].value_counts())

接著我們需要將原始資料得到的工資進行格式轉換，轉換成 k/月固定格式，並將整理的薪資資料單獨儲儲存存下來。主要程式碼如下：

def get_salary(salary):
    if '-' in salary:  # 針對1-2萬/月或者10-20萬/年的情況，包含-
        low_salary = re.findall(re.compile('(d*.?d+)'), salary)[0]
        high_salary = re.findall(re.compile('(d?.?d+)'), salary)[1]
        if u'萬' in salary and u'年' in salary:  # 單位統一成千/月的形式
            low_salary = round(float(low_salary) / 12 * 10, 1)
            high_salary = round(float(high_salary) / 12 * 10, 1)
        elif u'萬' in salary and u'月' in salary:
            low_salary = float(low_salary) * 10
            high_salary = float(high_salary) * 10
    else:  # 針對20萬以上/年和100元/天這種情況，不包含-，取最低工資，沒有最高工資
        low_salary = re.findall(re.compile('(d*.?d+)'), salary)[0]
        if u'萬' in salary and u'年' in salary:  # 單位統一成千/月的形式
            low_salary = round(float(low_salary) / 12 * 10, 1)
        elif u'萬' in salary and u'月' in salary:
            low_salary = float(low_salary) * 10
        elif u'元' in salary and u'天' in salary:
            low_salary = round(float(low_salary) / 1000 * 21, 1)   # 每月工作日21天
        elif u'元' in salary and u'小時' in salary:
            low_salary = round(float(low_salary) / 1000 * 8 * 21, 1)   # 每天8小時，每月工作日21天
        high_salary = low_salary
    return low_salary, high_salary

job = pd.read_csv("51job_pre.csv", encoding='utf-8')
job_df = job.drop("福利待遇", axis=1)
job_df = job_df.dropna(axis=0, how="any")
for index, row in job_df.iterrows():
    salary = row["薪資水平"]
    if salary:  # 如果待遇這欄不為空，計算最低最高待遇
        getsalary = get_salary(salary)
        low_salary = getsalary[0]
        high_salary = getsalary[1]
    else:
        low_salary = high_salary = "0"
    job_df.loc[index, "最低工資（k）"] = low_salary
    job_df.loc[index, "最高工資（k）"] = high_salary
    job_df.loc[index, "平均工資（k）"] = round((float(low_salary) + float(high_salary)) / 2, 1)
job_df.to_csv("./51job_pre2.csv", index=False)

薪資整理結束之後，我們通過觀察可以看到招聘條件列中有我們需要的學歷和工作經驗等資料，我們需要提取出單獨的學歷，工作經驗等資料。程式碼如下：

job_df.to_csv("./51job_pre2.csv", index=False)
job_df = pd.read_csv("51job_pre2.csv", encoding='utf-8')
job_df["學歷"] = job_df["招聘條件"].apply(lambda x: re.findall("本科|大專|高中|中專|碩士|博士|初中及以下", x))
job_df["工作經驗"] = job_df["招聘條件"].apply(lambda x: re.findall(r',".*經驗"|,"在校生/應屆生"', x))
job_df["學歷"] = job_df["學歷"].apply(func)
job_df["工作經驗"] = job_df["工作經驗"].apply(func2)
# 薪資水平,公司型別,招聘條件,工作地點,公司規模,主要業務,城市,最低工資（k）,最高工資（k）,平均工資（k）
job_df = job_df.drop(["薪資水平", "招聘條件", "工作地點", "主要業務", "城市"], axis=1)
job_df = job_df.dropna(axis=0, how="any")
job_df.to_csv("./51job_analysis.csv", index=False)

原始資料淨化整理完畢，我們就可以繼續編寫GUI的資料展示了。

3.3 崗位資料展示

上面我們已經拿到了需要的資料，首先我們可以通過讀取儲存的資料表格，視覺化展示崗位資料表格。程式碼主要如下：

def read_csv_define(csv_path):
    x = tree.get_children()
    for item in x:
        tree.delete(item)
    global job_df
    job_df = pd.read_csv(csv_path, encoding='utf-8')
    # print(job_df.shape[0])
    for i in range(job_df.shape[0]):
        tree.insert("", "end",
                    values=(i+1, job_df.loc[job_df.index[i], "崗位名稱"], job_df.loc[job_df.index[i], "公司名稱"],
                            job_df.loc[job_df.index[i], "公司型別"], job_df.loc[job_df.index[i], "公司規模"], 
                             job_df.loc[job_df.index[i], "學歷"],job_df.loc[job_df.index[i], "工作經驗"], 
                            job_df.loc[job_df.index[i], "最低工資（k）"], job_df.loc[job_df.index[i], "最高工資（k）"]))

def openFile(label, label_num):
    sname = label.get()
    num_str = label_num.get()
    num = int(num_str)
    rep.main(sname, int(num/50))
    pre.main()
    Filepath = "./51job_analysis.csv"
    # 開啟檔案選擇對話方塊
    # Filepath = filedialog.askopenfilename(filetypes=[('表格', '*.xls;*.csv')]) #過濾檔案字尾型別
    # print(os.path.split(Filepath))
    (filepath, tempfilename) = os.path.split(Filepath)  # 右側的值是元組型別
    try:
        # Filepath 當路徑存在的時候繼續
        if Filepath:
            # 傳輸excel表格的路徑
            # 呼叫讀取資料的函數，趁著使用者正在檢視查詢條件的時候 將資料注入到全域性變數中 減少查詢等待
            # 由於兩種表格檔案的讀取模組不同，需要做處理判斷屬於哪種檔案型別，故採用下邊的方式進行判斷
            # 從檔名中分離出字尾
            (filename, extension) = os.path.splitext(tempfilename)
            if extension == '.xls' or extension == '.XLS':
                read_xls(Filepath)
            elif extension == '.csv':
                read_csv_define(Filepath)
        else:
            print('未選擇任何檔案!')
            # exit_program()
    except Exception as e:
        global job_df
        job_df = pd.DataFrame()
        tkinter.messagebox.showwarning('警告', '檔案讀取異常，請檢查！')
        print("ex:", e)
    finally:
        size_str.set("")
        type_str.set("")
        edu_str.set("")
        canvas_spice.get_tk_widget().destroy()
        canvas_spice_hist.get_tk_widget().destroy()

效果如下：

3.4 薪資圖表視覺化

接下來我們可以對不同工作經驗年限對應的薪資統計，繪製縮略折線圖。程式碼如下：

def show_plot():
    global canvas_spice
    canvas_spice.get_tk_widget().destroy()
    # 影象及畫布
    fig, ax = plt.subplots(figsize=(5, 3.2), dpi=100)  # 影象比例
    canvas_spice = FigureCanvasTkAgg(fig, root)
    canvas_spice.get_tk_widget().place(x=5, y=400)  # 放置位置
    work_experience = round(job_df.groupby(by='工作經驗')['平均工資（k）'].mean(), 2)
    experience_list = ["無需經驗", "1年經驗", "2年經驗", "3-4年經驗", "5-7年經驗"]
    experience_val = [work_experience[i] for i in experience_list]
    x = range(len(experience_list))
    y = experience_val
    plt.plot(x, y)
    plt.xticks(x, experience_list, fontsize=6)
    plt.grid(True, linestyle="--", alpha=0.5)
    plt.xlabel("工作經驗", fontsize=8)
    plt.ylabel("平均工資（k）", fontsize=8)
    plt.title("工作經驗對應薪資折線圖", fontsize=8)
    canvas_spice.draw()
    canvas_spice.get_tk_widget().bind("<Double-Button-1>", xFunc1)

通過統計崗位資料中平均薪資資料，繪製出薪資分佈直方圖，展示該職位的平均薪資分佈情況。程式碼如下：

def show_hist():
    global canvas_spice_hist
    canvas_spice_hist.get_tk_widget().destroy()
    # 影象及畫布
    fig_hist, ax_hist = plt.subplots(figsize=(5, 3.2), dpi=100)  # 影象比例
    canvas_spice_hist = FigureCanvasTkAgg(fig_hist, root)
    canvas_spice_hist.get_tk_widget().place(x=520, y=400)  # 放置位置
    plt.hist(job_df["平均工資（k）"].values, bins=10)
    # 求出最小值
    max_ = job_df["平均工資（k）"].max()
    min_ = job_df["平均工資（k）"].min()
    # 修改刻度
    plt.xticks(np.linspace(min_, max_, num=11),fontsize=7)
    # 新增網格
    plt.grid()
    plt.xlabel("平均工資（k）", fontsize=8)
    plt.ylabel("崗位數量", fontsize=8)
    plt.title("工資分佈直方圖", fontsize=8)
    canvas_spice_hist.draw()
    canvas_spice_hist.get_tk_widget().bind("<Double-Button-1>", xFunc2)

效果如下：

除了主介面之外，我們在繪製完圖表之後希望能直接彈窗預覽展示，因此也需要一個用於瀏覽圖片的介面與功能，這部分整體會放在後續預覽儲存模組講解。

3.5 崗位公司情況統計

我們還可以對不同的公司規模、型別、對崗位學歷要求等進行資料分析展示：

def show_data():
    company = job_df.loc[:, "公司型別"].value_counts()
    type_name = list(company.index)
    x = range(len(type_name))
    content = ""
    for item in x:
        content += type_name[item] + '----' + str(company[item]) + 'n'
    type_str.set(content)
    company = job_df.loc[:, "公司規模"].value_counts()
    company_scale = company.index.to_list()
    z = range(len(company_scale))
    size_content = ""
    for item in z:
        size_content += company_scale[item] + '----' + str(company[item]) + 'n'
    size_str.set(size_content)
    education = job_df.loc[:, "學歷"].value_counts()
    education_scale = education.index.to_list()
    y = range(len(education_scale))
    edu_content = ""
    for item in y:
        edu_content += education_scale[item] + '----' + str(education[item]) + 'n'
    edu_str.set(edu_content)

效果如下：

3.6 預覽儲存

我們在前面有提到，對於繪製好的圖表，希望可以彈出預覽儲存，這裡實現這個功能，採用的是畫布繫結左鍵雙擊事件，彈出的子表單同樣可以繫結右鍵事件，通過事件介面卡傳遞圖片引數。

def handlerAdaptor(fun, **kwds):
    return lambda event, fun= fun, kwds= kwds:fun(event, **kwds)

def xFunc2(event):
    top = Toplevel()
    top.title('影象匯出')
    top.minsize(700, 450)
    top.resizable(False, False)
    # 得到螢幕寬度
    sw = root.winfo_screenwidth()
    # 得到螢幕高度
    sh = root.winfo_screenheight()
    ww = 700
    wh = 450
    x = (sw - ww) / 2
    y = (sh - wh) / 2
    top.geometry("%dx%d+%d+%d" % (ww, wh, x, y))
    top.transient(root)
    top.grab_set()
    # 影象及畫布
    fig, ax = plt.subplots(figsize=(7, 4.5), dpi=100)  # 影象比例
    canvas_spice = FigureCanvasTkAgg(fig, top)
    canvas_spice.get_tk_widget().place(x=1, y=1)  # 放置位置
    plt.hist(job_df["平均工資（k）"].values, bins=10)
    # 求出最小值
    max_ = job_df["平均工資（k）"].max()
    min_ = job_df["平均工資（k）"].min()
    # 修改刻度
    plt.xticks(np.linspace(min_, max_, num=11), fontsize=10)
    # 新增網格
    plt.grid()
    plt.xlabel("平均工資（k）", fontsize=12)
    plt.ylabel("崗位數量", fontsize=12)
    plt.title("工資分佈直方圖", fontsize=15)
    plt.savefig("hist.png")
    canvas_spice.draw()
    canvas_spice.get_tk_widget().pack()
    img_pl = Image.open("hist.png").copy()
    os.remove("hist.png")
    canvas_spice.get_tk_widget().bind("<Button-3>", handlerAdaptor(saveimg, img=img_pl))

效果如下：

然後我們可以對預覽的圖片進行儲存：

def saveimg(event, img):
    Filepath = filedialog.asksaveasfilename(filetypes=[('影象', '*.png;')])
    if Filepath:
        if not Filepath.endswith(('.png')):
            Filepath += '.png'
        # 儲存獲得的影象
        img.save(Filepath, 'png')
        tkinter.messagebox.showinfo("提示", "儲存成功！")
        HWND = win32gui.GetFocus()  # 獲取當前視窗控制程式碼
        win32gui.PostMessage(HWND, win32con.WM_CLOSE, 0, 0)

至此，自制的職位分析器小工具就編碼完成啦~

下面，我一鍵獲取生成設計師職位的分析報告：