Python爬虫入门：豆瓣电影TOP250评分趋势分析

技术教程壹维导航

3 0 0

Python爬虫入门：如何爬取豆瓣电影TOP250评分数据并分析其变化趋势

豆瓣电影TOP250榜单一直是影迷们参考的重要指标，通过Python爬虫技术获取这些数据并进行趋势分析，不仅能掌握爬虫基础技能，还能挖掘出电影评分的深层规律。本文将详细介绍如何从零开始完成这一任务。

一、准备工作

在开始之前，需要确保已安装Python环境及必要的库。主要用到requests（发送HTTP请求）、BeautifulSoup（解析HTML）、pandas（数据处理）和matplotlib（数据可视化）。

安装依赖库：pip install requests beautifulsoup4 pandas matplotlib
分析豆瓣电影TOP250页面的URL规律：https://movie.douban.com/top250?start=0
观察页面结构，定位电影标题、评分、排名等关键信息的位置

二、编写爬虫代码

1. 发送请求并获取页面

使用requests库获取页面内容，注意添加User-Agent模拟浏览器访问：

import requests
from bs4 import BeautifulSoup

headers = {
    \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\'
}

def get_page(url):
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    return None

2. 解析页面数据

使用BeautifulSoup解析HTML，提取电影信息：

def parse_page(html):
    soup = BeautifulSoup(html, \'html.parser\')
    items = soup.find_all(\'div\', class_=\'item\')
    for item in items:
        rank = item.find(\'em\', class_=\'\').text
        title = item.find(\'span\', class_=\'title\').text
        rating = item.find(\'span\', class_=\'rating_num\').text
        yield {
            \'rank\': rank,
            \'title\': title,
            \'rating\': float(rating)
        }

3. 循环获取多页数据

TOP250共有10页，通过循环拼接URL获取所有数据：

def main():
    all_data = []
    for start in range(0, 250, 25):
        url = f\'https://movie.douban.com/top250?start={start}\'
        html = get_page(url)
        if html:
            all_data.extend(parse_page(html))
    return all_data

三、数据清洗与存储

将爬取的数据保存为CSV文件，便于后续分析：

import pandas as pd

data = main()
df = pd.DataFrame(data)
df.to_csv(\'douban_top250.csv\', index=False, encoding=\'utf-8-sig\')

四、数据分析与可视化

1. 评分分布分析

使用matplotlib绘制评分分布直方图：

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.hist(df[\'rating\'], bins=20, edgecolor=\'black\')
plt.title(\'豆瓣TOP250电影评分分布\')
plt.xlabel(\'评分\')
plt.ylabel(\'电影数量\')
plt.show()

2. 评分趋势分析

按排名顺序分析评分变化趋势：

plt.figure(figsize=(12, 6))
plt.plot(df[\'rank\'], df[\'rating\'], marker=\'o\')
plt.title(\'豆瓣TOP250电影评分变化趋势\')
plt.xlabel(\'排名\')
plt.ylabel(\'评分\')
plt.grid(True)
plt.show()