编程 Scrapling 深度实战：让爬虫在现代Web里活下来的自适应抓取框架——2026年完全指南

2026-05-25 07:22:38 +0800 CST views 458

Scrapling 深度实战：让爬虫在现代 Web 里"活下来"的自适应抓取框架——2026年完全指南

53.6K Star，GitHub Trending 常客。Scrapling 不是又一个 BeautifulSoup 封装，它是对传统爬虫范式的全面升级：自适应定位、反反爬内置、AI 协同、Spider 框架一条龙。本文从架构到实战，带你彻底掌握这个改变爬虫游戏规则的框架。

一、为什么你需要 Scrapling

如果你写过爬虫，这些场景一定不陌生：

凌晨跑得好好的爬虫，第二天网站改版，选择器全挂
Cloudflare Turnstile 验证码挡住了你的请求
动态渲染页面需要 Selenium，但 Selenium 又慢又重
Scrapy 项目搭起来一套，简单需求又嫌它重
AI Agent 拿到网页要自己清洗 HTML，token 烧得心疼

传统方案怎么做？BeautifulSoup 写死选择器 → 网站改版就崩；Scrapy 搭框架 → 重型项目才能回本；Selenium 模拟浏览器 → 内存和时间的双杀；手动绕反爬 → 永远在和风控系统军备竞赛。

Scrapling 的核心理念：网站会变，但你的爬虫不该失效。

它把这些问题打包成了一套完整的解决方案：

能力	传统方案	Scrapling
静态页面	requests + BeautifulSoup	`Fetcher.get()`
动态渲染	Selenium/Playwright	`PlaywrightFetcher` / `StealthyFetcher`
反反爬	手动配代理、UA池	内置 StealthyFetcher + Camoufox
网站改版	重写选择器	自适应定位（Adaptive Scraping）
大规模爬取	Scrapy	内置 Spider 框架
AI 协同	手动清洗 HTML	MCP Server + 结构化输出

二、架构全景：Scrapling 的三层设计

Scrapling 的架构可以清晰地分为三层：

┌─────────────────────────────────────────┐
│          Spider 框架（调度层）            │
│   并发控制 / 去重 / 暂停恢复 / 优先队列   │
├─────────────────────────────────────────┤
│          Fetcher 体系（抓取层）           │
│  Fetcher  │  PlaywrightFetcher  │  StealthyFetcher │
│  (HTTP)   │  (浏览器自动化)       │  (隐身浏览器+反反爬) │
├─────────────────────────────────────────┤
│          解析层（AdaptiveParser）         │
│  CSS/XPath 选择器  │  自适应匹配  │  AI 输出   │
└─────────────────────────────────────────┘

2.1 抓取层：三种 Fetcher 的分工

Scrapling 提供了三种 Fetcher，覆盖从简单到复杂的所有场景：

Fetcher：纯 HTTP 请求，最快最轻。类似 requests，但返回的响应自带解析能力：

from scrapling.fetchers import Fetcher

page = Fetcher.get("https://quotes.toscrape.com/")
for quote in page.css(".quote"):
    text = quote.css(".text::text").get()
    author = quote.css(".author::text").get()
    print(f"{author}: {text}")

PlaywrightFetcher：基于 Playwright 的浏览器自动化抓取。处理 JS 渲染页面：

from scrapling.fetchers import PlaywrightFetcher

# 需要先安装浏览器
# pip install "scrapling[fetchers]"
# scrapling install

page = PlaywrightFetcher.get("https://spa-example.com/")
# 等待特定元素加载
page = PlaywrightFetcher.get(
    "https://spa-example.com/",
    wait_selector=".content-loaded",
    timeout=10000
)

StealthyFetcher：隐身浏览器，内置反反爬。基于 Camoufox（Firefox 隐身分支），能自动绕过 Cloudflare Turnstile、DataDome、PerimeterX 等主流风控：

from scrapling.fetchers import StealthyFetcher

# 自动处理 Cloudflare 验证
page = StealthyFetcher.get("https://protected-site.com/")
# 内置：指纹伪装、行为模拟、代理轮换

三种 Fetcher 的选择逻辑很简单：

目标页面能直接 HTTP 请求拿到数据？
  ├─ 是 → Fetcher
  └─ 否 → 需要 JS 渲染？
              ├─ 是 → 有反爬保护？
              │        ├─ 是 → StealthyFetcher
              │        └─ 否 → PlaywrightFetcher
              │
              └─ 不确定 → 先试 Fetcher，失败再升级

2.2 解析层：像 Scrapy 一样熟悉，但更耐变

Scrapling 的选择器语法和 Scrapy/PyQuery 几乎一致，学习成本为零：

# CSS 选择器
page.css("h1::text").get()           # 获取第一个
page.css("a::attr(href)").getall()   # 获取所有

# XPath
page.xpath("//h1/text()").get()
page.xpath("//a/@href").getall()

# 链式调用
page.css(".product-list").css(".item::text").getall()

但关键区别在于：Scrapling 的解析器有"记忆"能力。当你用 css() 或 xpath() 定位一个元素后，Scrapling 会记录这个元素的特征指纹（标签名、属性、相对位置、文本模式等）。下次页面改版，选择器失效时，它会用这些特征重新定位。

2.3 自适应抓取：Scrapling 的杀手锏

这是 Scrapling 区别于所有其他爬虫框架的核心能力。

场景：你用 .price-box .amount 选择器抓取价格，网站改版后 class 名变成了 price-value。传统爬虫直接报错返回 None，Scrapling 会怎么做？

from scrapling import Fetcher, Adaptor

# 第一次抓取，正常定位
page = Fetcher.get("https://shop.example.com/product/123")
price_element = page.css(".price-box .amount")
# Scrapling 记录了这个元素的"指纹"

# 网站改版后，选择器失效
page = Fetcher.get("https://shop.example.com/product/123")
# 传统方式：返回 None
price_element_old = page.css(".price-box .amount")  # None

# 自适应方式：Scrapling 根据指纹重新匹配
price_element = page.find_by_signature(
    original_selector=".price-box .amount",
    # 基于之前记录的特征智能匹配
)

自适应匹配的核心算法：

特征提取：首次定位元素时，提取多维度特征——标签类型、属性组合、兄弟节点关系、父容器结构、文本内容模式
相似度评分：页面改版后，对每个候选元素计算与原始特征的相似度分数
阈值过滤：只有超过置信度阈值的匹配才会被接受，避免误匹配
渐进学习：每次成功匹配都会更新特征指纹，越用越准

# 更精细的自适应控制
from scrapling.core import AdaptiveParser

parser = AdaptiveParser(
    similarity_threshold=0.7,  # 相似度阈值，越高越严格
    learning_rate=0.3,          # 学习率，每次匹配后更新指纹的权重
    max_candidates=50,          # 最大候选元素数
)

# 带自适应的完整抓取流程
page = Fetcher.get(url)
elements = parser.adaptive_find(
    page,
    signature={
        "tag": "span",
        "parent_tag": "div",
        "attributes": {"class": "price"},
        "text_pattern": r"\$\d+\.\d{2}",
        "sibling_count": 3,
    }
)

三、实战篇：从零构建一个完整爬虫项目

3.1 环境搭建

# 基础安装
pip install scrapling

# 完整安装（含浏览器自动化）
pip install "scrapling[all]"

# 安装浏览器引擎
scrapling install

# 验证安装
python -c "from scrapling.fetchers import Fetcher; print(Fetcher.get('https://httpbin.org/get').status)"

3.2 实战一：电商价格监控（自适应 + 定时）

"""电商价格监控——Scrapling 自适应抓取实战"""
import time
import json
from scrapling.fetchers import Fetcher

class PriceMonitor:
    """商品价格监控器，支持自适应定位"""
    
    def __init__(self, url, selector_cache_file="selector_cache.json"):
        self.url = url
        self.cache_file = selector_cache_file
        self.signatures = self._load_signatures()
    
    def _load_signatures(self):
        """加载之前保存的元素签名"""
        try:
            with open(self.cache_file, "r") as f:
                return json.load(f)
        except FileNotFoundError:
            return {}
    
    def _save_signatures(self):
        with open(self.cache_file, "w") as f:
            json.dump(self.signatures, f, indent=2)
    
    def extract_price(self, page):
        """提取价格，自适应处理改版"""
        # 尝试已知选择器
        selectors = [
            ".price-box .amount",
            "[data-price]",
            ".product-price",
            "span.price",
        ]
        
        for sel in selectors:
            element = page.css(sel)
            if element:
                # 记录成功的签名
                self.signatures["price"] = {
                    "selector": sel,
                    "tag": element[0].tag,
                    "text_sample": element[0].text[:50] if element[0].text else "",
                }
                self._save_signatures()
                return self._parse_price(element[0].text)
        
        # 所有选择器失效，使用自适应匹配
        if "price" in self.signatures:
            element = page.find_by_signature(
                original_selector=self.signatures["price"]["selector"],
            )
            if element:
                return self._parse_price(element.text)
        
        return None
    
    def _parse_price(self, text):
        """从文本中解析价格数字"""
        import re
        match = re.search(r'[\d,]+\.?\d*', text.replace(',', ''))
        return float(match.group()) if match else None
    
    def run(self, interval=3600):
        """定时监控"""
        while True:
            page = Fetcher.get(self.url)
            price = self.extract_price(page)
            print(f"[{time.strftime('%Y-%m-%d %H:%M')}] 价格: ¥{price}")
            time.sleep(interval)

# 使用
monitor = PriceMonitor("https://shop.example.com/product/123")
monitor.run()

3.3 实战二：绕过 Cloudflare 保护的新闻爬虫

"""绕过 Cloudflare Turnstile 的实战方案"""
from scrapling.fetchers import StealthyFetcher
import json
import time

class CloudflareBypassSpider:
    """绕过 Cloudflare 的新闻站点爬虫"""
    
    def __init__(self, base_url):
        self.base_url = base_url
        self.visited = set()
        self.results = []
    
    def fetch_with_retry(self, url, max_retries=3):
        """带重试的隐身抓取"""
        for attempt in range(max_retries):
            try:
                page = StealthyFetcher.get(
                    url,
                    headless=True,           # 无头模式
                    disable_resources=True,   # 禁用图片/CSS/字体加载，加速
                    timeout=30000,            # 30秒超时
                )
                # 检查是否仍在 Cloudflare 挑战页
                if "challenge" in page.css("title::text").get(default="").lower():
                    print(f"  仍在挑战页，重试 {attempt+1}/{max_retries}")
                    time.sleep(5 * (attempt + 1))  # 指数退避
                    continue
                return page
            except Exception as e:
                print(f"  请求失败: {e}, 重试 {attempt+1}/{max_retries}")
                time.sleep(3)
        return None
    
    def parse_article(self, page):
        """解析文章内容"""
        article = {
            "title": page.css("h1::text").get(),
            "author": page.css(".author-name::text").get(),
            "content": "\n".join(
                p.text for p in page.css(".article-content p")
            ),
            "date": page.css("time::attr(datetime)").get(),
            "tags": page.css(".tag-list a::text").getall(),
        }
        return {k: v for k, v in article.items() if v}
    
    def crawl(self, start_path="/news"):
        page = self.fetch_with_retry(f"{self.base_url}{start_path}")
        if not page:
            return []
        
        # 提取文章链接
        article_links = page.css(".article-list a::attr(href)").getall()
        
        for link in article_links:
            full_url = f"{self.base_url}{link}" if link.startswith("/") else link
            if full_url in self.visited:
                continue
            self.visited.add(full_url)
            
            article_page = self.fetch_with_retry(full_url)
            if article_page:
                article = self.parse_article(article_page)
                article["url"] = full_url
                self.results.append(article)
                print(f"  已抓取: {article.get('title', '无标题')}")
                time.sleep(2)  # 礼貌爬取
        
        return self.results

# 使用
spider = CloudflareBypassSpider("https://news.example.com")
articles = spider.crawl()
print(f"共抓取 {len(articles)} 篇文章")

3.4 实战三：Spider 框架——大规模结构化爬取

Scrapling 内置了 Spider 框架，对标 Scrapy 的核心能力，但 API 更简洁：

"""Scrapling Spider 框架——大规模爬取实战"""
from scrapling.spiders import Spider, Request

class ECommerceSpider(Spider):
    """电商全站爬虫"""
    
    name = "ecommerce"
    
    # 并发与调度配置
    custom_settings = {
        "concurrent_requests": 8,       # 并发数
        "download_delay": 1.0,          # 请求间隔（秒）
        "retry_times": 3,              # 重试次数
        "retry_http_codes": [429, 500, 502, 503],
        "user_agent_rotation": True,   # UA 轮换
    }
    
    def start_requests(self):
        """生成初始请求"""
        for category in ["electronics", "books", "clothing"]:
            yield Request(
                f"https://shop.example.com/{category}",
                callback=self.parse_category,
                meta={"category": category},
            )
    
    def parse_category(self, response):
        """解析分类页，提取商品链接"""
        category = response.meta["category"]
        
        # 提取商品链接
        products = response.css(".product-card a::attr(href)").getall()
        for url in products:
            yield Request(
                url,
                callback=self.parse_product,
                meta={"category": category},
            )
        
        # 翻页
        next_page = response.css(".pagination .next::attr(href)").get()
        if next_page:
            yield Request(
                next_page,
                callback=self.parse_category,
                meta={"category": category},
            )
    
    def parse_product(self, response):
        """解析商品详情页"""
        yield {
            "url": response.url,
            "category": response.meta["category"],
            "title": response.css("h1.product-title::text").get(),
            "price": response.css("[data-price]::text").get(),
            "description": response.css(".description::text").get(),
            "specs": {
                row.css("th::text").get(): row.css("td::text").get()
                for row in response.css(".spec-table tr")
            },
            "images": response.css(".gallery img::attr(src)").getall(),
            "reviews_count": response.css(".review-count::text").re_first(r'\d+'),
        }

# 运行
if __name__ == "__main__":
    spider = ECommerceSpider()
    spider.run()

Spider 框架的核心特性：

特性	说明
并发控制	异步调度，可配置并发数和请求间隔
自动去重	基于 URL 指纹的请求去重，避免重复抓取
暂停/恢复	支持断点续爬，中断后不丢失进度
优先队列	重要页面优先抓取（如详情页 > 列表页）
中间件	可插拔的请求/响应中间件，自定义处理逻辑
管道	数据处理管道，支持清洗、验证、存储

四、进阶篇：Scrapling 与 AI 的协同

4.1 MCP Server：让 AI Agent 高效消费网页数据

Scrapling 内置了 MCP（Model Context Protocol）Server，这是它区别于传统爬虫框架的另一个亮点。传统方式下，AI Agent 直接获取网页 HTML 可能包含大量无关内容（导航栏、广告、脚注），消耗大量 token 且降低理解准确度。

MCP Server 的工作流：

AI Agent 发出网页抓取请求
    ↓
MCP Server 接收，使用 Scrapling 抓取
    ↓
自动清洗：去除导航、广告、脚注等噪音
    ↓
结构化输出：返回干净的文本 + 元数据
    ↓
AI Agent 处理干净的上下文，省 token、提精度

启动 MCP Server：

# 安装 MCP 依赖
pip install "scrapling[mcp]"

# 启动 Server
scrapling-mcp --port 8080

在 Claude Code 或其他 AI Agent 中配置：

{
  "mcpServers": {
    "scrapling": {
      "command": "scrapling-mcp",
      "args": ["--port", "8080"]
    }
  }
}

4.2 AI 辅助选择器生成

Scrapling 还支持让 AI 自动生成选择器，当你面对一个陌生页面不知道怎么写选择器时：

from scrapling.ai import AISelector

# 描述你想提取的内容
selector = AISelector(
    model="gpt-4o",
    api_key="your-key",
)

# AI 自动生成选择器
result = selector.generate(
    url="https://shop.example.com/product/123",
    description="提取商品名称、价格和评价数量",
)

print(result.selectors)
# 输出类似：
# {
#   "name": "h1.product-title",
#   "price": "[data-price]",
#   "reviews": ".review-count"
# }

# 直接使用生成的选择器
page = Fetcher.get("https://shop.example.com/product/123")
name = page.css(result.selectors["name"] + "::text").get()
price = page.css(result.selectors["price"] + "::text").get()

4.3 结构化数据提取管道

结合自适应解析和 AI 输出，构建端到端的数据提取管道：

"""结构化数据提取管道——从网页到结构化 JSON"""
from scrapling.fetchers import Fetcher
from scrapling.core import AdaptiveParser
from dataclasses import dataclass, asdict
import json

@dataclass
class JobListing:
    title: str
    company: str
    location: str
    salary: str
    tags: list
    url: str

class JobExtractor:
    def __init__(self):
        self.parser = AdaptiveParser(similarity_threshold=0.65)
        self.results = []
    
    def extract(self, url):
        page = Fetcher.get(url)
        cards = page.css(".job-card")
        
        for card in cards:
            job = JobListing(
                title=card.css(".job-title::text").get(default=""),
                company=card.css(".company-name::text").get(default=""),
                location=card.css(".location::text").get(default=""),
                salary=card.css(".salary::text").get(default="面议"),
                tags=card.css(".tag::text").getall(),
                url=card.css("a::attr(href)").get(default=""),
            )
            self.results.append(asdict(job))
        
        return self.results
    
    def save(self, filename="jobs.json"):
        with open(filename, "w", encoding="utf-8") as f:
            json.dump(self.results, f, ensure_ascii=False, indent=2)

extractor = JobExtractor()
extractor.extract("https://jobs.example.com/python")
extractor.save()

五、性能优化篇

5.1 Fetcher 级别的优化

# 1. 禁用不必要的资源加载（PlaywrightFetcher / StealthyFetcher）
page = PlaywrightFetcher.get(
    url,
    disable_resources=True,   # 不加载图片、CSS、字体
    # 等价于 Playwright 的：
    # route.abort() 对 image, stylesheet, font 类型的请求
)

# 2. 连接复用——使用 Session
from scrapling.fetchers import Fetcher

session = Fetcher.session()  # 复用 TCP 连接
for url in url_list:
    page = session.get(url)
    # 处理...

# 3. 预编译选择器
from scrapling.core import SelectorCompiler

compiled = SelectorCompiler.compile(".product-card .price::text")
# 对大量同结构页面重复使用，避免重复解析
for page in pages:
    price = compiled.extract(page)

5.2 Spider 框架的并发调优

class OptimizedSpider(Spider):
    custom_settings = {
        # 1. 根据目标站点承载力调整并发
        "concurrent_requests": 4,      # 保守：对小型站点
        # "concurrent_requests": 16,   # 激进：对大型CDN站点
        
        # 2. 智能延迟——2xx 快速，429/5xx 退避
        "download_delay": 0.5,
        "auto_throttle": {
            "enabled": True,
            "target_concurrency": 4.0,
            "max_delay": 30.0,
        },
        
        # 3. 内存优化——及时清理已处理的响应
        "max_cached_responses": 1000,
        
        # 4. 断点续爬
        "job_storage": "sqlite:///crawl_state.db",
        "persist": True,              # 中断后可恢复
    }
    
    # 5. 自定义中间件——请求级别优化
    custom_middleware = [
        "myproject.middleware.ProxyRotationMiddleware",
        "myproject.middleware.CookieMiddleware",
    ]

5.3 反反爬策略深度解析

StealthyFetcher 的工作原理远不止"换个浏览器"：

"""反反爬策略详解"""
from scrapling.fetchers import StealthyFetcher

# 配置 1：代理轮换
page = StealthyFetcher.get(
    url,
    proxy="http://proxy-pool.example.com:8080",
    proxy_rotation=True,       # 自动轮换
)

# 配置 2：浏览器指纹伪装
page = StealthyFetcher.get(
    url,
    # Camoufox 自动随机化以下指纹：
    # - WebGL 渲染器信息
    # - Canvas 指纹
    # - AudioContext 指纹  
    # - Navigator 属性（platform, hardwareConcurrency 等）
    # - 屏幕分辨率
    # - 时区与语言
    fingerprint_mode="random",  # 每次请求随机指纹
)

# 配置 3：人类行为模拟
page = StealthyFetcher.get(
    url,
    humanize=True,             # 模拟人类浏览行为
    # 包括：随机鼠标移动、页面滚动、随机停顿
    # 自动等待页面"看起来自然"后再执行操作
)

# 配置 4：多层保护组合
page = StealthyFetcher.get(
    url,
    proxy="socks5://proxy:1080",
    fingerprint_mode="random",
    humanize=True,
    disable_resources=True,    # 不加载图片（加速+降低检测面）
    wait_selector=".content",  # 等待关键内容加载
    timeout=60000,
)

反反爬原理剖析：

Cloudflare Turnstile 的检测维度和 Scrapling 的应对策略：

检测维度	Turnstile 检查什么	Scrapling 如何绕过
TLS 指纹	JA3/JA4 哈希	Camoufox 使用 Firefox TLS 栈，指纹与真实浏览器一致
HTTP/2 指纹	SETTINGS 帧、优先级	原生浏览器行为，非 Python httpx
JavaScript 环境	navigator、screen 等	Camoufox 层面随机化，非 JS 注入
行为分析	鼠标轨迹、滚动模式	humanize 模式模拟真实交互
IP 信誉	IP 是否在黑名单	代理轮换 + 住宅代理支持

5.4 内存与 CPU 优化

"""大规模爬取的资源优化策略"""

# 1. 流式处理——不累积所有结果再存储
from scrapling.spiders import Spider

class StreamSpider(Spider):
    def parse_product(self, response):
        item = {
            "title": response.css("h1::text").get(),
            # ...
        }
        # 使用 pipeline 实时写入，而非追加到内存列表
        self.pipeline.process(item)
        yield item  # 生成器，不占用内存

# 2. 限制缓存大小
class LeanSpider(Spider):
    custom_settings = {
        "max_cached_responses": 500,   # 限制响应缓存
        "dedup_method": "bloom",        # 布隆过滤器去重，O(1) 内存
    }

# 3. 选择性解析——只处理需要的部分
page = Fetcher.get(url, parse_only=".main-content")
# 忽略导航、侧边栏、页脚，只解析主内容区

六、Scrapling vs Scrapy vs BeautifulSoup：选型决策

维度	Scrapling	Scrapy	BeautifulSoup
学习成本	低，API 直观	中高，需理解整个框架	极低，5分钟上手
静态页面	✅ Fetcher	✅ Requests 中间件	✅ + requests
动态渲染	✅ PlaywrightFetcher	⚠️ 需 + scrapy-playwright	❌ 需 + Selenium
反反爬	✅ StealthyFetcher	⚠️ 需 + scrapy-fake-useragent 等	❌ 完全手动
自适应定位	✅ 核心能力	❌ 无	❌ 无
大规模爬取	✅ Spider 框架	✅ 核心优势	❌ 不适合
暂停/恢复	✅ 内置	✅ Job 运行	❌ 无
AI 协同	✅ MCP Server	❌ 无	❌ 无
生态/插件	🔄 成长中	✅ 极其丰富	✅ 简单够用
部署方案	Docker / CLI	Scrapyd / Docker	简单脚本

选型建议：

小脚本、一次性抓取：BeautifulSoup 还是最快的选择
大型生产爬虫、复杂调度：Scrapy 的生态和中间件体系仍是最成熟的
反反爬、动态渲染、自适应：Scrapling 是 2026 年最值得投入的框架
AI Agent 场景：Scrapling 的 MCP Server 是唯一选择

七、生产部署

7.1 Docker 部署

# Dockerfile
FROM python:3.12-slim

# 安装 Camoufox 依赖
RUN apt-get update && apt-get install -y \
    libgtk-3-0 libdbus-glib-1-2 libxt6 libasound2 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir "scrapling[all]"

COPY . .
RUN scrapling install  # 安装浏览器

CMD ["python", "spider.py"]

# 构建并运行
docker build -t scrapling-spider .
docker run -d \
  -e PROXY_URL=http://proxy:8080 \
  -v $(pwd)/data:/app/output \
  scrapling-spider

7.2 定时任务集成

"""与 Celery 集成的定时爬取"""
from celery import Celery
from scrapling.fetchers import Fetcher, StealthyFetcher

app = Celery("crawler", broker="redis://localhost:6379/0")

@app.task(bind=True, max_retries=3)
def crawl_product(self, url):
    """异步爬取商品信息"""
    try:
        page = StealthyFetcher.get(url, headless=True)
        return {
            "title": page.css("h1::text").get(),
            "price": page.css("[data-price]::text").get(),
            "url": url,
        }
    except Exception as exc:
        raise self.retry(exc=exc, countdown=60)

# 定时任务配置
app.conf.beat_schedule = {
    "daily-price-check": {
        "task": "crawler.crawl_product",
        "schedule": 86400,  # 每天
        "args": ("https://shop.example.com/product/123",),
    },
}

7.3 监控与告警

"""爬虫健康监控"""
import logging
from scrapling.spiders import Spider

class MonitoredSpider(Spider):
    name = "monitored"
    
    # 统计指标
    stats = {
        "total_requests": 0,
        "success_count": 0,
        "failed_count": 0,
        "adaptive_matches": 0,
        "start_time": None,
    }
    
    def start_requests(self):
        self.stats["start_time"] = time.time()
        yield from super().start_requests()
    
    def parse(self, response):
        self.stats["total_requests"] += 1
        if response.status == 200:
            self.stats["success_count"] += 1
        else:
            self.stats["failed_count"] += 1
        # ... 正常解析逻辑
    
    def closed(self, reason):
        duration = time.time() - self.stats["start_time"]
        success_rate = self.stats["success_count"] / max(self.stats["total_requests"], 1)
        
        logging.info(f"""
        爬取完成报告：
        - 总请求数: {self.stats['total_requests']}
        - 成功率: {success_rate:.1%}
        - 自适应匹配次数: {self.stats['adaptive_matches']}
        - 耗时: {duration:.1f}s
        - 原因: {reason}
        """)
        
        # 成功率低于 80% 触发告警
        if success_rate < 0.8:
            self.alert(f"爬虫成功率降至 {success_rate:.1%}")

八、常见问题与最佳实践

Q1：自适应匹配会误匹配吗？

会。相似度阈值是核心调参点。生产环境建议：

起始阈值 0.7，逐步调高
对关键字段（价格、标题）使用更高阈值（0.85+）
结合 text_pattern 约束匹配范围
在关键路径上加入验证逻辑

Q2：StealthyFetcher 和直接用 Playwright 隐身模式有什么区别？

本质区别在于指纹伪装的层次。Playwright 隐身模式只是不加载扩展、清除 cookie，但 TLS 指纹、HTTP/2 行为、JavaScript 环境暴露的全是自动化特征。StealthyFetcher 基于 Camoufox——一个从 Firefox 源码层面修改的隐身分支，指纹层面就是真实用户。

Q3：Scrapling 能替代 Scrapy 吗？

不完全能。Scrapy 的中间件生态、信号系统、Item Pipeline 体系经过了 10+ 年的生产验证。Scrapling 的 Spider 框架功能上对标了 Scrapy 的核心，但在极端复杂的场景（每秒千级请求、复杂去重策略、分布式爬取）下，Scrapy 仍然更稳健。

建议：新项目优先尝试 Scrapling，已有 Scrapy 项目不需要迁移，但可以在特定模块（反反爬、自适应）中引入 Scrapling 组件。

Q4：如何处理验证码？

StealthyFetcher 能绕过大多数 Cloudflare Turnstile 和 DataDome 挑战。对于 reCAPTCHA/hCaptcha，Scrapling 目前不内置破解能力，建议：

优先尝试 StealthyFetcher + 住宅代理，很多站点的验证码在"看起来像真人"的情况下不会触发
结合第三方验证码服务（2captcha、anticaptcha）
对于必须登录的站点，使用 Cookie 注入方式

"""Cookie 注入绕过登录墙"""
from scrapling.fetchers import Fetcher

# 方式1：直接注入 Cookie
page = Fetcher.get(
    "https://member-site.com/dashboard",
    cookies={
        "session_id": "xxx",
        "auth_token": "yyy",
    }
)

# 方式2：从浏览器提取 Cookie 后注入
# 先用 StealthyFetcher 登录
login_page = StealthyFetcher.get(
    "https://member-site.com/login",
    humanize=True,
)
# 填写表单、提交...
cookies = login_page.cookies  # 获取登录后的 Cookie

# 后续请求用轻量 Fetcher + Cookie
page = Fetcher.get(
    "https://member-site.com/data",
    cookies=cookies,
)

九、未来展望

Scrapling 正在快速迭代，值得关注的路线图方向：

分布式支持：当前 Spider 框架是单机模式，分布式调度已在规划中
更多反反爬能力：对 PerimeterX、Akamai Bot Manager 的深度绕过
AI 自动修复选择器：当自适应匹配失败时，调用 LLM 理解页面结构并重新生成选择器
可视化调试工具：Chrome 扩展，实时预览选择器匹配结果
与 Scrapy 的互操作：作为 Scrapy 中间件使用 Scrapling 的 Fetcher 和自适应能力

十、总结

Scrapling 在 2026 年的爬虫生态中占据了一个独特的位置：

自适应定位解决了爬虫维护的长期痛点——网站改版不再意味着重写代码
三层 Fetcher 体系让从简单到复杂的抓取场景有了统一入口
StealthyFetcher 基于 Camoufox 的深层伪装，是目前开源方案中反反爬能力最强的选择
MCP Server 是 AI Agent 时代的刚需——网页数据不再需要手动清洗
Spider 框架虽然不如 Scrapy 成熟，但对 80% 的爬虫场景已经够用

如果你在 2026 年还需要写爬虫，Scrapling 值得你花一个下午试试。它不会替代 Scrapy 的所有场景，但它会让很多"以前需要拼凑多个工具才能解决的问题"变成一个 import 的事。

本文基于 Scrapling 最新版本撰写，项目地址：https://github.com/D4Vinci/Scrapling