python – 标点符

下一代Python包管理器uv

钱魏Way — Sun, 29 Mar 2026 11:59:36 +0000

在Python生态中，依赖管理工具的效率和可靠性直接关系到开发体验与项目交付速度。传统的pip虽然作为官方标准，但其在大型项目中的依赖解析速度和环境一致性方面常显不足。而由Astral团队（Ruff工具的创造者）用Rust编写的uv，正以其革命性的速度和现代化的设计，成为2025年及以后Python开发者的首选工具。本文将深入介绍uv的核心特性、详细使用教程，并将其与pip、conda等主流工具进行多维度对比。

uv是什么？为何选择它？

uv是一个高性能的Python包和项目管理器，其定位是替代传统的pip、venv、pip-tools等多个工具链，提供一个统一、极速的“一站式”解决方案。它的核心优势在于速度，官方基准测试显示，其依赖解析和包安装速度比pip快10到100倍，比conda甚至快100倍以上。这种性能飞跃源于其Rust底层实现、并行下载机制、智能全局缓存以及高效的PubGrub依赖解析算法。

除了速度，uv的设计理念也极具吸引力：

一体化管理：集成了虚拟环境创建、依赖安装锁定、Python版本管理乃至项目初始化和脚本运行，类似于Rust生态中的Cargo。
完全兼容：100%兼容现有的pip工作流和txt文件，用户可以无缝迁移，只需将pip命令替换为uv pip即可获得性能提升。
确定性构建：通过生成跨平台的lock锁文件，确保开发、测试和生产环境能够100%复现相同的依赖集合，彻底解决“在我机器上能运行”的问题。
轻量高效：其创建的虚拟环境通过符号链接复用基础解释器，仅需约10MB磁盘空间，非常适合容器化和CI/CD环境。

uv安装与快速上手

安装uv

uv是一个独立的二进制文件，安装极其简单，无需预装Python或Rust。

# Linux/macOS（推荐）： 
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows（PowerShell）： 
irm https://astral.sh/uv/install.ps1 | iex
# 通过pipx安装（若已安装）： 
pipx install uv

安装后，重启终端或运行source ~/.profile，通过uv –version验证。

创建与激活虚拟环境

使用uv venv创建虚拟环境，速度远快于python -m venv。

# 在当前目录创建默认的.venv环境
uv venv

# 指定环境名称和Python版本
uv venv myproject-env --python 3.11
# 激活环境（以Linux/macOS为例）：
source .venv/bin/activate

激活后，终端提示符会变化，所有后续操作都在此隔离环境中进行。

依赖管理：安装、编译与同步

uv提供了兼容pip的命令uv pip，以及更现代的原生命令（如uv add）。

安装包：使用uv pip install或更快的uv add。

# 安装单个包（兼容pip方式）
uv pip install requests
# 使用原生命令，会自动更新pyproject.toml
uv add requests
# 安装开发依赖
uv add pytest --dev

依赖编译与锁定（关键步骤）：这是实现可重复构建的核心。首先创建一个requirements.in文件，列出项目直接依赖。

# requirements.in
flask
pandas>=2.0

然后运行uv pip compile生成精确锁定的requirements.txt：

uv pip compile requirements.in -o requirements.txt

生成的requirements.txt会包含所有直接和间接依赖的确切版本及哈希值。

同步环境：使用uv pip sync使虚拟环境状态与锁文件完全一致。它会安装缺失的包、升级版本不匹配的包，并卸载锁文件中未列出的包，确保环境纯净。

uv pip sync requirements.txt

项目管理与运行

uv可以管理整个项目生命周期。

初始化项目：uv init会生成toml、.python-version等文件。
运行脚本：无需手动激活环境，使用uv run即可在项目虚拟环境中执行命令。

uv run python script.py
uv run pytest

Python版本管理：uv可以安装和管理多个Python解释器，类似pyenv的功能。

uv python install 3.11
uv python pin 3.11  # 为当前项目指定Python版本

uv与其他包管理器的深度对比

为了更清晰地做出技术选型，我们将uv与最常用的pip和conda进行多维度对比。

维度	uv	pip	conda
核心定位	高性能Python包与项目管理器	Python官方包管理器	跨语言环境与包管理器
底层语言	Rust（高性能）	Python	Python/C++
安装速度	极快，比pip快10-100倍	中等，单线程下载解析	较慢，依赖复杂SAT求解器
依赖解析	PubGrub算法，确定性高，冲突提示清晰	简单递归，易冲突	SAT求解器，擅长处理复杂全局依赖
虚拟环境	内置，轻量（uv venv）	需配合venv或virtualenv	内置，包含Python解释器
锁文件支持	✅ 原生uv.lock，跨平台	❌ 需配合pip-tools或pip freeze	✅ environment.yml（非原生锁文件）
非Python依赖	❌ 仅限纯Python包	❌ 需手动处理系统库	✅ 支持（如CUDA、MKL、R库）
Python版本管理	✅ 内置（uv python）	❌ 需配合pyenv	✅ 内置
适用场景	纯Python项目、Web开发、CI/CD、追求极致速度	简单脚本、传统项目兼容	数据科学、机器学习、跨语言项目

总结与选型建议：

优先选择uv：如果你的项目是纯Python应用（如Django/FastAPI Web服务、工具链脚本），并且追求极致的依赖安装速度、轻量化的环境以及团队协作的环境一致性，uv是最优选择。它在CI/CD流水线中能大幅缩短构建时间。
优先选择conda：如果你的项目涉及科学计算、机器学习，需要管理CUDA、MKL等非Python依赖，或者是一个混合了Python、R、C++的跨语言项目，conda仍然是不可替代的工具。
保留pip：适用于极其简单的脚本，或需要绝对兼容PyPA官方生态且不愿引入新工具的传统项目。
混合使用策略：一种日益流行的最佳实践是系统层用apt/brew安装非Python依赖（如CUDA驱动），项目层用uv管理Python包和环境。对于数据科学项目，也可以使用conda安装底层计算库，再用uv pip管理纯Python包，以平衡功能与速度。

未来展望

uv的发展势头迅猛，已被ThoughtWorks技术雷达推荐，并被Dify等知名项目采用。Astral团队正持续开发，计划整合更多类似Poetry的依赖管理功能，并可能支持跨平台二进制包分发。随着社区采纳度在2025年及以后的不断提高，uv有望成为纯Python生态系统中的标准工具链。

结论：对于大多数Python开发者而言，从今天开始尝试并逐步将uv纳入你的开发工具链，是一项面向未来的投资。它不仅能带来立竿见影的效率提升，其现代化的设计理念也将帮助你构建更可靠、更易于协作的项目。

参考链接：

GitHub – astral-sh/uv: An extremely fast Python package and project manager, written in Rust. GitHub

Python标准库之数据持久化dbm

钱魏Way — Mon, 02 Mar 2026 12:59:52 +0000

dbm简介

Python 的 dbm 模块是一个用于实现简单键值对数据库的模块。它是基于 Unix 系统上的数据库管理工具 dbm (Database Manager) 的概念引入的。以下是 dbm 及其在 Python 中实现的背景：

dbm 的起源

历史背景：
- dbm最早由 Ken Thompson 和 Dennis Ritchie 在 1979 年为 Unix 操作系统开发。它的设计初衷是提供一种简单高效的方式来存储和检索键值对数据，这在很多应用场景中非常常见，例如配置文件、简单的数据库应用等。
- dbm是基于哈希表的数据库管理系统，旨在提供快速的数据访问和更新功能。
特性：
- dbm使用磁盘文件来存储数据，支持快速的键查找。
- 它非常适合于需要持久化存储简单键值对的应用，尤其是在内存有限或不需要复杂关系型数据库功能的场合。

Python 中的 dbm 模块

Python 的实现：
- Python 标准库提供了dbm 模块，该模块提供了对 Unix 风格的 dbm 数据库的接口。
- Python 的dbm 模块是一个抽象层，支持不同的底层实现，如 gnu、dbm.ndbm、dbm.dumb 等。不同的实现可能在性能和特性上有所不同。
使用场景：
- dbm模块适合于需要存储简单键值对的应用程序，尤其是那些不需要复杂事务处理或关系型数据库特性的场合。
- 常用于简单的缓存实现、配置存储、会话数据持久化等。
Python dbm 的兼容性：
- dbm模块在不同平台上的实现可能有所不同，例如在某些平台上可能使用 GNU dbm (gdbm)，而在其他平台上可能使用 Berkeley DB (bsddb) 或者是一个简单的纯 Python 实现 (dumbdbm)。

dbm 是 Python 标准库中的一个模块，提供了一种简单的方式来存储和检索键值对数据。它实现了一个简单的数据库接口，允许你在磁盘上以键值对的形式存储数据。dbm 模块有多个子模块，提供了不同的底层数据库实现。

主要功能

键值对存储：允许以键值对的形式存储数据。
持久化存储：数据持久化存储在磁盘文件中。
简单接口：提供类似字典的接口来操作数据。

子模块

dbm 模块的不同子模块实现了不同的底层数据库接口，主要包括：

dumb：使用纯 Python 实现的简单数据库，适合小型应用。
gnu：使用 GNU gdbm 库实现的数据库，提供更高的性能和更多的功能。
ndbm：使用 ndbm 库实现的数据库，通常用于 Unix 系统。
sqlite：使用 sqlite3 实现的数据库，支持 SQL 功能。

工作机制

键和值：dbm 数据库中的键和值都是字节串（bytes）。因此，在存储和检索数据时，需要进行编码和解码操作。
文件格式：dbm 模块使用不同的底层库来实现数据库文件格式，具体取决于使用的子模块。

使用场景

简单的键值存储：适用于存储简单的键值对数据，如缓存、配置文件等。
小型数据库：对于小型应用和嵌入式系统，dbm 提供了一种简单且高效的数据库解决方案。

注意事项

数据类型：dbm 数据库中的键和值都必须是字节串（bytes）。如果你使用字符串（str），需要进行编码和解码操作。
并发问题：dbm 数据库在并发访问方面有限制，不适用于需要高并发访问的场景。
兼容性：不同的底层数据库实现可能具有不同的功能和性能特性。选择合适的实现取决于具体的需求。

dbm的用法

无论使用哪个底层数据库实现，dbm 模块提供的 API 都是相似的，主要包括以下几个函数和方法：

dbm.open() 函数

dbm.open() 函数用于打开一个数据库文件。

参数：

filename：数据库文件的名称。如果文件不存在，将会创建一个新的文件。
flag：可选参数，指定打开模式。常用值包括：
- ‘r’：只读模式。
- ‘w’：读写模式（默认）。
- ‘c’：读写模式，如果文件不存在则创建它（默认）。
- ‘n’：读写模式，总是创建一个新的空文件。
format：可选参数，指定数据库实现的类型。可以是 ‘dbm.dumb’、’dbm.gnu’、’dbm.ndbm’ 或 ‘dbm.sqlite’。

返回值：返回一个类字典对象，可以像字典一样操作。

示例：

import dbm

# 打开一个数据库文件（创建新文件）
with dbm.open('mydb', 'c') as db:
    db['key1'] = 'value1'
    db['key2'] = 'value2'

# 读取数据
with dbm.open('mydb', 'r') as db:
    print(db['key1'])  # 输出：b'value1'
    print(db['key2'])  # 输出：b'value2'

使用不同的底层数据库实现

可以通过指定 format 参数来选择不同的底层数据库实现：

import dbm

# 使用 GNU dbm 库（如果支持）
with dbm.open('example_gdbm.db', 'c', format='dbm.gnu') as db:
    db[b'key'] = b'value'

# 使用 ndbm 库（如果支持）
with dbm.open('example_ndbm.db', 'c', format='dbm.ndbm') as db:
    db[b'key'] = b'value'

# 使用 SQLite（如果支持）
with dbm.open('example_sqlite.db', 'c', format='dbm.sqlite') as db:
    db[b'key'] = b'value'

Python美化输出工具pprint

钱魏Way — Sun, 08 Feb 2026 08:33:58 +0000

pprint（Pretty-Printer）是Python标准库中一个用于美化输出复杂数据结构的模块，特别适用于嵌套较深或元素较多的字典、列表、元组等。相比普通的print()，它能自动格式化输出，使其更具可读性。

主要特点

自动格式化：自动处理嵌套数据结构
智能换行：根据宽度限制智能换行
键值排序：默认对字典键排序（可配置）
深度控制：可限制嵌套深度显示
递归安全：安全处理循环引用

与相关工具对比

工具	优点	缺点	适用场景
pprint	支持所有Python类型，自动处理循环引用，可配置性强	性能相对较慢	调试复杂Python数据结构
json.dumps	标准JSON格式，可跨语言，性能好	只支持JSON类型，不处理循环引用	数据序列化，API响应
yaml.dump	人类可读性极好，支持复杂结构	需要额外依赖	配置文件，文档
print	简单直接，性能最好	无格式化，可读性差	简单输出，快速调试

基本对比

import pprint
import json

# 原始数据
data = {
    'users': [
        {'id': 2, 'name': 'Bob', 'roles': ['user']},
        {'id': 1, 'name': 'Alice', 'roles': ['admin', 'user']}
    ],
    'metadata': {'created': '2024-01-01', 'version': 1.0}
}

# 1. 普通 print
print("普通 print:")
print(data)
# 输出：单行，难以阅读

# 2. json.dumps
print("\njson.dumps:")
print(json.dumps(data, indent=2, sort_keys=True))
# 输出：格式化，但只能处理JSON兼容类型

# 3. pprint
print("\npprint:")
pprint.pprint(data)
# 输出：格式化，支持所有Python类型，自动处理非ASCII字符

pprint()函数

pprint.pprint(
    object,           # 要打印的对象
    stream=None,      # 输出流，默认为 sys.stdout
    indent=1,        # 每级缩进空格数
    width=80,        # 每行最大字符数
    depth=None,      # 最大嵌套深度
    *,
    compact=False,   # 紧凑模式
    sort_dicts=True, # 字典键排序
    underscore_numbers=False  # Python 3.10+: 数字使用下划线分隔
)

pformat()函数

返回格式化字符串而不直接打印

# 返回字符串，适合存储或进一步处理
formatted = pprint.pformat(data, indent=2, width=60)
print("格式化字符串:")
print(formatted)

# 可用于日志记录
import logging
logger = logging.getLogger(__name__)
logger.debug("数据:\n%s", pprint.pformat(data))

PrettyPrinter类

创建自定义打印机实例

# 创建自定义配置的打印机
pp = pprint.PrettyPrinter(
    indent=4,
    width=100,
    depth=3,
    compact=True,
    sort_dicts=False
)

# 多次使用相同配置
pp.pprint(data1)
pp.pprint(data2)

# 获取格式化字符串
str1 = pp.pformat(data1)

参数详细说明

indent- 缩进控制

data = {'a': [1, 2, 3], 'b': {'x': 10, 'y': 20}}

pprint.pprint(data, indent=2)  # 2空格缩进
# 输出：
# { 'a': [1, 2, 3],
#   'b': { 'x': 10,
#          'y': 20}}

pprint.pprint(data, indent=4)  # 4空格缩进
width- 宽度控制 
long_list = list(range(20))

# 宽度较小，频繁换行
pprint.pprint(long_list, width=30)
# 输出：
# [0,
#  1,
#  2,
#  ...]

# 宽度较大，尽量不换行
pprint.pprint(long_list, width=200)
depth- 深度限制 
nested = {
    'level1': {
        'level2': {
            'level3': {
                'level4': 'deep'
            }
        }
    }
}

# 限制深度
pprint.pprint(nested, depth=2)
# 输出：
# {'level1': {'level2': {...}}}
compact- 紧凑模式 
# 控制长序列的显示方式
long_sequence = [f"item_{i}" for i in range(20)]

# compact=False（默认）
pprint.pprint(long_sequence, width=50, compact=False)
# 每个元素单独一行

# compact=True
pprint.pprint(long_sequence, width=50, compact=True)
# 尽可能在一行显示多个元素
sort_dicts- 字典排序 
unsorted_dict = {'z': 3, 'a': 1, 'm': 2}

# 默认排序
pprint.pprint(unsorted_dict, sort_dicts=True)
# 输出：{'a': 1, 'm': 2, 'z': 3}

# 保持插入顺序
pprint.pprint(unsorted_dict, sort_dicts=False)
# 输出：{'z': 3, 'a': 1, 'm': 2}

处理特殊数据类型

# 1. 集合
data_set = {1, 2, 3, 4, 5}
pprint.pprint(data_set)  # 自动排序集合元素

# 2. 命名元组
from collections import namedtuple
Point = namedtuple('Point', ['x', 'y'])
p = Point(10, 20)
pprint.pprint(p)  # 保持命名元组的表示形式

# 3. 自定义对象
class CustomClass:
    def __init__(self, value):
        self.value = value
    def __repr__(self):
        return f"CustomClass({self.value})"

obj = CustomClass(42)
pprint.pprint(obj)  # 使用__repr__方法

处理递归引用

# 创建递归数据结构
recursive_list = [1, 2, 3]
recursive_list.append(recursive_list)  # 添加对自己的引用

# 安全处理递归
pprint.pprint(recursive_list, depth=3)
# 输出：[1, 2, 3, [...]]

# 检查递归
print("是否递归:", pprint.isrecursive(recursive_list))
print("是否可读:", pprint.isreadable(recursive_list))

自定义格式化

class CustomPrettyPrinter(pprint.PrettyPrinter):
    """自定义打印机"""
    
    def format(self, obj, context, maxlevels, level):
        """重写格式化方法"""
        if isinstance(obj, set):
            # 自定义集合显示格式
            items = sorted(obj)
            return ', '.join(str(i) for i in items), True, False
        elif isinstance(obj, complex):
            # 自定义复数显示格式
            return f"{obj.real}+{obj.imag}j", True, False
        return super().format(obj, context, maxlevels, level)

# 使用自定义打印机
custom_printer = CustomPrettyPrinter(width=50)
custom_printer.pprint({1+2j, 3+4j, 5+6j})

正例-无标记学习PU Learning

钱魏Way — Sat, 15 Nov 2025 07:47:09 +0000

什么是 PU Learning？

PU Learning 的全称是 Positive-Unlabeled Learning，即正例-无标记学习。它是一种在半监督学习范畴内的特殊机器学习设定。

与传统的监督学习（数据有明确的“正例”和“负例”标签）不同，PU Learning 处理的数据集只包含两类样本：

正例：明确知道属于目标类别的样本。例如：已经确认购买过产品的客户、被医生确诊患病的病人、被人工审核确定的垃圾邮件。
无标记样本：不知道属于正例还是负例的样本。这个集合是未标记样本的混合体，其中既包含尚未被发现的正例，也包含真正的负例。例如：网站的所有访问者（其中既有潜在客户也有非客户）、所有接受筛查的病人（其中既有患者也有健康人）、邮箱里的所有邮件（其中既有垃圾邮件也有正常邮件）。

核心思想：从已知的正例和一个混合的未标记集合中，学习一个分类器，以便将来能够准确区分新的正例和负例。

入门级概念实例

这些实例有助于直观理解PU Learning要解决的问题。

实例1：寻找稀有矿石

场景：你是一位地质学家，在一个特定区域发现了一些黄金矿脉（正例）。整个山区有成千上万个勘探点（未标记样本）。你的任务是找出其他可能含有黄金的勘探点。
问题：你不能把没有黄金的勘探点都标记为“负例”，因为可能只是你还没找到而已。如果你把一个尚未发现的富矿点错误地标记为“无矿”并用来训练模型，模型就会学会忽略真正的信号。
PU方法：你可以利用已知的黄金矿脉的地质特征（如土壤成分、磁场强度），从所有勘探点中找出那些地质特征与已知矿脉截然不同的点，这些点可以作为“可靠负例”（比如一片普通的石灰岩地带）。然后，利用这些正例和可靠负例训练模型，再去预测剩余的勘探点。

实例2：推荐系统

场景：在一个电商平台，用户点击或购买了的商品是明确的正例。
问题：那些用户看到但没有点击的商品不是真正的负例。用户可能因为没看到、暂时不需要、或者将来会购买而没有点击。如果简单地将未点击商品作为负例，模型会认为用户不喜欢的商品范围被夸大，从而推荐过于保守。
PU方法：将未点击的商品视为未标记样本。通过PU学习，模型可以学会从曝光未点击的商品中区分出用户真正不喜欢的（负例）和可能感兴趣的（隐藏正例）。

为什么 PU Learning 具有挑战性？（核心问题）

直接使用标准分类方法会遇到严重问题：

标签偏差：如果你简单地把所有“无标记样本”都当作“负例”来训练一个分类器，会导致严重的问题。因为无标记样本中实际上包含了很多隐藏的正例，把这些隐藏的正例当作负例来学习，会“教坏”模型，导致学到的决策边界完全错误。
数据分布失真：标准的分类器通常假设训练数据中的正负分布是真实分布的无偏采样。但在PU学习中，训练集的“负例”集实际上是真实世界数据的一个有偏采样（它缺失了那些隐藏在无标记集合中的正例），这破坏了传统算法的基本假设。

PU Learning 的典型应用场景

PU Learning 在现实世界中极其有用，因为获取完整的负例标签通常非常困难或成本高昂。

信息检索与推荐系统：
- 正例：用户点击、购买、长时间浏览的物品。
- 无标记样本：用户看到但未交互的所有其他物品。
- 目标：从无标记样本中找出用户可能喜欢的物品（隐藏正例）进行推荐。
异常检测：
- 正例：已确认的欺诈交易、网络攻击、设备故障。
- 无标记样本：绝大部分的正常运行数据。
- 目标：因为“正常”行为千变万化，很难穷举定义，而异常是罕见的。PU学习非常适合从大量正常数据中找出罕见的异常模式。
生物信息学：
- 正例：已知与某种疾病相关的基因。
- 无标记样本：人类基因组中其他所有基因（其中大部分是无关的，但可能包含尚未被发现的相关基因）。
- 目标：预测新的致病基因。
医疗诊断：
- 正例：通过金标准（如活检）确诊的患者。
- 无标记样本：所有接受筛查但未被确诊的人群（其中包含假阴性和健康人）。

PU Learning 的主要方法论

研究人员提出了多种思路来解决PU学习问题，主要可以归结为三大类：

方法一：两步策略

这是最直观和流行的方法。它承认未标记数据是混合的，并尝试从中识别出可靠的负例。

第一步：识别可靠负例

利用已知的正例和未标记数据，通过某种技术找出那些“很可能”是负例的样本。
常用技术：
- 间谍法：从已知正例中随机抽取一小部分（如10%），把他们“伪装”成未标记样本，放入未标记集合中，这部分样本称为“间谍”。
- 然后使用一种学习算法（如朴素贝叶斯、SVM）在剩下的正例（90%）和整个未标记集合（包含间谍）上进行训练。由于间谍本质是正例，但被算法当作未标记数据处理，那些被分类器判断为“非常不像间谍”的未标记样本，就有很高概率是可靠的负例。
- 其他方法：使用聚类（如k-means）或异常检测算法（如Isolation Forest）来寻找与已知正例分布差异很大的未标记样本作为可靠负例。

第二步：迭代学习

一旦找到了一组可靠的负例，问题就转变为了一个更“干净”的有监督学习问题：已知正例 + 可靠负例。
可以用任何分类算法（如逻辑回归、决策树、SVM等）在这个数据集上训练一个初始分类器。
然后用这个分类器对剩下的未标记数据进行预测，将预测为负例且置信度高的样本加入负例集合。
不断迭代这个过程，逐步扩充训练集，直到模型收敛。

方法二：类别先验修正法

这种方法的核心思想是：对标准分类算法的损失函数进行修正，以补偿缺失的负例标签。

核心洞察：一个未标记样本的期望损失，可以看作是它作为正例的损失和它作为负例的损失的加权平均，权重就是它是正例的概率。
关键步骤：
- 估计类别先验：即估计未标记数据中正例的比例 π = P(y=1)。这是一个具有挑战性但可解决的问题，有专门的估计算法。
- 修正损失函数：基于估计出的 π，对标准二分类损失函数进行数学修正。使得算法在只有正例和未标记样本的情况下，能优化出与拥有完整标签时相近的模型参数。
优点：这种方法通常更理论化、更优雅，可以直接利用很多现有的高效算法，只需修改其损失函数即可。

方法三：概率输出法

这种方法将未标记样本视为带有噪声的、概率形式的标签。

核心思想：假设每个未标记样本 x属于正例的概率是 P(y=1 | x)，属于负例的概率是 1 – P(y=1 | x)。在模型训练过程中，将这些概率权重考虑到目标函数中。
常用算法：期望最大化算法（Expectation-Maximization, EM）经常被用于这种框架。
- E步：基于当前模型参数，计算每个未标记样本属于正例和负例的“期望”概率。
- M步：利用已知正例和带有概率权重的未标记样本，更新模型参数，最大化似然函数。
- 重复E步和M步直到收敛。

Pulearn库使用详解

pulearn 是一个专门用于 正例-无标记学习 的 Python 库，它提供了多种算法来处理仅包含正例（P）和大量未标记（U）样本的分类问题。下面这个表格概括了它的核心组件和实用信息，可以帮助你快速了解其全貌。pulearn 库主要实现了三种主流的PU学习策略，你可以根据具体问题和数据特点进行选择。

Elkanoto 方法

Elkanoto 方法是PU学习中最经典和实用的方法之一，由Charles Elkan和Keith Noto在2008年的论文《Learning classifiers from only positive and unlabeled data》中提出。它巧妙地利用概率估计来解决仅有正例和未标记样本下的分类问题。

方法核心：两步估计与概率校正

Elkanoto方法的核心思想是，未标记样本集中的正例比例，可以通过一个训练出的分类器来估计，进而校正概率估计。其关键步骤和公式如下：

可靠负例的识别与分类器训练
- 首先，从原始正例集（P）中随机选取一个子集作为“间谍”正例，混入未标记集（U）中。剩余的P集和整个U集（含间谍）用于训练一个分类器，该分类器的目标是区分“明确的正例”和“未标记数据”。
- 此时，分类器学习到的实际是样本被标记为正例的概率，即 P(s=1 | x, y=1)。这里的s=1表示样本被标记为正例。由于“间谍”正例本质是已知正例，分类器对U中样本的判断，有助于发现那些与已知正例差异巨大的可靠负例。
概率校正与最终分类器构建
- 训练第二个分类器来估计真正的类别概率 P(y=1 | x)。关键的一步是进行概率校正。Elkanoto方法证明，存在以下关系：P(y=1 | x) = P(s=1 | x) / c。其中，c = P(s=1 | y=1, x)，可以近似为第一个分类器在整个原始正例集（P）上预测概率的平均值。这个 c就是正例被标记出来的概率。

算法步骤与实现

在pulearn库中，ElkanotoPuClassifier封装了这一流程。

输入：标记的正例集 P，未标记集 U，一个基础分类器（如SVM、逻辑回归）。
步骤：
- 从 P中随机选取一个比例（由 hold_out_ratio参数控制）作为“间谍”正例，与 U混合。
- 用剩余的 P和混合后的 U训练第一个分类器，得到每个样本 x的 P(s=1 | x)。
- 计算先验概率 c，通常取第一个分类器对整个原始正例集（P）预测概率的平均值。
- 用所有标记数据（原始P作为正例，从U中识别出的可靠负例）训练第二个分类器。在预测时，将其输出的概率除以 c进行校正，得到 P(y=1 | x)。但为了避免校正后概率大于1，通常取 min(1, P(s=1 | x) / c)。

关键参数与注意事项

使用Elkanoto方法时，有几个关键点需要特别注意：

hold_out_ratio参数：这个参数控制了从原始正例集中抽取多大比例作为“间谍”正例。它直接影响先验概率 c估计的准确性。比例过小，c的估计可能不可靠；比例过大，则会减少用于训练第一个分类器的正例数量。通常需要通过实验（如网格搜索）来调整。
基础分类器的选择：所选的基础分类器（如SVM、逻辑回归等）需要能够输出概率估计。算法的性能在很大程度上依赖于这个基础分类器的表现。
先验概率 c的稳定性：c的估计至关重要。在实践中，可能会采用多次随机选取“间谍”正例并取平均值等策略来提高 c的稳定性。

优势与局限性

了解Elkanoto方法的优缺点，有助于你在实际应用中做出合适的选择。

优势：

理论坚实：方法有概率论基础，逻辑清晰。
实现相对简单：pulearn等库使其易于应用。
无需
效果良好：在许多场景下，特别是当正例和未标记样本中的正例有相似特征时，效果不错。

局限性：

对参数敏感：hold_out_ratio等参数的选择对结果影响较大。
依赖基础分类器：第一个分类器的性能直接影响整个流程。
“间谍”样本的假设：方法假设混入U的“间谍”正例与U中隐藏的正例具有相似分布，如果此假设不成立，会影响效果。
计算成本：需要训练两个分类器。

代码实例

# 第一步：导入必要的库
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score, precision_score, roc_auc_score, accuracy_score, f1_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import xgboost as xgb

# Plotly 相关库
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.io as pio

# 设置Plotly默认主题
pio.templates.default = "plotly_white"

# # 第二步：加载和探索数据
# print("=== 步骤1: 数据加载与探索 ===")
# url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt"
# data = pd.read_csv(url, header=None)
data = pd.read_csv("data/data_banknote_authentication.txt",header=None)
# 分配列名
data.columns = ['F1','F2','F3','F4','Target']
# 打印数据基本信息
print("数据形状:", data.shape)
print("\n前5行数据:")
print(data.head())

print("\n目标变量分布:")
print(data['Target'].value_counts())

print("\n数据描述性统计:")
print(data.describe())

# 第三步：创建基线模型（全监督学习）
print("\n=== 步骤2: 创建基线模型（全监督学习）===")

# 定义特征和目标
features = ['F1','F2','F3','F4']

# 划分训练集和测试集
x_data = data[features]
y_data = data['Target']
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=7)

# 训练XGBoost分类器作为基线
model = xgb.XGBClassifier(random_state=42)
model.fit(x_train, y_train)

# 定义评估函数
def evaluate_results(y_test, y_predict, y_prob=None):
    print('分类结果:')
    f1 = f1_score(y_test, y_predict)
    print("f1: %.2f%%" % (f1 * 100.0))
    roc = roc_auc_score(y_test, y_predict)
    print("ROC AUC: %.2f%%" % (roc * 100.0))
    rec = recall_score(y_test, y_predict, average='binary')
    print("召回率: %.2f%%" % (rec * 100.0))
    prc = precision_score(y_test, y_predict, average='binary')
    print("精确率: %.2f%%" % (prc * 100.0))
    acc = accuracy_score(y_test, y_predict)
    print("准确率: %.2f%%" % (acc * 100.0))

    if y_prob is not None:
        roc_auc_prob = roc_auc_score(y_test, y_prob)
        print("ROC AUC (概率): %.2f%%" % (roc_auc_prob * 100.0))

    print("-" * 50)
    return {'f1': f1, 'roc_auc': roc, 'recall': rec, 'precision': prc, 'accuracy': acc}

# 在测试集上评估基线模型
print("基线模型性能 (全监督学习):")
y_predict_baseline = model.predict(x_test)
y_prob_baseline = model.predict_proba(x_test)[:, 1]
baseline_metrics = evaluate_results(y_test, y_predict_baseline, y_prob_baseline)

# 第四步：创建PU学习场景
print("\n=== 步骤3: 创建PU学习场景 ===")

# 创建数据副本用于PU学习
mod_data = data.copy()

# 提取正例的索引（Target=1）
index_pos = mod_data[mod_data['Target']==1].sample(frac=0.25, random_state=42).index

# 创建PU_Target列，初始化为"Unlabeled"
mod_data['PU_Target'] = "Unlabeled"

# 仅标记25%的正例为"Positive"（其余保持为"Unlabeled"）
mod_data.loc[index_pos,'PU_Target'] = 'Positive'

# 打印Target和PU_Target的交叉表
print("Target和PU_Target的交叉表:")
print(pd.crosstab(mod_data['Target'], mod_data['PU_Target'], margins=True))
print("\n我们的目标是从1220个未标记样本中识别出458个隐藏的正例")

# 第五步：实现PU估计器（Elkanoto方法）
print("\n=== 步骤4: 实现PU估计器（Elkanoto方法）===")

def fit_PU_estimator_AM(X, y, hold_out_ratio, estimator):
    """
    实现Elkanoto方法的PU估计器

    参数:
    - X: 特征数据
    - y: 标签 (1=Positive, 0=Unlabeled)
    - hold_out_ratio: 保留的正例比例
    - estimator: 基础分类器
    """
    # 提取标记为正例的样本
    X_labeled_pos = X[y == 1]
    # 随机保留一部分正例作为验证集
    X_hold_out = X_labeled_pos.sample(frac=hold_out_ratio, random_state=42)

    # 提取非保留样本的索引
    idx_non_hold = list(set(X.index) - set(X_hold_out.index))

    # 从X和y中移除保留的样本
    X_non_hold = X.loc[idx_non_hold]
    y_non_hold = y.loc[idx_non_hold]

    # 在非保留样本上训练估计器
    estimator.fit(X_non_hold, y_non_hold)

    # 使用估计器预测保留的正例集，估计P(s=1|y=1)
    hold_out_predictions = estimator.predict_proba(X_hold_out)[:, 1]

    # 计算平均概率
    prob_s1y1 = hold_out_predictions.mean()
    return estimator, prob_s1y1

def predict_PU_prob_AM(X, estimator, prob_s1y1):
    """
    使用训练好的PU估计器进行预测

    参数:
    - X: 特征数据
    - estimator: 训练好的估计器
    - prob_s1y1: P(s=1|y=1)的估计值
    """
    predicted_s = estimator.predict_proba(X)[:, 1]  # P(s=1|X)
    return predicted_s / prob_s1y1  # P(y=1|X) = P(s=1|X) / P(s=1|y=1)

# 准备PU学习的数据
y_pu = mod_data['PU_Target'].map({'Unlabeled': 0, 'Positive': 1}).astype('int')

# 第六步：执行PU学习
print("\n=== 步骤5: 执行PU学习 ===")

# 初始化变量
predicted = np.zeros(len(mod_data))
learning_iterations = 101  # 减少迭代次数以加快演示速度

# 执行多次迭代学习
report = []
for index in range(learning_iterations):
    # 每次迭代使用不同的保留样本，因此pu_estimator和probs1y1会不同
    pu_estimator, probs1y1 = fit_PU_estimator_AM(
        X=mod_data[features],
        y=y_pu,
        hold_out_ratio=0.25,
        estimator=xgb.XGBClassifier(random_state=index)  # 每次使用不同的随机种子
    )

    predicted_index = predict_PU_prob_AM(mod_data[features], pu_estimator, probs1y1)

    # 由于预测的概率可能不在[0,1]范围内，进行缩放
    predicted_index_scaled = MinMaxScaler().fit_transform(
        predicted_index.reshape(-1, 1)
    ).reshape(-1)

    predicted += predicted_index_scaled

    # 每20次迭代打印一次进度
    if index % 20 == 0:
        print(f'学习迭代: {index}/{learning_iterations} => P(s=1|y=1)={probs1y1:.4f}')

# 计算平均概率
mod_data['y_pos_pred_proba'] = predicted / learning_iterations

# 第七步：分析PU学习结果
print("\n=== 步骤6: 分析PU学习结果 ===")

# 查看不同组的预测概率中位数
prob_comparison = pd.pivot_table(
    mod_data,
    index='Target',
    columns='PU_Target',
    values='y_pos_pred_proba',
    aggfunc='median'
)
print("不同组的预测概率中位数:")
print(prob_comparison)

# 第八步：评估PU学习性能
print("\n=== 步骤7: 评估PU学习性能 ===")

# 在不同阈值下评估性能
thresholds = np.linspace(0.1, 0.9, 50)
performance_report = []

for thre in thresholds:
    y_pred_pu = (mod_data['y_pos_pred_proba'] > thre).astype(int)
    p = precision_score(mod_data['Target'], y_pred_pu)
    r = recall_score(mod_data['Target'], y_pred_pu)
    f = f1_score(mod_data['Target'], y_pred_pu)
    a = accuracy_score(mod_data['Target'], y_pred_pu)
    performance_report.append([thre, p, r, f, a])

performance_df = pd.DataFrame(
    performance_report,
    columns=['threshold', 'precision', 'recall', 'f1', 'accuracy']
)

# 找到最佳F1分数对应的阈值
best_f1_idx = performance_df['f1'].idxmax()
best_threshold = performance_df.loc[best_f1_idx, 'threshold']
best_f1 = performance_df.loc[best_f1_idx, 'f1']

print(f"最佳阈值: {best_threshold:.4f}, 最佳F1分数: {best_f1:.4f}")

# 使用最佳阈值进行最终预测
y_pred_pu_best = (mod_data['y_pos_pred_proba'] > best_threshold).astype(int)
print("PU学习模型性能 (使用最佳阈值):")
pu_metrics = evaluate_results(mod_data['Target'], y_pred_pu_best, mod_data['y_pos_pred_proba'])

# 第九步：使用Plotly可视化结果
print("\n=== 步骤8: 使用Plotly可视化结果 ===")

# 1. 特征分布图
feature_fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[f'特征 {feature} 分布' for feature in features]
)

for i, feature in enumerate(features):
    row = i // 2 + 1
    col = i % 2 + 1

    # 正例分布
    pos_data = data[data['Target'] == 1][feature]
    feature_fig.add_trace(
        go.Histogram(x=pos_data, name='正例', opacity=0.7, marker_color='red'),
        row=row, col=col
    )

    # 负例分布
    neg_data = data[data['Target'] == 0][feature]
    feature_fig.add_trace(
        go.Histogram(x=neg_data, name='负例', opacity=0.7, marker_color='blue'),
        row=row, col=col
    )

feature_fig.update_layout(
    title_text="特征分布 by Target",
    height=600,
    showlegend=True
)
feature_fig.show()

# 2. PU学习预测概率分布
pu_prob_fig = go.Figure()

# 标记为正例的样本概率分布
positive_data = mod_data[mod_data['PU_Target'] == 'Positive']['y_pos_pred_proba']
pu_prob_fig.add_trace(go.Histogram(
    x=positive_data,
    name='标记正例',
    opacity=0.7,
    marker_color='red'
))

# 未标记样本的概率分布
unlabeled_data = mod_data[mod_data['PU_Target'] == 'Unlabeled']['y_pos_pred_proba']
pu_prob_fig.add_trace(go.Histogram(
    x=unlabeled_data,
    name='未标记样本',
    opacity=0.7,
    marker_color='blue'
))

pu_prob_fig.update_layout(
    title='PU学习预测概率分布',
    xaxis_title='预测概率',
    yaxis_title='频数',
    bargap=0.1,
    height=400
)
pu_prob_fig.show()

# 3. 性能指标随阈值变化
metrics_fig = go.Figure()

metrics_fig.add_trace(go.Scatter(
    x=performance_df['threshold'],
    y=performance_df['precision'],
    mode='lines',
    name='精确率',
    line=dict(width=3)
))

metrics_fig.add_trace(go.Scatter(
    x=performance_df['threshold'],
    y=performance_df['recall'],
    mode='lines',
    name='召回率',
    line=dict(width=3)
))

metrics_fig.add_trace(go.Scatter(
    x=performance_df['threshold'],
    y=performance_df['f1'],
    mode='lines',
    name='F1分数',
    line=dict(width=3)
))

# 添加最佳阈值标记
metrics_fig.add_vline(
    x=best_threshold,
    line_width=2,
    line_dash="dash",
    line_color="red",
    annotation_text=f"最佳阈值: {best_threshold:.3f}"
)

metrics_fig.update_layout(
    title='性能指标随阈值变化',
    xaxis_title='阈值',
    yaxis_title='分数',
    height=500
)
metrics_fig.show()

# 4. 方法比较
methods = ['基线模型 (全监督)', 'PU学习模型']
f1_scores = [baseline_metrics['f1'], pu_metrics['f1']]
recall_scores = [baseline_metrics['recall'], pu_metrics['recall']]
precision_scores = [baseline_metrics['precision'], pu_metrics['precision']]

comparison_fig = go.Figure(data=[
    go.Bar(name='F1分数', x=methods, y=f1_scores, marker_color='lightblue'),
    go.Bar(name='召回率', x=methods, y=recall_scores, marker_color='lightgreen'),
    go.Bar(name='精确率', x=methods, y=precision_scores, marker_color='lightsalmon')
])

comparison_fig.update_layout(
    title='方法比较',
    xaxis_title='方法',
    yaxis_title='分数',
    barmode='group',
    height=500
)
comparison_fig.show()

# 5. 隐藏正例识别情况
true_positives = mod_data[(mod_data['Target'] == 1) & (mod_data['PU_Target'] == 'Unlabeled')]
identified_positives = true_positives[true_positives['y_pos_pred_proba'] > best_threshold]

identification_rate = len(identified_positives)/len(true_positives)*100

# 创建识别情况饼图
identification_fig = go.Figure(data=[go.Pie(
    labels=['被正确识别的隐藏正例', '未被识别的隐藏正例'],
    values=[len(identified_positives), len(true_positives) - len(identified_positives)],
    marker_colors=['lightgreen', 'lightcoral']
)])

identification_fig.update_layout(
    title=f'隐藏正例识别情况 (识别率: {identification_rate:.2f}%)',
    height=400
)
identification_fig.show()

# 第十步：总结与洞察
print("\n=== 步骤10: 总结与洞察 ===")
print("PU学习关键洞察:")
print("1. PU学习仅使用部分标记的正例和大量未标记样本进行训练")
print("2. 通过迭代估计P(s=1|y=1)，我们可以估计P(y=1|X) = P(s=1|X) / P(s=1|y=1)")
print("3. 在这个例子中，我们成功从1220个未标记样本中识别出了隐藏的正例")
print("4. PU学习在真实场景中非常有用，当获取负例标签困难或成本高时")

print(f"\n性能比较总结:")
print(f"基线模型 (全监督) F1分数: {baseline_metrics['f1']:.4f}")
print(f"PU学习模型 F1分数: {pu_metrics['f1']:.4f}")

print(f"\n隐藏的正例识别情况:")
print(f"总隐藏正例数: {len(true_positives)}")
print(f"被正确识别的隐藏正例数: {len(identified_positives)}")
print(f"隐藏正例识别率: {identification_rate:.2f}%")

加权 Elkanoto 方法

加权 Elkanoto 方法是经典 Elkanoto PU 学习方法的重要改进版本，它通过引入样本权重来优化学习过程，特别是在处理类别不平衡和噪声数据时表现更好。

方法背景与动机

经典 Elkanoto 方法的局限性

经典的 Elkanoto 方法虽然理论基础坚实，但在实际应用中存在一些局限性：

对噪声敏感：所有样本被平等对待，噪声样本会影响模型性能
类别不平衡问题：当正例数量远少于未标记样本时，模型容易偏向负例
样本重要性差异：不同样本对模型学习的贡献度不同

加权方法的优势

加权 Elkanoto 方法通过为不同样本分配不同权重来解决上述问题：

降低噪声影响：为可能的噪声样本分配较低权重
处理类别不平衡：通过权重调整平衡正例和未标记样本的影响
关注困难样本：为分类边界附近的样本分配更高权重

方法原理与数学基础

核心思想

加权 Elkanoto 方法的核心思想是：不同的训练样本应该对损失函数有不同的贡献。通过为样本分配权重，模型可以更关注那些对学习决策边界更重要的样本。

数学公式

在经典 Elkanoto 方法中，关键的概率关系为：P(y=1|x) = P(s=1|x) / P(s=1|y=1)

在加权版本中，我们引入权重向量 w，其中每个样本对应一个权重。

加权的损失函数可以表示为：L(w) = Σ w_i * L(f(x_i), y_i)

其中是基础损失函数（如交叉熵损失），是分类器。

算法步骤详解

权重分配策略

加权 Elkanoto 方法的关键在于如何为样本分配权重。常见的策略包括：

基于置信度的权重

# 基于模型预测置信度分配权重
def confidence_based_weights(predictions, alpha=0.5):
    """
    基于预测置信度分配权重
    predictions: 模型预测概率
    alpha: 平滑参数
    """
    confidence = np.abs(predictions - 0.5) * 2  # 转换为[0,1]区间
    weights = alpha + (1 - alpha) * confidence
    return weights

基于距离的权重

def distance_based_weights(X, positive_centroid, beta=1.0):
    """
    基于与正例中心的距离分配权重
    positive_centroid: 正例样本的中心点
    beta: 距离缩放参数
    """
    distances = np.linalg.norm(X - positive_centroid, axis=1)
    max_distance = np.max(distances)
    weights = 1 - (distances / max_distance) * beta
    return np.clip(weights, 0.1, 1.0)  # 确保权重在合理范围内

加权 Elkanoto 算法流程

import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_array

class WeightedElkanotoPUClassifier(BaseEstimator, ClassifierMixin):
    """
    加权 Elkanoto PU 分类器实现
    """
    
    def __init__(self, base_estimator=None, hold_out_ratio=0.3, 
                 weight_strategy='confidence', n_iter=10, random_state=42):
        self.base_estimator = base_estimator
        self.hold_out_ratio = hold_out_ratio
        self.weight_strategy = weight_strategy
        self.n_iter = n_iter
        self.random_state = random_state
        self.prob_s1y1_ = None
        self.estimator_ = None
        self.weights_ = None
        
    def _calculate_weights(self, X, y, predictions=None):
        """计算样本权重"""
        if self.weight_strategy == 'uniform':
            # 均匀权重（退化为经典Elkanoto）
            return np.ones(len(X))
            
        elif self.weight_strategy == 'confidence':
            # 基于置信度的权重
            if predictions is None:
                return np.ones(len(X))
            return confidence_based_weights(predictions)
            
        elif self.weight_strategy == 'distance':
            # 基于距离的权重
            positive_indices = np.where(y == 1)[0]
            if len(positive_indices) == 0:
                return np.ones(len(X))
            positive_centroid = np.mean(X[positive_indices], axis=0)
            return distance_based_weights(X, positive_centroid)
            
        else:
            raise ValueError(f"未知的权重策略: {self.weight_strategy}")
    
    def fit(self, X, y):
        """训练加权PU分类器"""
        X, y = check_X_y(X, y)
        np.random.seed(self.random_state)
        
        # 分离正例和未标记样本
        positive_indices = np.where(y == 1)[0]
        unlabeled_indices = np.where(y == 0)[0]
        
        if len(positive_indices) == 0:
            raise ValueError("训练数据中必须包含正例样本")
        
        # 迭代优化过程
        best_estimator = None
        best_prob_s1y1 = 0
        best_weights = None
        min_loss = float('inf')
        
        for iteration in range(self.n_iter):
            # 1. 随机选择保留的正例样本
            n_hold_out = max(1, int(len(positive_indices) * self.hold_out_ratio))
            hold_out_positives = np.random.choice(
                positive_indices, size=n_hold_out, replace=False
            )
            
            # 训练样本：剩余正例 + 所有未标记样本
            train_indices = np.setdiff1d(
                np.arange(len(X)), 
                hold_out_positives
            )
            
            X_train = X[train_indices]
            y_train = y[train_indices]
            
            # 2. 初始训练（第一轮使用均匀权重）
            if iteration == 0:
                sample_weights = np.ones(len(X_train))
            else:
                # 使用上一轮的预测计算权重
                predictions = self.estimator_.predict_proba(X_train)[:, 1]
                sample_weights = self._calculate_weights(X_train, y_train, predictions)
            
            # 训练基础分类器
            estimator = clone(self.base_estimator)
            estimator.fit(X_train, y_train, sample_weight=sample_weights)
            
            # 3. 估计 P(s=1|y=1)
            X_hold_out = X[hold_out_positives]
            hold_out_probs = estimator.predict_proba(X_hold_out)[:, 1]
            prob_s1y1 = np.mean(hold_out_probs)
            
            # 4. 计算损失（加权交叉熵）
            train_probs = estimator.predict_proba(X_train)[:, 1]
            loss = -np.mean(sample_weights * (
                y_train * np.log(train_probs + 1e-10) + 
                (1 - y_train) * np.log(1 - train_probs + 1e-10)
            ))
            
            # 5. 选择最佳模型
            if loss < min_loss:
                min_loss = loss
                best_estimator = estimator
                best_prob_s1y1 = prob_s1y1
                best_weights = sample_weights
        
        self.estimator_ = best_estimator
        self.prob_s1y1_ = best_prob_s1y1
        self.weights_ = best_weights
        
        return self
    
    def predict_proba(self, X):
        """预测概率"""
        X = check_array(X)
        prob_s1x = self.estimator_.predict_proba(X)[:, 1]
        prob_y1x = prob_s1x / self.prob_s1y1_
        # 确保概率在[0,1]范围内
        prob_y1x = np.clip(prob_y1x, 0, 1)
        prob_y0x = 1 - prob_y1x
        return np.vstack([prob_y0x, prob_y1x]).T
    
    def predict(self, X, threshold=0.5):
        """预测类别"""
        proba = self.predict_proba(X)
        return (proba[:, 1] > threshold).astype(int)

权重策略比较

均匀权重 (Uniform Weighting)

策略：所有样本权重相等
效果：等价于经典 Elkanoto 方法
适用场景：数据相对干净，噪声较少

置信度权重 (Confidence-based Weighting)

def advanced_confidence_weights(predictions, y, alpha=0.3, beta=2.0):
    """
    高级置信度权重策略
    alpha: 基础权重
    beta: 置信度放大系数
    """
    # 基础置信度
    base_confidence = np.abs(predictions - 0.5) * 2
    
    # 为正例和预测困难的样本分配更高权重
    difficulty_weights = 1 + beta * (1 - base_confidence)
    
    # 结合基础权重
    weights = alpha + (1 - alpha) * base_confidence * difficulty_weights
    
    # 为正例样本额外加权
    positive_mask = (y == 1)
    weights[positive_mask] *= 1.5  # 正例权重提升50%
    
    return np.clip(weights, 0.1, 3.0)  # 限制权重范围

距离权重 (Distance-based Weighting)

def mahalanobis_distance_weights(X, y, positive_indices):
    """
    基于马氏距离的权重分配
    """
    from scipy.spatial.distance import mahalanobis
    from sklearn.covariance import LedoitWolf
    
    if len(positive_indices) < 2:
        return np.ones(len(X))
    
    # 计算正例的均值和协方差
    X_positive = X[positive_indices]
    centroid = np.mean(X_positive, axis=0)
    
    # 使用Ledoit-Wolf估计器计算稳健的协方差矩阵
    cov_estimator = LedoitWolf().fit(X_positive)
    cov_matrix = cov_estimator.covariance_
    
    try:
        # 计算马氏距离
        inv_cov = np.linalg.pinv(cov_matrix)
        distances = np.array([mahalanobis(x, centroid, inv_cov) 
                            for x in X])
        
        # 将距离转换为权重（距离越小，权重越大）
        max_dist = np.max(distances)
        weights = 1 - (distances / max_dist)
        
    except np.linalg.LinAlgError:
        # 如果协方差矩阵奇异，使用欧氏距离
        distances = np.linalg.norm(X - centroid, axis=1)
        max_dist = np.max(distances)
        weights = 1 - (distances / max_dist)
    
    return np.clip(weights, 0.1, 1.0)

实际应用示例

完整的使用示例

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, average_precision_score
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# 生成示例数据
def create_pu_data(n_samples=1000, positive_ratio=0.3, labeled_ratio=0.4, noise_ratio=0.1):
    """创建带有噪声的PU学习数据"""
    X, y_true = make_classification(
        n_samples=n_samples, 
        n_features=20,
        n_informative=10,
        n_redundant=5,
        n_clusters_per_class=1,
        flip_y=noise_ratio,  # 添加标签噪声
        random_state=42
    )
    
    # 创建PU标签
    y_pu = np.zeros_like(y_true)
    positive_indices = np.where(y_true == 1)[0]
    
    # 随机选择一部分正例作为标记的正例
    n_labeled = int(len(positive_indices) * labeled_ratio)
    labeled_indices = np.random.choice(positive_indices, n_labeled, replace=False)
    y_pu[labeled_indices] = 1
    
    # 添加特征噪声（模拟测量误差）
    noise_mask = np.random.random(len(X)) < 0.1  # 10%的样本添加特征噪声
    X_noisy = X.copy()
    X_noisy[noise_mask] += np.random.normal(0, 0.5, X[noise_mask].shape)
    
    return X_noisy, y_true, y_pu

# 比较不同权重策略
def compare_weight_strategies():
    """比较不同权重策略的性能"""
    X, y_true, y_pu = create_pu_data()
    
    # 划分训练测试集
    X_train, X_test, y_pu_train, y_pu_test = train_test_split(
        X, y_pu, test_size=0.3, random_state=42, stratify=y_pu
    )
    _, _, _, y_true_test = train_test_split(
        X, y_true, test_size=0.3, random_state=42, stratify=y_pu
    )
    
    strategies = ['uniform', 'confidence', 'distance']
    results = {}
    
    for strategy in strategies:
        print(f"\n=== 测试权重策略: {strategy} ===")
        
        # 创建加权Elkanoto分类器
        base_rf = RandomForestClassifier(n_estimators=100, random_state=42)
        pu_classifier = WeightedElkanotoPUClassifier(
            base_estimator=base_rf,
            hold_out_ratio=0.3,
            weight_strategy=strategy,
            n_iter=5,
            random_state=42
        )
        
        # 训练模型
        pu_classifier.fit(X_train, y_pu_train)
        
        # 预测
        y_prob = pu_classifier.predict_proba(X_test)[:, 1]
        y_pred = (y_prob > 0.5).astype(int)
        
        # 评估
        ap_score = average_precision_score(y_true_test, y_prob)
        precision, recall, _ = precision_recall_curve(y_true_test, y_prob)
        
        results[strategy] = {
            'ap_score': ap_score,
            'precision': precision,
            'recall': recall,
            'prob_s1y1': pu_classifier.prob_s1y1_,
            'classifier': pu_classifier
        }
        
        print(f"P(s=1|y=1)估计值: {pu_classifier.prob_s1y1_:.4f}")
        print(f"平均精度 (AP): {ap_score:.4f}")
    
    return results, y_true_test

# 可视化比较结果
def plot_comparison(results, y_true):
    """可视化不同权重策略的比较结果"""
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=(
            '精确率-召回率曲线比较',
            '平均精度比较', 
            'P(s=1|y=1)估计值比较',
            '权重分布示例'
        )
    )
    
    # 1. 精确率-召回率曲线
    for strategy, result in results.items():
        fig.add_trace(
            go.Scatter(
                x=result['recall'], 
                y=result['precision'],
                mode='lines',
                name=f'{strategy} (AP={result["ap_score"]:.3f})',
                line=dict(width=3)
            ),
            row=1, col=1
        )
    
    # 2. 平均精度比较
    strategies = list(results.keys())
    ap_scores = [results[s]['ap_score'] for s in strategies]
    
    fig.add_trace(
        go.Bar(x=strategies, y=ap_scores, 
               marker_color=['blue', 'red', 'green']),
        row=1, col=2
    )
    
    # 3. P(s=1|y=1)估计值比较
    prob_s1y1_values = [results[s]['prob_s1y1'] for s in strategies]
    fig.add_trace(
        go.Bar(x=strategies, y=prob_s1y1_values,
               marker_color=['lightblue', 'lightcoral', 'lightgreen']),
        row=2, col=1
    )
    
    # 更新布局
    fig.update_layout(
        height=800,
        title_text="加权Elkanoto方法不同权重策略比较",
        showlegend=True
    )
    
    fig.update_xaxes(title_text="召回率", row=1, col=1)
    fig.update_yaxes(title_text="精确率", row=1, col=1)
    fig.update_xaxes(title_text="策略", row=1, col=2)
    fig.update_yaxes(title_text="平均精度", row=1, col=2)
    fig.update_xaxes(title_text="策略", row=2, col=1)
    fig.update_yaxes(title_text="P(s=1|y=1)估计值", row=2, col=1)
    
    fig.show()

# 运行比较
results, y_true_test = compare_weight_strategies()
plot_comparison(results, y_true_test)

方法优势与适用场景

主要优势

鲁棒性更强：对噪声和异常值的敏感性降低
处理不平衡数据：通过权重调整有效处理类别不平衡
收敛更快：关注重要样本，加速模型收敛
灵活性高：支持多种权重策略，适应不同场景

适用场景

高噪声数据：当训练数据包含较多标签噪声时
严重类别不平衡：正例数量远少于未标记样本时
异质数据：不同样本群体具有不同特征分布时
在线学习：需要逐步更新权重的场景

参数调优建议

# 关键参数调优建议
optimal_params = {
    'hold_out_ratio': [0.2, 0.3, 0.4],  # 保留比例
    'weight_strategy': ['uniform', 'confidence', 'distance'],
    'n_iter': [5, 10, 15],  # 迭代次数
}

# 对于不同数据规模的建议
size_recommendations = {
    'small_dataset': {'n_iter': 5, 'hold_out_ratio': 0.2},
    'medium_dataset': {'n_iter': 10, 'hold_out_ratio': 0.3},
    'large_dataset': {'n_iter': 15, 'hold_out_ratio': 0.4},
}

基于 Bagging 的PU学习

基于 Bagging 的 PU 学习是一种集成学习方法，它通过组合多个弱分类器来提高 PU 学习任务的性能。这种方法特别适合处理 PU 学习中的不确定性和噪声问题。

方法背景与动机

PU 学习的挑战

PU 学习面临两个主要挑战：

标签不确定性：未标记样本中既包含正例也包含负例
样本选择偏差：标记的正例可能不是所有正例的代表性样本

Bagging 的优势

Bagging（Bootstrap Aggregating）通过以下方式应对这些挑战：

减少方差：通过组合多个模型降低过拟合风险
处理不确定性：多个模型的不同视角可以更好地处理标签不确定性
增强鲁棒性：对噪声和异常值更加稳健

方法原理与理论基础

核心思想

基于 Bagging 的 PU 学习方法的核心思想是：通过多次自助采样创建多个训练子集，在每个子集上训练一个基分类器，然后通过投票或平均来集成这些分类器的预测结果。

数学基础

对于给定的 PU 数据集$D = (X_P, X_U)$，其中$X_P$是标记的正例，是$X_U$未标记样本。

Bagging PU 学习的目标是学习一个函数$f: X \rightarrow [0,1]$，使得：$f(x) = \frac{1}{B} \sum_{b=1}^{B} f_b(x)$

其中$B$是基分类器的数量，$f_b$是第$b$个基分类器。

算法流程详解

基本 Bagging PU 算法

import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin, clone
from sklearn.utils import resample
from sklearn.utils.validation import check_X_y, check_array

class BaggingPUClassifier(BaseEstimator, ClassifierMixin):
    """
    基于 Bagging 的 PU 分类器
    """
    
    def __init__(self, base_estimator=None, n_estimators=10, 
                 max_samples=1.0, max_features=1.0, 
                 bootstrap=True, random_state=None):
        self.base_estimator = base_estimator
        self.n_estimators = n_estimators
        self.max_samples = max_samples
        self.max_features = max_features
        self.bootstrap = bootstrap
        self.random_state = random_state
        self.estimators_ = []
        self.feature_indices_ = []
        
    def _create_training_set(self, X, y, random_state):
        """
        创建 PU 学习的训练集
        关键：从未标记样本中抽取样本作为负例
        """
        # 分离正例和未标记样本
        positive_indices = np.where(y == 1)[0]
        unlabeled_indices = np.where(y == 0)[0]
        
        # 正例采样
        n_pos_samples = len(positive_indices)
        if self.bootstrap:
            pos_sample_indices = resample(
                positive_indices, 
                n_samples=n_pos_samples, 
                replace=True, 
                random_state=random_state
            )
        else:
            pos_sample_indices = positive_indices
        
        # 从未标记样本中抽取负例
        n_neg_samples = int(len(pos_sample_indices) * self.max_samples)
        neg_sample_indices = resample(
            unlabeled_indices,
            n_samples=n_neg_samples,
            replace=True,
            random_state=random_state
        )
        
        # 合并正例和负例样本索引
        sample_indices = np.concatenate([pos_sample_indices, neg_sample_indices])
        
        # 特征采样
        n_features = X.shape[1]
        n_selected_features = int(n_features * self.max_features)
        feature_indices = np.random.choice(
            n_features, 
            size=n_selected_features, 
            replace=False
        )
        
        return sample_indices, feature_indices
    
    def fit(self, X, y):
        """
        训练 Bagging PU 分类器
        """
        X, y = check_X_y(X, y)
        np.random.seed(self.random_state)
        
        self.estimators_ = []
        self.feature_indices_ = []
        
        for i in range(self.n_estimators):
            # 创建随机种子（确保可重复性）
            random_state = np.random.RandomState(self.random_state + i) if self.random_state else None
            
            # 创建训练子集
            sample_indices, feature_indices = self._create_training_set(X, y, random_state)
            
            # 提取子集数据
            X_subset = X[sample_indices][:, feature_indices]
            y_subset = y[sample_indices]
            
            # 创建并训练基分类器
            estimator = clone(self.base_estimator)
            estimator.fit(X_subset, y_subset)
            
            # 保存模型和特征索引
            self.estimators_.append(estimator)
            self.feature_indices_.append(feature_indices)
            
        return self
    
    def predict_proba(self, X):
        """
        预测概率
        """
        X = check_array(X)
        n_samples = X.shape[0]
        
        # 收集所有基分类器的预测
        all_probas = []
        for estimator, feature_indices in zip(self.estimators_, self.feature_indices_):
            X_sub = X[:, feature_indices]
            probas = estimator.predict_proba(X_sub)
            all_probas.append(probas)
        
        # 平均概率
        avg_proba = np.mean(all_probas, axis=0)
        return avg_proba
    
    def predict(self, X, threshold=0.5):
        """
        预测类别
        """
        proba = self.predict_proba(X)
        return (proba[:, 1] > threshold).astype(int)

改进的 Bagging PU 算法

class ImprovedBaggingPUClassifier(BaggingPUClassifier):
    """
    改进的 Bagging PU 分类器
    添加了权重和置信度评估
    """
    
    def __init__(self, base_estimator=None, n_estimators=10, 
                 max_samples=1.0, max_features=1.0, 
                 bootstrap=True, random_state=None,
                 confidence_threshold=0.7, use_weighting=True):
        super().__init__(base_estimator, n_estimators, max_samples, 
                        max_features, bootstrap, random_state)
        self.confidence_threshold = confidence_threshold
        self.use_weighting = use_weighting
        self.estimator_weights_ = []
    
    def fit(self, X, y):
        """
        训练改进的 Bagging PU 分类器
        """
        X, y = check_X_y(X, y)
        np.random.seed(self.random_state)
        
        self.estimators_ = []
        self.feature_indices_ = []
        self.estimator_weights_ = []
        
        for i in range(self.n_estimators):
            random_state = np.random.RandomState(self.random_state + i) if self.random_state else None
            
            # 创建训练子集
            sample_indices, feature_indices = self._create_training_set(X, y, random_state)
            X_subset = X[sample_indices][:, feature_indices]
            y_subset = y[sample_indices]
            
            # 训练基分类器
            estimator = clone(self.base_estimator)
            estimator.fit(X_subset, y_subset)
            
            # 计算基分类器的权重（基于在标记数据上的性能）
            if self.use_weighting:
                # 使用标记的正例评估性能
                labeled_pos_indices = np.where(y == 1)[0]
                if len(labeled_pos_indices) > 0:
                    X_labeled_pos = X[labeled_pos_indices][:, feature_indices]
                    y_labeled_pos = y[labeled_pos_indices]
                    
                    # 计算在标记正例上的准确率作为权重
                    accuracy = estimator.score(X_labeled_pos, y_labeled_pos)
                    weight = max(0.1, accuracy)  # 确保权重不为0
                else:
                    weight = 1.0
            else:
                weight = 1.0
                
            # 保存模型、特征索引和权重
            self.estimators_.append(estimator)
            self.feature_indices_.append(feature_indices)
            self.estimator_weights_.append(weight)
            
        return self
    
    def predict_proba(self, X):
        """
        加权平均概率预测
        """
        X = check_array(X)
        n_samples = X.shape[0]
        
        # 收集所有基分类器的预测
        all_probas = []
        for i, (estimator, feature_indices) in enumerate(zip(self.estimators_, self.feature_indices_)):
            X_sub = X[:, feature_indices]
            probas = estimator.predict_proba(X_sub)
            
            # 应用权重
            if self.use_weighting:
                weight = self.estimator_weights_[i]
                # 对概率进行加权
                weighted_probas = probas * weight
                all_probas.append(weighted_probas)
            else:
                all_probas.append(probas)
        
        # 加权平均概率
        if self.use_weighting:
            total_weight = sum(self.estimator_weights_)
            avg_proba = np.sum(all_probas, axis=0) / total_weight
        else:
            avg_proba = np.mean(all_probas, axis=0)
            
        return avg_proba
    
    def predict_with_confidence(self, X):
        """
        返回预测结果和置信度
        """
        proba = self.predict_proba(X)
        predictions = (proba[:, 1] > 0.5).astype(int)
        
        # 计算置信度（基于概率与决策边界的距离）
        confidence = np.abs(proba[:, 1] - 0.5) * 2
        
        return predictions, confidence
    
    def get_estimator_performance(self):
        """
        获取各个基分类器的性能评估
        """
        performances = []
        for i, weight in enumerate(self.estimator_weights_):
            performances.append({
                'estimator_index': i,
                'weight': weight,
                'n_features': len(self.feature_indices_[i])
            })
        return performances

关键技术与优化策略

样本采样策略

def advanced_sampling_strategy(X, y, sampling_method='balanced', 
                               positive_weight=1.5, random_state=None):
    """
    高级采样策略
    """
    positive_indices = np.where(y == 1)[0]
    unlabeled_indices = np.where(y == 0)[0]
    
    if sampling_method == 'balanced':
        # 平衡采样：正例和负例数量相等
        n_pos = len(positive_indices)
        n_neg = min(len(unlabeled_indices), n_pos)
        
    elif sampling_method == 'weighted':
        # 加权采样：为正例分配更高权重
        n_pos = len(positive_indices)
        n_neg = int(n_pos * positive_weight)
        
    elif sampling_method == 'proportional':
        # 比例采样：根据数据集大小确定采样比例
        total_size = len(positive_indices) + len(unlabeled_indices)
        pos_ratio = len(positive_indices) / total_size
        n_pos = len(positive_indices)
        n_neg = int(n_pos * (1 - pos_ratio) / pos_ratio)
    
    # 执行采样
    pos_samples = resample(
        positive_indices, 
        n_samples=n_pos, 
        replace=True, 
        random_state=random_state
    )
    
    neg_samples = resample(
        unlabeled_indices,
        n_samples=min(n_neg, len(unlabeled_indices)),
        replace=True,
        random_state=random_state
    )
    
    return np.concatenate([pos_samples, neg_samples])

# 集成到 Bagging PU 分类器中
class AdvancedBaggingPUClassifier(ImprovedBaggingPUClassifier):
    def __init__(self, base_estimator=None, n_estimators=10, 
                 sampling_method='balanced', positive_weight=1.5, **kwargs):
        super().__init__(base_estimator, n_estimators, **kwargs)
        self.sampling_method = sampling_method
        self.positive_weight = positive_weight
    
    def _create_training_set(self, X, y, random_state):
        positive_indices = np.where(y == 1)[0]
        unlabeled_indices = np.where(y == 0)[0]
        
        # 使用高级采样策略
        sample_indices = advanced_sampling_strategy(
            X, y, self.sampling_method, self.positive_weight, random_state
        )
        
        # 特征采样
        n_features = X.shape[1]
        n_selected_features = int(n_features * self.max_features)
        feature_indices = np.random.choice(
            n_features, 
            size=n_selected_features, 
            replace=False,
            random_state=random_state
        )
        
        return sample_indices, feature_indices

多样性增强技术

def enhance_diversity(estimators, feature_indices, X, y):
    """
    增强基分类器多样性的技术
    """
    diversities = []
    
    for i in range(len(estimators)):
        # 计算与其他分类器的预测差异
        diversity_score = 0
        for j in range(len(estimators)):
            if i != j:
                # 使用标记数据计算预测差异
                labeled_indices = np.where(y == 1)[0]
                if len(labeled_indices) > 0:
                    X_labeled = X[labeled_indices]
                    
                    pred_i = estimators[i].predict(X_labeled[:, feature_indices[i]])
                    pred_j = estimators[j].predict(X_labeled[:, feature_indices[j]])
                    
                    disagreement = np.mean(pred_i != pred_j)
                    diversity_score += disagreement
        
        diversities.append(diversity_score / (len(estimators) - 1))
    
    return diversities

class DiversityEnhancedBaggingPU(AdvancedBaggingPUClassifier):
    """
    多样性增强的 Bagging PU 分类器
    """
    
    def __init__(self, base_estimator=None, n_estimators=10, 
                 diversity_weight=0.3, **kwargs):
        super().__init__(base_estimator, n_estimators, **kwargs)
        self.diversity_weight = diversity_weight
    
    def fit(self, X, y):
        # 首先进行标准训练
        super().fit(X, y)
        
        # 计算多样性并调整权重
        diversities = enhance_diversity(self.estimators_, self.feature_indices_, X, y)
        
        # 结合准确率和多样性调整权重
        for i in range(len(self.estimators_)):
            accuracy_based_weight = self.estimator_weights_[i]
            diversity_based_weight = diversities[i]
            
            # 组合权重
            combined_weight = (1 - self.diversity_weight) * accuracy_based_weight + \
                             self.diversity_weight * diversity_based_weight
            
            self.estimator_weights_[i] = combined_weight
        
        return self

实际应用与评估

完整应用示例

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, average_precision_score, classification_report
import plotly.graph_objects as go
from plotly.subplots import make_subplots

def demonstrate_bagging_pu():
    """
    演示基于 Bagging 的 PU 学习
    """
    # 1. 生成 PU 数据
    print("=== 生成 PU 数据 ===")
    X, y_true = make_classification(
        n_samples=2000, n_features=20, n_informative=10,
        n_redundant=5, n_clusters_per_class=1,
        flip_y=0.05, random_state=42
    )
    
    # 创建 PU 标签（仅标记部分正例）
    y_pu = np.zeros_like(y_true)
    positive_indices = np.where(y_true == 1)[0]
    labeled_ratio = 0.3  # 仅标记30%的正例
    n_labeled = int(len(positive_indices) * labeled_ratio)
    labeled_indices = np.random.choice(positive_indices, n_labeled, replace=False)
    y_pu[labeled_indices] = 1
    
    print(f"数据统计:")
    print(f"- 总样本数: {len(y_true)}")
    print(f"- 真实正例数: {sum(y_true)}")
    print(f"- 标记的正例数: {sum(y_pu)}")
    print(f"- 未标记样本中的隐藏正例数: {sum(y_true) - sum(y_pu)}")
    
    # 2. 划分训练测试集
    X_train, X_test, y_pu_train, y_pu_test = train_test_split(
        X, y_pu, test_size=0.3, random_state=42, stratify=y_pu
    )
    _, _, _, y_true_test = train_test_split(
        X, y_true, test_size=0.3, random_state=42, stratify=y_pu
    )
    
    # 3. 比较不同方法
    print("\n=== 方法比较 ===")
    
    # 方法1: 标准分类器（错误地将未标记样本视为负例）
    print("\n--- 方法1: 标准分类器（朴素方法） ---")
    naive_rf = RandomForestClassifier(n_estimators=100, random_state=42)
    naive_rf.fit(X_train, y_pu_train)
    y_pred_naive = naive_rf.predict(X_test)
    y_prob_naive = naive_rf.predict_proba(X_test)[:, 1]
    ap_naive = average_precision_score(y_true_test, y_prob_naive)
    print(f"平均精度 (AP): {ap_naive:.4f}")
    
    # 方法2: 基本 Bagging PU
    print("\n--- 方法2: 基本 Bagging PU ---")
    base_svm = SVC(C=1.0, kernel='rbf', probability=True, random_state=42)
    bagging_pu_basic = BaggingPUClassifier(
        base_estimator=base_svm,
        n_estimators=20,
        max_samples=0.8,
        max_features=0.8,
        random_state=42
    )
    bagging_pu_basic.fit(X_train, y_pu_train)
    y_prob_basic = bagging_pu_basic.predict_proba(X_test)[:, 1]
    ap_basic = average_precision_score(y_true_test, y_prob_basic)
    print(f"平均精度 (AP): {ap_basic:.4f}")
    
    # 方法3: 改进的 Bagging PU
    print("\n--- 方法3: 改进的 Bagging PU ---")
    bagging_pu_improved = ImprovedBaggingPUClassifier(
        base_estimator=RandomForestClassifier(n_estimators=50, random_state=42),
        n_estimators=30,
        max_samples=0.8,
        max_features=0.7,
        use_weighting=True,
        random_state=42
    )
    bagging_pu_improved.fit(X_train, y_pu_train)
    y_prob_improved = bagging_pu_improved.predict_proba(X_test)[:, 1]
    ap_improved = average_precision_score(y_true_test, y_prob_improved)
    print(f"平均精度 (AP): {ap_improved:.4f}")
    
    # 方法4: 多样性增强的 Bagging PU
    print("\n--- 方法4: 多样性增强的 Bagging PU ---")
    bagging_pu_diverse = DiversityEnhancedBaggingPU(
        base_estimator=RandomForestClassifier(n_estimators=50, random_state=42),
        n_estimators=30,
        sampling_method='balanced',
        diversity_weight=0.3,
        random_state=42
    )
    bagging_pu_diverse.fit(X_train, y_pu_train)
    y_prob_diverse = bagging_pu_diverse.predict_proba(X_test)[:, 1]
    ap_diverse = average_precision_score(y_true_test, y_prob_diverse)
    print(f"平均精度 (AP): {ap_diverse:.4f}")
    
    # 4. 可视化比较结果
    print("\n=== 结果可视化 ===")
    visualize_comparison(
        y_true_test, 
        [y_prob_naive, y_prob_basic, y_prob_improved, y_prob_diverse],
        ['朴素方法', '基本BaggingPU', '改进BaggingPU', '多样性BaggingPU'],
        [ap_naive, ap_basic, ap_improved, ap_diverse]
    )
    
    return {
        'naive': {'prob': y_prob_naive, 'ap': ap_naive},
        'basic': {'prob': y_prob_basic, 'ap': ap_basic},
        'improved': {'prob': y_prob_improved, 'ap': ap_improved},
        'diverse': {'prob': y_prob_diverse, 'ap': ap_diverse},
        'y_true': y_true_test
    }

def visualize_comparison(y_true, probabilities, method_names, ap_scores):
    """
    可视化比较结果
    """
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=(
            '精确率-召回率曲线比较',
            '平均精度比较',
            '基分类器权重分布',
            '预测概率分布'
        )
    )
    
    # 1. 精确率-召回率曲线
    colors = ['red', 'blue', 'green', 'purple']
    for i, (y_prob, name, color) in enumerate(zip(probabilities, method_names, colors)):
        precision, recall, _ = precision_recall_curve(y_true, y_prob)
        fig.add_trace(
            go.Scatter(x=recall, y=precision, mode='lines', 
                      name=f'{name} (AP={ap_scores[i]:.3f})',
                      line=dict(color=color, width=2)),
            row=1, col=1
        )
    
    # 2. 平均精度比较
    fig.add_trace(
        go.Bar(x=method_names, y=ap_scores, 
               marker_color=colors,
               text=[f'{score:.3f}' for score in ap_scores],
               textposition='auto'),
        row=1, col=2
    )
    
    # 3. 预测概率分布（以多样性方法为例）
    y_prob_diverse = probabilities[3]
    fig.add_trace(
        go.Histogram(x=y_prob_diverse, nbinsx=30, name='预测概率分布',
                    marker_color='lightblue', opacity=0.7),
        row=2, col=1
    )
    
    # 添加决策阈值线
    fig.add_vline(x=0.5, line_dash="dash", line_color="red", row=2, col=1)
    
    # 4. 方法性能提升百分比
    improvement = [(ap_scores[i] - ap_scores[0]) / ap_scores[0] * 100 
                  for i in range(1, len(ap_scores))]
    fig.add_trace(
        go.Bar(x=method_names[1:], y=improvement,
               marker_color=colors[1:],
               text=[f'{imp:.1f}%' for imp in improvement],
               textposition='auto'),
        row=2, col=2
    )
    
    fig.update_layout(
        height=800,
        title_text="基于Bagging的PU学习方法比较",
        showlegend=True
    )
    
    fig.update_xaxes(title_text="召回率", row=1, col=1)
    fig.update_yaxes(title_text="精确率", row=1, col=1)
    fig.update_xaxes(title_text="方法", row=1, col=2)
    fig.update_yaxes(title_text="平均精度", row=1, col=2)
    fig.update_xaxes(title_text="预测概率", row=2, col=1)
    fig.update_yaxes(title_text="频数", row=2, col=1)
    fig.update_xaxes(title_text="方法", row=2, col=2)
    fig.update_yaxes(title_text="相对于朴素方法的提升 (%)", row=2, col=2)
    
    fig.show()

# 运行演示
results = demonstrate_bagging_pu()

参数调优与模型选择

def parameter_tuning_example():
    """
    参数调优示例
    """
    # 生成数据
    X, y_true = make_classification(n_samples=1500, n_features=15, random_state=42)
    y_pu = create_pu_labels(y_true, labeled_ratio=0.3)
    
    # 参数网格
    param_grid = {
        'n_estimators': [10, 20, 30, 50],
        'max_samples': [0.5, 0.7, 0.8, 1.0],
        'max_features': [0.5, 0.7, 0.8, 1.0],
        'sampling_method': ['balanced', 'weighted', 'proportional']
    }
    
    best_score = 0
    best_params = {}
    
    # 简化的网格搜索（实际中应使用交叉验证）
    for n_est in param_grid['n_estimators']:
        for max_samp in param_grid['max_samples']:
            for max_feat in param_grid['max_features']:
                for samp_method in param_grid['sampling_method']:
                    
                    # 创建分类器
                    classifier = AdvancedBaggingPUClassifier(
                        base_estimator=RandomForestClassifier(n_estimators=50),
                        n_estimators=n_est,
                        max_samples=max_samp,
                        max_features=max_feat,
                        sampling_method=samp_method,
                        random_state=42
                    )
                    
                    # 训练和评估（简化版）
                    X_train, X_test, y_pu_train, y_pu_test = train_test_split(
                        X, y_pu, test_size=0.3, random_state=42
                    )
                    _, _, _, y_true_test = train_test_split(
                        X, y_true, test_size=0.3, random_state=42
                    )
                    
                    classifier.fit(X_train, y_pu_train)
                    y_prob = classifier.predict_proba(X_test)[:, 1]
                    ap_score = average_precision_score(y_true_test, y_prob)
                    
                    if ap_score > best_score:
                        best_score = ap_score
                        best_params = {
                            'n_estimators': n_est,
                            'max_samples': max_samp,
                            'max_features': max_feat,
                            'sampling_method': samp_method
                        }
    
    print(f"最佳参数: {best_params}")
    print(f"最佳平均精度: {best_score:.4f}")
    
    return best_params, best_score

def create_pu_labels(y_true, labeled_ratio=0.3):
    """
    创建PU标签
    """
    y_pu = np.zeros_like(y_true)
    positive_indices = np.where(y_true == 1)[0]
    n_labeled = int(len(positive_indices) * labeled_ratio)
    labeled_indices = np.random.choice(positive_indices, n_labeled, replace=False)
    y_pu[labeled_indices] = 1
    return y_pu

实际应用场景与最佳实践

适用场景

基于 Bagging 的 PU 学习特别适用于以下场景：

高噪声数据：当未标记样本中包含大量噪声时
小样本正例：当标记的正例数量有限时
数据异质性：当正例样本来自不同分布时
稳定性要求高：当需要稳定可靠的预测时

最佳实践建议

class PracticalBaggingPUGuide:
    """
    实际应用指南
    """
    
    @staticmethod
    def recommend_parameters(data_size, positive_ratio, problem_type):
        """
        根据问题特性推荐参数
        """
        recommendations = {}
        
        if data_size < 1000:
            recommendations.update({
                'n_estimators': 10,
                'max_samples': 0.9,
                'max_features': 0.8
            })
        elif data_size < 10000:
            recommendations.update({
                'n_estimators': 20,
                'max_samples': 0.8,
                'max_features': 0.7
            })
        else:
            recommendations.update({
                'n_estimators': 30,
                'max_samples': 0.7,
                'max_features': 0.6
            })
        
        if positive_ratio < 0.1:
            recommendations['sampling_method'] = 'weighted'
            recommendations['positive_weight'] = 2.0
        else:
            recommendations['sampling_method'] = 'balanced'
        
        if problem_type == 'high_noise':
            recommendations.update({
                'n_estimators': 50,
                'use_weighting': True,
                'diversity_weight': 0.4
            })
        
        return recommendations
    
    @staticmethod
    def select_base_estimator(problem_characteristics):
        """
        根据问题特性选择基分类器
        """
        if problem_characteristics['n_features'] > 100:
            # 高维数据：使用线性模型或特征选择能力强的模型
            from sklearn.linear_model import LogisticRegression
            return LogisticRegression(penalty='l1', solver='liblinear')
        
        elif problem_characteristics['n_samples'] < 1000:
            # 小样本：使用简单模型避免过拟合
            from sklearn.svm import SVC
            return SVC(kernel='linear', probability=True)
        
        else:
            # 一般情况：使用随机森林
            from sklearn.ensemble import RandomForestClassifier
            return RandomForestClassifier(n_estimators=100)
    
    @staticmethod
    def evaluate_model_robustness(classifier, X, y, n_trials=10):
        """
        评估模型鲁棒性
        """
        ap_scores = []
        
        for i in range(n_trials):
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=0.3, random_state=i
            )
            
            # 重新训练（使用不同的随机种子）
            trial_classifier = clone(classifier)
            trial_classifier.set_params(random_state=i)
            trial_classifier.fit(X_train, y_train)
            
            # 评估
            y_prob = trial_classifier.predict_proba(X_test)[:, 1]
            ap_score = average_precision_score(y_test, y_prob)
            ap_scores.append(ap_score)
        
        robustness_score = 1 - (np.std(ap_scores) / np.mean(ap_scores))
        return {
            'mean_ap': np.mean(ap_scores),
            'std_ap': np.std(ap_scores),
            'robustness': robustness_score
        }

参考链接：

新闻文章提取工具：newspaper

钱魏Way — Sat, 15 Nov 2025 07:34:36 +0000

Newspaper3k

Newspaper3k 是一个专门用于新闻文章抓取和内容提取的Python库。该项目由 Lucas Ou-Yang 开发，灵感来源于Requests库的简洁性，底层使用lxml实现高效解析。

核心特性

文章内容提取
- 自动提取文章标题、正文、作者、发布时间
- 提取顶部图片和所有相关图片
- 识别视频链接
- 支持多语言（10+种语言）
自然语言处理功能
- 关键词提取
- 自动摘要生成
- 语言自动检测
批量处理能力
- 支持构建整个新闻源
- 多线程下载框架
- 自动识别分类URL

基本用法示例

from newspaper import Article

# 单篇文章处理
url = 'http://example.com/article'
article = Article(url)
article.download()
article.parse()

print(article.title)
print(article.authors)
print(article.text)

# NLP处理
article.nlp()
print(article.keywords)
print(article.summary)

# 批量处理新闻源
cnn_paper = newspaper.build('http://cnn.com')
for article in cnn_paper.articles:
    print(article.url)

Newspaper4k

Newspaper4k 是 Newspaper3k 的一个活跃分支，由 AndyTheFactory 维护。由于原项目长期未更新，Newspaper4k 旨在继续发展和改进这个库。

改进和新特性

功能增强
- 支持更多语言（80+种）
- 集成Google News支持
- 提供命令行接口(CLI)
- 多种输出格式（JSON、CSV、文本）
性能优化
- 改进了文章提取算法
- 更好的语言检测
- 增强的多线程处理
开发者体验
- 更好的文档
- 类型提示支持
- 更严格的代码质量控制

基本用法示例

import newspaper

# 单篇文章处理
article = newspaper.article('https://example.com/article')
print(article.authors)
print(article.publish_date)
print(article.text)

# 命令行使用
# python -m newspaper --url="URL" --language=en --output-format=json

# Google News集成
from newspaper.google_news import GoogleNewsSource
source = GoogleNewsSource(country="US", period="7d", max_results=10)
source.build(top_news=True)
source.download_articles()

性能评估对比

根据项目提供的基准测试数据：

版本	BLEU分数	F1分数
Newspaper3k 0.2.8	0.8660	0.9100
Newspaper4k 0.9.3	0.9531	0.9460

主要差异对比

特性	Newspaper3k	Newspaper4k
维护状态	已停止更新	活跃维护
Python版本	Python 3	Python 3.8+
语言支持	10+种	80+种
Google News	不支持	支持
命令行工具	无	有
性能	基础版本	显著改进
文档质量	一般	更好
类型提示	无	支持

建议

对于新项目，强烈推荐使用Newspaper4k，因为它：

有活跃的维护和bug修复
性能更好
功能更丰富
有更好的开发者体验

对于现有使用Newspaper3k的项目，可以考虑迁移到Newspaper4k，因为API保持兼容，迁移成本较低。

参考链接：

Python @property 装饰器详解

钱魏Way — Tue, 25 Feb 2025 12:49:58 +0000

@property装饰器简介

在Python中，@property装饰器是一种优雅的属性管理工具，它允许你将类的方法伪装成属性（即无需使用()调用），同时可以在属性访问时添加逻辑（如数据校验、动态计算等）。

@property 的核心作用

隐藏实现细节：对外暴露属性式的访问接口，内部可封装复杂逻辑。
控制属性访问：在读取、设置或删除属性时触发自定义逻辑（如数据校验）。
兼容性：允许在不破坏现有代码的前提下修改内部实现。

基本用法：定义属性的读写删除

定义属性的读取（getter）

使用@property装饰器将一个方法转换为”属性读取”方法：

class Person:
    def __init__(self, name):
        self._name = name  # 实际存储的属性（通常用下划线命名）

    @property
    def name(self):
        print("Getting name")
        return self._name

# 使用
p = Person("Alice")
print(p.name)  # 输出: Getting name→Alice

定义属性的设置（setter）

使用@属性名.setter装饰器定义属性的设置逻辑：

class Person:
    def __init__(self, name):
        self._name = name

    @property
    def name(self):
        return self._name

    @name.setter
    def name(self, value):
        if not isinstance(value, str):
            raise ValueError("Name must be a string!")
        print("Setting name")
        self._name = value

# 使用
p = Person("Alice")
p.name = "Bob"  # 正常设置
p.name = 123  # 抛出 ValueError

定义属性的删除（deleter）

使用@属性名.deleter装饰器定义删除逻辑：

class Person:
    def __init__(self, name):
        self._name = name

    @property
    def name(self):
        return self._name

    @name.deleter
    def name(self):
        print("Deleting name")
        del self._name

# 使用
p = Person("Alice")
del p.name  # 输出: Deleting name

典型应用场景

数据校验

禁止非法值被设置：

class Circle:
    def __init__(self, radius):
        self.radius = radius  # 通过 setter 设置

    @property
    def radius(self):
        return self._radius

    @radius.setter
    def radius(self, value):
        if value <= 0:
            raise ValueError("Radius must be positive")
        self._radius = value

动态计算属性

属性值由其他属性动态计算得出：

class Rectangle:
    def __init__(self, width, height):
        self.width = width
        self.height = height

    @property
    def area(self):
        return self.width * self.height  # 每次访问时计算

# 使用
rect = Rectangle(3, 4)
print(rect.area)  # 输出: 12
rect.width = 5
print(rect.area)  # 输出: 20 (自动更新)

属性访问控制

限制某些属性为只读：

class User:
    def __init__(self, user_id):
        self._user_id = user_id

    @property
    def user_id(self):
        return self._user_id  # 无 setter，无法修改

# 使用
user = User(1001)
user.user_id = 2002  # 抛出 AttributeError: can't set attribute

深入原理

属性本质：@property创建了一个 property 对象，管理属性的 get、set、delete 方法。
描述符协议：property实现了描述符协议，通过 __get__、__set__ 和 __delete__ 方法拦截属性操作。

注意事项

避免副作用：Getter方法尽量不修改对象状态。
性能考量：频繁计算的属性可考虑缓存结果。
命名冲突：属性名不要与实例变量同名（如用 _name 存储数据，name 作为属性）。

Python 类型注解（Type Hints）详解

钱魏Way — Mon, 24 Feb 2025 04:17:00 +0000

类型注解的概念

类型注解（Type Hints）是 Python 3.5+ 引入的特性（通过PEP 484），允许开发者为变量、函数参数和返回值等标注期望的数据类型。它不会影响代码运行时行为，但可通过静态检查工具（如 mypy）提前发现类型错误，提升代码健壮性和可维护性。

定义：类型注解（Type Hints）是为变量、函数参数、返回值等标注预期类型的一种语法，帮助开发者明确代码意图。
作用：
- 提升代码可读性和可维护性。
- 支持静态类型检查工具（如 `mypy`）发现潜在错误。
- 增强 IDE 的代码提示和自动补全功能。
注意：Python 仍是动态类型语言，类型注解不会影响运行时行为。

基本语法

变量类型注解

使用 `: Type` 标注变量类型：

name: str = "Alice"
age: int = 30
scores: list[float] = [90.5, 85.0]

函数参数与返回值

参数：`参数名: 类型`
返回值：`-> 返回类型`

def add(a: int, b: int) -> int:
    return a + b

容器类型

使用 `typing` 模块中的泛型（Python 3.9+ 可直接用内置容器）：

from typing import List, Dict, Tuple

# 列表：元素为整数
numbers: List[int] = [1, 2, 3]
# 字典：键为字符串，值为浮点数
prices: Dict[str, float] = {"apple": 4.5, "banana": 2.0}
# 元组：固定类型和长度
point: Tuple[float, float, str] = (3.5, 4.2, "坐标")

类与对象

类属性与方法返回值标注：

class User:
    def __init__(self, name: str, age: int) -> None:
        self.name = name
        self.age: int = age  # 实例属性注解

    def get_info(self) -> str:
        return f"{self.name}, {self.age}"

高级类型注解

联合类型（Union）

表示变量可以是多个类型之一：

from typing import Union


def parse_input(input: Union[str, int]) -> None:
    pass

可选类型（Optional）

等同于 `Union[T, None]`，表示值可能为 `None`：

from typing import Optional


def find_user(id: int) -> Optional["User"]:
    # 可能返回 User 对象或 None
    pass

类型别名（Type Alias）

简化复杂类型定义：

from typing import List, Tuple

Coordinates = List[Tuple[float, float]]

def draw(points: Coordinates) -> None:
    pass

泛型（Generics）

自定义泛型类或函数：

from typing import TypeVar, Generic

T = TypeVar('T')


class Box(Generic[T]):
    def __init__(self, item: T) -> None:
        self.item = item

字面量类型（Literal）

限制变量为特定值（Python 3.8+）：

from typing import Literal


def set_mode(mode: Literal["read", "write"]) -> None:
    pass

回调函数类型（Callable）

标注函数参数类型：

from typing import Callable


def on_click(callback: Callable[[int, str], None]) -> None:
    pass

协议与结构子类型（Duck Typing）

通过 Protocol 定义接口，无需继承即可匹配类型：

from typing import Protocol

class Flyer(Protocol):
    def fly(self) -> str: ...

class Bird:
    def fly(self) -> str:
        return "Flapping wings"

def takeoff(obj: Flyer) -> None:
    print(obj.fly())

takeoff(Bird())  # 合法，因为 Bird 实现了 fly 方法

泛型（Generics）

泛型是什么？

泛型（Generics）是一种编程范式，允许你编写可操作多种数据类型的代码，同时保持类型安全。通过泛型，可以定义参数化的类型（如 List[T]），其中 T 是一个类型变量，表示实际使用时确定的类型。泛型的核心目标是：

代码复用：避免为不同类型重复编写相似逻辑。
类型约束：明确操作的数据类型，减少运行时错误。
静态检查支持：配合工具（如 mypy）提前发现类型不匹配问题。

Python 中的泛型实现

Python 通过 typing 模块（Python 3.5+）和内置语法（Python 3.12+）支持泛型。

基本语法

定义泛型类型变量：使用 TypeVar。

from typing import TypeVar, Generic

T = TypeVar('T')  # 定义泛型类型变量 T

泛型类：继承 Generic[T]。

```html
class Box(Generic[T]):
    def __init__(self, item: T) -> None:
        self.item = item

    def get_item(self) -> T:
        return self.item

# 使用
int_box = Box(10)  # Box[int]
str_box = Box("hello")  # Box[str]

泛型函数：直接使用类型变量。

from typing import TypeVar

T = TypeVar('T')

def first_element(items: list[T]) -> T:
    return items[0]

print(first_element([1, 2, 3]))  # int
print(first_element(["a", "b", "c"]))  # str

Python 3.12+的新语法

Python 3.12引入更简洁的泛型语法（PEP 695），无需显式使用TypeVar和Generic：

class Box[T]:
    def __init__(self, item: T) -> None:
        self.item = item

    def get_item(self) -> T:
        return self.item

def first_element[T](items: list[T]) -> T:
    return items[0]

泛型的核心应用场景

容器类（如列表、字典）

定义可容纳任意类型但内部类型一致的容器：

from typing import Generic, TypeVar, Iterable

T = TypeVar('T')

class LinkedList(Generic[T]):
    def __init__(self, items: Iterable[T]) -> None:
        self.items = list(items)

    def append(self, item: T) -> None:
        self.items.append(item)

# 使用
int_list = LinkedList([1, 2, 3])  # LinkedList[int]
str_list = LinkedList(["a", "b", "c"])  # LinkedList[str]

通用算法

编写与类型无关的算法逻辑：

from typing import TypeVar, Sequence

T = TypeVar('T')

def max_element(seq: Sequence[T]) -> T:
    return max(seq)  # 假设元素可比较

print(max_element([3, 1, 4]))  # 4
print(max_element(["z", "a", "b"]))  # "z"

API设计

设计可处理多种类型的接口：

from typing import TypeVar, Callable

Input = TypeVar('Input')
Output = TypeVar('Output')

def transform_data(
    data: Input,
    converter: Callable[[Input], Output]
) -> Output:
    return converter(data)

# 使用
result_str = transform_data(100, lambda x: str(x))  # Output: str
result_int = transform_data("123", int)  # Output: int

高级泛型特性

类型边界（Type Bounds）

限制类型变量的取值范围：

from typing import TypeVar, Number

N = TypeVar('N', bound=Number)  # 必须是Number的子类（如int, float）

def add(a: N, b: N) -> N:
    return a + b

add(1, 2)  # 合法
add(3.14, 2.5)  # 合法
add("a", "b")  # 类型检查报错

协变（Covariant）与逆变（Contravariant）

控制泛型类型的继承关系：

协变（covariant=True）：子类泛型可替代父类（如list[Child] 可视为 list[Parent]）。
逆变（contravariant=True）：父类泛型可替代子类（较少使用）。

from typing import TypeVar, Generic

class Animal: ...
class Dog(Animal): ...

T_co = TypeVar('T_co', covariant=True)

class Cage(Generic[T_co]):
    def get_animal(self) -> T_co: ...

def get_animal_from_cage(cage: Cage[Animal]) -> Animal:
    return cage.get_animal()

dog_cage: Cage[Dog] = Cage()
get_animal_from_cage(dog_cage)  # 合法（协变允许Cage[Dog]作为Cage[Animal]使用）

注意事项

运行时类型擦除：泛型类型信息在运行时不可见（仅用于静态检查）。
避免过度泛型化：仅在需要类型安全时使用，否则会增加代码复杂度。
动态类型兼容：Python的动态特性可能绕过泛型约束，需配合工具链检查。
性能影响：泛型本身不影响运行时性能，但复杂类型检查可能增加静态分析时间。

类型检查工具

mypy

安装：`pip install mypy`
使用：`mypy your_script.py`
配置：在`pyproject.toml`中添加：

[tool.mypy]
ignore_missing_imports = true
strict = true

IDE支持

VSCode：安装`Python`扩展和`Pylance`。
PyCharm：内置类型检查，支持自动提示。

最佳实践

逐步引入：在关键模块（如公共接口、复杂逻辑）优先添加类型注解。
避免过度使用Any：尽量明确类型，否则会削弱检查效果。
兼容性处理：
- 对旧代码使用# type: ignore 临时忽略错误。
- 用typing 模块兼容不同Python版本。
文档生成：结合sphinx 或 pdoc 生成类型化文档。

常见问题与解决方案

循环引用

“`

class Node:
    def __init__(self, parent: "Node") -> None:
        self.parent = parent

动态类型处理

类型断言：

value: Any = get_data()
str_value: str = value # 不安全的转换
safe_str = cast(str, value) # 明确告知类型检查器

类型保护：

def process(data: Union[int, str]) -> None:
    if isinstance(data, int):
        print(data + 1)
    else:
        print(data.upper())

Python 编码规范整理版

钱魏Way — Sun, 23 Feb 2025 09:53:53 +0000

以下是根据规则修复空格后的内容：

“`html
以下是一份结合PEP8规范、最佳实践及常见注意事项的Python编码规范整理，适用于团队协作与个人项目：

代码布局与格式

缩进

规则：使用4个空格（禁止使用Tab键）。
多行缩进：垂直对齐或悬挂缩进（后续行多缩进一级）。

# 正确：与括号对齐
result = some_function(arg1, arg2,
arg3, arg4)
# 正确：悬挂缩进（多一层缩进）
result = some_function(
arg1, arg2,
arg3, arg4
)

在Python编码规范中，垂直对齐和悬挂缩进是两种处理代码换行的缩进方式，目的是让多行代码更清晰易读。

垂直对齐（Vertical Alignment）

定义：当一行代码过长需要换行时，续行的代码与包裹元素的起始位置对齐（例如括号、方括号、花括号等）。这种对齐方式通过视觉上的整齐排列，强调代码的结构层级。
适用场景：函数调用、列表/字典/元组定义、条件判断等需要多行表达的场景。
优点：代码结构直观，容易看出参数或元素的层级关系。
缺点：当包裹元素（如括号）位置较深时，可能导致续行代码缩进过多，占用行宽。

悬挂缩进（Hanging Indent）

定义：当一行代码过长需要换行时，续行的代码比父级代码多一层缩进（通常为4个空格）。首行不放置任何元素，续行代码从下一行开始。
适用场景：函数定义/调用、字典/列表等需要换行且父级代码较长的情况。
优点：避免首行过长，代码层次分明，适合复杂结构。更符合PEP8对行长度的限制（79字符）。
缺点：首行可能显得空荡，需要适应这种风格。

PEP8的推荐

– 优先使用悬挂缩进：PEP8推荐在函数定义、调用等场景中使用悬挂缩进，因为它更符合行长度限制，且易于工具自动格式化。
– 垂直对齐的例外：当代码逻辑需要强调对齐关系时（如字典键值对），垂直对齐更直观。

行长度

规则：每行不超过79字符（PEP8推荐），可放宽至88-100字符（现代IDE支持）。
换行策略：
- 在运算符前换行。
- 使用反斜杠\ 或括号包裹换行。

# 正确：运算符后换行
total = (value1 + value2
+ value3 - value4)

空格与运算符

运算符两侧加空格，但括号内不需要：

# 正确
x = 1 + 2
if a > 5 and b< 10:
# 错误
x=1+2

逗号、分号后加空格，前面不加：

# 正确
arr = [1, 2, 3]
# 错误
arr=[1,2,3]

空行

函数/类之间：用2个空行分隔。
函数内部：用1个空行分隔逻辑块。

def func1():
pass


def func2():
pass

导入（Imports）

模块导入顺序：标准库→第三方库→本地模块，每组用空行分隔。
禁止通配符：禁止使用from module import *。

import os
import sys

from django.http import HttpResponse
from myapp.utils import helper

命名规范

基本规则

变量/函数：小写+下划线（snake_case），如calculate_total。
类名：首字母大写（CamelCase），如ClassName。
常量：全大写+下划线，如MAX_LENGTH。
私有成员：前缀单下划线_private_var，双下划线 __mangled_name（名称修饰）。

避免的命名

单字母变量（除非在循环或lambda中）。
保留关键字（如list、str）或易混淆名称（如 l, O）。

代码风格

表达式与语句

避免无关空格：

# 错误
func(arg1, arg2 )
# 正确
func(arg1, arg2)

链式比较：允许if 0< x< 10。
条件简写：避免冗余代码。

# 冗余
if x == True:
# 正确
if x:

函数与方法

函数长度不超过50行，功能单一（一个函数只做一件事）。
参数默认值：避免可变对象（如列表、字典）。

# 错误
def func(a=[]):
# 正确
def func(a=None):
a = a or []

在Python中，函数参数的默认值如果是可变对象（如列表、字典等），可能会导致意外的副作用。这是因为默认值在函数定义时就被计算并存储，后续调用会共享同一个可变对象。以下是详细解释和示例：

问题的根源：Python的默认参数值是在函数定义时就被计算并存储的，而不是每次调用时重新创建。如果默认值是可变对象（如`[]`或`{}`），所有未显式传递该参数的调用都会共享同一个对象。

示例代码：

def append_to_list(value, lst=[]):
lst.append(value)
return lst

print(append_to_list(1)) # 输出[1]
print(append_to_list(2)) # 输出[1,2]❌预期是[2]

为什么会出现问题？

第一次调用`append_to_list(1)`时，`lst`使用默认的空列表`[]`，结果变为`[1]`。
```第二次调用`append_to_list(2)`时，`lst`仍然指向第一次调用时修改后的列表`[1]`，所以结果变为`[1,2]`。

如何避免？

正确的做法是将默认值设为不可变对象（如`None`），然后在函数内部判断并初始化可变对象：

修正代码：

def append_to_list(value, lst=None):
    if lst is None:
        lst = []  # 每次调用时创建新列表
    lst.append(value)
    return lst

print(append_to_list(1))  # 输出 [1]
print(append_to_list(2))  # 输出 [2] ✅

类型注解：推荐使用类型提示（Python 3.5+）。

def greet(name: str) -> str:
    return f"Hello, {name}"

类设计

类的职责明确，避免“上帝类”。
使用`@property`装饰器管理属性访问。
优先使用组合而非继承。

单一职责原则（SRP）

定义：一个类应该只有一个引起它变化的原因（即一个类只做一件事）。
核心思想：将功能拆分为独立的模块，每个模块负责一个明确的职责。

示例对比：

# 错误设计：一个类同时处理用户认证、数据存储和日志记录。
# 上帝类的典型表现
class UserManager:
    def login(self, username, password): ...  # 认证
    def save_to_db(self, user_data): ...  # 数据存储
    def write_log(self, message): ...  # 日志记录

# 正确设计：职责拆分到不同类。
class UserAuthenticator:
    def login(self, username, password): ...  # 只负责认证

class UserRepository:
    def save(self, user_data): ...  # 只负责存储

class Logger:
    def log(self, message): ...  # 只负责日志

如何判断职责是否明确？

类的名称是否清晰描述其功能？（如`UserAuthenticator`而非`UserManager`）
能否用一句话描述类的职责？（例如：“这个类负责验证用户身份”）
修改某个功能时，是否需要改动多个类？

什么是“上帝类”（God Class）？

定义

上帝类：一个类承担了太多职责，包含大量方法和属性，甚至直接操作其他类的数据。
典型特征：
- 代码量庞大（数千行）。
- 直接依赖多个外部模块或数据。
- 包含大量不相关的方法（如同时处理网络请求、数据库操作、业务逻辑）。

上帝类的危害：

代码耦合度高：牵一发而动全身，修改一处可能引发多处错误。
难以测试：依赖复杂，单元测试难以覆盖所有场景。
可读性差：新成员需要花费大量时间理解类的功能。

如何避免上帝类？

按功能拆分：将大类分解为多个小类，每个类负责单一功能。

class OrderCreator: ...  # 创建订单
class InventoryUpdater: ...  # 更新库存
class PaymentProcessor: ...  # 处理支付
class NotificationSender: ...  # 发送通知

使用组合而非继承：

继承：容易导致父类膨胀（如`Animal`类包含所有动物的方法）。
组合：通过注入依赖对象，灵活扩展功能。

# 错误：继承导致冗余
class Bird(Animal):
    def fly(self): ...  # 鸟类会飞
    def swim(self): ...  # 但有些鸟不会游泳

# 正确：通过组合分离能力
class Bird:
    def __init__(self, fly_behavior, swim_behavior):
        self.fly = fly_behavior
        self.swim = swim_behavior

依赖注入（Dependency Injection）：将依赖的外部服务（如数据库、网络）通过构造函数或方法参数传入，而非在类内部直接创建。

# 错误：类内部直接依赖数据库
class UserRepository:
    def __init__(self):
        self.db = MySQLClient()  # 紧耦合


# 正确：通过依赖注入解耦
class UserRepository:
    def __init__(self, db_client):  # 可传入任意数据库客户端
        self.db = db_client

异常处理

基本原则

避免捕获所有异常（`except:`❌），明确指定异常类型。
异常处理用于预期可能出错的代码，而非控制流程。
自定义异常应继承`Exception`基类。

明确异常类型：避免裸 except:。

# 错误
try:
    ...
except:
    pass

# 正确
try:
    ...
except ValueError as e:
    logger.error(e)

异常消息：提供清晰错误信息。

raise ValueError("Invalid value: expected int, got str")

注释与文档

代码注释

行内注释：在代码后空两格写`#`，注释内容简明。
块注释：用于复杂逻辑解释，每行以`#`开头。
避免无意义的注释（如`# 赋值给 x`）。

文档字符串（Docstring）

模块/函数/类：使用三重双引号"""..."""，遵循 Google 或 NumPy 风格。

def calculate_sum(a: int, b: int) -> int:
    """Add two numbers.

    Args:
        a (int): First number.
        b (int): Second number.

    Returns:
        int: Sum of a and b.
    """
    return a + b

Google 风格文档字符串

特点：

简洁直观：语法简单，强调自然语言描述。
结构化标签：用`Args`、`Returns`、`Raises`等标签分块。
类型标注可选：参数类型可写在描述中，或与参数名结合。
广泛适用：适合中小型项目或快速编写文档。

格式示例：

def add(a: int, b: int) -> int:
"""计算两个整数的和。

Args:
    a (int): 第一个加数。
    b (int): 第二个加数。

Returns:
    int: 两个数的和。

Raises:
    ValueError: 如果输入非整数。

Examples:
    >>> add(2, 3)
    5
"""
if not isinstance(a, int) or not isinstance(b, int):
    raise ValueError("输入必须是整数")
return a + b

核心标签：

`Args`: 参数说明（可标注类型）。
`Returns`: 返回值说明。
`Raises`: 可能抛出的异常。
`Examples`: 使用示例（可包含 Doctest）。

NumPy 风格文档字符串

特点：

详细规范：严格分块，适合复杂函数或库。
类型强制标注：参数和返回值必须明确类型。
多级分隔线：用 `----------` 分隔不同区块。
科学计算偏好：常见于数据科学和数值计算项目（如 NumPy、Pandas）。

格式示例

def divide(dividend: float, divisor: float) -> float:
"""计算两个数的除法。

Parameters
----------
dividend: float
    被除数。
divisor: float
    除数，必须非零。

Returns
-------
float
    除法结果。

Raises
------
ZeroDivisionError
    如果除数为零。

Examples
--------
>>> divide(10.0, 2.0)
5.0
"""
if divisor == 0:
    raise ZeroDivisionError("除数不能为零")
return dividend / divisor

核心标签：

`Parameters`: 参数说明（强制类型标注）。
`Returns`: 返回值说明。
`Raises`: 异常说明。
`SeeAlso`: 相关函数或类。
`Notes`: 额外注意事项。
`Examples`: 使用示例。

TODO 注释

标记待完成事项。

# TODO: Implement error handling here.

最佳实践与注意事项

代码可读性

避免魔法代码：明确优于隐晦。

# 错误：魔法数字
if status == 2:
# 正确：使用常量
STATUS_COMPLETED = 2
if status == STATUS_COMPLETED:

单一职责原则：一个函数只做一件事。

性能优化

避免低效操作

字符串拼接使用 `join` 而非 `+`。
优先用列表推导式代替 `for` 循环生成列表。
减少全局变量访问（局部变量更快）。

高效数据结构

频繁查找用 `字典` 或 `集合`（O(1) 复杂度）。
大数据处理优先考虑生成器（`yield`）。

兼容性与版本

Python 2/3 兼容：明确代码运行环境（Python 2 已停止维护）。
环境依赖：使用 txt 或 pyproject.toml 管理包。

工具推荐

代码检查：
- flake8：综合检查（PEP8 + 代码复杂度）。
- pylint：更严格的静态分析。
自动格式化：
- black：无配置强制统一格式。
- isort：自动排序导入语句。
类型检查：
- mypy：静态类型检查。

Python的数据可视化工具Pygwalker

钱魏Way — Sat, 21 Dec 2024 01:52:59 +0000

Pygwalker（Python binding for GraphicWalker）是一个用于Python的数据可视化工具，旨在帮助数据科学家和分析师以更交互和直观的方式探索和理解数据。Pygwalker是GraphicWalker的Python绑定，提供类似Tableau的用户体验，使用户能够在Jupyter Notebook或其他Python环境中快速创建交互式图表。

核心特性

交互式可视化：
- Pygwalker提供了一种交互式的方式来创建和探索数据可视化，允许用户动态调整图表参数。
- 用户可以通过拖放操作来构建和修改图表，支持快速的视觉数据分析。
多种图表类型：
- 支持多种图表类型，包括柱状图、折线图、散点图、饼图、热力图等。
- 提供灵活的图表选项，满足不同的数据展示需求。
集成到Python环境：
- 可以在Jupyter Notebook中无缝使用，与Python数据分析库（如Pandas）集成良好。
- 支持从Pandas DataFrame直接创建可视化，简化了数据处理和可视化的流程。
数据探索功能：
- 提供数据过滤、排序和聚合功能，帮助用户深入探索数据集。
- 支持多维数据分析，允许用户在多个维度上切换和比较数据。
用户友好的界面：
- 提供直观的用户界面，用户无需编写复杂的代码即可创建复杂的可视化。
- 支持通过简单的交互来调整图表的样式和配置。
开源和可扩展性：
- 作为开源项目，用户可以根据需要查看和修改代码。
- 提供可扩展的架构，允许用户开发自定义的图表类型和功能。

应用场景

数据探索与分析：
- 支持数据科学家和分析师进行快速的数据探索和可视化分析。
- 适用于初步数据分析阶段，通过可视化发现数据中的模式和异常。
报告与展示：
- 适用于创建交互式数据报告和展示材料，帮助更好地传达数据洞察。
- 提供高质量的图表输出，适合在演示和文档中使用。
教育与教学：
- 用于教学数据可视化和分析技能，帮助学生通过实践更好地理解数据。
- 提供简单易用的工具，适合课堂演示和作业项目。
企业数据分析：
- 支持企业用户进行业务数据的探索和可视化，帮助进行数据驱动的决策。
- 提供灵活的数据操作和展示选项，满足企业分析需求。

参考链接：

Kanaries/pygwalker: PyGWalker: Turn your pandas dataframe into an interactive UI for visual analysis (github.com)

分布式计算框架Ray

钱魏Way — Sun, 15 Dec 2024 02:25:52 +0000

Ray简介

Ray是一个开源的分布式计算框架，专为机器学习和人工智能应用设计。它提供了一种灵活、高效的方式来构建和运行分布式应用程序，特别是在需要大规模并行计算的场景中。Ray的核心是一个通用的分布式执行引擎，支持无状态和有状态的计算任务。

核心特性

简化的分布式计算：
- Ray使得编写分布式应用程序变得简单，通过Python API提供对分布式计算资源的透明访问。
- 支持函数和类的分布式执行，用户可以轻松地将现有的单机代码扩展到分布式环境。
自动化的资源管理：
- Ray提供自动化的资源调度和负载均衡，优化计算资源的利用。
- 支持多种资源类型（如CPU、GPU）的管理和调度，适应不同的计算需求。
弹性和容错：
- Ray的架构支持任务的自动重试和故障恢复，确保应用的高可用性。
- 支持动态扩展和收缩计算资源，适应变化的工作负载。
高性能：
- 通过共享内存和零拷贝技术，Ray提供高效的数据传输和计算性能。
- 支持大规模并行计算和低延迟的任务调度。
生态系统和集成：
- Ray提供丰富的库和工具支持，如Ray Tune（超参数调优）、Ray Serve（模型服务化）和Ray RLlib（强化学习）。
- 与主流的机器学习框架（如TensorFlow、PyTorch）无缝集成，支持复杂的机器学习工作流。

应用场景

机器学习和深度学习：
- 支持分布式训练、超参数调优和模型服务化，适合大规模机器学习应用。
- 与TensorFlow、PyTorch等框架集成，支持复杂的模型训练和部署。
强化学习：
- Ray RLlib提供了一个可扩展的强化学习库，支持多种算法和环境。
- 适用于需要大规模并行仿真和训练的强化学习任务。
数据处理和分析：
- 支持分布式数据处理和分析任务，适用于大规模数据集的处理。
- 提供与Pandas、NumPy等数据处理库的集成，简化数据分析工作流。
高性能计算（HPC）：
- 适用于需要大规模并行计算和高性能的数据密集型应用。
- 支持科学计算、仿真和建模等HPC场景。

Ray的架构

Ray是一个开源的分布式计算框架，旨在提供简单易用的接口来开发和运行大规模的分布式应用程序。它特别适用于机器学习、强化学习和数据处理等需要大规模计算的场景。Ray的架构设计注重灵活性、可扩展性和高性能。

以下是Ray的主要架构组件及其功能：

核心组件

Ray Core
- 任务和Actor模型：Ray提供了一个基于任务和Actor的并行计算模型。任务是无状态的函数调用，而Actor是有状态的计算单元。
- 调度器：负责在集群中调度和分配任务，确保任务的高效执行和资源的有效利用。
- 资源管理：跟踪和管理集群中的计算资源，支持动态扩展和负载均衡。
Ray Cluster
- 节点类型：
  - Head Node：集群的主节点，负责管理集群的状态、调度任务和协调节点之间的通信。
  - Worker Nodes：执行具体任务的工作节点，负责处理分配给它们的任务和Actor。
- Ray Dashboard
  - 提供一个Web界面，用于监控和管理Ray集群。用户可以查看集群的状态、资源使用情况、任务执行情况等。

数据管理

对象存储（Object Store）：Ray使用共享内存对象存储来管理任务之间的数据传递。对象存储支持零拷贝数据共享，减少数据传输的开销。
对象引用：Ray使用对象引用来跟踪和管理数据对象，确保任务可以高效地访问和共享数据。

扩展和集成

Ray Libraries：Ray提供了一系列库，支持不同的应用场景：
- Ray Tune：用于大规模超参数优化。
- Ray RLlib：用于分布式强化学习。
- Ray Serve：用于大规模模型服务和在线推理。
与其他工具的集成：Ray可以与常见的机器学习框架（如TensorFlow、PyTorch）和数据处理工具（如Apache Spark）集成，扩展其功能。

可扩展性和弹性

水平扩展：Ray支持动态添加和移除工作节点，以应对变化的计算需求。
故障恢复：Ray具有内置的故障恢复机制，可以自动重新调度失败的任务，确保计算的可靠性和连续性。

参考链接：