器→工具, 数据, 术→技巧, 研发, 编程语言

Python对象持久化存储工具pickle

钱魏Way · · 511 次浏览

Python 中有个序列化过程称为pickle,它能够实现任意对象与文本之间的相互转化,也可以实现任意对象与二进制之间的相互转化。也就是说,pickle 可以实现 Python 对象的存储及恢复。

  • 序列化(picking): 把变量从内存中变成可存储或传输的过程称为序列化,序列化之后,就可以把序列化的对象写入磁盘,或者传输给其他设备
  • 反序列化(unpickling):相应的,把变量的内容从序列化的对象重新读到内存里的过程称为反序列化

在机器学习中,我们常常需要把训练好的模型存储起来,这样在进行决策时直接将模型读出,而不需要重新训练模型,这样就大大节约了时间。Python提供的pickle模块就很好地解决了这个问题,它可以序列化对象并保存到磁盘中,并在需要的时候读取出来,任何对象都可以执行序列化操作。

Python 2中有两个模块可以实现对象的序列化,pickle和cPickle,cPickle是用C语言实现的,pickle是用纯Python语言实现的,相比,cPickle的读写效率高一些。使用的时候,一般先尝试导入cPickle,如果失败,再导入pickle模块。

try:
    import cPickle as pickle
except:
    import pickle

Python 3种无需再这样进行导入:

A common pattern in Python 2.x is to have one version of a module implemented in pure Python, with an optional accelerated version implemented as a C extension; for example, pickle and cPickle. This places the burden of importing the accelerated version and falling back on the pure Python version on each user of these modules. In Python 3.0, the accelerated versions are considered implementation details of the pure Python versions. Users should always import the standard version, which attempts to import the accelerated version and falls back to the pure Python version. The pickle / cPickle pair received this treatment. The profile module is on the list for 3.1. The StringIO module has been turned into a class in the io module. https://docs.python.org/3.1/whatsnew/3.0.html#library-changes

pickle 模块提供了以下 4 个函数供我们使用:

  • dumps():将 Python 中的对象序列化成二进制对象,并返回
  • loads():读取给定的二进制对象数据,并将其转换为 Python 对象
  • dump():将 Python 中的对象序列化成二进制对象,并写入文件
  • load():读取指定的序列化数据文件,并返回对象

以上这 4 个函数可以分成两类,其中 dumps 和 loads 实现基于内存的 Python 对象与二进制互转,dump 和 load 实现基于文件的 Python 对象与二进制互转。

使用上与json序列化与反序列化类似,但中间还是存在一些区别:

  • JSON只能存储文本形式的存储,Pickle可以存储成二进制
  • JSON是人可读的,Pickle不可读
  • JSON广泛应用于除Python外的其他领域,Pickle是Python独有的
  • JSON只能dump一些python的内置对象,Pickle可以存储几乎所有对象

pickle的使用说明

pickle 模块提供了两个常量

常量 说明
pickle.HIGHEST_PROTOCOL 这是一个整数值,表示可用的最高协议版本。它可以作为协议版本的参数传递给dump()和dumps()函数
pickle.DEFAULT_PROTOCOL 这是一个整数值,表示用于 pickling 的默认协议,其值可能小于最高协议的值

pickle 模块提供的方法:

  • dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)
  • dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)
  • load(file, *, fix_imports=True, encoding=”ASCII”, errors=”strict”, buffers=None)
  • loads(data, /, *, fix_imports=True, encoding=”ASCII”, errors=”strict”, buffers=None)

其中protocol可选参数:

  • Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.(原始的纯文本存储)
  • Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.(旧版二进制存储)
  • Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.(新版二进制存储,效率更高,Python 2.3新增)
  • Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This was the default protocol in Python 3.0–3.7.(Python 3引入,在Python 3.0-3.7中默认)
  • Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. It is the default protocol starting with Python 3.8. Refer to PEP 3154 for information about improvements brought by protocol 4.(支持非常大的对象)
  • Protocol version 5 was added in Python 3.8. It adds support for out-of-band data and speedup for in-band data. Refer to PEP 574 for information about improvements brought by protocol 5.

如果该参数传-1,则使用最高版本。

示例代码:

import pickle

# take objects list, dictionary and class
mylist = ['pink', 'green', 'blue', 'red']
mydict = {'a': 23, 'b': 17, 'c': 9}


class Student:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def display_info(self):
        return ("Student name is {name} & is {age} years old".format(name=self.name, age=self.age))


# object created for student
myobj = Student('Maria', 18)

# pickling
# byte stream of objects written in binary format
pickle.dump(mylist, file=open('mylist.pkl', 'wb'))
pickle.dump(mydict, file=open('mydict.pkl', 'wb'))
pickle.dump(myobj, file=open('myobj.pkl', 'wb'))

# delete objects
del mylist
del mydict
del myobj

# unpickling
mylist = pickle.load(file=open('mylist.pkl', 'rb'))
mydict = pickle.load(file=open('mydict.pkl', 'rb'))
myobj = pickle.load(file=open('myobj.pkl', 'rb'))

# printing objects and their types
print('list object: ', mylist, type(mylist))
print('dictionary object: ', mydict, type(mydict))
print('student info: ', myobj.display_info())

输出内容:
list object:  ['pink', 'green', 'blue', 'red'] <class 'list'>
dictionary object:  {'a': 23, 'b': 17, 'c': 9} <class 'dict'>
student info:  Student name is Maria & is 18 years old

文件是否按照二进制方式打开好像影响不大,不过为了保险还是按它说的来比较好。

pickleDB

介绍说pickleDB是一个轻量级且简单的键值存储。 它基于Python的simplejson模块,受redis启发。不清楚与pickle有什么关系?

pickleDB的示例:

>>> import pickledb

>>> db = pickledb.load('test.db', False)

>>> db.set('key', 'value')

>>> db.get('key')
'value'

>>> db.dump()
True

pickleDB能否与pickle联合使用?测试代码:

# -*- coding:utf-8 -*-
import pickledb
import pickle
import json

dataList = [[1, 1, 'yes'],
            [1, 1, 'yes'],
            [1, 0, 'no'],
            [0, 1, 'no'],
            [0, 1, 'no']]
dataDic = {0: [1, 2, 3, 4],
           1: ('a', 'b'),
           2: {'c': 'yes', 'd': 'no'}}

p1 = pickle.dumps(dataList)
print(pickle.loads(p1))
p2 = pickle.dumps(dataDic)
print(pickle.loads(p2))

db = pickledb.load('example.db', False)  # 从文件加载数据库,如果没有会自动创建
db.set('p1', p1)  # set 设置一个键的字符串值
db.set('p2', p2)  # set 设置一个键的字符串值

print(pickle.loads(db.get('p1')))  # get 获取一个键的值
print(pickle.loads(db.get('p2')))  # get 获取一个键的值

db.dump()  # 将数据库从内存保存到example.db

报如下错误:

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
<ipython-input-6-201838b0e5b8> in <module>
     18 print(pickle.loads(p2))
     19 
---> 20 db = pickledb.load('example.db', False)  # 从文件加载数据库,如果没有会自动创建
     21 db.set('p1', p1)  # set 设置一个键的字符串值
     22 db.set('p2', p2)  # set 设置一个键的字符串值

/opt/conda/lib/python3.7/site-packages/pickledb.py in load(location, auto_dump, sig)
     41 def load(location, auto_dump, sig=True):
     42     '''Return a pickledb object. location is the path to the json file.'''
---> 43     return PickleDB(location, auto_dump, sig)
     44 
     45 

/opt/conda/lib/python3.7/site-packages/pickledb.py in __init__(self, location, auto_dump, sig)
     52         If the file does not exist it will be created on the first update.
     53         '''
---> 54         self.load(location, auto_dump)
     55         self.dthread = None
     56         if sig:

/opt/conda/lib/python3.7/site-packages/pickledb.py in load(self, location, auto_dump)
     83         self.auto_dump = auto_dump
     84         if os.path.exists(location):
---> 85             self._loaddb()
     86         else:
     87             self.db = {}

/opt/conda/lib/python3.7/site-packages/pickledb.py in _loaddb(self)
    100     def _loaddb(self):
    101         '''Load or reload the json info from the file'''
--> 102         self.db = json.load(open(self.loco, 'rt'))
    103 
    104     def _autodumpdb(self):

/opt/conda/lib/python3.7/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    294         cls=cls, object_hook=object_hook,
    295         parse_float=parse_float, parse_int=parse_int,
--> 296         parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
    297 
    298 

/opt/conda/lib/python3.7/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    346             parse_int is None and parse_float is None and
    347             parse_constant is None and object_pairs_hook is None and not kw):
--> 348         return _default_decoder.decode(s)
    349     if cls is None:
    350         cls = JSONDecoder

/opt/conda/lib/python3.7/json/decoder.py in decode(self, s, _w)
    335 
    336         """
--> 337         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338         end = _w(s, end).end()
    339         if end != len(s):

/opt/conda/lib/python3.7/json/decoder.py in raw_decode(self, s, idx)
    353             obj, end = self.scan_once(s, idx)
    354         except StopIteration as err:
--> 355             raise JSONDecodeError("Expecting value", s, err.value) from None
    356         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

貌似pickledb不支持和pickle混用。或是我的使用方法有误?

shelve

shelve是一个简单的数据存储方案,类似key-value数据库,可以很方便的保存Python对象,其内部是通过pickle协议来实现数据序列化。shelve只有一个open()函数,这个函数用于打开指定的文件(一个持久的字典),然后返回一个shelf对象。shelf是一种持久的、类似字典的对象。其values值可以是任意基本Python对象–pickle模块可以处理的任何数据。这包括大多数类实例、递归数据类型和包含很多共享子对象的对象。keys还是普通的字符串。

open(filename, flag=’c’, protocol=None, writeback=False)

  • flag 参数表示打开数据存储文件的格式:
    • ‘r’ 以只读模式打开一个已经存在的数据存储文件
    • ‘w’ 以读写模式打开一个已经存在的数据存储文件
    • ‘c’ 以读写模式打开一个数据存储文件,如果不存在则创建
    • ‘n’ 总是创建一个新的、空数据存储文件,并以读写模式打开
  • protocol 参数表示序列化数据所使用的协议版本,默认是pickle v3;
  • writeback 参数表示是否开启回写功能。

使用示例:

# -*- coding:utf-8 -*-
import shelve

with shelve.open('student.db') as db:
    db['name'] = 'Tom'
    db['age'] = 19
    db['hobby'] = ['篮球', '看电影', '弹吉他']
    db['other_info'] = {'sno': 1, 'addr': 'xxxx'}

# 读取数据
with shelve.open('student.db') as db:
    for key, value in db.items():
        print(key, ': ', value)

参考链接:

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注