spaCy 是一个Python自然语言处理工具包,诞生于2014年年中,号称“Industrial-Strength Natural Language Processing in Python”,是具有工业级强度的Python NLP工具包。spaCy里大量使用了 Cython 来提高相关模块的性能,这个区别于学术性质更浓的Python NLTK,因此具有了业界应用的实际价值。
主要特性:
- 分词
- 命名实体识别
- 多语言支持(号称支持53种语言)
- 针对11种语言的23种统计模型
- 预训练词向量
- 高性能
- 轻松的整合深度学习
- 词性标注
- 依存句法分析
- 句法驱动的句子切分
- 用于语法和命名实体识别的内置可视化工具
- 方便的字符串到哈希映射
- 导出到numpy数据数组
- 高效的二进制序列化
- 易于模型打包和部署
- 稳健,精确评估
SpaCy的安装
先执行包的安装: pip install spacy ,再执行数据集和模型的下载。
模型地址:
比如想安装英文的,执行如下命令即可: python -m spacy download en_core_web_sm
使用时加载相应的模型:
1 2 |
import spacy nlp = spacy.load("en_core_web_sm") |
由于官网没有中文的模型,针对中文模型安装稍微要麻烦些。
非官方中文模型地址:https://github.com/howl-anderson/Chinese_models_for_SpaCy
下载后执行: pip install ./zh_core_web_sm-2.0.5.tar.gz
安装后执行:
1 2 |
import spacy nlp = spacy.load("zh_core_web_sm") |
报如下错误:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
Traceback (most recent call last): File "D:/CodeHub/NLP/test_new.py", line 7, in <module> nlp = spacy.load('zh_core_web_sm') File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\__init__.py", line 30, in load return util.load_model(name, **overrides) File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\util.py", line 164, in load_model return load_model_from_package(name, **overrides) File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\util.py", line 185, in load_model_from_package return cls.load(**overrides) File "D:\CodeHub\NLP\venv\lib\site-packages\zh_core_web_sm\__init__.py", line 12, in load return load_model_from_init_py(__file__, **overrides) File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\util.py", line 228, in load_model_from_init_py return load_model_from_path(data_path, meta, **overrides) File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\util.py", line 211, in load_model_from_path return nlp.from_disk(model_path) File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\language.py", line 941, in from_disk util.from_disk(path, deserializers, exclude) File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\util.py", line 654, in from_disk reader(path / key) File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\language.py", line 936, in <lambda> p, exclude=["vocab"] File "pipes.pyx", line 661, in spacy.pipeline.pipes.Tagger.from_disk File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\util.py", line 654, in from_disk reader(path / key) File "pipes.pyx", line 641, in spacy.pipeline.pipes.Tagger.from_disk.load_model File "pipes.pyx", line 643, in spacy.pipeline.pipes.Tagger.from_disk.load_model File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\_classes\model.py", line 376, in from_bytes copy_array(dest, param[b"value"]) File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\util.py", line 145, in copy_array dst[:] = src ValueError: could not broadcast input array from shape (128) into shape (96) |
初步判定是版本问题,重新安装spaCy: pip install spacy==2.0.5
重装完成后模型能正常加载,但是代码不能执行,报如下错误:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
Building prefix dict from the default dictionary ... Loading model from cache C:\Users\QWD312~1.TCE\AppData\Local\Temp\jieba.cache Loading model cost 0.785 seconds. Prefix dict has been built succesfully. Traceback (most recent call last): File "D:/CodeHub/NLP/test_new.py", line 6, in <module> doc = nlp("王小明在北京的清华大学读书") File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\language.py", line 333, in __call__ doc = proc(doc) File "pipeline.pyx", line 390, in spacy.pipeline.Tagger.__call__ File "pipeline.pyx", line 402, in spacy.pipeline.Tagger.predict File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\_classes\model.py", line 161, in __call__ return self.predict(x) File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\api.py", line 55, in predict X = layer(X) File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\_classes\model.py", line 161, in __call__ return self.predict(x) File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\api.py", line 293, in predict X = layer(layer.ops.flatten(seqs_in, pad=pad)) File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\_classes\model.py", line 161, in __call__ return self.predict(x) File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\api.py", line 55, in predict X = layer(X) File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\_classes\model.py", line 161, in __call__ return self.predict(x) File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\_classes\model.py", line 125, in predict y, _ = self.begin_update(X) File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\api.py", line 374, in uniqued_fwd Y_uniq, bp_Y_uniq = layer.begin_update(X_uniq, drop=drop) File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\api.py", line 61, in begin_update X, inc_layer_grad = layer.begin_update(X, drop=drop) File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\_classes\layernorm.py", line 51, in begin_update X, backprop_child = self.child.begin_update(X, drop=0.) File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\_classes\maxout.py", line 69, in begin_update output__boc = self.ops.batch_dot(X__bi, W) File "ops.pyx", line 338, in thinc.neural.ops.NumpyOps.batch_dot File "<__array_function__ internals>", line 6, in dot ValueError: shapes (7,512) and (640,384) not aligned: 512 (dim 1) != 640 (dim 0) |
预估还是版本问题,重新一个个版本测试,终于将版本重装为2.0.16可顺利执行:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# -*- encoding:utf-8 -*- import spacy nlp = spacy.load("zh_core_web_sm") doc = nlp("王小明在北京的清华大学读书") for token in doc: print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop, token.has_vector, token.ent_iob_, token.ent_type_, token.vector_norm, token.is_oov) spacy.displacy.serve(doc) |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Building prefix dict from the default dictionary ... Loading model from cache C:\Users\QWD312~1.TCE\AppData\Local\Temp\jieba.cache Loading model cost 0.730 seconds. Prefix dict has been built succesfully. 王小明 王小明 X NNP nsubj xxx True False True B PERSON 14.44006 False 在 在 X VV acl x True True True O 9.84207 False 北京 北京 X NNP det xx True False True B GPE 18.310038 False 的 的 X DEC case:dec x True True True O 10.005628 False 清华大学 清华大学 X NNP obj xxxx True False True B ORG 21.960636 False 读书 读书 X VV ROOT xx True False True O 22.59519 False Serving on port 5000... Using the 'dep' visualizer |
SpaCy的使用
使用示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
import spacy # Load English tokenizer, tagger, parser, NER and word vectors nlp = spacy.load("en_core_web_sm") text = "Rami Eid is studying at Stony Brook University in New York" doc = nlp(text) # 分词 词性标注 for token in doc: print(token, token.pos_, token.pos) # 命名实体识别(NER) for ent in doc.ents: print(ent, ent.label_, ent.label) # 名词短语提取 for np in doc.noun_chunks: print(np) mport spacy nlp = spacy.load("en_core_web_sm") text = "Rami Eid is studying at Stony Brook University in New York" doc = nlp(text) # 分词 词性标注 for token in doc: print(token, token.pos_, token.pos) # 命名实体识别(NER) for ent in doc.ents: print(ent, ent.label_, ent.label) # 名词短语提取 for np in doc.noun_chunks: print(np) # 依存关系 for token in doc: print(token.text, token.dep_, token.head) # 文本相似度 doc1 = nlp(u"my fries were super gross") doc2 = nlp(u"such disgusting fries") similarity = doc1.similarity(doc2) print(similarity) |
Spacy里面实体的标签及其表示的含义:
PERSON | People, including fictional. | 人物 |
NORP | Nationalities or religious or political groups. | 国家、宗教、政治团体 |
FAC | Buildings, airports, highways, bridges, etc. | 建筑、机场、高速公路、桥梁等 |
ORG | Companies, agencies, institutions, etc. | 组织公司、机构等 |
GPE | Countries, cities, states. | 国家、城市、州 |
LOC | Non-GPE locations, mountain ranges, bodies of water. | 山脉、水体等 |
PRODUCT | Objects, vehicles, foods, etc. (Not services.) | 车辆、食物等非服务性的产品 |
EVENT | Named hurricanes, battles, wars, sports events, etc. | 飓风、战争、体育赛事等 |
WORK_OF_ART | Titles of books, songs, etc. | 书名、歌名等 |
LAW | Named documents made into laws. | 法律文书 |
LANGUAGE | Any named language. | 语言 |
DATE | Absolute or relative dates or periods. | 日期 |
TIME | Times smaller than a day. | 小于1天的时间 |
PERCENT | Percentage, including “%”. | 百分比 |
MONEY | Monetary values, including unit. | 货币价值 |
QUANTITY | Measurements, as of weight or distance. | 度量单位 |
ORDINAL | “first”, “second”, etc. | 序数词 |
CARDINAL | Numerals that do not fall under another type. | 数量词 |
参考链接:
打赏作者
