之前的文章详细介绍 Google 的词向量工具Word2Vec、Facebook 的词向量工具FastText、斯坦福大学词向量工具Glove。之前的文章主要从原理层面进行了介绍。今天想要分享的只要内容是如何使用这些工具。及比较针对相同的训练数据最终的结果。
Word2Vec 词向量训练及使用
Word2Vec 的词向量训练在先前的使用 word2vec 训练中文维基百科 文章中已经有详细的介绍,这里就不作过多的重复。主要流程为编译后进行训练:
git clone https://github.com/tmikolov/word2vec.git cd word2vec make ./word2vec -train "../data/output.txt" -output "../data/word2vec.model" -cbow 1 -size 300 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 32 -binary 1 -iter 15
Word2vec 模型的使用
import gensim.models.keyedvectors as word2vec word2vec_model = word2vec.KeyedVectors.load_word2vec_format('data/word2vec.model', binary=True, unicode_errors='ignore') print(word2vec_model.most_similar('性价比'))
使用 unicode_errors=’ignore’ 参数主要是为解决此报错问题:
UnicodeDecodeError: ‘utf-8’ codec can’t decode bytes in position 96-97: unexpected end of data
输出内容:
[('性比价', 0.8243662118911743), ('性价', 0.7430107593536377), ('性价比特', 0.6778421401977539), ('信价', 0.5147293210029602), ('CP 值', 0.5129910707473755), ('价比', 0.5119792819023132), ('92241473', 0.5006518363952637), ('物有所值', 0.4925231635570526), ('档次', 0.4839213192462921), ('性價', 0.4788089692592621)]
Fasttext 词向量训练与使用
同样在训练之前需要先对官方提供的工具进行编译和安装:
# 命令行工具 git clone https://github.com/facebookresearch/fastText.git cd fastText && make # Python 包 git clone https://github.com/facebookresearch/fastText.git cd fastText python setup.py install
在使用前,我们现来看看示例训练代码 word-vector-example.sh:
#!/usr/bin/env bash # # Copyright (c) 2016-present, Facebook, Inc. # All rights reserved. # # This source code is licensed under the MIT license found in the # LICENSE file in the root directory of this source tree. # RESULTDIR=result DATADIR=data mkdir -p "${RESULTDIR}" mkdir -p "${DATADIR}" if [ ! -f "${DATADIR}/fil9" ] then wget -c http://mattmahoney.net/dc/enwik9.zip -P "${DATADIR}" unzip "${DATADIR}/enwik9.zip" -d "${DATADIR}" perl wikifil.pl "${DATADIR}/enwik9" > "${DATADIR}"/fil9 fi if [ ! -f "${DATADIR}/rw/rw.txt" ] then wget -c https://nlp.stanford.edu/~lmthang/morphoNLM/rw.zip -P "${DATADIR}" unzip "${DATADIR}/rw.zip" -d "${DATADIR}" fi make ./fasttext skipgram -input "${DATADIR}"/fil9 -output "${RESULTDIR}"/fil9 -lr 0.025 -dim 100 \ -ws 5 -epoch 1 -minCount 5 -neg 5 -loss ns -bucket 2000000 \ -minn 3 -maxn 6 -thread 4 -t 1e-4 -lrUpdateRate 100 cut -f 1,2 "${DATADIR}"/rw/rw.txt | awk '{print tolower($0)}' | tr '\t' '\n' > "${DATADIR}"/queries.txt cat "${DATADIR}"/queries.txt | ./fasttext print-word-vectors "${RESULTDIR}"/fil9.bin > "${RESULTDIR}"/vectors.txt python eval.py -m "${RESULTDIR}"/vectors.txt -d "${DATADIR}"/rw/rw.txt
其中,核心代码为:
./fasttext skipgram -input "${DATADIR}"/fil9 -output "${RESULTDIR}"/fil9 -lr 0.025 -dim 100 \ -ws 5 -epoch 1 -minCount 5 -neg 5 -loss ns -bucket 2000000 \ -minn 3 -maxn 6 -thread 4 -t 1e-4 -lrUpdateRate 100
训练参数含义:
$ ./fasttext supervised Empty input or output path. The following arguments are mandatory: -input 训练文件路径(必须) -output 输出文件路径(必须) The following arguments are optional: -verbose verbosity level [2] The following arguments for the dictionary are optional: -minCount 最低词频,默认1(Word-representation modes skipgram and cbow use a default -minCount of 5) -minCountLabel minimal number of label occurrences [0] -wordNgrams n-grams设置,默认1 -bucket number of buckets [2000000] -minn 最小字符长度默认0 -maxn 最大字符长度默认0 -t 采样阈值,默认0.0001 -label label prefix [__label__] The following arguments for training are optional: -lr 学习率,默认0.1 -lrUpdateRate 学习率更新速率,默认100 -dim 训练的词向量维度,默认100 -ws 上下文窗口大小,默认为5 -epoch epochs数量,默认为5 -neg number of negatives sampled [5] -loss 损失函数{ns, hs, softmax},默认为softmax -thread 线程数量,默认为12 -pretrainedVectors pretrained word vectors for supervised learning [] -saveOutput whether output params should be saved [0] The following arguments for quantization are optional: -cutoff number of words and ngrams to retain [0] -retrain finetune embeddings if a cutoff is applied [0] -qnorm quantizing the norm separately [0] -qout quantizing the classifier [0] -dsub size of each sub-vector [2]
参考链接:https://fasttext.cc/docs/en/options.html
最终训练指令为:
./fasttext skipgram -input "../data/output.txt" -output "../data/fasttext.model" -lr 0.01 -dim 300 -bucket 2000000 -thread 32
Fasttext词向量使用
from gensim.models import FastText fasttext_model = FastText.load_fasttext_format('data/fasttext.model') print(fasttext_model.most_similar('性价比'))
输出内容:
[('性价比👌', 0.9393969178199768), ('性价比市', 0.932769775390625), ('性价比比', 0.9304042458534241), ('性价此', 0.9251571297645569), ('性价比底', 0.9238805174827576), ('x性价比', 0.9228106737136841), ('无性价比', 0.9195789694786072), ('性价比髙', 0.9189218878746033), ('w性价比', 0.9176821112632751), ('性价比赞', 0.9165310263633728)]
Glove词向量训练与使用
使用Glove训练的方法有很多种,这里介绍的是官方提供的C语言版本。在训练之前首先要下载源码并编译:
git clone http://github.com/stanfordnlp/glove cd glove && make
编译完成后默认会在glove目录下生成一个build目录,里面生成了4个训练需要用到的工具:
build/ |-- cooccur |-- glove |-- shuffle `-- vocab_count
在介绍如何使用这些工具前先来看下示例训练代码demo.sh的内容:
#!/bin/bash set -e # Makes programs, downloads sample data, trains a GloVe model, and then evaluates it. # One optional argument can specify the language used for eval script: matlab, octave or [default] python make if [ ! -e text8 ]; then if hash wget 2>/dev/null; then wget http://mattmahoney.net/dc/text8.zip else curl -O http://mattmahoney.net/dc/text8.zip fi unzip text8.zip rm text8.zip fi CORPUS=text8 VOCAB_FILE=vocab.txt COOCCURRENCE_FILE=cooccurrence.bin COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin BUILDDIR=build SAVE_FILE=vectors VERBOSE=2 MEMORY=4.0 VOCAB_MIN_COUNT=5 VECTOR_SIZE=50 MAX_ITER=15 WINDOW_SIZE=15 BINARY=2 NUM_THREADS=8 X_MAX=10 echo echo "$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE< $CORPUS > $VOCAB_FILE" $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE< $CORPUS > $VOCAB_FILE echo "$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE< $CORPUS > $COOCCURRENCE_FILE" $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE< $CORPUS > $COOCCURRENCE_FILE echo "$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE< $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE" $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE< $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE echo "$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE" $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE if [ "$CORPUS" = 'text8' ]; then if [ "$1" = 'matlab' ]; then matlab -nodisplay -nodesktop -nojvm -nosplash< ./eval/matlab/read_and_evaluate.m 1>&2 elif [ "$1" = 'octave' ]; then octave< ./eval/octave/read_and_evaluate_octave.m 1>&2 else echo "$ python eval/python/evaluate.py" python eval/python/evaluate.py fi fi
从示例代码可以知道,训练总共分为4步,对应上面的四个工具,顺序依次为vocab_count –> cooccur –> shuffle –> glove:
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE< $CORPUS > $VOCAB_FILE $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE< $CORPUS > $COOCCURRENCE_FILE $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE< $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
每一步的作用:
- vocab_count:从语料库$CORPUS中统计词频(备注:中文语料要先分好词),输出文件$VOCAB_FILE。每行为词语 词频,-min-count 5指示词频低于5的词舍弃,-verbose 2控制屏幕打印信息的,设为0表示不输出。
- cooccur:从语料库中统计词共现,输出文件为$COOCCURRENCE_FILE,格式为非文本的二进制,-memory 4.0指示bigram_table缓冲器,-vocab-file指上一步得到的文件,-verbose 2同上,-window-size 5指示词窗口大小。
- shuffle:对$COOCCURRENCE_FILE重新整理,输出文件$COOCCURRENCE_SHUF_FILE
- glove:训练模型,输出词向量文件。-save-file、-threads、-input-file和-vocab-file直接按照字面应该就可以理解了,-iter表示迭代次数,vector-size表示向量维度大小,-binary控制输出格式0: save as text files; 1: save as binary; 2: both
Glove词向量使用
Glove想要在gensim中使用前需要先将其转换为word2vec模型,具体流程为:
from gensim.scripts.glove2word2vec import glove2word2vec from gensim.models import KeyedVectors glove_input_file = 'data/glove.model' word2vec_output_file = 'data/glove2word2vec.model' glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False) print(glove_model.most_similar('性价比'))
输出内容:
[('高',0.8684605360031128),('Ok喇??',0.8263183832168579),('超高',0.8215326070785522),('性价',0.7322962880134583),('价位',0.7196157574653625),('价格',0.7166442275047302),('实惠',0.7093995809555054),('总体',0.6866426467895508),('Q性',0.6845536828041077),('总之',0.6692114472389221)]
总结
针对同一份数据,Glove和Fasttext的训练时间较短,Word2Vec训练耗时较长。其结果看,Glove训练后的结果有些奇怪,感觉没有达到语义相关,更多体现在共现上。
其他参考资料: