词向量实战：Word2Vec、FastText、Glove

!文章内容如有错误或排版问题，请提交反馈，非常感谢！

之前的文章详细介绍 Google 的词向量工具Word2Vec、Facebook 的词向量工具FastText、斯坦福大学词向量工具Glove。之前的文章主要从原理层面进行了介绍。今天想要分享的只要内容是如何使用这些工具。及比较针对相同的训练数据最终的结果。

Word2Vec 词向量训练及使用

Word2Vec 的词向量训练在先前的使用 word2vec 训练中文维基百科文章中已经有详细的介绍，这里就不作过多的重复。主要流程为编译后进行训练：

git clone https://github.com/tmikolov/word2vec.git
cd word2vec
make
./word2vec -train "../data/output.txt" -output "../data/word2vec.model" -cbow 1 -size 300 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 32 -binary 1 -iter 15

Word2vec 模型的使用

import gensim.models.keyedvectors as word2vec
word2vec_model = word2vec.KeyedVectors.load_word2vec_format('data/word2vec.model', binary=True, unicode_errors='ignore')
print(word2vec_model.most_similar('性价比'))

使用 unicode_errors=’ignore’ 参数主要是为解决此报错问题：

UnicodeDecodeError: ‘utf-8’ codec can’t decode bytes in position 96-97: unexpected end of data

输出内容：

[('性比价', 0.8243662118911743), ('性价', 0.7430107593536377), ('性价比特', 0.6778421401977539), ('信价', 0.5147293210029602), ('CP 值', 0.5129910707473755), ('价比', 0.5119792819023132), ('92241473', 0.5006518363952637), ('物有所值', 0.4925231635570526), ('档次', 0.4839213192462921), ('性價', 0.4788089692592621)]

Fasttext 词向量训练与使用

同样在训练之前需要先对官方提供的工具进行编译和安装：

# 命令行工具
git clone https://github.com/facebookresearch/fastText.git
cd fastText && make

# Python 包
git clone https://github.com/facebookresearch/fastText.git
cd fastText
python setup.py install

在使用前，我们现来看看示例训练代码 word-vector-example.sh：

#!/usr/bin/env bash
#
# Copyright (c) 2016-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
#

RESULTDIR=result
DATADIR=data

mkdir -p "${RESULTDIR}"
mkdir -p "${DATADIR}"

if [ ! -f "${DATADIR}/fil9" ]
then
  wget -c http://mattmahoney.net/dc/enwik9.zip -P "${DATADIR}"
  unzip "${DATADIR}/enwik9.zip" -d "${DATADIR}"
  perl wikifil.pl "${DATADIR}/enwik9" > "${DATADIR}"/fil9
fi

if [ ! -f "${DATADIR}/rw/rw.txt" ]
then
  wget -c https://nlp.stanford.edu/~lmthang/morphoNLM/rw.zip -P "${DATADIR}"
  unzip "${DATADIR}/rw.zip" -d "${DATADIR}"
fi
make

./fasttext skipgram -input "${DATADIR}"/fil9 -output "${RESULTDIR}"/fil9 -lr 0.025 -dim 100 \
  -ws 5 -epoch 1 -minCount 5 -neg 5 -loss ns -bucket 2000000 \
  -minn 3 -maxn 6 -thread 4 -t 1e-4 -lrUpdateRate 100

cut -f 1,2 "${DATADIR}"/rw/rw.txt | awk '{print tolower($0)}' | tr '\t' '\n' > "${DATADIR}"/queries.txt

cat "${DATADIR}"/queries.txt | ./fasttext print-word-vectors "${RESULTDIR}"/fil9.bin > "${RESULTDIR}"/vectors.txt

python eval.py -m "${RESULTDIR}"/vectors.txt -d "${DATADIR}"/rw/rw.txt

其中，核心代码为：

./fasttext skipgram -input "${DATADIR}"/fil9 -output "${RESULTDIR}"/fil9 -lr 0.025 -dim 100 \
  -ws 5 -epoch 1 -minCount 5 -neg 5 -loss ns -bucket 2000000 \
  -minn 3 -maxn 6 -thread 4 -t 1e-4 -lrUpdateRate 100

训练参数含义：

$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
-input 训练文件路径（必须）
-output 输出文件路径（必须）

The following arguments are optional:
-verbose verbosity level [2]

The following arguments for the dictionary are optional:
-minCount 最低词频，默认1（Word-representation modes skipgram and cbow use a default -minCount of 5）
-minCountLabel minimal number of label occurrences [0]
-wordNgrams n-grams设置，默认1
-bucket number of buckets [2000000]
-minn 最小字符长度默认0
-maxn 最大字符长度默认0
-t 采样阈值，默认0.0001
-label label prefix [__label__]

The following arguments for training are optional:
-lr 学习率，默认0.1
-lrUpdateRate 学习率更新速率，默认100
-dim 训练的词向量维度，默认100
-ws 上下文窗口大小，默认为5
-epoch epochs数量，默认为5
-neg number of negatives sampled [5]
-loss 损失函数{ns, hs, softmax}，默认为softmax
-thread 线程数量，默认为12
-pretrainedVectors pretrained word vectors for supervised learning []
-saveOutput whether output params should be saved [0]

The following arguments for quantization are optional:
-cutoff number of words and ngrams to retain [0]
-retrain finetune embeddings if a cutoff is applied [0]
-qnorm quantizing the norm separately [0]
-qout quantizing the classifier [0]
-dsub size of each sub-vector [2]

参考链接：https://fasttext.cc/docs/en/options.html

最终训练指令为：

./fasttext skipgram -input "../data/output.txt" -output "../data/fasttext.model" -lr 0.01 -dim 300 -bucket 2000000 -thread 32

Fasttext词向量使用

from gensim.models import FastText
fasttext_model = FastText.load_fasttext_format('data/fasttext.model')
print(fasttext_model.most_similar('性价比'))

输出内容：

[('性价比👌', 0.9393969178199768), ('性价比市', 0.932769775390625), ('性价比比', 0.9304042458534241), ('性价此', 0.9251571297645569), ('性价比底', 0.9238805174827576), ('x性价比', 0.9228106737136841), ('无性价比', 0.9195789694786072), ('性价比髙', 0.9189218878746033), ('w性价比', 0.9176821112632751), ('性价比赞', 0.9165310263633728)]

Glove词向量训练与使用

使用Glove训练的方法有很多种，这里介绍的是官方提供的C语言版本。在训练之前首先要下载源码并编译：

git clone http://github.com/stanfordnlp/glove
cd glove && make

编译完成后默认会在glove目录下生成一个build目录，里面生成了4个训练需要用到的工具：

build/
|-- cooccur
|-- glove
|-- shuffle
`-- vocab_count

在介绍如何使用这些工具前先来看下示例训练代码demo.sh的内容：

#!/bin/bash
set -e

# Makes programs, downloads sample data, trains a GloVe model, and then evaluates it.
# One optional argument can specify the language used for eval script: matlab, octave or [default] python

make
if [ ! -e text8 ]; then
  if hash wget 2>/dev/null; then
    wget http://mattmahoney.net/dc/text8.zip
  else
    curl -O http://mattmahoney.net/dc/text8.zip
  fi
  unzip text8.zip
  rm text8.zip
fi

CORPUS=text8
VOCAB_FILE=vocab.txt
COOCCURRENCE_FILE=cooccurrence.bin
COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
BUILDDIR=build
SAVE_FILE=vectors
VERBOSE=2
MEMORY=4.0
VOCAB_MIN_COUNT=5
VECTOR_SIZE=50
MAX_ITER=15
WINDOW_SIZE=15
BINARY=2
NUM_THREADS=8
X_MAX=10

echo
echo "$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE< $CORPUS > $VOCAB_FILE"
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE< $CORPUS > $VOCAB_FILE
echo "$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE< $CORPUS > $COOCCURRENCE_FILE"
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE< $CORPUS > $COOCCURRENCE_FILE
echo "$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE< $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE< $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
echo "$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE"
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
if [ "$CORPUS" = 'text8' ]; then
  if [ "$1" = 'matlab' ]; then
    matlab -nodisplay -nodesktop -nojvm -nosplash< ./eval/matlab/read_and_evaluate.m 1>&2
  elif [ "$1" = 'octave' ]; then
    octave< ./eval/octave/read_and_evaluate_octave.m 1>&2
  else
    echo "$ python eval/python/evaluate.py"
    python eval/python/evaluate.py
  fi
fi

从示例代码可以知道，训练总共分为4步，对应上面的四个工具，顺序依次为vocab_count –> cooccur –> shuffle –> glove：

$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE< $CORPUS > $VOCAB_FILE
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE< $CORPUS > $COOCCURRENCE_FILE
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE< $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE

每一步的作用：

vocab_count：从语料库$CORPUS中统计词频（备注：中文语料要先分好词），输出文件$VOCAB_FILE。每行为词语词频，-min-count 5指示词频低于5的词舍弃，-verbose 2控制屏幕打印信息的，设为0表示不输出。
cooccur：从语料库中统计词共现，输出文件为$COOCCURRENCE_FILE，格式为非文本的二进制，-memory 4.0指示bigram_table缓冲器，-vocab-file指上一步得到的文件，-verbose 2同上，-window-size 5指示词窗口大小。
shuffle：对$COOCCURRENCE_FILE重新整理，输出文件$COOCCURRENCE_SHUF_FILE
glove：训练模型，输出词向量文件。-save-file、-threads、-input-file和-vocab-file直接按照字面应该就可以理解了，-iter表示迭代次数，vector-size表示向量维度大小，-binary控制输出格式0: save as text files; 1: save as binary; 2: both

Glove词向量使用

Glove想要在gensim中使用前需要先将其转换为word2vec模型，具体流程为：

from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors

glove_input_file = 'data/glove.model'
word2vec_output_file = 'data/glove2word2vec.model'
glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
print(glove_model.most_similar('性价比'))

输出内容：

[('高',0.8684605360031128),('Ok喇??',0.8263183832168579),('超高',0.8215326070785522),('性价',0.7322962880134583),('价位',0.7196157574653625),('价格',0.7166442275047302),('实惠',0.7093995809555054),('总体',0.6866426467895508),('Q性',0.6845536828041077),('总之',0.6692114472389221)]

总结

针对同一份数据，Glove和Fasttext的训练时间较短，Word2Vec训练耗时较长。其结果看，Glove训练后的结果有些奇怪，感觉没有达到语义相关，更多体现在共现上。

其他参考资料：

词向量实战：Word2Vec、FastText、Glove

Word2Vec 词向量训练及使用

Fasttext 词向量训练与使用

Glove词向量训练与使用

总结

Arduino 存储之EEPROM 和 SD 卡模块

Arduino 串口通信 (Serial Communication)

Arduino程序控制之中断

发表回复取消回复

词向量实战：Word2Vec、FastText、Glove

Word2Vec 词向量训练及使用

Fasttext 词向量训练与使用

Glove词向量训练与使用

总结

Arduino 存储之EEPROM 和 SD 卡模块

Arduino 串口通信 (Serial Communication)

Arduino程序控制之中断

发表回复 取消回复

发表回复取消回复