器→工具, 开源项目, 数据, 术→技巧

词向量实战:Word2Vec、FastText、Glove

钱魏Way · · 1,945 次浏览
!文章内容如有错误或排版问题,请提交反馈,非常感谢!

之前的文章详细介绍 Google 的词向量工具Word2Vec、Facebook 的词向量工具FastText、斯坦福大学词向量工具Glove。之前的文章主要从原理层面进行了介绍。今天想要分享的只要内容是如何使用这些工具。及比较针对相同的训练数据最终的结果。

Word2Vec 词向量训练及使用

Word2Vec 的词向量训练在先前的使用 word2vec 训练中文维基百科 文章中已经有详细的介绍,这里就不作过多的重复。主要流程为编译后进行训练:

git clone https://github.com/tmikolov/word2vec.git
cd word2vec
make
./word2vec -train "../data/output.txt" -output "../data/word2vec.model" -cbow 1 -size 300 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 32 -binary 1 -iter 15

Word2vec 模型的使用

import gensim.models.keyedvectors as word2vec
word2vec_model = word2vec.KeyedVectors.load_word2vec_format('data/word2vec.model', binary=True, unicode_errors='ignore')
print(word2vec_model.most_similar('性价比'))

使用 unicode_errors=’ignore’ 参数主要是为解决此报错问题:

UnicodeDecodeError: ‘utf-8’ codec can’t decode bytes in position 96-97: unexpected end of data

输出内容:

[('性比价', 0.8243662118911743), ('性价', 0.7430107593536377), ('性价比特', 0.6778421401977539), ('信价', 0.5147293210029602), ('CP 值', 0.5129910707473755), ('价比', 0.5119792819023132), ('92241473', 0.5006518363952637), ('物有所值', 0.4925231635570526), ('档次', 0.4839213192462921), ('性價', 0.4788089692592621)]

Fasttext 词向量训练与使用

同样在训练之前需要先对官方提供的工具进行编译和安装:

# 命令行工具
git clone https://github.com/facebookresearch/fastText.git
cd fastText && make

# Python 包
git clone https://github.com/facebookresearch/fastText.git
cd fastText
python setup.py install

在使用前,我们现来看看示例训练代码 word-vector-example.sh:

#!/usr/bin/env bash
#
# Copyright (c) 2016-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
#

RESULTDIR=result
DATADIR=data

mkdir -p "${RESULTDIR}"
mkdir -p "${DATADIR}"

if [ ! -f "${DATADIR}/fil9" ]
then
  wget -c http://mattmahoney.net/dc/enwik9.zip -P "${DATADIR}"
  unzip "${DATADIR}/enwik9.zip" -d "${DATADIR}"
  perl wikifil.pl "${DATADIR}/enwik9" > "${DATADIR}"/fil9
fi

if [ ! -f "${DATADIR}/rw/rw.txt" ]
then
  wget -c https://nlp.stanford.edu/~lmthang/morphoNLM/rw.zip -P "${DATADIR}"
  unzip "${DATADIR}/rw.zip" -d "${DATADIR}"
fi
make

./fasttext skipgram -input "${DATADIR}"/fil9 -output "${RESULTDIR}"/fil9 -lr 0.025 -dim 100 \
  -ws 5 -epoch 1 -minCount 5 -neg 5 -loss ns -bucket 2000000 \
  -minn 3 -maxn 6 -thread 4 -t 1e-4 -lrUpdateRate 100

cut -f 1,2 "${DATADIR}"/rw/rw.txt | awk '{print tolower($0)}' | tr '\t' '\n' > "${DATADIR}"/queries.txt

cat "${DATADIR}"/queries.txt | ./fasttext print-word-vectors "${RESULTDIR}"/fil9.bin > "${RESULTDIR}"/vectors.txt

python eval.py -m "${RESULTDIR}"/vectors.txt -d "${DATADIR}"/rw/rw.txt

其中,核心代码为:

./fasttext skipgram -input "${DATADIR}"/fil9 -output "${RESULTDIR}"/fil9 -lr 0.025 -dim 100 \
  -ws 5 -epoch 1 -minCount 5 -neg 5 -loss ns -bucket 2000000 \
  -minn 3 -maxn 6 -thread 4 -t 1e-4 -lrUpdateRate 100

训练参数含义:

$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
-input 训练文件路径(必须)
-output 输出文件路径(必须)

The following arguments are optional:
-verbose verbosity level [2]

The following arguments for the dictionary are optional:
-minCount 最低词频,默认1(Word-representation modes skipgram and cbow use a default -minCount of 5)
-minCountLabel minimal number of label occurrences [0]
-wordNgrams n-grams设置,默认1
-bucket number of buckets [2000000]
-minn 最小字符长度默认0
-maxn 最大字符长度默认0
-t 采样阈值,默认0.0001
-label label prefix [__label__]

The following arguments for training are optional:
-lr 学习率,默认0.1
-lrUpdateRate 学习率更新速率,默认100
-dim 训练的词向量维度,默认100
-ws 上下文窗口大小,默认为5
-epoch epochs数量,默认为5
-neg number of negatives sampled [5]
-loss 损失函数{ns, hs, softmax},默认为softmax
-thread 线程数量,默认为12
-pretrainedVectors pretrained word vectors for supervised learning []
-saveOutput whether output params should be saved [0]

The following arguments for quantization are optional:
-cutoff number of words and ngrams to retain [0]
-retrain finetune embeddings if a cutoff is applied [0]
-qnorm quantizing the norm separately [0]
-qout quantizing the classifier [0]
-dsub size of each sub-vector [2]

参考链接:https://fasttext.cc/docs/en/options.html

最终训练指令为:

./fasttext skipgram -input "../data/output.txt" -output "../data/fasttext.model" -lr 0.01 -dim 300 -bucket 2000000 -thread 32

Fasttext词向量使用

from gensim.models import FastText
fasttext_model = FastText.load_fasttext_format('data/fasttext.model')
print(fasttext_model.most_similar('性价比'))

输出内容:

[('性价比👌', 0.9393969178199768), ('性价比市', 0.932769775390625), ('性价比比', 0.9304042458534241), ('性价此', 0.9251571297645569), ('性价比底', 0.9238805174827576), ('x性价比', 0.9228106737136841), ('无性价比', 0.9195789694786072), ('性价比髙', 0.9189218878746033), ('w性价比', 0.9176821112632751), ('性价比赞', 0.9165310263633728)]

Glove词向量训练与使用

使用Glove训练的方法有很多种,这里介绍的是官方提供的C语言版本。在训练之前首先要下载源码并编译:

git clone http://github.com/stanfordnlp/glove
cd glove && make

编译完成后默认会在glove目录下生成一个build目录,里面生成了4个训练需要用到的工具:

build/
|-- cooccur
|-- glove
|-- shuffle
`-- vocab_count

在介绍如何使用这些工具前先来看下示例训练代码demo.sh的内容:

#!/bin/bash
set -e

# Makes programs, downloads sample data, trains a GloVe model, and then evaluates it.
# One optional argument can specify the language used for eval script: matlab, octave or [default] python

make
if [ ! -e text8 ]; then
  if hash wget 2>/dev/null; then
    wget http://mattmahoney.net/dc/text8.zip
  else
    curl -O http://mattmahoney.net/dc/text8.zip
  fi
  unzip text8.zip
  rm text8.zip
fi

CORPUS=text8
VOCAB_FILE=vocab.txt
COOCCURRENCE_FILE=cooccurrence.bin
COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
BUILDDIR=build
SAVE_FILE=vectors
VERBOSE=2
MEMORY=4.0
VOCAB_MIN_COUNT=5
VECTOR_SIZE=50
MAX_ITER=15
WINDOW_SIZE=15
BINARY=2
NUM_THREADS=8
X_MAX=10

echo
echo "$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE< $CORPUS > $VOCAB_FILE"
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE< $CORPUS > $VOCAB_FILE
echo "$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE< $CORPUS > $COOCCURRENCE_FILE"
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE< $CORPUS > $COOCCURRENCE_FILE
echo "$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE< $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE< $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
echo "$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE"
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
if [ "$CORPUS" = 'text8' ]; then
  if [ "$1" = 'matlab' ]; then
    matlab -nodisplay -nodesktop -nojvm -nosplash< ./eval/matlab/read_and_evaluate.m 1>&2
  elif [ "$1" = 'octave' ]; then
    octave< ./eval/octave/read_and_evaluate_octave.m 1>&2
  else
    echo "$ python eval/python/evaluate.py"
    python eval/python/evaluate.py
  fi
fi

从示例代码可以知道,训练总共分为4步,对应上面的四个工具,顺序依次为vocab_count –> cooccur –> shuffle –> glove:

$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE< $CORPUS > $VOCAB_FILE
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE< $CORPUS > $COOCCURRENCE_FILE
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE< $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE

每一步的作用:

  • vocab_count:从语料库$CORPUS中统计词频(备注:中文语料要先分好词),输出文件$VOCAB_FILE。每行为词语 词频,-min-count 5指示词频低于5的词舍弃,-verbose 2控制屏幕打印信息的,设为0表示不输出。
  • cooccur:从语料库中统计词共现,输出文件为$COOCCURRENCE_FILE,格式为非文本的二进制,-memory 4.0指示bigram_table缓冲器,-vocab-file指上一步得到的文件,-verbose 2同上,-window-size 5指示词窗口大小。
  • shuffle:对$COOCCURRENCE_FILE重新整理,输出文件$COOCCURRENCE_SHUF_FILE
  • glove:训练模型,输出词向量文件。-save-file、-threads、-input-file和-vocab-file直接按照字面应该就可以理解了,-iter表示迭代次数,vector-size表示向量维度大小,-binary控制输出格式0: save as text files; 1: save as binary; 2: both

Glove词向量使用

Glove想要在gensim中使用前需要先将其转换为word2vec模型,具体流程为:

from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors

glove_input_file = 'data/glove.model'
word2vec_output_file = 'data/glove2word2vec.model'
glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
print(glove_model.most_similar('性价比'))

输出内容:

[('高',0.8684605360031128),('Ok喇??',0.8263183832168579),('超高',0.8215326070785522),('性价',0.7322962880134583),('价位',0.7196157574653625),('价格',0.7166442275047302),('实惠',0.7093995809555054),('总体',0.6866426467895508),('Q性',0.6845536828041077),('总之',0.6692114472389221)]

总结

针对同一份数据,Glove和Fasttext的训练时间较短,Word2Vec训练耗时较长。其结果看,Glove训练后的结果有些奇怪,感觉没有达到语义相关,更多体现在共现上。

其他参考资料:

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注