器→工具, 开源项目, 数据, 术→技巧

词向量实战:Word2Vec、FastText、Glove

钱魏Way · · 614 次浏览

之前的文章详细介绍Google的词向量工具Word2Vec、Facebook的词向量工具FastText、斯坦福大学词向量工具Glove。之前的文章主要从原理层面进行了介绍。今天想要分享的只要内容是如何使用这些工具。及比较针对相同的训练数据最终的结果。

Word2Vec词向量训练及使用

Word2Vec的词向量训练在先前的 使用word2vec训练中文维基百科 文章中已经有详细的介绍,这里就不作过多的重复。主要流程为编译后进行训练:

git clone https://github.com/tmikolov/word2vec.git
cd word2vec
make
./word2vec -train "../data/output.txt" -output "../data/word2vec.model" -cbow 1 -size 300 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 32 -binary 1 -iter 15

Word2vec模型的使用

import gensim.models.keyedvectors as word2vec
word2vec_model = word2vec.KeyedVectors.load_word2vec_format('data/word2vec.model', binary=True, unicode_errors='ignore')
print(word2vec_model.most_similar('性价比'))

使用 unicode_errors=’ignore’ 参数主要是为解决此报错问题:

UnicodeDecodeError: ‘utf-8’ codec can’t decode bytes in position 96-97: unexpected end of data

输出内容:

[('性比价', 0.8243662118911743), ('性价', 0.7430107593536377), ('性价比特', 0.6778421401977539), ('信价', 0.5147293210029602), ('CP值', 0.5129910707473755), ('价比', 0.5119792819023132), ('92241473', 0.5006518363952637), ('物有所值', 0.4925231635570526), ('档次', 0.4839213192462921), ('性價', 0.4788089692592621)]

Fasttext词向量训练与使用

同样在训练之前需要先对官方提供的工具进行编译和安装:

# 命令行工具
git clone https://github.com/facebookresearch/fastText.git
cd fastText && make

# Python包
git clone https://github.com/facebookresearch/fastText.git
cd fastText
python setup.py install

在使用前,我们现来看看示例训练代码word-vector-example.sh:

#!/usr/bin/env bash
#
# Copyright (c) 2016-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
#

RESULTDIR=result
DATADIR=data

mkdir -p "${RESULTDIR}"
mkdir -p "${DATADIR}"

if [ ! -f "${DATADIR}/fil9" ]
then
  wget -c http://mattmahoney.net/dc/enwik9.zip -P "${DATADIR}"
  unzip "${DATADIR}/enwik9.zip" -d "${DATADIR}"
  perl wikifil.pl "${DATADIR}/enwik9" > "${DATADIR}"/fil9
fi

if [ ! -f "${DATADIR}/rw/rw.txt" ]
then
  wget -c https://nlp.stanford.edu/~lmthang/morphoNLM/rw.zip -P "${DATADIR}"
  unzip "${DATADIR}/rw.zip" -d "${DATADIR}"
fi
make

./fasttext skipgram -input "${DATADIR}"/fil9 -output "${RESULTDIR}"/fil9 -lr 0.025 -dim 100 \
  -ws 5 -epoch 1 -minCount 5 -neg 5 -loss ns -bucket 2000000 \
  -minn 3 -maxn 6 -thread 4 -t 1e-4 -lrUpdateRate 100

cut -f 1,2 "${DATADIR}"/rw/rw.txt | awk '{print tolower($0)}' | tr '\t' '\n' > "${DATADIR}"/queries.txt

cat "${DATADIR}"/queries.txt | ./fasttext print-word-vectors "${RESULTDIR}"/fil9.bin > "${RESULTDIR}"/vectors.txt

python eval.py -m "${RESULTDIR}"/vectors.txt -d "${DATADIR}"/rw/rw.txt

其中,核心代码为:

./fasttext skipgram -input "${DATADIR}"/fil9 -output "${RESULTDIR}"/fil9 -lr 0.025 -dim 100 \
  -ws 5 -epoch 1 -minCount 5 -neg 5 -loss ns -bucket 2000000 \
  -minn 3 -maxn 6 -thread 4 -t 1e-4 -lrUpdateRate 100

训练参数含义:

$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
  -input              训练文件路径(必须)
  -output             输出文件路径(必须)

  The following arguments are optional:
  -verbose            verbosity level [2]

  The following arguments for the dictionary are optional:
  -minCount           最低词频,默认1(Word-representation modes skipgram and cbow use a default -minCount of 5)
  -minCountLabel      minimal number of label occurrences [0]
  -wordNgrams         n-grams 设置,默认1
  -bucket             number of buckets [2000000]
  -minn               最小字符长度默认0
  -maxn               最大字符长度默认0
  -t                  采样阈值,默认0.0001
  -label              labels prefix [__label__]

  The following arguments for training are optional:
  -lr                 学习率,默认0.1 
  -lrUpdateRate       学习率更新速率,默认100
  -dim                训练的词向量维度,默认100
  -ws                 上下文窗口大小,默认为5
  -epoch              epochs 数量,默认为5
  -neg                number of negatives sampled [5]
  -loss               损失函数 {ns,hs,softmax},默认为 softmax
  -thread             线程数量,默认为12 
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]

  The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]

参考链接:https://fasttext.cc/docs/en/options.html

最终训练指令为:

./fasttext skipgram -input "../data/output.txt" -output "../data/fasttext.model" -lr 0.01 -dim 300 -bucket 2000000 -thread 32

Fasttext词向量使用

from gensim.models import FastText
fasttext_model  = FastText.load_fasttext_format('data/fasttext.model ')
print(fasttext_model.most_similar('性价比'))

输出内容:

[('性价比👌', 0.9393969178199768), ('性价比市', 0.932769775390625), ('性价比比', 0.9304042458534241), ('性价此', 0.9251571297645569), ('性价比底', 0.9238805174827576), ('x性价比', 0.9228106737136841), ('无性价比', 0.9195789694786072), ('性价比髙', 0.9189218878746033), ('w性价比', 0.9176821112632751), ('性价比赞', 0.9165310263633728)]

Glove词向量训练与使用

使用Glove训练的方法有很多种,这里介绍的是官方提供的C语言版本。在训练之前首先要下载源码并编译:

git clone http://github.com/stanfordnlp/glove
cd glove && make

编译完成后默认会在glove目录下生成一个 build 目录,里面生成了4个训练需要用到的工具:

    build/
|-- cooccur
|-- glove
|-- shuffle
`-- vocab_count

在介绍如何使用这些工具前先来看下示例训练代码demo.sh的内容:

#!/bin/bash
set -e

# Makes programs, downloads sample data, trains a GloVe model, and then evaluates it.
# One optional argument can specify the language used for eval script: matlab, octave or [default] python

make
if [ ! -e text8 ]; then
  if hash wget 2>/dev/null; then
    wget http://mattmahoney.net/dc/text8.zip
  else
    curl -O http://mattmahoney.net/dc/text8.zip
  fi
  unzip text8.zip
  rm text8.zip
fi

CORPUS=text8
VOCAB_FILE=vocab.txt
COOCCURRENCE_FILE=cooccurrence.bin
COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
BUILDDIR=build
SAVE_FILE=vectors
VERBOSE=2
MEMORY=4.0
VOCAB_MIN_COUNT=5
VECTOR_SIZE=50
MAX_ITER=15
WINDOW_SIZE=15
BINARY=2
NUM_THREADS=8
X_MAX=10

echo
echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE"
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
if [ "$CORPUS" = 'text8' ]; then
   if [ "$1" = 'matlab' ]; then
       matlab -nodisplay -nodesktop -nojvm -nosplash < ./eval/matlab/read_and_evaluate.m 1>&2 
   elif [ "$1" = 'octave' ]; then
       octave < ./eval/octave/read_and_evaluate_octave.m 1>&2
   else
       echo "$ python eval/python/evaluate.py"
       python eval/python/evaluate.py
   fi
fi

从示例代码可以知道,训练总共分为4步,对应上面的四个工具,顺序依次为vocab_count –> cooccur –> shuffle –> glove:

$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE

每一步的作用:

  • vocab_count:从语料库$CORPUS中统计词频(备注:中文语料要先分好词),输出文件$VOCAB_FILE。每行为词语 词频 ,-min-count 5指示词频低于5的词舍弃,-verbose 2控制屏幕打印信息的,设为0表示不输出。
  • cooccur:从语料库中统计词共现,输出文件为$COOCCURRENCE_FILE,格式为非文本的二进制,-memory 4.0指示bigram_table缓冲器,-vocab-file指上一步得到的文件,-verbose 2同上,-window-size 5指示词窗口大小。
  • shuffle:对$COOCCURRENCE_FILE重新整理,输出文件$COOCCURRENCE_SHUF_FILE
  • glove:训练模型,输出词向量文件。-save-file、-threads、-input-file和-vocab-file直接按照字面应该就可以理解了,-iter表示迭代次数,vector-size表示向量维度大小,-binary控制输出格式0: save as text files; 1: save as binary; 2: both

Glove词向量使用

Glove想要在gensim中使用前需要先将其转换为word2vec模型,具体流程为:

from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors

glove_input_file = 'data/glove.model'
word2vec_output_file = 'data/glove2word2vec.model'
glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
print(glove_model.most_similar('性价比'))

输出内容:

[('高', 0.8684605360031128), ('Ok喇??', 0.8263183832168579), ('超高', 0.8215326070785522), ('性价', 0.7322962880134583), ('价位', 0.7196157574653625), ('价格', 0.7166442275047302), ('实惠', 0.7093995809555054), ('总体', 0.6866426467895508), ('Q性', 0.6845536828041077), ('总之', 0.6692114472389221)]

总结

针对同一份数据,Glove和Fasttext的训练时间较短,Word2Vec训练耗时较长。其结果看,Glove训练后的结果有些奇怪,感觉没有达到语义相关,更多体现在共现上。

其他参考资料:

发表评论

邮箱地址不会被公开。 必填项已用*标注