数据, 术→技巧, 研发

NLP实战:用户点评印象抽取

钱魏Way · · 214 次浏览

最近在做点评做分析,主要目的是为了提取用户对点评的整体印象。类似的实现应该已经有很多了,于是从网上的开源代码中借鉴了思路。主要使用思路为:词性标注+正则提取。

  • 词性标准,主要选择的是斯坦福的NLP工具。原因是词性标注集相对更加细致。但是也存在一些缺点,比如分词的准确率并不高。特别是遇到一些地名的时候。
  • 正则提取,主要提取的是从语句中提取“名词”+“形容词”的结构。

针对单条点评的印象抽取

针对单个产品,多个点评汇总呈现

针对单品的点评印象,常见的常见方式是采用标签云或词云的方式:

from wordcloud import WordCloud
import matplotlib.pyplot as plt

counter_all = Counter(short_result).most_common()
wordcloud = WordCloud(font_path="data/FZYingXueJW.TTF", background_color="white", width=800, height=600)
wordcloud.generate_from_frequencies(dict(counter_all))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

另外一种方式是只抽取核心词(评价最多的词)进行呈现。 但是从上图可知,中间存在很多重复或意义相同的词。解决方案使用Word2Vec进行短文本的聚类,只获取每个类别下评价最多的词。

Word2Vec本身不支持对短文本进行聚类,解决方案是句向量平均:

# -*- coding: utf-8 -*-
import pandas as pd
from collections import Counter
from gensim.models import KeyedVectors
from sklearn.cluster import KMeans

tencent_model = KeyedVectors.load_word2vec_format('dict_1000000.txt', binary=False)


def get_sentence_vector(sentence, model, size=200):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    word_list = nlp.pos_tag(sentence)
    for word in word_list:
        if word[0] == "酒店":  # 去除点评主题,防止产生干扰
            pass
        if word[1] == "NN":
            try:
                vec += model[word[0]]
                count += 1.
            except KeyError:
                continue
    if count != 0:
        vec /= count
    return vec.tolist()[0]


if __name__ == "__main__":
    short_result = []
    counter = Counter(short_result).most_common(100)
    if len(counter) < 50:
        pass
    else:
        hotel_impression = {
            "hotel_id": hotel_id,
        }
        train_data = []
        for impression, count in counter:
            vector = get_sentence_vector(impression, tencent_model, size=200)
            impression_dict = {
                "impression": impression,
                "count": count,
            }
            for i in range(len(vector)):
                impression_dict["v_" + str(i)] = vector[i]
            train_data.append(impression_dict)
        df = pd.DataFrame(train_data)
        clf = KMeans(n_clusters=10)
        clf.fit(df.iloc[:, 2:])
        df['labels'] = clf.labels_
        df_result = df.sort_values('count', ascending=False).drop_duplicates(['labels'])[
            ["impression", 'count', 'labels']]
        impression_list = []
        for imp in df_result[df_result['count'] != 1]["impression"]:
            impression_list.append(imp)
        hotel_impression["impression"] = impression_list
        print(hotel_impression)

其他改进:

去除emoji表情代码:

import re

def filter_emoj(text):
    import re
    try:
        # Wide UCS-4 build
        myre = re.compile(u'['
                          u'\U0001F300-\U0001F64F'
                          u'\U0001F680-\U0001F6FF'
                          u'\u2600-\u2B55'
                          u'\u23cf'
                          u'\u23e9'
                          u'\u231a'
                          u'\u3030'
                          u'\ufe0f'
                          u"\U0001F600-\U0001F64F"  # emoticons
                          u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                          u'\U00010000-\U0010ffff'
                          u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
                          u'\U00002702-\U000027B0]+',
                          re.UNICODE)
    except re.error:
        # Narrow UCS-2 build
        myre = re.compile(u'('
                          u'\ud83c[\udf00-\udfff]|'
                          u'\ud83d[\udc00-\ude4f]|'
                          u'\uD83D[\uDE80-\uDEFF]|'
                          u"(\ud83d[\ude00-\ude4f])|"  # emoticon
                          u'[\u2600-\u2B55]|'
                          u'[\u23cf]|'
                          u'[\u1f918]|'
                          u'[\u23e9]|'
                          u'[\u231a]|'
                          u'[\u3030]|'
                          u'[\ufe0f]|'
                          u'\uD83D[\uDE00-\uDE4F]|'
                          u'\uD83C[\uDDE0-\uDDFF]|'
                          u'[\u2702-\u27B0]|'
                          u'\uD83D[\uDC00-\uDDFF])+',
                          re.UNICODE)
    return myre.sub(' ', text)

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注