最近在做点评做分析,主要目的是为了提取用户对点评的整体印象。类似的实现应该已经有很多了,于是从网上的开源代码中借鉴了思路。主要使用思路为:词性标注+正则提取。
- 词性标准,主要选择的是斯坦福的NLP工具。原因是词性标注集相对更加细致。但是也存在一些缺点,比如分词的准确率并不高。特别是遇到一些地名的时候。
- 正则提取,主要提取的是从语句中提取“名词”+“形容词”的结构。
针对单条点评的印象抽取
针对单个产品,多个点评汇总呈现
针对单品的点评印象,常见的常见方式是采用标签云或词云的方式:
from wordcloud import WordCloud import matplotlib.pyplot as plt counter_all = Counter(short_result).most_common() wordcloud = WordCloud(font_path="data/FZYingXueJW.TTF", background_color="white", width=800, height=600) wordcloud.generate_from_frequencies(dict(counter_all)) plt.imshow(wordcloud, interpolation="bilinear") plt.axis("off") plt.show()
另外一种方式是只抽取核心词(评价最多的词)进行呈现。 但是从上图可知,中间存在很多重复或意义相同的词。解决方案使用Word2Vec进行短文本的聚类,只获取每个类别下评价最多的词。
Word2Vec本身不支持对短文本进行聚类,解决方案是句向量平均:
# -*- coding: utf-8 -*- import pandas as pd from collections import Counter from gensim.models import KeyedVectors from sklearn.cluster import KMeans tencent_model = KeyedVectors.load_word2vec_format('dict_1000000.txt', binary=False) def get_sentence_vector(sentence, model, size=200): vec = np.zeros(size).reshape((1, size)) count = 0. word_list = nlp.pos_tag(sentence) for word in word_list: if word[0] == "酒店": # 去除点评主题,防止产生干扰 pass if word[1] == "NN": try: vec += model[word[0]] count += 1. except KeyError: continue if count != 0: vec /= count return vec.tolist()[0] if __name__ == "__main__": short_result = [] counter = Counter(short_result).most_common(100) if len(counter) < 50: pass else: hotel_impression = { "hotel_id": hotel_id, } train_data = [] for impression, count in counter: vector = get_sentence_vector(impression, tencent_model, size=200) impression_dict = { "impression": impression, "count": count, } for i in range(len(vector)): impression_dict["v_" + str(i)] = vector[i] train_data.append(impression_dict) df = pd.DataFrame(train_data) clf = KMeans(n_clusters=10) clf.fit(df.iloc[:, 2:]) df['labels'] = clf.labels_ df_result = df.sort_values('count', ascending=False).drop_duplicates(['labels'])[ ["impression", 'count', 'labels']] impression_list = [] for imp in df_result[df_result['count'] != 1]["impression"]: impression_list.append(imp) hotel_impression["impression"] = impression_list print(hotel_impression)
其他改进:
- 处理前将所有繁体转为简体
- 去除点评中的emoji表情
去除emoji表情代码:
import re def filter_emoj(text): import re try: # Wide UCS-4 build myre = re.compile(u'[' u'\U0001F300-\U0001F64F' u'\U0001F680-\U0001F6FF' u'\u2600-\u2B55' u'\u23cf' u'\u23e9' u'\u231a' u'\u3030' u'\ufe0f' u"\U0001F600-\U0001F64F" # emoticons u"\U0001F300-\U0001F5FF" # symbols & pictographs u'\U00010000-\U0010ffff' u'\U0001F1E0-\U0001F1FF' # flags (iOS) u'\U00002702-\U000027B0]+', re.UNICODE) except re.error: # Narrow UCS-2 build myre = re.compile(u'(' u'\ud83c[\udf00-\udfff]|' u'\ud83d[\udc00-\ude4f]|' u'\uD83D[\uDE80-\uDEFF]|' u"(\ud83d[\ude00-\ude4f])|" # emoticon u'[\u2600-\u2B55]|' u'[\u23cf]|' u'[\u1f918]|' u'[\u23e9]|' u'[\u231a]|' u'[\u3030]|' u'[\ufe0f]|' u'\uD83D[\uDE00-\uDE4F]|' u'\uD83C[\uDDE0-\uDDFF]|' u'[\u2702-\u27B0]|' u'\uD83D[\uDC00-\uDDFF])+', re.UNICODE) return myre.sub(' ', text)