文章内容如有错误或排版问题,请提交反馈,非常感谢!
最近在做点评做分析,主要目的是为了提取用户对点评的整体印象。类似的实现应该已经有很多了,于是从网上的开源代码中借鉴了思路。主要使用思路为:词性标注+正则提取。
- 词性标准,主要选择的是斯坦福的NLP工具。原因是词性标注集相对更加细致。但是也存在一些缺点,比如分词的准确率并不高。特别是遇到一些地名的时候。
- 正则提取,主要提取的是从语句中提取“名词”+“形容词”的结构。
针对单条点评的印象抽取
针对单个产品,多个点评汇总呈现
针对单品的点评印象,常见的常见方式是采用标签云或词云的方式:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
counter_all = Counter(short_result).most_common()
wordcloud = WordCloud(font_path="data/FZYingXueJW.TTF", background_color="white", width=800, height=600)
wordcloud.generate_from_frequencies(dict(counter_all))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

另外一种方式是只抽取核心词(评价最多的词)进行呈现。但是从上图可知,中间存在很多重复或意义相同的词。解决方案使用Word2Vec进行短文本的聚类,只获取每个类别下评价最多的词。
Word2Vec本身不支持对短文本进行聚类,解决方案是句向量平均:
#-*- coding: utf-8 -*-
import pandas as pd
from collections import Counter
from gensim.models import KeyedVectors
from sklearn.cluster import KMeans
tencent_model = KeyedVectors.load_word2vec_format('dict_1000000.txt', binary=False)
def get_sentence_vector(sentence, model, size=200):
vec = np.zeros(size).reshape((1, size))
count = 0.
word_list = nlp.pos_tag(sentence)
for word in word_list:
if word[0] == "酒店": #去除点评主题,防止产生干扰
pass
if word[1] == "NN":
try:
vec += model[word[0]]
count += 1.
except KeyError:
continue
if count != 0:
vec /= count
return vec.tolist()[0]
if __name__ == "__main__":
short_result = []
counter = Counter(short_result).most_common(100)
if len(counter) < 50:
pass
else:
hotel_impression = {
"hotel_id": hotel_id,
}
train_data = []
for impression, count in counter:
vector = get_sentence_vector(impression, tencent_model, size=200)
impression_dict = {
"impression": impression,
"count": count,
}
for i in range(len(vector)):
impression_dict["v_"+str(i)] = vector[i]
train_data.append(impression_dict)
df = pd.DataFrame(train_data)
clf = KMeans(n_clusters=10)
clf.fit(df.iloc[:, 2:])
df['labels'] = clf.labels_
df_result = df.sort_values('count', ascending=False).drop_duplicates(['labels'])[
["impression", 'count', 'labels']]
impression_list = []
for imp in df_result[df_result['count'] != 1]["impression"]:
impression_list.append(imp)
hotel_impression["impression"] = impression_list
print(hotel_impression)
其他改进:
- 处理前将所有繁体转为简体
- 去除点评中的emoji表情
去除emoji表情代码:
import re
def filter_emoj(text):
import re
try:
# Wide UCS-4 build
myre = re.compile(u'['
u'\U0001F300-\U0001F64F'
u'\U0001F680-\U0001F6FF'
u'\u2600-\u2B55'
u'\u23cf'
u'\u23e9'
u'\u231a'
u'\u3030'
u'\ufe0f'
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u'\U00010000-\U0010ffff'
u'\U0001F1E0-\U0001F1FF' # flags (iOS)
u'\U00002702-\U000027B0]+',
re.UNICODE)
except re.error:
# Narrow UCS-2 build
myre = re.compile(u'('
u'\ud83c[\udf00-\udfff]|'
u'\ud83d[\udc00-\ude4f]|'
u'\uD83D[\uDE80-\uDEFF]|'
u"(\ud83d[\ude00-\ude4f])|" # emoticon
u'[\u2600-\u2B55]|'
u'[\u23cf]|'
u'[\u1f918]|'
u'[\u23e9]|'
u'[\u231a]|'
u'[\u3030]|'
u'[\ufe0f]|'
u'\uD83D[\uDE00-\uDE4F]|'
u'\uD83C[\uDDE0-\uDDFF]|'
u'[\u2702-\u27B0]|'
u'\uD83D[\uDC00-\uDDFF])+',
re.UNICODE)
return myre.sub('', text)



