【复刻论文】政策分析：高频词提取+共现矩阵+相异矩阵+文本聚类+树状图聚类可视化

2024-06-15 22:12| 来源: 网络整理| 查看: 265

参考文献

主要参考：曹海军,侯甜甜.我国城市网格化管理的注意力变迁及逻辑演绎——基于2005—2021年中央政策文本的共词与聚类分析[J].南通大学学报(社会科学版).2022,38(02)

次要参考：张永安,耿喆,王燕妮.我国区域科技创新政策的系统性分类——基于中关村数据的研究[J].系统科学学报.2016,24(02) 郑石明,彭芮,高灿玉.中国环境政策变迁逻辑与展望——基于共词与聚类分析[J].吉首大学学报(社会科学版).2019,40(02)

复刻论文缘由

本人为一名大二学生，在进行毛概社会实践的时候，希望顺手锻炼一下学术能力，遂在实践设计里面希望加入一些数据分析的板块。我们小组选取的是关于政策方面的文本分析。

在查阅了大量文献后，我们发现大部分是采用 NVivo12 软件对政策文本内容进行编码，软件下载失败，需要补丁，寻觅未果，遂放弃。

如果有大佬可以在评论区分享一下如何成功下载使用这个软件的话，本人感激不尽！！

继而，继续查阅大量文献，发现还有些采用高频词提取+共现矩阵+相异矩阵+文本聚类+树状图聚类可视化的方式进行政策分析，发现可行，遂采用。

文献方法说明

参照文献里面的描述：

首先，本研究运用内容分析软件ROSTCM对每一份政策文本进行高频词提取，并结合政策文件内容筛选和更正表征主题内容，确定3—7个反映该政策文本的主题词。

其次，借助BICOMB软件建立每个阶段高频主题词的共现矩阵，此外，为了消除共现矩阵数值差异过大带来的影响，还需利用Ochiia系数算法将共词矩阵进行标准化处理，得到相异矩阵。

最后，将相异矩阵导入SPSS22.0软件中进行聚类分析，绘制出高频词聚类图谱，并对形成的词簇进行命名。通过上述可视化结果，分析各个阶段的总体特征，以直观地展示中央政府对城市网格化管理的注意力导向。

本人不想转换这么多的软件，遂全部用python进行处理。

数据说明

我们社会实践是研究岭南文化相关的政策。于是选取我国广东省、广西壮族自治区、海南省三省地方政府颁布的文化遗产保护政策文本为研究对象。数据选取来源于北大法宝法律法规数据库。

具体检索要求如下：首先，以“广东省文化遗产”“广西文化遗产”“海南省文化遗产”等关键词进行信息搜索；其次，政策类型包括地方性法规、地方规范性文件、地方工作文件等；最后，考虑到政策文本的效力，将会议通知、批复函等非正式政策文本排除在外。基于上述步骤与原则，共筛选出2005-2023年间相关政策文本84份，其中广东省34份、广西壮族自治区27份、海南省23份。

在此基础上，结合政策分布时序状态，本文将岭南三省文化遗产保护政策变迁划人为分为三个时期：探索推广期(2005-2010年)、拓展应用期(2011-2016年)、规范发展期(2017-2023年)。

因为有三个时期，接下来把三个时期的txt文件分别放入三个文件夹里面，我这里命名为了“1”、“2”、“3”。

注：以下代码修改一下文件路径即可运行。

1、高频词提取 # 步骤一 import os import glob import pandas as pd import jieba import re from collections import Counter # 设置文件夹路径 folder_paths = ["E:\\大学课程相关\\大二下学期\\1 毛概\\1", "E:\\大学课程相关\\大二下学期\\1 毛概\\2", "E:\\大学课程相关\\大二下学期\\1 毛概\\3"] # 读取停用词文件 stopwords_file = "stopwords.txt" with open(stopwords_file, 'r', encoding='utf-8') as f: stopwords = set(f.read().splitlines()) # 添加领域专属词汇到分词库 specialized_words = ['非物质文化遗产', '政策文件', '项目', '名录', '保护','省级','县级'] for word in specialized_words: jieba.add_word(word) # 定义函数来提取文件夹中的高频词 def extract_top_words(folder_path): # 初始化一个计数器来统计词频 word_counter = Counter() # 遍历文件夹中的每个txt文件 for file_path in glob.glob(os.path.join(folder_path, '*.txt')): with open(file_path, 'r', encoding='utf-8') as file: # 读取文件内容 text = file.read() # 使用正则表达式去除数字和英文字符，只保留中文 text = re.sub(r'[^\u4e00-\u9fa5]+', '', text) # 分词并去除停用词 words = jieba.lcut(text) words = [word.lower() for word in words if word.isalnum() and word.lower() not in stopwords] # 更新词频计数器 word_counter.update(words) # 返回前30个高频词及其频数 return word_counter.most_common(30) # 分别提取三个文件夹中的高频词，并保存到单独的Excel文件中 for folder_path in folder_paths: folder_name = os.path.basename(folder_path) top_words = extract_top_words(folder_path) # 将结果列表转换为DataFrame result_df = pd.DataFrame(top_words, columns=['Top Word', 'Frequency']) # 将结果保存到Excel文件中 output_file = f"E:\\大学课程相关\\大二下学期\\1 毛概\\高频词统计结果_{folder_name}.xlsx" result_df.to_excel(output_file, index=False) print(f"{folder_name} 文件夹的高频词统计结果已保存到文件:", output_file) 2、构建共现矩阵 # 步骤二 import pandas as pd from collections import defaultdict from itertools import combinations import os # 步骤一输出的高频词文件路径 top_words_files = [ r"E:\大学课程相关\大二下学期\1 毛概\高频词统计结果_1.xlsx", r"E:\大学课程相关\大二下学期\1 毛概\高频词统计结果_2.xlsx", r"E:\大学课程相关\大二下学期\1 毛概\高频词统计结果_3.xlsx" ] # 输出共现矩阵文件的目录 output_folder = r"E:\大学课程相关\大二下学期\1 毛概" # 遍历每个高频词文件，生成共现矩阵 for file_path in top_words_files: # 读取高频词统计结果文件 df = pd.read_excel(file_path) # 创建一个默认字典来存储共现频次 co_occurrence_matrix = defaultdict(int) # 提取高频词列表 words = df['Top Word'].tolist() # 生成高频词之间的所有可能组合 word_combinations = combinations(words, 2) # 更新共现矩阵 for pair in word_combinations: # 获取共现词对在原始文本中的共现次数 co_occurrence_count = df.loc[(df['Top Word'] == pair[0]) | (df['Top Word'] == pair[1]), 'Frequency'].min() # 更新共现矩阵 co_occurrence_matrix[pair] += co_occurrence_count # 将共现矩阵转换为DataFrame co_occurrence_df = pd.DataFrame(list(co_occurrence_matrix.items()), columns=['Word Pair', 'Co-occurrence']) # 拆分 Word Pair 列为两列：Word 1 和 Word 2 co_occurrence_df[['Word 1', 'Word 2']] = pd.DataFrame(co_occurrence_df['Word Pair'].tolist(), index=co_occurrence_df.index) # 重新排列列的顺序 co_occurrence_df = co_occurrence_df[['Word 1', 'Word 2', 'Co-occurrence']] # 获取文件夹名称以用于输出文件命名 folder_name = os.path.basename(file_path).split('_')[-1].split('.')[0] # 输出文件路径 output_file = os.path.join(output_folder, f"共现矩阵结果_{folder_name}.xlsx") # 将结果保存到Excel文件中 co_occurrence_df.to_excel(output_file, index=False) print(f"共现矩阵结果已保存到文件: {output_file}") 3、构建相异矩阵，实现标准化 # 步骤三 import pandas as pd from itertools import combinations import numpy as np import os # 高频词文件路径 top_words_files = [ r"E:\大学课程相关\大二下学期\1 毛概\高频词统计结果_1.xlsx", r"E:\大学课程相关\大二下学期\1 毛概\高频词统计结果_2.xlsx", r"E:\大学课程相关\大二下学期\1 毛概\高频词统计结果_3.xlsx" ] # 输出文件夹路径 output_folder = r"E:\大学课程相关\大二下学期\1 毛概" # 遍历每个高频词文件 for file_path in top_words_files: # 读取高频词统计结果文件 df = pd.read_excel(file_path) # 提取高频词列 words = df['Top Word'].tolist() frequencies = df['Frequency'].tolist() total_word_counts = dict(zip(words, frequencies)) # 创建共现矩阵 co_occurrence_matrix = pd.DataFrame(0, index=words, columns=words) # 填充共现矩阵 word_combinations = combinations(words, 2) for pair in word_combinations: co_occurrence_matrix.at[pair[0], pair[1]] += 1 co_occurrence_matrix.at[pair[1], pair[0]] += 1 # 计算Ochiai系数 oc_matrix = co_occurrence_matrix.copy() for i in range(len(words)): for j in range(i+1, len(words)): word1 = words[i] word2 = words[j] co_occurrence = oc_matrix.at[word1, word2] word1_count = total_word_counts[word1] word2_count = total_word_counts[word2] ochiai_coefficient = co_occurrence / np.sqrt(word1_count * word2_count) oc_matrix.at[word1, word2] = ochiai_coefficient oc_matrix.at[word2, word1] = ochiai_coefficient # 计算相异矩阵 dissimilarity_matrix = 1 - oc_matrix # 获取文件夹名称以用于输出文件命名 folder_name = os.path.basename(file_path).split('_')[-1].split('.')[0] # 输出文件路径 output_file = os.path.join(output_folder, f"相异矩阵结果_{folder_name}.xlsx") # 保存相异矩阵到Excel文件 dissimilarity_matrix.to_excel(output_file) print("相异矩阵结果已保存到文件:", output_file) 4、文本聚类+可视化 # 步骤四 import pandas as pd import numpy as np import matplotlib.pyplot as plt from scipy.cluster import hierarchy plt.rcParams['font.sans-serif'] = ['Microsoft YaHei'] # 读取相异矩阵 file_paths = [ r"E:\大学课程相关\大二下学期\1 毛概\相异矩阵结果_1.xlsx", r"E:\大学课程相关\大二下学期\1 毛概\相异矩阵结果_2.xlsx", r"E:\大学课程相关\大二下学期\1 毛概\相异矩阵结果_3.xlsx" ] # 遍历每个相异矩阵文件路径 for i, file_path in enumerate(file_paths, start=1): # 读取相异矩阵 df = pd.read_excel(file_path, index_col=0) # 转换为数组 data = np.array(df) # 计算层次聚类 Z = hierarchy.linkage(data, method='average') # 绘制树状图 plt.figure(figsize=(12, 10)) # 调整图表大小 dn = hierarchy.dendrogram(Z, labels=df.index, orientation='left', leaf_font_size=8) # 减小叶子节点字体大小 plt.xlabel('相异度', fontsize=12) plt.ylabel('样本', fontsize=12) plt.title(f'树状图 {i}', fontsize=14) plt.grid(True) plt.show() 5、进一步探究

做完这个之后，我们希望看看是否运用其他的方式也能做，于是进行了一些其他的探索。比如说：采用LDA、K-means等方式进行主题词提取和聚类，发现结果都不好看。果然，那些论文里面选高频词还是有点道理的（狗头）

代码也贴上来：

（1）LDA # 导入所需的库 import os import glob import pandas as pd import jieba import re from gensim import corpora from gensim.models import LdaModel from gensim.models.ldamulticore import LdaMulticore # 设置文件夹路径 folder_paths = ["E:\\大学课程相关\\大二下学期\\1 毛概\\1", "E:\\大学课程相关\\大二下学期\\1 毛概\\2", "E:\\大学课程相关\\大二下学期\\1 毛概\\3"] # 读取停用词文件 stopwords_file = "stopwords.txt" with open(stopwords_file, 'r', encoding='utf-8') as f: stopwords = set(f.read().splitlines()) # 创建一个空列表来存储文档内容 texts = [] # 读取每个文件夹中的文本文件，并进行分词和去除停用词处理 for folder_path in folder_paths: for file_path in glob.glob(os.path.join(folder_path, '*.txt')): with open(file_path, 'r', encoding='utf-8') as file: text = file.read() text = re.sub(r'[^\u4e00-\u9fa5]+', '', text) # 只保留中文字符 words = [word for word in jieba.lcut(text) if word not in stopwords] # 分词并去除停用词 texts.append(words) # 创建词典 dictionary = corpora.Dictionary(texts) # 创建语料库 corpus = [dictionary.doc2bow(text) for text in texts] # 运行LDA主题建模 num_topics = 5 # 指定主题数量 lda_model = LdaMulticore(corpus, id2word=dictionary, num_topics=num_topics) # 打印每个主题的词分布 for idx, topic in lda_model.print_topics(-1): print("主题 {}: {}".format(idx, topic)) # 提取主题词 topics_words = lda_model.show_topics(num_topics=num_topics, num_words=10, formatted=False) for i, topic_words in enumerate(topics_words): topic_num = topic_words[0] words = [word[0] for word in topic_words[1]] print("主题 {} 的关键词：{}".format(topic_num, words)) （2）K-means # K-means import pandas as pd import jieba from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score import re # 读取Excel文件 excel_file = "E:\\大学课程相关\\大二下学期\\1 毛概\\高频词统计结果.xlsx" data = pd.read_excel(excel_file) # 获取文本数据 texts = data['Top Word'].tolist() # 文本预处理：分词、去除停用词等 stopwords_file = "stopwords.txt" with open(stopwords_file, 'r', encoding='utf-8') as f: stopwords = set(f.read().splitlines()) def text_preprocessing(text): text = re.sub(r'[^\u4e00-\u9fa5]+', ' ', text) # 只保留中文字符 words = jieba.lcut(text) # 分词 words = [word for word in words if word not in stopwords] # 去除停用词 return " ".join(words) # 对文本进行预处理 preprocessed_texts = [text_preprocessing(text) for text in texts] # 使用TF-IDF向量化文本数据 vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(preprocessed_texts) # 使用KMeans算法进行聚类 num_clusters = 3 # 指定聚类簇的数量 kmeans = KMeans(n_clusters=num_clusters) kmeans.fit(X) # 将聚类结果添加到数据中 data['Cluster'] = kmeans.labels_ # 打印每个聚类的关键词 cluster_centers = kmeans.cluster_centers_ feature_names = vectorizer.get_feature_names_out() for i, cluster_center in enumerate(cluster_centers): top_keywords_idx = cluster_center.argsort()[-10:][::-1] # 获取每个聚类的前10个关键词的索引 top_keywords = [feature_names[idx] for idx in top_keywords_idx] print("Cluster {} 的关键词：{}".format(i, top_keywords)) # 将结果保存到Excel文件中 output_file = "E:\\大学课程相关\\大二下学期\\1 毛概\\文档聚类结果.xlsx" data.to_excel(output_file, index=False) print("文档聚类结果已保存到文件:", output_file)

欢迎评论区批评指正！一起探讨，共同进步（握手）

【本文地址】

公司简介

联系我们