nav emailalert searchbtn searchbox tablepage yinyongbenwen piczone journalimg journalInfo searchdiv qikanlogo popupnotification paper paperNew
2024, 09, v.57 106-112
基于层级互信息聚类的烟草行业信息分类与编码设计
基金项目(Foundation): 中国烟草总公司重点研发项目“新一代信息技术融合创新与网信治理研究”(110202102049)
邮箱(Email): jiangtao@tobacco.gov.cn;
DOI: 10.16135/j.issn1002-0861.2024.0600
摘要:

为满足全国烟草生产经营管理一体化平台建设对行业信息分类与编码的需求,按照“流程、实体、服务”三类数字对象对信息系统进行解构,结合烟草行业业务实际情况,提出层级互信息聚类算法(Hierarchical Mutual Information Clustering,HMIC),通过对文本数据进行自然语言处理,计算不同数字对象在不同分类层级的互信息,利用层次聚类算法对数字对象进行聚类,从而得到烟草行业信息分类,并在此基础上进行信息编码。将HMIC与常用聚类算法进行对比测试,结果表明:(1)所构建的HMIC模型的信息分类效果最好,其整体信息熵比使用欧氏距离的聚类算法降低约8.2%,比仅使用互信息矩阵的聚类算法降低约2.5%。(2)从信息量的角度对分类编码进行研究,能够更好地区分不同类别之间的差异,提高信息分类与编码的可用性。该技术可为指导信息系统项目全生命周期建设提供支持。

Abstract:

To meet the needs of the construction of National Tobacco Production, Operation and Management Integrated Platform of the tobacco industry, information classifying and coding are developed.The information systems are decomposed according to three types of digital objects, namely“process,entity, and service”, and in conjunction with the real-life business of the tobacco industry, a hierarchical mutual information clustering(HMIC)algorithm is proposed. By conducting natural language processing on text data, the mutual information of different digital objects at different classification levels is calculated, and the hierarchical clustering algorithm is used to classify digital objects, thus obtaining tobacco industry information classification, and then information coding is completed based on the information classification. The HMIC algorithm was compared with commonly used clustering algorithms,the results showed that: 1) The designed HMIC algorithm featured the best performance in information classifying, with its total information entropy reduced by about 8.2% compared with the clustering algorithm using Euclidean distance, and by about 2.5% compared with the clustering algorithm with mutual information matrix only. 2)From the point of information content, the research of information classifying and coding could better distinguish the differences between different categories and improve their usability. This technology supports the guidance for the whole life cycle of information system project construction.

参考文献

[1]钟宇,董浩,邢军,等.烟草行业质量数据利用现状及应用需求分析[J].烟草科技,2023,56(2):104-112.ZHONG Yu, DONG Hao, XING Jun, et al.Current status of quality data utilization and application requirements in tobacco industry[J].Tobacco Science&Technology,2023,56(2):104-112.

[2]许建,肖迎宾,邢阳,等.基于DW2.0的烟草海量数据分析系统的设计与实现[J].烟草科技,2016,49(4):96-102.XU Jian,XIAO Yingbin,XING Yang,et al. Design and implementation of DW2.0-based massive tobacco data analysis system[J]. Tobacco Science&Technology,2016,49(4):96-102.

[3]华烨,王莉.烟草企业数据资产管理方法研究及实践[J].中国烟草学报,2020,26(5):114-122.HUA Ye, WANG Li. Research and practice of data asset management in tobacco enterprises[J]. Acta Tabacaria Sinica,2020,26(5):114-122.

[4]古发辉,赖路燕,李雯.面向信息共享的信息分类编码及其管理系统研究[J].情报杂志,2008(11):74-77.GU Fahui, LAI Luyan, LI Wen. Research of the information classification coding oriented the enterprise information sharing and management system[J].Journal of Information,2008(11):74-77.

[5]潘佩芬.铁路地理信息分类与编码研究[J].铁道标准设计,2018,62(12):171-174.PAN Peifen. Study on classification and codes of railway geographic information[J]. Railway Standard Design,2018,62(12):171-174.

[6]张建中.面向全生命周期管理的煤机设备信息分类编码[J].工矿自动化,2021,47(1):21-27.ZHANG Jianzhong. Classification coding of coal mechanical equipment information for full life cycle management[J]. Industry and Mine Automation,2021,47(1):21-27.

[7]李曙光,王俊彪,蒋建军,等.基于本体理论的企业信息分类编码方法研究[J].计算机应用研究,2007,24(12):129-131.LI Shuguang,WANG Junbiao,JIANG Jianjun,et al.Study of information classification and coding based on ontology theory[J]. Application Research of Computers,2007,24(12):129-131.

[8]杜少华,万怀宇,武志昊,等.融合文本序列和图信息的海关商品HS编码分类[J].计算机科学,2021,48(4):97-103.DU Shaohua, WAN Huaiyu, WU Zhihao, et al.Customs commodity HS code classification integrating text sequence and graph information[J]. Computer Science,2021,48(4):97-103.

[9]阮启铭,过弋,郑楠,等.基于层级多任务BERT的海关报关商品分类算法[J].计算机应用,2022,42(1):71-77.RUAN Qiming, GUO Yi, ZHENG Nan, et al.Customs declaration good classification algorithm based on hierarchical multi-task BERT[J]. Journal of Computer Applications,2022,42(1):71-77.

[10]王锐,邱纪青,郑新章,等.烟草科研数据现状调查与分析[J].烟草科技,2020,53(2):107-112.WANG Rui, QIU Jiqing, ZHENG Xinzhang, et al.Survey on status of scientific research data in tobacco industry[J]. Tobacco Science&Technology,2020,53(2):107-112.

[11]王卫军,李娜,郑新章,等.面向烟草领域的科研知识图谱服务平台关键技术研究[J].中国烟草学报,2021,27(4):83-91.WANG Weijun,LI Na,ZHENG Xinzhang,et al. Key technologies of the service platform of scientific research knowledge graph in tobacco field[J]. Acta Tabacaria Sinica,2021,27(4):83-91.

[12]乌兰,刘全,黄志刚,等.优势加权互信息最大化的最大熵分层强化学习[J].计算机学报,2023,46(10):2066-2083.WU Lan, LIU Quan, HUANG Zhigang, et al.Maximum entropy hierarchical reinforcement learning with advantage-weighted mutual information maximization[J]. Chinese Journal of Computers,2023,46(10):2066-2083.

[13]王永胜,刘亚丽,贾楠,等.烟草文献数据知识检索服务平台的设计与实现[J].烟草科技,2022,55(3):107-112.WANG Yongsheng,LIU Yali,JIA Nan,et al. Design and implementation of knowledge retrieval service platform for tobacco literatures[J]. Tobacco Science&Technology,2022,55(3):107-112.

[14]张燕.多模态异构大数据混合属性特征匹配筛选算法[J].现代电子技术,2024,47(3):119-122.ZHANG Yan. Multimodal heterogeneous big data mixed attribute feature matching filtering algorithm[J].Modern Electronics Technique,2024,47(3):119-122.

[15] Roux M. A comparative study of divisive and agglomerative hierarchical clustering algorithms[J].Journal of Classification,2018,35(2):345-366.

[16]Yu B,Zheng Z J,Dai J H. K-DGHC:A hierarchical clustering method based on K-dominance granularity[J]. Information Sciences,2023,632:232-251.

[17]章永来,周耀鉴.聚类算法综述[J].计算机应用,2019,39(7):1869-1882.ZHANG Yonglai,ZHOU Yaojian. Review of clustering algorithms[J]. Journal of Computer Applications,2019,39(7):1869-1882.

[18]唐琳,郭崇慧,陈静锋.中文分词技术研究综述[J].数据分析与知识发现,2020,4(2/3):1-17.TANG Lin,GUO Chonghui,CHEN Jingfeng. Review of Chinese word segmentation studies[J]. Data Analysis and Knowledge Discovery,2020,4(2/3):1-17.

基本信息:

DOI:10.16135/j.issn1002-0861.2024.0600

中图分类号:TP399;TS4

引用信息:

[1]王轶博,潘伟,张海涛等.基于层级互信息聚类的烟草行业信息分类与编码设计[J].烟草科技,2024,57(09):106-112.DOI:10.16135/j.issn1002-0861.2024.0600.

基金信息:

中国烟草总公司重点研发项目“新一代信息技术融合创新与网信治理研究”(110202102049)

检 索 高级检索

引用

GB/T 7714-2015 格式引文
MLA格式引文
APA格式引文