CN103473255A

CN103473255A - A data clustering method, system and data processing equipment

Info

Publication number: CN103473255A
Application number: CN2013102234517A
Authority: CN
Inventors: 曹付元; 黄哲学; 梁吉业
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2013-06-06
Filing date: 2013-06-06
Publication date: 2013-12-25

Abstract

The invention is applicable to the field of data processing, and provides a data clustering method, system and data processing equipment. The method comprises the steps of: inputting a data set composed of n objects with block data characteristics that need to be clustered and an expected category number k; selecting k block data objects from the data set as initial class centers; calculating each object The distance to the initial class center; according to the calculated distance, assign each block data object to its nearest center to form k disjoint classes; calculate the center of each class as the new class center; repeat execution The step of distributing each block data object to its nearest center according to the calculated distance to form k disjoint classes; and the step of calculating the center of each class as a new class center until the algorithm Converge to obtain the division result of the data set. The present invention can directly process the data with block characteristics without compressing the block data, avoiding the loss of information, and the obtained clustering result is better than the clustering effect after compressing the block data.

Description

A data clustering method, system and data processing equipment

技术领域technical field

本发明属于数据处理领域，尤其涉及一种数据聚类方法、系统及数据处理设备。The invention belongs to the field of data processing, and in particular relates to a data clustering method, system and data processing equipment.

背景技术Background technique

随着数据自动生成和采集技术的迅猛发展，许多领域产生了记录人们行为细节的海量数据，为行为模式挖掘提供了可能。这些描述被采集对象行为的数据具有一种共同特征，即每个对象的行为是通过多条记录集合来刻画的，我们将记录对象行为特征的数据集称为一个块数据。比如一个客户的购买行为或通话行为是通过该客户在一个时间段的购买明细或通话明细体现的。通过对块数据进行深入挖掘，有助于我们对客户的行为进行分析和预测。然而，当前的机器学习算法不能对块数据直接进行处理，必须将其转换成标准的数据进行处理，致使数据中存在的潜在行为特征可能被忽略。With the rapid development of data automatic generation and collection technology, many fields have produced massive data that record the details of people's behavior, which provides the possibility for behavior pattern mining. These data that describe the behavior of the collected objects have a common feature, that is, the behavior of each object is described by a collection of multiple records. We call the data set that records the behavior characteristics of the object a block of data. For example, a customer's purchase behavior or call behavior is reflected by the customer's purchase details or call details in a certain period of time. Through in-depth mining of block data, it helps us analyze and predict customer behavior. However, current machine learning algorithms cannot directly process block data, and must be converted into standard data for processing, so that the potential behavioral characteristics existing in the data may be ignored.

发明内容Contents of the invention

本发明的目的在于提供一种数据聚类方法、系统及数据处理设备，旨在解决现有技术中存在的当前的机器学习算法不能对块数据直接进行处理，必须将其转换成标准的数据进行处理，致使数据中存在的潜在行为特征可能被忽略的问题。The purpose of the present invention is to provide a data clustering method, system and data processing equipment, aiming to solve the problem that the current machine learning algorithm in the prior art cannot directly process the block data, and it must be converted into standard data for processing. Dealing with the problem that potential behavioral features present in the data may be overlooked.

本发明是这样实现的，一种数据聚类方法，所述方法包括以下步骤：The present invention is realized like this, a kind of data clustering method, described method comprises the following steps:

输入需要聚类的具有块数据特征的n个对象组成的数据集和期望的类别数k；Input a data set composed of n objects with block data characteristics that need to be clustered and the expected number of categories k;

从所述数据集中选择k个块数据对象作为初始类中心；Selecting k block data objects as initial class centers from the data set;

计算各个对象到所述初始类中心的距离；Calculate the distance from each object to the initial class center;

根据计算出的距离，将每一个块数据对象分配到离其最近的中心，形成k个不相交的类；According to the calculated distance, assign each block data object to its nearest center, forming k disjoint classes;

计算各个类的中心作为新的类中心；Calculate the center of each class as the new class center;

重复执行所述根据计算出的距离，将每一个块数据对象分配到离其最近的中心，形成k个不相交的类的步骤；以及所述计算各个类的中心作为新的类中心的步骤，直至算法收敛，获得数据集的划分结果。Repeating the step of distributing each block data object to its nearest center according to the calculated distance to form k disjoint classes; and the step of calculating the center of each class as a new class center, Until the algorithm converges, the division result of the data set is obtained.

本发明的另一目的在于提供一种数据聚类系统，所述系统包括：Another object of the present invention is to provide a data clustering system, said system comprising:

输入模块，用于输入需要聚类的具有块数据特征的n个对象组成的数据集和期望的类别数k；The input module is used to input a data set composed of n objects with block data characteristics that need to be clustered and an expected number of categories k;

选择模块，用于从所述数据集中选择k个块数据对象作为初始类中心；A selection module is used to select k block data objects from the data set as initial class centers;

距离计算模块，用于计算各个对象到所述初始类中心的距离；A distance calculation module, used to calculate the distance from each object to the initial class center;

分配模块，用于根据计算出的距离，将每一个块数据对象分配到离其最近的中心，形成k个不相交的类；An assignment module, configured to assign each block data object to its nearest center according to the calculated distance, forming k disjoint classes;

类中心计算模块，用于计算各个类的中心作为新的类中心；The class center calculation module is used to calculate the center of each class as a new class center;

循环控制模块，用于控制重复执行分配对象和计算类中心的步骤，直至算法收敛，获得数据集的划分结果。The loop control module is used to control the repeated execution of the steps of allocating objects and calculating the class center until the algorithm converges and obtains the division result of the data set.

本发明的另一目的在于提供一种包括上面所述的数据聚类系统的数据处理设备。Another object of the present invention is to provide a data processing device comprising the above-mentioned data clustering system.

在本发明中，通过迭代过程把数据集划分为不同类别，使得评价聚类性能的准则函数达到最优。首先从数据集中随机选择k（期望的类别数）个块数据对象作为初始类中心；然后按照块数据之间的距离描述，计算数据集中的每一个块对象到初始类中心之间的距离，将每个块对象分配到离其最近的中心，形成k个类；通过容斥原理计算每个类的中心作为新的类中心；重复分配对象和计算类中心的步骤，直至算法收敛。本发明实施例能够快速地对现实世界中广泛存在的块数据进行聚类，是一种既高效又实用的划分聚类方法。本发明实施例可以直接对具有块特性的数据进行处理，而不需要对块数据进行压缩处理，避免了信息的丢失，得到的聚类结果比对块数据压缩后的聚类效果更佳。另外，本发明实施例还能够处理大规模数据。In the present invention, the data set is divided into different categories through an iterative process, so that the criterion function for evaluating the clustering performance can be optimized. First, k (expected number of categories) block data objects are randomly selected from the data set as the initial class center; then, according to the description of the distance between the block data, the distance between each block object in the data set and the initial class center is calculated, and the Each block object is assigned to the nearest center to form k classes; the center of each class is calculated as the new class center by the principle of inclusion and exclusion; the steps of allocating objects and calculating the class center are repeated until the algorithm converges. The embodiment of the present invention can rapidly cluster block data widely existing in the real world, and is an efficient and practical division and clustering method. The embodiment of the present invention can directly process the data with block characteristics without compressing the block data, avoiding the loss of information, and the obtained clustering result is better than the clustering effect after compressing the block data. In addition, the embodiments of the present invention can also handle large-scale data.

附图说明Description of drawings

图1是本发明实施例提供的数据聚类方法的实现流程示意图。FIG. 1 is a schematic diagram of the implementation flow of a data clustering method provided by an embodiment of the present invention.

图2是本发明实施例提供的34个城市的聚类结果图。Fig. 2 is a graph of clustering results of 34 cities provided by the embodiment of the present invention.

图3是本发明实施例提供的数据聚类系统的结构示意图。Fig. 3 is a schematic structural diagram of a data clustering system provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及有益效果更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the object, technical solution and beneficial effects of the present invention more clear, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

请参阅图1，为本发明实施例提供的数据聚类方法的实现流程，其包括以下步骤：Please refer to Fig. 1, the implementation process of the data clustering method provided by the embodiment of the present invention, which includes the following steps:

在步骤S101中，输入需要聚类的具有块数据特征的n个对象组成的数据集和期望的类别数k；In step S101, input a data set composed of n objects with block data characteristics that need to be clustered and an expected number of categories k;

在本发明实施例中，假定待聚类的数据集为X＝{x₁,x₂,L,x_n}，其中 $x_{i} = [\begin{matrix} x_{i, 1,1}, x_{i, 1,2}, L, x_{i, 1, m} \\ x_{i, 2,1}, x_{i, 2,2}, L, x_{i, 2, m} \\ L \\ x_{i, r, 1}, x_{i, r, 2}, L, x_{i, r, m} \end{matrix}]$ 是第i个由m个属性，r个明细记录描述的对象，我们将x_i称为一个块数据对象。k是期望的类别数。In the embodiment of the present invention, it is assumed that the data set to be clustered is X={x ₁ ,x ₂ ,L,x _n }, where $x_{i} = [\begin{matrix} x_{i, 1,1}, x_{i, 1,2}, L, x_{i, 1, m} \\ x_{i, 2,1}, x_{i, 2,2}, L, x_{i, 2, m} \\ L \\ x_{i, r, 1}, x_{i, r, 2}, L, x_{i, r, m} \end{matrix}]$ is the i-th object described by m attributes and r detail records, we call _xi a block data object. k is the desired number of categories.

在步骤S102中，从所述数据集中选择k个块数据对象作为初始类中心；In step S102, k block data objects are selected from the data set as initial class centers;

在本发明实施例中，从数据集X中选择k个块数据对象作为初始类中心c₁,c₂,L,c_k的步骤，具体为：从数据集X中随机选出k个块对象作为初始类中心。In the embodiment of the present invention, the step of selecting k block data objects from the data set X as the initial class centers c ₁ , c ₂ , L, c _k is specifically: randomly select k block objects from the data set X as the initial class center.

在步骤S103中，计算各个对象到所述初始类中心的距离；In step S103, calculate the distance from each object to the initial class center;

在步骤S104中，根据计算出的距离，将每一个块数据对象分配到离其最近的中心，形成k个不相交的类；In step S104, according to the calculated distance, each block data object is assigned to its nearest center to form k disjoint classes;

在本发明实施例中，对象之间的距离取决于对象属性值之间的差异性，对于块数据对象之间的距离采用公式

进行度量，其中x,y表示两个块数据对象，A_i,B_i分别表示两个对象在i个属性下的域值，m为描述对象的特征数或属性数。In the embodiment of the present invention, the distance between the objects depends on the difference between the object attribute values, and the distance between the block data objects adopts the formula

Perform measurement, where x, y represent two block data objects, A _i , B _i respectively represent the domain values of the two objects under the i attribute, and m is the number of features or attributes describing the object.

在步骤S105中，计算各个类的中心作为新的类中心；In step S105, calculate the center of each class as the new class center;

在本发明实施例中，首先通过计算该类中所有对象明细数的均值作为该类的类中心要包含的记录数r；然后统计每一维域值中每一个元素在该类中不同对象中出现的频率，如果域值的个数大于r，则选取频率最高的前r个值作为该维的代表，反之，按照频率由高到低的顺序反复取域值，直到取够r个值；重复上述步骤，得到m列个代表，构成该类的类中心。In the embodiment of the present invention, at first by calculating the mean value of all object detail numbers in this class as the record number r to be included in the class center of this class; The frequency of occurrence, if the number of domain values is greater than r, select the first r values with the highest frequency as the representative of the dimension, otherwise, repeatedly take the domain values in the order of frequency from high to low until enough r values are taken; Repeat the above steps to get m columns of representatives, which constitute the class center of this class.

在步骤S106中，重复执行步骤S104和S105的步骤，直至算法收敛，获得数据集的划分结果。In step S106, the steps of steps S104 and S105 are repeatedly executed until the algorithm converges, and the division result of the data set is obtained.

在本发明实施例中，通过计算前后类中心的距离，如果二者的距离小于一个给定的阈值，则算法结束。In the embodiment of the present invention, by calculating the distance between the front and rear class centers, if the distance between the two is less than a given threshold, the algorithm ends.

下面结合本发明实施例提供的方法详细说明该实例实施的具体步骤如下：The specific steps implemented in this example are described in detail below in conjunction with the method provided by the embodiment of the present invention as follows:

1）我们从http://www.wunderground.com/上下载了2011年全国34个省会城市（包括香港和澳门）的天气数据，除了上海是364天的数据，其他城市都是365天的数据，因此每个城市一年的数据是一个典型的块数据。为方便，我们选择了16个没有缺失值的属性描述天气数据的特征。由于属性是数值型特征，我们采用了均匀量化的方法对数值型数据离散化为30个分类型值。1) We downloaded the weather data of 34 provincial capital cities (including Hong Kong and Macau) in 2011 from http://www.wunderground.com/. Except for Shanghai, which has 364 days of data, other cities have 365 days of data , so one year's data of each city is a typical block data. For convenience, we selected 16 attributes with no missing values to describe the characteristics of weather data. Since the attribute is a numerical feature, we discretized the numerical data into 30 categorical values using a uniform quantization method.

2）假定期望的类别数是2，选择太原和武汉两个城市作为初始类中心。2) Assuming that the expected number of categories is 2, two cities, Taiyuan and Wuhan, are selected as the initial category centers.

3）利用定义的距离公式计算每一个城市到太原和武汉之间的距离，并将每一个块数据对象分配到离其最近的中心。3) Use the defined distance formula to calculate the distance between each city and Taiyuan and Wuhan, and assign each block data object to the nearest center.

4）计算每一类中的类中心。4) Compute the class centers in each class.

5）判断新类中心和初始类中心的距离是否小于给定的阈值。5) Judging whether the distance between the new class center and the initial class center is less than a given threshold.

6）如果小于，则结束，否则转到步骤3），直至算法收敛。6) If less than, end, otherwise go to step 3) until the algorithm converges.

7）聚类结果如图2所示，其中圆圈和五角星表示分成的两类，三角形表示该城市没有2011年的天气数据。7) The clustering results are shown in Figure 2, where circles and five-pointed stars indicate the two categories, and triangles indicate that the city has no weather data in 2011.

请参阅图3，为本发明实施例提供的数据聚类系统的结构。为了便于说明，仅示出了与本发明实施例相关的部分。所述数据聚类系统包括：输入模块101、选择模块102、距离计算模块103、分配模块104、类中心计算模块105、以及循环控制模块106。所述数据聚类系统可以是内置于数据处理设备中的软件单元、硬件单元或者是软硬件结合的单元。Please refer to FIG. 3 , which shows the structure of the data clustering system provided by the embodiment of the present invention. For ease of description, only parts related to the embodiments of the present invention are shown. The data clustering system includes: an input module 101 , a selection module 102 , a distance calculation module 103 , an allocation module 104 , a class center calculation module 105 , and a cycle control module 106 . The data clustering system may be a software unit, a hardware unit or a combination of software and hardware built into the data processing device.

输入模块101，用于输入需要聚类的具有块数据特征的n个对象组成的数据集和期望的类别数k；Input module 101, is used for inputting the data set that needs clustering to form with the n object of block data feature and expected category number k;

选择模块102，用于从所述数据集中选择k个块数据对象作为初始类中心；A selection module 102, configured to select k block data objects from the data set as initial class centers;

在本发明实施例中，选择模块102，具体用于从数据集X中随机选出k个块对象作为初始类中心。In the embodiment of the present invention, the selection module 102 is specifically configured to randomly select k block objects from the data set X as initial class centers.

距离计算模块103，用于计算各个对象到所述初始类中心的距离；Distance calculation module 103, for calculating the distance from each object to the initial class center;

分配模块104，用于根据计算出的距离，将每一个块数据对象分配到离其最近的中心，形成k个不相交的类；Assignment module 104, for according to the calculated distance, assign each block data object to its nearest center, form k disjoint classes;

类中心计算模块105，用于计算各个类的中心作为新的类中心；The class center calculation module 105 is used to calculate the center of each class as a new class center;

循环控制模块106，用于控制重复执行分配对象和计算类中心的步骤，直至算法收敛，获得数据集的划分结果。The loop control module 106 is used to control the repeated execution of the steps of allocating objects and calculating the class center until the algorithm converges and obtains the division result of the data set.

综上所述，本发明实施例通过迭代过程把数据集划分为不同类别，使得评价聚类性能的准则函数达到最优。首先从数据集中随机选择k（期望的类别数）个块数据对象作为初始类中心；然后按照块数据之间的距离描述，计算数据集中的每一个块对象到初始类中心之间的距离，将每个块对象分配到离其最近的中心，形成k个类；通过容斥原理计算每个类的中心作为新的类中心；重复分配对象和计算类中心的步骤，直至算法收敛。本发明实施例能够快速地对现实世界中广泛存在的块数据进行聚类，是一种既高效又实用的划分聚类方法。本发明实施例可以直接对具有块特性的数据进行处理，而不需要对块数据进行压缩处理，避免了信息的丢失，得到的聚类结果比对块数据压缩后的聚类效果更佳。另外，本发明实施例还能够处理大规模数据。In summary, the embodiment of the present invention divides the data set into different categories through an iterative process, so that the criterion function for evaluating the clustering performance can be optimized. First, k (expected number of categories) block data objects are randomly selected from the data set as the initial class center; then, according to the description of the distance between the block data, the distance between each block object in the data set and the initial class center is calculated, and the Each block object is assigned to the nearest center to form k classes; the center of each class is calculated as the new class center by the principle of inclusion and exclusion; the steps of allocating objects and calculating the class center are repeated until the algorithm converges. The embodiment of the present invention can rapidly cluster block data widely existing in the real world, and is an efficient and practical division and clustering method. The embodiment of the present invention can directly process the data with block characteristics without compressing the block data, avoiding the loss of information, and the obtained clustering result is better than the clustering effect after compressing the block data. In addition, the embodiments of the present invention can also handle large-scale data.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，所述的程序可以存储于一计算机可读取存储介质中，所述的存储介质，如ROM/RAM、磁盘、光盘等。Those of ordinary skill in the art can understand that all or part of the steps in the method of the above-mentioned embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium, and the storage Media such as ROM/RAM, magnetic disk, optical disk, etc.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. within range.

Claims

1. A data clustering method, characterized in that said method comprises the following steps:

Input a data set composed of n objects with block data characteristics that need to be clustered and the expected number of categories k;

Selecting k block data objects as initial class centers from the data set;

Calculate the distance from each object to the initial class center;

According to the calculated distance, assign each block data object to its nearest center, forming k disjoint classes;

Calculate the center of each class as the new class center;

Repeating the step of distributing each block data object to its nearest center according to the calculated distance to form k disjoint classes; and the step of calculating the center of each class as a new class center, Until the algorithm converges, the division result of the data set is obtained.

2. The method according to claim 1, wherein it is assumed that the data set to be clustered is X={x ₁ , x ₂ , L, x _n }, where

x_{i} = [\begin{matrix} x_{i, 1,1}, x_{i, 1,2}, L, x_{i, 1, m} \\ x_{i, 2,1}, x_{i, 2,2}, L, x_{i, 2, m} \\ L \\ x_{i, r, 1}, x_{i, r, 2}, L, x_{i, r, m} \end{matrix}]

is the i-th object described by m attributes and r detailed records, and x _i is called a block data object; k is the expected number of categories.

3. The method according to claim 1, wherein the step of selecting k block data objects from the data set as the initial class center is specifically: randomly selecting k block objects from the data set X as the initial class center.

4. The method according to claim 1, wherein the distance between the objects depends on the difference between the object attribute values, and the distance between the block data objects adopts the formula

5. The method according to claim 1, characterized in that, at first by calculating the mean value of all object detailed numbers in the class as the record number r to be included in the class center of the class; The frequency of elements appearing in different objects in this class. If the number of domain values is greater than r, select the first r values with the highest frequency as the representative of this dimension. Otherwise, the domain values are repeatedly selected in the order of frequency from high to low. , until enough r values are taken; repeat the above steps to obtain m columns of representatives, which constitute the class center of this class.

6. A data clustering system, characterized in that the system comprises:

The input module is used to input a data set composed of n objects with block data characteristics that need to be clustered and an expected number of categories k;

A selection module is used to select k block data objects from the data set as initial class centers;

A distance calculation module, used to calculate the distance from each object to the initial class center;

An assignment module, configured to assign each block data object to its nearest center according to the calculated distance, forming k disjoint classes;

The class center calculation module is used to calculate the center of each class as a new class center;

The loop control module is used to control the repeated execution of the steps of allocating objects and calculating the class center until the algorithm converges and obtains the division result of the data set.

7. The system according to claim 6, wherein the selection module is specifically configured to randomly select k block objects from the data set X as initial class centers.

8. A data processing apparatus comprising a system as claimed in any one of claims 6 or 7.