WO2008022341A2 - Procédé d'ensemencement pour regroupement par nuées dynamiques et autres algorithmes de groupement - Google Patents

Procédé d'ensemencement pour regroupement par nuées dynamiques et autres algorithmes de groupement Download PDF

Info

Publication number
WO2008022341A2
WO2008022341A2 PCT/US2007/076258 US2007076258W WO2008022341A2 WO 2008022341 A2 WO2008022341 A2 WO 2008022341A2 US 2007076258 W US2007076258 W US 2007076258W WO 2008022341 A2 WO2008022341 A2 WO 2008022341A2
Authority
WO
WIPO (PCT)
Prior art keywords
seeds
selecting
seed
data points
points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2007/076258
Other languages
English (en)
Other versions
WO2008022341A3 (fr
Inventor
Rafail Ostrovsky
Yuval Rabani
Leonard J. Schulman
Chaitanya Swamy
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Technion Research and Development Foundation Ltd
California Institute of Technology
University of California Berkeley
University of California San Diego UCSD
Original Assignee
Technion Research and Development Foundation Ltd
California Institute of Technology
University of California Berkeley
University of California San Diego UCSD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Technion Research and Development Foundation Ltd, California Institute of Technology, University of California Berkeley, University of California San Diego UCSD filed Critical Technion Research and Development Foundation Ltd
Publication of WO2008022341A2 publication Critical patent/WO2008022341A2/fr
Publication of WO2008022341A3 publication Critical patent/WO2008022341A3/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions

Definitions

  • Clustering is performed in many applications, including the areas of pattern recognition, image analysis, data mining, search engines, bioinformatics, fraud detection, and the like, in order to associate data into groups for further analysis, including detecting items that are similar and also items that are dissimilar.
  • clustering it is assumed that there are a number of data items, whatever the items may represent, and the data items are to be organized in some fashion.
  • each data item is transformed into an analyzable form that may be best understood as a single point in a multi-dimensional space, and then clusters of points, sometimes called clouds or groups of points, are identified within such space. That is, given enough points, distinct groups of points are identified, where each group has a number of points in relatively close proximity such that the group is a clustering of such points.
  • the number of points derived from a data set can be relatively large, perhaps even on the order of millions, tens of millions, and perhaps even hundreds of millions. Accordingly, identifying clusters within a space of such data points is not a trivial matter.
  • One method for doing so includes the use of Lloyd-type fc-means clustering.
  • Finding the center of each cluster is not a trivial matter. Accordingly, Lloyd- type k- means clustering may be employed to do so. Briefly, in classic LIo yd- type fc-means clustering, and again presuming two clusters merely for the sake of explanation, a potential center for each cluster is defined, based on some set of parameters, or even randomly, where each such potential center is in effect a seed to initialize an iterative process. In such iterative process, some number of points closest to each seed are located, and for the located points for each seed, the center of such located points is found in a known manner, and the seed is relocated to such found center. The iterative process is repeated a number of times until some terminating event or condition occurs.
  • Such event or condition can for example be that the movement of each seed effectively settles down to a more-or-less stationary location or settled center, or can be a fixed number of iterations. Of course, it may be that at least some of the seeds never settle down to a settled center, in which case the iterations may be stopped after some predetermined number of iterations. In such a case, it may be that the seeds were initially poorly chosen, in which case new seeds may be picked and the process repeated.
  • ⁇ -clustering that more accurately define an initial seed for each cluster prior to performing the iterative process of such Lloyd-type ⁇ -clustering. More particularly, a need exists for such systems and methods that employ a procedure to at least roughly identify the initial location of each true cluster based on distances between at least a sampling of all the points in the data set.
  • a method of seeding that is performed with regard to performing clustering on a data set of data items, each of which has been transformed into a point in a multi-dimensional space.
  • a number of clusters are defined within the space, and a center of each cluster is initially located by selecting an initial seed as a potential center for each cluster.
  • An iterative process is then performed to move each seed to what eventually becomes the located center of a cluster.
  • a first initial seed for a first cluster is selected by:
  • the selection may, in some embodiments, be performed randomly with a uniform probability of picking any point, or in other embodiments may be based on an assigned probability distribution with, for example, the probability of picking a point D 1 as ⁇ ⁇ being:
  • the first initial seed may be selected as the centroid of the data points or the centroid of a sampling of the data points.
  • an initial seed for each successive cluster may be selected by: (i) receiving an integer k between 2 and n - 1 as the number of seeds that are to be produced; and
  • more than one initial seed may be selected before undertaking the iterative process described above for determining remaining seeds.
  • an abundance of seeds may be selected and later culled according to processes such as described below.
  • FIG. 1 is a block diagram of an example of a computing environment within which various embodiments of the present invention may be implemented;
  • FIG. 2 is a conceptual diagram showing a space within which a number of data points from a data set are located;
  • FIG. 3 is a conceptual diagram showing a space within which a number of data points from a data set are located and a partition of the points into clusters;
  • Fig. 1 is set forth herein as an exemplary computing environment in which various embodiments of the present invention may be implemented.
  • the computing system environment is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality. Numerous other general purpose or special purpose computing system environments or configurations may be used. Examples of well known computing systems, environments, and/or configurations that may be suitable for use include, but are not limited to, personal computers (PCs), server computers, handheld or laptop devices, multi-processor systems, microprocessor-based systems, network PCs, minicomputers, mainframe computers, embedded systems, distributed computing environments that include any of the above systems or devices, and the like.
  • PCs personal computers
  • server computers handheld or laptop devices
  • multi-processor systems microprocessor-based systems
  • network PCs network PCs
  • minicomputers minicomputers
  • mainframe computers mainframe computers
  • embedded systems distributed computing environments that include any of the above systems or devices, and the like.
  • Computer-executable instructions such as program modules executed by a computer may be used.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • Distributed computing environments may be used where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium.
  • program modules and other data may be located in both local and remote computer storage media including memory storage devices.
  • an exemplary system for implementing aspects described herein includes a computing device, such as computing device 100.
  • computing device 100 typically includes at least one processing unit 102 and memory 104.
  • memory 104 may be volatile (such as random access memory (RAM)), non-volatile (such as read-only memory (ROM), flash memory, etc.), or some combination of the two.
  • RAM random access memory
  • ROM read-only memory
  • FIG. 1 This most basic configuration is illustrated in Fig. 1 by dashed line 106.
  • Computing device 100 may have additional features/functionality.
  • computing device 100 may include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape. Such additional storage is illustrated in Fig.
  • Computing device 100 typically includes or is provided with a variety of computer-readable media.
  • Computer readable media can be any available media that can be accessed by computing device 100 and includes both volatile and non-volatile media, removable and non-removable media.
  • Computer readable media may comprise computer storage media and communication media.
  • Computer storage media includes volatile and non-volatile, removable and nonremovable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Memory 104, removable storage 108, and non-removable storage 110 are all examples of computer storage media.
  • Computer storage media includes, but is not limited to, RAM, ROM, electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD- ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by computing device 100. Any such computer storage media may be part of computing device 100.
  • Computing device 100 may also contain communications connection(s) 112 that allow the device to communicate with other devices.
  • Each such communications connection 112 is an example of communication media.
  • Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared and other wireless media.
  • RF radio frequency
  • computer readable media as used herein includes both storage media and communication media.
  • Computing device 100 may also have input device(s) 114 such as keyboard, mouse, pen, voice input device, touch input device, etc.
  • Output device(s) 116 such as a display, speakers, printer, etc. may also be included. All these devices are generally known to the relevant public and therefore need not be discussed in any detail herein except as provided.
  • computing device 100 may be one of a plurality of computing devices 100 inter-connected by a network 118, as is shown in Fig. 1.
  • the network 118 may be any appropriate network
  • each computing device 100 may be connected thereto by way of a connection 112 in any appropriate manner
  • each computing device 100 may communicate with one or more of the other computing devices 100 in the network 118 in any appropriate manner.
  • the network 118 may be a wired or wireless network within an organization or home or the like, and may include a direct or indirect coupling to an external network such as the Internet or the like.
  • the computing device In the case of program code execution on programmable computers, the computing device generally includes a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
  • One or more programs may implement or utilize the processes described in connection with the presently disclosed subject matter, e.g., through the use of an application- program interface (API), reusable controls, or the like.
  • API application- program interface
  • Such programs may be implemented in a high level procedural or object oriented programming language to communicate with a computer system.
  • the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language, and combined with hardware implementations.
  • exemplary embodiments may refer to utilizing aspects of the presently disclosed subject matter in the context of one or more stand-alone computer systems, the subject matter is not so limited, but rather may be implemented in connection with any computing environment, such as a network 118 or a distributed computing environment. Still further, aspects of the presently disclosed subject matter may be implemented in or across a plurality of processing chips or devices, and storage may similarly be effected across a plurality of devices in a network 118. Such devices might include personal computers, network servers, and handheld devices, for example.
  • clustering is performed in many applications in order to associate data into groups for further analysis, including detecting items that are similar and also items that are dissimilar.
  • clustering it is assumed that there are a number of data items, whatever the items may represent, and the data items are to be organized in some fashion.
  • fraud detection such as that which may be performed by a bank or other financial institution to find transactions that are improper and which should be prevented.
  • patterns of proper transactions are identified, and transactions that fall outside of or are dissimilar to such proper transactions are identified and then scrutinized for possible identification as being fraudulent.
  • clustering may be performed with regard to any other type of data set without departing from the spirit and scope of the present invention.
  • each data item is transformed into an analyzable form that may be best understood as a single point in a multi-dimensional space.
  • Fig. 2 is an example of a collection of points 10 and a set of two clusters 12, 14 delineated by the dotted curves.
  • a goal of a clustering algorithm is to identify within a collection of points 10 a set of clusters 12, 14 of points.
  • the set of clusters may partition the collection of points 10 into a set of distinct clusters 12, 14. Note that by its nature the drawing of Fig. 2 is two-dimensional. Nevertheless, it is to be understood that the objects shown in the two-dimensional drawing of Fig. 2 can readily be extended to a greater number of dimensions.
  • Distinct clusters 12, 14 are identified, where each cluster 12, 14 has a number of points in relatively close proximity to each other. As seen in Fig. 2, it may be that there are some points that are relatively close to each other, in which case such points and the data items represented by them are relatively similar in some sense. Likewise, it may be that there are points 10 that are relatively far from each other, in which case such points and the data items represented by them are relatively dissimilar.
  • Points that are relatively closer to the center of a cluster 12, 14 tend to be more closely identifiable as being part of such cluster 12, 14, while points that are relatively farther from the center of a cluster 12, 14 tend to be less identifiable as being part of such cluster 12, 14.
  • points that are relatively far from the center of any cluster that represents legitimate transactions may be more closely scrutinized as being at least potentially fraudulent.
  • Lloyd-type fc-means clustering is used to find a center of each cluster, such that each point can then be associated with a particular cluster based on distance within such space between the point and the center of the cluster.
  • each point is associated with the cluster whose center is nearest the point. Finding distances within high-dimensional space and associating a point with a particular cluster is known or should be apparent to the relevant public and therefore need not be set forth herein in detail. Accordingly, any method for finding such distance within such space and associating such point with a particular cluster may be employed without departing from the spirit and scope of the present invention.
  • classic Lloyd-type fc-means clustering involves selecting an initial collection of seeds as potential centers for each cluster. The initial selection may be based on some set of parameters, or may even be performed randomly. Various techniques for this initial selection are described below. An iterative process is then performed wherein the following steps are repeated: each point is assigned to a cluster associated with the potential cluster center nearest the point and then the centroid of all points (or some sampling of points) in each cluster is computed and a new cluster center for that cluster is located at the centroid of that cluster. The iterative process is repeated until some termination condition is met. For example, the termination condition might be that the cluster centers move less than a predetermined amount from one iteration to the next, or that some predetermined number of iterations is performed.
  • Fig. 3 provides an example of an initial stage of a Lloyd-type k- means clustering algorithm for two clusters.
  • a collection of points 30 is provided.
  • Initial seeds ⁇ i and ⁇ 2 have been chosen and are designated as initial potential centers for the two clusters 32, 34. Techniques for selected the initial seeds are discussed below.
  • Each point in the collection of points 30 is then associated with the cluster whose center is closest to the point.
  • points nearer to ⁇ i than ⁇ 2 i.e., points above line 36
  • points nearer to ⁇ 2 than ⁇ 1? i.e., points below line 36
  • the centroid of the points in the each cluster 32 34 is computed and new cluster centers, V 1 , V 2 respectively, are located at those centroids.
  • each point in the collection of points 30 is once again associated with a cluster whose center is closest to the point.
  • points nearer to V 1 than V 2 i.e., points above and to the right of line 38
  • points nearer to V 2 than V 1 i.e., points below and to the left of line 38
  • the process is repeated, determining new cluster centers and perforce new clusters, until some termination condition is met.
  • Such Lloyd-type ⁇ -clustering is generally expected to work well if the initial seeds are well-chosen. However, if, for example, multiple seeds are chosen within one 'true' cluster (i.e., a cluster that ought to be found) or if multiple seeds iteratively migrate to one true cluster, the process can result in multiple cluster centers within the one true cluster and no cluster centers in another true cluster, and the points in the true clusters may end up being assigned to the clusters in some false and perhaps even playful fashion. [0038] Clustering in general is a venerable statistical problem of both practical and theoretical interest. There is presently a wide and unsatisfactory gap between the practical and theoretical clustering literatures.
  • a near-optimal ⁇ -clustering can be achieved by a partition into fewer than k clusters, then that smaller value of k should be used to cluster the data. If near-optimal ⁇ -clusterings are hard to find only when they provide ambiguous classification or marginal benefit (i.e., in the absence of a meaningful k- clustering), then such hardness should not be viewed as an acceptable obstacle to algorithm development. Instead, the performance criteria should be revised.
  • each center is associated with all data in its Voronoi region, and is located at the center of mass of this data.
  • Lloyd's method which consists of the following two phases: (a) 'Seed' the process with some initial centers (the literature contains many competing suggestions of how to do this); (b) Iterate the following Lloyd step until the clustering is 'good enough': cluster all the data in the Voronoi region of a center together, and then move the center to the centroid of its cluster.
  • the objective is to partition X into k clusters X ⁇ ..., X k and assign each point in every cluster X 1 to a common center c ; e R d , so as to minimize the fc-means cost, defined as ⁇ f_ j ⁇ xe% Il JC - c, II 2 ,
  • cluster X 1 simply consists of all points x e X for which c, is the nearest center, breaking ties arbitrarily.
  • clustering X 1 the best centers corresponding to this clustering are obtained by setting c, to be the center of mass (centroid) of
  • c is the centroid of cluster X 1 , and each point in X 1 has c, as its nearest center.
  • Theorem 3.4 The above algorithm returns a clustering of cost at most —
  • both techniques proceed in two stages.
  • the first stage is a seeding stage, which performs the bulk of the work and guarantees that at the end of this stage there are k seed centers positioned at nearly the right locations.
  • each optimal center has a (distinct) initial center located in close proximity.
  • This is precisely the leverage that we obtain from the fc-means separation condition (as in the 2-means case).
  • We consider a natural generalization of the sampling procedure used for the 2-means case and show that this picks the k initial centers from the cores of the optimal clusters. This sampling procedure runs in linear time but it succeeds with probability that is exponentially small in k.
  • Running time The sampling procedure consists of k iterations, each of which takes ⁇ (nd) time. This is because after sampling a new point C 1+1 , we can update the quantity min ⁇ 1 !+1 ⁇ ll JC - c Il for each point x in ⁇ (d) time. So the overall running time is Oinkd) .
  • Analysis. Let ⁇ 2 « p ⁇ 1 be a parameter that we will set later. As in the 2-
  • X° ut [ x e X 1 : Il x - C 1 Il 2 ⁇ ⁇ - ⁇ . Note that X ⁇ c X 1 '
  • N h- — - — -r where 0 ⁇ ⁇ ⁇ 1 is a desired error tolerance.
  • Lemma 4.3 Suppose that we have sampled i points ⁇ Jc 1 ,..., X 1 ] from X. Let X 1 ,..., X m be all the clusters whose outer cores contain some sampled point x ⁇ . Then
  • the initial abundant collection of cluster centers may be determined by any of the methods described herein, or by a hybrid of two or more of the methods. Some of the initial cluster centers may be selected by random methods and some may be selected by deterministic methods. Some may be chose with respect to previously selected cluster centers and some may be chosen independently of previously selected cluster centers.
  • centers may be deleted or merged, for example by the techniques described below, until the desired number of cluster centers remain.
  • a 2 k (•) cost of the sampled instance is much smaller than its (•) cost, and that the optimal centers for the sampled instance lie near the optimal centers for X.
  • Lemma 4.10 For each center c ; , there is a center C 1 such that .
  • Ball- ⁇ > means step. Let B ; be the points of X in a ball of radius U 1 I 3 around C 1 , and C 1 be the centroid of B ; . Return C 1 ,..., c ⁇ k as the final centers.
  • a linear time constant-factor approximation technique may be accomplished by : 1 - Execute the seeding procedure above to obtain k initial centers C 1 ,..., c k . 2 - Run the ball-fc- means step to obtain the final centers.
  • the running time is O(nkd+k 3 d).
  • Lemma 4.10 we know that with probability 1 - O(y/ ⁇ ) , for each c ; , there is a distinct center C 1 such that Il C 1 - C 1 Il ⁇ D 1 /10. Combined with Lemma 4.11, this yields the following theorem.
  • Theorem 4.13 If A 2 J (X) ⁇ ⁇ 2 for a small enough ⁇ , the above algorithm
  • the PTAS combines the sampling procedure with the centroid estimation step: 1 - Use the procedure above to pick k initial centers c 1 ,..., c k . 2 - Run the centroid estimation procedure to obtain the final centers.
  • cost(jc;,..., X k ) denote the cost of clustering X around the centers xi,...,x t in R d .
  • R(x) we use R(x) to denote the Voronoi region of point x (the centers will be clear from the context).
  • S 1 ® S 2 (S 1 XS 2 ) (J (S 2 XS 1 ) denote the symmetric difference of S 1 and S 2 .
  • Theorem 5.2 Let ⁇ ⁇ 1 / 3. Suppose that for every ⁇ -clustering X 1 ,..., X k of X of cost at most ⁇ 2 A 2 k (X) ,
  • the various embodiments of the present invention are directed in particular toward data analysis and methods of clustering data points into clusters that represent groups of interest.
  • One method of performing clustering is the so-called fc-means clustering heuristic.
  • fc-means clustering heuristic tends to be sensitive to the way initial cluster centers or seeds are chosen, which is significant given that some methods for randomly selecting seeds can lead to results of variable quality.
  • a procedure for choosing initial seeds to find centers of clusters is provided which leads to more effective clustering procedures, both in terms of the quality of results and running time.
  • the seeding procedure and clustering procedure are performed with regard to a data set of items, each of which must be in an analyzable form that may be best understood as a single point in a multi-dimensional space. Accordingly, such data set is procured, obtained, or otherwise provided, as a collection of points 500. As should be understood, such procurement and transformation may be performed in any appropriate manner without departing from the spirit and scope of the present invention.
  • the seeding procedure receives as an input an integer k between 2 and n- ⁇ which represents the number of seeds that are to be produced for the clustering procedure.
  • the seeding procedure will output k points out of the n input points that will serve as seeds.
  • the first seeds may be determined by selecting one or more points at random from the collection of data points. The random selection may be made from any probability distribution assigned to the data points. In some embodiments the first seeds are selected uniformly at random from the data points.
  • the center of gravity C of the n points may be calculated, for example by taking
  • this center of gravity C is used as the first seed.
  • the first seed may be chosen by the following procedure:
  • Such distance C 1 0 may be calculated in any appropriate manner without departing from the spirit and scope of the teachings herein.
  • the distance may be calculated as the squared Euclidean distance between such point (ii)
  • An average squared distance is then calculated as:
  • a point ⁇ x from among the n points is selected at random as the first seed with the probability of picking any point D ;
  • each distance may be calculated in any appropriate manner, such as the
  • steps 504 and 506 are repeated until all of the required seeds ⁇ x , ..., ⁇ k have been selected. Thereafter, ⁇ x , ..., ⁇ k as the selected seeds are then supplied (step 510) to a clustering procedure, and the clustering procedure iteratively moves the seeds within the space until a termination condition is reached and cluster centers are defined.
  • more than one seed may be selected by techniques such as those described in connection with “Picking the first seed” above before proceeding to the techniques of "Picking subsequent seeds.”
  • an abundant number of seeds may be initially selected. That is, some number of seeds greater than that to be used by a clustering algorithm are initially selected. The collection of seeds is then culled, such as for example by the deletion / merger techniques described above.
  • the initial abundant collection of seeds may be chosen by any suitable method, such as those described above for example, or by a combination of methods. For example, some number of seeds may be chosen by the procedure described in connection with Fig. 5 above and then the collection of seeds may be augmented by selecting additional seeds at random. As another example, some number of seeds may be picked randomly, and then the collection of seeds augmented by the procedure described in connection with steps 504 to 508 of Fig. 5.
  • the seeds that are selected by the techniques disclosed herein will be communicated to a clustering algorithm for subsequent determination of a desirable clustering of given data.
  • a Lloyd-type iteration algorithm may take the selected seeds as a starting point for an iterative process that determines a clustering for the data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Techniques de détermination d'un ensemble de semences utilisables dans un algorithme de groupement. Après la sélection d'une ou plusieurs semences par un parmi divers procédés, des semences supplémentaires sont sélectionnées par sélection aléatoire d'un point parmi un ensemble de points de données, la probabilité de sélection d'un point donné étant proportionnelle à la distance entre le point donné et la plus proche semence sélectionnée auparavant. Un nombre abondant de semences peuvent être sélectionnées au départ, et l'ensemble de semences peut être filtré par un processus itératif.
PCT/US2007/076258 2006-08-18 2007-08-18 Procédé d'ensemencement pour regroupement par nuées dynamiques et autres algorithmes de groupement Ceased WO2008022341A2 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US82290306P 2006-08-18 2006-08-18
US60/822,903 2006-08-18

Publications (2)

Publication Number Publication Date
WO2008022341A2 true WO2008022341A2 (fr) 2008-02-21
WO2008022341A3 WO2008022341A3 (fr) 2008-12-31

Family

ID=39083194

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/076258 Ceased WO2008022341A2 (fr) 2006-08-18 2007-08-18 Procédé d'ensemencement pour regroupement par nuées dynamiques et autres algorithmes de groupement

Country Status (1)

Country Link
WO (1) WO2008022341A2 (fr)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014143729A1 (fr) * 2013-03-15 2014-09-18 Affinnova, Inc. Procédé et appareil destinés à une optimisation évolutionniste interactive de concepts
US9111298B2 (en) 2011-03-08 2015-08-18 Affinova, Inc. System and method for concept development
US9208132B2 (en) 2011-03-08 2015-12-08 The Nielsen Company (Us), Llc System and method for concept development with content aware text editor
US9311383B1 (en) 2012-01-13 2016-04-12 The Nielsen Company (Us), Llc Optimal solution identification system and method
US9785995B2 (en) 2013-03-15 2017-10-10 The Nielsen Company (Us), Llc Method and apparatus for interactive evolutionary algorithms with respondent directed breeding
WO2017181660A1 (fr) * 2016-04-21 2017-10-26 华为技术有限公司 Procédé et dispositif de regroupement de données basé sur un algorithme des k-moyennes
JP2018195089A (ja) * 2017-05-17 2018-12-06 富士通株式会社 クラスタリング方法、クラスタリングプログラム、および情報処理装置
US10354263B2 (en) 2011-04-07 2019-07-16 The Nielsen Company (Us), Llc Methods and apparatus to model consumer choice sourcing
CN111125469A (zh) * 2019-12-09 2020-05-08 重庆邮电大学 一种社交网络的用户聚类方法、装置以及计算机设备
US11657417B2 (en) 2015-04-02 2023-05-23 Nielsen Consumer Llc Methods and apparatus to identify affinity between segment attributes and product characteristics
CN116539643A (zh) * 2023-03-16 2023-08-04 南京京烁雷达科技有限公司 一种使用雷达识别煤岩数据的方法及系统
US20230316099A1 (en) * 2022-03-29 2023-10-05 Microsoft Technology Licensing, Llc System and method for identifying and resolving performance issues of automated components
CN117876412A (zh) * 2024-03-12 2024-04-12 江西求是高等研究院 三维重建的背景分离方法、系统、可读存储介质及计算机

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004501358A (ja) * 2000-05-11 2004-01-15 ベクトン・ディキンソン・アンド・カンパニー 最適な境界を有する平滑化された多角形を使用して散布図中のクラスタを識別するシステム
US6973212B2 (en) * 2000-09-01 2005-12-06 Siemens Corporate Research, Inc. Graph cuts for binary segmentation of n-dimensional images from object and background seeds
KR100442834B1 (ko) * 2002-07-19 2004-08-02 삼성전자주식회사 얼굴/유사얼굴 영상으로 학습된 패턴 분류기를 이용한얼굴 검출 방법 및 시스템

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9218614B2 (en) 2011-03-08 2015-12-22 The Nielsen Company (Us), Llc System and method for concept development
US9262776B2 (en) 2011-03-08 2016-02-16 The Nielsen Company (Us), Llc System and method for concept development
US9111298B2 (en) 2011-03-08 2015-08-18 Affinova, Inc. System and method for concept development
US9208132B2 (en) 2011-03-08 2015-12-08 The Nielsen Company (Us), Llc System and method for concept development with content aware text editor
US9208515B2 (en) 2011-03-08 2015-12-08 Affinnova, Inc. System and method for concept development
US10354263B2 (en) 2011-04-07 2019-07-16 The Nielsen Company (Us), Llc Methods and apparatus to model consumer choice sourcing
US11842358B2 (en) 2011-04-07 2023-12-12 Nielsen Consumer Llc Methods and apparatus to model consumer choice sourcing
US11037179B2 (en) 2011-04-07 2021-06-15 Nielsen Consumer Llc Methods and apparatus to model consumer choice sourcing
US9311383B1 (en) 2012-01-13 2016-04-12 The Nielsen Company (Us), Llc Optimal solution identification system and method
US11574354B2 (en) 2013-03-15 2023-02-07 Nielsen Consumer Llc Methods and apparatus for interactive evolutionary algorithms with respondent directed breeding
US9785995B2 (en) 2013-03-15 2017-10-10 The Nielsen Company (Us), Llc Method and apparatus for interactive evolutionary algorithms with respondent directed breeding
US9799041B2 (en) 2013-03-15 2017-10-24 The Nielsen Company (Us), Llc Method and apparatus for interactive evolutionary optimization of concepts
US20140344013A1 (en) * 2013-03-15 2014-11-20 Affinnova, Inc. Method and apparatus for interactive evolutionary optimization of concepts
WO2014143729A1 (fr) * 2013-03-15 2014-09-18 Affinnova, Inc. Procédé et appareil destinés à une optimisation évolutionniste interactive de concepts
US10839445B2 (en) 2013-03-15 2020-11-17 The Nielsen Company (Us), Llc Method and apparatus for interactive evolutionary algorithms with respondent directed breeding
US11195223B2 (en) 2013-03-15 2021-12-07 Nielsen Consumer Llc Methods and apparatus for interactive evolutionary algorithms with respondent directed breeding
US11657417B2 (en) 2015-04-02 2023-05-23 Nielsen Consumer Llc Methods and apparatus to identify affinity between segment attributes and product characteristics
WO2017181660A1 (fr) * 2016-04-21 2017-10-26 华为技术有限公司 Procédé et dispositif de regroupement de données basé sur un algorithme des k-moyennes
JP2018195089A (ja) * 2017-05-17 2018-12-06 富士通株式会社 クラスタリング方法、クラスタリングプログラム、および情報処理装置
CN111125469B (zh) * 2019-12-09 2022-06-10 重庆邮电大学 一种社交网络的用户聚类方法、装置以及计算机设备
CN111125469A (zh) * 2019-12-09 2020-05-08 重庆邮电大学 一种社交网络的用户聚类方法、装置以及计算机设备
US20230316099A1 (en) * 2022-03-29 2023-10-05 Microsoft Technology Licensing, Llc System and method for identifying and resolving performance issues of automated components
US12056620B2 (en) * 2022-03-29 2024-08-06 Microsoft Technology Licensing, Llc System and method for identifying and resolving performance issues of automated components
CN116539643A (zh) * 2023-03-16 2023-08-04 南京京烁雷达科技有限公司 一种使用雷达识别煤岩数据的方法及系统
CN116539643B (zh) * 2023-03-16 2023-11-17 南京京烁雷达科技有限公司 一种使用雷达识别煤岩数据的方法及系统
CN117876412A (zh) * 2024-03-12 2024-04-12 江西求是高等研究院 三维重建的背景分离方法、系统、可读存储介质及计算机
CN117876412B (zh) * 2024-03-12 2024-05-24 江西求是高等研究院 三维重建的背景分离方法、系统、可读存储介质及计算机

Also Published As

Publication number Publication date
WO2008022341A3 (fr) 2008-12-31

Similar Documents

Publication Publication Date Title
WO2008022341A2 (fr) Procédé d'ensemencement pour regroupement par nuées dynamiques et autres algorithmes de groupement
Krissinel et al. Common subgraph isomorphism detection by backtracking search
Ostrovsky et al. The effectiveness of Lloyd-type methods for the k-means problem
Holzmann et al. A minimized automaton representation of reachable states
US10157239B2 (en) Finding common neighbors between two nodes in a graph
Anagolum et al. Élivágar: Efficient quantum circuit search for classification
US20240259183A1 (en) Similarity hashing of binary file feature sets for clustering and malicious detection
Bhattacharyya et al. Efficiently summarising event sequences with rich interleaving patterns
US9361403B2 (en) Efficiently counting triangles in a graph
Tseng et al. Parallel batch-dynamic minimum spanning forest and the efficiency of dynamic agglomerative graph clustering
CN114398069A (zh) 一种基于交叉指纹分析的公共组件库精确版本识别方法及系统
US11550910B2 (en) Creating generic rules in a high dimensional sparse feature space using negative feedback
Andoni et al. Approximate nearest neighbors beyond space partitions
Siyari et al. Lexis: An optimization framework for discovering the hierarchical structure of sequential data
Matsumoto et al. Accelerating exact similarity search on cpu-gpu systems
CN117716350A (zh) 链接Bloom过滤器以估计数据集中具有低频的键的数量
Naidan et al. Non-metric space library manual
Mageirakos et al. Cracking vector search indexes
Pandey et al. Lazy merge sort: An improvement over merge sort
Carstensen et al. An efficient PSO-based evolutionary model for closed high-utility itemset mining: S. Carstensen, JC-W. Lin
Stirling et al. F3M: fast focused function merging
Yarmolik et al. Controlled random tests
Fu et al. Financial time series indexing based on low resolution clustering
Christiani et al. Confirmation sampling for exact nearest neighbor search
Chembu et al. Scalable and Globally Optimal Generalized L₁ K-center Clustering via Constraint Generation in Mixed Integer Linear Programming

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07814241

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

122 Ep: pct application non-entry in european phase

Ref document number: 07814241

Country of ref document: EP

Kind code of ref document: A2