WO2024254894A1 - 一种基于单机的大规模图数据处理系统 - Google Patents

一种基于单机的大规模图数据处理系统 Download PDF

Info

Publication number
WO2024254894A1
WO2024254894A1 PCT/CN2023/101407 CN2023101407W WO2024254894A1 WO 2024254894 A1 WO2024254894 A1 WO 2024254894A1 CN 2023101407 W CN2023101407 W CN 2023101407W WO 2024254894 A1 WO2024254894 A1 WO 2024254894A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
module
sub
graph
subgraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2023/101407
Other languages
English (en)
French (fr)
Inventor
朱筱可
刘阳
刘书昊
樊文飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Computing Sciences
Original Assignee
Shenzhen Institute of Computing Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Computing Sciences filed Critical Shenzhen Institute of Computing Sciences
Publication of WO2024254894A1 publication Critical patent/WO2024254894A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/24569Query processing with adaptation to specific hardware, e.g. adapted for using GPUs or SSDs
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of data processing technology, and in particular to a large-scale graph data processing system based on a single machine.
  • graph data has become a highly valued topic in the field of data science and engineering, as it is easy to abstract entities and relationships in the real world. It has been widely used in many fields such as social network analysis, recommendation systems, financial fraud detection, and drug discovery. At the same time, graph data has high flexibility, and many problems that were originally modeled using matrices, relationships, or other data structures can also be converted to graph data processing, further highlighting the importance of graph data.
  • the scale of abstract graph data generated or collected by computer systems is growing rapidly. This growth in magnitude has posed an extremely sharp challenge to the large-scale data storage, analysis, and mining capabilities of modern computer systems.
  • a large-scale graph data processing system based on a single machine to solve the above problem includes:
  • a large-scale graph data processing system based on a single machine, comprising a data loading module, a data calculation module, a data release module, a storage management module and a disk;
  • the disk stores large-scale graph data consisting of a plurality of subgraphs;
  • the storage management module stores state information corresponding to each of the subgraphs; in an initial state, the state of the subgraph is active;
  • the data loading module is used for acquiring the sub-graph in the active state from the disk, and transmitting the sub-graph to the data calculation module;
  • the data calculation module is used to update the subgraph and transmit the message generated by the update to the storage management module;
  • the data calculation module is further used to transmit the sub-graph to the data release module;
  • the data release module is used to write the sub-graph into the disk
  • the storage management module When the subgraph is written to the disk, the storage management module is used to set the state of the subgraph to converged.
  • the data calculation module is further used to write the sub-graph into the disk.
  • the data releasing module is further used to transmit the sub-graph to the data loading module.
  • the storage management module is further used to set the state of the subgraph that has received the message to active.
  • the data calculation module is also used to aggregate all the sub-graphs to obtain updated large-scale graph data.
  • the storage management module includes a message storage unit and a state management unit; the state management unit stores the state information;
  • the data calculation module is used to transmit the updated message to the message storage unit
  • the state management unit is used to update the subgraph received the message Set the status to Active.
  • the data calculation module includes an aggregation calculation unit
  • the aggregation calculation unit is used to aggregate all the subgraphs to obtain updated large-scale graph data.
  • the storage management module is further used to set the state of the subgraph to waiting for calculation.
  • the storage management module is further used to set the state of the subgraph to being calculated.
  • the storage management module is further used to set the state of the sub-graph to being released.
  • the present application provides a solution for applying a sub-graph-centric computing model to a single machine system and establishing a pipeline processing architecture, specifically: "A large-scale graph data processing system based on a single machine, comprising a data loading module, a data computing module, a data release module, a storage management module and a disk; the disk stores large-scale graph data consisting of a number of sub-graphs; the storage management module stores status information corresponding to each of the sub-graphs; in an initial state, the sub-graph is active; The data loading module is used to obtain the subgraph with active status from the disk and transmit the subgraph to the data calculation module; the data calculation module is used to update the subgraph and transmit the message generated by the update to the storage management module; when the updated subgraph has changes, the data calculation module is also used to transmit the subgraph to the data release module; when the subgraph is
  • the architecture can overlap data I/O and CPU operations, thereby reducing the I/O cost of the traditional vertex-centric computing model while improving CPU utilization and promoting sequential access to the disk.
  • the architecture uses shared memory data structures for message passing and efficient synchronization, which can separate computing from memory management and scheduling. out, thus providing new opportunities for optimization.
  • FIG1 is a schematic diagram of the connected component calculation process on the vertex center model and the subgraph center model
  • FIG2 is a schematic diagram of a processing architecture of a large-scale graph data processing system provided by an embodiment of the present application
  • FIG3 is a schematic diagram of a state management and optimization strategy for a large-scale graph data processing system provided in an embodiment of the present application.
  • the vertex-centric computing model will undoubtedly bring additional communication overhead or I/O overhead due to the transmission of messages between vertices.
  • the subgraph-centric computing model (which allows information in the calculation process to be freely transmitted within the subgraph) has significantly fewer computing steps than the vertex-centric computing model.
  • the traditional subgraph-centric computing model is designed for multi-machine systems, and no work has yet introduced the subgraph-centric computing model into a single-machine environment. Therefore, some issues remain unclear. For example, when the communication cost is potentially converted into I/O cost, can the introduction of the subgraph-centric computing model systematically reduce the I/O cost of the out-of-core graph system and improve multi-core parallelism?
  • the traditional subgraph-centric computing model requires a finer-grained partitioning of the graph to improve parallelism, but this The cost is more redundant control information, such as the mapping of global vertex IDs to local vertex IDs. This problem can be solved in a distributed environment by allocating enough memory to each computing node, but in a single-machine multi-core environment, finer-grained graph partitioning will occupy the already precious memory resources.
  • auxiliary storage such as hard
  • a large-scale graph data processing system based on a single machine, comprising a data loading module, a data calculation module, a data release module, a storage management module and a disk;
  • the disk stores large-scale graph data consisting of a plurality of subgraphs;
  • the storage management module stores state information corresponding to each of the subgraphs; in an initial state, the state of the subgraph is active;
  • the data loading module is used for acquiring the sub-graph in the active state from the disk, and transmitting the sub-graph to the data calculation module;
  • the data calculation module is used to update the subgraph and transmit the message generated by the update to the storage management module;
  • the data calculation module is further used to transmit the sub-graph to the data release module;
  • the data release module is used to write the sub-graph into the disk
  • the storage management module When the subgraph is written to the disk, the storage management module is used to set the state of the subgraph to converged.
  • the present application compared with the existing large-scale graph processing system based on a single machine, In order to solve the problem of high pin or I/O overhead, the present application applies the subgraph-centric computing model to a single-machine system and establishes a pipeline processing architecture.
  • the pipeline processing architecture uses the subgraph ⁇ F0, F1, F2, F3, ..., Fn-1 ⁇ of graph G as the minimum input and output unit, and updates the large graph G iteratively with a pipeline.
  • the architecture decomposes the out-of-core processing of the subgraph Fi into three consecutive stages: reading Fi into memory, calculating and updating Fi, and if necessary, writing the updated Fi back to external memory.
  • These stages are completed through three modules, namely, a data loading module, a data calculation module, and a data release module.
  • these modules work asynchronously through two task queues "input queue” and "output queue”.
  • the pipeline processing architecture effectively overlaps subgraph I/O and CPU operations, performs calculations on memory subgraphs while loading suspended subgraphs from disk, which can reduce the I/O cost of traditional vertex-centric computing models, while improving CPU utilization by reducing idle waits and enabling continuous access to disks; in addition, the architecture uses shared memory data structures for message passing and efficient synchronization, which can separate computing from memory management and scheduling, thereby providing new opportunities for optimization.
  • the system adopts APIs based on a hybrid computing model.
  • the APIs adopt a unified PIE+ interface, which integrates vertex-centric and subgraph-centric programming models. Users can not only parallelize sequential graph algorithms under the subgraph-centric computing model to simplify parallel programming (inter-subgraph parallelism), but also further explore the parallelism within the subgraph based on the vertex-centric computing model through new interfaces.
  • the hybrid model supports both inter-subgraph parallelism of the "subgraph-centric computing model” and intra-subgraph parallelism of the "vertex-centric computing model”. Under limited memory, it can better utilize multi-core resources and avoid fragmentation of the input graph; in addition, it provides a unified interface from which users can choose the interface that best suits their applications and graphics.
  • the system further includes a scheduler.
  • the scheduler is used to track and allocate threads in a thread pool, where each thread corresponds to a physical CPU core. It decides to allocate physical threads to virtual worker threads to perform (parallel) calculations on subgraphs. It also actively adjusts to support two levels of parallelism: when threads are available, the scheduler allocates them to new computing units by consuming the "input queue" to speed up the parallelism between subgraphs, or through the running worker cores. To improve the parallelism within a subgraph.
  • the storage management module includes a message storage unit; the data calculation module is used to transmit the updated messages to the message storage unit.
  • the message storage unit is used to realize message synchronization between parallel computing units.
  • the message storage unit is implemented as a data structure in memory. In order to improve space efficiency, it can be implemented as a compact variable-length array. It should be noted that the space complexity of the message storage unit is closely related to the partitioning strategy. The more boundary vertices/edges there are, the larger the space consumed by the message storage unit. Compared with the message passing strategy of a multi-machine system, the use of a message storage unit is more efficient in a shared memory environment.
  • the storage management module further includes a state management unit; the state management unit stores the state information and can update the state information at a specific time.
  • the state management unit is used to maintain a state machine to model the state of the subgraph.
  • the state management unit is implemented as a lightweight data structure, and each subgraph only maintains a few states, and the memory space occupied can be ignored.
  • the state management unit uses the state management unit to record the state information of the subgraph to record the state information of the subgraph to record the state information of the subgraph. It only requires a tag list M to help track the message exchange between computing units, and a lightweight state machine to model the work progress of each computing unit. Specifically, the state management unit constructs a tag list M, and each subgraph corresponds to a tag to indicate whether it has received any messages in this round of iteration. If a subgraph has at least one pending update to be extracted from the message storage unit, its corresponding M[i] is true, otherwise, M[i] is false. In actual operation, a finite state machine can be used to model the progress of each subgraph, and the flag M[i] can be used to trigger the state transition of the subgraph.
  • the states of the subgraphs include “active”, “waiting for calculation”, “calculating”, “releasing” and “converging”. At any time, the subgraph is in one of the five states, where the first two states indicate that the subgraph is on the disk, and the remaining states indicate that the subgraph is in the memory.
  • the initial state of each subgraph is "active", which means that the subgraph is waiting for the data loading module to load it into the memory; when the subgraph is obtained by the data loading module, the state management unit is used to set the state of the subgraph to "waiting for calculation", which means It means that the subgraph has been resident in the memory and the subgraph is waiting to be assigned a processing core; when the subgraph is transmitted to the data computing module, the state management unit is used to set the state of the subgraph to "computing", which means that the subgraph is being processed by the processing core; when the subgraph is transmitted to the data release module (that is, the subgraph generates a message that needs to be sent to other subgraphs in the current round of update), the state management unit is used to set the state of the subgraph to "releasing"; when the subgraph is written to the disk, the state management unit is used to set the state of the subgraph to "converging"; when the current round of update ends, the state management unit is also used to set the state of the subgraph participating in the
  • the system can skip certain states in a round of computation without affecting correctness, that is, it can take some "shortcuts" in state transitions and reduce unnecessary computation and I/O.
  • the storage management module when the current round of updates is completed, the storage management module is used to set the state of the subgraph that received the message to "active" ("shortcut A").
  • the state of the subgraph with a "convergence” state needs to be reset to "active”. If M[i] corresponding to a subgraph is true at this time, the subgraph can be kept in the "convergence” state so that it does not participate in the next round of updates. This allows the processing of subgraphs that do not need to be updated to be skipped without affecting the correctness of the program.
  • "Shortcut A” is often used when the input graph is not well connected and a subgraph is "isolated”, and effectively reduces I/O costs.
  • the data release module is used to transfer the subgraph to the data loading module ("shortcut B"). After all the subgraphs have completed the current round of updates, a new round of updates will begin. If a subgraph is still in the "releasing" state, that is, it has not been completely saved to the disk, the state of the subgraph can be directly set to "waiting for calculation", thereby starting a new round of updates to the subgraph without going through the disk. "Shortcut B" can be used at the end of each round and effectively reduces I/O costs.
  • the data calculation module when there is no change in the updated sub-graph, the data calculation module is used to write the sub-graph to the disk ("shortcut C"). There is no change compared to before calculation. You can skip the "releasing" state and directly set it to “converging", which can effectively reduce redundant disk writes.
  • the data calculation module includes an aggregation calculation unit.
  • the aggregation calculation unit is used to call a preset aggregation function to aggregate all the sub-graphs to obtain updated large-scale graph data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种基于单机的大规模图数据处理系统,包括数据加载模块、数据计算模块、数据释放模块、存储管理模块和磁盘;数据加载模块用于从磁盘获取状态为活跃的子图,并将子图传输至数据计算模块;数据计算模块用于对子图进行更新,并将更新产生的消息传输至存储管理模块;数据计算模块还用于将子图传输至数据释放模块;数据释放模块用于将子图写入磁盘;当子图被写入磁盘时,存储管理模块用于将子图的状态设置为收敛。通过将基于子图中心的计算模型应用到单机系统,并且建立一套独特的流水线处理架构,该架构能够重叠数据I/O和CPU操作,从而降低传统的顶点中心计算模型的I/O成本同时提高CPU利用率,并促进对磁盘的顺序访问。

Description

一种基于单机的大规模图数据处理系统 技术领域
本申请涉及数据处理技术领域,特别是一种基于单机的大规模图数据处理系统。
背景技术
近年来,由于图数据易于抽象真实世界中的实体与关系,它已经成为数据科学和工程领域备受重视的话题,并被广泛应用于社交网络分析、推荐系统、金融欺诈检测、药物发现等多个领域,同时,图数据具有很高的灵活性,许多原本使用矩阵、关系或其他数据结构进行建模的问题也可以转换为图数据处理,进一步凸显了图数据的重要性。随着社交媒体和移动互联网应用的增强,计算机系统产生或收集的抽象图数据规模正在飞速增长,这种量级上的增长对现代计算机系统的大规模数据存储、分析和挖掘能力提出了极其尖锐的挑战。
传统的大规模图计算系统使用数据划分的并行化方法,即整合多台计算机资源以完成图计算任务。尽管这些计算系统在大图处理领域扮演着重要角色,但由于高昂的维护和构建成本,只有少数拥有大规模计算机集群的公司能够进行大规模图计算,此外,分布式计算系统通常基于一个假设,即使用更多的计算节点会减少计算时间,但实际上这一假设并不总是成立,增加计算节点可能会导致更大的通信代价,从而无法显著提升系统性能。
针对大规模图分析的实际需求和资源受限的使用场景,一系列基于单机的大规模图处理系统被提出。这些系统利用外存作为内存拓展来处理大图,并采用基于顶点中心的计算模型(该模型将计算过程中的信息局限在节点之间传递),以提升数据局部性,简化用户使用负担。尽管基于顶点中心的计算模型简单易懂,但也存在通信开销或I/O开销较高的问题。
发明内容
鉴于上述问题,提出了本申请以便提供克服所述问题或者至少部分地解 决所述问题的一种基于单机的大规模图数据处理系统,包括:
一种基于单机的大规模图数据处理系统,包括数据加载模块、数据计算模块、数据释放模块、存储管理模块和磁盘;所述磁盘存储有由若干子图构成的大规模图数据;所述存储管理模块存储有与每一所述子图对应的状态信息;初始状态下,所述子图的状态为活跃;
所述数据加载模块用于从所述磁盘获取状态为活跃的所述子图,并将所述子图传输至所述数据计算模块;
所述数据计算模块用于对所述子图进行更新,并将更新产生的消息传输至所述存储管理模块;
当更新后的所述子图存在改变时,所述数据计算模块还用于将所述子图传输至所述数据释放模块;
当所述子图非当前轮更新中的最后一个时,所述数据释放模块用于将所述子图写入所述磁盘;
当所述子图被写入所述磁盘时,所述存储管理模块用于将所述子图的状态设置为收敛。
优选的,当更新后的所述子图不存在改变时,所述数据计算模块还用于将所述子图写入所述磁盘。
优选的,当所述子图为当前轮更新中的最后一个时,所述数据释放模块还用于将所述子图传输至所述数据加载模块。
优选的,当当前轮更新结束时,所述存储管理模块还用于将接收到消息的所述子图的状态设置为活跃。
优选的,当当前轮更新结束且所述存储管理模块中不存在消息缓存时,所述数据计算模块还用于对全部所述子图进行聚合,得到更新后的大规模图数据。
优选的,所述存储管理模块包括消息存储单元和状态管理单元;所述状态管理单元存储有所述状态信息;
所述数据计算模块用于将更新产生的消息传输至所述消息存储单元;
当当前轮更新结束时,所述状态管理单元用于将接收到消息的所述子图 的状态设置为活跃。
优选的,所述数据计算模块包括聚合计算单元;
当当前轮更新结束且所述消息存储单元内不存在消息缓存时,所述聚合计算单元用于对全部所述子图进行聚合,得到更新后的大规模图数据。
优选的,当所述子图被所述数据加载模块获取时,所述存储管理模块还用于将所述子图的状态设置为等待计算。
优选的,当所述子图被传输至所述数据计算模块时,所述存储管理模块还用于将所述子图的状态设置为正在计算。
优选的,当所述子图被传输至所述数据释放模块时,所述存储管理模块还用于将所述子图的状态设置为释放中。
本申请具有以下优点:
在本申请的实施例中,相对于现有基于单机的大规模图处理系统通信开销或I/O开销较高的问题,本申请提供了将基于子图中心的计算模型应用到单机系统并建立一套流水线处理架构的解决方案,具体为:“一种基于单机的大规模图数据处理系统,包括数据加载模块、数据计算模块、数据释放模块、存储管理模块和磁盘;所述磁盘存储有由若干子图构成的大规模图数据;所述存储管理模块存储有与每一所述子图对应的状态信息;初始状态下,所述子图的状态为活跃;所述数据加载模块用于从所述磁盘获取状态为活跃的所述子图,并将所述子图传输至所述数据计算模块;所述数据计算模块用于对所述子图进行更新,并将更新产生的消息传输至所述存储管理模块;当更新后的所述子图存在改变时,所述数据计算模块还用于将所述子图传输至所述数据释放模块;当所述子图非当前轮更新中的最后一个时,所述数据释放模块用于将所述子图写入所述磁盘;当所述子图被写入所述磁盘时,所述存储管理模块用于将所述子图的状态设置为收敛”。通过将基于子图中心的计算模型应用到单机系统,并且建立一套独特的流水线处理架构,该架构能够重叠数据I/O和CPU操作,从而降低传统的顶点中心计算模型的I/O成本同时提高CPU利用率,并促进对磁盘的顺序访问,此外,该架构采用共享内存数据结构进行消息传递和高效同步,能够将计算从内存管理和调度中分离 出来,从而为优化提供了新的机会。
附图说明
为了更清楚地说明本申请的技术方案,下面将对本申请的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是顶点中心模型和子图中心模型上的连通分量计算过程示意图;
图2是本申请一实施例提供的一种大规模图数据处理系统的处理架构示意图;
图3是本申请一实施例提供的一种大规模图数据处理系统的状态管理及优化策略示意图。
具体实施方式
为使本申请的所述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请做进一步详细的说明。显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
发明人通过分析现有技术发现,基于顶点中心的计算模型无疑会因为顶点之间消息的传递带来额外的通信开销或I/O开销,如图1所示,在输入图G上完成连通分量的计算,采用基于子图中心的计算模型(该模型允许计算过程中的信息在子图内部自由传递)比采用基于顶点中心的计算模型明显拥有更少的计算步骤。
但是,传统的基于子图中心的计算模型是为多机系统设计的,还没有工作将基于子图中心的计算模型引入单机环境中,因此有一些问题尚未明确,例如:当通信代价潜在地转换为I/O代价之后,引入基于子图中心的计算模型,能否系统地降低核外图系统的I/O成本并提高多核并行性?传统的基于子图中心的计算模型要求对图执行更细粒度的划分来提升并行度,但是这样 做的代价是更多冗余的控制信息,比如全局顶点ID到本地顶点ID的映射,这一问题在分布式环境下可以通过为每台计算节点分配足够多的内存解决,但是在单机多核环境下更细粒度的图划分会占据本就珍贵的内存资源。
发明人认为,将基于子图中心的计算模型扩展到单机系统将面临以下改进需求:当输入图超过内存容量时,单机系统需要借助辅助存储器(如硬盘、SSD等)作为内存扩展进行计算,因此需要合理管理图在内存和磁盘间的调度;传统的基于子图中心的计算模型通过计算机网络在计算单元之间传递消息以进行同步,但在共享内存的情况下,单机系统的同步逻辑发生了改变,因此,需要更有效地实现消息同步;基于子图中心的计算模型仅利用数据分区并行性,这可能导致在内存容量有限的情况下,CPU内核利用率不足或图碎片过多,因此,需要平衡考虑子图间的并行计算与子图内的并行计算;由于单机系统共享内存架构,工作迁移在单个机器的核心之间的成本很低,因此,需要采用灵活的资源调度来提高系统性能。
本实施例中,提供一种基于单机的大规模图数据处理系统,包括数据加载模块、数据计算模块、数据释放模块、存储管理模块和磁盘;所述磁盘存储有由若干子图构成的大规模图数据;所述存储管理模块存储有与每一所述子图对应的状态信息;初始状态下,所述子图的状态为活跃;
所述数据加载模块用于从所述磁盘获取状态为活跃的所述子图,并将所述子图传输至所述数据计算模块;
所述数据计算模块用于对所述子图进行更新,并将更新产生的消息传输至所述存储管理模块;
当更新后的所述子图存在改变时,所述数据计算模块还用于将所述子图传输至所述数据释放模块;
当所述子图非当前轮更新中的最后一个时,所述数据释放模块用于将所述子图写入所述磁盘;
当所述子图被写入所述磁盘时,所述存储管理模块用于将所述子图的状态设置为收敛。
在本申请的实施例中,相对于现有基于单机的大规模图处理系统通信开 销或I/O开销较高的问题,本申请将基于子图中心的计算模型应用到单机系统,并建立了一套流水线处理架构。参照图2,给定一个大图G(大图G最初存储在磁盘上),所述流水线处理架构以图G的子图{F0,F1,F2,F3,...,Fn-1}作为最小输入输出单元,并以流水线迭代地对大图G进行更新,具体来说,该架构将子图Fi的核外处理分解为三个连续阶段:将Fi读入内存,计算并更新Fi,以及如果需要,将更新后的Fi写回外存,通过三个模块,即数据加载模块、数据计算模块和数据释放模块完成这些阶段,这些模块在流水线处理架构中,通过两个任务队列“输入队列”和“输出队列”异步工作。
所述流水线处理架构有效地重叠了子图I/O和CPU操作,在内存子图上进行计算,同时从磁盘加载挂起的子图,可以降低传统的顶点中心计算模型的I/O成本,同时通过减少空闲等待提高CPU的利用率,并且能够连续访问磁盘;此外,该架构采用共享内存数据结构进行消息传递和高效同步,能够将计算从内存管理和调度中分离出来,从而为优化提供了新的机会。
下面,将对本示例性实施例中一种基于单机的大规模图数据处理系统做进一步地说明。
本实施例中,所述系统采用基于混合计算模型的APIs。该APIs采用统一的PIE+接口,这一套接口集成了顶点中心和子图中心的编程模型,用户不仅可以在基于子图中心的计算模型下并行化顺序图算法以简化并行编程(子图间并行),还可以通过新的接口进一步探索基于顶点中心计算模型的子图内部的并行性。需要说明的是,所述混合模型同时支持“子图中心计算模型”的子图间并行和“顶点中心的计算模型”的子图内并行,在有限内存下,可以更好地利用多核资源,避免输入图的碎片化;此外,它还提供了一个统一的界面,用户可以从中选择最适合他们的应用程序和图形的界面。
本实施例中,所述系统还包括调度器。所述调度器用于在线程池中跟踪并分配线程,其中每个线程对应于一个物理CPU核心,它决定将物理线程分配给虚拟工作线程,以便在子图上执行(并行)计算,它还进行主动调整以支持两级并行:当线程可用时,所述调度器通过消耗“输入队列”来将其分配给新的计算单元,以加快子图间的并行性,或通过正在运行的工作核心 来改善子图内的并行性。
本实施例中,所述存储管理模块包括消息存储单元;所述数据计算模块用于将更新产生的消息传输至所述消息存储单元。所述消息存储单元用于实现并行计算单元之间的消息同步。具体的,所述消息存储单元被实现为一种内存中的数据结构,为了提高空间效率,可以将其实现为一个紧凑的可变长度数组。需要说明的是,所述消息存储单元的空间复杂度与划分策略密切相关,如果有越多的边界顶点/边,那么所述消息存储单元消耗的空间就越大。与多机系统的消息传递策略相比,采用消息存储单元在共享内存环境中工作效率更高。
本实施例中,所述存储管理模块还包括状态管理单元;所述状态管理单元存储有所述状态信息,并可以在特定时刻更新所述状态信息。所述状态管理单元用于维护一个状态机来建模子图的状态。具体的,所述状态管理单元被实现为一个轻量级的数据结构,每个子图只维护几个状态,占用的内存空间可以忽略不计。
采用所述状态管理单元记录子图的状态信息是一种低成本的收敛检测方法,只需要使用一个标记列表M来帮助跟踪计算单元之间的消息交换,以及一个轻量级状态机来对每个计算单元的工作进度建模。具体的,所述状态管理单元构建了一个标记列表M,每个子图对应一个标记,用于指示其是否在该轮迭代接收到有任何消息,如果一个子图至少有一个挂起的更新要从所述消息存储单元中提取,则其对应的M[i]为真,否则,M[i]为假。实际操作时,可以采用有限状态机对每个子图的进度进行建模,并使用标志M[i]来触发子图的状态转换。
如图3所示,所述子图的状态包括“活跃”、“等待计算”、“正在计算”、“释放中”和“收敛”五种,在任何时刻,所述子图都处于五种状态中的一种,其中,前两种状态表示所述子图在磁盘上,剩下的状态表示所述子图在内存中。每个所述子图的初始状态都是“活跃”,意味着所述子图正在等待所述数据加载模块将其载入到内存;当所述子图被所述数据加载模块获取时,所述状态管理单元用于将所述子图的状态设置为“等待计算”,意 味着所述子图已经驻留在内存中,所述子图等待被分配处理核心;当所述子图被传输至所述数据计算模块时,所述状态管理单元用于将所述子图的状态设置为“正在计算”,意味着所述子图正在被处理核心计算处理;当所述子图被传输至所述数据释放模块(也即所述子图在当前轮更新产生了需要发送给其他所述子图的消息)时,所述状态管理单元用于将所述子图的状态设置为“释放中”;当所述子图被写入所述磁盘时,所述状态管理单元用于将所述子图的状态设置为“收敛”;当当前轮更新结束时,所述状态管理单元还用于将参与下一轮更新的子图的状态设置为活跃,以便这部分子图开始下一轮的更新;当且仅当当前轮更新结束且所述消息存储单元中不存在消息缓存时,整个系统停止更新。
在一定条件下,所述系统可以跳过一轮计算中的某些状态而不影响正确性,也就是说,可以在状态转换中采取一些“捷径”,并减少不必要的计算和I/O。
如图3所示,本实施例中,当当前轮更新结束时,所述存储管理模块用于将接收到消息的所述子图的状态设置为“活跃”(“捷径A”)。为了开始新一轮增量计算,状态为“收敛”的所述子图的状态需要被重置为“活跃”,如果此时一个子图对应的M[i]为真,可以让该子图保持在“收敛”状态,使其不参与下一轮的更新,由此可以全面跳过对无需更新的子图的处理,而不影响程序的正确性。“捷径A”经常在输入图没有被很好地连接,某一子图是“孤立的”的情况下被利用,并有效减少了I/O成本。
本实施例中,当所述子图为当前轮更新中的最后一个时,所述数据释放模块用于将所述子图传输至所述数据加载模块(“捷径B”)。在全部所述子图完成当前轮更新后会开始新一轮的更新,如果一个子图仍然处于“释放中”状态,也就是说还没有完全保存到磁盘上,可以直接将该子图的状态设置为“等待计算”,由此可以在不经过磁盘的情况下启动对该子图的新一轮更新。“捷径B”可以在每一轮结束时被利用,并有效减少了I/O成本。
本实施例中,当更新后的所述子图不存在改变时,所述数据计算模块用于将所述子图写入所述磁盘(“捷径C”)。当一个子图更新完成时,如果 其相比于计算前并没有任何改变,可以跳过“释放中”状态,直接将其设置为“收敛”,由此可以有效减少冗余的磁盘写入。
本实施例中,所述数据计算模块包括聚合计算单元,当当前轮更新结束且所述消息存储单元中不存在消息缓存时,所述聚合计算单元用于调用预设的聚合函数对全部所述子图进行聚合,得到更新后的大规模图数据。
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。
以上对本申请所提供的一种基于单机的大规模图数据处理系统,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (10)

  1. 一种基于单机的大规模图数据处理系统,其特征在于,包括数据加载模块、数据计算模块、数据释放模块、存储管理模块和磁盘;所述磁盘存储有由若干子图构成的大规模图数据;所述存储管理模块存储有与每一所述子图对应的状态信息;初始状态下,所述子图的状态为活跃;
    所述数据加载模块用于从所述磁盘获取状态为活跃的所述子图,并将所述子图传输至所述数据计算模块;
    所述数据计算模块用于对所述子图进行更新,并将更新产生的消息传输至所述存储管理模块;
    当更新后的所述子图存在改变时,所述数据计算模块还用于将所述子图传输至所述数据释放模块;
    当所述子图非当前轮更新中的最后一个时,所述数据释放模块用于将所述子图写入所述磁盘;
    当所述子图被写入所述磁盘时,所述存储管理模块用于将所述子图的状态设置为收敛。
  2. 根据权利要求1所述的系统,其特征在于,当更新后的所述子图不存在改变时,所述数据计算模块还用于将所述子图写入所述磁盘。
  3. 根据权利要求1所述的系统,其特征在于,当所述子图为当前轮更新中的最后一个时,所述数据释放模块还用于将所述子图传输至所述数据加载模块。
  4. 根据权利要求1所述的系统,其特征在于,当当前轮更新结束时,所述存储管理模块还用于将接收到消息的所述子图的状态设置为活跃。
  5. 根据权利要求1所述的系统,其特征在于,当当前轮更新结束且所述存储管理模块中不存在消息缓存时,所述数据计算模块还用于对全部所述子图进行聚合,得到更新后的大规模图数据。
  6. 根据权利要求1所述的系统,其特征在于,所述存储管理模块包括消息存储单元和状态管理单元;所述状态管理单元存储有所述状态信息;
    所述数据计算模块用于将更新产生的消息传输至所述消息存储单元;
    当当前轮更新结束时,所述状态管理单元用于将接收到消息的所述子图 的状态设置为活跃。
  7. 根据权利要求6所述的系统,其特征在于,所述数据计算模块包括聚合计算单元;
    当当前轮更新结束且所述消息存储单元内不存在消息缓存时,所述聚合计算单元用于对全部所述子图进行聚合,得到更新后的大规模图数据。
  8. 根据权利要求1所述的系统,其特征在于,当所述子图被所述数据加载模块获取时,所述存储管理模块还用于将所述子图的状态设置为等待计算。
  9. 根据权利要求1所述的系统,其特征在于,当所述子图被传输至所述数据计算模块时,所述存储管理模块还用于将所述子图的状态设置为正在计算。
  10. 根据权利要求1所述的系统,其特征在于,当所述子图被传输至所述数据释放模块时,所述存储管理模块还用于将所述子图的状态设置为释放中。
PCT/CN2023/101407 2023-06-12 2023-06-20 一种基于单机的大规模图数据处理系统 Ceased WO2024254894A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310695465.2 2023-06-12
CN202310695465.2A CN116680296B (zh) 2023-06-12 2023-06-12 一种基于单机的大规模图数据处理系统

Publications (1)

Publication Number Publication Date
WO2024254894A1 true WO2024254894A1 (zh) 2024-12-19

Family

ID=87790518

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101407 Ceased WO2024254894A1 (zh) 2023-06-12 2023-06-20 一种基于单机的大规模图数据处理系统

Country Status (2)

Country Link
CN (1) CN116680296B (zh)
WO (1) WO2024254894A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308327A (zh) * 2018-09-19 2019-02-05 浙江天猫技术有限公司 基于子图模型兼容点中心模型的图计算方法装置介质设备
US20200249998A1 (en) * 2019-02-01 2020-08-06 Alibaba Group Holding Limited Scheduling computation graph heterogeneous computer system
CN111859027A (zh) * 2019-04-24 2020-10-30 华为技术有限公司 图计算方法及装置
CN112988064A (zh) * 2021-02-09 2021-06-18 华中科技大学 一种面向并发多任务的磁盘图处理方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2003287815A1 (en) * 2002-12-02 2004-06-23 Mount Sinai Hospital Methods and products for representing and analyzing complexes of biological molecules
CN112352234B (zh) * 2018-06-15 2024-03-08 华为云计算技术有限公司 用于处理并发属性图查询的系统
WO2020019313A1 (zh) * 2018-07-27 2020-01-30 浙江天猫技术有限公司 一种图数据的更新方法、系统、计算机可读存储介质及设备
CN111241353B (zh) * 2020-01-16 2023-08-22 支付宝(杭州)信息技术有限公司 一种图数据的分区方法、装置以及设备
CN113392280B (zh) * 2021-06-10 2023-08-04 东北大学 一种面向跨区域的多主模型分布式图计算方法
CN113434702A (zh) * 2021-07-27 2021-09-24 支付宝(杭州)信息技术有限公司 一种用于图计算的自适应控制方法和系统
CN114756483A (zh) * 2022-03-31 2022-07-15 深圳清华大学研究院 基于核间存储访问的子图分段优化方法及应用

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308327A (zh) * 2018-09-19 2019-02-05 浙江天猫技术有限公司 基于子图模型兼容点中心模型的图计算方法装置介质设备
US20200249998A1 (en) * 2019-02-01 2020-08-06 Alibaba Group Holding Limited Scheduling computation graph heterogeneous computer system
CN111859027A (zh) * 2019-04-24 2020-10-30 华为技术有限公司 图计算方法及装置
CN112988064A (zh) * 2021-02-09 2021-06-18 华中科技大学 一种面向并发多任务的磁盘图处理方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MEI ZHENJIE: "Research on Optimization Method of Graph-oriented Computation", CHINA MASTER’S THESES FULL-TEXT DATABASE, 24 May 2017 (2017-05-24), XP093247168 *

Also Published As

Publication number Publication date
CN116680296B (zh) 2026-01-13
CN116680296A (zh) 2023-09-01

Similar Documents

Publication Publication Date Title
CN110704360B (zh) 一种基于异构fpga数据流的图计算优化方法
CN101950282B (zh) 一种多处理器系统及其同步引擎
CN105653204A (zh) 一种基于磁盘的分布式图计算方法
CN103455371B (zh) 用于优化的管芯内小节点间消息通信的方法和系统
US20110265093A1 (en) Computer System and Program Product
CN111190735A (zh) 一种基于Linux的片上CPU/GPU流水化计算方法及计算机系统
WO2023274278A1 (zh) 一种资源调度的方法、装置及计算节点
CN116719646A (zh) 热点数据处理方法、装置、电子装置和存储介质
CN107168795A (zh) 基于cpu‑gpu异构复合式并行计算框架的密码子偏差系数模型方法
CN106095552A (zh) 一种基于i/o去重的多任务图处理方法及系统
CN109144749A (zh) 一种使用处理器实现多处理器间通信的方法
CN116340024B (zh) 仿真模型组件进程间的数据共享方法、计算机设备及介质
CN117539598A (zh) 任务处理方法、装置、电子设备及存储介质
CN116841952A (zh) 核间通信系统、方法、装置、设备、芯片及可读存储介质
CN107528871A (zh) 存储系统中的数据分析
CN114792186A (zh) 生产排产仿真方法及其装置
WO2024254894A1 (zh) 一种基于单机的大规模图数据处理系统
CN118277490A (zh) 数据处理系统、数据同步方法、电子设备和存储介质
CN114741166B (zh) 一种分布式任务的处理方法、分布式系统及第一设备
CN119807082A (zh) 一种基于Stream write指令的GPU写存储系统及其方法
CN116820713A (zh) 流水控制方法、加速器和电子设备
Singh Communication Coroutines For Parallel Program Using DW26010 Many Core Processor
CN110515729B (zh) 基于图形处理器的图计算节点向量负载平衡方法及装置
CN108958904A (zh) 嵌入式多核中央处理器的轻量级操作系统的驱动程序框架
CN119861974B (zh) 基于数据流核心架构的任务处理方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23941127

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE