WO2026009207A2 - Algorithme de multiplication de matrice avec complexité temporelle de o(n2), extensible pour résoudre des systèmes d'équations linéaires et pouvant être mis en œuvre sur des architectures de traitement parallèles - Google Patents

Algorithme de multiplication de matrice avec complexité temporelle de o(n2), extensible pour résoudre des systèmes d'équations linéaires et pouvant être mis en œuvre sur des architectures de traitement parallèles

Info

Publication number
WO2026009207A2
WO2026009207A2 PCT/IB2025/058508 IB2025058508W WO2026009207A2 WO 2026009207 A2 WO2026009207 A2 WO 2026009207A2 IB 2025058508 W IB2025058508 W IB 2025058508W WO 2026009207 A2 WO2026009207 A2 WO 2026009207A2
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
binary
matrices
layers
counting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/IB2025/058508
Other languages
English (en)
Other versions
WO2026009207A3 (fr
Inventor
Danial SADEGHI
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of WO2026009207A2 publication Critical patent/WO2026009207A2/fr
Publication of WO2026009207A3 publication Critical patent/WO2026009207A3/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present invention pertains to: - Numerical computation, - Optimization of matrix algorithms, and - Parallel processing systems.
  • This invention provides: - A simplified, stable matrix multiplication algorithm with O(n2) time complexity - Elimination of recursive subdivisions and associated computational overhead - Native compatibility with parallel processing architectures (GPU/FPGA) - Memory-optimized operations without numerical instability issues
  • Computational Efficiency O(n2) time complexity for matrix multiplication - Utilizes: • Multi-layer bitwise representation • Vectorized AND + popcount operations • Kronecker weighting system 2. Hardware Optimization - Native GPU/FPGA compatibility - Reduced memory requirements - Parallel processing support 3. Performance Advantages - Maintained numerical precision - Scalable for large matrices - Lower resource utilization vs. Strassen/Coppersmith-Winograd.
  • FIG. 1 shows a flowchart of the Binary Fusion matrix multiplication algorithm, illustrating the overall process from the input matrices to the reconstructed output.
  • FIG. 1 shows a schematic diagram of the complete workflow of the Binary Fusion algorithm, where each stage of the process is represented with a symbolic block and a brief description.
  • FIG. 1 shows a parallelization diagram of the Binary Fusion algorithm, in which the input matrices are represented as binary layers processed concurrently by GPU threads, with subsequent operations performed in FPGA cores using logical AND and popcount functions.
  • This invention introduces a novel, highly parallelizable computational method for performing matrix multiplication. Instead of relying on conventional numerical operations such as floating-point multiplication and addition, the proposed method employs a layered binary representation of input elements and executes counting–logical operations with structured weighting.
  • All elements of the input matrices are first mapped to a fixed binary precision. This mapping is performed by applying an appropriate offset, enabling negative or fractional numbers to be transformed into the domain of non-negative integer binary values.
  • Step 2 Binary Layered Representation
  • Each normalized element of the input matrices is decomposed into a fixed-length vector of bits (e.g. 8-bit or 16-bit).
  • each input matrix (e.g. A) is transformed into a unique three-dimensional structure of size , where m and n are the original matrix dimensions and L is the number of layers.
  • Step 3 Structural Organization for Parallel Processing
  • the binary layers of the input matrices are structurally rearranged so that vertical and horizontal alignment patterns correspond to their computational positions. Simultaneously, two independent weight vectors v and w are generated according to the ordering of the binary layers.
  • Step 4 Counting–Logical Multiplication of Binary Data
  • counting operations are applied to the logical product of binary matrices, producing the Counting–Logical matrix (CL). This is performed vectorially and in parallel, using the logical AND operator—where only the combination of two “1” bits produces an output of 1, and all other combinations yield 0.
  • This structure enables hierarchical combination of binary layers.
  • the CL matrix itself is a block matrix formed from cross-combinations of meaningful binary layers. (See .)
  • the output matrix is reconstructed by structured multiplication of the weight matrices on both sides of the CL matrix.
  • the algorithm maintains a time complexity of in this stage.
  • Step 7 Offset Reversal (Dependent on Step 1)
  • Step 1 If an initial offset or fractional scaling factor was applied in Step 1, offset reversal is carried out:
  • the offset is removed, which may involve subtracting offset - induced matrix terms - such as products of the input matrices with unit matrices (depending on multiplication side) and products of two unit matrices.
  • the final result is exactly equivalent to the conventional matrix product of the original matrices—without having used floating-point operations at any stage. This eliminates rounding and numerical instability errors.
  • the algorithm can be implemented with ease in environments such as MATLAB, Python, and the C language family.
  • Example1 the proposed algorithm can be applied in the field of real-time image processing to perform matrix multiplication between pixel matrices.
  • pixel colors are represented as non-negative integers in the range of 0 to 255. Therefore, no offsetting stage for converting negative or fractional values is required, and the input values are directly transformed into layered binary representations.
  • Each color channel of the image (such as R, G, or B) is processed as an independent matrix and decomposed into a three-dimensional structure of binary layers.
  • the counting–logical operations between binary layers are executed in parallel by logic units based on FPGA (or equivalent logical modules). These operations consist of counting the positions where active bits (1s) in two matrices interact, thereby producing a counting matrix. Subsequently, weight vectors corresponding to layer positions are expanded into two weight matrices by applying Kronecker products of identity matrices to the vectors, which are then multiplied on both sides of the counting matrix. As a result, the output matrix representing the weighted multiplication of the counting blocks is reconstructed. This output is finally transformed back into displayable color values, replacing the original pixel values.
  • FPGA or equivalent logical modules
  • matrix multiplication between two images can represent various outcomes depending on the intended purpose. For example, when two images or two channels of the same scene are multiplied, the result may be a composite image reflecting the overlap of visual information.
  • one matrix represents a filter or convolution kernel
  • its multiplication with the input image applies that filter, such as edge detection, blurring, or edge enhancement.
  • This embodiment significantly reduces the consumption of classical computational resources (multiplications and additions) while enabling real-time image processing in resource-constrained environments such as unmanned aerial vehicles (UAVs), surveillance systems, and smart cameras.
  • UAVs unmanned aerial vehicles
  • Example 2 In this example, the invention is used to accelerate neural network matrix operations:
  • Weight and input matrices in a neural network layer are converted into layered binary representations.
  • Parallel bitwise counting reduces the computation needed for matrix multiplications in fully connected or convolution layers.
  • Output matrices are reassembled into standard numerical values for activation functions.
  • Example 3 Here, the invention demonstrates performance on large-scale matrices (e.g., 10,000 ⁇ 10,000):
  • Decomposition and parallel bitwise counting allow the algorithm to scale efficiently without recursion overhead. Memory usage is optimized compared to classical approaches.
  • Example 4 ( Multi-Domain Application). The invention is applied across domains to show versatility:
  • Image processing, linear algebra, and neural network operations are combined in a single workflow.
  • Layered binary representations are reused across different matrices for different computations.
  • an integrated system can simultaneously handle real-time video analysis, numerical modeling, and AI inference without requiring three separate computational frameworks.
  • cross-domain workflows become both faster and more memory-efficient.
  • This versatility suggests that the invention can act as a unifying computational backbone in heterogeneous applications, reducing development time and improving cross-compatibility between fields.
  • the industrial applications of the present invention include, but are not limited to, the following domains:
  • High-speed image processing systems including real-time analysis for surveillance drones, smart cameras, and autonomous vehicles.
  • Cryptography-resistant and encrypted data reconstruction systems providing high-performance computation while maintaining security requirements.
  • Real-time analytical systems capable of processing streaming data for monitoring, diagnostics, or predictive analysis.
  • Machine learning and artificial intelligence implementations both in software and hardware-accelerated environments.
  • AI accelerators and hardware optimization systems for improved performance in specialized computing units.
  • This invention enables significant reductions in computational overhead while maintaining high numerical accuracy, making it directly applicable across a wide range of industrial, scientific, and technological fields.
  • NVIDIA NVIDIA.
  • CUDA C++ Programming Guide 2025.
  • EP3861467B1 Matrix architecture for machine learning; relies on compression and repetition reduction, not layer-wise numeric reconstruction.
  • TW202020598A Multi-architecture matrix processing; emphasizes parallel scheduling rather than layer-based modeling.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Neurology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)

Abstract

La présente invention concerne un algorithme de multiplication de matrice parallèle permettant une complexité temporelle O(n²) par : 1. Technique de fusion binaire : Représentation de données par bits multicouche. Opérations de comptage-logique vectorisées (ET + comptage de population). Reconstruction pondérée basée sur Kronecker. 2. Avantages essentiels : Élimine les erreurs d'arrondi à virgule flottante. Réduit les exigences de bande passante de mémoire. Conception agnostique du point de vue matériel (GPU/FPGA/ASIC compatible). 3. Applications : Accélération d'apprentissage automatique. Optimisation de systèmes cryptographiques. Calcul scientifique à grande échelle. L'algorithme transforme des opérations numériques en processus parallèles en terme de bit tout en maintenant une entière précision. Par rapport aux procédés classiques (par exemple, Strassen), il présente une extensibilité supérieure pour de grandes matrices sans décomposition récursive.
PCT/IB2025/058508 2025-05-27 2025-08-24 Algorithme de multiplication de matrice avec complexité temporelle de o(n2), extensible pour résoudre des systèmes d'équations linéaires et pouvant être mis en œuvre sur des architectures de traitement parallèles Pending WO2026009207A2 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IR14043001926 2025-05-27
IR140450140003001926 2025-05-27

Publications (2)

Publication Number Publication Date
WO2026009207A2 true WO2026009207A2 (fr) 2026-01-08
WO2026009207A3 WO2026009207A3 (fr) 2026-04-09

Family

ID=98319335

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2025/058508 Pending WO2026009207A2 (fr) 2025-05-27 2025-08-24 Algorithme de multiplication de matrice avec complexité temporelle de o(n2), extensible pour résoudre des systèmes d'équations linéaires et pouvant être mis en œuvre sur des architectures de traitement parallèles

Country Status (1)

Country Link
WO (1) WO2026009207A2 (fr)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5170370A (en) 1989-11-17 1992-12-08 Cray Research, Inc. Vector bit-matrix multiply functional unit
US10489479B1 (en) 2016-09-12 2019-11-26 Habana Labs Ltd. Matrix multiplication engine
US10534838B2 (en) 2017-09-29 2020-01-14 Intel Corporation Bit matrix multiplication
US10621269B2 (en) 2017-05-17 2020-04-14 Google Llc Performing matrix multiplication in hardware
TW202020598A (zh) 2018-05-15 2020-06-01 美商萊特美特股份有限公司 光子處理系統及方法
US11194886B2 (en) 2019-05-09 2021-12-07 Applied Materials, Inc. Bit-ordered binary-weighted multiplier-accumulator
US11334648B2 (en) 2017-12-29 2022-05-17 Huawei Technologies Co., Ltd. Matrix multiplier
US20230289398A1 (en) 2022-03-10 2023-09-14 Nvidia Corporation Efficient Matrix Multiply and Add with a Group of Warps
EP3861467B1 (fr) 2018-11-19 2025-05-28 Microsoft Technology Licensing, LLC Entrées planifiées de codage par compression pour des calculs matriciels

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5170370A (en) 1989-11-17 1992-12-08 Cray Research, Inc. Vector bit-matrix multiply functional unit
US10489479B1 (en) 2016-09-12 2019-11-26 Habana Labs Ltd. Matrix multiplication engine
US10621269B2 (en) 2017-05-17 2020-04-14 Google Llc Performing matrix multiplication in hardware
US10534838B2 (en) 2017-09-29 2020-01-14 Intel Corporation Bit matrix multiplication
US11334648B2 (en) 2017-12-29 2022-05-17 Huawei Technologies Co., Ltd. Matrix multiplier
TW202020598A (zh) 2018-05-15 2020-06-01 美商萊特美特股份有限公司 光子處理系統及方法
EP3861467B1 (fr) 2018-11-19 2025-05-28 Microsoft Technology Licensing, LLC Entrées planifiées de codage par compression pour des calculs matriciels
US11194886B2 (en) 2019-05-09 2021-12-07 Applied Materials, Inc. Bit-ordered binary-weighted multiplier-accumulator
US20230289398A1 (en) 2022-03-10 2023-09-14 Nvidia Corporation Efficient Matrix Multiply and Add with a Group of Warps

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
ALOM, M. ET AL.: "Efficient GPU-Accelerated Matrix Multiplication", IEEE ACCESS, 2021
CHEN, Y. ET AL.: "Layered Binary Matrix Techniques for Neural Networks.", NEURAL COMPUTING & APPLICATIONS, 2022
DONGARRA, J. ET AL.: "High-Performance Matrix Multiplication: Algorithms and Architectures", JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2019
LI, H. ET AL.: "Bit-level Matrix Decomposition for High-performance Computing.", ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2021
PATTERSON, D.HENNESSY, J: "Computer Organization and Design", 2013
VOLKOV, V.DEMMEL, J.: "Benchmarking GPUs for Matrix Computations.", ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2008
ZHANG, X. ET AL.: "Parallel Matrix Multiplication on FPGA Architectures", IEEE TRANSACTIONS ON COMPUTERS, 2020

Also Published As

Publication number Publication date
WO2026009207A3 (fr) 2026-04-09

Similar Documents

Publication Publication Date Title
US10909418B2 (en) Neural network method and apparatus
Kannan et al. MPI-FAUN: An MPI-based framework for alternating-updating nonnegative matrix factorization
Ahmadi-Asl et al. Randomized algorithms for fast computation of low rank tensor ring model
DE102021127803A1 (de) Einbindung einer ternären matrix in ein neuronales netz
CN118229632B (zh) 显示屏缺陷检测方法、模型训练方法、装置、设备及介质
DE102022129634A1 (de) Durchführen von simulationen unter verwendung maschinellen lernens
Jiabao et al. The application of SJ-MSD adder to mean value filtering processing
Zhang et al. Expanding the edge: Enabling efficient winograd cnn inference with deep reuse on edge device
Saha Leveraging cloud computing to overcome the computational challenges of GAN training
CN115480919B (zh) 卷积优化运算方法、装置、计算机设备及存储介质
Bai et al. Tt-rnnpool3d: A tensor-based high-order rnnpool model for mobile edge consumer applications
WO2026009207A2 (fr) Algorithme de multiplication de matrice avec complexité temporelle de o(n2), extensible pour résoudre des systèmes d'équations linéaires et pouvant être mis en œuvre sur des architectures de traitement parallèles
CN119227825B (zh) 一种基于双比特量子门分类的量子电路指令集转换方法
CN114912570A (zh) 加速神经网络模型优化的方法、装置、设备及可读介质
Su et al. Removing rows and columns of tokens in vision transformer enables faster dense prediction without retraining
Wang et al. Cpac-conv: Cp-decomposition to approximately compress convolutional layers in deep learning
Exman et al. Linear Software Models: Modularity Analysis by the Laplacian Matrix.
CN111143762A (zh) 一种张量数据分解方法及系统
Fiorini Deep learning and deep thinking: New application framework by CICT
Zheng Data frugality and computational efficiency in deep learning
Boureima et al. Distributed Out-of-core NMF of Dense and Sparse Data on CPU/GPU Architectures with Automatic Model Selection for Exascale Data
Cai et al. GPU-accelerated restricted boltzmann machine for collaborative filtering
Li et al. CUSNTF: A scalable sparse non-negative tensor factorization model for large-scale industrial applications on multi-GPU
Wang et al. An efficient architecture for floating-point eigenvalue decomposition
Ahi et al. LLMs and LVMs for agentic AI: a GPU-accelerated multimodal system architecture for RAG-grounded, explainable, and adaptive intelligence