WO2019116106A2 - Processeur d'image, et procédés permettant de traiter une image - Google Patents
Processeur d'image, et procédés permettant de traiter une image Download PDFInfo
- Publication number
- WO2019116106A2 WO2019116106A2 PCT/IB2018/001597 IB2018001597W WO2019116106A2 WO 2019116106 A2 WO2019116106 A2 WO 2019116106A2 IB 2018001597 W IB2018001597 W IB 2018001597W WO 2019116106 A2 WO2019116106 A2 WO 2019116106A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- benes network
- benes
- network
- inputs
- configuration
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
Definitions
- DAS camera based driver assistance systems
- LDW lane departure warning
- AHC Automatic High-beam Control
- FCW forward collision warning
- any combination of any unit, device, and/or component disclosed in any figure and/or in the specification may be provided.
- Non-limiting examples of such units include a gather unit, an image processor and the like.
- Any combination of the methods and/or method steps of originally filed claims 1-17, 18-19, 20-21, 2542, 53, 75, 97, 98, 99 and 109- 114 may be provided.
- a method of calculating warp results may include executing, for each target pixel out of a group of target pixels, a warp calculation process that may include receiving, by a first group of processing units of an array of processing units, a first weight and a second weight associated with the target pixel; receiving, by a second group of processing units of the array, values of neighboring source pixels associated with the target pixel; calculating, by the second group, a warp result based on in response to values of the neighboring source pixels and the pair of weights; and providing the warp result to a memory module.
- the calculating of the warp result may include relaying values of some of the neighboring source pixels between processing units of the second group.
- the calculating of the warp result may include relaying intermediate results calculated by the second group and values of some of the neighboring source pixels between processing units of the second group.
- the calculating of the warp result may include calculating, by a first processing unit of the second group, a first difference between a first pair of neighboring source pixels and a second difference between a second pair of neighboring source pixels; providing the first difference to a second processing unit of the second group; and providing the second difference to a third processing unit of the second group.
- the calculating of the warp result further may include calculating, by a fourth processing unit of die second group, a first modified weight in response to the first weight; providing the first modified weight from the fourth processing unit to the second processing unit of the second group; calculating, by the second processing unit of the second group, a first intermediate result based on the first difference, a first neighboring source pixel and the first modified weight.
- the calculating of the warp result further may include providing the second difference from the third processing unit of the second group to a sixth processing unit of the second group; providing a second neighboring source pixel from a fifth processing unit of the second group to the sixth processing unit of the second group; and calculating, by the sixth processing unit of the second group, a second intermediate result based on the second difference, the second neighboring source pixel and the first modified weight.
- the calculating of the warp result further may include providing the second intermediate result from the sixth processing unit of the second groups to a seventh processing unit of the second group; providing the first intermediate result from the second processing unit of the second group to the seventh processing unit of the second groups; and calculating, by the seventh processing unit of the second group, a third intermediate result based on to the first and second intermediate results.
- the calculating of the warp result further may include providing the third intermediate resuit from the seventh processing unit of the second group to an eighth processing unit of the second group; providing the second intermediate result from the sixth processing unit of the second group to a ninth processing unit of the second group; providing the second intermediate result from the ninth processing unit of the second group to the eighth processing unit of the second group; providing the second modified weight from the third processing unit of the second group to an eighth processing unit of the second group; and calculating the warp result, by the eighth processing unit of die second group, based upon the second and third intermediate results and the second modified weight.
- the method may include executing, by the array, multiple warp computing processes associated with a subgroup of target pixels in parallel.
- the method may include fetching, from a gather unit, neighboring source pixels associated with each target pixel of the subgroup of pixels in parallel; wherein the gather unit may include a set associate cache and may be arranged to access a memory module that may include multiple independently accessible memory hanks.
- the method may include receiving for each target pixel of the subgroup of pixels, first and second warp parameters; wherein the first and second warp parameters may include the first and second weights and location information indicative of a location of the neighboring source pixels associated with the target pixel.
- Tlie method may include providing to die gather unit, the location information for each target pixel of the subgroup of pixels.
- the method may include converting, by the gather unit, the location information to addresses of the neighboring source pixels.
- the method may include calculating, by a third group of processing units of the array, and for each target pixel of the subgroup of pixels, first and second warp parameters; wherein the first and second warp parameters may include the first and second weights and location information indicative of a location of the neighboring source pixels associated with the target pixel.
- the method sensing, from the third group to the first group the first and second weights.
- the method may include providing to the gather unit, the location informa ion for each target pixel of the subgroup of pixels.
- the method may include converting, by the gather unit, the location information to addresses of the neighboring source pixels.
- a method for calculating warp results may include concurrently receiving, by a first group of processing units of an array of processing units, and for each target pixel of a subgroup of pixels, a first weight and a second weight; concurrently providing, to a gather unit, for each target pixel out of the subgroup of pixels, location information indicative of a location of the neighboring source pixels associated with the target pixel; concurrently receiving, by the array and from the gather unit, neighboring source pixels associated with each target pixel out of a subgroup of pixels; wherein different groups of the array receive neighboring source pixels associated with different target pixels of the subgroup of pixels; and concurrently calculating, by the different groups of the array; warp results related to the different target pixels.
- the method may include receiving or calculating, for each target pixel of the subgroup of pixels, first and second warp parameters; wherein the first and second warp parameters may include die first and second weights and location information indicative of a location of the neighboring source pixels associated with the target pixel.
- a method for calculating warp results may include repeating, for each subgroup of target pixels out of a group of target pixels, the steps of receiving, by an array of processing units, neighboring source pixels associated with each target pixel of the subgroup of target pixels; and calculating, by the array, warp results for target pixels from the subgroup of target pixels; wherein the calculating may include calculating intermediate results and relaying at least some of the intermediate results between processing units of the array.
- Each processing unit of the array may be directly coupled to a set of processing units of the array and may be indirectly coupled to another set of processing units of the array.
- the terms“processing units” and“data processors” may be used in an interchangeable manner.
- an image processor that may be configured to calculate warp results
- the image processor may be configured to execute, for each target pixel out of a group of target pixels, a warp calculation process that may include recei ving, by a first group of processing units of an array of processing units of the image processor, a first weight and a second weight associated with the target pixel; receiving, by a second group of processing units of the array, values of neighboring source pixels associated with the target pixel; calculating, by the second group, a warp result based on in response to values of the neighboring source pixels and the pair of weights; and providing the warp result to a memory module.
- an image processor may be configured to calculate warp results
- the image processor may include an array of processing units that may be configured to concurrently receive, by a first group of processing units of the array, and for each target pixel of a subgroup of pixels, a first weight and a second weight; concurrently provide, to a gather unit of the image processor, for each target pixel out of the subgroup of pixels, location information indicative of a location of the neighboring source pixels associated with the target pixel; concurrently receive, by the array and from the gather unit, neighboring source pixels associated with each target pixel out of a subgroup of pixels; wherein different groups of the array receive neighboring source pixels associated with different target pixels of die subgroup of pixels; and concurrently calculate, by the different groups of the array; warp results related to the different target pixels.
- an image processor that may be configured to calculate warp results
- the image processor may be configured to repeat, for each subgroup of target pixels out of a group of target pixels, the steps of receive, by an array of processing units of the image processor, neighboring source pixels associated with each target pixel of the subgroup of target pixels; and calculate, by the array, warp results for target pixels from the subgroup of target pixels; wherein the calculating may include calculating intermediate results and relaying at least some of the intermediate results between processing units of the array.
- a method for calculating disparity may include calculating, by a first group of data processors of an array of data processors, a set of sums of absolute differences (SADs); wherein the set of SADs may be associated with a source pixel and a subgroup of target pixels; wherein each SAD may be calculated based on previously calculated SADs and based on currently calculated absolute difference between another source pixel and a target pixel that belongs to the subgroup of target pixels; and determining, by a second group of data processors of the array, a best matching target pixel out of the subgroup of target pixels in response to values of the set of SADs.
- SADs sums of absolute differences
- a given SAD of the set of SADs reflects absolute differences between a given rectangular source pixel array and a given rectangular target pixel array; wherein the previously calculated SADs may include (a) a first previously calculated SAD that reflects absolute differences between (i) a rectangular source pixel array that differs from the given rectangular source pixel array by a first source pixel column and by a second source pixel column, and (i ) a rectangular target pixel array that differs from the given rectangular target pixel array by a first target pixel column and by a second target pixel column; and (b) a second previously calculated SAD that reflects absolute differences between the first source column and the first source column.
- the other source pixel may be a lowest source pixel of the second source pixel column and the target pixel that belongs to the subgroup of target pixels may be a lowest target pixel of the second target pixel column.
- the method may include calculating the given SAD by calculating an intermediate result by subtracting, from the first previously calculated SAD, (a) the second previously calculated SAD and (b) an absolute difference between (i) a target pixel that may be positioned on top of the second target pixel column and (ii) a source pixel that may be positioned on top of the second source pixel column; and adding to the intermediate result an absolute difference between the lowest target pixel of the second target pixel column and the lowest source pixel of the second source pixel column.
- the method may include storing in the array of data processors, for the given SAD, the first previously calculated SAD, the second previously calculated SAD, the target pixel that may be positioned on top of the second target pixel column and the source pixel that may he positioned on top of the second source pixel column.
- the calculating of the given SAD may be preceded by fetching the lowest target pixel of the second target pixel colum and the lowest source pixel of the second source pixel column.
- the subgroup of target pixels may include target pixels that may be sequentially stored in a memory module; wherein the calculating of the set of SADs may be preceded by fetching the subgroup of target pixels from the memory module.
- the fetching of the subgroup of target pixels from the memory module may he executed by a gather unit that may include a content addressable memory cache
- the subgroup of target pixels belong to a group of target pixels that may include multiple subgroups of target pixels; wherein the method may include repeating, for each subgroup of target pixels, the steps of calculating, by the first group of processing units, a set of SADs lor each subgroup of target pixels; and finding, by the second group of data processors of the array, a best matching target pixel out of the group of target pixels in repose to values of set of SADs of every subgroup of target pixels.
- the method may include calculating, by a first group of data processor of an array of data processors, multiple sets of SADs that may be associated with a plurality of source pixels and multiple subgroups of target pixels; wherein each SAD of the multiple set of SADs may be calculated based on previously calculated SADs and to a currently calculated absolute difference; and finding, by a second group of data processors of the array and for source pixel, a best matching target pixel in repose to values of SADs that may he associated with the source pixel.
- the multiple set of SADs may include sub-sets of SADs, each sub- set of SADs may be associated with the plurality of source pixels and a plurality of subgroups of target pixels of the multiple subgroups of target pixels.
- the plurality of source pixels may belong to a column of the rectangular array of pixels and may be adjacent to each other.
- the calculating of the multiple sets of SADs may include calculating, in parallel, SADs of different sub-sets of SADs.
- the method may include calculating, in sequential manner, SADs that belong to the same sub-set of SADs.
- the plurality of source pixels may be a pair of source pixels.
- the plurality of source pixels may be four source pixels.
- the different sub-sets of SADs may be calculated by different first subgroups of data processor of the array of data processors.
- the method may include calculating, in sequential manner, SADs that belong to the same sub-set of SADs; and sequentially fetching to the array of data processors target pixels related to the different SADs of the same sub-set of SADs.
- an image processor may include an array of data processors and may be configured to calculate disparity by calculating, by a first group of data processors of the array of data processors, a set of sums of absolute differences (SADs); wherein the set of SADs may be associated with a source pixel and a subgroup of target pixels; wherein each SAD may be calculated based on previously calculated SADs and based on currently calculated absolute difference between another source pixel and a target pixel that belongs to the subgroup of target pixels; and determining, by a second group of data processors of the array, a best matching target pixel out of the subgroup of target pixels in response to values of the set of SADs.
- SADs sums of absolute differences
- a gather unit may include an input interface that may be arranged to receive multiple requests for retrieving multiple requested data units; a cache memory' that may include multiple entries may be configured to store multiple tags and multiple cached data units; wherein each tag may be associated with a cached data unit and may be indicative of a group of memory cells of a memory module that differs from the cache memory and stores the cached data unit; an array of comparators that may be arranged to concurrently compare between the multiple tags and multiple requested memory group addresses to provide comparison results; wherein each requested memory group address may be indicative of a group of memory' cells of the memory' module that stores a requested data unit of the multiple requested data units; a contention evaluation unit; a controller that may be arranged to (a) classify, based on the comparison results, the multiple requested data units to cached data units that may be stored in the cache memory and uncached data units; and (b) send to the contention evaluation unit information about cached and uncached data units; wherein the contention
- the array of comparators may be arranged to concurrently compare between the multiple tags and multiple requested memory group addresses during a single gather unit dock cycle; and wherein the contention evaluation unit may be arranged to check the occurrence of the at least one contention during a single gather unit clock cycle.
- the contention evaluation unit may be arranged to re-check an occurrence of at least one contention in response to new tags of the cache memory.
- the gather unit may he arranged to operate in a pipelined manner; wherein duration of each phase of the pipeline may be one gather unit clock cycle.
- the cache memory may be a fully associative memory cache.
- the gather unit may include an address converter that may he arranged to convert location information included in the multiple requests to the multiple requested memory group addresses.
- the multiple requested data units may belong to an array of data units; wherein the location information includes coordinates of the multiple requested data units within the array of data units.
- the contention evaluation unit may indude multiple groups of nodes; wherein each group of nodes may be arranged to evaluate a contention between the multiple requested memory group addresses and a tag of the multiple tags.
- a method for responding to multiple requests for retrieving multiple requested data units may include receiving, by an input interface of a gather unit, the multiple requests for retrieving multiple requested data units; storing, by a cache memory that may include multiple entries, multiple tags and multiple cached data units; wherein each tag may be associated with a cached data unit and may be indicative of a group of memory cells of a memory module that differs from die cache memory and stores the cached data unit; concurrently comparing, by an array of comparators between the multiple tags and multiple requested memory group addresses to provide comparison results; wherein each requested memory group address may be indicative of a group of memory cells of the memory module that stores a requested data unit of the multiple requested data units; classifying, by a controller, based on the comparison results, the multiple requested data units to cached data units that may be stored in the cache memory and uncached data units; and sending to the contention evaluation unit information about cached and uncached data units; checking, by the contention evaluation unit, an
- a processing module may include an array of data processors; wherein each data processor unit out of multiple data processors of the array of data processors may be directly coupled to some data processors of the array of data processors, may be indirectly coupled to some other data processors of the array of data processors, and may include a relay channel for relaying data between relay ports of the data processor.
- the relay channel of each data processor of the multiple data processors may exhibit substantially zero latency.
- Each data processor of the multiple data processors may include a core; wherein the core may include an arithmetic logic unit and a memory resource; wherein cores of the multiple data processors may be coupled to each other by a configurable network.
- Each data processor of the multiple data processors may include multiple data flow components of the configurable network.
- Each data processor of the multiple data processors may include a first non- relay input port that may be directly coupled to a first set of neighbors.
- the first set of neighbors may be formed by data processors that may be located within a distance less than four data processors from the data processor.
- the first non-relay input port of the data processor may be directly coupled to relay ports of data processors of the first set of neighbors.
- the data processor further may include a second non-relay input port that may be directly coupled to non-relay ports of data processors of the first set of neighbors.
- the first non-relay input port of the data processor may be directly coupled to non-relay ports of data processors of the first set of neighbors.
- the first set of neighbors may be formed by eight data processors.
- a first relay port of each data processor of the multiple data processors may be directly coupled to a second set of neighbors.
- the second set of neighbors differs from the first set of neighbors.
- the second set of neighbors may include a data processing unit that may be more distant from the data processor than any of the data processors that belong to the first set of neighbors.
- the array of the processors may include , in addition to the multiple data processors, at least one other data processor.
- the data processor of the array of data processors may he arranged in rows and columns.
- Some data processors of each row may be coupled to each other in a cyclic manner.
- Data processors of each row may be controlled by a shared microcontroller.
- Each data processor of the multiple data processors may include configuration instruction registers; wherein the instructions registers may be arranged to receive configuration instructions during a configuration process and to store the configuration instructions in the configuration instruction registers; wherein data processors of a given row may be controlled by a given shared microcontroller; wherein each data processor of the given row may be arranged to receive selection information for selecting a selected configuration instruction from the given shared microcontroller and to configure, under a certain condition, the data processor to operate according to the selected configuration instruction.
- the certain condition may he fulfilled when the data processor may be arranged to respond to the selection information; wherein the certain condition may be not fulfilled when the data processor may be arranged to ignore the selection i nformation.
- Each data processor of the multiple data processors may include a controller, an arithmetic logic unit, a register file and configuration instruction registers; wherein the instructions registers may be arranged to receive configuration instructions during a configuration process and to store the configuration instructions in the configuration instruction registers; wherein the controller may be arranged to receive selection information for selecting a selected configuration instruction and to configure the data processor to operate according to die selected configuration instruction.
- Each data processor of the multiple data processors may include up to three configuration instruction registers.
- a method for operating a processing module may include an array of data processors; wherein the operating may include processing data by data processors of the array; wherein each data processor unit out of multiple daia processors of the array of data processors may be directly coupled to some data processors of the array of data processors, may be indirectly coupled to some other data processors of the array of data processors, and relaying, using one or more relay channels of one or more data processors, data between relay ports of the data processor
- an image processor may include an array of data processors, first microcontrollers, a buffering unit and a second microcontroller; wherein data processors of the array may be arranged to receive, during a data processor configuration process, data processor configuration instructions; wherein the buffering unit may be arranged to receive, during a buffering unit configuration process, buffering unit configuration instructions; wherein the first microcontrollers may be arranged to control an operation of the data processors by providing data processor selection information to data processors; wherein the data processors may be arranged to select, in response to the data processor selection information, selected data processor configuration instructions, and to perform one or more data processing operation according to the selected data processor configuration instructions; wherein the second microcontroller may be arranged to control an operation of the buffering unit by providing buffering unit selection information to the buffering unit; % ' herein the buffering unit may be arranged to select, in response to at least a portion of the buffering unit selection information, a selected buffering unit configuration
- the data processors of the array may be arranged in groups of data processors; wherein different groups of data processors may be controlled b different first microprocessors.
- a group of data processors may be a row' of data processors.
- Data processors of a same group of data processors receive in parallel the same data processor selection information.
- the buffering unit may include multiple groups of memory resources; wherein different groups of memory resources may be coupled to different groups of data processors.
- the image processor may include second microcontrollers; wherein different second microcontrollers may be arranged to control different groups of memory resources.
- the different groups of memory resources may be different groups of shift registers.
- the different groups of shift registers may be coupled to multiple groups of buffers that may be arranged to recei ve data from a memory module.
- the multiple groups of buffers may be not controlled by the second microcontrollers.
- the buffering unit selection information selects connectivity between the multiple groups of memory resources and the multiple groups of data processors.
- Each data processor may include an arithmetic logical unit and data flow components; wherein the data processor configuration instruction defines an opcode of the arithmetic logical unit and defines a flow of data to the arithmetic logic unit via the data flow components.
- the image processor further may include a memory module that may include multiple memory banks; wherein the buffering unit may he arranged to retrieve data from the memory module and to send the data to the array of data processors.
- the first microcontrollers share a program memory.
- Each first microcontroller may include control registers that store a first instruction address, a number of header instructions and a number of loop instructions.
- the image processor may include a memory module that may be coupled to the buffering unit; wherein the memory ' module may include a store buffer, load store units and multiple memory banks; wherein the store buffer may be controlled by a third microcontroller.
- the store buffer may be arranged to receive, during a store buffer configuration process, store buffer configuration instructions; wherein the third microcontroller may be arranged to control an operation of the store buffer by providing store buffer selection information to the store buffer.
- control image processor may include store buffers that may be controlled by third microprocessors.
- an image processor may include multiple configurable circuits and multiple microcontrollers; wherein the multiple configurable circuits may include memory circuits and multiple data processors; wherein each configurable circuit may be arranged to store up to a limited amount of configuration instructions; wherein the multiple microcontrollers may be arranged to control the multiple configurable circuits by repetitively providing to the multiple configurable circuits selection information for selecting by each configurable circuit a selected configuration instruction out of the limited amount of configuration instructions.
- the multiple configurable circuits may include a memory module that may include multiple memory banks; and a buffering unit for exchanging data between the memory module and the data processors.
- a size of the selection information does not exceed two bits.
- a method for configuring an image processor may include multiple configurable circuits and multiple microcontrollers; wherein the multiple configurable circuits may include memory circuits and multiple data processors; wherein the method may include storing, in each configurable circuit, up to a limited amount of configuration instructions; controlling, by the multiple microcontrollers, the multiple configurable circuits by repetitively providing to the multiple configurable circuits selection information for selecting by each configurable circuit a selected configuration instruction out of the limited amount of configuration instructions.
- a method for operating an image processor may include providing an image processor that may include an array of data processors; a memory module that may include multiple memory banks; a buffering unit; a gather unit; and multiple microcontrollers; controlling the array of data processors by the multiple
- microprocessors a part of the memory module and the buffering unit; retrieving, by die buffering unit data from the memory module; sending, by the buffering unit, the data to the array of data processors; receiving by the gather unit multiple requests for retrieving multiple requested data units from the memory module; sending by the gather unit to the array of data processors the multiple requested data units.
- a method for configuring an image processor may include an array of data processors, first microcontrollers, a buffering unit and a second microcontroller; wherein the method may include providing, to data processors of the array, during a data processor configuration process data processor configuration instructions; providing to the buffering unit, during a buffering unit configuration process, buffering unit configuration instructions; controlling by the first microcontrollers an operation of the data processors by providing data processor selection information to data processors; selecting by the data processors, in response to the data processor selection information, selected data processor configuration instructions, and performing one or more data processing operation according to the selected data processor configuration instructions; controlling by the second microcontroller an operation of the buffering unit by providing buffering unit selection information to the buffering unit; selecting, by the buffering unit, in response to at least a portion of the buffering unit selection information, a selected buffering unit configuration instruction, and to performing one or more buffering unit operations according to a selected buffering unit configuration instruction; wherein a size of
- a non-transitory computer readable medium that stores instructions for responding to multiple requests for retrieving multiple requested data units that once executed by a gather unit result in the execution of the steps of : receiving, by an input interface of the gather unit, the multiple requests for retrieving multiple requested data units; storing, by a cache memory that comprises multiple entries, multiple tags and multiple cached data units; wherein each tag is associated with a cached data unit and is indicative of a group of memory cells of a memory module that differs from the cache memory and stores the cached data unit; concurrently comparing, by an array of comparators between the multiple tags and multiple requested memory group addresses to provide comparison results; wherein each requested memory group address is indicative of a group of memory cells of the memory module that stores a requested data unit of the multiple requested data units; classifying by a controller, based on the comparison results, the multiple requested data units to cached data units that are stored in the cache memory and uncached data units; and sending to the contention evaluation unit information about cached
- a non-transitory computer readable medium that stores instructions for operating a processing module that once executed by the processing module result in the execution of the steps of: processing data by data processors of an array of data processors of the processing module; wherein each data processor unit out of multiple data processors of the array of data processors is directly coupled to some data processors of the array of data processors, is indirectly coupled to some other data processors of the array of data processors, and relaying, using one or more relay channels of one or more data processors, data between relay ports of the data processor
- a non-transitory computer readable medium that stores instructions for configuring an image processor that comprises multiple configurable circuits and multiple microcontrollers; wherein the multiple configurable circuits comprise memory ' circuits and multiple data processors, wherein the instructions once executed by the image processor result in the execution of the steps of: storing, in each configurable circuit, up to a limited amount of configuration instructions; controlling, by the multiple microcontrollers, the multiple configurable circuits by repetitively providing to the multiple configurable circuits selection information for selecting by each configurable circuit a selected configuration instruction out of the limited amount of configuration instructions.
- a non-transitory computer readable medium that stores instructions for operating an image processor that comprises an array of data processors; a memory module that comprises multiple memory banks; a buffering unit; a gather unit; and multiple microcontrollers; wherein an execution of the by the image processor results in the execution of the steps of: sending, by the buffering unit, the data to the array of data processors; receiving by the gather unit multiple requests for retrieving multiple requested data units from the memory module; sending by the gather unit to the array of data processors the multiple requested data units.
- a non-transitory computer readable medium that stores instructions for configuring an image processor that comprises an array of data processors, first microcontrollers, a buffering unit and a second microcontroller; wherein the multiple configurable circuits comprise memory circuits and multiple data processors, wherein the instructions once executed by the image processor result in the execution of the steps of: providing, to data processors of the array, during a data processor configuration process, data processor configuration instructions; providing to the buffering unit, during a buffering unit configuration process, buffering unit configuration instructions; controlling by the first microcontrollers an operation of the data processors by providing data processor selection information to data processors; selecting by the data processors, in response to the data processor selection information, selected data processor configuration instructions, and performing one or more data processing operation according to the selected data processor configuration instructions; controlling by the second microcontroller an operation of the buffering unit by providing buffering unit selection information to the buffering unit; selecting, by the buffering unit, in response to at least a portion of the buffering unit selection information, a selected buffering
- FIG. 1 illustrates a system according to an embodiment of the invention
- FIG. 2 illustrates an image processor according to an embodiment of the invention
- FIG. 3 illustrates an image processor according to an embodiment of the invention
- FIG. 4 illustrates a portion of an image processor according to an embodiment of the invention
- FIG. 5 illustrates a clock dee according to an embodiment of the invention
- FIG. 6 illustrates a memory module according to an embodiment of the invention
- FIG. 7 illustrates a mapping between LSUs of the memory module and memory banks of the memor ' module according to an embodiment of the invention
- FIG. 8 illustrates a store buffer according to an embodiment of the invention
- FIGs. 9 and 10 illustrate instnictions Row,Sel according to an embodiment of the invention
- FIG. 11 illustrates a buffering unit according to an embodiment of the invention
- FIG. 12 illustrates a gather unit according to an embodiment of the invention
- FIG. 13 is a timing diagram that illustrates a process that includes address conversion, cache hit/miss, contention and outputting of information.
- FIG. 14 illustrates a contention evaluation unit according to an embodiment of the invention
- FIGs. 15 and 16 illustrate a data processing unit according to an embodiment of the invention
- FIG. 17 illustrates a warp calculation method according to an embodiment of the invention
- FIGs. 18 and 19 illustrate an array of data processors that perform warp calculations according to an embodiment of the invention
- FIG. 20 illustrates warp parameters that are outputted from various data processor according to an embodiment of the invention
- FIG. 21 illustrates a group of data processors that perform warp calculations according to an embodiment of the invention
- FIG. 22 illustrates a warp calculation method according to an embodiment of the invention
- FIG. 23 illustrates a group of processing units according to an embodiment of the invention
- FIG. 24 illustrates a first subgroup source pixels, a first subgroup target pixels, a second subgroup of source pixels and a second subgroup of target pixels;
- FIG. 25 illustrates a subgroup SG(B) of source pixels having a center pixel SB ;
- FIG. 26 illustrates a corresponding subgroup ' TG(B) of target pixels (not sho wn ) having a center pixel TB;
- FIG. 27 illustrates method according to an embodiment of the inven ion;
- FIG. 28 illustrates eight source pixels anti thirty two target pixels that are processed by the DPA according to an embodiment of the invention;
- FIG. 29 illustrates an array of source pixels according to an embodiment of the invention
- FIG. 30 illustrates an array of target pixels according to an embodiment of the invention
- FIG. 31 illustrates multiple groups of data processors DPUs according to an embodiment of the invention
- FIG. 32 illustrates eight groups of data processors DPUs - each group includes four DPUs according to an embodiment of the invention
- FTG. 33 illustrates a warp calculation method according to an embodiment of the invention
- FIG. 34 illustrates image processor according to an embodiment of the invention according to an embodiment of the invention
- FTG. 35 illustrates a portion of image processor according to an embodiment of the invention according to an embodiment of the invention.
- FIG. 36 illustrates a buffering unit according to an embodiment of the invention according to an embodiment of the invention.
- FIG. 37 illustrates a data processing unit (DPU) according to an embodiment of the invention
- FTG. 38 illustrates a data processing unit (DPU) according to an embodiment of the invention
- FIG. 39 illustrates two DPUs and Benes network according to an embodiment of the invention.
- FTG. 40 illustrates an example of Benes network according to an embodiment of the invention
- FTG. 41 illustrates configuration unit according to an embodiment of the invention
- FIG. 42 illustrates a coupling between an intermediate layer of first Benes network portion and a set of multiplexers according to an embodiment of the invention
- FTG. 43 illustrates an example of calculations of addresses of switches according to an embodiment of the invention
- FIG. 44 illustrates an example of calculations of addresses of switches according to an embodiment of the invention
- FIG. 46 illustrates an example of calculations of masks of addresses of switches according to an embodiment of the invention
- FIG. 47 illustrates an example of calculations of masks of addresses of switches according to an embodiment of the invention
- FIG. 48 illustrates a method for configuration according to an embodiment of the invention
- FIG. 49 illustrates a Benes network according to an embodiment of the invention
- FIG. 50 illustrates method for determining a configuration of a Benes network according to an embodiment of the invention
- FIG. 51 illustrates method for configuring a Benes network according to an embodiment of the invention.
- FIG. 52 illustrates a non-uniform Benes network according to an embodiment of the invention.
- any reference in the specification to a method should be applied mutatis mutandis to a system capable of executing the method and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that once executed by a computer result in the execution of the method.
- any method steps of originally filed claims 1-17, 18, 19, 20-21, and 25- 42 and 97-99 may be executed by a system.
- the system in this sense may be an image processor, a gather unit or any component of the image processor.
- Any reference in the specification to a system and any other component should be applied mutatis mutandis to a method that may be executed by the memory device and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that may be executed by the memory device.
- a method and/or method steps executed by the image processor described in any one of claims 44-52 there may be provided a method and/or method steps executed by the image processor described in any one of claims 76-93.
- Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a system capable of executing the instructions stored in the non-transitory computer readable medium and should be applied mutatis mutandis to method that may be executed by a computer that reads the instructions stored in the non-transitory computer readable medium.
- a pixel may be a picture element obtained by a camera, may be a processed picture element.
- the terms '‘row” and“line” are used in an interchangeable manner.
- car is used as a non- limiting example of a vehicle.
- Figure 1 illustrates a system 90 according to an embodiment of the invention.
- System 90 may be a DAS, a part of an autonomous car control module, and the like.
- the system 90 may be installed in car 10. At least some of the components of the system 90 are within the vehicle.
- System 90 may include first camera 81, first processor 83, storage unit 85, man machine interface 86 and image processor 100. These components may be coupled to each other via bus or network 82 or by any other arrangement.
- the syste 90 may include additional cameras and/or additional processors and/or additional image processors.
- First processor 83 may determine which task should be executed by the image processor 100 and instruct the image processor 100 to operate accordingly.
- image processor 100 may be a part of the first processor 83 and that it may be a part of any other system.
- the man machine interface 86 may include a display, a speaker, one or more light emitting diodes, a microphone or any other type of man machine interface.
- the man machine interface may communicate with a mobile device of the driver of the car, with the multimedia systems of the car, and the like.
- Figure 2 illustrates image processor 100 according to an embodiment of the invention.
- Master port 101 and slave port 103 provide an interface between image processor 100 and any other component of system 90.
- Image processor 100 includes:
- DMA Direct memory access
- storage unit 85 an external memory resource such as storage unit 85.
- a controller such as but not limited Lo scalar unit 104.
- BU control unit 490 BU control unit 490.
- Busses 132, 133, 134, 135, 136 and 137 Busses 132, 133, 134, 135, 136 and 137.
- Image processor 100 also includes multiple microcontrollers. For brevity of explanation these microcontrollers are illustrated in figure 4.
- DMA 102 is coupled to multiplexers 112, 114 and 120.
- Scalar unit 104 is coupled to buffer 1 18 and to multiplexer 112.
- Buffer 118 is coupled multiplexer 116.
- Buffer 110 is coupled to multiplexers 112, 116 and 114.
- Multiplexer 112 is coupled to SU program memory 106.
- Multiplexer 114 is coupled to SU data memory 108.
- Memory unit 200 is coupled to gather unit 300 via (unidirectional) bus 132, is coupled via (unidirectional) bus 134 to buffering unit 400 and is coupled via (unidirectional) bus 133 to DPA 500.
- Gather unit 300 is coupled via (unidirectional) bus 135 to buffering unit 400 and via (unidirectional) bus 137 to DPA 500.
- Buffering unit 400 is coupled to DPA 500 via (unidirectional) bus 136.
- the units of the image processor may be coupled to each other by other buses, by additional of fewer buses, by interconnects and/or networks, by busses of other widths and directionality, and the like.
- Scalar unit 104 may control the execution of tasks by other components of the image processor 100.
- Scalar unit 104 may receive instructions (for example from first processor 83 of figure 1) which tasks to execute and may fetch the relevant instructions from SU program memory 106.
- the scalar unit 104 may determine which programs will be executed by microcontrollers within SB control unit 290, BU control unit 490 and DP A control unit 590.
- the programs executed by the microcontrollers of SB control unit 290 control store buffers (not shown) of memory module 200.
- the programs executed by the microcontrollers of BU control unit 490 controls the buffer unit 400.
- the programs executed by the microcontrollers of DP A control unit 590 control data processing units of DPA 500.
- Any of said microcontrollers may control any module or unit by providing short selection information (for example 2-3 bits, less than a byte or any number of bits that is smaller than the number of bits of the selected configuration instruction) for selecting the configuration instructions already stored in the controlled module or unit. This allows to reduce the traffic and to perform fast configuration changes (as the configuration changes may require to select between different configuration registers already stored in the relevant units or modules.
- short selection information for example 2-3 bits, less than a byte or any number of bits that is smaller than the number of bits of the selected configuration instruction
- control units and their allocation between components of the image processor may differ from those illustrated in figure 2.
- Memory module 200 is the highest level memory resource of image processor 100.
- Buffering unit 400 and gather unit 300 are lower level memory resources of the image processor 100 and may be configured to fetch data from the memory module 200 and provide the data to the DPA 500.
- DPA 500 may send data directly to memory module 200.
- DPA 500 includes multiple data processors and is arranged to perform computational tasks such as but not limited to image processing algorithms.
- image processing algorithms include a warp algorithm, disparity, and the like.
- Gather unit 300 includes a cache memory. Gather unit 300 is configured to receive from DPA 500 requests to fetch multiple data units (such as pixels) and to fetch the requested pixels - from the cache memory or from memory unit.
- the Gather unit 300 may operate in a pipelined manner and have a limited number (for example three) of pipeline stages of a very low latency - for example one (or less than five or ten) clock cycles. As indicated below - the gather unit may also fetch data units in additional modes- while using an address generator of the memory module to fetch information.
- Buffering unit 400 is configured to act as a buffer of data between the memory module 200 and the DPA 500.
- the buffering unit 400 may be arranged to pro vide data in parallel to multiple data processors of the DPA 500.
- Configuration bus 130 is coupled to DMA 102, memory module 200, gather unit 300, buffering unit 400 and DPA 500.
- DPA 500 exhibits an architecture that may support parallel and pipeline implementation. It exhibits a flexible connectivity, enables to connect almost every data processing unit (DPIJ) to every DPI).
- DPIJ data processing unit
- FIG. 1 illustrates image processor 100 according to an embodiment of the invention.
- Figure 3 provides non-limiting examples of the width of various buses and of the content of the memory module 200, gather unit 300, buffering unit 400 and DPA 500.
- DPIJ 500 may include 6 rows by six columns of data processing units (DFU) 510(0,0) - 510(5,15).
- DFU data processing units
- Configuration bus 130 is 32 byte wide.
- Bus 132 is 8 x 64 byte wide.
- Bus 134 is 6 x 128 byte wide
- Bus 135 is 2 x 128 byte wide.
- Bus 137 is 2 x 16 x 16 byte wide
- Bus 133 is 6 x 2 x! 6 x 16 byte wide.
- Bus 136 is 2 x 16 x 16 byte wide.
- Memory module 200 is illustrated as including address generators, 6 load store units, 16 multi-port memory interfaces, and 16 independently accessible memory banks of 8 byte lines.
- Gather unit 300 includes a cache memory that includes 18 registers of 8 bytes each.
- the buffering unit 400 includes six rows by 4 columns of 16 byte registers, and 6 rows by 16 columns of 2: 1 multiplexers.
- Figure 4 illustrates a portion of image processor 100 according to an embodiment of the invention.
- SB control unit 290 may include SB program memory 292 and SB microcontrollers 291 and 292.
- the SB program memory 293 stores instructions to be executed by SB microcontrollers 291 and 292.
- SB microcontrollers 291 and 292 may be fed (through configuration bus 130 and/or by scalar unit 104) by information (stored in configuration registers 298) that indicated which instructions (out of the instructions stored in SB program memory 293) to execute.
- BU control unit 490 may include BU program memory 497, configuration registers 498 and BU microcontrollers 491-496.
- the BU program memory 297 stores instructions to be executed by BU microcontrollers 491-496.
- BU microcontrollers 491-496 may be fed (through configuration bus 130 and/or by scalar unit 104) by information (stored in configuration registers 498) that indicated which instructions (out of the instructions stored in BU program memory ' 497) to execute.
- DPA control unit 590 may include DPA program memory 597, configuration registers 598 and DPA microcontrollers 591-596.
- the DPA program memory' 297 stores instructions to be executed by DPA microcontrollers 591-596.
- DPA microcontrollers 591-596 may be fed (through configuration bus 130 and/or by scalar unit 104) by information that indicated which instructions (out of the instructions stored in DPA program memory 597) to execute [00224]
- the microcontroller may be grouped in other manners. For example there may be one microprocessor group, two, three or more than three microprocessor groups.
- Figure 5 illustrates a clock tree according to an embodiment of the invention.
- An input clock signal 2131 is fed to scalar unit 104.
- Scalar unit sends clk_mem 2132 to memory banks 610-625 of memory module and elk 2133 to buiiering unit 400, gather unit 300 and load store units (LSUs) 630-635 of memory module 200.
- Clk 2133 is converted to dpa_clk 2134 which is sent to DPA 500.
- Figure 6 illustrates memory module 200 according to an embodiment of the invention.
- Figure 7 illustrates mapping between LSUs of the memory module and memory banks of the memory module according to an embodiment of the invention.
- Memory module 200 includes sixteen independently accessible memory banks M0-M15 610-625, six load store units LSU0-LSU5 630-625, size address generators AG0-AG5 640-645 and two store buffers 650 and 660.
- Memory banks M0-M15 610-625 are eight byte wide (have lines of 64 bits each) and include IK lines to provide a total memory size of 96KB.
- Each memory bank may include (or may be coupled to) a multi-port memory interface for arbitrating between requests that are sent to the memory bank.
- the multi-port memory interface has to arbitrate between access requests that appear on these four inputs.
- Tire multi-port memory interface may apply any arbitration scheme. For example it may apply priority based arbitration.
- Each LSU can select one out of 6 addresses from the address generators, and is connected to 4 memory banks, and may access 16 bytes (from 2 memory banks) per access, such that, the 6 LSUs can access 12 of the 16 memory banks at a time.
- Figure 7 illustrates the mapping between different values of a control signal SysMemMap and hie mapping between LSUs 630-635 and memory banks M0- M15 610-625.
- FIG. 6 illustrates that memory module 200 outputs data units to gather unit by eight eight-byte wide busses (pars of bus 132) and outputs data units to buffering unit via six sixteen bytes wide busses (part of bus 134).
- Each address generator of AGO - AG5 640- 645 may implements a four dimensional (4D) iterator by using the following variables and registers:
- Baddr defines the Base address in the memory bank.
- 'W' Direction - variable wDepth defines the distance in bytes of one step in the W direction.
- Variable wCount defines W counter max value - when reaching this value, the zCounter is incremented, and wCounter is cleared.
- 'Z' Direction - zArea defines the distance in bytes of one step in Z direction
- variable zCount defines Z counter max value. "Alien reaching this value, the X counter is incremented, and Z counter is cleared.
- 'X' Direction - variable xStep defines the step (can be 1, 2, 4, 8 or 16 bytes).
- Variable xCount defines the X counter max value before next ⁇
- ⁇ ' Direction - variable stride defines the distance in bytes between start of consecutive’lines' .
- Variable yCount defines Y counter max value.
- variables are stored in registers that may be configured through the configuration bus 130.
- the store may also write at sizes that differ from 2, 4, 8 and 16 Bytes [00247]
- Each LSU can perform Load/Store operations from/to 2 memory hanks (16 bytes) out of 4 memory banks connected.
- the memory banks accessed may depend on a selected mapping (see, for example figure 7) and the address.
- the data to be stored is prepared in one of store buffers 650 and 660 described below.
- Each LSU may select the address that is generated from one of the 6 address generators AG0-AG5.
- the data read from a memory bank is stored in a buffer (not shown) of the load store unit and then transferred (via bus 134) to the buffering unit 400.
- This buffer helps avoiding stalls due to contentions on memory banks.
- the data to be stored in a memory bank is prepared in one of the store buffers 650 and 660.
- Each store buffer may request to write between one and four 16-bytes words to one of the LSUs.
- Each LSU can hence get up to 8 simultaneous requests, and grants one after the other in a predefined order: 1 ) Store Buffer 0 - Word 0, 2) Store Buffer 0 - Word 1, ... 4) Store Buffer 1 Word 0, ..., 8) Store Buffer 1 Word 3.
- the store buffer may ignore (not send to a memory’ bank) or process (send to a memory bank) data when the store buffer is configured to operate in a conditional store mode.
- the store buffer may, when configured to operate in a scatter mode, treat a part of a data unit received by him as an address associated with the storage of the remaining of the data uni .
- Store buffers 650 and 660 are controlled by store buffer microcontrollers 291 and 292.
- each one of store buffers 650 and 660 receives (and stores) three configuration instruction (sb_instr[l] - sbjnstr[3]).
- the configuration instructions (also referred to as store buffer configuration instructions) of the different store buffer may differ from each other or may be the same.
- each store buffer microcontroller receives the addresses of the instructions to be executed by each store buffer microcontroller.
- First and last PC indicates the first and las instructions to be read from the storage buffer program memory 293.
- the location of the program memory for each store buffer microcontroller is also defined:
- the store buffer microcontroller instruction may be an execute instruction or a do loop instruction. They have the following formats:
- sel Store Buffer Instruction Select. Triggers store when non-zero
- Figure 8 illustrates a store buffer 660 according to an embodiment of the invention.
- Store buffer 660 has four multiplexors 661-664, four buffers wordO - word 3 671-674 and four demultiplexers 681-684.
- Buffer WordO 671 is coupled between multiplexer 661 and demultiplexer 681.
- Buffer Wordl 672 is coupled between multiplexer 662 and demultiplexer 682.
- Buffer Word2 673 is coupled between multiplexer 663 and demultiplexer 683.
- Buffer Word3 674 is coupled between multiplexer 664 and demultiplexer 684.
- Each one of multiplexers 661-664 has four inputs for recei ving different lines of bus 133 and is controlled by control signal Row, Sel.
- Each one of demultiplexers 681-684 has six outputs for providing data to either one ol LSU0-LSU5 and is controlled by control signal En, LSU.
- the store buffer configuration instruction controls the operation of the store buffer and even generates commands Row, Sel and En,LSU.
- Figure 1 1 illustrates a buffering unit 400 according to an embodiment of the invention
- Buffering unit 400 includes read buffers (RB) that is collectively denoted 402, register file (RF) 404 buffering unit inner network 408, multiplexer control circuits 471-476, output multiplexers 406, BU configuration registers 401(1) 401(5), each storing two configuration instructions, history configuration buffer 405 and BU read buffer configuration register 403.
- RB read buffers
- RF register file
- the BU microcontroller may select, for each line, which configuration instruction to read (out of two configuration instructions stores per each BU configuration register out of 401(1 )-401 (5 )) .
- multiplexers There are six lines of multiplexers and they include multiplexers 491(0)-491(15) and 491’(0)-491’(15), multiplexers 492(0)-492(15) and 492’(0)- 492’(l5), multiplexers 493(0)-493(15) and 493’(0)-493’(15), multiplexers 494(0)- 494(15) and 494’(0)-494’(15), and multiplexers 495(0)-495(15) and 495’(0)-495’(15).
- Buffering unit inner network 408 couples read buffers 402 to register file 404
- a first row of four read buffers 415, 416, 417 and 417 is coupled (via buffering unit inner network 408) to a first row of four registers R3, R2, R1 and R0 413, 412, 41 l and 410
- a second row of four read buffers 425, 426, 427 and 427 is coupled (via buffering unit inner network 408) to a second row of four registers R3, R2, Rl and R0 423, 422, 421 and 420.
- a third row of four read buffers 435, 436, 437 and 437 is coupled tofvia buffering unit inner network 408) a third row of four registers R3, R2, R1 and R0 433, 432, 431 and 430
- a forth row of four read buffers 445, 446, 447 and 447 is coupled (via buffering unit inner network 408) to a forth row of four registers R3, R2, R l and R0 443, 442, 441 and 440.
- the different lines of multiplexers are also controlled by DPU microcontrollers. Especially each DPU microcontroller of 591-596 controls a corresponding line of DPUs and sends control instructions (MuxCtl) to corresponding multiplexer lines (via multiplexers control circuits 471-476). Each multiplexer control circuit stores the last (for example sixteen) MuxCtl instructions (instruction history) and history configuration buffer 405 stores selection information for determining which MuxCtl instruction to send to the line of multiplexers.
- Multiplexer control circuit 471 controls the first line of multiplexers and includes FIFO 481(1) for storing MuxCtl instructions sent from DPU
- microcontroller 491 and includes control multiplexer 481(2) to select which stored MuxCtl instruction to fetch from FIFO 481(1) and send to the first line of multiplexers that includes multiplexers 491 (0) -491 (15) and 49G(0)-49G(15).
- Mmultiplexer control circuit 476 controls the sixth line of
- the register file may be controlled by the BU microcontrollers.
- the operations executed by the register file may include (a) shifts from most to least significant bytes, where the leap of the shift is a power of 2 bytes (or any other value) , (b) load of one or two registers from the read buffers.
- Content from the file register can be manipulated. For example, the content may be interleaved and/or interlaced.
- the buffering configuration map includes the addresses of the configuration buffers for storing configuration instructions for the buffering unit and storing indications of which commands to be fetched by the buffering unit microcontrollers (BUuCO - BuuC5).
- the latter are termed first and Last PC for BU pairs of instructions (each thirty two bits MuxConfig instruction includes two separate buffering unit configuration instruction):
- BU microcontrollers include bits [0:8] for loop control or instruction repetition and includes bits [9: 13] that include values that control the execution of instructions. This is true for both register file commands and do loop commands.
- RFL stands for Register File Line and RBL for Read Buffer Line.
- RRR[r/rl:r0] stands for Register (RFL or RBL) number "r” (16 bytes) / from register rl to tO.
- RRR[r][b/bl:bO] stands for Register (RFL or RBL) number "r” byte b / from byte b 1 to bO.
- BU read buffer configuration register (also referred to as RBSreCnf) 403 specifies for each RB line from which LSU to load.
- the configuration instruction stored in BU read buffer configuration register 403 has the following format:
- Read buffer’ refers to a line of read buffers.
- the four bits per read buffer line may have the following meaning: 0-7 :No Load, 8:LSU0, 9:LSU1, 10:LSU2, 1ULSU3, 12:LSU4, 13:LSU5, 14:GU0, 15:GU1 (GUI Swap 8 bytes load is of d8..dl5,d0..d7 instead d0..dl5 or GIJ last 8 short outputs on short mode).
- the MuxCtl operates the Muxes (selects the register of the register file and the port of the DPU) in the following way:
- the selection of history may be done by reading the content of history configuration buffer 405 that stores four bits of selection information for each line of multiplexers.
- Figure 12 illustrates a gather unit 300 according to an embodiment of the invention.
- Gather unit 300 includes input buffer 301, address converter 302, cache memory 303, address to tags comparator 304, contention evaluation unit 306, controller 307, memory interface 308, iterator 310 and configuration register 311.
- Gather unit 300 is configured to gather up to 16 byte or short pixels from eight memory' banks, MB0..MB7 or MB8..MB 15 depending on bit gu_Ctrl[15] of configuration register 311, through a full associative cache memory (CAM) 303 that includes sixteen 8 bytes registers.
- CAM full associative cache memory
- the pixels addresses or coordinates are generated or received from the array according to the mode gu_Clrl[3:0] of configuration register 311.
- the gather unit 300 may accesses memory banks of memory module 200 using addresses (duplets) that indicate the memory bank number and line within the memory bank.
- the gather unit may receive or generate, instead of addresses, X and Y coordinates that represent the location of the requested pixels in an image.
- the gather unit includes an address converter 302 for converting X,Y coordinates to addresses.
- Iterator 310 may operate in one of two modes - (a) only internal iterator and (b) using an address generator of the memory module.
- the iterator 310 may generate sixteen addresses using the following control parameters (that are stored in the configuration register 311):
- xCount The max steps counter before stride. The stride is performed from previous stride or from the based coordinates.
- the iterator feeds AddBase to the address generator and the address generator uses this address to perform iterations.
- the iterator mode may be useful, for example, during disparity calculation where the gather unit may retrieve data units from source and target images - especially pixels that are proximate to each other.
- Another mode of operations include receiving addresses of requested data (the addresses may be X,Y coordinates to be converted by address converter 302 or memory addresses such as duplets), checking if the requested data units are store in the cache memory and if not fetching the data units from the memory module 200.
- a further mode of operation includes receiving addresses of requested data units and translating the addresses to more requested addresses and then fetching the content of the more requested addresses from the cache e ory' 303 or front the memory unit. This mode of operation may he useful for example, during warp calculation wherein the gather unit may receive an address of a pixel and obtain the pixel and few other neighboring data units.
- Cache memory 303 stores tags that are duplets and these lags are used to determine whether the requested data unit is within the cache memory 303.
- the duplets are also used to detect contention - when at the same cycle multiple requested data units reside in different lines of the same memory bank.
- Figure 13 is a timing diagram that illustrates a process that includes address conversion, cache hit/miss, contention and outputting of information.
- the coordinates (X,Y) are accepted or generated in cycle 0. converted into duplets at cycle 1. Banks accesses are computed in this same cycle for performing the access in cycle 2. If several coordinates address a same bank at different addresses (as it is the case in this example), there is a contention, the corresponding pixel fetch is delayed to the next cycle, and a stall is asserted in cycle 1. The coordinates causing contention are retreated in cycle 1 and memory banks accessed in cycle 2. The latency between the coordinates and the pixels is 5 cycles + the number of stall cycles. In an extreme case, accessing 16 pixels can cause 15 stall cycles. For warp operation, since the pixels accessed are close to each other, 16 pixels are fetched in 1.0- 1.4 cycles a verage, depending on the type of the warp.
- input buffer 301 is coupled to address converter 302.
- Addresses to tags comparator 304 receives inputs from cache memory 303 and from address converter 302 (if address conversion is required) or from input buffer 301.
- the addresses to tags comparator 304 sends output signals indicative of the comparisons (such as cache miss, cache hit- and if so where the hit occurred) to controller 307 and to contention evaluation unit 306 and memory interface 308.
- Iterator 310 is coupled to input buffer 301 and memory interface 308.
- An input interface (such as input buffer 301) is arranged to receive multiple requests for retrieving multiple requested data units.
- Cache memory 303 includes entries (such as sixteen entries or lines) that store multiple tags (each tag may be a duplet) and multiple cached data units.
- Each tag is associated with a cached data unit and is indicative of a group of memory cells (such as a line) of a memory module (such as a memory bank) that differs from the cache memory and stores the cached data unit.
- Addresses to tags comparator 304 includes an array of comparators that is arranged to concurrently compare between the multiple tags and multiple requested memory group addresses to provide comparison results.
- Address to tags comparator includes K x J nodes - to cover each pair of tag and requested memory bank address.
- the (k,j)’th node 304(k,j) compares the k’th requested address to the j’th tag.
- Controller 307 may be arranged to (a) classify, based on the comparison results, the multiple requested data units ?that are stored in the cache memory 303 and those that are uncached data units (not stored in the cache memory ' 303); and (h) send to the contention evaluation unit 306, when there is at least uncached data unit, information about it.
- the contention evaluation unit 306 is arranged to cheek an occurrence of at least one contention
- the memory interface 308 is arranged to request any uncached data unit from the memory module in a contention free manner.
- the contention evaluation unit may include multiple groups of nodes. An example is provided in figure 14.
- the number of groups of nodes is the maximal number of memory banks that may be accessed concurrently by the gather unit (for example- eight).
- Each group of nodes is arranged to evaluate a contention related to a single memory' ⁇ bank. For example, in figure 14 there are eight groups of nodes - for comparing between sixteen requested addresses and lines of eight memory banks.
- the nodes of the first group of nodes (305(0, 0)-306(0, 15) are serially connected to each other.
- the leftmost node 306(0,0) of the first group of nodes receives as input signals (used,bankO, line) and also signals (validO, AddressO).
- AddressO is the first address of the sixteen requested addresses (of data units) and validO indicates if the first requested address is valid or not - if the first requested address refers to a cached data unit (invalid) or refers to an uncached data unit (valid).
- Input signals (used, bankO, line) indicate whether the bankO is used and the line that was requested by the first node of the group of nodes input signals (used, bankO, line) that are fed to leftmost node 306(0,0) indicate that bankO is not used.
- the leftmost node 306(0,0) determines that a previously unused bank (not currently associated with any uncached data unit) should be used (for retrieval of a valid data unit address associated with the node) then the node changes signals (used, bankO, line) to indicate that the bank is used - and also updates“line” to the line which is requested by the node
- any node (out of nodes 306(0, l)-306(0, 15)) of the first group of nodes receives a valid address that refers to BankO, then that node compares between the Line of the requested address and the line indicated by (used, BankO, line). If the line values does not match then the node, outputs a contention signal.
- each group may include K nodes.
- each node of the group is arranged to (a) receive an access request indication (such as signals (used, bank, line)) that is indicative of whether any previous node of the group is requesting to access the memory bank that is identified by duplet (Valid, Address) and (b) update the access request indication to indicate whether the group is requesting access to its corresponding memory bank or not.
- an access request indication such as signals (used, bank, line)
- DPA 500 includes ninety six DPUs that are arranged in six lines (rows whereas sixteen DPUs are included in each line.
- Each line of DPUs may be controlled by a separate DPA microcontroller.
- FIGs.15 and 16 illustrate a DPU 510 according to an embodiment of die invention.
- DPU 510 includes:
- Register file 550 that includes sixteen registers 550(0)-550(15).
- Each input multiplexer is coupled to an input port of DPU 510 and may be coupled to other DPUs, to gather unit 300, to memory ' module 200, to buffering unit 400 or to an output port of the DPU.
- MuxH 529 also include inputs that are coupled to bus 581.
- Bus 581 is also coupled to the outputs of MuxlnO 570, Muxlnl 571, ux In 2 572, Muxln3 573.
- Resisters RegA 531, RegB 532, RegCl 533 and RegCh 534 are connected between input multiplexers MuxA 561, MuxB 562, MuxCl 563, MuxCh 564 (respectively - one register for each multiplexer) and ALU 540 and feed the ALU with data.
- ALU 540 uses two groups of multiplexers where one group of multiplexers may receive outputs of the other group increases the number of sources for providing data that is fed to ALU 540.
- the output of ALU 540 is coupled to an input of the register file 550.
- the first register file RegO 550(0) is also connected as an input to ALU 540.
- register file 550 is coupled to output multiplexers MuxD 534 and MuxE 535.
- the outputs of output multiplexers MuxD 534, MuxE 535 are coupled to output ports D (521) and E (522) respectively and (via MuxG’ 528) to output port G 523 and flip-flop 566.
- Register RegH 539 is connected between MuxH 529 and MuxG 527.
- MuxE 526 is directly connected to port F 522 and to flip-flop 565 thereby providing a low latency relay channel.
- MuxG 527 is coupled to MuxG’ 528.
- MuxInO, Muxlnl, Muxln2, Muxln3 may implement a short routing to other DPUs: (a) MuxInO, Muxlnl get input from D outputs of 8 DPUs, (b) Muxln2, Muxln3 get input from E outputs of the same 8 DPUs.
- Three other input multiplexers MuxA, MuxB, MuxCl, MuxCh and MuxG may implement the following routing:
- MuxA may get its input from the buffering unit, and from MuxInO..Muxin3.
- Each one of MuxB, MuxCi and MuxCh may get its input from the buffering unit, MuxIn0..MuxIn3 and from an internal register of the register file (for example R 14 or R 15 of the register file.
- the DPU 510 and other DPUs in the same row are controlled by a shared row PDA microcontroller, which generates a strea of selection information for selecting between configuration instructions stored in the DPU (see configuration register 511).
- the configuration register (also referred to as dpu_Ctrl) 511 may store die following content:
- each DPU of DPA 500 is directly coupled to some DPUs of the PMA and is in directly coupled (coupled via one or more intermediate DPUs) to some other data processors of the array of data processors.
- Each DPU has a relay channel (between ports F and G) for relaying data between relay ports (port F and port G) of the DPU. This simplifies the connections and reduces connectivity while providing enough connectivity and flexibility to perform image processing tasks in a highly efficient manner.
- each DPU includes a core.
- the core includes ALU 540 and memory resources such as register file 550.
- the cores of the multiple DPUs are coupled to each other by a configurable network.
- the configurable network includes data flow' components such as multiplexers MuxA-MuxCh, MuxD- MuxE, Muxtn0-Muxln3 , MuxF-MuxH and MuxG’. These data flow components may be included (as illustrated in figure 16) within the DPU but may be, at least in part, be positioned outside the DPUs.
- the DPUs may include non-relay input port that are directly coupled to a first set of neighbors.
- the non-relay input ports may include input ports A, B, Cl, Ch, InO, In i, In2 and In3.
- Their connectivity to the first set of neighbors is listed in the relay exampled below.
- the first set of neighbors may include, for example eight neighbors.
- the first set of neighbors is formed by DPUs that are located within a distance (cyclic distance) less than four DPUs from the DPU. The distances as well as directions are cyclic.
- MuxInO is coupled to D ports of DPUs that are: (a) in the same row' but one column to the left (D(Q,-lj), (b) in the same row' but one column to the right (D(Q,+1)), (c) in the same column but one row' above (D(-1,0)),
- a first non-relay input port of the data processor may be directly coupled to relay ports of data processors of the first set of neighbors. See, for example port A, which is directly coupled to the F port of the same DPU and to the F port of the DPU of the same row but one column to the left (F/Fd( 0,-1)).
- a first relay port (such as ports G and F) maybe directly coupled to a second set of neighbors.
- input mux F (coupled to port F) may be coupled to the G output and delayed G output (Gd) of G ports of the DPUs that are (a) one row below' and at the same column (G/Gd(+1,0)), (b) two rows below' and at the same column G/Gd(+2,0), (c) three row's below and at the same column (G/Gd(+3,0)), (d) same row' but one column to the left (G/Gd(0,+1)), (e) same row' but two columns to the left G/Gd(0,+2), (f) same row' but four columns to the left (G/Gd(0,+4)), (g) and same row but eight columns to the left (G/Gd(0,+8)).
- Configuration registers 511(0)-511(3) may store up to four configuration instructions.
- a configuration instruction may be 64 bits long and may be read by one or two read operations.
- the configuration instructions controls the multiplexers select (A, B, C, D, E, F and G) and the register file 550 shifts:
- the fields may be applied at different timings:
- ALU control is delay by one clock cycle.
- Post shift Shift left by 0, 1 , 2 or 3 bits
- vectorial 0:regular - 1: vectorial
- mode_b 0: Unsigned - l:Signed
- wrRegl Write Regl from ALU Out High.
- Each one of MuxInQ-MuxIn3 is connected to multiple buses-as listed below:
- Input multiplexers MuxA, MuxB, MuxC, Mux F, outputs multiplexers MuxD, MuxE, output F, delayed output Fd output G, and delayed output Gd provide the connectivity listed in this paragraph.
- the notation X(N,M) means output X of DPU(row+N % 6, col+M % 16).
- the DPUs of each row are connected to each other in a cyclic manner and the DPUs of each column are connected to each other in a cyclic manner.
- multiplexers have multiple (such as sixteen ) inputs and the following list provide the connections to each of these inputs.
- A[0] - A[ 15] are the sixteen input of MuxA.
- %6 means a modulo 6 operation and %16 means modulo 16 operation.
- R14 and 51R are the last two registers of the register file.
- Inputs (A, B, Cl ,Ch, G) register configuration specifications (configuration bits stored in the configuration register of the DPU and are used to control the various components of the DPU.
- the following example list the values of various bits included in the configuration instruction of the DPU.
- the notation X(N,M) means output X of DPU(row+N % 6, co!+M % 16).
- Chfl9] ⁇ D(-l,-l)
- Ch[28] ⁇ E( l,+1)
- Ch[29] ⁇ E(-2, 0)
- Inputs (A, B, Cl, Ch, G) have muxes to (MuxIn0,..,MuxIn3) and two additional inputs in order to save configuration bits, a dynamic resource allocation of (MuxInO, .. ,MuxIn3) to (A, B, Cl, Ch, G)'s D(N,M) and E(N,M) configuration bits can be used.
- the allocation can be done as follows: if 1 or 2 of Input (A, B, Cl, Ch, G) configurations is of D( N,M) form, the first input allocates MuxInO and the second (if exist) allocates Muxlnl.
- the inputs are denoted by inputDl and inputD2 and the configuration is denoted by D1 ( N,M) and D2( N,M) respectively.
- the MuxInO and Muxlnl muxes’s control can be according to Dl ( N,M) and D2( N,M) respectively.
- the same dynamic allocation can be applied if 1 or 2 of Input (A, B, Cl, Ch, G) configurations is of E( N,M) form, with Muxln2 and Muxln3.
- the C exponent bias argument in conversions can be treated as a 7 bit signed integer (lie other 25 bits of C can be ignored and the sign bit extended)
- the DPUs may be configured one at a time (each DPU has a unique unicast address) or may be configured in a broadcast mode- there are addresses that may reflect the row and/or column of the DPUs that share the row and/or column and this allows to broadcast the configuration information.
- Data Processing Array Broadcast Mapping (programming configuration registers of multiple DPUs concurrently)
- any image processing algorithm may be executed by the image processor in an iterative manner.
- Results regarding some pixels are processed by the DPA 500. Some of the results may be stored in the DPA for a certain period of time and then sent to the memory module. The certain period of time is usually set based on the size of the memory resources of the PMA and the amount of source or target pixels that are processed by the DPA during a certain task. Once these results are needed again they may be fetched from the memory module. For example, when die DPA 500 performs calculations regarding certain source pixels of a source image, these results may be stored for a certain period (for example when performing calculations relating to adjacent source pixels) and then sent to the memory. When the results are further required they may be fetched from the memory module.
- Warp calculation may be applied for various reasons. For example, to compensate for image acquisition imparities.
- the warp calculation is applied for each target pixel (a pixel of a target image) out of a group of target pixels.
- the group of target pixels may include the entire target image or a part of the target image.
- the target image is virtually segmented to multiple windows and each window is a group of target pixels.
- the warp calculation may receive or may calculate a corresponding group of source pixels.
- Source pixels of the corresponding group of source pixels are processed during the warp calculation.
- the selection of the corresponding group of source pixels is usually fed to the PMA and may depend, for example, on the desired warp function.
- the warped value of a target pixel is calculated by applying weights (Wx, Wy) on neighboring source pixels associated with the target pixel.
- the weights and coordinates (x,y) of at least one of the neighboring source pixels are defined in warp parameters (X’, Y’).
- Figure 17 illustrates method 1700 according to an embodiment of the invention.
- Method 1700 may start by step 1710 of selecting a target pixel out of a group of target pixels.
- the selected target pixel will be referred to as“the target pixel'’.
- Step 1710 may be followed by step 1720 of executing, for each target pixel out of a group of target pixels, a warp calculation process that includes:
- the warp parameters may include first and second weights (Wx, Wy) and coordinates (x,y) of a given source pixel that should be processed during the warp calculation.
- the first and second weights are received by first group of processing units (DPUs) of the array of processing units (DPA).
- the gather unit may, in various operational modes, receive 4 coordinates and convert them to sixteen source pixels - four groups of neighboring source pixels. 3) Receiving (1723), by (he second group of processing units, neighboring source pixels associated with the target pixel
- Steps 1721, 1722, 1723 and 1724 may be executed in a pipelined manner.
- the first group of processing units is denoted 505 and may include the four leftmost DPUs of the first upper rows of DP A 500.
- the second group of processing units is denoted 501 and may include the two rightmost columns of DPA 500.
- Step 1720 is followed by step 1730 of checking if the warp was calculated for all target pixels of the group. If no - ending the warp calculation
- Step 1726 may include relaying values of some of the neighboring source pixels between processing units of the second group.
- FIGs.18 and 19 illustrates that the output signal (X for group 504) of DPU(0,4) is sent to DPU(0,15) and is then relayed to DPU(1 ,! 5). It should be noted that in FIGs.18 and 19 the PMA calculate warp functions for four pixels in parallel:
- DPU(0,3), DPU(1,3) and the DPUs of group 501 are involved in calculating the warp of a first pixel.
- DPU(0,2), DPU(1,2) and the DPUs of group 502 are involved in calculating the warp of a second pixel.
- DPU(0,1), DPU(1, 1) and the DPUs of group 503 are involved in calculating the warp of a third pixel.
- DPU(0,0), DPU(LO) and the DPUs of group 504 are involved in calculating the warp of a third pixel.
- Step 1726 may include relaying intermediate results calculated by the second group and values of some of the neighboring source pixels between processing units of the second group.
- FIGs.21 and 22 illustrates a warp calculation executed by DPUs 510(0, 15)-510(3, 15) and DPUs 510(0, 14)-510(5,14) of group 501 according to an embodiment of the invention.
- the warp calculation of figure 21 includes the following steps (some of which are executed in parallel to each other). Steps 1751 - 1762 are also illustrated in figure 22.
- Warp_result Var2*Wx’+Varl.
- DPU 10(5,14) may receive pixels P0, PI, P2 and P3 from the gather unit.
- groups 501, 502, 503 and 504 receive sixteen pixels (in parallel) from the gather unit.
- the DPA 500 also receives (for example - from the gather unit) the warp parameters X’, Y’ related to each pixel.
- the warp parameters for each pixel may be calculated by DPUs of the DPA- for example when the warp parameters may be represented by a mathematical formula such as a polynomial.
- Figure 23 illustrated a group of DPUs 507 that calculate X’ and Y’ and these calculated X’ and Y’ may be fed to groups 505 and 506.
- FIGs.18-22 illustrate only non-limiting grouping schemes.
- the warp calculations can be executed by groups of DPUs of other shapes and size.
- Disparity calculation aims to find for a source pixel the best matching target pixel.
- the search may be executed for all source pixels in a source image and for all target pixels of a target image - but this is not necessarily so and the disparity may be applied only on some source pixels of the source image and/or on some target pixels of the target image.
- the disparity calculation does not compare j usl the differences between a single source pixel to a single target pixel but compares a subgroup of source pixels to a subgroup of target pixels.
- the comparison may include calculating a function such as a sum of absolute differences (SAD) between source pixels and corresponding target pixels.
- SAD sum of absolute differences
- the source pixels may he positioned at the center of the source pixels subgroup and the target pixel may be positioned at the center of subgroup of target pixels. Other positions of the source and/or target pixels may be used.
- the subgroup of source pixels and the subgroup of target pixels may be rectangular shaped (or may have any other shapes) and may include N rows and N columns, whereas N may be an odd positive integer that may exceed three.
- Figure 24 illustrates a first subgroup 1001 of 5 x 5 source pixels S ( 1 , 1 )- S(5,5), a first subgroup 1002 of 5 x 5 target pixels T(l, l)- T(5,5), a second subgroup 1003 of 5 x 5 source pixels S(l,2)- S(5,6) and a second subgroup 1004 of 5 5 target pixels T(l,2) T(5,6).
- Source pixels S(3,3) and S(3,4) are in the center of first subgroup 1001 and second subgroup 1003 of source pixels.
- Target pixels T(3,3) and T(3,4) are in the center of first subgroup 1002 and second subgroup 1004 of target pixels.
- SAD(S(3,3),T(3,3)) SUM(IS(i,j)-T(i,j)l) for indexes i and ] between 1 and 5.
- SAD(S(3,4),T(3,4)) SUM(!S(i,j)-T(i,j)i) - for index i between 2 and 6 and for index j between 1 and 5.
- Figure 25 illustrates a subgroup SG(B) of source pixels having a center pixel SB.
- Figure 26 illustrates a corresponding subgroup TG(B) of target pixels (not shown) having a center pixel TB.
- SUD was calculated for the source pixels of rows that are above the row of SB and for pixels that are positioned to the left of SB and at the same row.
- Pixel SA is the center of subgroup SG(A) and is the left neighbor of pixel SB.
- Target pixel TA is the left neighbor of pixel SB and is the center of subgroup TG(A)
- Pixel SC is the center of subgroup SG(C) and is die upper neighbor of pixel SB.
- Target pixel TC is the upper neighbor of pixel SB and is the center of subgroup TGi C i.
- the leftmost column of SG(A) is denoted 1110.
- the rightmost column of SG(C) is denoted 1114.
- the current rightmost column of SG(B) is denoted 1115.
- the rightmost lowest pixel of SG(B) ⁇ also referred to as new source pixel NSP) is denoted 1116.
- the old pixel (belongs to SG(C)) ⁇ also referred to as old source pixel NSP ⁇ that is on top of the current right most column of SG(B) is denoted 1112.
- the leftmost column of TG(A) is denoted 1110’ .
- the rightmost column of TG(C) is denoted 1114’.
- the current rightmost column of TG(B) is denoted 11 15’.
- the rightmost lowest pixel of TG(B) ⁇ also referred to as new target pixel NTP ⁇ is denoted 1116’.
- the old pixel (belongs to TG(C)) ⁇ also referred to as old target pixel NTP ⁇ that is on top of the current right most column of TG(B) is denoted 1112’.
- Figure 27 illustrates method 2600 according to an embodiment of the invention.
- Method 2600 may start by step 2610 of selecting a source pixel and selecting a subgroup of target pixel.
- the subgroup of target pixels may be the entire target image of a part of the target image.
- Step 2610 may be followed by step 2620 of calculating, by a firsL group of data processor of an array of data processors, a set of sums of absolute differences
- the set of SADs is associated with the source pixel and a subgroup of target pixels that includes the target pixel selected in step 2610. Different SADs of the set is calculated in relation to the (same) source pixel and to different target pixels of the subgroup of target pixels.
- the subgroup of target pixels may include target pixels that are sequentially stored in a memory module.
- the calculating of the set of SADs is preceded by fetching the subgroup of target pixels from the memory module.
- the fetching of the subgroup of target pixels from the memory module is executed by a gather unit that comprises a conleni addressable memory cache.
- Each SAD is calculated based on previously calculated SADs and on currently calculated absolute difference between other source pixels and other target pixels that belongs to the subgroup of target pixels.
- Figure 25 provide an example of such a computation.
- Step 2620 may be followed by step 2630 of finding, by a second group of data processors of the array, a best matching target pixel out of the subgroup of target pixels in response to values of the set of SADs.
- Step 2620 and 2630 may include storing in the array of data processors the calculated results - SADs of an entire rectangular array of pixels, SADs of columns, and the like. It should be noted that the depth of the register file of each DPU may he long enough to store the SAD of the rightmost column of the previous rectangular array. For example if there are 15 columns in SG(A) then the register file 550 of the DPU should be at least fifteen.
- the first previously calculated SAD may reflect absolute differences between (i) a rectangular source pixel array that differs from the given rectangular source pixel array by a first source pixel column and by a second source pixel column, and (ii) a rectangular target pixel array that differs from the given rectangular target pixel array by a first target pixel column and by a second target pixel column.
- SAD SA,TA
- the second previously calculated SAD may reflects absolute differences between the first source column and the first source column. For example
- Step 2620 may include:
- finding the best matching target pixel may involve an iterative process and that multiple repetitions of steps 2610, 2620 and 2630 may be performed for different subgroups of pixels and that by comparing the re ults of these multiple iteration - the best matching target pixel of the group of target pixels may be found.
- the array of processing units may perform multiple disparity calculations (for different source pixels and/or for different target pixels) in parallel.
- Figure 28 illustrates eight source pixels and thirty two target pixels that are processed by die DPA according to an embodiment of die invention.
- Figure 29 illustrates an array of source pixels according to an embodiment of the invention.
- Figure 30 illustrates an array of target pixels according to an embodiment of the invention.
- SADs related to source pixels (SP0, SP1, SP2 and SP3) and (SP’0, SP’l, SP’2 and SP’3), to 4 x 8 target pixels (including a leftmost column of TR0, TP1 , TP2 and TP3) and another 4 x 8 target pixels (including a leftmost column of TP’O, TP’l, TP’2 and TP’3) are calculated.
- Source pixels SP0, SP1 , SP2 and SP3 belong to the same column and their SADs are calculated in a pipelined manner:
- the PMA In parallel to the calculation of the SADs of source pixels SP0, SP1, SP2 and SP3 - the PMA also calculates the SADs of SP’0, SP’l, SP’2 and SP’3. SP’0, SP’l, SP’2 and SP’3 belong to the same column and their SADs are calculated in a pipelined manner:
- the DPA 500 may calculate the SADs of each source pixel and multiple other target pixels in parallel.
- the calculation of SADs for SP0 may include calculating SADs for SP0 and each one of TPO, TslPO, Ts2P0, Ts2P0, Ts3P0, Ts4P0, Ts5PG, Ts6P0, Ts7P0.
- Figure 29 illustrates four new source pixels NSQ, NS1, NS2 and NS3 (for calculating the SADs related to SP0, SP1, SP2 and SP3 and only one target pixel column).
- Figure 30 illustrates thirty two new target pixels: 1) New target pixels for calculating SADs for SPO and eight different target pixels - NT0, NslTO, Ns2TG, Ns3T0, Ns4T0, Ns5T0, Ns6T0, Ss7T0.
- Figure 31 illustrates eight groups 1131, 1132, 1133, 1134, 1135, 1136, 1137 and 1 138 of DPUs - each group includes four DPUs.
- Each group of 1131, 1 132, 1 133 and 1 134 calculates the SAD for pixels SPO, SP1, SP2 and SP3 - hut for different target pixels (TPO, TP2, TP3 and TP4)
- Each group of 1135, 1136, 1137 and 1138 calculates the SAD for pixels SP’0, SP’l, SP’2 and SP’3 - but for different target pixels (TPO, TP2, TP3 and TP4).
- Group of pixels 1140 performs mini mum operations on the SADs calculated by groups 1131- 1138.
- - method 2600 may include calculating by a first group of data processor of an array of data processors, multiple sets of SADs that are associated with a plurality of source pixels and multiple subgroups of target pixels; wherein each
- SAD of the multiple set of SADs is calculated based on previously calculated SADs and to a currently calculated absolute difference; and finding, by a second group of data processors of the array and for source pixel, a best matching target pixel in repose to values of SADs that are associated with the source pixel.
- the multiple set of SADs may include sub-sets of SADs, each sub-set of SADs is associated with the plurality of source pixels and a plurality of subgroups of target pixels of the multiple subgroups of target pixels. For example, groups 1131- 1 138 calculate different sub-sets of SADs.
- the plurality of source pixels may belong to a column of the rectangul ar array of pixels and are adjacent to each other.
- Calculating the multiple sets of SADs may include calculating, in parallel, SADs of different sub-sets of SADs.
- Calculating may include calculating, in sequential manner, SADs that belong to the same sub-set of SADs.
- These status and configuration buffers 109 include PMA control status register, PMA halt enable control register and PMA halt on event status register.
- the control registers may allow, for example the scalar unit to determine a predefined period of operation for the image processor. Additionally or alternatively, the scalar unit may halt the image processor (without changing the state of the PMA) and program the program processor send control signals to the program processor and resume the operation of the image processor from the same point (except to changes introduced by the scalar unit).
- p_RstCtl defines which features are reset upon resuming.
- a simple counter counting with DPU clock (do not count during p_stall).
- the counter is preset through configuration. Whenever the counter is null, it raises an event signal to the Scalar Unit. This counter is readable though the configuration bus.
Landscapes
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multi Processors (AREA)
- Image Processing (AREA)
Abstract
La présente invention peut se rapporter à un réseau de Benes non uniforme qui peut comprendre : une première partie de réseau de Benes ayant un premier nombre (k) de premières entrées et k premières sorties ; une seconde partie de réseau de Benes ayant un second nombre (j) de secondes entrées et j secondes sorties, j étant inférieur à k ; et un ensemble de multiplexeurs qui sont couplés entre un ensemble de commutateurs d'une couche intermédiaire de la première partie de réseau de Benes et une première couche de la seconde partie de réseau de Benes.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US15/841,333 | 2017-12-14 | ||
| US15/841,333 US11178072B2 (en) | 2015-06-10 | 2017-12-14 | Image processor and methods for processing an image |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2019116106A2 true WO2019116106A2 (fr) | 2019-06-20 |
| WO2019116106A3 WO2019116106A3 (fr) | 2019-10-03 |
Family
ID=66820765
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IB2018/001597 Ceased WO2019116106A2 (fr) | 2017-12-14 | 2018-12-12 | Processeur d'image, et procédés permettant de traiter une image |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2019116106A2 (fr) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11178072B2 (en) | 2015-06-10 | 2021-11-16 | Mobileye Vision Technologies Ltd. | Image processor and methods for processing an image |
| WO2024079525A1 (fr) * | 2022-10-13 | 2024-04-18 | Mobileye Vision Technologies Ltd. | Routage de réseau d'entités de traitement connecté à un réseau de benes |
| US12499078B2 (en) | 2015-06-10 | 2025-12-16 | Mobileye Vision Technologies Ltd. | Image processing array with multi-path relay channel for relaying data between relay ports of data processors |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP2051387A1 (fr) * | 2007-10-15 | 2009-04-22 | CoreOptics, Inc., c/o The Corporation Trust Center | Récepteur, circuit d'entrelacement et de désentrelacement et procédé |
| EP2974025B1 (fr) * | 2013-03-15 | 2018-10-31 | The Regents of The University of California | Architectures de réseaux permettant des interconnexions hiérarchiques sans limite |
| US9077338B1 (en) * | 2014-05-20 | 2015-07-07 | Altera Corporation | Method and circuit for scalable cross point switching using 3-D die stacking |
| US11178072B2 (en) * | 2015-06-10 | 2021-11-16 | Mobileye Vision Technologies Ltd. | Image processor and methods for processing an image |
-
2018
- 2018-12-12 WO PCT/IB2018/001597 patent/WO2019116106A2/fr not_active Ceased
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11178072B2 (en) | 2015-06-10 | 2021-11-16 | Mobileye Vision Technologies Ltd. | Image processor and methods for processing an image |
| US12499078B2 (en) | 2015-06-10 | 2025-12-16 | Mobileye Vision Technologies Ltd. | Image processing array with multi-path relay channel for relaying data between relay ports of data processors |
| WO2024079525A1 (fr) * | 2022-10-13 | 2024-04-18 | Mobileye Vision Technologies Ltd. | Routage de réseau d'entités de traitement connecté à un réseau de benes |
| GB2639461A (en) * | 2022-10-13 | 2025-09-24 | Mobileye Vision Technologies Ltd | Benes network connected processing entity array routing |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2019116106A3 (fr) | 2019-10-03 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12499078B2 (en) | Image processing array with multi-path relay channel for relaying data between relay ports of data processors | |
| US12045614B2 (en) | Streaming engine with cache-like stream data storage and lifetime tracking | |
| US11994949B2 (en) | Streaming engine with error detection, correction and restart | |
| US10789175B2 (en) | Caching policy in a multicore system on a chip (SOC) | |
| US7987322B2 (en) | Snoop request management in a data processing system | |
| KR101744031B1 (ko) | 독립적 데이터 상에서의 재귀적 계산들의 벡터화를 위한 판독 및 기입 마스크들 갱신 명령어 | |
| US11068164B2 (en) | Streaming engine with fetch ahead hysteresis | |
| US20140181466A1 (en) | Processors having fully-connected interconnects shared by vector conflict instructions and permute instructions | |
| JP2017107579A (ja) | リードマスク及びライトマスクにより制御されるベクトル移動命令 | |
| CN115827065B (zh) | 使用早期和后期地址以及循环计数寄存器来跟踪架构状态的流引擎 | |
| WO2019116106A2 (fr) | Processeur d'image, et procédés permettant de traiter une image | |
| CN112463218A (zh) | 指令发射控制方法及电路、数据处理方法及电路 | |
| US20220070116A1 (en) | Image processor and methods for processing an image | |
| Tudruj et al. | A globally-interconnected modular CMP system with communication on the fly |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18889136 Country of ref document: EP Kind code of ref document: A2 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 18889136 Country of ref document: EP Kind code of ref document: A2 |