WO2025200931A1 - Procédé, appareil, et support de traitement de données visuelles - Google Patents

Procédé, appareil, et support de traitement de données visuelles

Info

Publication number
WO2025200931A1
WO2025200931A1 PCT/CN2025/079822 CN2025079822W WO2025200931A1 WO 2025200931 A1 WO2025200931 A1 WO 2025200931A1 CN 2025079822 W CN2025079822 W CN 2025079822W WO 2025200931 A1 WO2025200931 A1 WO 2025200931A1
Authority
WO
WIPO (PCT)
Prior art keywords
visual data
partitions
substream
codestream
transform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2025/079822
Other languages
English (en)
Inventor
Zhaobin Zhang
Semih Esenlik
Ye-Kui Wang
Yaojun Wu
Jizheng Xu
Kai Zhang
Li Zhang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Douyin Vision Co Ltd
ByteDance Inc
Original Assignee
Douyin Vision Co Ltd
ByteDance Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Douyin Vision Co Ltd, ByteDance Inc filed Critical Douyin Vision Co Ltd
Publication of WO2025200931A1 publication Critical patent/WO2025200931A1/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding

Definitions

  • Neural network was invented originally with the interdisciplinary research of neuroscience and mathematics. It has shown strong capabilities in the context of non-linear transform and classification. Neural network-based image/video compression technology has gained significant progress during the past half decade. It is reported that the latest neural network-based image compression algorithm achieves comparable rate-distortion (R-D) performance with Versatile Video Coding (VVC) . With the performance of neural image compression continually being improved, neural network-based video compression has become an actively developing research area. However, coding flexibility of neural network-based image/video coding is generally expected to be further improved.
  • Embodiments of the present disclosure provide a solution for visual data processing.
  • a method for visual data processing comprises: performing a conversion between visual data and a codestream of the visual data with a neural network (NN) -based model, wherein the NN-based model comprises a transform process and a plurality of processes different from the transform process, the transform process is configured to perform a transform between a latent representation of the visual data and the visual data, and respective inputs of the plurality of processes are partitioned into a plurality of partitions based on a same first partitioning scheme, and wherein based on the first partitioning scheme, a set of residual samples associated with the visual data is partitioned into a plurality of subsets of residual samples that correspond to the plurality of partitions, and in accordance with a determination that residual samples for each of the plurality of partitions are in a substream of the codestream that corresponds to the partition, each of the plurality of partitions corresponds to an integer number of processing units of the transform process.
  • NN neural network
  • respective inputs of the plurality of processes of the NN-based model are partitioned into a plurality of partitions based on a same partitioning scheme. Moreover, if residual samples for each of the plurality of partitions are in a substream of the codestream that corresponds to the partition, each of the plurality of partitions corresponds to an integer number of processing units of the transform process.
  • the proposed method can better support coding a part of the visual data independently from the rest part of the visual data. Thereby, the coding flexibility can be improved.
  • an apparatus for visual data processing comprises a processor and a non-transitory memory with instructions thereon.
  • a non-transitory computer-readable storage medium stores instructions that cause a processor to perform a method in accordance with the first aspect of the present disclosure.
  • the non-transitory computer-readable recording medium stores a codestream of visual data which is generated by a method performed by an apparatus for visual data processing.
  • the method comprises: performing a conversion from visual data to the codestream with a neural network (NN) -based model, wherein the NN-based model comprises a transform process and a plurality of processes different from the transform process, the transform process is configured to perform a transform between a latent representation of the visual data and the visual data, and respective inputs of the plurality of processes are partitioned into a plurality of partitions based on a same first partitioning scheme, and wherein based on the first partitioning scheme, a set of residual samples associated with the visual data is partitioned into a plurality of subsets of residual samples that correspond to the plurality of partitions, and in accordance with a determination that residual samples for each of the plurality of partitions are in a substream of the codestream that corresponds to the partition, each of
  • NN neural network
  • a method for storing a codestream of visual data comprises: performing a conversion from visual data to the codestream with a neural network (NN) -based model; and storing the codestream in a non-transitory computer-readable recording medium, wherein the NN-based model comprises a transform process and a plurality of processes different from the transform process, the transform process is configured to perform a transform between a latent representation of the visual data and the visual data, and respective inputs of the plurality of processes are partitioned into a plurality of partitions based on a same first partitioning scheme, and wherein based on the first partitioning scheme, a set of residual samples associated with the visual data is partitioned into a plurality of subsets of residual samples that correspond to the plurality of partitions, and in accordance with a determination that residual samples for each of the plurality of partitions are in a substream of the codestream that corresponds to the partition, each of the plurality of partitions corresponds to an integer number of
  • Fig. 1B is a schematic diagram illustrating an example transform coding scheme
  • Fig. 3 is a schematic diagram illustrating an example autoencoder implementing a hyperprior model
  • Fig. 4 is a schematic diagram illustrating an example combined model configured to jointly optimize a context model along with a hyperprior and the autoencoder;
  • Fig. 5 illustrates an example encoding process
  • Fig. 6 illustrates an example decoding process
  • Fig. 7 illustrates an example decoding process according to some embodiments of the present disclosure
  • Fig. 8 illustrates an example learning-based image codec architecture
  • Fig. 9 illustrates an example synthesis transform for learning based image coding
  • Fig. 10 illustrates an example leaky Rectified Linear Unit (ReLU) activation function
  • Fig. 11 illustrates an example ReLU activation function
  • Fig. 12 illustrates latent tiles in synthesis transform
  • Fig. 13 illustrates a bitstream layout
  • Fig. 14 illustrates an example decoder structure
  • Fig. 15 illustrates an example hyper scale decoder
  • Fig. 16 illustrates an example hyper decoder
  • Fig. 17 illustrates a diagram of an example multistage context modelling (MCM) structure
  • Fig. 18 illustrates an example implementation of primary component guided adaptive up-sampling filter
  • Fig. 19 illustrates an example bitstream structure
  • Fig. 20 illustrates different types of split regions in accordance with embodiments of the present disclosure
  • Fig. 21 illustrates overlapping is applied to the region boundary in accordance with embodiments of the present disclosure
  • Fig. 22 illustrates shifted regions in accordance with embodiments of the present disclosure
  • Fig. 23 illustrates an independent region-based coding in accordance with embodiments of the present disclosure
  • Fig . 24 illustrates a flag used to control region split mechanism in accordance with embodiments of the present disclosure
  • Fig. 25 illustrates two example grids in accordance with embodiments of the present disclosure
  • Fig. 26 illustrates latent tiles in synthesis transform
  • Fig. 27 illustrates an example of hyper decoder
  • Fig. 28 illustrates an example of applying overlapping on partitions in accordance with embodiments of the present disclosure
  • Fig. 29 illustrates an example of applying padding on partitions in accordance with embodiments of the present disclosure
  • Fig. 30 illustrates an example of applying both padding and overlapping on partitions in accordance with embodiments of the present disclosure
  • Fig. 31 illustrates an example codestream structure in accordance with embodiments of the present disclosure
  • Fig. 32 illustrates an example coding structure in accordance with embodiments of the present disclosure
  • Fig. 34 illustrates an example coding structure in accordance with embodiments of the present disclosure
  • Fig. 36 illustrates that modules involved tiling may be split into two parts based on
  • Fig. 38 illustrates a schematic diagram of independent case tile
  • Fig. 39 illustrates an example scenario of modules applied with tiling
  • Fig. 40 illustrates another example scenario of modules applied with tiling
  • Fig. 41 illustrates an example bitstream structure variation
  • Fig. 42 illustrates another example bitstream structure variation
  • Fig. 43 illustrates an example scenario in which Q-stream is also split based on the example shown in Fig. 39;
  • Fig. 45 illustrates a further example bitstream structure variation
  • Fig. 46 illustrates a flowchart of a method for visual data processing in accordance with embodiments of the present disclosure
  • Fig. 47 illustrates a schematic diagram of collocated tiling in accordance with embodiments of the present disclosure
  • Fig. 48 illustrates a schematic diagram of hierarchical tiling in accordance with embodiments of the present disclosure.
  • Fig. 49 illustrates a block diagram of a computing device in which various embodiments of the present disclosure can be implemented.
  • references in the present disclosure to “one embodiment, ” “an embodiment, ” “an example embodiment, ” and the like indicate that the embodiment described may include a particular feature, structure, or characteristic, but it is not necessary that every embodiment includes the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an example embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
  • first and second etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and similarly, a second element could be termed a first element, without departing from the scope of example embodiments.
  • the term “and/or” includes any and all combinations of one or more of the listed terms.
  • Fig. 1A is a block diagram that illustrates an example visual data coding system 100 that may utilize the techniques of this disclosure.
  • the visual data coding system 100 may include a source device 110 and a destination device 120.
  • the source device 110 can be also referred to as a visual data encoding device, and the destination device 120 can be also referred to as a visual data decoding device.
  • the source device 110 can be configured to generate encoded visual data and the destination device 120 can be configured to decode the encoded visual data generated by the source device 110.
  • the source device 110 may include a visual data source 112, a visual data encoder 114, and an input/output (I/O) interface 116.
  • I/O input/output
  • the visual data source 112 may include a source such as a visual data capture device.
  • Examples of the visual data capture device include, but are not limited to, an interface to receive visual data from a visual data provider, a computer graphics system for generating visual data, and/or a combination thereof.
  • the factorized entropy model is used to decode the quantized latents for luma and chroma, i.e., and in Fig. 7. 2.
  • the probability parameters (e.g. variance) generated by the second network are used to generate a quantized residual latent by performing the arithmetic decoding process.
  • the quantized residual latent is inversely gained with the inverse gain unit (iGain) as shown in orange color in Fig. 7.
  • the outputs of the inverse gain units are denoted as and for luma and chroma components, respectively. 4.
  • the following steps are performed in a loop until all elements of are obtained: a.
  • Fig. 9 illustrates synthesis transform example for learning based image coding.
  • the example synthesis transform above includes a sequence of 4 convolutions with up-sampling with stride of 2.
  • the synthesis transform sub-Net is depicted on Fig. 9.
  • the size of the tensor in different parts of synthesis transform before cropping layer is the diagram on Fig. 9.
  • the scale factor might be 2 for example, wherein the secondary component is downsampled by a factor of 2.
  • the operation of cropping layers are controlled by the output size H, W. In one example if H and W are both equal to 16, then the cropping layers do not perform any cropping. On the other hand if H and W are both equal to 17, then all 4 cropping layers are going to perform cropping.
  • bitshift (x, n) x*2 n
  • bitshift (x, n) floor (x*2 n )
  • bitshift (x, n) x//2 n .
  • the output of the bitshift operation is an integer value.
  • the floor () function might be added to the definition.
  • Floor (x) is equal to the largest integer less than or equal to x.
  • bits shifted into the most significant bits (MSBs) as a result of the right shift have a value equal to the MSB of x prior to the shift operation.
  • MSBs most significant bits
  • x ⁇ y Arithmetic left shift of a two's complement integer representation of x by y binary digits. This function is defined only for non-negative integer values of y.
  • Bits shifted into the least significant bits (LSBs) as a result of the left shift have a value equal to 0. 2.10 Convolution operation The convolution is a fairly simple operation at heart: you start with a kernel, which is simply a small matrix of weights.
  • padding is performed by replication. Different model of padding can be specified (for example, padding by zeros) . 2.
  • Crop H in , W in , d, s d
  • H in W in are height and width of tensor –output to Synthesis transform
  • s d stride of proceeding transposed convolution
  • d depth of convolution layer in deep learnable reconstruction process.
  • Latent space tiles decoding can be enabled for each component independently by flags signalled in picture header (section 9.3) . If tile_enable_Luma or tile_enable_Choma is true then corresponsing component is decoded unsing latent tiling process illustrated on Fig. 12. The number and location of tiles are determined by values tile size S tile (equal to tile_size_Luma for primary component and tile_size_Chroma for secondary component) and tile overlap m tile (tile_overlap _Luma for primary component tile_overlap_Chroma for secondary component) signalled in picture header (section 9.3) .
  • picture_header_size is the number of bytes in the picture header excluding the first two-byte marker
  • img_width plus 64 specifies width of an input picture (from 64 to 65600)
  • img_height plus 64 specifies height of the input picture (from 64 to 65600)
  • bit_depth is a bit-depth the output picture ( “0” corresponds to 8 and “1” corresponds to 10)
  • bit_s_ver is one bit value which defines s_ver which is used for align coding subsampling mode of the secodary component and subsampling mode in output picture format in vertical direction as defined in Table 3. Usage of s_ver is descried in section 7.6 If bit_c_ver is not present in bit-stream then s_ver is equal to 1.
  • c_ver controls internal subsampling mode of the secodary component in vertical direction as defined in Table 3 Usage of s_ver is descried in section 7.6 If bit_c_ver is not present in bit-stream then s_hor is equal to 1.
  • independent_beta_uv is a flag (false/true) which indicates do the rate control parameter ( ⁇ ) for primary and secondary components are the same.
  • beta_displacement_log_y –parameter indicating ratio between rate control parameter beta selected by encoder for primary component and one used in the model training.
  • betaDisplacementLogY beta_displacement_log_y –2 11
  • beta_displacement_log_uv indicating ratio between rate control parameter beta selected by encoder for secondary component and one used in the model training.
  • opIdx is an identificator for operation point, 0 means “base” , 1 means “high” operation point.
  • tile_enable_Luma and tile_enable_Chroma are enable flags for tiling of primary and secondary components.
  • tile_size_Luma and tile_size_Chroma are size of tiles for primary and secondary components.
  • tile_overlap_Luma and tile_overlap_Luma are sizes of tiles overlapping areas for primary and secondary components.
  • cube_group_flag is 1-bit unsigned integer.
  • Quality map information decoder The input of this process is - Codestream for quality_map; - Indicators for sigma tables which are decoded in picture header: ⁇ quality_map_entropy_index_y for primary component. ⁇ quality_map_entropy_index_uv for secondary component.
  • the output of this process is - quality_map_delta _Y is an array of size with information used for deriving scaling factor (section 12.2) for primary component residual tensor.
  • - quality_map_delta _UV is an array of size with information used for deriving scaling factor (section 12.2) for secondary component residual tensor.
  • q_primary is a 1D array of size h 4Y ⁇ w 4Y
  • q_secondary is a 1D array of size h 4UV ⁇ w 4UV
  • the output of this process is - r_primary –one dimensional array ⁇ s ⁇ of residual tensor elements which is an input of decoder skip process (section 13.3.4) for primary component.
  • sizes C p , w 4Y , h 4Y are defined in Table 2.
  • hyper scale decoder The input of hyper scale decoder is - reconstructed hyper tensor, - sizes of input/output tensor H in , W in , - operation point indicator opIdx, - model parameters for Hyper Scale Decoder Net defined by pair (modelIdx, opIdx) , all multiplier parameters in those models are 8-bits integer.
  • the output of hyper scale decoder is standard deviation logarithm tensor I ⁇ [C, h 4 , w 4 ] with integer values in a range 0 ⁇ I ⁇ ⁇ ( (N ⁇ -1) ⁇ sigmaPrecision) , where N ⁇ , sigmaPrecision are defined in section 0.
  • the cropping layer (stride 4, depth 5) ensures the size of output tensor is [C, h 4 , w 4 ] .
  • the process concluded with abs operation.
  • Model with weights are storied in electronic attachment, in format specicied in section 4.8 and locates in ⁇ oper_point>/model_ ⁇ MID>/ ⁇ COMP>/hyper_scale_decoder. onnx, where ⁇ oper_point> is “base” or “high” , ⁇ MID> is an integer from 0 to 5, ⁇ COMP> is “primary” or “secondary” . 16.
  • (11.2) Hyper Decoder The learning-based hyper decoder consists of two independent pipe-lines with identical neural network architecture, except input size and number of channels.
  • the input of this process is - reconstructed hyper latent tensor, - model parameters for Hyper Decoder Net defined by (modelIdx) , - operation point indicator opIdx.
  • Hyper decoder process is depicted in Fig. 16. Hyper decoder starts stride 1 convolution with kernel size 1 ⁇ 1, followed by inverse convolution (stride 2, kernel size 4 ⁇ 4) , cropping layer (depth 6) and leacky rectified linear unit.
  • Next step is stride 1 convolution with kernel size 3 ⁇ 3, followed by inverse convolution (stride 2, kernel size 4 ⁇ 4) , cropping layer (depth 5) and leacky rectified linear unit. Number of channels kep un-changed till this point (equal to number of channels C of input tensor) .
  • Hyper decoder concluded by stride 1 convolution with kernel size 3 ⁇ 3 which increases number of channels to 2C for high operation point and keeps number fo channes unchanged for base operation point followed by leacky rectified linear unit.
  • Model with weights are storied in electronic attachment, in format specicied in section 4.8 and locates in ⁇ oper_point>/model_ ⁇ MID>/ ⁇ COMP>/hyper_decoder.
  • Decoder side SKIP operation At the decoder, the inputs of skip mode process are - 1D array s [num_res_elements] after decoding by me-tANS (section 9.5.2) to from the “stream-y” , - mask_skip [C, h 4 , w 4 ] .
  • the output of this process is - the residual tensor
  • the output of the lossless decoding process is a 1D array ⁇ s k ⁇ , whose size is equal to the total number of “1” s in the mask_skip [C, h 4 , w 4 ] tensor.
  • the mask_skip [C, h 4 , w 4 ] tensor determines which samples of the residual tensor are included in the bitstream. All of the other samples of the quantized residual tensor are inferred to be equal to zero.
  • Inter channel correlation information filter tiling and output selection process The ICCI filter processes the input in tiles using the same method described in Section 8.4.
  • the tile size icci_tile_size and tile overlap icci_tile_overlap are same for the primary (Luma) and secondary (Chroma) components and signalled in picture header (Section 9.3) .
  • the ratio r between input domain size and output tensor size is fixed to 1 for each component.
  • the minimum tile size is limited to 176 for the stable computation of the model selection explained next.
  • ICCI filters are included into the design.
  • icci_model_idx 0, the ICCI processing is bypassed. 21. (14.2) Adaptive upsampler This section details the primary component guaided adaptive upsampler process. This process provides enhancement of secondary components (colour information planes) of image utilising information from primary component.
  • LEF general process LEF process receives primary variance tensor and reconstructed primary as inputs.
  • the output of LEF process is edge enhanced modified
  • the following ordered steps are performed: - Picture header parsing process is invoked as described in section Section 9.3. to obtain targetBppIdx, - chIdx is set equal to reference_channel_idx_table [tagetBppIdx] , -
  • i 0..
  • H Y -1, j 0..
  • the weights tensor kernel is defined as follows:
  • the luma_edge_filter_intensity_list is as follows:
  • the luma_edge_filter_thr_list_table is defined as follows: 3. Problems The following limitations present in the existing design: 1) As shown in Fig. 19, the coded bits of the Q-stream, Z-stream, R y -stream, and R UV -stream in the bitstream follow a sequential order. Therefore, even the synthesis transform has tiling-based mechanism, the whole bitstream must be decoded to generate the regional decoded image, which introduces additional decoding complexity and potential inconvenience in real applications.
  • JPEG AI supports splitting the latents into multiple partitions in the synthesis transform part, the multiple partitions have to be decoded in a sequential order since each subsequent decoded area uses the reference pixels from the previously decoded areas. Only decoding a specified region is not supported. The use cases of only decoding a certain region include omnidirectional video-based applications, multi-page document sharing on social media etc. 3) Currently, it is possible to partition a picture into multiple regions, and at the same time enable the correct decoding of a subset of each region independently from other regions by having tiles being subsets of regions and alignment of tiling between luma and chroma. However, there lacks a high-level indication of this regional access capability, e.g., in the picture header.
  • a subset (such as a subpicture, or a tile, or a slice, or a region) of samples in a picture can be reconstructed independently from samples in the picture out of the subset, with a NN-based decoder such as a JPEG-AI decoder.
  • a NN-based decoder such as a JPEG-AI decoder.
  • the bitstream is structured to region-based to replace the picture-based.
  • the picture may be split into multiple regions, either vertically or horizontally or both vertically and horizontally.
  • Fig. 20 illustrates possible split types.
  • the number of horizontal (row) regions and vertical (column) regions may use two flags/signals to indicate and signaled in the bitstreams. ii.
  • a signal/flag may be used to indicate if the current region is an independently decoded region, i.e., the region can be independently decoded without relying on other regions.
  • An algorithm may be designed to derive the region sizes and the specific starting and ending location of each region give the input picture size. In the algorithm, additional restrictions may apply such as the allowed minimal region size or maximum regional size. a. Alternatively, the size and/or location information may be explicitly indicated in the bitstream. c. A starting code/signal may be signaled into the bitstream to indicate the start of a region. 3) To alleviate the region boundary artifacts, overlapping is applied to dependent regions as shown in Fig. 21.
  • the overlapping mechanism may be used for hyper tensor, residual latents and synthesis transform.
  • the number of overlapped pixels could be.
  • Hyper decoder and hyper scale decoder 1. One sample overlap on each side at input. 2. Four samples discarded on each side at output.
  • MCM - Eight samples overlap on each side at input. - Eight samples discarded on each side at output.
  • Synthesis transforms - Two samples overlap on each side at input. - Thirty-two or sixteen samples discarded on each side at output. 4) Alternative to 3) , shifted regions may be used as illustrated in Fig. 22.
  • the tiles are a shifted version for the synthesis transform of the original tiles in the JPEG AI software.
  • the structure makes the pipelining possible, making a faster decoder.
  • a design is made to allow switching between splitting and not splitting coded data of z- hat (i.e., ) .
  • Not splitting z-hat data allows using of the z-hat data for the entire picture even when other types of data for only certain regions are available to the decoder. This allows the end user to access a lower quality representation of the entire picture plus a high-quality representation of certain regions.
  • splitting z-hat data is good for the spatial-random use cases other than 360/VR, as clean spatial random access without having to transmit and decode z-hat data not for the target region (s) can be achieve.
  • an indication e.g., a 1-bit flag, is signalled to specify whether coded data of z- hat is split or not.
  • a syntax element (a.k.a. an indication) disclosed above may be binarized as a flag, a fixed length code, an EG (x) code, a unary code, a truncated unary code, a truncated binary code, etc. It can be signed or unsigned.
  • a syntax element representing a coding tool or a coding method may not be signalled and implicitly determined to be unused, if the coding tool or the coding method is regarded as not applicable or cannot be used.
  • a syntax element disclosed above may be coded with at least one context model. Or it may be bypass coded.
  • a syntax element disclosed above may be signaled in a conditional way.
  • a syntax element disclosed above may be signaled at block level/sequence level/group of pictures level/picture level/slice level/tile group level.
  • f) Whether to and/or how to apply the disclosed methods above may be signalled at block level/sequence level/group of pictures level/picture level/slice level/tile group level.
  • g) Whether to and/or how to apply the disclosed methods above may be dependent on coded information, such as colour format, colour component, slice/picture type.
  • the proposed methods may be applied to other image/video compression solutions with NN-based coding tools involved. 5.
  • Embodiment 1 the region-based decoding is implemented with the overlapping region mechanism as illustrated in Fig. 13.
  • the high-level framework is shown in Fig. 23. The picture is split into multiple regions either horizontally, or vertically, or both horizontally and vertically.
  • a flag/code associated with each region in the bitstream indicating if the region is an independently decoded region.
  • the overlapping region is only applied to the dependent regions.
  • the number of overlapped pixels may vary for hyper latent, residual latent, MCM and synthesis transforms. 5.2.
  • Embodiment 2 Fig. 24 illustrates a flag is used to control region split mechanism, either overlapped region (as shown in Fig. 21) or shifted region (as shown in Fig. 22) .
  • a flag/signaled code is used to control region split mechanism selection.
  • An example may be implemented as follows.
  • ⁇ Synthesis transform is implemented using the overlap region.
  • Entropy decoding part is implemented using the overlap region.
  • Alternatively, the entropy decoding part is implemented using the shifted region.
  • a switch control is used to select the region partitioning.
  • At least two different grids are used in encoding or decoding of an image. ⁇ Based on the outcome of a decision, either the first grid is used, or the second grid is used. ⁇ The decision might be based on an indication obtained from the bitstream. ⁇ The indication might be a flag. ⁇ The indication might be a profile indicator. ⁇ The indication might be derived from a picture size, or a model index.
  • the first grid and the second grid might be depicted in Fig. 25: ⁇ The first grid and the second grid might comprise different sets of samples. ⁇ The sizes of the blocks in the grid 1 and grid 2 might be different. ⁇ The grid 1 and grid 2 might comprise same number of blocks. ⁇ The size of at least one block of grid1 and grid 2 might be different. ⁇ The grids might have overlapping areas. In Fig. 25, the grid 1 and grid 2 are depicted to have no overlapping area. However, according to some example embodiments of the present disclosure, the blocks of the grids might have overlapping area. In other words, block1 and block2 might have a small subset of samples shared by the two. ⁇ The grid 1 and grid 2 might be used in an entropy decoding process.
  • the output of the entropy decoding process might correspond to each block of the grid.
  • the entropy decoding process might be applied N times.
  • Output of the entropy decoding process might correspond to each block of the grid.
  • the grid 1 and grid 2 might be used in a hyper decoding or hyper scale decoding or a sample prediction (latent sample prediction, latent sample reconstruction or multi-stage context) process.
  • the output of the process might correspond to each block of the grid.
  • the input of the process might correspond to each block of the grid.
  • the process If there are N blocks in a grid, the process might be applied N times. Output of the process might correspond to each block of the grid.
  • the process might be applied N times. Input of the process might correspond to each block of the grid.
  • the first set of ⁇ grid 1 and grid 2 ⁇ might be used in determining the input of a process (e.g. the process might be an entropy decoding process, a hyper decoding process, a hyper scale decoding process, a sample prediction or sample reconstruction process) .
  • the second set of ⁇ grid 1 and grid 2 ⁇ might be used in determining the output of the process.
  • first set of ⁇ grid 1 and grid 2 ⁇ might be used in a first process and second set of ⁇ grid 1 and grid 2 ⁇ might be used in a second process.
  • the grids might be constructed as follows: ⁇ The top-left block of the second grid might be bigger in vertical and horizontal dimension than the top-left block of the first grid. ⁇ The bottom-right block of the second grid might be smaller in vertical and horizontal dimension than the top-left block of the first grid. ⁇ The top row of blocks in the second grid might be bigger in vertical dimension than the top row of blocks of the first grid. ⁇ The bottom row of blocks in the second grid might be smaller in vertical dimension than the bottom row of blocks of the first grid.
  • the processes of a decoding operation might be divided into two sets.
  • the first set of processes might comprise at least one of: ⁇ an entropy decoding/encoding process, ⁇ a hyper decoding/encoding process, ⁇ a hyper scale decoding/encoding process, ⁇ a sample prediction or sample reconstruction process.
  • the second set of processes might comprise a synthesis/analysis transform (e.g. an inverse transform operation that performs a transform between latent samples and picture samples) .
  • ⁇ at least 2 grids might be used in the processing of the first set of processes.
  • a first grid is used, or a second grid is used by the process belonging to the first set of processes.
  • the determination might be performed based on an indication obtained from the bitstream.
  • Only of grid is used in the processing of the second set of processes. In other words, no determination is performed, only one grid is used in the second set of processes.
  • the decoding/encoding operation comprises the first set and second set of processes.
  • the first grid might be used in parallel processing.
  • the first grid is used in a process, that process might be parallelized.
  • samples of a first block of the first grid can be obtained in parallel to the samples of a second block.
  • the second grid might be used in sequential processing.
  • samples of a second block of the second grid cannot be obtained in parallel to the samples of a first block.
  • the second grid might be used in pipelined processing or pipelining. 5.3.
  • Embodiment 3 In another example implementation the overlapping amount of the regions and tiles (e.g. partitions) might be determined with an indication obtained from the bitstream.
  • Overlap_for_independent_regions 0 specifies that overlap amount is set equal to zero for a tile boundary if the tile boundary coincides with an independent region (either the tile is inside the independent region, or the neighboring region is independent region) .
  • 1 specifies that overlap amount overlap amount specified by tile_overlap_Luma and tile_overlap_Luma are used for all synthesis transform tiles.
  • a new indication e.g. overlap_for_independent_regions
  • corresponsing component is decoded unsing latent tiling process illustrated on Figure 26.
  • the number and location of tiles are determined by values tile size S tileHor and S tileVer (equal to tile_size_Luma hor and tile_size_Luma_ver for primary component and tile_size_Chroma_hor and tile_size_Chroma_ver for secondary component respectively) and tile overlap m tile (tile_overlap _Luma for primary component tile_overlap_Chroma for secondary component) signalled in picture header (section 9.3) .
  • ⁇ –vertical dimension tile start in signal domain ⁇ if ⁇ –vertical dimension tile start in latent space; ⁇ –vertical dimension tile size in signal domain; ⁇ if ⁇ –vertical dimension tile size in latent space; ⁇ –horizontal dimension tile start in signal domain; ⁇ if ⁇ –horizontal dimension tile start in latent space; ⁇ –horizontal dimension tile size in signal domain; ⁇ if ⁇ –horizontal dimension tile size in latent space.
  • the overlap amount might be set equal to zero for the current region or part of the current region or at a boundary of the current region (e.g. the boundary between the current and the neighbor region) .
  • Model with weights are storied in electronic attachment, in format specified in section 4.8 and locates in ⁇ oper_point>/model_ ⁇ MID>/ ⁇ COMP>/hyper_decoder.
  • padding might be applied at the boundaries of a partition.
  • an extension might applied around the partition boundaries.
  • overlapping partitions are applied (e.g. samples from a neighbor partition might be used) in the decoding process. This is exemplified in Fig. 28.
  • an extension might applied around the partition boundaries.
  • Padding is applied at the extension region (e.g. samples from a current partition are used) in the decoding process. This is exemplified in Fig. 29.
  • an extension might applied around the partition boundaries. Both padding and overlapping might be applied at the extension region. This is exemplified in Fig. 30.
  • the padding or overlapping might be applied to different parts of the extension: - Corner samples of the extension might use padding. - Samples that are not at the corner might apply overlapping. - If a sample is at the extension section of a first partition, and at the same time it is inside a core part of second partition; a. If the second partition is independent partition, padding might be applied. b. If the second partition is dependent partition, overlapping might be applied. c. If the first partition is independent partition, padding might be applied. d. If the first partition is dependent partition, overlapping might be applied.
  • Partition might be a region, a tile, a synthesis transform tile etc. 5.6.
  • Embodiment 6 The following might the application of the decoding modules based on some example embodiments of the present disclosure.
  • the codestream structure might be as shown in Fig. 31.
  • the following modules might apply tiling as exemplified in Fig.
  • Embodiment 8 Based on some embodiments of the present disclosure, there may be two schemes to implement tiling: collocated tiling and hierarchical tiling, as shown in Fig. 35. In collocated tiling, the tile size and location are aligned across the whole processing pipeline, i.e., from to the final reconstructed image patch. In this manner, it is convenient to support the regional accessibility functionality, wherein if only a certain spatial area is needed, only that part bitstream will be decoded and be fed into the following modules. Alternatively, hierarchical tiling can also be used.
  • the Q-stream is located after Z-stream.
  • the dashed box-enclosed parts may not present. 5.11.
  • Embodiment 11 Alternative to Embodiment 10, the Hyper Scale Decoder may also be applied with tiles. Adding Hyper Scale Decoder to Fig. 39 leads to the example shown in Fig. 43. Adding Hyper Scale Decoder to Fig. 40 leads to the example shown Fig. 44. a. In one example as shown in Fig. 43, r-stream, Q-stream, synthesis transforms, MCM, Hyper Decoder and filters are applied with tiling. i.
  • Fig. 46 illustrates a flowchart of a method 4600 for visual data processing in accordance with some embodiments of the present disclosure.
  • a conversion between the visual data and a codestream of the visual data is performed with a neural network (NN) -based model.
  • NN neural network
  • a codestream may comprise a sequence of bits.
  • the codestream may further comprise associated codes which are used as markers.
  • the codestream may also be referred to as a bitstream.
  • the conversion may include encoding the visual data into the codestream. Additionally or alternatively, the conversion may include decoding the visual data from the codestream.
  • the decoding model shown in Fig. 6 may be employed for decoding the visual data from the bitstream.
  • the NN-based model comprises a transform process and a plurality of processes different from the transform process.
  • the transform process is configured to perform a transform between a latent representation of the visual data and the visual data.
  • a transform process configured to perform a transform from the visual data to a latent representation of the visual data may also be referred to as an analysis transform
  • a transform process configured to perform a transform from a latent representation of the visual data to the visual data may also be referred to as a synthesis transform.
  • the plurality of processes may comprise a hyper decoding process (a.k.a., a hyper decoder) and a latent tensor reconstruction process.
  • the input of the hyper decoding process may comprise reconstructed hyper tensor and model parameters for hyper decoder network
  • the output of the hyper decoding process may comprise re-shuffled explicit prediction tensor, which is the part of prediction tensor derived from explicitly signaled information.
  • the latent tensor reconstruction process may be implemented by invoking a multistage context modelling (MCM) process.
  • MCM multistage context modelling
  • the input of the MCM process may comprise a reconstructed residual tensor (e.g., a tensor of reconstructed residual samples) , and a re-shuffled explicit prediction tensor which is the output of the above-mentioned hyper decoding process, and the output of the MCM process may comprise a reconstructed latent tensor (e.g., a tensor of reconstructed latent samples) .
  • a reconstructed residual tensor e.g., a tensor of reconstructed residual samples
  • a re-shuffled explicit prediction tensor which is the output of the above-mentioned hyper decoding process
  • the output of the MCM process may comprise a reconstructed latent tensor (e.g., a tensor of reconstructed latent samples) .
  • Respective inputs of the plurality of processes are partitioned into a plurality of partitions based on a same first partitioning scheme.
  • a partitioning scheme may indicate at least one of the following: the number of vertical splits, or the number of horizontal splits.
  • a tensor of size W ⁇ H may be split into subtensors of size (W/N) ⁇ H, where W represents a width of the tensor, H represents a height of the tensor, and N represents the number of vertical splits.
  • a tensor of size W ⁇ H may be split into subtensors of size W ⁇ (H/N) , where W represents a width of the tensor, H represents a height of the tensor, and N represents the number of horizontal splits.
  • a tensor of size W ⁇ H may be split into subtensors of size (W/N) ⁇ (H/M) , where W represents a width of the tensor, H represents a height of the tensor, N represents the number of vertical splits, and M represents the number of horizontal splits.
  • a set of residual samples associated with the visual data is partitioned into a plurality of subsets of residual samples that correspond to the plurality of partitions. If residual samples for each of the plurality of partitions are in a substream of the codestream that corresponds to the partition (i.e., residual samples for different partitions are encapsulated into different substreams of the codestream separately) , each of the plurality of partitions corresponds to an integer number of processing units of the transform process. This will be described with reference to FIG. 47.
  • Fig. 47 illustrates a schematic diagram of collocated tiling in accordance with embodiments of the present disclosure.
  • the tensor 4710 may represent an input of a first processing process (such as the hyper decoding process or the like) .
  • the tensor 4710 may be partitioned into 4 partitions (i.e., 4 subtensors) , one of which is the partition 4711.
  • the 4 partitions of the tensor 4710 may be not overlapped with each other, as shown in Fig. 47.
  • the 4 partitions of the tensor 4710 may overlap with each other.
  • the tensor 4720 may represent an input of a second processing process (such as the latent tensor reconstruction or the like) .
  • the tensor 4720 may be an output of the first processing process.
  • the tensor 4720 may be partitioned into 4 partitions (i.e., 4 subtensors) according to the same partitioning scheme as the tensor 4710.
  • the 4 partitions of the tensor 4720 may be not overlapped with each other, as shown in Fig. 47.
  • the 4 partitions of the tensor 4720 may overlap with each other.
  • the partition 4721 corresponds to the partition 4711 and may be generated from the partition 4711.
  • the tensor 4730 may represent an input of a third processing process (such as the synthesis transform or the like) . In some embodiments, the tensor 4730 may be an output of the second processing process.
  • the tensor 4730 may be partitioned into 4 partitions (i.e., 4 subtensors) according to the same partitioning scheme as the tensor 4720. In some embodiments, the 4 partitions of the tensor 4730 may be not overlapped with each other, as shown in Fig. 47. In some alternative embodiments, the 4 partitions of the tensor 4730 may overlap with each other.
  • the partition 4731 corresponds to the partition 4721 and may be generated from the partition 4721.
  • a size of one of the integer number of processing units may be allowed to be different from a size of a further one of the integer number of processing units.
  • the processing unit (s) at right boundary and/or the bottom boundary may be smaller than the rest of processing units.
  • a size of one of the integer number of processing units may be determined based on a predetermined algorithm. Alternatively, the size of one of the integer number of processing units may be signaled in the codestream.
  • an input of the transform process may be partitioned into a plurality of partitions based on the first partitioning scheme.
  • Respective inputs of the entire processing pipeline comprising the hyper decoder process, the latent tensor reconstruction process, and the synthesis transform at the decoder side
  • an input of the transform process may be partitioned based on a second partitioning scheme that is allowed to be different from the first partitioning scheme. That is, respective inputs of the entire processing pipeline (comprising the hyper decoder process, the latent tensor reconstruction process, and the synthesis transform at the decoder side) may be partitioned into a plurality of partitions based on different partitioning schemes. This is also referred to as a hierarchical tiling scheme, details of which will be described with reference to Fig. 48.
  • Fig. 48 illustrates a schematic diagram of hierarchical tiling in accordance with embodiments of the present disclosure.
  • the tensor 4810 may represent an input of a first processing process (such as the hyper decoding process or the like) .
  • the tensor 4810 may be partitioned into 4 partitions (i.e., 4 subtensors) , one of which is the partition 4811.
  • the 4 partitions of the tensor 4810 may be not overlapped with each other, as shown in Fig. 48.
  • the 4 partitions of the tensor 4810 may overlap with each other.
  • the tensor 4830-1 and the tensor 4830-2 may represent an input of a third processing process (such as the synthesis transform or the like) .
  • the tensor 4830 may be an output of the second processing process, and that the output stage of the second processing process, the tensor 4830 may still follows the partitioning scheme same as the tensors 4811 and 4821. Then, the tensor 4830 is repartitioned according to a partitioning scheme different from the tensors 4811 and 4821, before being inputted to the third processing process.
  • the tensor 4830 before the repartition is also referred to as the tensor 4830-1
  • the tensor 4830 after the repartition is also referred to as the tensor 4830-2.
  • the tensor 4830-1 may be partitioned into 4 partitions (i.e., 4 subtensors) according to the same partitioning scheme as the tensor 4820.
  • the 4 partitions of the tensor 4830-1 may be not overlapped with each other, as shown in Fig. 48.
  • the 4 partitions of the tensor 4830-1 may overlap with each other.
  • the partition 4831 corresponds to the partition 4821 and may be generated from the partition 4821.
  • the tensor 4830-2 is partitioned into 9 partitions (i.e., 9 subtensors) .
  • the 9 partitions of the tensor 4830-2 may be not overlapped with each other, as shown in Fig. 48.
  • the 9 partitions of the tensor 4830-2 may overlap with each other.
  • each of the 9 partitions of the tensor 4830-2 may be further partitioned into several sub-partitions, each of which may be used as a processing unit of the third processing process.
  • the partition 4832 may be partitioned into an integer number (such as 2, 4, 6 or the like) of sub-partitions, and the third processing process may be applied on each of the sub-partitions.
  • the partition 4832 may be considered as being corresponding to the integer number of processing units of the third processing process.
  • the tensor 4840 may represent the reconstructed visual data, which may be an output of the third processing process (such as the synthesis transform or the like) .
  • the tensor 4840 may be partitioned into 9 partitions (i.e., 9 subtensors) according to the same partitioning scheme as the tensor 4830-2.
  • the 9 partitions of the tensor 4840 may be not overlapped with each other, as shown in Fig. 48.
  • the 9 partitions of the tensor 4840 may overlap with each other.
  • the partition 4842 corresponds to the partition 4832 and may be generated from the partition 4832.
  • the plurality of partitions do not overlap.
  • An example for this case is shown in Fig. 38.
  • the plurality of partitions may be allowed to overlap. An example for this case is shown in Fig. 37.
  • each of the plurality of partitions may be a tile used in the plurality of processes, and each of the integer number of processing units may be a tile used in the transform process.
  • the input of the hyper decoding process and the input of the latent tensor reconstruction process may be split into a plurality of tiles (a.k.a., regions) .
  • the input of the synthesis transform may also be split into the plurality of regions, and each region is further split into a plurality of tiles for synthesis transform (a.k.a., synthesis tiles) . That is, a tile used for the hyper decoding process and the hyper decoding process is different from a synthesis tile used for the synthesis transform, and a tile used for the hyper decoding process and the hyper decoding process may correspond to a plurality of synthesis tiles.
  • the NN-based model may comprise a first set of filtering processes following the transform process, and each of the set of filtering processes may be performed with a corresponding tiling process.
  • the set of filtering processes may comprise an adaptive linear filtering process, an inter channel correlation information filtering process, a non-linear chroma enhancement filtering process, and/or the like.
  • the NN-based model may comprise a second set of filtering processes following the transform process, and each of the second set of filtering processes may be performed without a tiling process.
  • the second set of filtering processes may comprise a luma edge filtering process, and/or the like.
  • the tiling process may be applied to partition an input of the filtering process, a parameter of the filtering process, and/or the like.
  • the codestream may comprise a first substream and a second substream following the first substream.
  • the first substream may comprise quality map information for the visual data, and may also be referred to as a Q-stream.
  • the second substream may comprise a hyper tensor for the visual data, and may also be referred to as a Z-stream. An example is shown in Fig. 41.
  • the hyper tensor carries information regarding a latent domain prediction for a latent representation of the visual data and/or an entropy parameter for residual of the latent representation.
  • an analysis transform may generate latent representation of the visual data.
  • This latent representation is further compressed, e.g. by using a hyper encoder, to obtain a hyper tensor z, which carries information about latent domain prediction and entropy parameters for residual.
  • the hyper tensor z may be further quantized to obtain a quantized hyper tensor z-hat (i.e., ) .
  • the quantized hyper tensor may also be referred to as a hyper tensor for short. Therefore, the above-mentioned Z-stream may comprise encoded data of unquantized hyper tensor z, and/or encoded data of quantized hyper tensor
  • the scope of the present disclosure is not limited in this respect.
  • the codestream may comprise a first substream and a second substream preceding the first substream.
  • the first substream may comprise quality map information for the visual data, and may also be referred to as a Q-stream.
  • the second substream may comprise a hyper tensor for the visual data, and may also be referred to as a Z-stream. An example is shown in Fig. 42.
  • an input of a hyper scale decoding process of the NN-based model may also be partitioned, e.g., based on the first partitioning scheme.
  • the input of the hyper scale decoding process may comprise reconstructed hyper tensor
  • the output of the hyper scale decoding process may comprise standard deviation logarithm tensor.
  • the encoded data of quality map information for the visual data may be split for plurality of partitions.
  • the split encoded data for each of the plurality of partitions may be in a substream of the codestream that corresponds to the partition. An example is shown in Fig. 45.
  • the solutions in accordance with some embodiments of the present disclosure can advantageously improve coding efficiency and coding flexibility.
  • a non-transitory computer-readable recording medium stores a codestream of visual data which is generated by a method performed by an apparatus for visual data processing.
  • the method comprises: performing a conversion from visual data to the codestream with a neural network (NN) -based model, wherein the NN-based model comprises a transform process and a plurality of processes different from the transform process, the transform process is configured to perform a transform between a latent representation of the visual data and the visual data, and respective inputs of the plurality of processes are partitioned into a plurality of partitions based on a same first partitioning scheme, and wherein based on the first partitioning scheme, a set of residual samples associated with the visual data is partitioned into a plurality of subsets of residual samples that correspond to the plurality of partitions, and in accordance with a determination that residual samples for each of the plurality of partitions are in a substream of the codestream that corresponds to
  • a method for storing codestream of visual data comprises: performing a conversion from visual data to the codestream with a neural network (NN) -based model; and storing the codestream in a non-transitory computer-readable recording medium, wherein the NN-based model comprises a transform process and a plurality of processes different from the transform process, the transform process is configured to perform a transform between a latent representation of the visual data and the visual data, and respective inputs of the plurality of processes are partitioned into a plurality of partitions based on a same first partitioning scheme, and wherein based on the first partitioning scheme, a set of residual samples associated with the visual data is partitioned into a plurality of subsets of residual samples that correspond to the plurality of partitions, and in accordance with a determination that residual samples for each of the plurality of partitions are in a substream of the codestream that corresponds to the partition, each of the plurality of partitions corresponds to an
  • a method for visual data processing comprising: performing a conversion between visual data and a codestream of the visual data with a neural network (NN) -based model, wherein the NN-based model comprises a transform process and a plurality of processes different from the transform process, the transform process is configured to perform a transform between a latent representation of the visual data and the visual data, and respective inputs of the plurality of processes are partitioned into a plurality of partitions based on a same first partitioning scheme, and wherein based on the first partitioning scheme, a set of residual samples associated with the visual data is partitioned into a plurality of subsets of residual samples that correspond to the plurality of partitions, and in accordance with a determination that residual samples for each of the plurality of partitions are in a substream of the codestream that corresponds to the partition, each of the plurality of partitions corresponds to an integer number of processing units of the transform process.
  • NN neural network
  • Clause 3 The method of any of clauses 1-2, wherein in accordance with a determination that residual samples for each of the plurality of partitions are in a substream of the codestream that corresponds to the partition, an input of the transform process is partitioned into a plurality of partitions based on the first partitioning scheme.
  • Clause 7 The method of any of clauses 1-6, wherein the first partitioning scheme indicates at least one of the following: the number of vertical splits, or the number of horizontal splits.
  • Clause 9 The method of any of clauses 1-8, wherein each of the plurality of partitions is a tile used in the plurality of processes, and each of the integer number of processing units is a tile used in the transform process.
  • Clause 11 The method of clause 10, wherein the set of filtering processes comprises at least one of the following: an adaptive linear filtering process, an inter channel correlation information filtering process, or a non-linear chroma enhancement filtering process.
  • Clause 13 The method of clause 12, wherein the second set of filtering processes comprises a luma edge filtering process.
  • Clause 14 The method of any of clauses 1-13, wherein the codestream comprises a first substream and a second substream following the first substream, the first substream comprises quality map information for the visual data, and the second substream comprises a hyper tensor for the visual data.
  • Clause 16 The method of any of clauses 1-15, wherein a size of one of the integer number of processing units is determined based on a predetermined algorithm.
  • Clause 20 The method of clause 19, wherein the split encoded data for each of the plurality of partitions is in a substream of the codestream that corresponds to the partition.
  • Clause 21 The method of any of clauses 1-20, wherein the visual data comprise a picture of a video, or an image.
  • Clause 22 The method of any of clauses 1-21, wherein the conversion includes encoding the visual data into the codestream.
  • Clause 23 The method of any of clauses 1-21, wherein the conversion includes decoding the visual data from the codestream.
  • Clause 24 An apparatus for visual data processing comprising a processor and a non-transitory memory with instructions thereon, wherein the instructions upon execution by the processor, cause the processor to perform a method in accordance with any of clauses 1-23.
  • Clause 25 A non-transitory computer-readable storage medium storing instructions that cause a processor to perform a method in accordance with any of clauses 1-23.
  • a non-transitory computer-readable recording medium storing a codestream of visual data which is generated by a method performed by an apparatus for visual data processing, wherein the method comprises: performing a conversion from visual data to the codestream with a neural network (NN) -based model, wherein the NN-based model comprises a transform process and a plurality of processes different from the transform process, the transform process is configured to perform a transform between a latent representation of the visual data and the visual data, and respective inputs of the plurality of processes are partitioned into a plurality of partitions based on a same first partitioning scheme, and wherein based on the first partitioning scheme, a set of residual samples associated with the visual data is partitioned into a plurality of subsets of residual samples that correspond to the plurality of partitions, and in accordance with a determination that residual samples for each of the plurality of partitions are in a substream of the codestream that corresponds to the partition, each of the plurality of partitions corresponds to an integer number of
  • a method for storing a codestream of visual data comprising: performing a conversion from visual data to the codestream with a neural network (NN) -based model; and storing the codestream in a non-transitory computer-readable recording medium, wherein the NN-based model comprises a transform process and a plurality of processes different from the transform process, the transform process is configured to perform a transform between a latent representation of the visual data and the visual data, and respective inputs of the plurality of processes are partitioned into a plurality of partitions based on a same first partitioning scheme, and wherein based on the first partitioning scheme, a set of residual samples associated with the visual data is partitioned into a plurality of subsets of residual samples that correspond to the plurality of partitions, and in accordance with a determination that residual samples for each of the plurality of partitions are in a substream of the codestream that corresponds to the partition, each of the plurality of partitions corresponds to an integer number of processing units of the transform process
  • the computing device 4900 includes a general-purpose computing device 4900.
  • the computing device 4900 may at least comprise one or more processors or processing units 4910, a memory 4920, a storage unit 4930, one or more communication units 4940, one or more input devices 4950, and one or more output devices 4960.
  • the computing device 4900 may further include additional detachable/non-detachable, volatile/non-volatile memory medium.
  • additional detachable/non-detachable, volatile/non-volatile memory medium may be provided.
  • a magnetic disk drive for reading from and/or writing into a detachable and non-volatile magnetic disk
  • an optical disk drive for reading from and/or writing into a detachable non-volatile optical disk.
  • each drive may be connected to a bus (not shown) via one or more visual data medium interfaces.
  • the input device 4950 may receive an encoded bitstream as the input 4970.
  • the encoded bitstream may be processed, for example, by the visual data coding module 4925, to generate decoded visual data.
  • the decoded visual data may be provided via the output device 4960 as the output 4980.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

Des modes de réalisation de la présente divulgation concernent une solution de traitement de données visuelles. La présente divulgation concerne un procédé de traitement de données visuelles. Le procédé comprend : la réalisation d'une conversion entre des données visuelles et un flux de code des données visuelles avec un modèle à base de réseau neuronal (NN), le modèle à base de NN comprenant un processus de transformée et une pluralité de processus différents du processus de transformée, le processus de transformée étant configuré pour effectuer une transformée entre une représentation latente des données visuelles et les données visuelles, et des entrées respectives de la pluralité de processus étant partitionnées en une pluralité de partitions sur la base d'un même premier schéma de partitionnement, et sur la base du premier schéma de partitionnement, un ensemble d'échantillons résiduels associés aux données visuelles étant partitionné en une pluralité de sous-ensembles d'échantillons résiduels qui correspondent à la pluralité de partitions.
PCT/CN2025/079822 2024-03-26 2025-02-28 Procédé, appareil, et support de traitement de données visuelles Pending WO2025200931A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2024083927 2024-03-26
CNPCT/CN2024/083927 2024-03-26

Publications (1)

Publication Number Publication Date
WO2025200931A1 true WO2025200931A1 (fr) 2025-10-02

Family

ID=97216830

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2025/079822 Pending WO2025200931A1 (fr) 2024-03-26 2025-02-28 Procédé, appareil, et support de traitement de données visuelles

Country Status (1)

Country Link
WO (1) WO2025200931A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230396801A1 (en) * 2020-11-04 2023-12-07 Vid Scale, Inc. Learned video compression framework for multiple machine tasks
WO2024020403A1 (fr) * 2022-07-22 2024-01-25 Bytedance Inc. Procédé, appareil et support de traitement de données visuelles
WO2024020112A1 (fr) * 2022-07-19 2024-01-25 Bytedance Inc. Image adaptative basée sur un réseau neuronal et procédé de compression vidéo à débit variable
US20240078414A1 (en) * 2021-06-09 2024-03-07 Huawei Technologies Co., Ltd. Parallelized context modelling using information shared between patches
WO2024056219A1 (fr) * 2022-09-15 2024-03-21 Nokia Technologies Oy Procédé, appareil et produit-programme informatique de codage vidéo et de décodage vidéo

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230396801A1 (en) * 2020-11-04 2023-12-07 Vid Scale, Inc. Learned video compression framework for multiple machine tasks
US20240078414A1 (en) * 2021-06-09 2024-03-07 Huawei Technologies Co., Ltd. Parallelized context modelling using information shared between patches
WO2024020112A1 (fr) * 2022-07-19 2024-01-25 Bytedance Inc. Image adaptative basée sur un réseau neuronal et procédé de compression vidéo à débit variable
WO2024020403A1 (fr) * 2022-07-22 2024-01-25 Bytedance Inc. Procédé, appareil et support de traitement de données visuelles
WO2024056219A1 (fr) * 2022-09-15 2024-03-21 Nokia Technologies Oy Procédé, appareil et produit-programme informatique de codage vidéo et de décodage vidéo

Similar Documents

Publication Publication Date Title
US20260019577A1 (en) Method, apparatus, and medium for visual data processing
WO2025072500A1 (fr) Procédé, appareil et support de traitement de données visuelles
WO2024140849A1 (fr) Procédé, appareil et support de traitement de données visuelles
CN121488473A (zh) 用于可视数据处理的方法、装置和介质
CN120814229A (zh) 用于可视数据处理的方法、装置和介质
US20250373827A1 (en) Method, apparatus, and medium for visual data processing
WO2025198937A1 (fr) Procédé, appareil, et support de traitement de données visuelles
WO2024169958A1 (fr) Procédé, appareil et support de traitement de données visuelles
WO2024149392A1 (fr) Procédé, appareil et support de traitement de données visuelles
WO2025200931A1 (fr) Procédé, appareil, et support de traitement de données visuelles
WO2025157163A1 (fr) Procédé, appareil et support de traitement de données visuelles
WO2025149063A1 (fr) Procédé, appareil et support de traitement de données visuelles
WO2025146073A1 (fr) Procédé, appareil, et support de traitement de données visuelles
WO2025131046A1 (fr) Procédé, appareil, et support de traitement de données visuelles
WO2025082522A1 (fr) Procédé, appareil et support pour le traitement de données visuelles
WO2025082523A1 (fr) Procédé, appareil et support pour le traitement de données visuelles
WO2025077746A1 (fr) Procédé, appareil et support pour le traitement de données visuelles
WO2025077744A1 (fr) Procédé, appareil et support de traitement de données visuelles
WO2025077742A1 (fr) Procédé, appareil, et support de traitement de données visuelles
WO2024193710A1 (fr) Procédé, appareil et support de traitement de données visuelles
WO2025044947A1 (fr) Procédé, appareil et support de traitement de données visuelles
WO2025087230A1 (fr) Procédé, appareil et support pour le traitement de données visuelles
WO2025002424A1 (fr) Procédé, appareil et support de traitement de données visuelles
WO2024193708A1 (fr) Procédé, appareil et support de traitement de données visuelles
WO2025153016A1 (fr) Procédé, appareil, et support de traitement de données visuelles

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25778483

Country of ref document: EP

Kind code of ref document: A1