EP4655943A1 - Multiresidualautocodierer zur bild- und videokomprimierung - Google Patents

Multiresidualautocodierer zur bild- und videokomprimierung

Info

Publication number: EP4655943A1
Authority: EP; European Patent Office
Prior art keywords: video; transform stages; encoding; decoding; image
Prior art date: 2023-01-25
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.): Pending

Application number

EP24709893.2A

Other languages

English (en)

French (fr)

Inventor

Fabien Racape

Syed Mateen UL HAQ

Hyomin CHOI

Wei Jiang

Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)

InterDigital VC Holdings Inc

Original Assignee

InterDigital VC Holdings Inc

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2023-01-25

Filing date

2024-01-24

Publication date

2025-12-03

2024-01-24 Application filed by InterDigital VC Holdings Inc filed Critical InterDigital VC Holdings Inc

2025-12-03 Publication of EP4655943A1 publication Critical patent/EP4655943A1/de

Status Pending legal-status Critical Current

Links

Classifications

- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/30—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
- H04N19/33—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability in the spatial domain
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/30—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
- H04N19/36—Scalability techniques involving formatting the layers as a function of picture distortion after decoding, e.g. signal-to-noise [SNR] scalability

Definitions

At least one of the present embodiments generally relates to a method or an apparatus for compression of images and videos using Neural Network based tools.
At least one of the present embodiments generally relates to a method or an apparatus in the context of the compression of images and videos using Neural Network (NN)-based tools.
NN Neural Network
one objective of the described embodiments is encoding video content at the highest quality possible, at sequence or subsequence level, in the context of end-to-end NN-based video compression.
a method comprising steps for performing one or more cascaded transform stages on a portion of a video image; performing an operation on said transformed portion of the video image and a prediction of a respective output of the one or more transform stages to produce a residual for each of the one or more transform stages; quantizing the residuals; and, encoding the quantized residuals to generate a tensor output for each of the one or more cascaded transform stages as a bitstream
a method comprising steps for decoding one or more portions of data; performing an operation on the decoded one or more portions of data and predictions from one or more previous cascaded transform stages; and transforming the sum of the additions of the one or more cascaded transform stages to generate an image.
an apparatus comprising a processor.
the processor can be configured to implement the general aspects by executing any of the described methods.
a device comprising an apparatus according to any of the decoding embodiments; and at least one of (i) an antenna configured to receive a signal, the signal including the video block, (ii) a band limiter configured to limit the received signal to a band of frequencies that includes the video block, or (iii) a display configured to display an output representative of a video block.
a non-transitory computer readable medium containing data content generated according to any of the described encoding embodiments or variants.
a signal comprising video data generated according to any of the described encoding embodiments or variants.
a bitstream is formatted to include data content generated according to any of the described encoding embodiments or variants.
a computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the described decoding embodiments or variants.
a non-transitory computer readable medium containing data content comprising instructions to perform any of the encoding or decoding methods.
Figure 1 illustrates a fully factorized-prior-based auto-encoder for image compression.
Figure 2 illustrates training of a fully factorized prior model.
Figure 3 illustrates an auto-encoder with a hyperprior architecture.
Figure 4 illustrates one embodiment of a proposed architecture for image compression.
Figure 5 illustrates one embodiment of a proposed architecture for image decompression.
Figure 6 illustrates one embodiment of a proposed method for including additional residual transforms.
Figure 7 illustrates one embodiment of a proposed method with concatenation operations and additional processing for residual tensors.
Figure 8 illustrates one embodiment of a method for encoding video using the described embodiments.
Figure 9 illustrates one embodiment of a method for decoding video using the described embodiments.
Figure 10 illustrates one embodiment of an apparatus for encoding or decoding using the described embodiments.
Figure 11 illustrates a standard, generic video compression scheme.
Figure 12 illustrates a standard, generic video decompression scheme.
Figure 13 illustrates a processor-based system for encoding/decoding under the general described aspects.
Encoder methods are used to enhance or speed-up an encoder of an existing codec. In that case, there is no normative change, an existing standard can be used.
End-to-end NN-based codecs End-to-end NN-based codecs.
the traditional compression scheme including prediction, transform, quantization and entropy coding is completely disrupted.
This invention corresponds to the 3 rd case, as the proposed approach details the architecture of a compression model that can be trained end-to-end.
the input x to the encoder part of the network can consists of an image or frame of a video, a part of an image, a tensor representing a group of images, and a tensor representing a part (crop) of a group of images.
the input can have one or multiple components, e.g.: monochrome, RGB or YCbCr components.
the input x is fed into the encoder network g a , also known as analysis transform.
g a is usually a sequence of convolutional layers with activation functions.
the convolutions can include a mechanism to spatially down-sample the input, for instance selecting a convolution with stride of 2 in both vertical and horizontal directions would result in an output having half the size of the input in both dimensions.
the output of a convolution is a tensor of shape CxHxW, where H and W are the spatial height and width, respectively, and C corresponds to an adjustable number of channels.
the first convolution of g a takes as input a tensor of 3 channels which correspond to the color components.
This encoder network can be seen as a learned transform, that is lossy as there are generally fewer elements in the output latent tensor than the source CxHxW input.
the output of the analysis mostly in the form of a 3-way array, referred to as a 3-D tensor is called a latent representation or a tensor of latent variables.
a set of latent variables constructs a latent space, which is also frequently used in the context of neural network-based end-to-end compression.
the output y g a ( ), is quantized, resulting in a tensor y which is then entropy coded into a binary stream (bitstream) for storage or transmission.
bitstream is entropy decoded (ED) to obtain y.
the performance of a compression system is measured as a tradeoff between the number of bits needed to transmit versus the quality of the decoded content.
a compression model can be trained using a loss following the Lagrangian form:
R represents the rate or bitrate and D the distortion of the decoded content.
Figure 1 and Figure 2 autoencoders in the actual inference configuration that consists of an encoder producing a bitstream that is then transmitted and decoded.
the entropy of y with respect to the learned probability model p ⁇ , as depicted in Figure 3.
GANs Generative Adversarial Networks
the entropy encoder and decoder rely on a simple fully factorized prior, as depicted in Figure 1 .
This method usually considers separate trained entropy models per channel of the latent.
the spatial correlations in the latent y are not considered as each of the samples are encoded using the same distribution, i.e. , assuming that they are independent and identically distributed (i.i.d).
y is not i.i.d and more recent approaches have taken on this specific issue.
Figure 3 depicts a popular approach called auto-encoder with a hyperprior [2], as the model now includes the additional convolutional sequences h a and h s outputs the learned distribution parameters, i.e., the scales or means and scales of Gaussian or Laplace distributions, for each element of the latent y.
the quantized tensor output by h a needs to be encoded and transmitted as side information for the decoder to decode y.
transmitting that tensor z using a fully factorized approach does not cost much overhead as z corresponds to y further downscaled by h a and the efficiency of using tailored gaussians for each element of the y dramatically surpasses the burden of transmitting the light z.
the architecture of the encoder and the decoder does not adapt to the content, the number of layers in ⁇ g a ,g s ⁇ , in particular the number of downsampling, resp upsampling , in g a , resp g s , is fixed and was determined before designing and training the system. Since convolutions and down-sampling/up-sampling are lossy operations, it may be beneficial to let the encoder transmit some information before some of these transformations turning the input images into lower resolution compressible tensors.
This invention proposes a novel auto-encoder architecture as depicted in Figure 4, which can be trained end-to-end to optimize the transmission of relevant details through the different sub-bitstreams.
the encoder can be decomposed in sub-analysis transforms ⁇ g°, ga’ -> ga ⁇ which can correspond to any sub-block of existing convolution-based operations.
each g a l can consist of a unique convolution and an activation function such as a ReLU (Rectified Linear Unit) or the popular Generalized Divisible Normalization (GDN).
the total chain can for instance correspond to the one described in Figure 1 , producing the quantized latent y 2 .
the decoder also contains the corresponding synthesis operations ⁇ gs, gs> -> gs ⁇ -
transforms and residual branches can be numbered in reverse order, i.e. , g corresponds to the last branch at the lowest resolution and each residual branch index is incremented with the higher resolutions.
This residual will be quantized and encoded using any entropy bottleneck model.
the invention does not limit to the example presented in Figure 4, where we represent the blocks of a fully factorized prior-based model for simplicity.
the “Encoder” part depicted in Figure 4 represents the encoder, coupled with a decoder.
the encoder needs to include the relevant synthesis functions g s l to derive the residuals, like in traditional encoders which include most decoding operations to simulate the decoding for prediction and residual extraction. Only g could be part of the decoder only.
the number of elements N+1 of g a l , g s l is part of the architecture design, as it is the case for the number of convolutions or residual blocks for instance in other autoencoders.
the decoder simply parses and decodes the different sub-bitstreams it receives as shown in Figure 5. Contrary to the decoder which needs to include the whole system to derive the residuals the decoder is provided exactly the information it needs to decode the different tensors. This requires an adequate syntax of the bitstream, which will be detailed in Section 5.
the extracted residuals are directly fed to a corresponding quantization and entropy bottleneck block.
Additional convolutions can transform a tensor CixHjxWi into a tensor of size CjxHjxWi with for instance j>i.
Figure 6 depicts such a system where transforms g ⁇ are inserted to further reshape and process residual tensors such that their compressibility and the added quality/resolution refinements improve the compression bitrate-distortion performance of the overall bitstream.
g r l can be as simple as 1x1 2D convolutions that can be used to resize the number of channels, but they can also include advanced mechanisms like in g a l . Note however that any down-sampling operation would be somewhat redundant with the other (next) residual branches which already perform at lower spatial resolution.
the extraction of residuals is performed by subtracting the reconstructed tensor y l , which can be retrieved at the decoder, by the current tensor y l at the encoder.
subtracting might not lead to the most compressible content.
this section like in Section 5.1 , it is proposed to utilize the flexibility of convolution operations to extract the relevant information to transmit at each resolution level.
the subtraction can be replaced by a concatenation along the channel axis, as described in Figure 7: Proposed method with concatenation operations and additional processing for residual tensors.
the resulting tensor after concatenation can have a size of 2CixHixWi if y l and y l have the same number of channels C.
a block g r l of at least one convolution can be inserted after concatenation to transform that tensor into a latent tensor with the desired number of channels Ck. For instance, it might be relevant to transmit less channels than the base latent tensor, i.e.
bitstream syntax the one that has the lowest resolution, since at higher resolution, we just aim to include additional information into the bitstream, but with a controlled cost in terms of bitrate.
the same type of operation can be performed, where the decoded residual y can be concatenated with the incoming y ⁇
the transforms g s l need to be adapted to take the correct number of input channels.
the different sub-bitstreams need to be indexed to be parsed by the decoder. Each would need a layer syntax header which contain their ID, or layer index, as well as the required information for the decoder to start the decoding process.
residual_layer_id shall be the same for all VCL NAL units of a coded tensor.
each tensor to decode needs to be either indicated at this layer level, or it could be derived by the decoder from either higher-level syntax or from previously decoded layers.
Each layer header can then include the horizontal and vertical sizes of the tensor, as well as the number of channels, expressed as non-negative integer, e.g., coded as 16-bit values. This enables a decoder to independently decode each layer. If the overhead is not satisfying in case of very low bitrate residual layers, these parameters can be derived from the layer index and the transmitted size of the original content via higher level syntax.
FIG. 8 One embodiment of a method 800 for encoding video data is shown in Figure 8.
the method commences at Start bock 801 and proceeds to block 810 for performing one or more cascaded transform stages on a portion of a video image.
Control proceeds from block 810 to block 820 for performing an operation on the transformed portion of the video image and a prediction of a respective output of the one or more transform stages to produce a residual for each of the one or more transform stages.
Control proceeds from block 820 to block 830 for quantizing the residuals.
Control proceeds from block 830 to block 840 for encoding the quantized residuals to generate a tensor output for each of the one or more cascaded transform stages as a bitstream.
FIG. 9 One embodiment of a method 900 for decoding video data is shown in Figure 9.
the method commences at Start block 901 and proceeds to block 910 for decoding one or more portions of data.
Control proceeds from block 910 to block 920 for performing an operation on the decoded one or more portions of data and predictions from one or more previous cascaded transform stages.
Control proceeds from block 920 to block 930 for transforming the sum of the additions of the one or more cascaded transform stages to generate an image.
Figure 10 shows one embodiment of an apparatus 1000 for compressing, encoding or decoding video using the aforementioned methods.
the apparatus comprises Processor 1010 and can be interconnected to a memory 1020 through at least one port. Both Processor 1010 and memory 1020 can also have one or more additional interconnections to external connections.
Processor 1010 is also configured to either insert or receive information in a bitstream and, either compressing, encoding, or decoding using the aforementioned methods.
the embodiments described here include a variety of aspects, including tools, features, embodiments, models, approaches, etc. Many of these aspects are described with specificity and, at least to show the individual characteristics, are often described in a manner that may sound limiting. However, this is for purposes of clarity in description, and does not limit the application or scope of those aspects. Indeed, all of the different aspects can be combined and interchanged to provide further aspects. Moreover, the aspects can be combined and interchanged with aspects described in earlier filings as well.
the terms “reconstructed” and “decoded” may be used interchangeably, the terms “pixel” and “sample” may be used interchangeably, the terms “image,” “picture” and “frame” may be used interchangeably.
the term “reconstructed” is used at the encoder side while “decoded” or “reconstructed” is used at the decoder side.
Various methods are described herein, and each of the methods comprises one or more steps or actions for achieving the described method. Unless a specific order of steps or actions is required for proper operation of the method, the order and/or use of specific steps and/or actions may be modified or combined. Additionally, terms such as “first”, “second”, etc.
first decoding may be used in various embodiments to modify an element, component, step, operation, etc., such as, for example, a “first decoding” and a “second decoding”. Use of such terms does not imply an ordering to the modified operations unless specifically required. So, in this example, the first decoding need not be performed before the second decoding, and may occur, for example, before, during, or in an overlapping time period with the second decoding.
Figure 11 illustrates an encoder 100. Variations of this encoder 100 are contemplated, but the encoder 100 is described below for purposes of clarity without describing all expected variations.
the video sequence may go through pre-encoding processing (101 ), for example, applying a color transform to the input color picture (e.g., conversion from RGB 4:4:4 to YCbCr 4:2:0), or performing a remapping of the input picture components in order to get a signal distribution more resilient to compression (for instance using a histogram equalization of one of the color components).
Metadata can be associated with the pre-processing and attached to the bitstream.
a picture is encoded by the encoder elements as described below.
the picture to be encoded is partitioned (102) and processed in units of, for example, CUs. Each unit is encoded using, for example, either an intra or inter mode.
a unit When a unit is encoded in an intra mode, it performs intra prediction (160). In an inter mode, motion estimation (175) and compensation (170) are performed.
the encoder decides (105) which one of the intra mode or inter mode to use for encoding the unit, and indicates the intra/inter decision by, for example, a prediction mode flag. Prediction residuals are calculated, for example, by subtracting (110) the predicted block from the original image block.
the prediction residuals are then transformed (125) and quantized (130).
the quantized transform coefficients, as well as motion vectors and other syntax elements, are entropy coded (145) to output a bitstream.
the encoder can skip the transform and apply quantization directly to the non-transform ed residual signal.
the encoder can bypass both transform and quantization, i.e., the residual is coded directly without the application of the transform or quantization processes.
the encoder decodes an encoded block to provide a reference for further predictions.
the quantized transform coefficients are de-quantized (140) and inverse transformed (150) to decode prediction residuals.
In-loop filters (165) are applied to the reconstructed picture to perform, for example, deblocking/SAO (Sample Adaptive Offset) filtering to reduce encoding artifacts.
the filtered image is stored at a reference picture buffer (180).
the input of the decoder includes a video bitstream, which can be generated by video encoder 100.
the bitstream is first entropy decoded (230) to obtain transform coefficients, motion vectors, and other coded information.
the picture partition information indicates how the picture is partitioned.
the decoder may therefore divide (235) the picture according to the decoded picture partitioning information.
the transform coefficients are de-quantized (240) and inverse transformed (250) to decode the prediction residuals.
Combining (255) the decoded prediction residuals and the predicted block an image block is reconstructed.
the predicted block can be obtained (270) from intra prediction (260) or motion-compensated prediction (i.e. , inter prediction) (275).
Inloop filters (265) are applied to the reconstructed image.
the filtered image is stored at a reference picture buffer (280).
the decoded picture can further go through post-decoding processing (285), for example, an inverse color transform (e.g., conversion from YcbCr 4:2:0 to RGB 4:4:4) or an inverse remapping performing the inverse of the remapping process performed in the pre-encoding processing (101 ).
post-decoding processing can use metadata derived in the pre-encoding processing and signaled in the bitstream.
FIG. 13 illustrates a block diagram of an example of a system in which various aspects and embodiments are implemented.
System 1000 can be embodied as a device including the various components described below and is configured to perform one or more of the aspects described in this document. Examples of such devices include, but are not limited to, various electronic devices such as personal computers, laptop computers, smartphones, tablet computers, digital multimedia set top boxes, digital television receivers, personal video recording systems, connected home appliances, and servers.
Elements of system 1000, singly or in combination can be embodied in a single integrated circuit (IC), multiple ICs, and/or discrete components.
the processing and encoder/decoder elements of system 1000 are distributed across multiple ICs and/or discrete components.
system 1000 is communicatively coupled to one or more other systems, or other electronic devices, via, for example, a communications bus or through dedicated input and/or output ports.
system 1000 is configured to implement one or more of the aspects described in this document.
the system 1000 includes at least one processor 1010 configured to execute instructions loaded therein for implementing, for example, the various aspects described in this document.
Processor 1010 can include embedded memory, input output interface, and various other circuitries as known in the art.
the system 1000 includes at least one memory 1020 (e.g., a volatile memory device, and/or a non-volatile memory device).
System 1000 includes a storage device 1040, which can include non-volatile memory and/or volatile memory, including, but not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Read-Only Memory (ROM), Programmable Read-Only Memory (PROM), Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), flash, magnetic disk drive, and/or optical disk drive.
the storage device 1040 can include an internal storage device, an attached storage device (including detachable and non-detachable storage devices), and/or a network accessible storage device, as non-limiting examples.
System 1000 includes an encoder/decoder module 1030 configured, for example, to process data to provide an encoded video or decoded video, and the encoder/decoder module 1030 can include its own processor and memory.
the encoder/decoder module 1030 represents module(s) that can be included in a device to perform the encoding and/or decoding functions. As is known, a device can include one or both of the encoding and decoding modules. Additionally, encoder/decoder module 1030 can be implemented as a separate element of system 1000 or can be incorporated within processor 1010 as a combination of hardware and software as known to those skilled in the art.
processor 1010 Program code to be loaded onto processor 1010 or encoder/decoder 1030 to perform the various aspects described in this document can be stored in storage device 1040 and subsequently loaded onto memory 1020 for execution by processor 1010.
processor 1010, memory 1020, storage device 1040, and encoder/decoder module 1030 can store one or more of various items during the performance of the processes described in this document.
Such stored items can include, but are not limited to, the input video, the decoded video or portions of the decoded video, the bitstream, matrices, variables, and intermediate or final results from the processing of equations, formulas, operations, and operational logic.
memory inside of the processor 1010 and/or the encoder/decoder module 1030 is used to store instructions and to provide working memory for processing that is needed during encoding or decoding.
a memory external to the processing device (for example, the processing device can be either the processor 1010 or the encoder/decoder module 1 1030) is used for one or more of these functions.
the external memory can be the memory 1020 and/or the storage device 1040, for example, a dynamic volatile memory and/or a non-volatile flash memory.
an external non-volatile flash memory is used to store the operating system of, for example, a television.
a fast external dynamic volatile memory such as a RAM is used as working memory for video coding and decoding operations, such as for MPEG-2 (MPEG refers to the Moving Picture Experts Group, MPEG-2 is also referred to as ISO/IEC 13818, and 13818-1 is also known as H.222, and 13818-2 is also known as H.262), HEVC (HEVC refers to High Efficiency Video Coding, also known as H.265 and MPEG-H Part 2), or WC (Versatile Video Coding, a new standard being developed by JVET, the Joint Video Experts Team).
MPEG-2 MPEG refers to the Moving Picture Experts Group
MPEG-2 is also referred to as ISO/IEC 13818
13818-1 is also known as H.222
13818-2 is also known as H.262
HEVC High Efficiency Video Coding
WC Very Video Coding
the input to the elements of system 1000 can be provided through various input devices as indicated in block 1130.
Such input devices include, but are not limited to, (i) a radio frequency (RF) portion that receives an RF signal transmitted, for example, over the air by a broadcaster, (ii) a Component (COMP) input terminal (or a set of COMP input terminals), (iii) a Universal Serial Bus (USB) input terminal, and/or (iv) a High Definition Multimedia Interface (HDMI) input terminal.
RF radio frequency
COMP Component
USB Universal Serial Bus
HDMI High Definition Multimedia Interface
the input devices of block 1130 have associated respective input processing elements as known in the art.
the RF portion can be associated with elements suitable for (i) selecting a desired frequency (also referred to as selecting a signal, or band-limiting a signal to a band of frequencies), (ii) downconverting the selected signal, (iii) band-limiting again to a narrower band of frequencies to select (for example) a signal frequency band which can be referred to as a channel in certain embodiments, (iv) demodulating the downconverted and band-limited signal, (v) performing error correction, and (vi) demultiplexing to select the desired stream of data packets.
the RF portion of various embodiments includes one or more elements to perform these functions, for example, frequency selectors, signal selectors, bandlimiters, channel selectors, filters, downconverters, demodulators, error correctors, and demultiplexers.
the RF portion can include a tuner that performs various of these functions, including, for example, downconverting the received signal to a lower frequency (for example, an intermediate frequency or a near-baseband frequency) or to baseband.
the RF portion and its associated input processing element receives an RF signal transmitted over a wired (for example, cable) medium, and performs frequency selection by filtering, downconverting, and filtering again to a desired frequency band.
Adding elements can include inserting elements in between existing elements, such as, for example, inserting amplifiers and an analog-to-digital converter.
the RF portion includes an antenna.
USB and/or HDMI terminals can include respective interface processors for connecting system 1000 to other electronic devices across USB and/or HDMI connections.
various aspects of input processing for example, Reed-Solomon error correction
aspects of USB or HDMI interface processing can be implemented within separate interface les or within processor 1010 as necessary.
the demodulated, error corrected, and demultiplexed stream is provided to various processing elements, including, for example, processor 1010, and encoder/decoder 1030 operating in combination with the memory and storage elements to process the datastream as necessary for presentation on an output device.
Various elements of system 1000 can be provided within an integrated housing, Within the integrated housing, the various elements can be interconnected and transmit data therebetween using suitable connection arrangement, for example, an internal bus as known in the art, including the Inter-IC (I2C) bus, wiring, and printed circuit boards.
I2C Inter-IC
the system 1000 includes communication interface 1050 that enables communication with other devices via communication channel 1060.
the communication interface 1050 can include, but is not limited to, a transceiver configured to transmit and to receive data over communication channel 1060.
the communication interface 1050 can include, but is not limited to, a modem or network card and the communication channel 1060 can be implemented, for example, within a wired and/or a wireless medium.
Wi-Fi Wireless Fidelity
IEEE 802.11 IEEE refers to the Institute of Electrical and Electronics Engineers
the Wi-Fi signal of these embodiments is received over the communications channel 1060 and the communications interface 1050 which are adapted for Wi-Fi communications.
the communications channel 1060 of these embodiments is typically connected to an access point or router that provides access to external networks including the Internet for allowing streaming applications and other over-the-top communications.
Other embodiments provide streamed data to the system 1000 using a set-top box that delivers the data over the HDMI connection of the input block 1130.
Still other embodiments provide streamed data to the system 1000 using the RF connection of the input block 1130.
various embodiments provide data in a non-streaming manner.
various embodiments use wireless networks other than Wi-Fi, for example a cellular network or a Bluetooth network.
the system 1000 can provide an output signal to various output devices, including a display 1100, speakers 1110, and other peripheral devices 1120.
the display 1100 of various embodiments includes one or more of, for example, a touchscreen display, an organic light-emitting diode (OLED) display, a curved display, and/or a foldable display.
the display 1100 can be for a television, a tablet, a laptop, a cell phone (mobile phone), or another device.
the display 1100 can also be integrated with other components (for example, as in a smart phone), or separate (for example, an external monitor for a laptop).
the other peripheral devices 1120 include, in various examples of embodiments, one or more of a stand-alone digital video disc (or digital versatile disc) (DVR, for both terms), a disk player, a stereo system, and/or a lighting system.
Various embodiments use one or more peripheral devices 1120 that provide a function based on the output of the system 1000. For example, a disk player performs the function of playing the output of the system 1000.
control signals are communicated between the system 1000 and the display 1100, speakers 1110, or other peripheral devices 1120 using signaling such as AV.Link, Consumer Electronics Control (CEC), or other communications protocols that enable device-to-device control with or without user intervention.
the output devices can be communicatively coupled to system 1000 via dedicated connections through respective interfaces 1070, 1080, and 1090. Alternatively, the output devices can be connected to system 1000 using the communications channel 1060 via the communications interface 1050.
the display 1100 and speakers 1110 can be integrated in a single unit with the other components of system 1000 in an electronic device such as, for example, a television.
the display interface 1070 includes a display driver, such as, for example, a timing controller (T Con) chip.
the display 1100 and speaker 1110 can alternatively be separate from one or more of the other components, for example, if the RF portion of input 1130 is part of a separate set-top box.
the output signal can be provided via dedicated output connections, including, for example, HDMI ports, USB ports, or COMP outputs.
the embodiments can be carried out by computer software implemented by the processor 1010 or by hardware, or by a combination of hardware and software. As a non-limiting example, the embodiments can be implemented by one or more integrated circuits.
the memory 1020 can be of any type appropriate to the technical environment and can be implemented using any appropriate data storage technology, such as optical memory devices, magnetic memory devices, semiconductor-based memory devices, fixed memory, and removable memory, as non-limiting examples.
the processor 1010 can be of any type appropriate to the technical environment, and can encompass one or more of microprocessors, general purpose computers, special purpose computers, and processors based on a multi-core architecture, as non-limiting examples.
Decoding can encompass all or part of the processes performed, for example, on a received encoded sequence to produce a final output suitable for display.
processes include one or more of the processes typically performed by a decoder, for example, entropy decoding, inverse quantization, inverse transformation, and differential decoding.
processes also, or alternatively, include processes performed by a decoder of various implementations described in this application.
decoding refers only to entropy decoding
decoding refers only to differential decoding
decoding refers to a combination of entropy decoding and differential decoding.
encoding can encompass all or part of the processes performed, for example, on an input video sequence to produce an encoded bitstream.
processes include one or more of the processes typically performed by an encoder, for example, partitioning, differential encoding, transformation, quantization, and entropy encoding.
processes also, or alternatively, include processes performed by an encoder of various implementations described in this application.
encoding refers only to entropy encoding
encoding refers only to differential encoding
encoding refers to a combination of differential encoding and entropy encoding.
syntax elements used herein are descriptive terms. As such, they do not preclude the use of other syntax element names.
Various embodiments may refer to parametric models or rate distortion optimization.
the balance or trade-off between the rate and distortion is usually considered, often given the constraints of computational complexity. It can be measured through a Rate Distortion Optimization (RDO) metric, or through Least Mean Square (LMS), Mean of Absolute Errors (MAE), or other such measurements.
RDO Rate Distortion Optimization
LMS Least Mean Square
MAE Mean of Absolute Errors
Rate distortion optimization is usually formulated as minimizing a rate distortion function, which is a weighted sum of the rate and of the distortion. There are different approaches to solve the rate distortion optimization problem.
the approaches may be based on an extensive testing of all encoding options, including all considered modes or coding parameters values, with a complete evaluation of their coding cost and related distortion of the reconstructed signal after coding and decoding.
Faster approaches may also be used, to save encoding complexity, in particular with computation of an approximated distortion based on the prediction or the prediction residual signal, not the reconstructed one.
Mix of these two approaches can also be used, such as by using an approximated distortion for only some of the possible encoding options, and a complete distortion for other encoding options.
Other approaches only evaluate a subset of the possible encoding options. More generally, many approaches employ any of a variety of techniques to perform the optimization, but the optimization is not necessarily a complete evaluation of both the coding cost and related distortion.
the implementations and aspects described herein can be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed can also be implemented in other forms (for example, an apparatus or program).
An apparatus can be implemented in, for example, appropriate hardware, software, and firmware.
the methods can be implemented in, for example, , a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants (“PDAs”), and other devices that facilitate communication of information between endusers.
PDAs portable/personal digital assistants
references to “one embodiment” or “an embodiment” or “one implementation” or “an implementation”, as well as other variations thereof, means that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment.
the appearances of the phrase “in one embodiment” or “in an embodiment” or “in one implementation” or “in an implementation”, as well any other variations, appearing in various places throughout this application are not necessarily all referring to the same embodiment.
Determining the information can include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory.
Accessing the information can include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, moving the information, copying the information, calculating the information, determining the information, predicting the information, or estimating the information.
this application may refer to “receiving” various pieces of information.
Receiving is, as with “accessing”, intended to be a broad term.
Receiving the information can include one or more of, for example, accessing the information, or retrieving the information (for example, from memory).
“receiving” is typically involved, in one way or another, during operations such as, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
any of the following ”, “and/or”, and “at least one of”, for example, in the cases of “A/B”, “A and/or B” and “at least one of A and B”, is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of both options (A and B).
such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C).
This may be extended, as is clear to one of ordinary skill in this and related arts, for as many items as are listed.
the word “signal” refers to, among other things, indicating something to a corresponding decoder.
the encoder signals a particular one of a plurality of transforms, coding modes or flags.
the same transform, parameter, or mode is used at both the encoder side and the decoder side.
an encoder can transmit (explicit signaling) a particular parameter to the decoder so that the decoder can use the same particular parameter.
signaling can be used without transmitting (implicit signaling) to simply allow the decoder to know and select the particular parameter.
signaling can be accomplished in a variety of ways. For example, one or more syntax elements, flags, and so forth are used to signal information to a corresponding decoder in various embodiments. While the preceding relates to the verb form of the word “signal”, the word “signal” can also be used herein as a noun.
implementations can produce a variety of signals formatted to carry information that can be, for example, stored or transmitted.
the information can include, for example, instructions for performing a method, or data produced by one of the described implementations.
a signal can be formatted to carry the bitstream of a described embodiment.
Such a signal can be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal.
the formatting can include, for example, encoding a data stream and modulating a carrier with the encoded data stream.
the information that the signal carries can be, for example, analog or digital information.
the signal can be transmitted over a variety of different wired or wireless links, as is known.
the signal can be stored on a processor-readable medium.
At least one embodiment comprises encoding and decoding of video information using neural networks.
At least one embodiment comprises performing a series of one or more transformations to generate residuals.
At least one embodiment comprises decoding tensor data in one or more cascaded stages, such as from a bitstream, to generate residuals and performing operations to generate at least a portion of an image from the one or more residuals.
At least one embodiment comprises the above encoding and decoding using concatenations or additions.
At least one embodiment comprises a bitstream or signal that includes one or more of the described syntax elements, or variations thereof.
At least one embodiment comprises creating and/or transmitting and/or receiving and/or decoding according to any of the embodiments described.
At least one embodiment comprises parsing video data or a bitstream to determine operating point of a codec.
At least one embodiment comprises a method, process, apparatus, medium storing instructions, medium storing data, or signal according to any of the embodiments described.
At least one embodiment comprises inserting in the signaling syntax elements that enable the decoder to determine decoding information in a manner corresponding to that used by an encoder.
At least one embodiment comprises creating and/or transmitting and/or receiving and/or decoding a bitstream or signal that includes one or more of the described syntax elements, or variations thereof.
At least one embodiment comprises a TV, set-top box, cell phone, tablet, or other electronic device that performs transform method(s) according to any of the embodiments described.
At least one embodiment comprises a TV, set-top box, cell phone, tablet, or other electronic device that selects, bandlimits, or tunes (e.g., using a tuner) a channel to receive a signal including an encoded image, and performs transform method(s) according to any of the embodiments described.
At least one embodiment comprises a TV, set-top box, cell phone, tablet, or other electronic device that receives (e.g., using an antenna) a signal over the air that includes an encoded image, and performs transform method(s).

Landscapes

Engineering & Computer Science (AREA)
Multimedia (AREA)
Signal Processing (AREA)
Compression Or Coding Systems Of Tv Signals (AREA)

EP24709893.2A 2023-01-25 2024-01-24 Multiresidualautocodierer zur bild- und videokomprimierung Pending EP4655943A1 (de)

Applications Claiming Priority (2)

Application Number	Priority Date	Filing Date	Title
US202363440979P	2023-01-25	2023-01-25
PCT/US2024/012755 WO2024158896A1 (en)	2023-01-25	2024-01-24	Multi-residual autoencoder for image and video compression

Publications (1)

Publication Number	Publication Date
EP4655943A1 true EP4655943A1 (de)	2025-12-03

Family

ID=90361820

Family Applications (1)

Application Number	Title	Priority Date	Filing Date
EP24709893.2A Pending EP4655943A1 (de)	2023-01-25	2024-01-24	Multiresidualautocodierer zur bild- und videokomprimierung

Country Status (3)

Country	Link
EP (1)	EP4655943A1 (de)
CN (1)	CN120584487A (de)
WO (1)	WO2024158896A1 (de)

2024
- 2024-01-24 EP EP24709893.2A patent/EP4655943A1/de active Pending
- 2024-01-24 CN CN202480008902.1A patent/CN120584487A/zh active Pending
- 2024-01-24 WO PCT/US2024/012755 patent/WO2024158896A1/en not_active Ceased

Also Published As

Publication number	Publication date
CN120584487A (zh)	2025-09-02
WO2024158896A1 (en)	2024-08-02

Legal Events

Date	Code	Title	Description
2024-03-15	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: UNKNOWN
2024-08-03	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE
2025-10-31	PUAI	Public reference made under article 153(3) epc to a published international application that has entered the european phase	Free format text: ORIGINAL CODE: 0009012
2025-10-31	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE
2025-12-03	17P	Request for examination filed	Effective date: 20250728
2025-12-03	AK	Designated contracting states	Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC ME MK MT NL NO PL PT RO RS SE SI SK SM TR
2026-04-29	DAV	Request for validation of the european patent (deleted)
2026-04-29	DAX	Request for extension of the european patent (deleted)

Publication	Publication Date	Title
US20230396801A1 (en)	2023-12-07	Learned video compression framework for multiple machine tasks
US12537979B2 (en)	2026-01-27	Method and an apparatus for encoding/decoding images and videos using artificial neural network based tools
US20230298219A1 (en)	2023-09-21	A method and an apparatus for updating a deep neural network-based image or video decoder
US12587669B2 (en)	2026-03-24	Motion flow coding for deep learning based YUV video compression
JP2025535086A (ja)	2025-10-22	暗黙的ニューラル表現の学習された辞書を使用する画像及びビデオ圧縮
JP2024510433A (ja)	2024-03-07	ビデオ圧縮のための時間的構造ベースの条件付き畳み込みニューラルネットワーク
WO2022098731A1 (en)	2022-05-12	Learned video compression and connectors for multiple machine tasks
EP4637139A1 (de)	2025-10-22	Verfahren und vorrichtung zur bildverbesserung auf der basis von restcodierung unter verwendung eines umkehrbaren tiefen netzwerks
US20260087680A1 (en)	2026-03-26	Reinforcement learning-based rate control for end-to-end neural network based video compression
EP4655943A1 (de)	2025-12-03	Multiresidualautocodierer zur bild- und videokomprimierung
EP4675498A1 (de)	2026-01-07	Videospezifisches wörterbuchlernen für implizite neuronale komprimierung
EP4730794A1 (de)	2026-04-22	Hyperprior für latente implizite neuronale darstellung
EP4730264A1 (de)	2026-04-22	Wörterbuchlernen für implizite neuronale komprimierung
EP4664881A1 (de)	2025-12-17	Effiziente komprimierung einer codierungsbaumeinheit auf der basis impliziter neuronaler darstellung mit einem neuronalen netzwerkcodierungsstandard
WO2025162700A1 (en)	2025-08-07	Multi-definition implicit neural representation video encoding
WO2025114337A1 (en)	2025-06-05	Downscaling ratio prediction for reference picture resampling
EP4612907A1 (de)	2025-09-10	Entropieanpassung für tiefe merkmalskompression unter verwendung flexibler netzwerke
WO2025162696A1 (en)	2025-08-07	Residual-based progressive growing inr for image and video coding
WO2025162699A1 (en)	2025-08-07	Semantic implicit neural representation for video compression
WO2026078009A1 (en)	2026-04-16	Optimization of multi-layer rate distortion decision