EP4600950A1 - Codeur et décodeur - Google Patents

Codeur et décodeur

Info

Publication number
EP4600950A1
EP4600950A1 EP24156124.0A EP24156124A EP4600950A1 EP 4600950 A1 EP4600950 A1 EP 4600950A1 EP 24156124 A EP24156124 A EP 24156124A EP 4600950 A1 EP4600950 A1 EP 4600950A1
Authority
EP
European Patent Office
Prior art keywords
encoder
data stream
style
decoder
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP24156124.0A
Other languages
German (de)
English (en)
Inventor
Andreas BRENDEL
Kishan GUPTA
Nicola PIA
Guillaume Fuchs
Suraj Pandey
Markus Multrus
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV
Original Assignee
Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV filed Critical Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV
Priority to EP24156124.0A priority Critical patent/EP4600950A1/fr
Priority to PCT/EP2025/052897 priority patent/WO2025168600A1/fr
Publication of EP4600950A1 publication Critical patent/EP4600950A1/fr
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Definitions

  • Embodiments of the present invention refer to an encoder for encoding the input audio signal and a decoder for decoding the encoded input audio signal.
  • the coding of the encoder and the decoder enable that the input data is compressed.
  • Further embodiments refer to the corresponding methods and to a computer program.
  • Preferred embodiments are in a field of neural audio codec using speaker embeddings.
  • SOTA neural speech coding achieves good quality of the reconstructed speech at bitrates as low as 3.2kbps.
  • SOTA neural speech coders [1,2,3] are typically trained in an end-to-end fashion, where an input audio signal is encoded by an encoding neural network, the output of the encoder (the so-called latent) is quantized and the resulting quantized latent is decoded by a decoding neural network (see Fig. 1 ).
  • the encoder runs on transmitter side providing quantized latents (or rather identifying indices of the used quantized latents) which are transmitted to the receiver side where the decoder reproduces the input speech signal.
  • Fig. 1 shows exemplarily a codec having an encoder 10 receiving an input audio signal IS so as to encode the input audio signal to obtain the encoded input audio signal Q.
  • This encoded input audio signal Q can be decoded by use of the decoder 20.
  • the decoder 20 is configured to provide based on the signal Q a reconstructed input audio signal IS'.
  • the submitted data stream Q must carry a long term information, like speaker identity, speaker style (called style in the following), etc., as well as short term information, like phonetic content (called content in the following).
  • This way of coding has drawbacks with regard to the coding efficiency. Therefore, there is a need for an improved approach.
  • An objective of the present invention is to provide a concept for coding speech having high coding efficiency.
  • An embodiment provides an encoder for encoding an input audio signal like a speech signal.
  • the encoder comprises a style encoder and a content encoder.
  • the style encoder is configured to encode style information of the input audio signal to obtain a first (latent) data stream.
  • the content encoder is configured to encode content information to obtain a second (latent) data stream.
  • Embodiments of the present invention are based on the finding that it is advantageous to disentangle content and style information.
  • the content information which may comprise a phonetic content mainly comprises short term information
  • the style information which may comprise an information on the speaker identity or speaking style mainly carries a long term information.
  • the encoder is subdivided into two parallel acting encoders, namely as style encoder and a content encoder. In this way, it is advantageously possible to increase the coding efficiency.
  • Another advantage could lie in the transmission of the two data streams. Since information is disentangled and splits into two independent streams, they can be transmitted over different transmission channels (with different data rate, transmission intervals, and power consumption) and/or with unequal error protection (UEP).
  • UDP unequal error protection
  • the first and the second (latent) data stream are output and/or transmitted separately.
  • the first and second (latent) data stream may be quantized separately before transmitting.
  • the first data stream is quantized and/or transmitted and/or stored disjointly from the second data stream. Consequently, the two streams may be transmitted separately on different transmission channels.
  • the first (latent) data stream which is output by the style encoder comprises (mainly) long term information, especially speaker identity information or speaking style information.
  • the second (latent) data stream (output by the content encoder) may comprise short term information, especially phonetic content. Due to this, it is advantageously possible not only to separate the encoding, but also the transmission, so that the two data streams can be encoded with separate entities. Due to the long term and short term character of the first and second (latent) data stream, the first (latent) data stream may comprise an information which is sent at much lower frequencies relative to the second (latent) data stream.
  • the first (latent) data stream may comprise data blocks that are sent at low frequency, e.g., below 10 times per second as the coded style information is slowly time varying.
  • the second (latent) data stream may be composed of data blocks that have to be sent more frequently, e.g., 10-100 times per second, to account for the short-term nature of the coded signal content. This means that the content information is sent more frequently than the style information.
  • the above-defined encoder is especially advantageous for speech coding. Therefore the input audio signal may be a speech signal or may comprise a speech record.
  • an input audio signal may be a speech signal of a person, the person being characterized by the first (latent) data stream with regard to a characteristic speech style, e.g. male speech style or female speech style or another speech style.
  • a characteristic speech style e.g. male speech style or female speech style or another speech style.
  • Such speech style information is highly characteristic for a speech style and describes the speech behavior of a person whose speech should be coded very good, but general.
  • the style encoder and the content encoder use neural codecs like NESC or other neural network codecs for encoding. Such neural codecs are advantageously trained or trainable using machine learning techniques.
  • the style encoder and/or the content encoder are trained or trainable using machine learning techniques (artificial intelligence algorithms or neural network approaches).
  • the style encoder is configured to employ layers of a recurrent neural network or layers representing long-term information, like for example a dilated convolutional neural network or a transformer.
  • the content encoder is configured to use normalization layers enforcing temporal whiteness or input perturbations or vocal tract length perturbations on the input audio signal waveform to emphasize short-term characteristics.
  • the disentanglement of content and style also allows for separate manipulation of both representations and corresponding decoding, e.g., for speaker anonymization.
  • the encoder can just use the content encoder configured to encode content information to obtain a second data stream. In this way, the content can be transmitted without transmitting the identity sensitive style information.
  • a default style information is used so that the content can be reproduced in an anonymized manner.
  • the decoder may comprise a conditioning entity and a content decoder.
  • the conditioning entity receives a first (latent) data stream of the encoded input audio signal (cf. above discussion) and is configured to output a control signal based on the first (latent) data stream.
  • the first (latent) data stream comprises an encoded style information.
  • the content decoder is configured to decode the second (latent) data stream of the encoded input audio signal comprising content information (cf. above discussion).
  • the content decoder is controlled and/or adapted by the control signal of the conditioning entity.
  • the content decoder is configured to obtain a reconstructed signal of the encoded input audio signal based on the first and second (latent) data stream.
  • the reconstructed signal is output by the content decoder taking into account the control signal of the conditioning entity. This control information is based on the first (latent) data stream comprising the encoded style information.
  • the content decoder may be configured to use a default control signal or initial control signal before the conditioning entity provides the control signal based on the first data stream.
  • the default control signal may be stored in the decoder.
  • the default control signal and/or the initial control signal represents an average style.
  • different default control signals or different initial control signals may be used.
  • an average style information for a male speaker and/or an average style information for a female speaker may be provided so as to start the decoding based on the default control signal/initial control signal.
  • the default control signal or initial control signal represents an average style selected out of the plurality of average styles, wherein the selection is taken based on an external information.
  • the decoder may comprise a conditioning entity and a content decoder.
  • the conditioning entity outputs a control signal being a default control signal which comprises a style information.
  • the content decoder is configured to decode the data stream (comparable to the second data stream) of the encoded input audio signal comprising content information, wherein the content decoder is controlled and/or adapted by the control signal.
  • the reconstructed speech may be anonymized, i.e., the reconstructed speech signal does not allow for identifying the speaker as the style was changed (speaker privacy can be achieved according to this aspect).
  • the default control signal representing a default style information may be selected out of a plurality of default styles. This may be done by use of a control signal.
  • Another embodiment provides a decoder for decoding an encoded input audio signal comprising a conditioning entity receiving the encoded input audio signal and configured to output a control signal based on the encoded input audio signal.
  • the encoded input audio signal comprises an encoded style information.
  • the decoder further comprises a content decoder configured to decode the encoded input audio signal further comprising content information, wherein the content decoder is controlled and/or adapted by the control signal.
  • the content decoder may be configured to use a default control signal or an initial control signal before the conditioning entity provides the control signal based on external information or the first (latent) data stream.
  • the default control signal or initial control signal represents a given style selected out of a plurality of available styles, wherein the selection is taken based on an external information or control.
  • a deliberated anonymization or desired voice conversion could be achieved.
  • the decoder is configured to decode an encoded input audio signal comprising a conditioning entity and a content decoder.
  • the conditioning entity is configured to output a control signal representing a style information.
  • the default control signal or initial control signal is used as control signal and thus, represents a given style selected out of one or more available styles.
  • the content decoder is configured to decode the encoded input audio signal comprising content information, wherein the content decoder is controlled and/or adapted by the control signal.
  • the default control signal or initial control signal represents, e.g., an average style and/or average style for a male speaker and/or average style for a female speaker.
  • the default control signal or initial control signal may according to further embodiments represent an average style selected out of a plurality of (average) styles, wherein the selection is taken based on an external information. Note it could also be a set of other candidate styles not necessarily an average of other styles.
  • An embodiment provides a decoder for decoding an encoded input audio signal.
  • the decoder comprises a conditioning entity and a content encoder.
  • the conditioning entity is configured to receive an encoded style information derived from the encoded input audio signal and to perform at least a learned affine transform of an encoded content information derived from the encoded input audio signal to obtain a transformed encoded content information.
  • the decoder is configured to further decode the transformed encoded content information.
  • Embodiments of the present invention are based on the finding that that it is advantageous to disentangle content and style information.
  • the content information which may comprise a phonetic content mainly comprises short term information
  • the style information which may comprise an information on the speaker identity or speaking style mainly carries a long term information.
  • the decoder comprises the two entities conditioning entity and the content decoder so as to enable the decoding of the two different information (content), namely the short-term information and the long-term information.
  • the long-term information also referred to as style information, is obtained using the conditioning entity while the short-term information, i.e., the content information, is decoded by the content decoder taking into account the previously processed long-term information/style information.
  • An embodiment provides a method for decoding an encoded input audio signal. The method comprising:
  • Another embodiment provides a method for decoding an encoded input audio signal, and the method comprising:
  • providing a manipulated or a default control signal may be used not only to cover the time span before the first encoded style information is received but also for anonymizing the speaker (here, the style information may not be transmitted, thereby keeping the identifying information of the speaker included in the audio signal private).
  • An embodiment provides a method for decoding an encoded audio signal comprising a conditioning entity configured to output a control signal being a default control signal and representing a style information; a content decoder configured to decode a data stream of the encoded input audio signal comprising content information, wherein the content decoder is controlled and/or adapted by the control signal.
  • the method may further comprise setting a default/manipulated style on the decoder side and not transmitting the style information for not disclosing identifying information about the speaker in the transmitted signals (encoder side).
  • the encoder 10' receives the input audio signal IS, e.g., an audio signal comprising speech.
  • the encoder 10' performs encoding so as to obtain two output data streams Q1 and Q2. These two audio data streams Q1 and Q2 are transmitted, e.g., separately to the decoder 20'.
  • the decoder 20' performs a reconstruction so as to obtain the reconstructed signal IS'.
  • the encoder 10' comprising two encoding entities, namely a style encoder 10s and a content encoder 10c. Each of the two encoders 10s and 10c receives the input audio signal IS and performs an encoding of a portion of the input audio signal.
  • the encoder 10s outputs the coded audio signal Q1, while the encoder 10c outputs the coded audio signal Q2.
  • the signal Q1 is processed by a conditioning/PreNet 20c, while the encoded audio stream Q2 is decoded by the content decoder 20d.
  • the content decoder 20d has an input for Q2 and a control signal generated by 20c.
  • the decoder uses the two signals so as to output the reconstructed signal IS'.
  • the style codec 10s is configured to code style information of the input audio signal to obtain a first (latent) data stream Q1.
  • the style information may comprise long term information, like a style information.
  • the style information may be dependent on the gender (male/female speakers) or dependent on the language.
  • the style information may have an influence on the general frequency dependent power distribution of the voice and/or to the general sound. Due to the encoding, the signal Q2 comprises the specific coding style information.
  • the content encoder 10c is configured to encode content information, i.e., short term information, like phonetic information.
  • embodiments of present invention are more efficient than the SOTA approaches to neural speech coding [1, 2, 3].
  • the encoder 10 (as well as the decoder 20) may be realized by neural codec NESC [1].
  • NESC neural codec
  • any neural network that transforms a potentially preprocessed audio signal into a compact representation can serve as an encoder for a neural codec.
  • a neural codec may be characterized by a preprocessing of the audio signal and a transformation of the preprocessed audio signal so as to obtain a compact representation of same.
  • convolutional neural network blocks are used that downsample the input audio signal successively and recurrent neural network blocks or transformers that model additional temporal context.
  • Part of this encoder would be also a subsequently applied quantizer transforming that continuous output of the neural network to a discrete representation that can be potentially entropy-coded and transmitted.
  • a formatter for preparing the data stream, etc. may be part of it.
  • NESC is an example for a neural codec.
  • NESC may, for example, consist of the following blocks:
  • the input waveform is transformed into a two-dimensional representation by a rolling window frontend (The frontend is the first part (receiving the input audio signal) of the encoder).
  • the result is processed by a convolutional layer within each frame, recurrent layers along different frames and again a convolutional layer operating within each frame. Together, this is termed dual path convolutional recurrent neural network (DPCRNN).
  • DPCRNN dual path convolutional recurrent neural network
  • the signal is further processed by several convolutional neural network blocks.
  • the output of these blocks is quantized, e.g., by residual vector quantization (NESC) or by scalar quantization.
  • NESC residual vector quantization
  • the decoder 20' as already shown by Fig. 2 will be discussed taking reference to Fig. 3b showing the decoder 20d as central component.
  • the decoder 20d receives Q2 and outputs IS'. Furthermore, the decoder 20d has another input for the control signal c obtained using the condition PreNet 20c, also referred to as condition entity.
  • the quantized representation may according to further embodiments be preprocessed by recurrent neural network blocks (so as to implement example neural codec based on NESC).
  • the result is decoded by Streamwise-StyleMelGAN (SSMGAN).
  • SSMGAN consists of a sequence of convolutional layers that subsequently upsample the input audio signal, where the input audio signal is used to condition each upsampling stage with a temporal adaptive DE-normalization (TADE) layer.
  • TADE temporal adaptive DE-normalization
  • the final signal is synthesized with a Pseudo-Quadrature Mirror Filter bank (PQMF).
  • PQMF Pseudo-Quadrature Mirror Filter bank
  • the neural codec used at the decoder side may comprise Streamwise-StyleMeIGAN, e.g., consisting of a sequence of convolutional layers that subsequently upsample the input audio signal.
  • the decoder may comprise a SSMGAN decoder including the TADE layers and the dequantization (mapping indices describing the discrete representation (or a stream of indices) to a latent waveform (representation) that serve as the input of the remaining part of the decoder.
  • different network architectures may be used.
  • the encoder comprises a style encoder and/or a content encoder which comprises the following elements:
  • the content decoder comprises a styling element of the content information from the first data stream or a processed version of it, the styling element being controlled by the control signal (C).
  • a NESC decoder SMGAN
  • the content decoder comprises at least a styling element to obtain a reconstructed signal (IS') of the encoded input audio signal based on the first (Q1) and second data stream (Q2).
  • styling elements/TADEs are used elsewhere in the decoder, e.g. using SSMGAN as content decoder.
  • the content encoder 10c captures short time relations within the input audio signal (e.g., phonetic information), while the style encoder captures more global information (e.g., the identity of the speaker or the speaking style).
  • the content encoder produces latents at a much higher frequency (e.g., 10-100Hz) relative to the style encoder (less than 10Hz).
  • a much higher frequency e.g., 10-100Hz
  • the style encoder less than 10Hz
  • both encoders may share some of their layers.
  • layers capable of representing long-term information such as RNNs may be employed to enforce that the style encoder 10s learns long-term information to avoid leaking of style information to the content encoder, techniques such as normalization layers and forcing temporal whiteness or input perturbations like vocal tract length perturbations may be used.
  • the two latents Q1 and Q2 comprise different information. It should be noted that the latent Q1 (output of the style encoder) is transmitted less often as the latent Q2 (cf. output of the content encoder 10c). Background is that style information is a long-term and/or global feature, while content is referring to the short time information.
  • the decoder 20 comprises the two entities 20c and 20d.
  • 20c receives the latent Q1 comprising the style information and performs a processing so as to obtain a control signal.
  • This entity 20c is also referred to as conditioning PreNet and is configured to prepare a control signal to specialize the decoder 20d towards a specific style. Consequently, the control signal output by the entity 20c serves as additional input for the decoder 20d. This can be achieved by learning affine transforms of intermediate representations in the decoder 20d.
  • the decoder 20d receives the latent Q2 and performs a decoding so as to obtain the reconstructed signal IS'.
  • the signal is obtained based on Q2 taking into account the control signal provided by 20c. Through the control signal the specific style information is received, so that the decoder 20d can perform the decoding not only based on the information included by Q2, but also in the information included by Q1.
  • initialization techniques may be used according to embodiments. For example, decoding with an average style or with an average style for male/female speakers may be used during the initiation phase. This means in other words, that the decoder 20d uses an initial control signal representing the average style or an average style of a speaker female. To select the (average) style out of a plurality of (average) styles an information may be used, e.g. extracted or included by Q1.
  • a control signal can be provided imprinting an anonymous identity to the content decoded by use of the decoder 20d.
  • the conditioning entity 20c may be configured to receive the encoded style information derived from the encoded input audio signal, e.g., via the data stream Q1, and performs at least a learnable affine transform of the coded content information derived from the encoded input audio signal. The result is an obtained transformed encoded content information.
  • This content information is forwarded as a kind of control signal C to the content decoder 20d which performs a further decoding.
  • the further decoding may take into account the data stream Q2.
  • the decoding performed by 20d has the purpose to obtain IS'.
  • the data stream Q2 may suffice to be the input audio signal for 20c and 20d. Background thereof is that a long-term information can be derived from Q2, as well.
  • the content decoder 20d may comprise at least one learnable layer.
  • a convolutional layer or temporal adaptive DE-normalization layer TADE
  • the content decoder may comprise at least one temporal adaptive DE normalization residual block.
  • the proposed method achieves better quality of the reconstructed speech at very low bitrates and/or low complexity than comparable neural speech coders that do not leverage separate encoders for content and style. Especially the generalization ability of the trained codec may be significantly increased due to the proposed extra encoder.
  • both encoders can be realized in a computationally very efficient way such that an extra encoder does not pose significant extra computational burden.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
EP24156124.0A 2024-02-06 2024-02-06 Codeur et décodeur Pending EP4600950A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP24156124.0A EP4600950A1 (fr) 2024-02-06 2024-02-06 Codeur et décodeur
PCT/EP2025/052897 WO2025168600A1 (fr) 2024-02-06 2025-02-05 Codeur et décodeur

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP24156124.0A EP4600950A1 (fr) 2024-02-06 2024-02-06 Codeur et décodeur

Publications (1)

Publication Number Publication Date
EP4600950A1 true EP4600950A1 (fr) 2025-08-13

Family

ID=89853630

Family Applications (1)

Application Number Title Priority Date Filing Date
EP24156124.0A Pending EP4600950A1 (fr) 2024-02-06 2024-02-06 Codeur et décodeur

Country Status (2)

Country Link
EP (1) EP4600950A1 (fr)
WO (1) WO2025168600A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200365166A1 (en) * 2019-05-14 2020-11-19 International Business Machines Corporation High-quality non-parallel many-to-many voice conversion

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200365166A1 (en) * 2019-05-14 2020-11-19 International Business Machines Corporation High-quality non-parallel many-to-many voice conversion

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DEFOSSEZ ALEXANDRE ET AL., HIGH FIDELITY NEURAL AUDIO COMPRESSION, Retrieved from the Internet <URL:https://arxiv.org/abs/2210.13438>
PIANICOLA ET AL., NESC: ROBUST NEURAL END-2-END SPEECH CODING WITH GANS, Retrieved from the Internet <URL:https://arxiv.org/abs/2207.03282>
POLYAK ET AL., SPEECH RESYNTHESIS FROM DISCRETE DISENTANGLED SELF-SUPERVISED REPRESENTATIONS, 2021, Retrieved from the Internet <URL:https://arxiv.org/pdf/2104.00355.pdf>
XUE JIANG ET AL: "Disentangled Feature Learning for Real-Time Neural Speech Coding", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 25 February 2023 (2023-02-25), XP091446243 *
ZEGHIDOURNEIL ET AL., SOUNDSTREAM: AN END-TO-END NEURAL AUDIO CODEC, Retrieved from the Internet <URL:https://arxiv.ora/abs/2107.03312>

Also Published As

Publication number Publication date
WO2025168600A1 (fr) 2025-08-14

Similar Documents

Publication Publication Date Title
US12573414B2 (en) Speech signal encoding and decoding methods and apparatuses, electronic device, and storage medium
JP6951536B2 (ja) 音声符号化装置および方法
CN100583241C (zh) 音频编码设备、音频解码设备、音频编码方法和音频解码方法
Zhen et al. Cascaded cross-module residual learning towards lightweight end-to-end speech coding
JP2001202097A (ja) 符号化二進オーディオ処理方法
CN119252268B (zh) 音频解码、编码方法、装置、电子设备及存储介质
JPWO2007088853A1 (ja) 音声符号化装置、音声復号装置、音声符号化システム、音声符号化方法及び音声復号方法
CN118136030A (zh) 音频处理方法、装置、存储介质和电子设备
CN113314132A (zh) 一种应用于交互式音频系统中的音频对象编码方法、解码方法及装置
KR20070083957A (ko) 벡터 변환 장치 및 벡터 변환 방법
Yao et al. Variational speech waveform compression to catalyze semantic communications
JPH11242499A (ja) 音声符号化/復号化方法および音声信号の成分分離方法
US20230197091A1 (en) Signal transformation based on unique key-based network guidance and conditioning
Valin et al. DRED: Deep REDundancy coding of speech using a rate-distortion-optimized variational autoencoder
EP4600950A1 (fr) Codeur et décodeur
EP4600951A1 (fr) Codage et décodage audio démêlé avec contrôle de style
EP4600952A1 (fr) Décodeur
CN115410585A (zh) 音频数据编解码方法和相关装置及计算机可读存储介质
KR102353050B1 (ko) 스테레오 신호 인코딩에서의 신호 재구성 방법 및 디바이스
EP4189680B9 (fr) Génération de clé basée sur un réseau de neurones artificiels pour transformation de signal audio basée sur un réseau de neurones artificiels guidé par clé
CN121214951B (zh) 基于文本语义信息保真的超低码率语音编解码系统
Lim et al. Perceptual Neural Audio Coding With Modified Discrete Cosine Transform
WO2025035955A1 (fr) Procédé et appareil de décodage de signal vocal, et dispositif électronique
KR20260053218A (ko) 음성 신호 디코딩 방법 및 장치와 전자 디바이스
WO2024051955A1 (fr) Décodeur et procédé de décodage pour transmission discontinue de flux indépendants codés de manière paramétrique avec des métadonnées

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC ME MK MT NL NO PL PT RO RS SE SI SK SM TR