EP4600951A1 - Codage et décodage audio démêlé avec contrôle de style - Google Patents
Codage et décodage audio démêlé avec contrôle de styleInfo
- Publication number
- EP4600951A1 EP4600951A1 EP24156121.6A EP24156121A EP4600951A1 EP 4600951 A1 EP4600951 A1 EP 4600951A1 EP 24156121 A EP24156121 A EP 24156121A EP 4600951 A1 EP4600951 A1 EP 4600951A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- style
- encoder
- decoder
- content
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
Definitions
- Embodiments of the present invention refer to an encoder for encoding the input audio signal and a decoder for decoding the encoded input audio signal.
- the coding of the encoder and the decoder enable that the input data is compressed.
- Further embodiments refer to the corresponding methods and to a computer program.
- Preferred embodiments are in a field of neural audio codec using speaker embeddings.
- SOTA neural speech coding achieves good quality of the reconstructed speech at bitrates as low as 3.2kbps.
- SOTA neural speech coders [1,2,3] are typically trained in an end-to-end fashion, where an input audio signal is encoded by an encoding neural network, the output of the encoder (the so-called latent) is quantized and the resulting quantized latent is decoded by a decoding neural network (see Fig. 1 ).
- the encoder runs on transmitter side providing quantized latents (or rather identifying indices of the used quantized latents) which are transmitted to the receiver side where the decoder reproduces the input speech signal.
- Fig. 1 shows exemplarily a codec having an encoder 10 receiving an input audio signal IS so as to encode the input audio signal to obtain the encoded input audio signal Q.
- This encoded input audio signal Q can be decoded by use of the decoder 20.
- the decoder 20 is configured to provide based on the signal Q a reconstructed input audio signal IS'.
- the submitted data stream Q must carry a long term information, like speaker identity, speaker style (called style in the following), etc., as well as short term information, like phonetic content (called content in the following).
- This way of coding has drawbacks with regard to the coding efficiency. Therefore, there is a need for an improved approach.
- An objective of the present invention is to provide a concept for coding speech having high coding efficiency and/or additional functionality.
- An embodiment provides an encoder for encoding an input audio signal like a speech signal.
- the encoder comprises a style encoder and a content encoder.
- the style encoder is configured to encode style information of the input audio signal to obtain a first (latent) data stream.
- the content encoder is configured to encode content information to obtain a second (latent) data stream.
- Embodiments of the present invention are based on the finding that it is advantageous to disentangle content and style information.
- the content information which may comprise a phonetic content mainly comprises short term information
- the style information which may comprise an information on the speaker identity or speaking style mainly carries a long term information.
- the encoder is subdivided into two parallel acting encoders, namely as style encoder and a content encoder. In this way, it is advantageously possible to increase the coding efficiency.
- Another advantage could lie in the transmission of the two data streams. Since information is disentangled and splits into two independent streams, they can be transmitted over different transmission channels (with different data rate, transmission intervals, and power consumption) and/or with unequal error protection (UEP).
- UDP unequal error protection
- the first and the second (latent) data stream are output and/or transmitted separately.
- the first and second (latent) data stream may be quantized separately before transmitting.
- the first data stream is quantized and/or transmitted and/or stored disjointly from the second data stream. Consequently, the two streams may be transmitted separately on different transmission channels.
- the first (latent) data stream which is output by the style encoder comprises (mainly) long term information, especially speaker identity information or speaking style information.
- the second (latent) data stream (output by the content encoder) may comprise short term information, especially phonetic content. Due to this, it is advantageously possible not only to separate the encoding, but also the transmission, so that the two data streams can be encoded with separate entities. Due to the long term and short term character of the first and second (latent) data stream, the first (latent) data stream may comprise an information which is sent at much lower frequencies relative to the second (latent) data stream.
- the first (latent) data stream may comprise data blocks that are sent at low frequency, e.g., below 10 times per second as the coded style information is slowly time varying.
- the second (latent) data stream may be composed of data blocks that have to be sent more frequently, e.g., 10-100 times per second, to account for the short-term nature of the coded signal content. This means that the content information is sent more frequently than the style information.
- the above-defined encoder is especially advantageous for speech coding. Therefore the input audio signal may be a speech signal or may comprise a speech record.
- an input audio signal may be a speech signal of a person, the person being characterized by the first (latent) data stream with regard to a characteristic speech style, e.g. male speech style or female speech style or another speech style.
- a characteristic speech style e.g. male speech style or female speech style or another speech style.
- Such speech style information is highly characteristic for a speech style and describes the speech behavior of a person whose speech should be coded very good, but general.
- the style encoder and the content encoder use neural codecs like NESC or other neural network codecs for encoding. Such neural codecs are advantageously trained or trainable using machine learning techniques.
- the style encoder and/or the content encoder are trained or trainable using machine learning techniques (artificial intelligence algorithms or neural network approaches).
- the style encoder is configured to employ layers of a recurrent neural network or layers representing long-term information, like for example a dilated convolutional neural network or a transformer.
- the content encoder is configured to use normalization layers enforcing temporal whiteness or input perturbations or vocal tract length perturbations on the input audio signal waveform for emphasizing short-term characteristics.
- the disentanglement of content and style also allows for separate manipulation of both representations and corresponding decoding, e.g., for speaker anonymization.
- the encoder can just use the content encoder configured to encode content information to obtain a second data stream. In this way, the content can be transmitted without transmitting the identity sensitive style information.
- the encoder side it is possible to manipulate or delete the style information and/or first data steam.
- the corresponding style encoder is just switched off.
- a default style information may be used so that the content can be reproduced in an anonymized manner.
- an embodiment provides a decoder comprising a conditioning entity and a content decoder.
- the conditioning entity outputs a control signal being a default control signal which comprises a style information.
- the content decoder is configured to decode the data stream (comparable to the second data stream) of the encoded input audio signal comprising content information, wherein the content decoder is controlled and/or adapted by the control signal.
- the reconstructed speech may anonymized, i.e., the reconstructed speech signal does not allow for identifying the speaker as the style was changed (speaker privacy can be achieved according to this aspect).
- the default control signal may represent an anonymized style information.
- the default control signal representing a default style information may be selected out of a plurality of default styles (e.g. male or female style). This may be done by use of a control signal.
- the control signal can be sent by the transmitter or encoder so as to reproduce the content having a specific style (voice conversion).
- manipulating the style information comprises setting the style information to a default style information or selecting the style information out of a plurality of available style informations.
- manipulating the style information comprises removing the style information or the first data stream before transmission.
- an encoder or decoder having the privacy feature may comprise the same optional features (regarding used codec, etc.) which have been discussed or are discussed in context of the other decodes / encoders.
- the decoder may comprise a conditioning entity and a content decoder.
- the conditioning entity receives a first (latent) data stream of the encoded input audio signal (cf. above discussion) and is configured to output a control signal based on the first (latent) data stream.
- the first (latent) data stream comprises an encoded style information.
- the content decoder is configured to decode the second (latent) data stream of the encoded input audio signal comprising content information (cf. above discussion).
- the content decoder is controlled and/or adapted by the control signal of the conditioning entity.
- the content decoder is configured to obtain a reconstructed signal of the encoded input audio signal based on the first and second (latent) data stream.
- the reconstructed signal is output by the content decoder taking into account the control signal of the conditioning entity.
- This control information is based on the first (latent) data stream comprising the encoded style information.
- the content decoder may be configured to use a default control signal or initial control signal before the conditioning entity provides the control signal based on the first data stream.
- the default control signal may be stored in the decoder.
- the default control signal and/or the initial control signal represents an average style.
- different default control signals or different initial control signals may be used.
- an average style information for a male speaker and/or an average style information for a female speaker may be provided so as to start the decoding based on the default control signal/initial control signal.
- the default control signal or initial control signal represents an average style selected out of the plurality of average styles, wherein the selection is taken based on an external information.
- Another embodiment provides a decoder for decoding an encoded input audio signal comprising a conditioning entity receiving the encoded input audio signal and configured to output a control signal based on the encoded input audio signal.
- the encoded input audio signal comprises an encoded style information.
- the decoder further comprises a content decoder configured to decode the encoded input audio signal further comprising content information, wherein the content decoder is controlled and/or adapted by the control signal.
- the content decoder may be configured to use a default control signal or an initial control signal before the conditioning entity provides the control signal based on external information or the first (latent) data stream.
- the default control signal or initial control signal represents a given style selected out of a plurality of available styles, wherein the selection is taken based on an external information or control.
- a deliberated anonymization or desired voice conversion could be achieved.
- the default control signal or initial control signal represents, e.g., an average style and/or average style for a male speaker and/or average style for a female speaker.
- the default control signal or initial control signal may according to further embodiments represent an average style selected out of a plurality of (average) styles, wherein the selection is taken based on an external information. Note it could also be a set of other candidate styles not necessarily an average of other styles.
- An embodiment provides a decoder for decoding an encoded input audio signal.
- the decoder comprises a conditioning entity and a content encoder.
- the conditioning entity is configured to receive an encoded style information derived from the encoded input audio signal and to perform at least a learned affine transform of an encoded content information derived from the encoded input audio signal to obtain a transformed encoded content information.
- the decoder is configured to further decode the transformed encoded content information.
- Embodiments of the present invention are based on the finding that that it is advantageous to disentangle content and style information.
- the content information which may comprise a phonetic content mainly comprises short term information
- the style information which may comprise an information on the speaker identity or speaking style mainly carries a long term information.
- the decoder comprises the two entities conditioning entity and the content decoder so as to enable the decoding of the two different information (content), namely the short-term information and the long-term information.
- the long-term information also referred to as style information, is obtained using the conditioning entity while the short-term information, i.e., the content information, is decoded by the content decoder taking into account the previously processed long-term information/style information.
- the conditioning may be performed by a TADE layer (one of developed network building blocks based on StyleMelGAN and NESC as described below) or a conditioning network comprising one or several TADE layers.
- a TADE layer one of developed network building blocks based on StyleMelGAN and NESC as described below
- a conditioning network comprising one or several TADE layers.
- An embodiment provides a method for decoding an encoded input audio signal. The method comprising:
- the default control signal represents an anonymized style information.
- the method may further comprise setting a default/manipulated style on the decoder side and not transmitting the style information for not disclosing identifying information about the speaker in the transmitted signals (encoder side).
- the encoder 10' receives the input audio signal IS, e.g., an audio signal comprising speech.
- the encoder 10' performs encoding so as to obtain two output data streams Q1 and Q2. These two audio data streams Q1 and Q2 are transmitted, e.g., separately to the decoder 20'.
- the decoder 20' performs a reconstruction so as to obtain the reconstructed signal IS'.
- the encoder 10' comprising two encoding entities, namely a style encoder 10s and a content encoder 10c. Each of the two encoders 10s and 10c receives the input audio signal IS and performs an encoding of a portion of the input audio signal.
- the encoder 10s outputs the coded audio signal Q1, while the encoder 10c outputs the coded audio signal Q2.
- the signal Q1 is processed by a conditioning/PreNet 20c, while the encoded audio stream Q2 is decoded by the content decoder 20d.
- the content decoder 20d has an input for Q2 and a control signal generated by 20c.
- the decoder uses the two signals so as to output the reconstructed signal IS'.
- the style codec 10s is configured to code style information of the input audio signal to obtain a first (latent) data stream Q1.
- the style information may comprise long term information, like a style information.
- the style information may be dependent on the gender (male/female speakers) or dependent on the language.
- the style information may have an influence on the general frequency dependent power distribution of the voice and/or to the general sound. Due to the encoding, the signal Q2 comprises the specific coding style information.
- the content encoder 10c is configured to encode content information, i.e., short term information, like phonetic information.
- embodiments of present invention are more efficient than the SOTA approaches to neural speech coding [1, 2, 3].
- Part of this encoder would be also a subsequently applied quantizer transforming that continuous output of the neural network to a discrete representation that can be potentially entropy-coded and transmitted.
- a formatter for preparing the data stream, etc. may be part of it.
- NESC is an example for a neural codec.
- NESC may, for example, consist of the following blocks:
- the input waveform is transformed into a two-dimensional representation by a rolling window frontend (The frontend is the first part (receiving the input audio signal) of the encoder).
- the result is processed by a convolutional layer within each frame, recurrent layers along different frames and again a convolutional layer operating within each frame. Together, this is termed dual path convolutional recurrent neural network (DPCRNN).
- DPCRNN dual path convolutional recurrent neural network
- the signal is further processed by several convolutional neural network blocks.
- the output of these blocks is quantized, e.g., by residual vector quantization (NESC) or by scalar quantization.
- NESC residual vector quantization
- the decoder 20' as already shown by Fig. 2 will be discussed taking reference to Fig. 3b showing the decoder 20d as central component.
- the decoder 20d receives Q2 and outputs IS'. Furthermore, the decoder 20d has another input for the control signal c obtained using the condition PreNet 20c, also referred to as condition entity.
- the quantized representation may according to further embodiments be preprocessed by recurrent neural network blocks (so as to implement an example neural codec based on NESC).
- the result is decoded by Streamwise-StyleMelGAN (SSMGAN).
- SSMGAN consists of a sequence of convolutional layers that subsequently upsample the input audio signal, where the input audio signal is used to condition each upsampling stage with a temporal adaptive DE-normalization (TADE) layer.
- TADE temporal adaptive DE-normalization
- the final signal is synthesized with a Pseudo-Quadrature Mirror Filter bank (PQMF).
- PQMF Pseudo-Quadrature Mirror Filter bank
- the content encoder 10c captures short time relations within the input audio signal (e.g., phonetic information), while the style encoder captures more global information (e.g., the identity of the speaker or the speaking style).
- the content encoder produces latents at a much higher frequency (e.g., 10-100Hz) relative to the style encoder (less than 10Hz).
- a much higher frequency e.g., 10-100Hz
- the style encoder less than 10Hz
- both encoders may share some of their layers.
- layers capable of representing long-term information such as RNNs may be employed to enforce that the style encoder 10s learns long-term information to avoid leaking of style information to the content encoder, techniques such as normalization layers and forcing temporal whiteness or input perturbations like vocal tract length perturbations may be used.
- the two latents Q1 and Q2 comprise different information. It should be noted that the latent Q1 (output of the style encoder) is transmitted less often as the latent Q2 (cf. output of the content encoder 10c). Background is that style information is a long-term and/or global feature, while content is referring to the short time information.
- the conditioning entity 20c may be configured to receive the encoded style information derived from the encoded input audio signal, e.g., via the data stream Q1, and performs at least a learnable affine transform of the coded content information derived from the encoded input audio signal. The result is an obtained transformed encoded content information.
- This content information is forwarded as a kind of control signal C to the content decoder 20d which performs a further decoding.
- the further decoding may take into account the data stream Q2.
- the decoding performed by 20d has the purpose to obtain IS'.
- the data stream Q2 may suffice to be the input audio signal for 20c and 20d. Background thereof is that a long-term information can be derived from Q2, as well.
- embodiments of the present invention may be implemented as an apparatus, i.e., as decoder and encoder, or may be implemented as the corresponding methods for encoding and decoding.
- Another embodiment refers to the training of the encoder and/or decoder or of the entities of the encoder and/or decoder.
- An embodiment provides a concept for training with at least two encoders, wherein each encoder is specialized to encode certain types of information, e.g., speaking style and content. This means that the training of the encoder 10s may be performed separately and/or specialized to the certain encoder type by the encoder 10s.
- the training of the encoder 10c may also be performed separately and/or specialized separately.
- speaking style information may be used, e.g., for different speaking styles, like a female or male speaking style, etc.
- content information may be used for the training of the encoder 10c.
- Another embodiment provides a technique for avoiding style leakage into the content encoder. This may, for example, be achieved by differentiating the input audio signal, e.g., the speech or audio input audio signal, with respect to long-term and/or short-term information.
- the input audio signal e.g., the speech or audio input audio signal
- Another embodiment provides an implementation of a system, initialization techniques for being able to reproduce realistic speech with some default style information. This means that during an initial phase, the decoder 20d performs the decoding using initial/default style information as long as the decoder 20c does not provide the control signal C. As already discussed above, by selecting/using a default style information or random style information not being obtained based on Q1, the privacy of the speaker belonging to the input audio signal can be maintained, while the content is transmitted.
- aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
- Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some, one or more of the most important method steps may be executed by such an apparatus.
- the inventive encoded audio signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
- embodiments of the invention can be implemented in hardware or in software.
- the implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
- Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
- embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer.
- the program code may for example be stored on a machine readable carrier.
- inventions comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
- an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
- a further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
- the data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
- a further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
- a processing means for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
- a further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP24156121.6A EP4600951A1 (fr) | 2024-02-06 | 2024-02-06 | Codage et décodage audio démêlé avec contrôle de style |
| PCT/EP2025/052896 WO2025168599A1 (fr) | 2024-02-06 | 2025-02-05 | Codeur et décodeur |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP24156121.6A EP4600951A1 (fr) | 2024-02-06 | 2024-02-06 | Codage et décodage audio démêlé avec contrôle de style |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| EP4600951A1 true EP4600951A1 (fr) | 2025-08-13 |
Family
ID=89853371
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| EP24156121.6A Pending EP4600951A1 (fr) | 2024-02-06 | 2024-02-06 | Codage et décodage audio démêlé avec contrôle de style |
Country Status (2)
| Country | Link |
|---|---|
| EP (1) | EP4600951A1 (fr) |
| WO (1) | WO2025168599A1 (fr) |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2023175197A1 (fr) * | 2022-03-18 | 2023-09-21 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Techniques de vocodeur |
-
2024
- 2024-02-06 EP EP24156121.6A patent/EP4600951A1/fr active Pending
-
2025
- 2025-02-05 WO PCT/EP2025/052896 patent/WO2025168599A1/fr active Pending
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2023175197A1 (fr) * | 2022-03-18 | 2023-09-21 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Techniques de vocodeur |
Non-Patent Citations (8)
| Title |
|---|
| DEFOSSEZ ALEXANDRE ET AL., HIGH FIDELITY NEURAL AUDIO COMPRESSION, Retrieved from the Internet <URL:https://arxiv.org/abs/2210.13438> |
| DIMITRIOS STOIDIS ET AL: "Protecting gender and identity with disentangled speech representations", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 16 June 2021 (2021-06-16), XP081980784 * |
| EBBERS JANEK ET AL: "Contrastive Predictive Coding Supported Factorized Variational Autoencoder For Unsupervised Learning Of Disentangled Speech Representations", ICASSP 2021 - 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 6 June 2021 (2021-06-06), pages 3860 - 3864, XP033954627, DOI: 10.1109/ICASSP39728.2021.9414487 * |
| PIANICOLA ET AL., NESC: ROBUST NEURAL END-2-END SPEECH CODING WITH GANS, Retrieved from the Internet <URL:https://arxiv.ora/abs/2207.03282> |
| PIERRE CHAMPION ET AL: "Are disentangled representations all you need to build speaker anonymization systems?", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 13 January 2023 (2023-01-13), XP091417113 * |
| POLYAK ET AL., SPEECH RESYNTHESIS FROM DISCRETE DISENTANGLED SELF-SUPERVISED REPRESENTATIONS, 2021, Retrieved from the Internet <URL:https://arxiv.org/pdf/2104.00355.pdf> |
| XUE JIANG ET AL: "Disentangled Feature Learning for Real-Time Neural Speech Coding", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 25 February 2023 (2023-02-25), XP091446243 * |
| ZEGHIDOURNEIL ET AL., SOUNDSTREAM: AN END-TO-END NEURAL AUDIO CODEC, Retrieved from the Internet <URL:https://arxiv.ora/abs/2107.03312> |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2025168599A1 (fr) | 2025-08-14 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP6951536B2 (ja) | 音声符号化装置および方法 | |
| CN100583241C (zh) | 音频编码设备、音频解码设备、音频编码方法和音频解码方法 | |
| JP4506039B2 (ja) | 符号化装置及び方法、復号装置及び方法、並びに符号化プログラム及び復号プログラム | |
| Zhen et al. | Cascaded cross-module residual learning towards lightweight end-to-end speech coding | |
| KR101162275B1 (ko) | 오디오 신호 처리 방법 및 장치 | |
| JP2001202097A (ja) | 符号化二進オーディオ処理方法 | |
| CN119252268B (zh) | 音频解码、编码方法、装置、电子设备及存储介质 | |
| JPWO2007088853A1 (ja) | 音声符号化装置、音声復号装置、音声符号化システム、音声符号化方法及び音声復号方法 | |
| CN119698656A (zh) | 声码器技术 | |
| CN118136030A (zh) | 音频处理方法、装置、存储介质和电子设备 | |
| CN113314132A (zh) | 一种应用于交互式音频系统中的音频对象编码方法、解码方法及装置 | |
| JP4216364B2 (ja) | 音声符号化/復号化方法および音声信号の成分分離方法 | |
| Yao et al. | Variational speech waveform compression to catalyze semantic communications | |
| Valin et al. | DRED: Deep REDundancy coding of speech using a rate-distortion-optimized variational autoencoder | |
| EP4600951A1 (fr) | Codage et décodage audio démêlé avec contrôle de style | |
| EP4600950A1 (fr) | Codeur et décodeur | |
| KR102839239B1 (ko) | 고유한 키 기반의 네트워크 유도 및 컨디셔닝에 기반한 신호 변환 | |
| EP4600952A1 (fr) | Décodeur | |
| KR102353050B1 (ko) | 스테레오 신호 인코딩에서의 신호 재구성 방법 및 디바이스 | |
| EP4189680B9 (fr) | Génération de clé basée sur un réseau de neurones artificiels pour transformation de signal audio basée sur un réseau de neurones artificiels guidé par clé | |
| CN121214951B (zh) | 基于文本语义信息保真的超低码率语音编解码系统 | |
| Lim et al. | Perceptual Neural Audio Coding With Modified Discrete Cosine Transform | |
| WO2025035955A1 (fr) | Procédé et appareil de décodage de signal vocal, et dispositif électronique | |
| WO2024051955A1 (fr) | Décodeur et procédé de décodage pour transmission discontinue de flux indépendants codés de manière paramétrique avec des métadonnées | |
| WO2025237010A1 (fr) | Procédé de communication audio, procédé de conversion audio, appareil, dispositif électronique, support de stockage lisible par ordinateur et produit programme informatique |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED |
|
| AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC ME MK MT NL NO PL PT RO RS SE SI SK SM TR |