EP4600951A1 - Codage et décodage audio démêlé avec contrôle de style - Google Patents

Codage et décodage audio démêlé avec contrôle de style

Info

Publication number: EP4600951A1
Authority: EP; European Patent Office
Prior art keywords: style; encoder; decoder; content; information
Prior art date: 2024-02-06
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.): Pending

Application number

EP24156121.6A

Other languages

German (de)

English (en)

Inventor

Andreas BRENDEL

Kishan GUPTA

Nicola PIA

Guillaume Fuchs

Suraj Pandey

Markus Multrus

Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)

Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV

Original Assignee

Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2024-02-06

Filing date

2024-02-06

Publication date

2025-08-13

2024-02-06 Application filed by Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV filed Critical Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV

2024-02-06 Priority to EP24156121.6A priority Critical patent/EP4600951A1/fr

2025-02-05 Priority to PCT/EP2025/052896 priority patent/WO2025168599A1/fr

2025-08-13 Publication of EP4600951A1 publication Critical patent/EP4600951A1/fr

Status Pending legal-status Critical Current

Links

Classifications

- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis

Definitions

Embodiments of the present invention refer to an encoder for encoding the input audio signal and a decoder for decoding the encoded input audio signal.
the coding of the encoder and the decoder enable that the input data is compressed.
Further embodiments refer to the corresponding methods and to a computer program.
Preferred embodiments are in a field of neural audio codec using speaker embeddings.
SOTA neural speech coding achieves good quality of the reconstructed speech at bitrates as low as 3.2kbps.
SOTA neural speech coders [1,2,3] are typically trained in an end-to-end fashion, where an input audio signal is encoded by an encoding neural network, the output of the encoder (the so-called latent) is quantized and the resulting quantized latent is decoded by a decoding neural network (see Fig. 1 ).
the encoder runs on transmitter side providing quantized latents (or rather identifying indices of the used quantized latents) which are transmitted to the receiver side where the decoder reproduces the input speech signal.
Fig. 1 shows exemplarily a codec having an encoder 10 receiving an input audio signal IS so as to encode the input audio signal to obtain the encoded input audio signal Q.
This encoded input audio signal Q can be decoded by use of the decoder 20.
the decoder 20 is configured to provide based on the signal Q a reconstructed input audio signal IS'.
the submitted data stream Q must carry a long term information, like speaker identity, speaker style (called style in the following), etc., as well as short term information, like phonetic content (called content in the following).
This way of coding has drawbacks with regard to the coding efficiency. Therefore, there is a need for an improved approach.
An objective of the present invention is to provide a concept for coding speech having high coding efficiency and/or additional functionality.
An embodiment provides an encoder for encoding an input audio signal like a speech signal.
the encoder comprises a style encoder and a content encoder.
the style encoder is configured to encode style information of the input audio signal to obtain a first (latent) data stream.
the content encoder is configured to encode content information to obtain a second (latent) data stream.
Embodiments of the present invention are based on the finding that it is advantageous to disentangle content and style information.
the content information which may comprise a phonetic content mainly comprises short term information
the style information which may comprise an information on the speaker identity or speaking style mainly carries a long term information.
the encoder is subdivided into two parallel acting encoders, namely as style encoder and a content encoder. In this way, it is advantageously possible to increase the coding efficiency.
Another advantage could lie in the transmission of the two data streams. Since information is disentangled and splits into two independent streams, they can be transmitted over different transmission channels (with different data rate, transmission intervals, and power consumption) and/or with unequal error protection (UEP).
UDP unequal error protection
the first and the second (latent) data stream are output and/or transmitted separately.
the first and second (latent) data stream may be quantized separately before transmitting.
the first data stream is quantized and/or transmitted and/or stored disjointly from the second data stream. Consequently, the two streams may be transmitted separately on different transmission channels.
the first (latent) data stream which is output by the style encoder comprises (mainly) long term information, especially speaker identity information or speaking style information.
the second (latent) data stream (output by the content encoder) may comprise short term information, especially phonetic content. Due to this, it is advantageously possible not only to separate the encoding, but also the transmission, so that the two data streams can be encoded with separate entities. Due to the long term and short term character of the first and second (latent) data stream, the first (latent) data stream may comprise an information which is sent at much lower frequencies relative to the second (latent) data stream.
the first (latent) data stream may comprise data blocks that are sent at low frequency, e.g., below 10 times per second as the coded style information is slowly time varying.
the second (latent) data stream may be composed of data blocks that have to be sent more frequently, e.g., 10-100 times per second, to account for the short-term nature of the coded signal content. This means that the content information is sent more frequently than the style information.
the above-defined encoder is especially advantageous for speech coding. Therefore the input audio signal may be a speech signal or may comprise a speech record.
an input audio signal may be a speech signal of a person, the person being characterized by the first (latent) data stream with regard to a characteristic speech style, e.g. male speech style or female speech style or another speech style.
a characteristic speech style e.g. male speech style or female speech style or another speech style.
Such speech style information is highly characteristic for a speech style and describes the speech behavior of a person whose speech should be coded very good, but general.
the style encoder and the content encoder use neural codecs like NESC or other neural network codecs for encoding. Such neural codecs are advantageously trained or trainable using machine learning techniques.
the style encoder and/or the content encoder are trained or trainable using machine learning techniques (artificial intelligence algorithms or neural network approaches).
the style encoder is configured to employ layers of a recurrent neural network or layers representing long-term information, like for example a dilated convolutional neural network or a transformer.
the content encoder is configured to use normalization layers enforcing temporal whiteness or input perturbations or vocal tract length perturbations on the input audio signal waveform for emphasizing short-term characteristics.
the disentanglement of content and style also allows for separate manipulation of both representations and corresponding decoding, e.g., for speaker anonymization.
the encoder can just use the content encoder configured to encode content information to obtain a second data stream. In this way, the content can be transmitted without transmitting the identity sensitive style information.
the encoder side it is possible to manipulate or delete the style information and/or first data steam.
the corresponding style encoder is just switched off.
a default style information may be used so that the content can be reproduced in an anonymized manner.
an embodiment provides a decoder comprising a conditioning entity and a content decoder.
the conditioning entity outputs a control signal being a default control signal which comprises a style information.
the content decoder is configured to decode the data stream (comparable to the second data stream) of the encoded input audio signal comprising content information, wherein the content decoder is controlled and/or adapted by the control signal.
the reconstructed speech may anonymized, i.e., the reconstructed speech signal does not allow for identifying the speaker as the style was changed (speaker privacy can be achieved according to this aspect).
the default control signal may represent an anonymized style information.
the default control signal representing a default style information may be selected out of a plurality of default styles (e.g. male or female style). This may be done by use of a control signal.
the control signal can be sent by the transmitter or encoder so as to reproduce the content having a specific style (voice conversion).
manipulating the style information comprises setting the style information to a default style information or selecting the style information out of a plurality of available style informations.
manipulating the style information comprises removing the style information or the first data stream before transmission.
an encoder or decoder having the privacy feature may comprise the same optional features (regarding used codec, etc.) which have been discussed or are discussed in context of the other decodes / encoders.
the decoder may comprise a conditioning entity and a content decoder.
the conditioning entity receives a first (latent) data stream of the encoded input audio signal (cf. above discussion) and is configured to output a control signal based on the first (latent) data stream.
the first (latent) data stream comprises an encoded style information.
the content decoder is configured to decode the second (latent) data stream of the encoded input audio signal comprising content information (cf. above discussion).
the content decoder is controlled and/or adapted by the control signal of the conditioning entity.
the content decoder is configured to obtain a reconstructed signal of the encoded input audio signal based on the first and second (latent) data stream.
the reconstructed signal is output by the content decoder taking into account the control signal of the conditioning entity.
This control information is based on the first (latent) data stream comprising the encoded style information.
the content decoder may be configured to use a default control signal or initial control signal before the conditioning entity provides the control signal based on the first data stream.
the default control signal may be stored in the decoder.
the default control signal and/or the initial control signal represents an average style.
different default control signals or different initial control signals may be used.
an average style information for a male speaker and/or an average style information for a female speaker may be provided so as to start the decoding based on the default control signal/initial control signal.
the default control signal or initial control signal represents an average style selected out of the plurality of average styles, wherein the selection is taken based on an external information.
Another embodiment provides a decoder for decoding an encoded input audio signal comprising a conditioning entity receiving the encoded input audio signal and configured to output a control signal based on the encoded input audio signal.
the encoded input audio signal comprises an encoded style information.
the decoder further comprises a content decoder configured to decode the encoded input audio signal further comprising content information, wherein the content decoder is controlled and/or adapted by the control signal.
the content decoder may be configured to use a default control signal or an initial control signal before the conditioning entity provides the control signal based on external information or the first (latent) data stream.
the default control signal or initial control signal represents a given style selected out of a plurality of available styles, wherein the selection is taken based on an external information or control.
a deliberated anonymization or desired voice conversion could be achieved.
the default control signal or initial control signal represents, e.g., an average style and/or average style for a male speaker and/or average style for a female speaker.
the default control signal or initial control signal may according to further embodiments represent an average style selected out of a plurality of (average) styles, wherein the selection is taken based on an external information. Note it could also be a set of other candidate styles not necessarily an average of other styles.
An embodiment provides a decoder for decoding an encoded input audio signal.
the decoder comprises a conditioning entity and a content encoder.
the conditioning entity is configured to receive an encoded style information derived from the encoded input audio signal and to perform at least a learned affine transform of an encoded content information derived from the encoded input audio signal to obtain a transformed encoded content information.
the decoder is configured to further decode the transformed encoded content information.
Embodiments of the present invention are based on the finding that that it is advantageous to disentangle content and style information.
the content information which may comprise a phonetic content mainly comprises short term information
the style information which may comprise an information on the speaker identity or speaking style mainly carries a long term information.
the decoder comprises the two entities conditioning entity and the content decoder so as to enable the decoding of the two different information (content), namely the short-term information and the long-term information.
the long-term information also referred to as style information, is obtained using the conditioning entity while the short-term information, i.e., the content information, is decoded by the content decoder taking into account the previously processed long-term information/style information.
the conditioning may be performed by a TADE layer (one of developed network building blocks based on StyleMelGAN and NESC as described below) or a conditioning network comprising one or several TADE layers.
a TADE layer one of developed network building blocks based on StyleMelGAN and NESC as described below
a conditioning network comprising one or several TADE layers.
An embodiment provides a method for decoding an encoded input audio signal. The method comprising:
the default control signal represents an anonymized style information.
the method may further comprise setting a default/manipulated style on the decoder side and not transmitting the style information for not disclosing identifying information about the speaker in the transmitted signals (encoder side).
the encoder 10' receives the input audio signal IS, e.g., an audio signal comprising speech.
the encoder 10' performs encoding so as to obtain two output data streams Q1 and Q2. These two audio data streams Q1 and Q2 are transmitted, e.g., separately to the decoder 20'.
the decoder 20' performs a reconstruction so as to obtain the reconstructed signal IS'.
the encoder 10' comprising two encoding entities, namely a style encoder 10s and a content encoder 10c. Each of the two encoders 10s and 10c receives the input audio signal IS and performs an encoding of a portion of the input audio signal.
the encoder 10s outputs the coded audio signal Q1, while the encoder 10c outputs the coded audio signal Q2.
the signal Q1 is processed by a conditioning/PreNet 20c, while the encoded audio stream Q2 is decoded by the content decoder 20d.
the content decoder 20d has an input for Q2 and a control signal generated by 20c.
the decoder uses the two signals so as to output the reconstructed signal IS'.
the style codec 10s is configured to code style information of the input audio signal to obtain a first (latent) data stream Q1.
the style information may comprise long term information, like a style information.
the style information may be dependent on the gender (male/female speakers) or dependent on the language.
the style information may have an influence on the general frequency dependent power distribution of the voice and/or to the general sound. Due to the encoding, the signal Q2 comprises the specific coding style information.
the content encoder 10c is configured to encode content information, i.e., short term information, like phonetic information.
embodiments of present invention are more efficient than the SOTA approaches to neural speech coding [1, 2, 3].
Part of this encoder would be also a subsequently applied quantizer transforming that continuous output of the neural network to a discrete representation that can be potentially entropy-coded and transmitted.
a formatter for preparing the data stream, etc. may be part of it.
NESC is an example for a neural codec.
NESC may, for example, consist of the following blocks:
the input waveform is transformed into a two-dimensional representation by a rolling window frontend (The frontend is the first part (receiving the input audio signal) of the encoder).
the result is processed by a convolutional layer within each frame, recurrent layers along different frames and again a convolutional layer operating within each frame. Together, this is termed dual path convolutional recurrent neural network (DPCRNN).
DPCRNN dual path convolutional recurrent neural network
the signal is further processed by several convolutional neural network blocks.
the output of these blocks is quantized, e.g., by residual vector quantization (NESC) or by scalar quantization.
NESC residual vector quantization
the decoder 20' as already shown by Fig. 2 will be discussed taking reference to Fig. 3b showing the decoder 20d as central component.
the decoder 20d receives Q2 and outputs IS'. Furthermore, the decoder 20d has another input for the control signal c obtained using the condition PreNet 20c, also referred to as condition entity.
the quantized representation may according to further embodiments be preprocessed by recurrent neural network blocks (so as to implement an example neural codec based on NESC).
the result is decoded by Streamwise-StyleMelGAN (SSMGAN).
SSMGAN consists of a sequence of convolutional layers that subsequently upsample the input audio signal, where the input audio signal is used to condition each upsampling stage with a temporal adaptive DE-normalization (TADE) layer.
TADE temporal adaptive DE-normalization
the final signal is synthesized with a Pseudo-Quadrature Mirror Filter bank (PQMF).
PQMF Pseudo-Quadrature Mirror Filter bank
the content encoder 10c captures short time relations within the input audio signal (e.g., phonetic information), while the style encoder captures more global information (e.g., the identity of the speaker or the speaking style).
the content encoder produces latents at a much higher frequency (e.g., 10-100Hz) relative to the style encoder (less than 10Hz).
a much higher frequency e.g., 10-100Hz
the style encoder less than 10Hz
both encoders may share some of their layers.
layers capable of representing long-term information such as RNNs may be employed to enforce that the style encoder 10s learns long-term information to avoid leaking of style information to the content encoder, techniques such as normalization layers and forcing temporal whiteness or input perturbations like vocal tract length perturbations may be used.
the two latents Q1 and Q2 comprise different information. It should be noted that the latent Q1 (output of the style encoder) is transmitted less often as the latent Q2 (cf. output of the content encoder 10c). Background is that style information is a long-term and/or global feature, while content is referring to the short time information.
the conditioning entity 20c may be configured to receive the encoded style information derived from the encoded input audio signal, e.g., via the data stream Q1, and performs at least a learnable affine transform of the coded content information derived from the encoded input audio signal. The result is an obtained transformed encoded content information.
This content information is forwarded as a kind of control signal C to the content decoder 20d which performs a further decoding.
the further decoding may take into account the data stream Q2.
the decoding performed by 20d has the purpose to obtain IS'.
the data stream Q2 may suffice to be the input audio signal for 20c and 20d. Background thereof is that a long-term information can be derived from Q2, as well.
embodiments of the present invention may be implemented as an apparatus, i.e., as decoder and encoder, or may be implemented as the corresponding methods for encoding and decoding.
Another embodiment refers to the training of the encoder and/or decoder or of the entities of the encoder and/or decoder.
An embodiment provides a concept for training with at least two encoders, wherein each encoder is specialized to encode certain types of information, e.g., speaking style and content. This means that the training of the encoder 10s may be performed separately and/or specialized to the certain encoder type by the encoder 10s.
the training of the encoder 10c may also be performed separately and/or specialized separately.
speaking style information may be used, e.g., for different speaking styles, like a female or male speaking style, etc.
content information may be used for the training of the encoder 10c.
Another embodiment provides a technique for avoiding style leakage into the content encoder. This may, for example, be achieved by differentiating the input audio signal, e.g., the speech or audio input audio signal, with respect to long-term and/or short-term information.
the input audio signal e.g., the speech or audio input audio signal
Another embodiment provides an implementation of a system, initialization techniques for being able to reproduce realistic speech with some default style information. This means that during an initial phase, the decoder 20d performs the decoding using initial/default style information as long as the decoder 20c does not provide the control signal C. As already discussed above, by selecting/using a default style information or random style information not being obtained based on Q1, the privacy of the speaker belonging to the input audio signal can be maintained, while the content is transmitted.
aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some, one or more of the most important method steps may be executed by such an apparatus.
the inventive encoded audio signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
embodiments of the invention can be implemented in hardware or in software.
the implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer.
the program code may for example be stored on a machine readable carrier.
inventions comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
a further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
the data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
a further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
a processing means for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
a further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.

Landscapes

Engineering & Computer Science (AREA)
Computational Linguistics (AREA)
Signal Processing (AREA)
Health & Medical Sciences (AREA)
Audiology, Speech & Language Pathology (AREA)
Human Computer Interaction (AREA)
Physics & Mathematics (AREA)
Acoustics & Sound (AREA)
Multimedia (AREA)
Compression, Expansion, Code Conversion, And Decoders (AREA)

EP24156121.6A 2024-02-06 2024-02-06 Codage et décodage audio démêlé avec contrôle de style Pending EP4600951A1 (fr)

Priority Applications (2)

Application Number	Priority Date	Filing Date	Title
EP24156121.6A EP4600951A1 (fr)	2024-02-06	2024-02-06	Codage et décodage audio démêlé avec contrôle de style
PCT/EP2025/052896 WO2025168599A1 (fr)	2024-02-06	2025-02-05	Codeur et décodeur

Applications Claiming Priority (1)

Application Number	Priority Date	Filing Date	Title
EP24156121.6A EP4600951A1 (fr)	2024-02-06	2024-02-06	Codage et décodage audio démêlé avec contrôle de style

Publications (1)

Publication Number	Publication Date
EP4600951A1 true EP4600951A1 (fr)	2025-08-13

Family

ID=89853371

Family Applications (1)

Application Number	Title	Priority Date	Filing Date
EP24156121.6A Pending EP4600951A1 (fr)	2024-02-06	2024-02-06	Codage et décodage audio démêlé avec contrôle de style

Country Status (2)

Country	Link
EP (1)	EP4600951A1 (fr)
WO (1)	WO2025168599A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
WO2023175197A1 (fr) *	2022-03-18	2023-09-21	Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.	Techniques de vocodeur

2024
- 2024-02-06 EP EP24156121.6A patent/EP4600951A1/fr active Pending
2025
- 2025-02-05 WO PCT/EP2025/052896 patent/WO2025168599A1/fr active Pending

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
WO2023175197A1 (fr) *	2022-03-18	2023-09-21	Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.	Techniques de vocodeur

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
DEFOSSEZ ALEXANDRE ET AL., HIGH FIDELITY NEURAL AUDIO COMPRESSION, Retrieved from the Internet <URL:https://arxiv.org/abs/2210.13438>
DIMITRIOS STOIDIS ET AL: "Protecting gender and identity with disentangled speech representations", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 16 June 2021 (2021-06-16), XP081980784 *
EBBERS JANEK ET AL: "Contrastive Predictive Coding Supported Factorized Variational Autoencoder For Unsupervised Learning Of Disentangled Speech Representations", ICASSP 2021 - 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 6 June 2021 (2021-06-06), pages 3860 - 3864, XP033954627, DOI: 10.1109/ICASSP39728.2021.9414487 *
PIANICOLA ET AL., NESC: ROBUST NEURAL END-2-END SPEECH CODING WITH GANS, Retrieved from the Internet <URL:https://arxiv.ora/abs/2207.03282>
PIERRE CHAMPION ET AL: "Are disentangled representations all you need to build speaker anonymization systems?", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 13 January 2023 (2023-01-13), XP091417113 *
POLYAK ET AL., SPEECH RESYNTHESIS FROM DISCRETE DISENTANGLED SELF-SUPERVISED REPRESENTATIONS, 2021, Retrieved from the Internet <URL:https://arxiv.org/pdf/2104.00355.pdf>
XUE JIANG ET AL: "Disentangled Feature Learning for Real-Time Neural Speech Coding", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 25 February 2023 (2023-02-25), XP091446243 *
ZEGHIDOURNEIL ET AL., SOUNDSTREAM: AN END-TO-END NEURAL AUDIO CODEC, Retrieved from the Internet <URL:https://arxiv.ora/abs/2107.03312>

Also Published As

Publication number	Publication date
WO2025168599A1 (fr)	2025-08-14

Legal Events

Date

Code

Title

Description

2025-07-11

PUAI

Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012