EP4600950A1 - Codeur et décodeur - Google Patents

Codeur et décodeur

Info

Publication number: EP4600950A1
Authority: EP; European Patent Office
Prior art keywords: encoder; data stream; style; decoder; content
Prior art date: 2024-02-06
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.): Pending

Application number

EP24156124.0A

Other languages

German (de)

English (en)

Inventor

Andreas BRENDEL

Kishan GUPTA

Nicola PIA

Guillaume Fuchs

Suraj Pandey

Markus Multrus

Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)

Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV

Original Assignee

Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2024-02-06

Filing date

2024-02-06

Publication date

2025-08-13

2024-02-06 Application filed by Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV filed Critical Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV

2024-02-06 Priority to EP24156124.0A priority Critical patent/EP4600950A1/fr

2025-02-05 Priority to PCT/EP2025/052897 priority patent/WO2025168600A1/fr

2025-08-13 Publication of EP4600950A1 publication Critical patent/EP4600950A1/fr

Status Pending legal-status Critical Current

Links

Classifications

- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser

Definitions

Embodiments of the present invention refer to an encoder for encoding the input audio signal and a decoder for decoding the encoded input audio signal.
the coding of the encoder and the decoder enable that the input data is compressed.
Further embodiments refer to the corresponding methods and to a computer program.
Preferred embodiments are in a field of neural audio codec using speaker embeddings.
SOTA neural speech coding achieves good quality of the reconstructed speech at bitrates as low as 3.2kbps.
SOTA neural speech coders [1,2,3] are typically trained in an end-to-end fashion, where an input audio signal is encoded by an encoding neural network, the output of the encoder (the so-called latent) is quantized and the resulting quantized latent is decoded by a decoding neural network (see Fig. 1 ).
the encoder runs on transmitter side providing quantized latents (or rather identifying indices of the used quantized latents) which are transmitted to the receiver side where the decoder reproduces the input speech signal.
Fig. 1 shows exemplarily a codec having an encoder 10 receiving an input audio signal IS so as to encode the input audio signal to obtain the encoded input audio signal Q.
This encoded input audio signal Q can be decoded by use of the decoder 20.
the decoder 20 is configured to provide based on the signal Q a reconstructed input audio signal IS'.
the submitted data stream Q must carry a long term information, like speaker identity, speaker style (called style in the following), etc., as well as short term information, like phonetic content (called content in the following).
This way of coding has drawbacks with regard to the coding efficiency. Therefore, there is a need for an improved approach.
An objective of the present invention is to provide a concept for coding speech having high coding efficiency.
An embodiment provides an encoder for encoding an input audio signal like a speech signal.
the encoder comprises a style encoder and a content encoder.
the style encoder is configured to encode style information of the input audio signal to obtain a first (latent) data stream.
the content encoder is configured to encode content information to obtain a second (latent) data stream.
Embodiments of the present invention are based on the finding that it is advantageous to disentangle content and style information.
the content information which may comprise a phonetic content mainly comprises short term information
the style information which may comprise an information on the speaker identity or speaking style mainly carries a long term information.
the encoder is subdivided into two parallel acting encoders, namely as style encoder and a content encoder. In this way, it is advantageously possible to increase the coding efficiency.
Another advantage could lie in the transmission of the two data streams. Since information is disentangled and splits into two independent streams, they can be transmitted over different transmission channels (with different data rate, transmission intervals, and power consumption) and/or with unequal error protection (UEP).
UDP unequal error protection
the first and the second (latent) data stream are output and/or transmitted separately.
the first and second (latent) data stream may be quantized separately before transmitting.
the first data stream is quantized and/or transmitted and/or stored disjointly from the second data stream. Consequently, the two streams may be transmitted separately on different transmission channels.
the first (latent) data stream which is output by the style encoder comprises (mainly) long term information, especially speaker identity information or speaking style information.
the second (latent) data stream (output by the content encoder) may comprise short term information, especially phonetic content. Due to this, it is advantageously possible not only to separate the encoding, but also the transmission, so that the two data streams can be encoded with separate entities. Due to the long term and short term character of the first and second (latent) data stream, the first (latent) data stream may comprise an information which is sent at much lower frequencies relative to the second (latent) data stream.
the first (latent) data stream may comprise data blocks that are sent at low frequency, e.g., below 10 times per second as the coded style information is slowly time varying.
the second (latent) data stream may be composed of data blocks that have to be sent more frequently, e.g., 10-100 times per second, to account for the short-term nature of the coded signal content. This means that the content information is sent more frequently than the style information.
the above-defined encoder is especially advantageous for speech coding. Therefore the input audio signal may be a speech signal or may comprise a speech record.
an input audio signal may be a speech signal of a person, the person being characterized by the first (latent) data stream with regard to a characteristic speech style, e.g. male speech style or female speech style or another speech style.
a characteristic speech style e.g. male speech style or female speech style or another speech style.
Such speech style information is highly characteristic for a speech style and describes the speech behavior of a person whose speech should be coded very good, but general.
the style encoder and the content encoder use neural codecs like NESC or other neural network codecs for encoding. Such neural codecs are advantageously trained or trainable using machine learning techniques.
the style encoder and/or the content encoder are trained or trainable using machine learning techniques (artificial intelligence algorithms or neural network approaches).
the style encoder is configured to employ layers of a recurrent neural network or layers representing long-term information, like for example a dilated convolutional neural network or a transformer.
the content encoder is configured to use normalization layers enforcing temporal whiteness or input perturbations or vocal tract length perturbations on the input audio signal waveform to emphasize short-term characteristics.
the disentanglement of content and style also allows for separate manipulation of both representations and corresponding decoding, e.g., for speaker anonymization.
the encoder can just use the content encoder configured to encode content information to obtain a second data stream. In this way, the content can be transmitted without transmitting the identity sensitive style information.
a default style information is used so that the content can be reproduced in an anonymized manner.
the decoder may comprise a conditioning entity and a content decoder.
the conditioning entity receives a first (latent) data stream of the encoded input audio signal (cf. above discussion) and is configured to output a control signal based on the first (latent) data stream.
the first (latent) data stream comprises an encoded style information.
the content decoder is configured to decode the second (latent) data stream of the encoded input audio signal comprising content information (cf. above discussion).
the content decoder is controlled and/or adapted by the control signal of the conditioning entity.
the content decoder is configured to obtain a reconstructed signal of the encoded input audio signal based on the first and second (latent) data stream.
the reconstructed signal is output by the content decoder taking into account the control signal of the conditioning entity. This control information is based on the first (latent) data stream comprising the encoded style information.
the content decoder may be configured to use a default control signal or initial control signal before the conditioning entity provides the control signal based on the first data stream.
the default control signal may be stored in the decoder.
the default control signal and/or the initial control signal represents an average style.
different default control signals or different initial control signals may be used.
an average style information for a male speaker and/or an average style information for a female speaker may be provided so as to start the decoding based on the default control signal/initial control signal.
the default control signal or initial control signal represents an average style selected out of the plurality of average styles, wherein the selection is taken based on an external information.
the decoder may comprise a conditioning entity and a content decoder.
the conditioning entity outputs a control signal being a default control signal which comprises a style information.
the content decoder is configured to decode the data stream (comparable to the second data stream) of the encoded input audio signal comprising content information, wherein the content decoder is controlled and/or adapted by the control signal.
the reconstructed speech may be anonymized, i.e., the reconstructed speech signal does not allow for identifying the speaker as the style was changed (speaker privacy can be achieved according to this aspect).
the default control signal representing a default style information may be selected out of a plurality of default styles. This may be done by use of a control signal.
Another embodiment provides a decoder for decoding an encoded input audio signal comprising a conditioning entity receiving the encoded input audio signal and configured to output a control signal based on the encoded input audio signal.
the encoded input audio signal comprises an encoded style information.
the decoder further comprises a content decoder configured to decode the encoded input audio signal further comprising content information, wherein the content decoder is controlled and/or adapted by the control signal.
the content decoder may be configured to use a default control signal or an initial control signal before the conditioning entity provides the control signal based on external information or the first (latent) data stream.
the default control signal or initial control signal represents a given style selected out of a plurality of available styles, wherein the selection is taken based on an external information or control.
a deliberated anonymization or desired voice conversion could be achieved.
the decoder is configured to decode an encoded input audio signal comprising a conditioning entity and a content decoder.
the conditioning entity is configured to output a control signal representing a style information.
the default control signal or initial control signal is used as control signal and thus, represents a given style selected out of one or more available styles.
the content decoder is configured to decode the encoded input audio signal comprising content information, wherein the content decoder is controlled and/or adapted by the control signal.
the default control signal or initial control signal represents, e.g., an average style and/or average style for a male speaker and/or average style for a female speaker.
the default control signal or initial control signal may according to further embodiments represent an average style selected out of a plurality of (average) styles, wherein the selection is taken based on an external information. Note it could also be a set of other candidate styles not necessarily an average of other styles.
An embodiment provides a decoder for decoding an encoded input audio signal.
the decoder comprises a conditioning entity and a content encoder.
the conditioning entity is configured to receive an encoded style information derived from the encoded input audio signal and to perform at least a learned affine transform of an encoded content information derived from the encoded input audio signal to obtain a transformed encoded content information.
the decoder is configured to further decode the transformed encoded content information.
Embodiments of the present invention are based on the finding that that it is advantageous to disentangle content and style information.
the content information which may comprise a phonetic content mainly comprises short term information
the style information which may comprise an information on the speaker identity or speaking style mainly carries a long term information.
the decoder comprises the two entities conditioning entity and the content decoder so as to enable the decoding of the two different information (content), namely the short-term information and the long-term information.
the long-term information also referred to as style information, is obtained using the conditioning entity while the short-term information, i.e., the content information, is decoded by the content decoder taking into account the previously processed long-term information/style information.
An embodiment provides a method for decoding an encoded input audio signal. The method comprising:
Another embodiment provides a method for decoding an encoded input audio signal, and the method comprising:
providing a manipulated or a default control signal may be used not only to cover the time span before the first encoded style information is received but also for anonymizing the speaker (here, the style information may not be transmitted, thereby keeping the identifying information of the speaker included in the audio signal private).
An embodiment provides a method for decoding an encoded audio signal comprising a conditioning entity configured to output a control signal being a default control signal and representing a style information; a content decoder configured to decode a data stream of the encoded input audio signal comprising content information, wherein the content decoder is controlled and/or adapted by the control signal.
the method may further comprise setting a default/manipulated style on the decoder side and not transmitting the style information for not disclosing identifying information about the speaker in the transmitted signals (encoder side).
the encoder 10' receives the input audio signal IS, e.g., an audio signal comprising speech.
the encoder 10' performs encoding so as to obtain two output data streams Q1 and Q2. These two audio data streams Q1 and Q2 are transmitted, e.g., separately to the decoder 20'.
the decoder 20' performs a reconstruction so as to obtain the reconstructed signal IS'.
the encoder 10' comprising two encoding entities, namely a style encoder 10s and a content encoder 10c. Each of the two encoders 10s and 10c receives the input audio signal IS and performs an encoding of a portion of the input audio signal.
the encoder 10s outputs the coded audio signal Q1, while the encoder 10c outputs the coded audio signal Q2.
the signal Q1 is processed by a conditioning/PreNet 20c, while the encoded audio stream Q2 is decoded by the content decoder 20d.
the content decoder 20d has an input for Q2 and a control signal generated by 20c.
the decoder uses the two signals so as to output the reconstructed signal IS'.
the style codec 10s is configured to code style information of the input audio signal to obtain a first (latent) data stream Q1.
the style information may comprise long term information, like a style information.
the style information may be dependent on the gender (male/female speakers) or dependent on the language.
the style information may have an influence on the general frequency dependent power distribution of the voice and/or to the general sound. Due to the encoding, the signal Q2 comprises the specific coding style information.
the content encoder 10c is configured to encode content information, i.e., short term information, like phonetic information.
embodiments of present invention are more efficient than the SOTA approaches to neural speech coding [1, 2, 3].
the encoder 10 (as well as the decoder 20) may be realized by neural codec NESC [1].
NESC neural codec
any neural network that transforms a potentially preprocessed audio signal into a compact representation can serve as an encoder for a neural codec.
a neural codec may be characterized by a preprocessing of the audio signal and a transformation of the preprocessed audio signal so as to obtain a compact representation of same.
convolutional neural network blocks are used that downsample the input audio signal successively and recurrent neural network blocks or transformers that model additional temporal context.
Part of this encoder would be also a subsequently applied quantizer transforming that continuous output of the neural network to a discrete representation that can be potentially entropy-coded and transmitted.
a formatter for preparing the data stream, etc. may be part of it.
NESC is an example for a neural codec.
NESC may, for example, consist of the following blocks:
the input waveform is transformed into a two-dimensional representation by a rolling window frontend (The frontend is the first part (receiving the input audio signal) of the encoder).
the result is processed by a convolutional layer within each frame, recurrent layers along different frames and again a convolutional layer operating within each frame. Together, this is termed dual path convolutional recurrent neural network (DPCRNN).
DPCRNN dual path convolutional recurrent neural network
the signal is further processed by several convolutional neural network blocks.
the output of these blocks is quantized, e.g., by residual vector quantization (NESC) or by scalar quantization.
NESC residual vector quantization
the decoder 20' as already shown by Fig. 2 will be discussed taking reference to Fig. 3b showing the decoder 20d as central component.
the decoder 20d receives Q2 and outputs IS'. Furthermore, the decoder 20d has another input for the control signal c obtained using the condition PreNet 20c, also referred to as condition entity.
the quantized representation may according to further embodiments be preprocessed by recurrent neural network blocks (so as to implement example neural codec based on NESC).
the result is decoded by Streamwise-StyleMelGAN (SSMGAN).
SSMGAN consists of a sequence of convolutional layers that subsequently upsample the input audio signal, where the input audio signal is used to condition each upsampling stage with a temporal adaptive DE-normalization (TADE) layer.
TADE temporal adaptive DE-normalization
the final signal is synthesized with a Pseudo-Quadrature Mirror Filter bank (PQMF).
PQMF Pseudo-Quadrature Mirror Filter bank
the neural codec used at the decoder side may comprise Streamwise-StyleMeIGAN, e.g., consisting of a sequence of convolutional layers that subsequently upsample the input audio signal.
the decoder may comprise a SSMGAN decoder including the TADE layers and the dequantization (mapping indices describing the discrete representation (or a stream of indices) to a latent waveform (representation) that serve as the input of the remaining part of the decoder.
different network architectures may be used.
the encoder comprises a style encoder and/or a content encoder which comprises the following elements:
the content decoder comprises a styling element of the content information from the first data stream or a processed version of it, the styling element being controlled by the control signal (C).
a NESC decoder SMGAN
the content decoder comprises at least a styling element to obtain a reconstructed signal (IS') of the encoded input audio signal based on the first (Q1) and second data stream (Q2).
styling elements/TADEs are used elsewhere in the decoder, e.g. using SSMGAN as content decoder.
the content encoder 10c captures short time relations within the input audio signal (e.g., phonetic information), while the style encoder captures more global information (e.g., the identity of the speaker or the speaking style).
the content encoder produces latents at a much higher frequency (e.g., 10-100Hz) relative to the style encoder (less than 10Hz).
a much higher frequency e.g., 10-100Hz
the style encoder less than 10Hz
both encoders may share some of their layers.
layers capable of representing long-term information such as RNNs may be employed to enforce that the style encoder 10s learns long-term information to avoid leaking of style information to the content encoder, techniques such as normalization layers and forcing temporal whiteness or input perturbations like vocal tract length perturbations may be used.
the two latents Q1 and Q2 comprise different information. It should be noted that the latent Q1 (output of the style encoder) is transmitted less often as the latent Q2 (cf. output of the content encoder 10c). Background is that style information is a long-term and/or global feature, while content is referring to the short time information.
the decoder 20 comprises the two entities 20c and 20d.
20c receives the latent Q1 comprising the style information and performs a processing so as to obtain a control signal.
This entity 20c is also referred to as conditioning PreNet and is configured to prepare a control signal to specialize the decoder 20d towards a specific style. Consequently, the control signal output by the entity 20c serves as additional input for the decoder 20d. This can be achieved by learning affine transforms of intermediate representations in the decoder 20d.
the decoder 20d receives the latent Q2 and performs a decoding so as to obtain the reconstructed signal IS'.
the signal is obtained based on Q2 taking into account the control signal provided by 20c. Through the control signal the specific style information is received, so that the decoder 20d can perform the decoding not only based on the information included by Q2, but also in the information included by Q1.
initialization techniques may be used according to embodiments. For example, decoding with an average style or with an average style for male/female speakers may be used during the initiation phase. This means in other words, that the decoder 20d uses an initial control signal representing the average style or an average style of a speaker female. To select the (average) style out of a plurality of (average) styles an information may be used, e.g. extracted or included by Q1.
a control signal can be provided imprinting an anonymous identity to the content decoded by use of the decoder 20d.
the conditioning entity 20c may be configured to receive the encoded style information derived from the encoded input audio signal, e.g., via the data stream Q1, and performs at least a learnable affine transform of the coded content information derived from the encoded input audio signal. The result is an obtained transformed encoded content information.
This content information is forwarded as a kind of control signal C to the content decoder 20d which performs a further decoding.
the further decoding may take into account the data stream Q2.
the decoding performed by 20d has the purpose to obtain IS'.
the data stream Q2 may suffice to be the input audio signal for 20c and 20d. Background thereof is that a long-term information can be derived from Q2, as well.
the content decoder 20d may comprise at least one learnable layer.
a convolutional layer or temporal adaptive DE-normalization layer TADE
the content decoder may comprise at least one temporal adaptive DE normalization residual block.
the proposed method achieves better quality of the reconstructed speech at very low bitrates and/or low complexity than comparable neural speech coders that do not leverage separate encoders for content and style. Especially the generalization ability of the trained codec may be significantly increased due to the proposed extra encoder.
both encoders can be realized in a computationally very efficient way such that an extra encoder does not pose significant extra computational burden.

Landscapes

Engineering & Computer Science (AREA)
Computational Linguistics (AREA)
Health & Medical Sciences (AREA)
Audiology, Speech & Language Pathology (AREA)
Human Computer Interaction (AREA)
Physics & Mathematics (AREA)
Acoustics & Sound (AREA)
Multimedia (AREA)
Compression, Expansion, Code Conversion, And Decoders (AREA)

EP24156124.0A 2024-02-06 2024-02-06 Codeur et décodeur Pending EP4600950A1 (fr)

Priority Applications (2)

Application Number	Priority Date	Filing Date	Title
EP24156124.0A EP4600950A1 (fr)	2024-02-06	2024-02-06	Codeur et décodeur
PCT/EP2025/052897 WO2025168600A1 (fr)	2024-02-06	2025-02-05	Codeur et décodeur

Applications Claiming Priority (1)

Application Number	Priority Date	Filing Date	Title
EP24156124.0A EP4600950A1 (fr)	2024-02-06	2024-02-06	Codeur et décodeur

Publications (1)

Publication Number	Publication Date
EP4600950A1 true EP4600950A1 (fr)	2025-08-13

Family

ID=89853630

Family Applications (1)

Application Number	Title	Priority Date	Filing Date
EP24156124.0A Pending EP4600950A1 (fr)	2024-02-06	2024-02-06	Codeur et décodeur

Country Status (2)

Country	Link
EP (1)	EP4600950A1 (fr)
WO (1)	WO2025168600A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
US20200365166A1 (en) *	2019-05-14	2020-11-19	International Business Machines Corporation	High-quality non-parallel many-to-many voice conversion

2024
- 2024-02-06 EP EP24156124.0A patent/EP4600950A1/fr active Pending
2025
- 2025-02-05 WO PCT/EP2025/052897 patent/WO2025168600A1/fr active Pending

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
US20200365166A1 (en) *	2019-05-14	2020-11-19	International Business Machines Corporation	High-quality non-parallel many-to-many voice conversion

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DEFOSSEZ ALEXANDRE ET AL., HIGH FIDELITY NEURAL AUDIO COMPRESSION, Retrieved from the Internet <URL:https://arxiv.org/abs/2210.13438>
PIANICOLA ET AL., NESC: ROBUST NEURAL END-2-END SPEECH CODING WITH GANS, Retrieved from the Internet <URL:https://arxiv.org/abs/2207.03282>
POLYAK ET AL., SPEECH RESYNTHESIS FROM DISCRETE DISENTANGLED SELF-SUPERVISED REPRESENTATIONS, 2021, Retrieved from the Internet <URL:https://arxiv.org/pdf/2104.00355.pdf>
XUE JIANG ET AL: "Disentangled Feature Learning for Real-Time Neural Speech Coding", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 25 February 2023 (2023-02-25), XP091446243 *
ZEGHIDOURNEIL ET AL., SOUNDSTREAM: AN END-TO-END NEURAL AUDIO CODEC, Retrieved from the Internet <URL:https://arxiv.ora/abs/2107.03312>

Also Published As

Publication number	Publication date
WO2025168600A1 (fr)	2025-08-14

Legal Events

Date

Code

Title

Description

2025-07-11

PUAI

Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012