WO2010086020A1 - Audio signal quality prediction - Google Patents

Audio signal quality prediction Download PDF

Info

Publication number
WO2010086020A1
WO2010086020A1 PCT/EP2009/051054 EP2009051054W WO2010086020A1 WO 2010086020 A1 WO2010086020 A1 WO 2010086020A1 EP 2009051054 W EP2009051054 W EP 2009051054W WO 2010086020 A1 WO2010086020 A1 WO 2010086020A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
quality
spectral
blocks
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/EP2009/051054
Other languages
French (fr)
Inventor
Volodya Grancharov
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Telefonaktiebolaget LM Ericsson AB
Original Assignee
Telefonaktiebolaget LM Ericsson AB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget LM Ericsson AB filed Critical Telefonaktiebolaget LM Ericsson AB
Priority to US13/146,426 priority Critical patent/US20120020484A1/en
Priority to EP09778994A priority patent/EP2392003B1/en
Priority to JP2011546623A priority patent/JP5204904B2/en
Priority to PCT/EP2009/051054 priority patent/WO2010086020A1/en
Publication of WO2010086020A1 publication Critical patent/WO2010086020A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/2236Quality of speech transmission monitoring
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/69Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals

Definitions

  • the present invention relates to a method and an apparatus for predicting the quality of an audio signal after transmission through a communication system, using a reference signal corresponding to an input signal to the communication system, and a processed signal corresponding to an output signal from said communication system.
  • the objective quality of an audio/speech signal after transmission through a system can be predicted e.g. by using the PESQ (Perceptual Evaluation of Speech Quality) or the PEAQ (Perceptual Evaluation of Audio Quality) , which both are examples of a conventional intrusive, i.e. double-ended, methods for audio quality prediction.
  • An intrusive method uses both the original signal input to a system and the distorted output signal, which are forwarded to an audio signal quality predicting apparatus.
  • An intrusive audio signal quality predicting apparatus predicts the quality of an audio signal after transmission through a network by comparing a reference signal input to the system with the processed (distorted) signal output, and it is effective across a range of networks, including PSTN, mobile, and VoIP.
  • the PESQ takes into account e.g. coding distortions, errors, packet loss, delay, variable delay and filtering, and measures the effects of distortions such as noise, delay, and front-end clipping, in order to provide a single Mean Opinion Score (MOS) as a quality measure.
  • MOS Mean Opinion Score
  • a reference signal i.e. an input signal to an audio transmission system
  • a processed signal i.e. a distorted output of the system
  • the terminal arranged to perform the prediction is normally connected to two different points of the system, one point for insertion of the reference signal and one for receiving the processed signal.
  • a possible connection point is e.g. a mobile phone, a Media Gateway, or a VoIP Gateway.
  • FIG. 2 is a block diagram illustrating a conventional apparatus 25 for estimation the quality of an audio signal, e.g. a speech signal, after transmission through a communication system 21, from a reference signal and a processed signal.
  • a synchronization in time of the reference signal and the processed signal is performed by a time aligning device 22, an extraction of the features in the signals related to quality variations is performed by a feature extracting device 23, and a quality estimation is produced by combining the extracted features in the quality predicting device 24.
  • the synchronization in time, i.e. the time-alignment, between the reference signal and the processed signal in the time aligning device 22 in figure 2 is required due to the fact that a delay is typically introduced in the processed signal, e.g. by a VoIP system, by a low-bitrate parametric coder, not synchronized clocks, and by changes in the sampling rate. Even though the human perception of the audio quality normally is unaffected by small delays, the signals have to be synchronized before the extraction of the features, in order to obtain an objective estimation of the audio signal quality.
  • the feature extracting device 23 in figure 2 performs an extraction of the features in both signals, and figure 1 illustrates a conventional feature extraction scheme from a reference signal 11 and a processed signal 12.
  • Vectors with spectral information are extracted from both signals on block basis, and the distance between the vectors is a measure of the local distortion.
  • a sequence of typically 8-12 sec from the reference signal and of the processed signal is segmented into short blocks, each block having a length of typically 20-40ms.
  • the waveform of each signal block is transformed to the frequency domain, and the frequency domain blocks are, in turn, transformed to the power spectrum.
  • the frequency domain vector may be converted to the perceptual domain, through frequency warping of the Herz- scale to Bark or Mel scales, followed by a compression to obtain loudness density.
  • the local distortion D n 16 is calculated in 15 as the distance between the frequency representation 13 of the reference signal and the frequency representation 14 of the processed signal, related to e.g. the excitation pattern and the loudness density, the calculation described e.g. according to equation (1) below:
  • the index r indicates a reference signal
  • the index p indicates the processed signal
  • the index n indicates a particular block.
  • the function f in equation (1) performs and aggregation over the frequency bins w, and calculates a vector distance, which may include an L p norm and/or sign difference.
  • a signal quality value, Q is determined from a calculated aggregation, e.g. by an L p norm, of the per-block distortions, D n , according to equation (2) below:
  • the audio signal quality value indicated by the quality value, Q is inversely proportional to the aggregated distortion, D.
  • the above-described conventional quality estimating device 25 has several drawbacks.
  • One drawback is that it is very sensitive to errors in the time-alignment between the reference signal and the processed signal, and the calculated difference between the two power spectrum vectors, as illustrated in Fig. 1, will have a large error if the spectrum vectors are not perfectly synchronized in time. Since the processed signal could be heavily distorted due to e.g. a low bitrate codec, en error in the time-alignment presents a problem in objective audio signal quality estimation using a reference and a processes signal .
  • the object of the present invention is to address the problem outlined above, and this object and others are achieved by the method and the arrangement according to the appended independent claims, and by the embodiments according to the dependent claims .
  • the invention provides a method for predicting the quality of an audio signal after transmission through a communication system.
  • the method uses a reference signal corresponding to an input signal to the communication system, and a processed signal corresponding to an output signal from said communication system.
  • the method comprises the steps of: - Segmenting the reference signal and the processed signal into at least two first blocks having a pre-determined length;
  • the quality indicated by the determined first quality value may be inversely proportional to the minimum aggregated value of the distortions, and the number of parameters may be equal to three.
  • One of said spectral parameters may represent a spectral flatness, which indicates the resonant structure of the power spectrum, one of the spectral parameters may represent the normalized transition rate of RMSE, which indicates the rate of signal energy change, and one of said spectral parameters may represent the spectral centroid, which indicates the frequency around which the signal power is concentrated.
  • the method may comprise the further steps of:
  • the second quality value may be inversely proportional to the aggregated value of the distortions.
  • a total quality value of the audio signal may be determined by combining the first quality value with the second quality value, e.g. by an addition with different weight.
  • the calculation of said second parameters may comprise a determination of the means, the variance, or the skew of the spectral parameters calculated for the first blocks contained in the second blocks.
  • the invention provides an apparatus for predicting the quality of an audio signal transmitted through a communication system by using a reference signal corresponding to an input signal to said communication system, and a processed signal corresponding to a distorted output signal from the communication system.
  • the apparatus comprises signal segmenting means for segmenting the reference signal and the processed signal into at least two first blocks having a pre-determined length; spectral parameter calculating means for calculating at least two spectral parameters for each of said first blocks, each spectral parameter representing a different spectral property of the signal; distortion calculating means for calculating the distortion between each spectral parameter of the reference signal and the corresponding spectral parameter of the processed signal, for each of the first blocks; aggregation calculating means for calculating an aggregated value of said calculated distortions at a number of different time-displacements between the reference signal and the processed signal, and first quality determining means for determining a first quality value of the audio signal from a minimum aggregated value of the distortions at an optimal time- displacement.
  • the apparatus may further comprise means for determining a second quality value, said means comprising second segmenting means for segmenting the reference signal and the processed signal into at least one second block, each second block containing a pre-determined number of said first blocks; second parameter calculating means for calculating a second parameter from each of the spectral parameters calculated for each of the first blocks contained in the second blocks; second distortion calculating means for calculating a distortion between each second parameter of the reference signal and the corresponding second parameter of the processed signal for each of the second blocks, at said optimal time-displacement; and second quality determining means for determining a second quality value from an aggregated value of the calculated distortions.
  • the apparatus may be arranged to be connected to two points of the communication system, one for insertion of the reference signal and one for receiving the distorted processed signal.
  • Figure 1 illustrates a conventional feature extraction scheme for a reference signal and a processed signal
  • FIG. 2 illustrates a conventional apparatus for predicting the quality of an audio signal
  • FIG. 3 illustrates a parameter extracting scheme according to an exemplary embodiment of the present invention
  • FIG. 5 is a flow diagram illustrating a method for predicting the quality of an audio signal, according to a first exemplary embodiment of this invention
  • Figure 6 is a flow diagram illustrating the additional steps of predicting the quality of an audio signal, according to a second exemplary embodiment of this invention.
  • Figure 7 is an apparatus for predicting the quality of an audio signal, according to a first exemplary embodiment of this invention; DETAILED DESCRIPTION
  • the predicted quality of an audio signal transmitted through a system is based on the distortion between a small number of spectral parameters representing the signal spectrum of the distorted processed signal, and the same spectral parameters representing the signal spectrum of the input reference signal. Further, the time synchronization between the reference signal and the processed signal is performed jointly with the calculation of the distortion. Thereby, the quality prediction is less sensitive for synchronization errors, and the distortions can be calculated on different time scales. More specifically, a sequence of the reference signal, i.e. a signal input to a communication system, and the processed signal, i.e.
  • the output signal from the communication system are each segmented into a number of small-scale first blocks having a pre-determined length, typically 20-40 msec, and the length of the signal sequences are typically 8-12 sec.
  • the signal waveform can be transformed into a frequency domain, and expressed as a power spectrum.
  • Two or more, and typically three, different spectral parameters representing different spectral properties of the signals are calculated for each block of the reference signal and of the processed signal.
  • the number of spectral parameters should be low, and significantly lower than the number of frequency bins, but may obviously be more than three, such as e.g. four or five.
  • the distortion of the processed signal is determined by calculating the difference between each spectral parameter of each of the first blocks in the sequence of the processed signal and the same spectral parameter in the corresponding block of the reference signal.
  • a local distortion, D n is determined for each block from these differences, and the local distortions are aggregated.
  • D n a local distortion
  • a smaller value of the aggregated local per-block distortion indicates that the transmission through the communication system will cause less distortion of an audio signal, i.e. a higher quality can be predicted.
  • a value of the quality is determined from the aggregated local distortion, such that the quality indicated by a predicted quality value is inversely proportional to the size of the aggregated local distortion.
  • the synchronization in time between the reference signal and the processed signal is performed jointly with the calculation of the aggregation of the distortions, by calculating each local distortion, and the aggregation of the local distortion, at a number of different time-displacements, m, between the reference signal and the processed signal.
  • an optimal time-displacement could be determined by selecting the minimum of the calculated aggregated local distortions, and determining the quality value from this minimum of the aggregated distortions.
  • Figure 3 is a block diagram illustrating the calculation of a local distortion for a first block with index n, according to an exemplary embodiment of this invention.
  • a sequence of the reference signal 11 and of the processed signal 12 are both segmented into a number of first blocks, and the signal waveform of first block n of the reference signal is transformed into a power spectrum 13 in the frequency domain, and the signal waveform of block n of the processed signal is transformed into a power spectrum 14 in the frequency domain.
  • three spectral parameters 31 are calculated for first block n in the reference signal, and the same spectral parameters 32 are calculated for the block in the processed signal.
  • the spectral parameters are derived directly from the signal waveform, without transforming the signal waveform to a power spectrum.
  • the difference 33 between each of the spectral parameters is calculated, and the local distortion 34, D n , is determined for block n from these differences.
  • Figure 4 illustrates an audio quality predicting apparatus 42, according to the basic idea of this invention, of an audio signal transmitted through a communication system 21.
  • a suitable low number of different spectral parameters e.g. three spectral parameters, are calculated from the spectral properties of the blocks of the reference signal and of the processed signal by a parameter extracting device 23, and the synchronization in time and an aggregation of calculated local distortions are performed jointly in a time-aligning and quality predicting device 41, providing a value of the quality, Q, at the output.
  • every first block, having a length of e.g. 20 ms, of the reference signal and the processed signal are described with at least two, but preferably three, different spectral parameters, in contrast to a conventional frequency representation description, according to which such a block could described with e.g. 128 components.
  • suitable spectral parameters for describing each block comprises the spectral flatness, the normalized transition rate of RMSE, and the spectral centroid.
  • the spectral parameter representing the spectral flatness of the block measures the amount of resonant structure in the power spectrum, e.g. according to equation (3) below, and a deviation in this parameter is related to coding distortions and an additive background noise.
  • the spectral parameter representing the normalized transition rate of RMSE indicates the rate of the signal energy change, e.g. according to equation (4) below, and a deviation in this parameter is related to e.g. gain errors and signal mutes.
  • the spectral parameter representing the spectral centroid indicates the frequency around which most of the signal energy is concentrated, e.g. according to equation (5) below, and a deviation in this parameter is related to a loss of bandwidth and an additive background noise. Since the spectral centroid is related to the spectrum tilt, the spectral centroid can be approximated as the coefficient in the first-order linear- prediction analysis.
  • the above-described exemplary parameters represent meaningful dimensions of a block of an audio signal, such as the resonant structure, the perceived brightness, and the energy changes, and the parametric representation is easy to associate with a particular distortion. Further, the spectral parameters are robust to errors in time-alignment and formant displacement, since they do not require that the frequency bins of the reference signal and the processed signal are perfectly positioned.
  • the local distortion, D n for a first block with index n, which is calculated from the differences between each spectral parameters of the block in the processed signal and the spectral parameters in the corresponding block in the reference signal, can be expressed e.g. according to the equation (6) below:
  • the synchronization in time of the processed signal and the reference signal is performed jointly with the calculation of the aggregation of the local distortions, D n , by calculating each local distortion, as well as the aggregation of the local distortion, at a number of different time-displacements, m, between the reference signal and the processed signal.
  • an optimal time-displacement can be determined by selecting the minimum of the calculated aggregated local distortions, and determining the quality value from this minimum of the distortions .
  • the quality is predicted from the minimum aggregated value of the local distortions, at an optimal time-displacement, at which the processed signal is time-aligned with the reference signal.
  • the predicted quality is indicated by a selected suitable quality value.
  • the quality indicated by the quality value is inversely proportional to the aggregated local distortions, since a comparatively small distortion of the audio signal means that the predicted quality of the audio signal is comparatively high.
  • the optimal time displacement M can be calculated e.g. according to equation (9) :
  • FIG. 5 is a flow diagram illustrating a method for predicting the quality of an audio signal, according to a first exemplary embodiment of this invention.
  • the reference signal and the processed signal are segmented into a number of first blocks having a length of e.g. 20-40 ms, and in step 52, e.g. three different spectral parameters are calculated for each of the first blocks in the processed signal and in the reference signal.
  • the spectral parameters are at least two, and suitable spectral parameters are e.g. the spectral flatness, the spectral centroid and the normalized transitions rate of RMSE, as described above.
  • step 53 the local distortion, D n , is calculated for each of the first blocks from the difference between each spectral parameter in the block of the processed output signal and in the corresponding block of the input reference signal, in order to determine the distortion of the audio signal during the transmission through the communication system.
  • step 54 the processed signal is synchronized in time with the reference signal by a calculation of an aggregated value, e.g. as an L p norm, of the local distortions in each block at different time-displacements, m, between the processed signal and the reference signal.
  • the predicted first quality value is determined, in step 55, from the minimum of the aggregated local distortions, at the optimal time-displacement,
  • the spectral parameters and the local distortion are calculated for fixed small-scale blocks, e.g. with a length of 20 ms .
  • the distortions can be obtained at a larger scale, as well, through calculating second parameters as statistic values from the calculated spectral parameters of the first blocks located within a larger-scale second block.
  • said second parameters are obtained by calculating e.g. the mean, the variance, the skew, or a certain quintile of from the spectral parameters calculated for the first blocks located within the larger-scale second block.
  • the second parameters indicated in equation (10), (11) and (12) below are obtained for the larger-scale second block with index B of the reference signal, the larger-scale second block containing a pre- determined number of small-scale first blocks:
  • the corresponding second parameters are also obtained for the processed signal.
  • the local distortion, DB, for this large-scale second block B is calculated from the difference between the second parameters in this larger-scale second block in the processed signal and the corresponding larger-scale second block in the reference signal, e.g. according to the equation (13) below:
  • the total quality of an audio signal sequence having a length of e.g. between 8 and 12 seconds is predicted from the combination of D n and DB distortions.
  • D n always describes the local distortion in the small-scale first blocks, which have as fixed length.
  • a larger-scale second block indicated by index B, has a length corresponding to at least two of the first blocks, i.e. a length between two small-scale blocks and the total length of the signal sequence.
  • the total quality is predicted as a linear combination between quality predictions determined from the distortions with different resolution, i.e. the small-scale local distortions and the larger-scale distortions are aggregated independently. Accordingly, a first quality value, Qi is determined from an aggregation of the small-scale local distortions, D n , and a second quality value, Q 2 , is determined from an aggregation of the large-scale distortions, DB • Thereafter, the first quality value Qi and the second quality value Q 2 are combined to form the total quality value Qtot / e.g. according to equation (14) below:
  • the first quality value and the second quality value are added with the same weight.
  • the first quality value and the second quality value are added with different weight, and the different weights are indicated by ki ⁇ k 2 in (14) above.
  • the second quality value predicted from larger- scale blocks with index B could be given a higher weight in the predicted total quality value when a specific distortion is detected, since some distortion are more easily describes with larger-scale parameters, such as e.g. additive background noise, bandwidth limitations and the energy loss in larger signal segments. Therefore, it may be advantageous to give the second large-scale quality value a higher weight in the total quality value, and in this case ki ⁇ k 2 in equation (14) above.
  • Figure 6 is a flow diagram illustrating the additional steps of predicting a second, larger-scale quality of an audio signal, according to a second exemplary embodiment of this invention, which is performed after the steps illustrated in figure 5.
  • step 61 the sequence of the processed signal and of the reference signal are segmented into one or more second blocks, of which each of the second blocks contains two or more of the small-scale first blocks.
  • step 62 a second parameter is calculated statistically from each of the spectral parameters of the first blocks contained in the larger-scale second block in the processed signal and in the reference signal, at the optimal
  • step 63 the difference is calculated between each second parameter of the block in the processed signal distortion and the same second parameter in the corresponding block of the reference signal, and a local distortion, D ⁇ f is calculated for each of the second blocks, e.g. according to equation (13) above.
  • step 64 a second larger-scale quality value, Q 2 , is predicted from the aggregated local distortion, and the quality indicated by the selected second quality value is inversely proportional to the aggregated local distortions D.
  • the spectral features can be extracted from the reference signal and from the processed signals without performing any synchronization.
  • the synchronization can be performed jointly with the determination of the aggregated distortions.
  • the invention achieves a low-complexity perceptual time-alignment, which is superior to conventional waveform synchronization, as well as enabling a prediction of the distortion at different time resolution, i.e. different scales, thus improving the accuracy and flexibility of the quality prediction.
  • Figure 7 is an apparatus 42 for predicting the quality of an audio signal, according to a first exemplary embodiment.
  • the apparatus comprises signal segmenting means 71 for segmenting a sequence of the reference signal and of the processed signal into a number of first blocks having a length of 20-40 ms .
  • the apparatus comprises spectral parameter calculating means 72 for calculating e.g. three different spectral parameter for each of the first blocks, each spectral parameter representing a different spectral property of the block.
  • the difference between each spectral parameter in each block of the processed signal and the spectral parameter in the corresponding block of the reference signal is calculated by the distortion calculating means, 73, and a local distortion D n is calculated for each of the first blocks, based on these differences.
  • the local distortions in the blocks of the sequences are aggregated by the aggregation calculating means, 74, e.g. as an L p -norm, and a first quality value is predicted by the first quality predicting means, 75, such that the quality indicated by the first quality value is inversely proportional to the aggregated local distortions.
  • FIG 7 may be implemented by physical or logical entities using software functioning in conjunction with a programmed microprocessor or general purpose computer, and/or using an application specific integrated circuit (ASIC) .
  • ASIC application specific integrated circuit
  • the apparatus is further provided with means for determining a second quality value, which is calculated at a larger scale.
  • These means comprises the following:
  • Second segmenting means for segmenting the reference signal and the processed signal into one or more second blocks, each second block being larger than said first blocks, and each second block containing a pre-determined number, i.e. two or more, of the first blocks;
  • Second parameter calculating means for calculating a second parameter from each of the spectral parameters calculated for each of the first small-scale blocks contained in a second, larger-scale block;
  • Second distortion calculating means for calculating a distortion between each second parameter of the reference signal and the corresponding second parameter of the processed signal, at the optimal time-displacement M between the processed signal and the reference signal, and determining a local distortion for each second block;
  • Second quality determining means for determining a second quality value from an aggregated value of the calculated local distortions .
  • the apparatus comprises means for determining a total quality of the audio signal, by combining the first quality value with the second quality value, e.g. with different weight.
  • the apparatus is arranged to be connected to two different points of the communication system, one for insertion of the reference signal and one for receiving the distorted processed signal.
  • a possible connection point is e.g. a mobile phone, a Media Gateway, or a VoIP Gateway.
  • RMSE Root Mean Squared Error VoIP - Voice Over Internet Protocol
  • n block index for the first blocks, i.e. the 20 - 40 ms small- scale blocks
  • B - block index for the second larger-scale blocks each containing two or more of the first smaller-scale blocks N - the number of blocks in the signal sequence w - frequency bin index, inside one block r - parameter associated with the reference signal p - parameter associated with the processed signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Telephone Function (AREA)

Abstract

Method and apparatus for predicting the quality of an audio signal after transmission through a communication system (21), the method using a reference signal (11) corresponding to an input signal to the communication system, and a processed signal (12) corresponding to an output signal from said communication system. The signals are segmented into blocks, and e.g. three spectral parameters are calculated for each block in the processed and in the reference signal. Thereafter, the quality of the audio signal is predicted from the distortion between these parameters.

Description

Audio signal quality prediction
TECHNICAL FIELD
The present invention relates to a method and an apparatus for predicting the quality of an audio signal after transmission through a communication system, using a reference signal corresponding to an input signal to the communication system, and a processed signal corresponding to an output signal from said communication system.
BACKGROUND
In a mobile communication system, as well as in e.g. a VoIP system, it is important to be able to predict the quality of a speech signal after the speech signal has passed through the system. The objective quality of an audio/speech signal after transmission through a system can be predicted e.g. by using the PESQ (Perceptual Evaluation of Speech Quality) or the PEAQ (Perceptual Evaluation of Audio Quality) , which both are examples of a conventional intrusive, i.e. double-ended, methods for audio quality prediction. An intrusive method uses both the original signal input to a system and the distorted output signal, which are forwarded to an audio signal quality predicting apparatus. An intrusive audio signal quality predicting apparatus predicts the quality of an audio signal after transmission through a network by comparing a reference signal input to the system with the processed (distorted) signal output, and it is effective across a range of networks, including PSTN, mobile, and VoIP. The PESQ takes into account e.g. coding distortions, errors, packet loss, delay, variable delay and filtering, and measures the effects of distortions such as noise, delay, and front-end clipping, in order to provide a single Mean Opinion Score (MOS) as a quality measure. Thus, a reference signal, i.e. an input signal to an audio transmission system, and a processed signal, i.e. a distorted output of the system, may be used for predicting the quality of an audio signal transmitted through said system.
In order to perform an intrusive, double-ended, audio signal quality prediction, the terminal arranged to perform the prediction is normally connected to two different points of the system, one point for insertion of the reference signal and one for receiving the processed signal. A possible connection point is e.g. a mobile phone, a Media Gateway, or a VoIP Gateway.
Figure 2 is a block diagram illustrating a conventional apparatus 25 for estimation the quality of an audio signal, e.g. a speech signal, after transmission through a communication system 21, from a reference signal and a processed signal. A synchronization in time of the reference signal and the processed signal is performed by a time aligning device 22, an extraction of the features in the signals related to quality variations is performed by a feature extracting device 23, and a quality estimation is produced by combining the extracted features in the quality predicting device 24.
The synchronization in time, i.e. the time-alignment, between the reference signal and the processed signal in the time aligning device 22 in figure 2 is required due to the fact that a delay is typically introduced in the processed signal, e.g. by a VoIP system, by a low-bitrate parametric coder, not synchronized clocks, and by changes in the sampling rate. Even though the human perception of the audio quality normally is unaffected by small delays, the signals have to be synchronized before the extraction of the features, in order to obtain an objective estimation of the audio signal quality. The feature extracting device 23 in figure 2 performs an extraction of the features in both signals, and figure 1 illustrates a conventional feature extraction scheme from a reference signal 11 and a processed signal 12. Vectors with spectral information are extracted from both signals on block basis, and the distance between the vectors is a measure of the local distortion. In the feature extraction, a sequence of typically 8-12 sec from the reference signal and of the processed signal is segmented into short blocks, each block having a length of typically 20-40ms. The waveform of each signal block is transformed to the frequency domain, and the frequency domain blocks are, in turn, transformed to the power spectrum. Further, the frequency domain vector may be converted to the perceptual domain, through frequency warping of the Herz- scale to Bark or Mel scales, followed by a compression to obtain loudness density. Thereafter, the local distortion Dn 16, at a block with index n, is calculated in 15 as the distance between the frequency representation 13 of the reference signal and the frequency representation 14 of the processed signal, related to e.g. the excitation pattern and the loudness density, the calculation described e.g. according to equation (1) below:
Dn=f(Pn r(ω)-Pn p(ω))
Hereinafter, the index r indicates a reference signal, the index p indicates the processed signal, and the index n indicates a particular block.
The function f in equation (1) performs and aggregation over the frequency bins w, and calculates a vector distance, which may include an Lp norm and/or sign difference.
In the quality predicting device 24 in figure 2, a signal quality value, Q, is determined from a calculated aggregation, e.g. by an Lp norm, of the per-block distortions, Dn, according to equation (2) below:
Figure imgf000005_0001
Since a lower distortion leads to a higher quality, the audio signal quality value indicated by the quality value, Q, is inversely proportional to the aggregated distortion, D.
However, the above-described conventional quality estimating device 25 has several drawbacks. One drawback is that it is very sensitive to errors in the time-alignment between the reference signal and the processed signal, and the calculated difference between the two power spectrum vectors, as illustrated in Fig. 1, will have a large error if the spectrum vectors are not perfectly synchronized in time. Since the processed signal could be heavily distorted due to e.g. a low bitrate codec, en error in the time-alignment presents a problem in objective audio signal quality estimation using a reference and a processes signal .
Further, even though the human auditory system compensates for moderate differences in pitch and timbre, the subtraction of the two spectrum vectors is not able to capture these natural speech variations. An additional drawback is that since the speech signal is a quasi-stationary, the spectral characteristics can be extracted only on short-time basis, e.g. up to 40 ms . However, it may be desirable to calculate the distortion with a different resolution, using larger signal segment, e.g. with a length of 300 ms, which is not possible using this conventional quality estimation device. SUMMARY
The object of the present invention is to address the problem outlined above, and this object and others are achieved by the method and the arrangement according to the appended independent claims, and by the embodiments according to the dependent claims .
According to one aspect, the invention provides a method for predicting the quality of an audio signal after transmission through a communication system. The method uses a reference signal corresponding to an input signal to the communication system, and a processed signal corresponding to an output signal from said communication system. The method comprises the steps of: - Segmenting the reference signal and the processed signal into at least two first blocks having a pre-determined length;
- Calculating a number of different spectral parameters representing spectral properties of the signal for each of said first blocks, the number of spectral parameters being at least two;
- For each of the first blocks, calculating a distortion between each calculated spectral parameter of the reference signal and the corresponding calculated spectral parameter of the processed signal;
- Calculating an aggregated value of said distortions for a number of different time-displacements between the reference signal and the processed signal;
- Determining a first quality value of the audio signal from a minimum aggregated value of the distortions at an optimal time-displacement . The quality indicated by the determined first quality value may be inversely proportional to the minimum aggregated value of the distortions, and the number of parameters may be equal to three.
One of said spectral parameters may represent a spectral flatness, which indicates the resonant structure of the power spectrum, one of the spectral parameters may represent the normalized transition rate of RMSE, which indicates the rate of signal energy change, and one of said spectral parameters may represent the spectral centroid, which indicates the frequency around which the signal power is concentrated.
The method may comprise the further steps of:
- Segmenting the reference signal and the processed signal into at least one second block, each second block containing a pre-determined number of said first blocks;
- For each of the second blocks, calculating a second parameter from each of the spectral parameters calculated for each of the first blocks contained in the second block, and calculating a distortion between each second parameter of the reference signal and the corresponding second parameter of the processed signal, at said optimal time displacement;
- Determining a second quality value from an aggregated value of the calculated distortions.
The second quality value may be inversely proportional to the aggregated value of the distortions.
Further, a total quality value of the audio signal may be determined by combining the first quality value with the second quality value, e.g. by an addition with different weight.
The calculation of said second parameters may comprise a determination of the means, the variance, or the skew of the spectral parameters calculated for the first blocks contained in the second blocks.
According to a second aspect, the invention provides an apparatus for predicting the quality of an audio signal transmitted through a communication system by using a reference signal corresponding to an input signal to said communication system, and a processed signal corresponding to a distorted output signal from the communication system. The apparatus comprises signal segmenting means for segmenting the reference signal and the processed signal into at least two first blocks having a pre-determined length; spectral parameter calculating means for calculating at least two spectral parameters for each of said first blocks, each spectral parameter representing a different spectral property of the signal; distortion calculating means for calculating the distortion between each spectral parameter of the reference signal and the corresponding spectral parameter of the processed signal, for each of the first blocks; aggregation calculating means for calculating an aggregated value of said calculated distortions at a number of different time-displacements between the reference signal and the processed signal, and first quality determining means for determining a first quality value of the audio signal from a minimum aggregated value of the distortions at an optimal time- displacement.
The apparatus may further comprise means for determining a second quality value, said means comprising second segmenting means for segmenting the reference signal and the processed signal into at least one second block, each second block containing a pre-determined number of said first blocks; second parameter calculating means for calculating a second parameter from each of the spectral parameters calculated for each of the first blocks contained in the second blocks; second distortion calculating means for calculating a distortion between each second parameter of the reference signal and the corresponding second parameter of the processed signal for each of the second blocks, at said optimal time-displacement; and second quality determining means for determining a second quality value from an aggregated value of the calculated distortions.
The apparatus may be arranged to be connected to two points of the communication system, one for insertion of the reference signal and one for receiving the distorted processed signal.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention will now be described in more detail, and with reference to the accompanying drawings, in which:
Figure 1 illustrates a conventional feature extraction scheme for a reference signal and a processed signal;
- Figure 2 illustrates a conventional apparatus for predicting the quality of an audio signal; - Figure 3 illustrates a parameter extracting scheme according to an exemplary embodiment of the present invention;
- Figure 4 illustrates audio signal quality prediction, according to the basic idea of this invention;
- Figure 5 is a flow diagram illustrating a method for predicting the quality of an audio signal, according to a first exemplary embodiment of this invention;
Figure 6 is a flow diagram illustrating the additional steps of predicting the quality of an audio signal, according to a second exemplary embodiment of this invention; - Figure 7 is an apparatus for predicting the quality of an audio signal, according to a first exemplary embodiment of this invention; DETAILED DESCRIPTION
In the following description, the invention will be described in more detail with reference to certain embodiments and to accompanying drawings. For purposes of explanation and not limitation, specific details are set forth, such as particular scenarios, techniques, etc., in order to provide a thorough understanding of the present invention. However, it is apparent to one skilled in the art that the present invention may be practised in other embodiments that depart from these specific details.
Moreover, those skilled in the art will appreciate that the functions and means explained herein below may be implemented using software functioning in conjunction with a programmed microprocessor or general purpose computer, and/or using an application specific integrated circuit (ASIC) . It will also be appreciated that while the current invention is primarily described in the form of methods and devices, the invention may also be embodied in a computer program product as well as in a system comprising a computer processor and a memory coupled to the processor, wherein the memory is encoded with one or more programs that may perform the functions disclosed herein.
According to the basic concept to this invention, the predicted quality of an audio signal transmitted through a system is based on the distortion between a small number of spectral parameters representing the signal spectrum of the distorted processed signal, and the same spectral parameters representing the signal spectrum of the input reference signal. Further, the time synchronization between the reference signal and the processed signal is performed jointly with the calculation of the distortion. Thereby, the quality prediction is less sensitive for synchronization errors, and the distortions can be calculated on different time scales. More specifically, a sequence of the reference signal, i.e. a signal input to a communication system, and the processed signal, i.e. the output signal from the communication system, are each segmented into a number of small-scale first blocks having a pre-determined length, typically 20-40 msec, and the length of the signal sequences are typically 8-12 sec. Optionally, the signal waveform can be transformed into a frequency domain, and expressed as a power spectrum.
Two or more, and typically three, different spectral parameters representing different spectral properties of the signals are calculated for each block of the reference signal and of the processed signal. The number of spectral parameters should be low, and significantly lower than the number of frequency bins, but may obviously be more than three, such as e.g. four or five.
Thereafter, the distortion of the processed signal is determined by calculating the difference between each spectral parameter of each of the first blocks in the sequence of the processed signal and the same spectral parameter in the corresponding block of the reference signal. Next, a local distortion, Dn, is determined for each block from these differences, and the local distortions are aggregated. A smaller value of the aggregated local per-block distortion indicates that the transmission through the communication system will cause less distortion of an audio signal, i.e. a higher quality can be predicted. Accordingly, a value of the quality is determined from the aggregated local distortion, such that the quality indicated by a predicted quality value is inversely proportional to the size of the aggregated local distortion.
Further, the synchronization in time between the reference signal and the processed signal is performed jointly with the calculation of the aggregation of the distortions, by calculating each local distortion, and the aggregation of the local distortion, at a number of different time-displacements, m, between the reference signal and the processed signal. Thereby, an optimal time-displacement could be determined by selecting the minimum of the calculated aggregated local distortions, and determining the quality value from this minimum of the aggregated distortions.
Figure 3 is a block diagram illustrating the calculation of a local distortion for a first block with index n, according to an exemplary embodiment of this invention. A sequence of the reference signal 11 and of the processed signal 12 are both segmented into a number of first blocks, and the signal waveform of first block n of the reference signal is transformed into a power spectrum 13 in the frequency domain, and the signal waveform of block n of the processed signal is transformed into a power spectrum 14 in the frequency domain. Thereafter, three spectral parameters 31 are calculated for first block n in the reference signal, and the same spectral parameters 32 are calculated for the block in the processed signal. However, according to an alternative embodiment, the spectral parameters are derived directly from the signal waveform, without transforming the signal waveform to a power spectrum. Further, the difference 33 between each of the spectral parameters is calculated, and the local distortion 34, Dn, is determined for block n from these differences.
Figure 4 illustrates an audio quality predicting apparatus 42, according to the basic idea of this invention, of an audio signal transmitted through a communication system 21. A suitable low number of different spectral parameters, e.g. three spectral parameters, are calculated from the spectral properties of the blocks of the reference signal and of the processed signal by a parameter extracting device 23, and the synchronization in time and an aggregation of calculated local distortions are performed jointly in a time-aligning and quality predicting device 41, providing a value of the quality, Q, at the output.
According to this invention, every first block, having a length of e.g. 20 ms, of the reference signal and the processed signal are described with at least two, but preferably three, different spectral parameters, in contrast to a conventional frequency representation description, according to which such a block could described with e.g. 128 components. According to an exemplary embodiment of this invention, suitable spectral parameters for describing each block comprises the spectral flatness, the normalized transition rate of RMSE, and the spectral centroid.
The spectral parameter representing the spectral flatness of the block measures the amount of resonant structure in the power spectrum, e.g. according to equation (3) below, and a deviation in this parameter is related to coding distortions and an additive background noise.
Figure imgf000013_0001
The spectral parameter representing the normalized transition rate of RMSE indicates the rate of the signal energy change, e.g. according to equation (4) below, and a deviation in this parameter is related to e.g. gain errors and signal mutes. F -F
E,, +E (4: n-\
The spectral parameter representing the spectral centroid indicates the frequency around which most of the signal energy is concentrated, e.g. according to equation (5) below, and a deviation in this parameter is related to a loss of bandwidth and an additive background noise. Since the spectral centroid is related to the spectrum tilt, the spectral centroid can be approximated as the coefficient in the first-order linear- prediction analysis.
Figure imgf000014_0001
The above-described exemplary parameters, and in particular the spectral flatness and the normalized transition rate of the RMSE, represent meaningful dimensions of a block of an audio signal, such as the resonant structure, the perceived brightness, and the energy changes, and the parametric representation is easy to associate with a particular distortion. Further, the spectral parameters are robust to errors in time-alignment and formant displacement, since they do not require that the frequency bins of the reference signal and the processed signal are perfectly positioned.
The local distortion, Dn, for a first block with index n, which is calculated from the differences between each spectral parameters of the block in the processed signal and the spectral parameters in the corresponding block in the reference signal, can be expressed e.g. according to the equation (6) below:
Figure imgf000015_0001
According to a first embodiment of this invention, the synchronization in time of the processed signal and the reference signal is performed jointly with the calculation of the aggregation of the local distortions, Dn, by calculating each local distortion, as well as the aggregation of the local distortion, at a number of different time-displacements, m, between the reference signal and the processed signal. Thereby, an optimal time-displacement can be determined by selecting the minimum of the calculated aggregated local distortions, and determining the quality value from this minimum of the distortions .
The calculation of the local distortion for first block n, at time displacement m can be expressed e.g. by the equation (7) below:
Dn,m
Figure imgf000015_0002
-Cnp+m 'Enr-Enp+m)' 'J)
Thereafter, the local distortions are aggregated at different m, e.g. as an Lp norm according to equation (8) :
Figure imgf000015_0003
The quality is predicted from the minimum aggregated value of the local distortions, at an optimal time-displacement, at which the processed signal is time-aligned with the reference signal. According to an embodiment of this invention, the predicted quality is indicated by a selected suitable quality value. The quality indicated by the quality value is inversely proportional to the aggregated local distortions, since a comparatively small distortion of the audio signal means that the predicted quality of the audio signal is comparatively high.
*
The optimal time displacement M can be calculated e.g. according to equation (9) :
Figure imgf000016_0001
Figure 5 is a flow diagram illustrating a method for predicting the quality of an audio signal, according to a first exemplary embodiment of this invention. In step 51, the reference signal and the processed signal are segmented into a number of first blocks having a length of e.g. 20-40 ms, and in step 52, e.g. three different spectral parameters are calculated for each of the first blocks in the processed signal and in the reference signal. The spectral parameters are at least two, and suitable spectral parameters are e.g. the spectral flatness, the spectral centroid and the normalized transitions rate of RMSE, as described above. In step 53, the local distortion, Dn, is calculated for each of the first blocks from the difference between each spectral parameter in the block of the processed output signal and in the corresponding block of the input reference signal, in order to determine the distortion of the audio signal during the transmission through the communication system. Next, in step 54, the processed signal is synchronized in time with the reference signal by a calculation of an aggregated value, e.g. as an Lp norm, of the local distortions in each block at different time-displacements, m, between the processed signal and the reference signal. The predicted first quality value is determined, in step 55, from the minimum of the aggregated local distortions, at the optimal time-displacement,
* JfI , between the processed signal and the reference signal.
In the prediction of the first quality value, as illustrated in figure 5, the spectral parameters and the local distortion are calculated for fixed small-scale blocks, e.g. with a length of 20 ms . However, according to a second embodiment of this invention, the distortions can be obtained at a larger scale, as well, through calculating second parameters as statistic values from the calculated spectral parameters of the first blocks located within a larger-scale second block.
Thus, according to a second embodiment of this invention, said second parameters are obtained by calculating e.g. the mean, the variance, the skew, or a certain quintile of from the spectral parameters calculated for the first blocks located within the larger-scale second block. Thereby, the second parameters indicated in equation (10), (11) and (12) below are obtained for the larger-scale second block with index B of the reference signal, the larger-scale second block containing a pre- determined number of small-scale first blocks:
Figure imgf000017_0001
{Cn-B A >cn+B} →CB ( 11 )
Figure imgf000017_0002
Obviously, the corresponding second parameters are also obtained for the processed signal. The local distortion, DB, for this large-scale second block B is calculated from the difference between the second parameters in this larger-scale second block in the processed signal and the corresponding larger-scale second block in the reference signal, e.g. according to the equation (13) below:
DB=g(ΦB'-ΦB[,CB-C£,EB r-E£) (13)
According to a further embodiment of this invention, the total quality of an audio signal sequence having a length of e.g. between 8 and 12 seconds is predicted from the combination of Dn and DB distortions. Dn always describes the local distortion in the small-scale first blocks, which have as fixed length.
However, a larger-scale second block, indicated by index B, has a length corresponding to at least two of the first blocks, i.e. a length between two small-scale blocks and the total length of the signal sequence.
The total quality is predicted as a linear combination between quality predictions determined from the distortions with different resolution, i.e. the small-scale local distortions and the larger-scale distortions are aggregated independently. Accordingly, a first quality value, Qi is determined from an aggregation of the small-scale local distortions, Dn , and a second quality value, Q2, is determined from an aggregation of the large-scale distortions, DB • Thereafter, the first quality value Qi and the second quality value Q2 are combined to form the total quality value Qtot/ e.g. according to equation (14) below:
Figure imgf000019_0001
If ki = k2 in equation (14), the first quality value and the second quality value are added with the same weight. However, according to a further embodiment, the first quality value and the second quality value are added with different weight, and the different weights are indicated by ki ≠ k2 in (14) above.
For example, the second quality value predicted from larger- scale blocks with index B could be given a higher weight in the predicted total quality value when a specific distortion is detected, since some distortion are more easily describes with larger-scale parameters, such as e.g. additive background noise, bandwidth limitations and the energy loss in larger signal segments. Therefore, it may be advantageous to give the second large-scale quality value a higher weight in the total quality value, and in this case ki < k2 in equation (14) above.
Figure 6 is a flow diagram illustrating the additional steps of predicting a second, larger-scale quality of an audio signal, according to a second exemplary embodiment of this invention, which is performed after the steps illustrated in figure 5. In step 61, the sequence of the processed signal and of the reference signal are segmented into one or more second blocks, of which each of the second blocks contains two or more of the small-scale first blocks. In step 62, a second parameter is calculated statistically from each of the spectral parameters of the first blocks contained in the larger-scale second block in the processed signal and in the reference signal, at the optimal
* time displacement M , and the second parameters are calculated e.g. as the mean, variance or medium value of the first parameters. Thereafter, in step 63, the difference is calculated between each second parameter of the block in the processed signal distortion and the same second parameter in the corresponding block of the reference signal, and a local distortion, Dβf is calculated for each of the second blocks, e.g. according to equation (13) above. Next, in step 64, a second larger-scale quality value, Q2, is predicted from the aggregated local distortion, and the quality indicated by the selected second quality value is inversely proportional to the aggregated local distortions D.
According to this invention, the spectral features can be extracted from the reference signal and from the processed signals without performing any synchronization. Instead, the synchronization can be performed jointly with the determination of the aggregated distortions. Thereby, the invention achieves a low-complexity perceptual time-alignment, which is superior to conventional waveform synchronization, as well as enabling a prediction of the distortion at different time resolution, i.e. different scales, thus improving the accuracy and flexibility of the quality prediction.
Figure 7 is an apparatus 42 for predicting the quality of an audio signal, according to a first exemplary embodiment. The apparatus comprises signal segmenting means 71 for segmenting a sequence of the reference signal and of the processed signal into a number of first blocks having a length of 20-40 ms . Further, the apparatus comprises spectral parameter calculating means 72 for calculating e.g. three different spectral parameter for each of the first blocks, each spectral parameter representing a different spectral property of the block. The difference between each spectral parameter in each block of the processed signal and the spectral parameter in the corresponding block of the reference signal is calculated by the distortion calculating means, 73, and a local distortion Dn is calculated for each of the first blocks, based on these differences. The local distortions in the blocks of the sequences are aggregated by the aggregation calculating means, 74, e.g. as an Lp-norm, and a first quality value is predicted by the first quality predicting means, 75, such that the quality indicated by the first quality value is inversely proportional to the aggregated local distortions.
It should be noted that the means illustrated in figure 7 may be implemented by physical or logical entities using software functioning in conjunction with a programmed microprocessor or general purpose computer, and/or using an application specific integrated circuit (ASIC) .
According to a second exemplary embodiment, the apparatus is further provided with means for determining a second quality value, which is calculated at a larger scale. These means comprises the following:
- Second segmenting means for segmenting the reference signal and the processed signal into one or more second blocks, each second block being larger than said first blocks, and each second block containing a pre-determined number, i.e. two or more, of the first blocks;
- Second parameter calculating means for calculating a second parameter from each of the spectral parameters calculated for each of the first small-scale blocks contained in a second, larger-scale block;
- Second distortion calculating means for calculating a distortion between each second parameter of the reference signal and the corresponding second parameter of the processed signal, at the optimal time-displacement M between the processed signal and the reference signal, and determining a local distortion for each second block;
- Second quality determining means for determining a second quality value from an aggregated value of the calculated local distortions .
According to a further exemplary embodiment, the apparatus comprises means for determining a total quality of the audio signal, by combining the first quality value with the second quality value, e.g. with different weight.
According to a still further embodiment, the apparatus is arranged to be connected to two different points of the communication system, one for insertion of the reference signal and one for receiving the distorted processed signal. A possible connection point is e.g. a mobile phone, a Media Gateway, or a VoIP Gateway.
Further, the above mentioned and described embodiments are only given as examples and should not be limiting to the present invention. Other solutions, uses, objectives, and functions within the scope of the invention as claimed in the accompanying patent claims should be apparent for the person skilled in the art.
ABBREVIATIONS
RMSE - Root Mean Squared Error VoIP - Voice Over Internet Protocol n - block index for the first blocks, i.e. the 20 - 40 ms small- scale blocks
B - block index for the second larger-scale blocks, each containing two or more of the first smaller-scale blocks N - the number of blocks in the signal sequence w - frequency bin index, inside one block r - parameter associated with the reference signal p - parameter associated with the processed signal

Claims

1. A method of predicting the quality of an audio signal after transmission through a communication system, the method using a reference signal corresponding to an input signal to the communication system, and a processed signal corresponding to an output signal from said communication system, characterized by the following steps:
- Segmenting (51) the reference signal and the processed signal into at least two first blocks having a pre-determined length;
- Calculating (52) a number of different spectral parameters representing spectral properties of the signal for each of said first blocks, the number of spectral parameters being at least two;
- For each of said first blocks, calculating (53) a distortion between each calculated spectral parameter of the reference signal and the corresponding calculated spectral parameter of the processed signal; - Calculating (54) an aggregated value of said distortions for a number of different time-displacements between the reference signal and the processed signal;
- Determining (55) a first quality value of the audio signal from a minimum aggregated value of the distortions at an optimal time-displacement.
2. A method according to claim 1, wherein the quality indicated by the determined first quality value is inversely proportional to the minimum aggregated value of the distortions.
3. A method according to claim 1 or 2, wherein the number of spectral parameters is equal to three.
4. A method according to any of the preceding claims, wherein one of said spectral parameters represents a spectral flatness, which indicates the resonant structure of the power spectrum.
5. A method according to any of the preceding claims, wherein one of said spectral parameters represents the normalized transition rate of RMSE, which indicates the rate of signal energy change.
6. A method according to any of the preceding claims, wherein one of said spectral parameters represents the spectral centroid, which indicates the frequency around which the signal power is concentrated.
7. A method according to any of the preceding claims, the method comprising the further steps of:
- Segmenting (61) the reference signal and the processed signal into at least one second block, each second block containing a pre-determined number of the first blocks; - For each of the second blocks, calculating (62) a second parameter from each of the spectral parameters calculated for each of the first blocks contained in the second block, and calculating (63) a distortion between each second parameter of the reference signal and the corresponding second parameter of the processed signal, at said optimal time displacement;
- Determining (64) a second quality value from an aggregated value of the calculated distortions.
8. A method according to claim 7, wherein a determined second quality value is inversely proportional to the aggregated value of the distortions.
9. A method according to claim 7 or 8, comprising the further step of determining a total quality value of the audio signal by combining a determined first quality value with a determined second quality value.
10. A method according to claim 9, wherein the values of the first quality and the second quality are combined by addition with different weight.
11. A method according to any of the claims 7 - 10, wherein the calculation of said second parameters comprises determining the means, the variance, or the skew of the spectral parameters calculated for the first blocks contained in the second blocks.
12. An apparatus (42) for predicting the quality of an audio signal transmitted through a communication system by using a reference signal (11) corresponding to an input signal to said communication system, and a processed signal (12) corresponding to a distorted output signal from the communication system, the apparatus characterized in that it comprises:
- Signal segmenting means (71) for segmenting the reference signal and the processed signal into at least two first blocks having a pre-determined length;
- Parameter calculating means (72) for calculating at least two spectral parameters for each of the first blocks, each spectral parameter representing a different spectral property of the signal;
- Distortion calculating means (73) for calculating the distortion between each spectral parameter of the reference signal and the corresponding spectral parameter of the processed signal, for each of the first blocks;
- Aggregation calculating means (74) for calculating an aggregated value of said calculated distortions at a number of different time-displacements between the reference signal and the processed signal;
- First quality determining means (75) for determining a first quality value of the audio signal from a minimum aggregated value of the distortions at an optimal time- displacement .
13. An apparatus according to claim 12, wherein the quality indicated by the determined first quality value is inversely proportional to said minimum aggregated value of the distortions .
14. An apparatus according to claim 12 or 13, wherein the number of spectral parameters is equal to three.
15. An apparatus according to any of the claims 12 - 14, wherein one of said spectral parameters represents the spectral flatness, which indicates the resonant structure of the power spectrum.
16. An apparatus according to any of the claims 12 - 15, wherein one of said spectral parameters represents the normalized transition rate of RMSE, which indicates the rate of the signal energy change.
17. An apparatus according to any of the claims 12 - 16, wherein one of said spectral parameters represents the spectral centroid, which indicates the frequency around which the signal power is concentrated.
18. An apparatus according to any of the claims 12 - 17, further comprising means for determining a second quality value, the means characterized by: - Second segmenting means for segmenting the reference signal (11) and the processed signal (12) into at least one second block, each second block containing a pre-determined number of the first blocks; - Second parameter calculating means for calculating a second parameter from each of the spectral parameters calculated for each of the first blocks contained in the second blocks;
- Second distortion calculating means for calculating a distortion between each second parameter of the reference signal and the corresponding second parameter of the processed signal for each block, at said optimal time- displacement;
- Second quality determining means for determining a second quality value from an aggregated value of the calculated distortions.
19. An apparatus according to claim 18, wherein a determined second quality value is inversely proportional the aggregated value of the distortions.
20. An apparatus according to claim 18 or 19, further comprising quality determining means for determining a total quality of the audio signal by combining the first quality value with the second quality value.
21. An apparatus according to claim 20, wherein the first quality value and the second quality value are combined by an addition with different weight.
22. An apparatus according to any of the claims 18 - 21, wherein the calculation of the second parameters comprises determining the means, the variance, or the skew of the spectral parameters calculated for the first blocks contained in a second block.
23. An apparatus according to any of the claims 11 - 22, wherein the apparatus is arranged to be connected to two points of the communication system, one for insertion of the reference signal and one for receiving the distorted processed signal.
PCT/EP2009/051054 2009-01-30 2009-01-30 Audio signal quality prediction Ceased WO2010086020A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US13/146,426 US20120020484A1 (en) 2009-01-30 2009-01-30 Audio Signal Quality Prediction
EP09778994A EP2392003B1 (en) 2009-01-30 2009-01-30 Audio signal quality prediction
JP2011546623A JP5204904B2 (en) 2009-01-30 2009-01-30 Audio signal quality prediction
PCT/EP2009/051054 WO2010086020A1 (en) 2009-01-30 2009-01-30 Audio signal quality prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2009/051054 WO2010086020A1 (en) 2009-01-30 2009-01-30 Audio signal quality prediction

Publications (1)

Publication Number Publication Date
WO2010086020A1 true WO2010086020A1 (en) 2010-08-05

Family

ID=41136699

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2009/051054 Ceased WO2010086020A1 (en) 2009-01-30 2009-01-30 Audio signal quality prediction

Country Status (4)

Country Link
US (1) US20120020484A1 (en)
EP (1) EP2392003B1 (en)
JP (1) JP5204904B2 (en)
WO (1) WO2010086020A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014505393A (en) * 2010-12-07 2014-02-27 エンパイア テクノロジー ディベロップメント エルエルシー Audio fingerprint difference for measuring quality of experience between devices
US12033646B2 (en) 2017-11-10 2024-07-09 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Analysis/synthesis windowing function for modulated lapped transformation

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011010962A1 (en) * 2009-07-24 2011-01-27 Telefonaktiebolaget L M Ericsson (Publ) Method, computer, computer program and computer program product for speech quality estimation
US8493202B1 (en) 2010-03-22 2013-07-23 Alarm.Com Alarm signaling technology
RU2584009C2 (en) * 2011-09-29 2016-05-20 Долби Интернешнл Аб Detection of high quality in frequency modulated stereo radio signals
US9830905B2 (en) 2013-06-26 2017-11-28 Qualcomm Incorporated Systems and methods for feature extraction
US11888919B2 (en) * 2013-11-20 2024-01-30 International Business Machines Corporation Determining quality of experience for communication sessions
US9325838B2 (en) * 2014-07-22 2016-04-26 International Business Machines Corporation Monitoring voice over internet protocol (VoIP) quality during an ongoing call
WO2017127367A1 (en) * 2016-01-19 2017-07-27 Dolby Laboratories Licensing Corporation Testing device capture performance for multiple speakers
EP3483878A1 (en) * 2017-11-10 2019-05-15 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Audio decoder supporting a set of different loss concealment tools
JP7212925B2 (en) * 2018-10-30 2023-01-26 国立大学法人九州大学 Speech transmission environment evaluation system and sensory stimulus presentation device
CN120998235B (en) * 2025-10-23 2026-02-03 深圳联康测控有限公司 Audio quality automatic scoring method and system combining MFCC and time domain statistical characteristics

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000022803A1 (en) * 1998-10-08 2000-04-20 British Telecommunications Public Limited Company Measurement of speech signal quality

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09331391A (en) * 1996-06-12 1997-12-22 Nippon Telegr & Teleph Corp <Ntt> Call quality objective estimation device
US6201176B1 (en) * 1998-05-07 2001-03-13 Canon Kabushiki Kaisha System and method for querying a music database
FR2835125B1 (en) * 2002-01-24 2004-06-18 Telediffusion De France Tdf METHOD FOR EVALUATING A DIGITAL AUDIO SIGNAL
JP3809164B2 (en) * 2002-12-25 2006-08-16 日本電信電話株式会社 Comprehensive call quality estimation method and apparatus, program for executing the method, and recording medium therefor
EP1620811A1 (en) * 2003-04-24 2006-02-01 Koninklijke Philips Electronics N.V. Parameterized temporal feature analysis
JP4341586B2 (en) * 2005-06-08 2009-10-07 Kddi株式会社 Call quality objective evaluation server, method and program
JP2007013674A (en) * 2005-06-30 2007-01-18 Ntt Docomo Inc Total call quality evaluation apparatus and total call quality evaluation method
US7933427B2 (en) * 2006-06-27 2011-04-26 Motorola Solutions, Inc. Method and system for equal acoustics porting
JP4597919B2 (en) * 2006-07-03 2010-12-15 日本電信電話株式会社 Acoustic signal feature extraction method, extraction device, extraction program, recording medium recording the program, acoustic signal search method, search device, search program using the features, and recording medium recording the program

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000022803A1 (en) * 1998-10-08 2000-04-20 British Telecommunications Public Limited Company Measurement of speech signal quality

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BEERENDS J G ET AL: "PERCEPTUAL EVALUATION OF SPEECH QUALITY (PESQ) THE NEW ITU STANDARD FOR END-TO-END SPEECH QUALITY ASSESSMENT PART II-PSYCHOACOUSTIC MODEL", JOURNAL OF THE AUDIO ENGINEERING SOCIETY, AUDIO ENGINEERING SOCIETY, NEW YORK, NY, US, vol. 50, no. 10, 1 October 2002 (2002-10-01), pages 765 - 778, XP001245918, ISSN: 1549-4950 *
RIX A W ET AL: "PERCEPTUAL EVALUATION OF SPEECH QUALITY (PESQ) THE NEW ITU STANDARD FOR END-TO-END SPEECH QUALITY ASSESSMENT PART 1-TIME-DELAY COMPENSATION", JOURNAL OF THE AUDIO ENGINEERING SOCIETY, AUDIO ENGINEERING SOCIETY, NEW YORK, NY, US, vol. 50, no. 10, 1 October 2002 (2002-10-01), pages 755 - 764, XP001245917, ISSN: 1549-4950 *
RIX A W ET AL: "Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs", 2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. PROCEEDINGS. (ICASSP). SALT LAKE CITY, UT, MAY 7 - 11, 2001; [IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP)], NEW YORK, NY : IEEE, US, vol. 2, 7 May 2001 (2001-05-07), pages 749 - 752, XP010803764 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014505393A (en) * 2010-12-07 2014-02-27 エンパイア テクノロジー ディベロップメント エルエルシー Audio fingerprint difference for measuring quality of experience between devices
US8989395B2 (en) 2010-12-07 2015-03-24 Empire Technology Development Llc Audio fingerprint differences for end-to-end quality of experience measurement
US9218820B2 (en) 2010-12-07 2015-12-22 Empire Technology Development Llc Audio fingerprint differences for end-to-end quality of experience measurement
US12033646B2 (en) 2017-11-10 2024-07-09 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Analysis/synthesis windowing function for modulated lapped transformation

Also Published As

Publication number Publication date
EP2392003A1 (en) 2011-12-07
JP5204904B2 (en) 2013-06-05
EP2392003B1 (en) 2013-01-02
US20120020484A1 (en) 2012-01-26
JP2012516591A (en) 2012-07-19

Similar Documents

Publication Publication Date Title
EP2392003B1 (en) Audio signal quality prediction
CN102549657B (en) Method and system for determining a perceived quality of an audio system
US7734462B2 (en) Method and apparatus for extending the bandwidth of a speech signal
RU2651218C2 (en) Harmonic extension of audio signal bands
CN102576535B (en) Method and system for determining perceived quality of an audio system
RU2595544C2 (en) Encoding device and method, decoding device and method and program
RU2640743C1 (en) Audio encoding device, audio encoding method, audio encoding programme, audio decoding device, audio decoding method and audio decoding programme
JP2009539132A (en) Linear predictive coding of audio signals
CN103426441B (en) Detect the method and apparatus of the correctness of pitch period
WO2005117517A2 (en) Neuroevolution-based artificial bandwidth expansion of telephone band speech
US20080106249A1 (en) Generating sample error coefficients
CN112530450B (en) Sample-accurate delay identification in the frequency domain
KR101044160B1 (en) Device for determining information to align two information signals in time
JPH10105195A (en) Pitch detection method, speech signal encoding method and apparatus
EP2438591A1 (en) A method and arrangement for estimating the quality degradation of a processed signal
Lee et al. Speech quality estimation of voice over internet protocol codec using a packet loss impairment model
Waltermann et al. Towards a new E-model impairment factor for linear distortion of narrowband and wideband speech transmission
TWI776236B (en) Audio decoder supporting a set of different loss concealment tools
JP3896654B2 (en) Audio signal section detection method and apparatus
JP3315956B2 (en) Audio encoding device and audio encoding method
Heim et al. Doppler-variant modeling of the vocal tract
Parsa et al. Interaction of voice over internet protocol speech coders and disordered speech samples.
Cai et al. Speech quality assessment using digital watermarking

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09778994

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2011546623

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 13146426

Country of ref document: US

WWE Wipo information: entry into national phase

Ref document number: 2009778994

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE