JPH0330159B2

JPH0330159B2 -

Info

Publication number: JPH0330159B2
Application number: JP62136377A
Authority: JP
Priority date: 1987-05-29
Filing date: 1987-05-29
Publication date: 1991-04-26
Also published as: JPS63300296A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は音声認識方式に関し、更に詳述すれば
ベクトル場のパターンを利用して高い認識率を得
ることができる新規な方式及びその実施に使用す
る装置を提供するものである。[Detailed Description of the Invention] [Field of Industrial Application] The present invention relates to a speech recognition method, and more specifically, to a novel method that can obtain a high recognition rate using vector field patterns and its implementation. It provides the equipment to be used.

[Prior art]

音声認識は、一般に、認識させるべき単語から
特徴を抽出して得た音声の標準パターンを単語
夫々に用意しておき、認識対象として入力された
音声から同様にして抽出した特徴パターンと複数
の標準パターンとを整合し、最も類似性が高い標
準パターンを求め、この標準パターンに係る単語
が入力されたものと判定する方式をとつている。
そして、従来は上記特徴パターンとして、音声信
号を分析して得られる、時間軸を横軸、空間軸を
縦軸とするスカラー場の時空間パターンそのもの
を用いていた。このようなスカラー場の時空間パ
ターンとしては、周波数を空間軸とするスペクト
ルが代表的なものであり、この他、ケフレンシー
を空間軸とするケプストラム、PARCOR係数、
LSP係数、音道断面積関数等種々の時空間パター
ンが用いられていた。 Generally, in speech recognition, a standard pattern of speech obtained by extracting features from the word to be recognized is prepared for each word, and feature patterns extracted in the same way from the speech input as recognition target and multiple standard patterns are prepared for each word. The standard pattern with the highest similarity is found by matching the patterns, and the word associated with this standard pattern is determined to have been input.
Conventionally, the spatio-temporal pattern itself of a scalar field with the horizontal axis as the time axis and the vertical axis as the spatial axis, which is obtained by analyzing the audio signal, has been used as the feature pattern. A typical spatio-temporal pattern of such a scalar field is a spectrum with frequency as its spatial axis, as well as a cepstrum with quefrency as its spatial axis, a PARCOR coefficient,
Various spatiotemporal patterns such as LSP coefficients and sound path cross-sectional area functions were used.

又、音声認識の分野において解決すべき課題の
１つとして多数話者又は不特定話者への対応があ
り、これには１つの単語に多数の標準パターンを
用意することで認識率の向上を図つていた。更
に、話者が同一であつても発音速度が異なること
があり、このような場合にも対応できるように時
間軸変動を吸収し得るDPマツチング法が開発さ
れていた。 In addition, one of the issues to be solved in the field of speech recognition is dealing with multiple speakers or unspecified speakers, and for this purpose, it is possible to improve the recognition rate by preparing a large number of standard patterns for one word. I was thinking about it. Furthermore, even if speakers are the same, their pronunciation speeds may differ, and a DP matching method that can absorb time axis fluctuations has been developed to cope with such cases.

[Problem that the invention seeks to solve]

スカラー場の時空間パターンそのものを特徴と
して用いる従来の方式では、大語彙や不特定話者
を対象とした場合、必ずしも十分な認識率が得ら
られておらず、たとえ、上述の如く１つの単語に
多数の標準パターンを用意したり、あるいはDP
マツチング法を用いても、これらは本格的な解決
にはならなかつた。従つて、不特定話者や大語彙
を対象とした音声認識システムの実用化が停滞し
ているのである。そこで、本発明者の１人は、特
開昭60−59394号公報において、時間−周波数の
時空間パターンをであるスカラー場のスペクトル
空間微分してスペクトルベクトル場パターンを
得、このパターンを特徴として用いる手法を提案
したが、本願は、この手法を工学的観点から更に
一歩進めて、計算が簡単で短時間に行え実用化に
適するように、且つより高い認識率が得られるよ
うに改良した音声認識方式及びその実施に使用す
る装置を提供することを目的とする。 Conventional methods that use the spatio-temporal pattern of the scalar field as a feature do not necessarily achieve a sufficient recognition rate when targeting large vocabularies or unspecified speakers. A large number of standard patterns are available, or DP
Even using the matching method, these problems could not be fully resolved. Therefore, the practical application of speech recognition systems for unspecified speakers and large vocabularies has stalled. Therefore, one of the inventors of the present invention obtained a spectral vector field pattern by differentiating the time-frequency spatiotemporal pattern in the spectral space of a scalar field, and used this pattern as a feature. However, this application takes this method one step further from an engineering perspective, and proposes an improved speech method that is easy to calculate, can be performed in a short time, is suitable for practical use, and has a higher recognition rate. The purpose of this invention is to provide a recognition method and a device used for its implementation.

[Means for solving problems]

本発明に係る音声認識方式は、認識対象として
入力された音声信号から特徴パターンを抽出し、
該特徴パターンと標準パターンとの整合をとり、
入力音声を識別する音声認識方式において、音声
信号を分析して時間軸と空間軸とで規定されるス
カラー場の時空間パターンを得、該時空間パター
ンを空間微分することにより空間の各格子点で大
きさと方向をもつベクトル場パターンに変換し、
該ベクトル場パターンのベクトルについて、その
方向パラメータをＮ値（Ｎ：整数）に量子化し、
この量子化値を同じくするベクトル毎に各々分離
して、そのベクトルの大きさを各格子点の値とし
たＮ個の方向別２次元パターンを作成し、該方向
別２次元パターンを前記特徴パターンとすること
を特徴とする。 The speech recognition method according to the present invention extracts a feature pattern from a speech signal input as a recognition target,
Matching the characteristic pattern with the standard pattern,
In a speech recognition method that identifies input speech, an audio signal is analyzed to obtain a spatiotemporal pattern of a scalar field defined by a time axis and a spatial axis, and each grid point in space is determined by spatially differentiating the spatiotemporal pattern. Convert it to a vector field pattern with magnitude and direction,
Quantize the direction parameter of the vector of the vector field pattern into N values (N: integer),
This quantized value is separated into vectors having the same value, and N two-dimensional patterns for each direction are created with the size of the vector as the value of each grid point, and the two-dimensional patterns for each direction are used as the characteristic pattern. It is characterized by:

[Effect]

入力された音声信号は時間軸及び空間軸で規定
されるスカラー場の時空間パターンからベクトル
の方向パラメータが量子化され、量子化された方
向毎に分離された複数の方向別２次元パターンに
変換されることにより、このパターンを特徴パタ
ーンとして認識が行われる。このパターンは時空
間パターンの空間微分、つまり時空間変化情報を
もつて構成されているので音声音韻性をよく表
し、且つ話者変動等に影響され難く、又、方向パ
ラメータの量子化によりベクトル場の変動を吸収
する。更には、ベクトル場パターンそのものを特
徴パターンとした場合に実行せざるを得ない複素
数演算が不要となり、計算が簡略化される。 The input audio signal is quantized from the spatiotemporal pattern of the scalar field defined by the time and space axes, and the vector direction parameters are quantized and converted into multiple directional two-dimensional patterns separated for each quantized direction. As a result, this pattern is recognized as a characteristic pattern. This pattern is composed of the spatial differential of the spatiotemporal pattern, that is, the spatiotemporal change information, so it represents speech phonology well and is not easily affected by speaker fluctuations, etc. Also, by quantizing the direction parameter, the vector field absorb fluctuations in Furthermore, complex number operations that must be performed when the vector field pattern itself is used as a feature pattern are no longer necessary, and calculations are simplified.

〔Example〕

以下本発明をその実施例を示す図面に基づいて
詳述する。 DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described in detail below based on drawings showing embodiments thereof.

第１図は本発明方式を実施するための装置の構
成を示すブロツク図である。この実施例では分析
部で音声信号をスペクトル分析してスカラー場の
時空間パターンとして、周波数軸を空間軸とする
スペクトルを用いている。 FIG. 1 is a block diagram showing the configuration of an apparatus for implementing the method of the present invention. In this embodiment, the analysis section spectrally analyzes the audio signal and uses a spectrum with the frequency axis as the spatial axis as the spatiotemporal pattern of the scalar field.

標準パターン作成のための音声の入力又は認識
対象の音声の入力はマイクロホン等の音声検出器
及びＡ／Ｄ変換器からなる音声入力部１によつて
行われ、これによつて得られた音声信号は通過周
波数帯域を夫々に異なる複数チヤネル（例えば10
〜30）のバンドパスフイルタを並列的に接続して
なる分析部２に入力される。分析部では、分析の
結果、時空間パターンが得られ、このパターンが
単語区間切出部３によつて認識単位の単語ごとに
区分されて特徴抽出部４へ与えられる。単語区間
切出部３としては従来から知られているものを用
いればよい。 The input of the voice for standard pattern creation or the voice to be recognized is performed by the voice input unit 1 consisting of a voice detector such as a microphone and an A/D converter, and the voice signal obtained thereby has multiple channels (for example, 10
~30) are input to the analysis section 2 which is formed by connecting bandpass filters in parallel. In the analysis section, a spatio-temporal pattern is obtained as a result of the analysis, and this pattern is segmented into each recognition unit word by the word section extraction section 3 and is provided to the feature extraction section 4 . As the word section cutting section 3, a conventionally known one may be used.

なお周波数帯域ごとに音声信号を分割する分析
部として、以後の説明においては、上記した如く
バンドパスフイルタ群を用いることとするが、高
速フーリエ変換器を用いてもよい。 In the following description, a group of bandpass filters will be used as described above as an analysis unit that divides the audio signal into frequency bands, but a fast Fourier transformer may also be used.

さて本発明方式は次に説明する特徴抽出部によ
つて特徴づけられる。特徴抽出部４への入力パタ
ーンは横軸を時間軸、縦軸を周波数とする時空間
パターンであり、単語区間切出部３によつて切出
された第２図に示す時空間パターンを（ｔ，ｘ）
（但しｔはサンプリングの時刻を示す番号、ｘは
バンドパスフイルタのチヤネル番号又は周波数帯
域を特定する番号。１≦ｔ≦Ｔ、１≦ｔ≦Ｌ）と
表す。 Now, the method of the present invention is characterized by the feature extractor described below. The input pattern to the feature extractor 4 is a spatio-temporal pattern with the horizontal axis as the time axis and the vertical axis as the frequency, and the spatio-temporal pattern shown in FIG. t, x)
(However, t is a number indicating the sampling time, and x is a channel number of a bandpass filter or a number specifying a frequency band. 1≦t≦T, 1≦t≦L).

単語区間切出部３出力は特徴抽出部４の正規化
部４１へ入力され、正規化部４１は時間軸の線形
正規化をする。これは単語の長短、入力音声の長
短等をある程度吸収するためであり、時間軸をＴ
フレームからＭフレーム（例えば16〜23フレーム
程度）にする。具体的にはＭ≦Ｔの場合は、正規
化した時空間パターンＦ（ｔ，ｘ）は下記(1)式で
求められる。 The output of the word section extraction section 3 is input to the normalization section 41 of the feature extraction section 4, and the normalization section 41 linearly normalizes the time axis. This is to absorb the length of words, the length of the input voice, etc. to a certain extent, and the time axis is T.
frame to M frames (for example, about 16 to 23 frames). Specifically, in the case of M≦T, the normalized spatiotemporal pattern F(t, x) is obtained by the following equation (1).

Ｆ（ｔ，ｘ）＝_(T/M) F(t,x)= _(T/M)

Claims

[Claims] 1. A spatiotemporal pattern of a scalar field defined on a time axis and a spatial axis is obtained from an audio signal input as a recognition target, and a feature pattern based on the spatiotemporal pattern is matched with a standard pattern. , in a speech recognition method that identifies input speech, the speech signal is analyzed to obtain the spatiotemporal pattern of the scalar field, and the spatiotemporal pattern is spatially differentiated to generate a vector field having a magnitude and direction at each grid point in the space. For the vector of the vector field pattern, its direction parameter is N
quantize it into a value (N: an integer), separate this quantized value into the same vectors, and create N two-dimensional patterns for each direction with the size of the vector as the value of each grid point, A speech recognition method characterized in that the two-dimensional pattern for each direction is used as the characteristic pattern. 2 Obtain a spatio-temporal pattern of a scalar field defined by the time and space axes from the audio signal input as a recognition target, match the feature pattern based on the spatio-temporal pattern with a standard pattern, and identify the input audio. The speech recognition device includes: an analysis unit that analyzes an input speech signal to obtain a spatiotemporal pattern of the scalar field; a normalization unit that normalizes the spatiotemporal pattern with respect to time; and a spatial differentiation of the normalized spatiotemporal pattern. a vector field extraction unit that extracts a vector field pattern having a magnitude and direction at each grid point in space; a direction-specific two-dimensional pattern creation unit that separates each vector having the same conversion value and creates N direction-specific two-dimensional patterns with the magnitude of the vector as the value of each grid point; A speech recognition device characterized in that an output of a direction-based two-dimensional pattern creation section is used as the characteristic pattern. 3. Obtain a spatio-temporal pattern of a scalar field defined by the temporal and spatial axes from the audio signal input as a recognition target, match the characteristic pattern based on the spatio-temporal pattern with a standard pattern, and identify the input audio. The speech recognition device includes: an analysis unit that analyzes an input speech signal to obtain a spatiotemporal pattern of the scalar field; and a spatial differentiation of the spatiotemporal pattern for each of a plurality of frames in the time axis direction sequentially output from the analysis unit. a vector field pattern extraction unit that sequentially extracts a vector field pattern; and a vector that quantizes a direction parameter of the sequentially extracted vectors of the vector field pattern to an N value (N: an integer) and has the same quantized value. a first direction-specific two-dimensional pattern creation unit that sequentially creates N direction-specific two-dimensional patterns with the magnitude of the vector as the value of each grid point; Find the average value of two-dimensional patterns for each direction of multiple frames that are sequentially created in a separate two-dimensional pattern creation section, or
Select one pattern from among the dimensional patterns,
and a second direction-based two-dimensional pattern creation section that sequentially creates direction-specific two-dimensional patterns for one frame, and the output of the second direction-based two-dimensional pattern creation section is configured to be used as the characteristic pattern. A speech recognition device characterized by: