JPH07319497A

JPH07319497A - Speech synthesizer

Info

Publication number: JPH07319497A
Application number: JP6108761A
Authority: JP
Inventors: Keiji Hayashi; 慶士林; Noriya Murakami; 憲也村上
Original assignee: N T T DATA TSUSHIN KK; NTT Data Communications Systems Corp
Current assignee: N T T DATA TSUSHIN KK; NTT Data Group Corp
Priority date: 1994-05-23
Filing date: 1994-05-23
Publication date: 1995-12-08

Abstract

(57)【要約】【目的】音声の自然性を確保しつつ音声合成に必要と
なる自然音声データの格納量を減らして処理時間の短縮
を図る構成の音声合成装置を提供する。【構成】入力文字列を前処理部１２で音韻単位に分割
し、選択基準パラメタ設定部１３で最適素片の選択基準
となる韻律パラメタ（選択基準パラメタ）を設定する。
素片選択部１４では、素片ファイル管理部１５を介して
各音韻単位に対応する波形素片が存在するか否かを検索
し、存在するときは選択基準パラメタとの誤差が最小と
なる波形素片を選択抽出する。他方、波形素片が存在し
ないときは、当該音韻単位に対応する単音節を最適素片
とみなして単音節ファイルから選択抽出する。選択した
波形素片又は単音節の韻律パラメタを素片変形部１６で
変形した後、素片接続部で順次結合して入力文字列に対
応する合成音声を得る。 (57) [Abstract] [Purpose] To provide a speech synthesis apparatus configured to reduce the storage amount of natural speech data necessary for speech synthesis while shortening the processing time while ensuring the naturalness of speech. A preprocessing unit 12 divides an input character string into phonemes, and a selection reference parameter setting unit 13 sets a prosody parameter (selection reference parameter) as a reference for selecting an optimum segment.
The segment selection unit 14 searches through the segment file management unit 15 whether there is a waveform segment corresponding to each phoneme unit, and if there is, a waveform with a minimum error from the selection reference parameter. Select and extract the pieces. On the other hand, when the waveform segment does not exist, the single syllable corresponding to the phoneme unit is regarded as the optimum segment and selectively extracted from the single syllable file. The prosodic parameters of the selected waveform segment or monosyllabic are transformed by the segment transforming unit 16 and then sequentially coupled by the segment connecting unit to obtain a synthetic voice corresponding to the input character string.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は音声合成装置に関し、特
に、合成パラメタとして自然音声から切り出して得た波
形素片と韻律パラメタが既知の単音節とを用いて、入力
文字列から合成音声を生成する音声合成装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizing device, and more particularly, to a synthetic speech from an input character string by using a waveform segment obtained by cutting out a natural speech as a synthesizing parameter and a monosyllabic speech whose prosody parameter is known. The present invention relates to a speech synthesizer to generate.

【０００２】[0002]

【従来の技術】図６は、従来の一般的な音声合成装置の
機能ブロック図の一例を示す図であり、６１は入力端
子、６２は前処理部、６３は選択基準パラメタ設定部、
６４は素片選択部、６５１は素片パラメタテーブル、６
５２は素片ファイル、６６は素片変形部、６７は素片接
続部、６８は出力端子である。2. Description of the Related Art FIG. 6 is a diagram showing an example of a functional block diagram of a conventional general speech synthesizer, in which 61 is an input terminal, 62 is a preprocessing section, 63 is a selection reference parameter setting section,
64 is a segment selection unit, 651 is a segment parameter table, 6
52 is a segment file, 66 is a segment transformation part, 67 is a segment connection part, and 68 is an output terminal.

【０００３】この構成の音声合成装置において、音素記
号及びアクセント記号からなる入力文字列は、入力端子
６１から入力された後に前処理部６２において音韻単位
に分割される。選択基準パラメタ設定部６３では、分割
された音韻単位とアクセント記号とから、選択基準パラ
メタ（平均ピッチ周波数・ピッチ傾斜・時間長・平均パ
ワ）を設定し、素片選択部６４に出力する。なお、上記
選択基準パラメタは、合成パラメタである波形素片の選
択基準として用いる韻律パラメタである。In the speech synthesizer having this configuration, an input character string consisting of phoneme symbols and accent symbols is input to the input terminal 61 and then divided into phoneme units in the preprocessing unit 62. The selection reference parameter setting unit 63 sets selection reference parameters (average pitch frequency, pitch inclination, time length, average power) from the divided phoneme units and accent symbols, and outputs them to the segment selection unit 64. The selection criterion parameter is a prosody parameter used as a criterion for selecting a waveform segment that is a synthesis parameter.

【０００４】素片ファイル６５２には、小説や随筆等の
自然音声から切り出して得た複数の波形素片が格納され
ており、素片パラメタテーブル６５１には、各波形素片
の韻律パラメタが格納されている。素片選択部６４は、
上記選択基準パラメタに基づいて素片パラメタテーブル
６５１を検索し、音韻連接に対する最適素片をそれぞれ
素片ファイル６５２から選択するものである。素片変形
部６６では、素片選択部６４で選択された最適素片を前
記選択基準パラメタに一致するよう適宜素片の変形処理
を施す。素片接続部６７では、変形された素片をそれぞ
れ入力文字列にしたがって接続し、出力端子６８を通じ
て出力する。The segment file 652 stores a plurality of waveform segments obtained by cutting out a natural voice such as a novel or essay. The segment parameter table 651 stores the prosody parameters of each waveform segment. Has been done. The element selection unit 64
The element parameter table 651 is searched based on the selection reference parameters, and the optimal element for phoneme concatenation is selected from the element file 652. The segment transforming unit 66 appropriately transforms the segment to match the optimum segment selected by the segment selecting unit 64 with the selection reference parameter. In the segment connecting unit 67, the deformed segments are connected according to the input character string and output through the output terminal 68.

【０００５】[0005]

【発明が解決しようとする課題】ところで、上述の音声
合成処理の際に、音韻単位に対応する波形素片が素片フ
ァイル６５２に存在しないと、当該音韻単位についての
音声が生成されない。従って、自然な合成音声を得るた
めには、合成対象となる文字列の語彙に対応する波形素
片を少なくとも１つは用意しておく必要がある。しかし
ながら、任意度の高い語彙について合成音声処理を行う
場合に、対応する全ての波形素片を予め素片ファイル６
５２に用意し、その韻律パラメタを解析しておくことは
困難であり、音声品質の向上に一定の限界があった。素
片ファイル６５２に新たに波形素片を追加することもで
きるが、いずれの場合もかなり大量の波形素片を収集す
る必要があり、多大な労力を要する問題があった。By the way, in the above speech synthesis processing, if the waveform segment corresponding to the phoneme unit does not exist in the segment file 652, the voice for the phoneme unit is not generated. Therefore, in order to obtain a natural synthesized speech, it is necessary to prepare at least one waveform segment corresponding to the vocabulary of the character string to be synthesized. However, when performing synthetic speech processing on a vocabulary having a high degree of arbitraryness, all the corresponding waveform segments are previously stored in the segment file 6.
It is difficult to prepare the data in No. 52 and analyze the prosody parameter in advance, and there is a certain limit in improving the voice quality. Although it is possible to newly add a waveform segment to the segment file 652, it is necessary to collect a considerably large number of waveform segments in either case, which is a problem that requires a lot of labor.

【０００６】また、上述のように素片ファイル６５２に
大量の波形素片を格納すると必然的にファイル容量が大
きくなる。そのため、合成処理に要する時間がそのファ
イル容量に比例して増大するという問題もあった。Further, as described above, storing a large number of waveform segments in the segment file 652 inevitably increases the file capacity. Therefore, there is also a problem that the time required for the combining process increases in proportion to the file size.

【０００７】本発明は上記問題点を解消し、音声合成に
必要となる自然音声データの量を減らして処理時間の短
縮を図ることができ、しかも自然な合成音声が得られる
構成の音声合成装置を提供することを目的とする。The present invention solves the above problems, reduces the amount of natural voice data required for voice synthesis, shortens the processing time, and is capable of obtaining a natural synthesized voice. The purpose is to provide.

【０００８】[0008]

【課題を解決するための手段】本発明が提供する音声合
成装置は、少なくとも、音声に対応する入力文字列を音
韻単位に分割する前処理部と、自然音声より切り出した
複数の波形素片と各波形素片の韻律パラメタとを格納し
てなる素片情報格納手段と、複数の単音節と各単音節の
韻律パラメタとを格納してなる単音節情報格納手段と、
前記分割された音韻単位に対応する波形素片が前記素片
情報格納手段に存在するか否かを検索し、存在するとき
は予め素片選択基準として定めた韻律パラメタとの誤差
が最小となる韻律パラメタに対応する波形素片を選択抽
出し、他方、存在しないときは当該音韻単位に対応する
単音節を前記単音節情報格納手段より選択抽出する素片
選択手段と、抽出された波形素片又は単音節を前記入力
文字列の順に接続して合成音声を生成する素片接続部
と、を有することを特徴とする。つまり、波形素片及び
単音節の２種のパラメタを複合的な合成パラメタとして
用いる。A speech synthesis apparatus provided by the present invention comprises at least a preprocessing section for dividing an input character string corresponding to a speech into phoneme units, and a plurality of waveform segments cut out from natural speech. A segment information storage means for storing prosodic parameters of each waveform segment; a monosyllabic information storage means for storing a plurality of monosyllabic and prosodic parameters of each monosyllabic;
It is searched whether or not a waveform segment corresponding to the divided phoneme unit exists in the segment information storage means, and when it exists, an error from a prosody parameter previously determined as a segment selection criterion is minimized. A waveform segment corresponding to the prosody parameter is selectively extracted. On the other hand, when the waveform segment does not exist, a single syllable corresponding to the phoneme unit is selectively extracted from the single syllable information storage unit, and the extracted waveform segment. Or a segment connecting section for connecting a single syllable in the order of the input character string to generate a synthesized voice. That is, two types of parameters, that is, a waveform segment and a monosyllabic, are used as a composite synthetic parameter.

【０００９】本発明が提供する他の構成の音声合成装置
は、上記音声合成装置において、更に、前記素片選択手
段で選択抽出した波形素片又は単音節の韻律パラメタを
変形せしめて前記素片選択基準として定めた韻律パラメ
タとの誤差を零値に近づける素片変形部を設け、この変
形された韻律パラメタの波形素片又は単音節を前記素片
接続部に導く構成としたことを特徴とする。According to another aspect of the present invention, there is provided a voice synthesizing device according to the above voice synthesizing device, wherein the prosodic parameter of a waveform segment or a single syllable selected and extracted by the segment selecting means is further modified. It is characterized in that it is configured to provide a phoneme deforming unit that brings an error with a prosodic parameter defined as a selection criterion close to zero value, and guide the waveform phoneme or monosyllable of the deformed prosodic parameter to the phoneme connecting unit. To do.

【００１０】なお、前記誤差は、好ましくは、前記選択
基準として定めた韻律パラメタと抽出された個々の韻律
パラメタとの差分を各韻律パラメタの変動幅で除した値
の２乗の和とする。The error is preferably the sum of squares of values obtained by dividing the difference between the prosody parameter defined as the selection criterion and the extracted individual prosody parameter by the variation width of each prosody parameter.

【００１１】[0011]

【作用】本発明の音声合成装置では、入力文字列を前処
理部で音韻単位に分割し、個々の音韻単位に対する波形
素片の選択基準となる韻律パラメタ（選択基準パラメ
タ）を設定する。この選択基準パラメタは、例えば所望
の平均ピッチ周波数、ピッチ傾斜、時間長、平均パワで
ある。素片選択手段においては、まず、各音韻単位に対
応する波形素片が素片情報格納手段に存在するか否かを
検索する。存在するときは、これを該素片情報格納手段
より選択抽出する。複数存在するときは、例えば選択基
準パラメタと各波形素片の韻律パラメタとの間で２乗誤
差を計算し、その２乗誤差が最小となる波形素片を選択
抽出する。他方、波形素片が存在しないときは、当該音
韻単位に対応する単音節を最適素片とみなして単音節情
報格納手段より選択抽出する。このようにして選択した
波形素片又は単音節を、素片接続手段が入力文字列に従
って音韻単位毎に順次結合し、合成音声を生成する。In the speech synthesizer of the present invention, the input character string is divided into phonological units by the preprocessing unit, and prosody parameters (selection reference parameters) which are selection references of waveform segment for each phonological unit are set. The selection criterion parameters are, for example, a desired average pitch frequency, pitch slope, time length, and average power. The segment selection unit first searches the segment information storage unit for a waveform segment corresponding to each phoneme unit. If it exists, it is selectively extracted from the element information storage means. If a plurality of waveform elements exist, for example, a squared error is calculated between the selection reference parameter and the prosody parameter of each waveform element, and the waveform element having the smallest squared error is selected and extracted. On the other hand, when the waveform segment does not exist, the single syllable corresponding to the phoneme unit is regarded as the optimum segment and selectively extracted from the single syllable information storage means. The waveform segment or the single syllable selected in this way is sequentially combined for each phonological unit according to the input character string to generate a synthetic voice.

【００１２】本発明の他の構成の音声合成装置では、上
記素片選択手段で選択抽出した波形素片又は単音節の韻
律パラメタを変形せしめ、素片選択基準として定めた韻
律パラメタとの誤差を零値に近づける。そして変形され
た韻律パラメタの波形素片又は単音節を素片接続部に導
く。これにより合成音声の韻律が調整される。In the speech synthesizer having another configuration of the present invention, the prosodic parameter of the waveform segment or the monosyllabic segment selected and extracted by the segment selection means is modified so that an error from the prosody parameter defined as the segment selection reference is generated. It approaches zero. Then, the waveform segment or monosyllabic segment of the transformed prosody parameter is guided to the segment segment connection part. As a result, the prosody of the synthesized voice is adjusted.

【００１３】[0013]

【実施例】以下、図面を参照して本発明の実施例を詳細
に説明する。図１は、本発明の一実施例に係る音声合成
装置の機能ブロック図であり、１１は入力端子、１２は
前処理部、１３は選択基準パラメタ設定部、１４は素片
選択部、１５は素片ファイル管理部、１６は素片変形
部、１７は素片接続部、１８は出力端子である。素片選
択部１４と素片ファイル管理部とで本発明の素片選択手
段を構成する。素片ファイル管理部１５を除く各部の機
能は、基本的には図６に示した従来装置のものと同様な
ので、その詳細な説明は省略する。Embodiments of the present invention will now be described in detail with reference to the drawings. FIG. 1 is a functional block diagram of a speech synthesizer according to an embodiment of the present invention, in which 11 is an input terminal, 12 is a preprocessing unit, 13 is a selection reference parameter setting unit, 14 is a segment selection unit, and 15 is a unit selection unit. A segment file management unit, 16 is a segment transformation unit, 17 is a segment connection unit, and 18 is an output terminal. The segment selection unit 14 and the segment file management unit constitute a segment selection unit of the present invention. The functions of the respective units except the segment file management unit 15 are basically the same as those of the conventional apparatus shown in FIG. 6, and thus detailed description thereof will be omitted.

【００１４】図２に、本実施例の素片ファイル管理部１
５の構成例を示す。この素片ファイル管理部１５は、音
韻情報管理テーブル２２、素片ファイル２４１、素片パ
ラメタテーブル２３１、単音節パラメタテーブル２３
２、単音節ファイル２４２、及び、図示を省略したファ
イル制御部（ＣＰＵ）から成る。なお、ＣＰＵ、素片フ
ァイル２４１、及び、素片パラメタテーブル２３１を含
んで本発明の素片情報格納手段を構成し、同様に、ＣＰ
Ｕ、単音節ファイル２４２、及び、単音節パラメタテー
ブル２３２を含んで本発明の単音節情報格納手段を構成
する。FIG. 2 shows the segment file management unit 1 of this embodiment.
5 shows a configuration example of No. 5. The unit file management unit 15 includes a phoneme information management table 22, a unit file 241, a unit parameter table 231, and a single syllable parameter table 23.
2, a single syllable file 242, and a file control unit (CPU) not shown. It should be noted that the CPU, the segment file 241, and the segment parameter table 231 constitute the segment information storage means of the present invention, and similarly, CP
U, the monosyllabic file 242, and the monosyllabic parameter table 232 constitute the monosyllabic information storage means of the present invention.

【００１５】素片ファイル２４１には、音韻単位に、予
め音韻環境等を考慮して単語・文章から切り出された素
片が複数個格納されている。これに対し、単音節ファイ
ル２４２には、５０音や連母音等の単音節がそのまま格
納されている。これら両ファイル２４１，２４２に格納
された波形素片及び単音節が、それぞれ後述の複合的な
合成パラメタとして用いられる。音韻情報管理テーブル
２２には、個々の音韻単位に属する素片ファイル２４１
中の波形素片数が、例えば数値データとして記述されて
いる。また、素片パラメタテーブル２３１には、各波形
素片の韻律パラメタがそれぞれ格納され、単音節パラメ
タテーブル２３２には、各単音節の韻律パラメタが格納
されている。これらテーブル２３１，２３２の内容例を
図３、図４に示す。The segment file 241 stores a plurality of segments extracted from words / sentences in advance in consideration of the phonological environment and the like in units of phonemes. On the other hand, the single syllable file 242 stores single syllables such as 50 syllables and consecutive vowels as they are. The waveform segment and the monosyllabic piece stored in both of these files 241 and 242 are used as a composite synthesis parameter described later. The phoneme information management table 22 has a phoneme file 241 belonging to each phoneme unit.
The number of waveform segments inside is described as, for example, numerical data. The segment parameter table 231 stores the prosody parameters of each waveform segment, and the monosyllabic parameter table 232 stores the prosody parameters of each monosyllabic. Examples of the contents of these tables 231 and 232 are shown in FIGS.

【００１６】図３は、／ｋａ／に関する素片パラメタテ
ーブル２３１の内容例であり、３０はテーブル名、３１
はファイル番号、３２は素片抽出環境、３３は平均ピッ
チ周波数、３４はピッチ傾斜、３５は継続時間、３６は
平均パワである。テーブル名３０は、例えば素片抽出環
境３２のインデックスとして機能させる。「＃ｋａー」
は、自然音声（単語・文章）の先頭にある音韻が／か／
で、アクセントの付与されていないものを意味する。フ
ァイル番号３１は、素片ファイル２４１中のファイル識
別符号であり、素片選択部１４において選択の対象とな
る。素片抽出環境３２は、波形素片の抽出可能な単語を
示すものである。図示の例では、「＃ｋａー」の波形素
片が「＃ｋａｋａｅｒｕ＃」等の単語から抽出可能であ
ることを示している。各素片抽出環境に対応する韻律パ
ラメタ３３〜３６は、波形切り出しの際に解析された値
であり、固定的な数値データである。FIG. 3 shows an example of the contents of the fragment parameter table 231 for / ka /, where 30 is the table name and 31
Is a file number, 32 is a segment extraction environment, 33 is an average pitch frequency, 34 is a pitch slope, 35 is a duration, and 36 is an average power. The table name 30 functions as an index of the segment extraction environment 32, for example. "#Ka"
Is the phoneme at the beginning of the natural speech (word / sentence)
Means that no accent is given. The file number 31 is a file identification code in the segment file 241 and is selected by the segment selection unit 14. The segment extraction environment 32 indicates the words that can be extracted from the waveform segment. The illustrated example shows that the waveform segment of "# ka-" can be extracted from a word such as "# kakaeru #". The prosody parameters 33 to 36 corresponding to each segment extraction environment are values analyzed at the time of waveform cutout, and are fixed numerical data.

【００１７】また、図４は、単音節パラメタテーブル２
３２の内容例であり、４１はファイル番号、４２は単音
節、４３は平均ピッチ周波数、４４はピッチ傾斜、４５
は継続時間、４６は平均パワである。ファイル番号４１
は、単音節ファイル２４２中のファイル識別符号であ
り、上記ファイル番号３１と同様、素片選択部１４にお
いて選択の対象となる。単音節４２は、単音節ファイル
２４２中に格納された単音節に対応するものである。図
示の例では、ａ，ｉ，・・・ｗａ等、１１０個の単音節
が単音節ファイル２４２に格納されていることを示して
いる。この単音節の種類はできるだけ多い方が好まし
い。なお、各単音節に対応する韻律パラメタ４３〜４６
は、予め解析された固定的な数値データである。Further, FIG. 4 shows a monosyllabic parameter table 2
32 is an example of contents of 32, 41 is a file number, 42 is a single syllable, 43 is an average pitch frequency, 44 is a pitch slope, and 45 is a pitch slope.
Is the duration and 46 is the average power. File number 41
Is a file identification code in the monosyllabic file 242, and is a target of selection in the segment selection unit 14, like the file number 31. The monosyllabic 42 corresponds to the monosyllabic stored in the monosyllabic file 242. The illustrated example shows that 110 single syllables such as a, i, ... Wa are stored in the single syllable file 242. It is preferable that there are as many types of monosyllables as possible. The prosody parameters 43 to 46 corresponding to each monosyllabic
Is fixed numerical data analyzed in advance.

【００１８】次に、上記構成の音声合成装置における音
声合成処理の内容を説明する。音韻記号及びアクセント
記号から成る入力文字列は、入力端子１１から入力され
た後に、前処理部１２において音韻単位に分割される。
本実施例における音韻単位とは、／ａ／や／ｋａ／など
の音韻の他に、素片接続の滑らかさを考慮して／ａｉ／
などの連母音や、／ａＮ／などの複合音節を含むものと
する。選択基準パラメタ設定部１３では、分割された音
韻単位に対応する最適な波形素片（最適素片）の選択基
準となる韻律パラメタ、即ち選択基準パラメタを設定す
る。この実施例では、選択基準パラメタとして上述の平
均ピッチ周波数、ピッチ傾斜、時間長、平均パワの全て
を使用するが、この選択基準パラメタは、合成音声の韻
律条件に応じて適宜変更しても良い。Next, the contents of the voice synthesizing process in the voice synthesizing apparatus having the above configuration will be described. An input character string consisting of phonetic symbols and accent symbols is input from the input terminal 11 and then divided into phoneme units in the preprocessing unit 12.
In the present embodiment, the phoneme unit is, in addition to phonemes such as / a / and / ka /, in consideration of the smoothness of segment connection / ai /
It includes continuous vowels such as and compound syllables such as / aN /. The selection criterion parameter setting unit 13 sets a prosody parameter that is a criterion for selecting an optimum waveform segment (optimal segment) corresponding to a divided phoneme unit, that is, a selection criterion parameter. In this embodiment, all of the above-mentioned average pitch frequency, pitch slope, time length, and average power are used as the selection reference parameter, but this selection reference parameter may be appropriately changed according to the prosodic condition of the synthetic speech. .

【００１９】素片選択部１４では、素片ファイル２４１
に属する波形素片または単音節ファイル２４２に属する
単音節の各韻律パラメタと上記選択基準パラメタとを用
いて音韻単位毎の最適素片をそれぞれ選択する。この素
片選択部１４の詳細な動作を、図５を参照して具体的に
説明する。ここで図５は、本発明における素片選択部１
４の動作原理を示すフローチャートであり、Ｓは、各処
理ステップを表す。In the segment selection section 14, the segment file 241
The optimum segment for each phonological unit is selected by using each prosody parameter of the waveform segment belonging to (1) or the single syllable belonging to the single syllable file 242 and the selection reference parameter. The detailed operation of the segment selection unit 14 will be specifically described with reference to FIG. Here, FIG. 5 is a fragment selection unit 1 according to the present invention.
4 is a flowchart showing the operating principle of No. 4, in which S represents each processing step.

【００２０】まず、音韻情報管理テーブル２２を参照し
て音韻単位毎の波形素片数を求める（Ｓ３１）。そし
て、この求めた波形素片数によって以下のように選択処
理の内容を決定する（Ｓ３２）。波形素片数が０、つま
り存在しないときは、当該音韻単位に対応する単音節を
最適素片とみなして選択する（Ｓ３３１）。例えば、音
韻単位／ｄｉ／に対応する波形素片が存在しないとき
は、単音節パラメタテーブル２３２を検索し、ファイル
番号１１０（図４参照）の単音節ｄｉを自動的に最適素
片とみなす。他方、波形素片数が０でないときは、上記
選択基準パラメタと当該波形素片の韻律パラメタとを用
いて最適素片を選択する（Ｓ３３２）。波形素片数が１
の場合は、その波形素片を最適素片とする。波形素片が
２以上の場合の選択手法は、任意のものを用いて良い
が、例えば、選択基準パラメタと各韻律パラメタとの間
で２乗誤差を計算し、その２乗誤差が最小となるものを
最適素片とすれば、より選択基準パラメタに近い韻律パ
ラメタの最適素片が得られる利点がある。このようにし
て選択した最適素片に対応するファイル及びその韻律パ
ラメタを、素片変形部に出力する（Ｓ３４）。First, the number of waveform segments for each phoneme unit is obtained by referring to the phoneme information management table 22 (S31). Then, the contents of the selection process are determined as follows based on the obtained number of waveform segments (S32). When the number of waveform segments is 0, that is, when the waveform segment does not exist, the single syllable corresponding to the phoneme unit is regarded as the optimum segment and selected (S331). For example, when there is no waveform segment corresponding to the phoneme unit / di /, the monosyllabic parameter table 232 is searched and the monosyllabic di of the file number 110 (see FIG. 4) is automatically regarded as the optimum segment. On the other hand, when the number of waveform segments is not 0, the optimum segment is selected using the selection reference parameter and the prosody parameter of the waveform segment (S332). The number of waveform segments is 1
In the case of, the waveform element is set as the optimum element. An arbitrary selection method may be used when the number of waveform segments is two or more. For example, a squared error is calculated between the selection reference parameter and each prosody parameter, and the squared error is minimized. If the object is the optimal segment, there is an advantage that the optimal segment of the prosody parameter closer to the selection criterion parameter can be obtained. The file corresponding to the optimal segment selected in this way and its prosody parameter are output to the segment transformation unit (S34).

【００２１】例えば、図３におけるファイル番号０７４
１に属する波形素片が最適素片とされた場合には、当該
ファイル番号に対応するファイルを素片ファイル２４１
より抽出して出力するとともに、その韻律パラメタ３３
〜３６を出力する。同様に、図４におけるファイル番号
１１０に属する単音節ｄｉが最適素片とみなされた場合
には、当該ファイル番号に対応するファイルを単音節フ
ァイル２４２より抽出して出力するとともに、その韻律
パラメタ４３〜４６を出力する。For example, the file number 074 in FIG.
If the waveform segment belonging to 1 is the optimal segment, the file corresponding to the file number is assigned to the segment file 241.
Extracted and output, and its prosody parameter 33
~ 36 is output. Similarly, when the monosyllabic di belonging to the file number 110 in FIG. 4 is regarded as the optimum segment, the file corresponding to the file number is extracted from the monosyllabic file 242 and output, and the prosody parameter 43 of the file is extracted. ~ 46 is output.

【００２２】素片変形部１６では、素片選択部１４から
導かれたファイルの韻律パラメタを、上記選択基準パラ
メタに近づくように変形処理を施す。この変更処理は、
ファイル内容が波形素片の場合は従来の一般的な手法を
用いることができる。他方、単音節の場合には、例え
ば、その韻律パラメタの時間長を対応する選択基準パラ
メタの時間長に近づくように間引き又は補間する処理と
なる。他の韻律パラメタについてもピッチ制御や周波数
制御等によって変形処理が可能である。このように単音
節の韻律パラメタについても変形処理を行うのは、単純
に単音節ファイル２４２から該当する単音節を抽出して
接続しただけでは合成音声の自然性向上が図れないため
であり、また、波形素片との代替性をより完全に担保す
るためでもある。The segment transforming unit 16 transforms the prosody parameters of the file derived from the segment selecting unit 14 so as to approach the selection reference parameters. This change process is
When the file content is a waveform element, a conventional general method can be used. On the other hand, in the case of a single syllable, for example, the process is to thin out or interpolate the time length of the prosody parameter so as to approach the time length of the corresponding selection reference parameter. Other prosodic parameters can be modified by pitch control, frequency control, or the like. The reason why the transformation process is also performed on the prosodic parameters of the monosyllabic is that the naturalness of the synthesized speech cannot be improved simply by extracting and connecting the corresponding monosyllabic file from the monosyllabic file 242. It is also for completely ensuring the substitutability for the corrugated element.

【００２３】素片接続部１７では、前記素片変形部１６
で変形された波形素片又は単音節を、入力文字列に従っ
て音韻単位毎に順次結合することによって合成音声を生
成する。以上のような一連の処理を経て、出力端子１８
には、入力文字列に対応した、抜けの無い自然な合成音
声が出力される。In the element piece connecting portion 17, the element piece deforming portion 16 is provided.
Synthesized speech is generated by sequentially combining the waveform segment or the single syllable transformed in (3) for each phoneme unit according to the input character string. Through the series of processes described above, the output terminal 18
In, a natural synthetic voice with no omission corresponding to the input character string is output.

【００２４】このように、本実施例によれば、分割され
た音韻単位に対応する波形素片が存在しない場合でも、
対応する適切な単音節を選択して音声合成を行うことが
できる。従って、従来のように、全ての音韻単位に対し
て波形素片を予め用意する必要がなく、少ない自然音声
データからでも音声合成を行うことができる。As described above, according to this embodiment, even when there is no waveform segment corresponding to the divided phoneme unit,
It is possible to perform speech synthesis by selecting an appropriate corresponding monosyllable. Therefore, unlike the conventional case, it is not necessary to prepare waveform segments for all phoneme units in advance, and voice synthesis can be performed even from a small amount of natural voice data.

【００２５】[0025]

【発明の効果】以上の説明から明らかなように、本発明
の音声合成装置は、波形素片及び単音節の２種のパラメ
タを複合的な合成パラメタとして用い、音韻単位に対応
する波形素片が素片情報格納手段に存在するときは、予
め素片選択基準として定めた韻律パラメタとの誤差が最
小となる韻律パラメタに対応する波形素片を選択抽出
し、他方、存在しないときは当該音韻単位に対応する単
音節を単音節情報格納手段より選択抽出し、抽出された
波形素片又は単音節を入力文字列の順に接続する構成な
ので、合成対象となる語彙が増加した場合でも、語彙に
対応する波形素片を新たに追加することなく、予め作成
した自然音声データ、即ち波形素片と単音節、及びその
韻律パラメタのみを用いて、抜けのない自然な合成音声
を生成することができる。しかも、上述のように自然音
声データの追加を必要としないので、ファイル容量を小
さくすることができる。これにより、簡素で低コスト、
且つ、音声合成処理時間が短縮されて一定時間内に高品
質な合成音声を生成する構成の音声合成装置を実現する
ことができる。As is apparent from the above description, the speech synthesizer of the present invention uses two types of parameters, a waveform segment and a monosyllabic, as a composite synthesis parameter, and a waveform segment corresponding to a phoneme unit. Exists in the segment information storage means, the waveform segment corresponding to the prosody parameter that minimizes the error from the prosody parameter previously determined as the segment selection criterion is selected and extracted. Even if the number of vocabularies to be synthesized increases, the singular syllable corresponding to the unit is selectively extracted from the single syllable information storage means, and the extracted waveform segments or single syllables are connected in the order of the input character string. It is possible to generate natural synthetic speech without omissions by using only pre-created natural speech data, that is, the waveform segment and the monosyllabic, and its prosody parameters without newly adding the corresponding waveform segment. That. Moreover, since it is not necessary to add the natural voice data as described above, the file capacity can be reduced. This makes it simple and low cost,
In addition, it is possible to realize a voice synthesizing device having a configuration in which the voice synthesizing processing time is shortened and a high quality synthetic voice is generated within a fixed time.

【００２６】また、本発明の他の構成に係る音声合成装
置は、上記素片選択部で選択された波形素片又は単音節
の韻律パラメタを変形して素片選択の基準となる韻律パ
ラメタに近付ける素片変形部を設け、この変形された波
形素片等を接続して合成音声を生成するようにしたの
で、所望の韻律の合成音声を高速に生成することが可能
となる。A speech synthesizer according to another configuration of the present invention modifies the prosodic parameter of the waveform segment or the single syllable selected by the segment selection unit to obtain a prosodic parameter as a reference for segment selection. Since the elemental piece deforming unit to be brought close to is provided and the deformed waveform elemental pieces are connected to generate the synthetic speech, it is possible to generate the synthetic speech of a desired prosody at high speed.

[Brief description of drawings]

【図１】本発明の一実施例に係る音声合成装置の機能ブ
ロック図。FIG. 1 is a functional block diagram of a voice synthesizer according to an embodiment of the present invention.

【図２】本実施例の音声合成装置における素片ファイル
管理部の構成図。FIG. 2 is a configuration diagram of a segment file management unit in the speech synthesizer of this embodiment.

【図３】上記素片ファイル管理部を構成する素片パラメ
タテーブルの内容説明図。FIG. 3 is an explanatory diagram of contents of a segment parameter table that constitutes the segment file management unit.

【図４】上記素片ファイル管理部を構成する単音節パラ
メタテーブルの内容説明図。FIG. 4 is an explanatory diagram of the contents of a monosyllabic parameter table that constitutes the segment file management unit.

【図５】本実施例の音声合成装置における素片選択部の
動作を示すフローチャート。FIG. 5 is a flowchart showing the operation of a phoneme selection unit in the speech synthesizer of this embodiment.

【図６】従来の音声合成装置の機能ブロック図FIG. 6 is a functional block diagram of a conventional speech synthesizer.

[Explanation of symbols]

１２前処理部１３選択基準パラメタ設定部１４素片選択部１５素片ファイル管理部１６素片変形部１７素片接続部２２音韻情報管理テーブル２３１素片パラメタテーブル２３２単音節パラメタテーブル２４１素片ファイル２４２単音節ファイル 12 preprocessing unit 13 selection reference parameter setting unit 14 unit selection unit 15 unit file management unit 16 unit transformation unit 17 unit connection unit 22 phoneme information management table 231 unit parameter table 232 monosyllabic parameter table 241 unit file 242 monosyllabic files

Claims

[Claims]

1. A preprocessing unit for dividing an input character string corresponding to a voice into phonological units, a plurality of waveform units cut out from a natural voice, and a unit information storage unit for storing a prosody parameter of each waveform unit. A monosyllabic information storage means for storing a plurality of monosyllabic and prosodic parameters of each monosyllabic, and whether or not a waveform segment corresponding to the divided phonological unit exists in the segment information storage means. When it exists, the waveform segment corresponding to the prosody parameter that minimizes the error from the prosody parameter previously determined as the segment selection criterion is selected and extracted, and when it does not exist, the single segment corresponding to the phoneme unit is extracted. At least a segment selection unit that selectively extracts syllables from the single syllabic information storage unit, and a segment connection unit that connects the extracted waveform segments or single syllables in the order of the input character string to generate a synthesized voice. Characterized by having Speech synthesis device that.

2. The speech synthesizer according to claim 1, wherein an error from a prosodic parameter defined as the phoneme selection reference is obtained by modifying a prosodic parameter of a waveform phoneme or a single syllable selected and extracted by the phoneme selection unit. A speech synthesizer characterized in that it is provided with a segment transformation unit that brings the element closer to a zero value, and leads the waveform segment or monosyllabic segment of the transformed prosody parameter to the segment connection unit.

3. The speech synthesizer according to claim 1, wherein the error is obtained by dividing a difference between a prosody parameter defined as the selection criterion and an extracted individual prosody parameter by a variation width of each prosody parameter. A speech synthesizer characterized by being a sum of squared values.