TW512308B

TW512308B - Real-time lip dynamic simulation method with the voice as the driving mechanism

Info

Publication number: TW512308B
Application number: TW90111865A
Authority: TW
Inventors: Shiue-Wu Wang
Original assignee: Inst Information Industry
Priority date: 2001-05-17
Filing date: 2001-05-17
Publication date: 2002-12-01

Abstract

The present invention provides a real-time lip dynamic simulation method with the voice as the driving mechanism, which uses the Gaussian mixture model and the vector quantization as the grouping base for the voice and the lip size. In the training stage, the synchronous data for the voice and the lip is obtained from the video, and the voice is divided into continuous and overlapped voice frames. Each voice frame is converted into multiple inversed spectrum parameters, and the lip part abstracts two parameters for width and height to compose a vector. After obtaining a series of vectors, they are grouped by the vector quantization, and uses the Gaussian mixture model as the description base for each group, and find out the best description method with the maximum estimation algorithm. In the stage of corresponding the voice with the lip size, the voice is first divided into continuous and overlapped voice frames, and each voice frame is converted into multiple inversed spectrum parameters, and computing with the appearing probability of the parameter in each group. The probability and the corresponding lip size in each group are used to calculate the lip size for each voice segment in a weighted average method.

Description

512308 A? B? 五、發明說明（ί ) 【本發明之領域】本發明係有關嘴型模擬之技術領域為驅動機制的嘴型即時動態模擬方法。尤指一種以聲音 s 【本發明之背景】按，隨著電腦技術的發展，各種造型的嘴型與説話時的搭配，無論在3D或是2D方面的應用，例如在現今的電影、電腦遊戲等視聽娛樂之應用上，已經成為不可或缺的一部分。然而在這些應用中，一般而言，造型的嘴型與聲晋的搭配大都是以手工的方式調整，而以人工製作嘴形 3 0秒約需要1 · 5小時，因此，其耗時極長而缺乏效率，而即使有提供語音的辨認來決定對應之嘴形，也都是將聲音轉成相對應的文字，然後再依照相對應文字的嘴型大小^ 行嘴型仿眞，惟此種仿眞方式都僅能限制與單一的語含，例如為純中文與純英文，而不能中英文混合。因此::前述習知嘴形模擬方法來製作的動晝或是影片，通常非常= 耗費人力與時間，而有予以改進之必要。發明人爰因於此，本於積極發明之精神，亟思一種可以解決上述問題之「以聲音為驅動機制的嘴型即時動能模擬万法」，幾經研究實驗終至完成此項新穎進步之於明“。員工消費印512308 A? B? V. Description of the Invention (Field of the Invention) The present invention relates to the technical field of mouth shape simulation. The real-time dynamic simulation method of mouth shape is a driving mechanism. Especially with sound s [Background of the present invention] According to the development of computer technology, the combination of various shaped mouth shapes and speaking, whether in 3D or 2D applications, such as in today's movies, computer games In the application of audiovisual entertainment, it has become an indispensable part. However, in these applications, in general, the matching of the shape of the mouth and the sound of the mouth are mostly manually adjusted, and the artificial mouth shape takes about 1.5 hours in 30 seconds, so it takes a long time. It is inefficient, and even if speech recognition is provided to determine the corresponding mouth shape, the sound is converted into the corresponding text, and then the mouth shape is simulated according to the mouth size of the corresponding text. The imitation method can only be limited to a single language, such as pure Chinese and pure English, but not Chinese and English. Therefore :: The moving day or film produced by the above-mentioned conventional mouth shape simulation method is usually very = consumes manpower and time, and it is necessary to improve it. Because of this, based on the spirit of active invention, the inventor is eager to think of a "sound-driven real-time dynamic simulation method of mouth shape" that can solve the above problems. After several research experiments, this novel progress has been completed. Ming ". Employee Consumption Seal

【本發明之概述】本發明之目的係在提供-種以聲音4轉_的^ p: ’力':杈擬万法，以達成即時的同步動態模擬，吾音辨認技術，且能打破單一語言的限制；、而 5張尺度祕（2】G X[Summary of the invention] The purpose of the invention is to provide-a kind of ^ p: 'force': a pseudo-manipulation method with 4 turns of sound, to achieve real-time synchronous dynamic simulation, voice recognition technology, and can break a single Limitations of language; and 5 scale secrets (2) GX

1裝------ J^T --線 -I I I {請先閱讀背面之注音？事項再填寫本頁} I I I · 五、發明說明（z ) 為達前述之目的，本發明之即時動態模擬方法，主要包括下述驅動機制=型之影立资却沾敕立步^ · ( A )將輸入、…的耳音分成複數個連續而且有重最的立框. (B)將母-個音框轉成複’ _型的寬度與高度兩個參數:二數每=:;: 參數及嘴型的寬度與高度參:所 ==複數群，以使能量與嘴形大小相近之音頻-見：里在同一群；（D)以高斯混合模型作為每一群的基礎；以及，⑻對每一個群，根據向量量化所得到的結果，設定起始設定値，以利用最大預測演算法來求取每-群的最佳高斯混合模型的參數値，俾供模擬之聲音。 / 由於本發明設計新穎，能提供產業上利用，且確有增進功效，故依法申請專利。曰為使貴審查委員能進一步瞭解本發明之結構、特徵及其目的，茲附以圖式及較佳具體實施例之詳細說明如后： σ 【圖式簡單説明】第1圖：係為本發明之以聲音為驅動機制的嘴型即時動能模擬方法在訓練階段的流程圖。第2圖：係為本發明之以聲音為驅動機制的嘴型即時動能模擬方法在求取訓練參數之組合示意圖。 512308 A7 B7 五、發明說明（3) — 第3圖·係為本發明之以聲音為驅動機制的嘴型即時動能模擬方法在模擬階段之流程圖。【較佳具體實施例之詳細説明】為説明本發明之以聲音為驅動機制的嘴型即時動能模擬方法’凊先參照第1圖所示，其顯示本發明之方法在力1丨練階段之流程圖。本發明在訓練階段是以攝影機拍攝刻練者的朗誦事先設計好的數段文字，俾以求取訓練參數，併請參照第2圖所示所欲求取之訓練參數之組合示意圖，首先’將輸入之影晋資訊（Video & Audio)的聲音分成複數個連續而且有重疊的音框（步騾S 1 1 )，並以特徵分析 (Feature Extraction)將每一個音框轉成複數個（例如 13個）倒頻1晋參數（Cepstrum coefficients)(以 α表示）（步驟S12)，且相對應於每一個音框，以透過嘴形追蹤程式（Lip-tracking program)取得這個音框内嘴型的寬度（Width)與南度（Height)兩個參數（以▽表示）（步騾S 1 3 )，而對於每一個音框，此i 5個參數便可組成為一個音頻-視覺向量（Audio_visual feature v e c t o r )(步驟S 1 4 )，以作為該音框的代表〇在取得一系列的晋頻-視覺向量ν之後，再利用向量量化（Vector Quantization)將這些音頻-視覺向量分成Ν 群（步驟S 1 5 )’以使能量與嘴形大小相近之音頻_視覺向量在同一群，而每一群即對應有一個收斂後的中心向量 (Center Vector )與共變異矩陣（C0variance 本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297 ^釐） f請先閱讀背面之注咅？事項再填寫本頁}1 pack ------ J ^ T --line -I I I {Please read the phonetic on the back? Please fill in this page again. III. V. Description of the invention (z) In order to achieve the aforementioned purpose, the real-time dynamic simulation method of the present invention mainly includes the following driving mechanism = Xingyingyinglizi, but stubbornly moves ^ · (A ) Divides the ear sounds of input, ... into a plurality of continuous and heaviest frames. (B) Turns the mother-sound frame into a complex '_ width and height two parameters: two counts each ::::: Parameters and the width and height parameters of the mouth shape: all == complex groups, so that the energy is similar to the size of the mouth-see: inside the same group; (D) using the Gaussian mixture model as the basis for each group; and, ⑻ For each group, based on the results obtained from the vector quantization, the initial setting 値 is set to use the maximum prediction algorithm to obtain the parameters 値 of the optimal Gaussian mixture model for each group, for simulation sound. / As the invention is novel in design, can provide industrial use, and does have an added effect, it has applied for a patent in accordance with the law. In order to enable your review committee to further understand the structure, characteristics and purpose of the present invention, the detailed description of the drawings and preferred embodiments is attached as follows: σ [Simplified description of the drawings] Figure 1: This is the basis Invented a flowchart of a mouth-shaped instant kinetic energy simulation method using a sound as a driving mechanism during a training phase. Fig. 2: This is a schematic diagram of the combination of the real-time kinetic energy simulation method of the mouth shape using the sound as the driving mechanism to obtain training parameters. 512308 A7 B7 V. Description of the invention (3) — Figure 3 is a flowchart of the simulation method of the mouth-shaped real-time kinetic energy using the sound as the driving mechanism in the simulation phase. [Detailed description of the preferred embodiment] In order to explain the method of simulating real-time kinetic energy of the mouth shape using sound as the driving mechanism according to the present invention, refer to FIG. 1 first, which shows that the method of the present invention is in the power training stage. flow chart. In the training phase of the present invention, a camera is used to capture a number of texts designed in advance by a trainer ’s recitation to obtain training parameters, and please refer to the combined schematic diagram of the desired training parameters shown in FIG. 2. The sound of the input Video & Audio is divided into a plurality of continuous and overlapping frames (step S 1 1), and each frame is converted into a plurality by Feature Extraction (for example, 13) Cepstrum coefficients (indicated by α) (step S12), and corresponding to each frame, the mouth shape in the frame is obtained through a Lip-tracking program Width and Height parameters (indicated by ▽) (step 骡 S 1 3), and for each frame, these 5 parameters can be combined into an audio-visual vector (Audio_visual feature vector) (step S 1 4), as a representative of the sound frame. After obtaining a series of frequency-visual vectors ν, then use Vector Quantization to divide these audio-visual vectors into N groups ( Step S 1 5) 'so that the audio_visual vectors with similar energy to the mouth shape are in the same group, and each group corresponds to a converged Center Vector and a common variation matrix (C0variance) This paper scale applies to China National Standard (CNS) A4 Specification (210 X 297 ^ centimeters) f Please read the note on the back? Matters before filling out this page}

I 經濟部智慧財產局員工消費合作社印製 512308 A7 五、發明說明（+ )I Printed by the Consumer Cooperatives of the Intellectual Property Bureau of the Ministry of Economic Affairs 512308 A7 V. Description of Invention (+)

Matrix )，步騾S16係以高斯混合模型（以…以心 Mixture Model，GMM)作為每一群的表示基礎，亦即，以GMM來表示音頻-視覺向量的機率分佈，其中， GMM是K個高斯函數（Gaussian functi〇n)的權重和 (weighted sum )，可由以下的公式所示· /=/ 其中^為混合權重，啦為）為具有平均値（mean) A與共變異矩陣Σ，的高斯函數，如下所示。於步驟S17中，對每-個群卜根據向量量化所得到 .的結果’取其中心向量作為初始平均値（㈤“心⑽）心以收叙後的共變異矩陣作為分群丨之共變異矩陣ς，，而分群i中的音頻-視覺向量數目，㈣有音頻_視覺向量數目的比例則作為初始混合權重（initial mixture weight ) Μ，而以前述之起私以 —^ ^ 九5又疋値，即可利用最大預測次异法（ExpeetauGn_MaximizatiGnaigGdthm)^ 取每-群的最佳高斯混合模型的參數値u與义。声…參照第3圖所示，係首先將受測者的 ::「個:_成複數個(例如"個)倒頻譜參數(以! /^S32)，也就是聲音特徵向量α。步騾S33則根據《出現在每一群中的機率値，取— 、/、出目前的嘴型大小“另為固加推平均値而求Matrix), step S16 uses a Gaussian mixture model (with a Heart Mixture Model, GMM) as the basis for each group, that is, the probability distribution of audio-visual vectors is represented by GMM, where GMM is K Gaussian The weighted sum of the function (Gaussian functi〇n) can be expressed by the following formula: / = / where ^ is the mixed weight, which is) Gaussian with mean A (mean) A and covariance matrix Σ, Function as shown below. In step S17, each group is obtained according to vector quantization. The result 'takes its center vector as the initial average ㈤ (㈤ "心 ⑽), and uses the co-variation matrix after classification as the co-variation matrix for grouping. ς, and the number of audio-visual vectors in cluster i, the ratio of the number of audio_visual vectors is used as the initial mixture weight (M), and from the foregoing, it is privately used ^ ^ 九 5 又疋値, You can use the maximum prediction sub-extra method (ExpeetauGn_MaximizatiGnaigGdthm) ^ Take the optimal Gaussian mixture model parameters 値 u and meaning of each group. Sound ... Refer to Figure 3, the first is to test subjects: "" : _ Into a plurality of (such as ") cepstrum parameters (with! / ^ S32), which is the sound feature vector α. Step 骡 S33 is based on the "probability of appearing in each group 取, take —, /, out The current size of the mouth "is calculated separately

、、ϋ込求解的速度，可設定N 本紙張尺細巾嶋鮮經濟部智慧財產局員工消費合作社印製 512308 五、發明說明（，κ，亦即’蔣向量量化的分 ^The speed of solving the problem can be set to N paper rulers and fine towels. Printed by the Consumer Cooperatives of the Intellectual Property Bureau of the Ministry of Economic Affairs. 512308 V. Description of the invention

所使用的高斯函數的個數相设足為與表示GMM #同’而其求解之公式如下： -I Pa{a) *V^ 其中^⑷⑽，凡⑷⑷ 巧=ίν啦，Σ,](ν抽。，由以上之説明可知， & 、將聲音與嘴型女t ^ 、， ^明之方法係以分群的方法，計上的分群。t這斯混合模型與向量量化做-個統時，可根據聲音落在各分群=基％ =有説話聲音輸人的嘴型大小。而依照羊’异出孩聲骨所相對應做即時的聲音與土又小，便可以針對造型的嘴型到複雜的語音辨：動態模擬。因此，無需用打破單一語言的限現嘴型之模擬，同時亦可综上所陳，本發明時的同步動態模擬。顯示其迥異於習知技：：：目的、手段及功效，在在均大突破，懇請鳩:::二為:形模擬之設計上的-社會，實感德便。惟應n ΐ，賜准專利，俾嘉惠了便於説明而舉^^，’上述諸多實施例僅係為申請專利範圍所述為準= 為卞，而非僅限於上述實施例。 —裝 ·訂i (請先閱讀背面之注意事項再填寫本頁) --線·The number of Gaussian functions used is set to be the same as that of GMM, and the formula for its solution is as follows: -I Pa {a) * V ^ where ^ ⑷⑽ ，凡 ⑷⑷ 巧 = ίν 啦， Σ,] (ν From the above description, it can be known that the method of combining sound and mouth shape t ^, ^ Ming is based on the grouping method, and the grouping is counted. When this mixed model and vector quantification are made as a unified system, According to the sound falling in each subgroup = base% = the size of the mouth of a person who has a voice to speak. And according to the sheep's bones, the real-time sound and soil are small, and the shape of the mouth can be complicated. Speech recognition: dynamic simulation. Therefore, it is not necessary to use a simulation that breaks the limitation of a single language. At the same time, it can also summarize the synchronous dynamic simulation in the present invention. It shows that it is quite different from the conventional technology :: Purpose, Means and effects, in the breakthrough in the great, I ask the dove ::: The second is: the design of the shape simulation-society, real sense of morality. But should n ΐ, grant a quasi-patent, 俾 gratuitous for easy explanation ^^ , 'Many of the above-mentioned embodiments are only as described in the scope of patent application = Yes, not Limited to the embodiments described above - loaded · set i (please read the Notes on the back to fill out Page) - Line

本紙張尺度適用中國國家標準（CNS)A4規格 (210 297公釐）This paper size applies to China National Standard (CNS) A4 (210 297 mm)

Claims

512308 A8B8C8D8 Printed by the Consumer Property Cooperative of the Intellectual Property Bureau of the Ministry of Economic Affairs of the Ministry of Economic Affairs. 6. Application scope of patents. # 乂声曰 is a real-time dynamic simulation method of the mouth shape, which mainly includes the following steps. The sound is divided into a plurality of continuous and overlapping frames; B (B) converts each frame into a plurality of cepstrum parameters, and finds the width of the mouth shape in each bone frame. Two parameters, where each sound frame is composed of a corresponding cepstrum parameter and 喈喈, 〃 number and the shape's width and height parameters 7 an audio-visual vector; (㈡ using vector quantization to these The audio_visual vector is divided into complex numbers so that this I is similar to the mouth size of the audio_visual vector. The Gaussian mixture model is used as the basis for each group. And, (E) for each group, according to the vector quantization, The obtained results are sufficient to set the initial settings to use the maximum prediction algorithm to obtain the parameters of each Gaussian mixture model. The method described in the first item of the patent scope, which includes the V package, includes the following (F): divides the voice of the subject into a plurality of continuous and sound frames, and then converts each sound frame into a plurality of sound characteristics: Spectral parameters; and (G) According to the probability 声音 that sound feature vectors appear in each group, take a weighted average 値 to find the corresponding acceptance. 4 The mouth size of the sound The large paper size applies the Chinese National Standard (CNS > A4 size coffee X 297 What love) (Please read the precautions on the back before filling in this page)

Printed by the Consumer Cooperative of the Intellectual Property Bureau of the Ministry of Economic Affairs 512308-'—— VI. Patent Application Scope 3. As described in item i of the patent scope, the second A is a feature analysis that converts each sound box into A plurality of US 4. In the person (B) described in item 丨 of the scope of the patent application, the process two is tracked through the mouth shape;: '; step and height two parameters. The method described in item 丨 of the patent fan garden, where the mother-fresh has a center vector and the covariation moment (D) 6. The example is as follows: Please apply the method described in item 1 of the patent scope, wherein in step (D j) is a distribution of the Bis mixed mode. The probability of the wedge clay to express the bone frequency-visual vector 7. In the method described in item 6, the mixed model is the weights of K Gaussian functions, ^ and jin. Chuanli sum can be expressed by the following formula / == /, where 'μ is the mixed weight, ⑷ and 〃 have an average 値The Gaussian function of a and the covariance matrix II can be expressed as: /, 8 [μι 5 Σ /] (〇) = 7ΗΦΕί exp {~ 2-μί) Σ · 1 (〇-Ui)] ο 8 · If applied The method described in item 7 of the patent scope, wherein, in step discrimination, for each group, the center vector is taken as the initial average 値 ^ ', and the received covariation matrix is used as the covariation of group i. Matrix Σ, and the number of audio-visual vectors in cluster i, accounting for all audio_visual directions

This paper size applies the Chinese national standard (C ^ S :) A4 specification (21G χ 297 _)

512308 VI. Scope of Patent Application Purpose, for example, as the initial mixing weight ^, and 俾 supply as the initial setting, to find the parameters 値 a, [, and ^ of the optimal Gaussian mixture model for each group. Price: The method described in item 8 of the scope of the patent application, wherein the number of clusters for vector quantization is set to A ^ f τ machine aa v * is /, the number of Gaussian functions used in the mixture model Same, according to the following formula to solve according to the following formula: 7 = 4 small] -1 Paid), where, ⑷, K ⑻ sZmuako / = /, α TF cepstrum parameter 'ν represents the width and height of the mouth shape Parameters, F is the mouth size. " (Please read the notes on the back before filling out this page} The Consumer Cooperative of the Intellectual Property Bureau of the Ministry of Economic Affairs prints a moderate size sheet of a paper 21 / IV grid 4) Α Ns) (c quasi-standard family: κ < 97