512308 A? B? 五、發明說明(ί ) 【本發明之領域】 本發明係有關嘴型模擬之技術領域 為驅動機制的嘴型即時動態模擬方法。 尤指一種以聲音 s 【本發明之背景】 按,隨著電腦技術的發展,各種造型的嘴型與説話時 的搭配,無論在3D或是2D方面的應用,例如在現今的 電影、電腦遊戲等視聽娛樂之應用上,已經成為不可或缺 的一部分。然而在這些應用中,一般而言,造型的嘴型與 聲晋的搭配大都是以手工的方式調整,而以人工製作嘴形 3 0秒約需要1 · 5小時,因此,其耗時極長而缺乏效率,而 即使有提供語音的辨認來決定對應之嘴形,也都是將聲音 轉成相對應的文字,然後再依照相對應文字的嘴型大小^ 行嘴型仿眞,惟此種仿眞方式都僅能限制與單一的語含, 例如為純中文與純英文,而不能中英文混合。因此::前 述習知嘴形模擬方法來製作的動晝或是影片,通常非常= 耗費人力與時間,而有予以改進之必要。 發明人爰因於此,本於積極發明之精神,亟思一種可 以解決上述問題之「以聲音為驅動機制的嘴型即時動能模 擬万法」,幾經研究實驗終至完成此項新穎進步之於明“。 員 工 消 費 印512308 A? B? V. Description of the Invention (Field of the Invention) The present invention relates to the technical field of mouth shape simulation. The real-time dynamic simulation method of mouth shape is a driving mechanism. Especially with sound s [Background of the present invention] According to the development of computer technology, the combination of various shaped mouth shapes and speaking, whether in 3D or 2D applications, such as in today's movies, computer games In the application of audiovisual entertainment, it has become an indispensable part. However, in these applications, in general, the matching of the shape of the mouth and the sound of the mouth are mostly manually adjusted, and the artificial mouth shape takes about 1.5 hours in 30 seconds, so it takes a long time. It is inefficient, and even if speech recognition is provided to determine the corresponding mouth shape, the sound is converted into the corresponding text, and then the mouth shape is simulated according to the mouth size of the corresponding text. The imitation method can only be limited to a single language, such as pure Chinese and pure English, but not Chinese and English. Therefore :: The moving day or film produced by the above-mentioned conventional mouth shape simulation method is usually very = consumes manpower and time, and it is necessary to improve it. Because of this, based on the spirit of active invention, the inventor is eager to think of a "sound-driven real-time dynamic simulation method of mouth shape" that can solve the above problems. After several research experiments, this novel progress has been completed. Ming ". Employee Consumption Seal
【本發明之概述】本發明之目的係在提供-種以聲音4轉_的^ p: ’力':杈擬万法,以達成即時的同步動態模擬,吾音辨認技術,且能打破單一語言的限制;、而 5張尺度祕(2】G X[Summary of the invention] The purpose of the invention is to provide-a kind of ^ p: 'force': a pseudo-manipulation method with 4 turns of sound, to achieve real-time synchronous dynamic simulation, voice recognition technology, and can break a single Limitations of language; and 5 scale secrets (2) GX
1裝------ J^T --線 -I I I {請先閱讀背面之注音?事項再填寫本頁} I I I · 五、發明說明(z ) 為達前述之目的,本發明之 即時動態模擬方法,主要包括下述驅動機制=型 之影立资却沾敕立 步^ · ( A )將輸入 、…的耳音分成複數個連續而且有重最的立框. (B)將母-個音框轉成複’ _型的寬度與高度兩個參數:二數每=:;: 參數及嘴型的寬度與高度參:所 ==複數群,以使能量與嘴形大小相近之音頻-見:里在同一群;(D)以高斯混合模型作為每一群的 基礎;以及,⑻對每一個群,根據向量量化所得 到的結果,設定起始設定値,以利用最大預測演算法來求 取每-群的最佳高斯混合模型的參數値,俾供模擬 之聲音。 / 由於本發明設計新穎,能提供產業上利用,且確有增 進功效,故依法申請專利。 曰 為使貴審查委員能進一步瞭解本發明之結構、特徵 及其目的,茲附以圖式及較佳具體實施例之詳細說明如 后: σ 【圖式簡單説明】 第1圖:係為本發明之以聲音為驅動機制的嘴型即時動能 模擬方法在訓練階段的流程圖。 第2圖:係為本發明之以聲音為驅動機制的嘴型即時動能 模擬方法在求取訓練參數之組合示意圖。 512308 A7 B7 五、發明說明(3) — 第3圖·係為本發明之以聲音為驅動機制的嘴型即時動能 模擬方法在模擬階段之流程圖。 【較佳具體實施例之詳細説明】 為説明本發明之以聲音為驅動機制的嘴型即時動能模 擬方法’凊先參照第1圖所示,其顯示本發明之方法在力1丨 練階段之流程圖。本發明在訓練階段是以攝影機拍攝刻練 者的朗誦事先設計好的數段文字,俾以求取訓練參數,併 請參照第2圖所示所欲求取之訓練參數之組合示意圖,首 先’將輸入之影晋資訊(Video & Audio)的聲音分成複 數個連續而且有重疊的音框(步騾S 1 1 ),並以特徵分析 (Feature Extraction)將每一個音框轉成複數個(例如 13個)倒頻1晋參數(Cepstrum coefficients)(以 α表 示)(步驟S12),且相對應於每一個音框,以透過嘴形 追蹤程式(Lip-tracking program)取得這個音框内嘴 型的寬度(Width)與南度(Height)兩個參數(以▽表 示)(步騾S 1 3 ),而對於每一個音框,此i 5個參數便可 組成為一個音頻-視覺向量(Audio_visual feature v e c t o r )(步驟S 1 4 ),以作為該音框的代表〇 在取得一系列的晋頻-視覺向量ν之後,再利用向量量 化(Vector Quantization)將這些音頻-視覺向量分成Ν 群(步驟S 1 5 )’以使能量與嘴形大小相近之音頻_視覺向 量在同一群,而每一群即對應有一個收斂後的中心向量 (Center Vector )與共變異矩陣(C0variance 本紙張尺度適用中國國家標準(CNS)A4規格(210 X 297 ^釐) f請先閱讀背面之注咅?事項再填寫本頁}1 pack ------ J ^ T --line -I I I {Please read the phonetic on the back? Please fill in this page again. III. V. Description of the invention (z) In order to achieve the aforementioned purpose, the real-time dynamic simulation method of the present invention mainly includes the following driving mechanism = Xingyingyinglizi, but stubbornly moves ^ · (A ) Divides the ear sounds of input, ... into a plurality of continuous and heaviest frames. (B) Turns the mother-sound frame into a complex '_ width and height two parameters: two counts each ::::: Parameters and the width and height parameters of the mouth shape: all == complex groups, so that the energy is similar to the size of the mouth-see: inside the same group; (D) using the Gaussian mixture model as the basis for each group; and, ⑻ For each group, based on the results obtained from the vector quantization, the initial setting 値 is set to use the maximum prediction algorithm to obtain the parameters 値 of the optimal Gaussian mixture model for each group, for simulation sound. / As the invention is novel in design, can provide industrial use, and does have an added effect, it has applied for a patent in accordance with the law. In order to enable your review committee to further understand the structure, characteristics and purpose of the present invention, the detailed description of the drawings and preferred embodiments is attached as follows: σ [Simplified description of the drawings] Figure 1: This is the basis Invented a flowchart of a mouth-shaped instant kinetic energy simulation method using a sound as a driving mechanism during a training phase. Fig. 2: This is a schematic diagram of the combination of the real-time kinetic energy simulation method of the mouth shape using the sound as the driving mechanism to obtain training parameters. 512308 A7 B7 V. Description of the invention (3) — Figure 3 is a flowchart of the simulation method of the mouth-shaped real-time kinetic energy using the sound as the driving mechanism in the simulation phase. [Detailed description of the preferred embodiment] In order to explain the method of simulating real-time kinetic energy of the mouth shape using sound as the driving mechanism according to the present invention, refer to FIG. 1 first, which shows that the method of the present invention is in the power training stage. flow chart. In the training phase of the present invention, a camera is used to capture a number of texts designed in advance by a trainer ’s recitation to obtain training parameters, and please refer to the combined schematic diagram of the desired training parameters shown in FIG. 2. The sound of the input Video & Audio is divided into a plurality of continuous and overlapping frames (step S 1 1), and each frame is converted into a plurality by Feature Extraction (for example, 13) Cepstrum coefficients (indicated by α) (step S12), and corresponding to each frame, the mouth shape in the frame is obtained through a Lip-tracking program Width and Height parameters (indicated by ▽) (step 骡 S 1 3), and for each frame, these 5 parameters can be combined into an audio-visual vector (Audio_visual feature vector) (step S 1 4), as a representative of the sound frame. After obtaining a series of frequency-visual vectors ν, then use Vector Quantization to divide these audio-visual vectors into N groups ( Step S 1 5) 'so that the audio_visual vectors with similar energy to the mouth shape are in the same group, and each group corresponds to a converged Center Vector and a common variation matrix (C0variance) This paper scale applies to China National Standard (CNS) A4 Specification (210 X 297 ^ centimeters) f Please read the note on the back? Matters before filling out this page}
I 經濟部智慧財產局員工消費合作社印製 512308 A7 五、發明說明(+ )I Printed by the Consumer Cooperatives of the Intellectual Property Bureau of the Ministry of Economic Affairs 512308 A7 V. Description of Invention (+)
Matrix ),步騾S16係以高斯混合模型(以…以心 Mixture Model,GMM)作為每一群的表示基礎,亦 即,以GMM來表示音頻-視覺向量的機率分佈,其中, GMM是K個高斯函數(Gaussian functi〇n)的權重和 (weighted sum ),可由以下的公式所示· /=/ 其中^為混合權重,啦為)為具有平均値(mean) A與共變異矩陣Σ,的高斯函數,如下所示。 於步驟S17中,對每-個群卜根據向量量化所得到 .的結果’取其中心向量作為初始平均値(㈤“心⑽) 心以收叙後的共變異矩陣作為分群丨之共變異矩陣ς,,而 分群i中的音頻-視覺向量數目,㈣有音頻_視覺向量數 目的比例則作為初始混合權重(initial mixture weight ) Μ,而以前述之起私以 —^ ^ 九5又疋値,即可利用最大預測 次异法(ExpeetauGn_MaximizatiGnaigGdthm)^ 取每-群的最佳高斯混合模型的參數値u與义。 声…參照第3圖所示,係首先將受測者的 ::「個:_成複數個(例如"個)倒頻譜參數(以! /^S32),也就是聲音特徵向量α。步騾S33則 根據《出現在每一群中的機率値,取— 、/、 出目前的嘴型大小“另為 固加推平均値而求Matrix), step S16 uses a Gaussian mixture model (with a Heart Mixture Model, GMM) as the basis for each group, that is, the probability distribution of audio-visual vectors is represented by GMM, where GMM is K Gaussian The weighted sum of the function (Gaussian functi〇n) can be expressed by the following formula: / = / where ^ is the mixed weight, which is) Gaussian with mean A (mean) A and covariance matrix Σ, Function as shown below. In step S17, each group is obtained according to vector quantization. The result 'takes its center vector as the initial average ㈤ (㈤ "心 ⑽), and uses the co-variation matrix after classification as the co-variation matrix for grouping. ς, and the number of audio-visual vectors in cluster i, the ratio of the number of audio_visual vectors is used as the initial mixture weight (M), and from the foregoing, it is privately used ^ ^ 九 5 又 疋 値, You can use the maximum prediction sub-extra method (ExpeetauGn_MaximizatiGnaigGdthm) ^ Take the optimal Gaussian mixture model parameters 値 u and meaning of each group. Sound ... Refer to Figure 3, the first is to test subjects: "" : _ Into a plurality of (such as ") cepstrum parameters (with! / ^ S32), which is the sound feature vector α. Step 骡 S33 is based on the "probability of appearing in each group 取, take —, /, out The current size of the mouth "is calculated separately
、、ϋ込求解的速度,可設定N 本紙張尺細巾嶋鮮 經濟部智慧財產局員工消費合作社印製 512308 五、發明說明( ,κ,亦即’蔣向量量化的分 ^The speed of solving the problem can be set to N paper rulers and fine towels. Printed by the Consumer Cooperatives of the Intellectual Property Bureau of the Ministry of Economic Affairs. 512308 V. Description of the invention
所使用的高斯函數的個數相 设足為與表示GMM #同’而其求解之公式如下: -I Pa{a) *V^ 其中^⑷⑽, 凡⑷⑷ 巧=ίν啦,Σ,](ν抽。, 由以上之説明可知, & 、 將聲音與嘴型女t ^ 、, ^明之方法係以分群的方法, 計上的分群。t這斯混合模型與向量量化做-個統 時,可根據聲音落在各分群=基% =有説話聲音輸人 的嘴型大小。而依照羊’异出孩聲骨所相對應 做即時的聲音與 土又小,便可以針對造型的嘴型 到複雜的語音辨: 動態模擬。因此,無需用 打破單一語言的限現嘴型之模擬,同時亦可 综上所陳,本發明時的同步動態模擬。 顯示其迥異於習知技:::目的、手段及功效,在在均 大突破,懇請鳩:::二為:形模擬之設計上的-社會,實感德便。惟應n ΐ,賜准專利,俾嘉惠 了便於説明而舉^^,’上述諸多實施例僅係為 申請專利範圍所述為準= 為卞,而非僅限於上述實施例。 —裝 ·訂i (請先閱讀背面之注意事項再填寫本頁) --線·The number of Gaussian functions used is set to be the same as that of GMM, and the formula for its solution is as follows: -I Pa {a) * V ^ where ^ ⑷⑽ , 凡 ⑷⑷ 巧 = ίν 啦 , Σ,] (ν From the above description, it can be known that the method of combining sound and mouth shape t ^, ^ Ming is based on the grouping method, and the grouping is counted. When this mixed model and vector quantification are made as a unified system, According to the sound falling in each subgroup = base% = the size of the mouth of a person who has a voice to speak. And according to the sheep's bones, the real-time sound and soil are small, and the shape of the mouth can be complicated. Speech recognition: dynamic simulation. Therefore, it is not necessary to use a simulation that breaks the limitation of a single language. At the same time, it can also summarize the synchronous dynamic simulation in the present invention. It shows that it is quite different from the conventional technology :: Purpose, Means and effects, in the breakthrough in the great, I ask the dove ::: The second is: the design of the shape simulation-society, real sense of morality. But should n ΐ, grant a quasi-patent, 俾 gratuitous for easy explanation ^^ , 'Many of the above-mentioned embodiments are only as described in the scope of patent application = Yes, not Limited to the embodiments described above - loaded · set i (please read the Notes on the back to fill out Page) - Line
本紙張尺度適用中國國家標準(CNS)A4規格 (210 297公釐)This paper size applies to China National Standard (CNS) A4 (210 297 mm)