RS49875B

RS49875B - SYSTEM AND PROCEDURE FOR FREE SPEECH COMMUNICATION WITH A MICROPHONE STRIP

Info

Publication number: RS49875B
Application number: RSP-2006/0551A
Authority: RS
Inventors: dr. Zoran Šarić; dr. Slobodan Jovičić; dr. Vladimir Kovačević; dr. Nikola Teslić; dr. Dragan Kukolj
Original assignee: Micronasnit,
Priority date: 2006-10-04
Filing date: 2006-10-04
Publication date: 2008-08-07
Also published as: WO2008041878A2; WO2008041878A3; RS20060551A

Abstract

Sistem za slobodnu govornu komunikaciju pomoću mikrofonskog niza koji sadrži digitalni TV prijemnik koji omogućava audio i video komunikaciju u punom dupleksu karakterisan time što digitalni TV prijemnik (100) ima stereo audio reprodukciju (102) za reprodukovanje stereo TV programa i mono dolaznog govornog signala u videotelefonskoj komunikaciji, koji ima ugrađenu pokretnu video kameru (l04) za snimanje govornika u prostoriji i koji na delu svog ekrana reprodukuje sliku sagovornika sa udaljenog kraja (105); koji sadrži mikrofonski sistem (103) ugrađen u TV prijernnik (100) čija je namena snimanje govora govornika na bliskom kraju kao i ostalih ambijentalnih zvukova i čija je namena lociranje govornika u prostoriji i upravljanje video kamerom (104).Microphone free voice communication system comprising a digital TV receiver for audio and video communication in full duplex, characterized in that the digital TV receiver (100) has stereo audio playback (102) for reproducing stereo TV programs and mono incoming voice signals in video telephony communication, having a built-in motion video camera (l04) for recording speakers in a room and reproducing, at a portion of its screen, an image of an interlocutor from the far end (105); comprising a microphone system (103) built into the TV receiver (100) for the purpose of recording the speaker's speech at the near end and other ambient sounds and for locating the speaker in the room and operating the video camera (104).

Description

OBLAST TEHNIKE NA KOJU SE PRONALAZAK ODNOSI TECHNICAL FIELD TO WHICH THE INVENTION RELETS

Pronalazak pripada oblasti obrade akustičkog signala, ili konkretnije, metodama poništavanja akustičkog eha, prostornog selektovanja i lociranja govornika u reverberantnom akustičkom ambijentu i potiskivanja šuma primenom mikrofonskog niza. The invention belongs to the field of acoustic signal processing, or more specifically, methods of acoustic echo cancellation, spatial selection and location of speakers in a reverberant acoustic environment, and noise suppression using a microphone array.

TEHNIČKI PROBLEM TECHNICAL PROBLEM

Slobodni, "hands-free" (engl.), komunikacioni sistemi za prenos govornog signala u punom dupleksu koriste se u mnogim aplikacijama kao što su: video-telefonski sistemi, telekonferencijski sistemi, spikerfoni u prostoriji ili kolima, komunikacija čovek-računar putem glasa, itd. "Hands-free" govorna komunikacija podrazumeva da se govornik nalazi u akustičkom ambijentu na određenoj distanci od interfejsnih elemenata sistema - mikrofona i zvučnika. Ovakvi us lovi odvijanja govorne komunikacije generišu više tehničkih problema koje je potrebno resiti da bi se održao kvalitet komunikacije na prihvatljivom nivou. Hands-free, full-duplex voice communication systems are used in many applications such as: video-telephone systems, teleconferencing systems, speakerphones in rooms or cars, human-computer voice communication, etc. "Hands-free" voice communication implies that the speaker is in an acoustic environment at a certain distance from the interface elements of the system - microphone and speaker. Such conditions of speech communication generate several technical problems that need to be solved in order to maintain the quality of communication at an acceptable level.

Osnovni problem jeste akustički eho koji nastaje prenosom dela akustičke energije iz zvučnika u mikrofon tako da sagovornik na udaljenom kraju čuje sopstveni glas kao smetnju. Konvencionalno, poništavanje signala eha obavlja adaptivni filtar estimiranjem prenosne funkcije akustičkog puta između zvučnika i mikrofona, tako da se na njegovom izlazu dobija približno isti signal kao što je signal akustičkog eha. Oduzimanja ova dva signala se poništava akustički eho. Međutim, poništavanje eha ne može biti idealno zbog nelinearnosti sistema i nestacionarnosti akustičkog ambijenta. Kao rezultat pojavljuje se rezidualni signal eha. Pri tome ostaje osnovni zahtev da snimljeni govorni signal na bližem kraju ne srne biti izobličen primenom postupka potiskivanja eha. The main problem is the acoustic echo, which is caused by the transfer of part of the acoustic energy from the speaker to the microphone, so that the interlocutor at the far end hears his own voice as a disturbance. Conventionally, cancellation of the echo signal is performed by an adaptive filter by estimating the transfer function of the acoustic path between the loudspeaker and the microphone, so that its output produces approximately the same signal as the acoustic echo signal. Subtraction of these two signals cancels the acoustic echo. However, echo cancellation cannot be ideal due to the non-linearity of the system and the non-stationarity of the acoustic environment. As a result, a residual echo signal appears. At the same time, the basic requirement remains that the recorded voice signal at the near end is not distorted by the application of the echo suppression procedure.

U akustičkom ambijentu akustičke smetnje mogu biti različite prirode i uzroka. One mogu biti stacionarne i nestacionarne (na primer kao što su šum računara ili buka u automobilu) i poticati ođ više izvora lociranih na različitim pozicijama u prostoru gde se nalazi govornik. Pored toga u zatvorenim prostorima (radne sobe, sale, automobilska kabina) pojavljuje se efekat reverberacije, koji se manifestuje kao difuzna smetnja. Pošto se govornik najčešće nalazi u ovakvom ambijentu onda se mora izvršiti njegova separacija od ostalih izvora smetnji kako bi se omogućilo samo njegovo snimanje. Konvencionalno, ovaj problem se rešava primenom mikrofonskog niza koji se sastoji od više mikrofona poredanih na minimalnoj međusobnoj distanci. Određena konfiguracija mikrofona omogućava dobijanje sistema sa usmerenom karakteristikom osetljivosti. Ovakav mikrofonski sistem ima dovoljno uzanu karakteristiku usmerenosti da u prostom ambijenta može snimiti samo odabranog govornika dok ostale izvore smetnji koji se nalaze na drugim pozicijama (lokacijama) može potisnuti i time ostvariti dobitak u odnosu izabrani govornik - ostale smetnje. Veličina ovog dobitka zavisi ođ: karakteristike usmerenosti mikrofonskog niza (širine osnovne petlje), veličine bočnih petlji, separabilnosti govornika i izvora smetnji (da nisu suviše blizu), veličine reverberacije, nestacionarnosti svih izvora signala, itd. In the acoustic environment, acoustic disturbances can be of different nature and causes. They can be stationary or non-stationary (for example, computer noise or car noise) and originate from multiple sources located at different positions in the space where the speaker is located. In addition, in closed spaces (workrooms, halls, car cabins) the effect of reverberation appears, which manifests itself as a diffuse disturbance. Since the speaker is most often found in such an environment, his separation from other sources of interference must be carried out in order to allow only him to be recorded. Conventionally, this problem is solved by using a microphone array consisting of several microphones arranged at a minimum distance from each other. A certain configuration of the microphone makes it possible to obtain a system with a directional sensitivity characteristic. This type of microphone system has a sufficiently narrow directional characteristic that in a simple environment it can record only the selected speaker, while other sources of interference located in other positions (locations) can be suppressed and thereby achieve a gain in the relationship between the selected speaker and other interference. The size of this gain depends on: the characteristics of the directionality of the microphone array (width of the basic loop), the size of the side loops, the separability of the speaker and the source of interference (that they are not too close), the size of the reverberation, the non-stationarity of all signal sources, etc.

Određivanje pravca u prostoru na kome se nalazi izabrani govornik i usmeravanje karakteristike usmerenosti mikrofonskog niza ka njemu jeste važan problem u "hands-free" komunikacionim sistemima. Postupci određivanja pravca su veoma osetljivi na sve smetnje prisutne u ambijentu i posebno: na nestacionarnost izabranog govornika (kada se on kreće u ambijentu) i kada se u datom ambijentu nalazi više govornika koji istovremeno govore( cocktail- partyefekat). Određivanje pravca aktuelnog govornika u odnosu na mikrofonski niz u horizontalnoj ravni je veoma važno u video-telefonskim i telekonferencijskirn sistemima, jer je neophodno odrediti koordinate za kontrolu video kamere. Determining the direction in the space where the selected speaker is located and directing the directionality characteristic of the microphone array towards him is an important problem in "hands-free" communication systems. Direction determination procedures are very sensitive to all disturbances present in the environment and especially: to the non-stationarity of the chosen speaker (when he moves in the environment) and when there are several speakers speaking at the same time in the given environment (cocktail-party effect). Determining the direction of the current speaker in relation to the microphone array in the horizontal plane is very important in video-telephone and teleconferencing systems, because it is necessary to determine the coordinates for controlling the video camera.

Kod snimanja govora u akustičkom ambijentu uvek se pojavljuje problem aditivnog stacionarnog i/ili nestacionarnog šuma kao i rezidualnog šuma u obradi akustičkog signala. Ovi šumovi degradiraju kvalitet snimljenog govornog signala a ukoliko su dovoljno intenzivni mogu izazvati i narušavanje njegove razumljivosti. Postoji mnogo algoritama za potiskivanje šuma, optimiziranih za pojedine vrste šumova, ali se uvek postavlja zahtev da se ostvari određen dobitak u poboljšanju odnosa signal/šum pod uslovom da se ne unesu izobličenja u govorni signal i time dodatno ne naruši njegova razumljivost. When recording speech in an acoustic environment, there is always the problem of additive stationary and/or non-stationary noise as well as residual noise in the processing of the acoustic signal. These noises degrade the quality of the recorded speech signal, and if they are intense enough, they can cause a violation of its intelligibility. There are many algorithms for noise suppression, optimized for certain types of noise, but there is always a requirement to achieve a certain gain in improving the signal/noise ratio, provided that no distortions are introduced into the speech signal and thus its intelligibility is not further impaired.

Promenljivi ambijentalni uslovi i posebno, promenljivo rastojanje govomik-mikrofonski niz, zahtevaju automatsku kontrolu pojačanja sistema kako bi nivo glasa govornika bio što stabilniji i prijatniji za slušaoca na udaljenom kraju telekomunikacionog kanala. Automatska kontrola pojačanja u sistemima koji rade u punom dupleksu zahteva dodatne informacije od detektora aktivnosti govora na bližem kraju, detektora aktivnosti govora na daljem kraju kao i potiskivača akustičkog eha. Variable ambient conditions and, in particular, the variable distance of the speaker-microphone array, require automatic system gain control so that the speaker's voice level is as stable and pleasant as possible for the listener at the far end of the telecommunications channel. Automatic gain control in full-duplex systems requires additional information from near-end speech activity detectors, far-end speech activity detectors, and acoustic echo suppressors.

Iz izloženog se vidi da su tehnički problemi u rešenju slobodnog, "hands-free", komunikacionog sistema za prenos govornog signala u punom dupleksu i njegovu primenu u video-telefonskim i/ili telekonferencijskirn sistemima veoma složeni i da zahtevaju integralni pristup u optimizaciji rešenja, posebno kada se ima u vidu rad sistema u realnom vremenu na bazi komercijalne platforme digitalnog procesora signala It can be seen from the above that the technical problems in the solution of a free, "hands-free" communication system for the transmission of voice signals in full duplex and its application in video-telephone and/or teleconferencing systems are very complex and require an integral approach in optimizing the solution, especially when considering the operation of the system in real time based on a commercial platform of a digital signal processor

(DSP). (DSP).

STANJE TEHNIKESTATE OF THE ART

Kvalitetno snimanje govora u uslovima prisustva akustičkih smetnji i reverberacije prostorije predstavlja složen problem. U uslovima kada se spektri korisnog govornog signala preklapaju sa spektrima prisutnih smetnji, jednokanalnim postupcima obrade nije moguće ostvariti značajnije poboljšanje kvaliteta govornog signala. Sa razvojem digitalne obrade signala i postizanjem dovoljno velike računarske snage DSP-a otvoren je put za primenu višemikrofonskih postupaka obrade akustičkih signala. Prednost mikrofonskih nizova u odnosu na jednokanalne postupke obrade je njihova sposobnost da prilagode svoju prostomu karakteristiku prijema (karakteristiku usmerenosti) trenutnom prostornom rasporedu odabranog govornika i smetnji. Pri tome ostvaruju maksimalno potiskivanje prisutnih smetnji uz istovremeno isticanje odabranog govornika. Osnovni problemi koji se u primeni mikrofonskih nizova sreću su sledeći (M.S. Brandstein, D.B. Ward (Eđs.),Microphone Arrays: Signal Processing Techniques and Applications,Springer, Berlin 2001; Y. Huang, J. Benestv,Audio signal processing far next generation multimedia communication systems,Kluvver Academic Publishers Publ., 2004.): nepoznavanje tačne lokacije odabranog govornika, nepoznavanje broja i prostornog rasporeda prisutnih smetnji, višestruke refleksije korisnog izvora i smetnji o zidove prostorije i nestacionarnost izvora akustičkih smetnji i odabranog govornika. High-quality speech recording in the presence of acoustic disturbances and room reverberation is a complex problem. In conditions where the spectrum of the useful speech signal overlaps with the spectrum of the present disturbances, it is not possible to achieve a significant improvement in the quality of the speech signal with single-channel processing methods. With the development of digital signal processing and the achievement of sufficiently large DSP computing power, the way is open for the application of multi-microphone acoustic signal processing procedures. The advantage of microphone arrays over single-channel processing methods is their ability to adapt their simple reception characteristic (directivity characteristic) to the current spatial arrangement of the selected speaker and interference. At the same time, they achieve the maximum suppression of the present disturbances while at the same time highlighting the selected speaker. The basic problems encountered in the application of microphone arrays are as follows (M.S. Brandstein, D.B. Ward (Eds.), Microphone Arrays: Signal Processing Techniques and Applications, Springer, Berlin 2001; Y. Huang, J. Benestv, Audio signal processing far next generation multimedia communication systems, Kluvver Academic Publishers Publ., 2004.): not knowing the exact location of the selected speaker, not knowing the number and spatial arrangement of the interference present, multiple reflections of the useful source and interference on the walls of the room and the non-stationarity of the source of acoustic interference and the selected speaker.

Kada se mikrofonski niz upotrebi u video-telefonskim ili telekonferencijskirn sistemima koji funcionišu u punom dupleksu, onda se broj problema uvećava. Najveći problem je pojava akustičkog eha, zatim potreba za automatskom regulacijom pojačanja When a microphone array is used in videophone or teleconferencing systems that operate in full duplex, then the number of problems increases. The biggest problem is the appearance of acoustic echo, then the need for automatic gain regulation

(AGC) predajnog dela sistema, kao i moguća pojava nestabilnosti sistema, tzv. mikrofonija. Dodatni problem koji ovaj patent razmatra je postojanje signala TV programa koji se kao aditivni akustički eho pojavljuje na ulazu mikrofonskog niza. (AGC) of the transmitting part of the system, as well as the possible occurrence of system instability, the so-called microphonics. An additional problem that this patent considers is the existence of a TV program signal that appears as an additive acoustic echo at the input of the microphone array.

Veliki broj navedenih problema generisao je veoma različita rešenja koja su patentirana i koja rešavaju ili pojedinačne probleme ili integralno nekoliko problema. Naprimer: U.S. objavljena patentna prijava 2006/0153360 Al, prijavljen 2. septembra 2005., sa naslovom „Speech signal processing with combined noise reduction and echo compensation", daje integralno rešenje potiskivača eha i potiskivača šuma, zatim U.S. patent 7,035,415 B2, prijavljen 15. maja 2001, sa naslovom „Method and device for acoustic echo cancellation combined with adaptive beamforming", koji daje integralno rešenje potiskivača eha i rešenje za formiranje usmerene karakteristike mikrofonskog niza, zatim EP objavljena patentna prijava 1 633 121 Al, prijavljen 3. septembra 2004., sa naslovom „Speech signal processing with combined adaptive noise reduction and adaptive echo compensation", daje integralno rešenje potiskivača rezidualnog eha i potiskivača šuma, zatim EP objavljena patentna prijava l 571 875 A2, prijavljen 23. februara 2005., sa naslovom „A svstem and method for beamforming using a microphone array", koji daje rešenje samo za formiranje usmerene karakteristike mikrofonskog niza, zatim EP objavljena patentna prijava 1 581 026 Al, prijavljen 17. marta 2004., sa naslovom „Method for detecting and reducing noise trom a microphone array", daje rešenje samo za potiskivanje šuma u mikrofonskom nizu, kao i EP objavljena patentna prijava 1 286 175 A2, prijavljen 1. avgusta 2002,, sa naslovom „Robust talker localization in reverberant environment", daje rešenje samo za lokalizaciju govornika u reverberantnoj sobi. A large number of the mentioned problems have generated very different solutions which have been patented and which solve either individual problems or integrally several problems. For example: U.S. published patent application 2006/0153360 Al, filed on Sep. 2, 2005, entitled "Speech signal processing with combined noise reduction and echo compensation", provides an integral echo canceller and noise canceller solution, then U.S. Pat. patent 7,035,415 B2, filed on May 15, 2001, entitled "Method and device for acoustic echo cancellation combined with adaptive beamforming", which provides an integral echo canceller solution and a solution for forming the directional characteristic of a microphone array, then EP published patent application 1 633 121 Al, filed on September 3, 2004, entitled "Speech signal processing with combined adaptive noise reduction and adaptive echo compensation", provides an integral solution of residual echo suppressors and noise suppressors, then EP published patent application l 571 875 A2, filed on February 23, 2005, with the title "A system and method for beamforming using a microphone array", which provides a solution only for forming the directional characteristic of a microphone array, then EP published patent application 1 581 026 Al, filed on March 17, 2004, with the title "Method for detecting and reducing noise from a microphone array", provides a solution only for noise suppression in a microphone array, as well as EP published patent application 1 286 175 A2, filed on August 1, 2002, entitled "Robust talker localization in reverberant environment", provides a solution only for speaker localization in a reverberant room.

Integralno rešenje svih naznačenih problema, izloženo u ovom patentu, objedinjuje pozitivne osobine pojedinih postupaka obrade signala u rešenju svakog od naznačenih problema, integralno ih rešava u frekvencijskom domenu optimizirajući računarske resurse i daje rešenje koje u realnom vremenu obezbeđuje kvalitetnu slobodnu govornu komunikaciju u video-telefonskim i/ili telekonferencijskirn sistemima. The integral solution to all the indicated problems, presented in this patent, combines the positive features of individual signal processing procedures in the solution of each of the indicated problems, solves them integrally in the frequency domain, optimizing computer resources and provides a solution that in real time ensures high-quality free voice communication in video-telephone and/or teleconferencing systems.

IZLAGANJE SUŠTINE PRONALASKADISCLOSURE OF THE ESSENCE OF THE INVENTION

Predmet ovog pronalaska je sistem za slobodnu govornu komunikaciju u video-telefonskim ili telekonferencijskirn primenama koji koristi mikrofonski niz i složenu 5 obradu akustičkog signala u cilju obezbeđenja kvaliteta i razumljivosti govornog signala u složenom akustičkom ambijentu i u kome su mnogi prethodno nabrojani nedostaci pojedinačno ili integralno eliminisani. The subject of this invention is a system for free speech communication in video-telephone or teleconferencing applications that uses a microphone array and complex acoustic signal processing in order to ensure the quality and intelligibility of the speech signal in a complex acoustic environment and in which many of the previously listed shortcomings are individually or integrally eliminated.

Sistemom, koji je predmet pronalaska, prenosi se govor a kao prenosni medijum se koristi digitalna televizija. Za snimanje i reprodukciju govornog signala koristi se mikrofonski niz i zvučnici, respektivno, koji su sastavni elementi TV prijemnika. Pošto je reč o video-telefonskim ih telekonferencijskirn primenama, za snimanje i reprodukciju slike koristi se digitalna kamera i digitalni TV prijemnik, respektivno. The system, which is the subject of the invention, transmits speech and digital television is used as a transmission medium. A microphone array and speakers are used to record and reproduce the voice signal, respectively, which are integral elements of the TV receiver. Since we are talking about video-telephone and teleconferencing applications, a digital camera and a digital TV receiver are used to record and reproduce the image, respectively.

Suština pronalaska jeste u specifičnoj obradi govornog signala koji se snima u akustičkom ambijentu prostorije u kojoj se nalazi sistem i govornik. Za snimanje govornika u prostoriji, koji se nalazi na određenom rastojanju (do nekoliko metara) od TV prijemnika, sistem koristi mikrofonski niz od N mikrofona. Mikrofonski niz snima sve signale u prostoriji: koristan signal kao direktan talas koji stiže od govornika do mikrofona i signale smetnji koji mogu biti raznovrsni. Kao signali smetnje pojavljuju se: akustički eho kao direktan zvučni talas iz zvučnika preko kojih se emituje glas sagovomika sa udaljenog kraja komunikacionog kanala, akustički eho kao direktan zvučni talas iz zvučnika preko kojih se emituje stereo TV program, direktni talasi od jednog ili više izvora šumova ili izvora drugih smetnji koji se mogu naći u prostoriji i svi reflektovani talasi (eho prostorije) koji potiču od svih izvora zvukova, uključujući i govornika, a koji nastaju usled reverberacije prostorije. Treba naglasiti da izvori zvukova u prostoriji mogu biti stacionarni ili nestacionarni, što je najčešći slučaj, kako po svojim karakteristikama tako i po lokaciji u prostoriji (pokretni izvori zvukova). The essence of the invention lies in the specific processing of the speech signal that is recorded in the acoustic environment of the room where the system and the speaker are located. To record the speaker in the room, which is located at a certain distance (up to several meters) from the TV receiver, the system uses a microphone array of N microphones. The microphone array records all the signals in the room: the useful signal as a direct wave arriving from the speaker to the microphone and interference signals that can be varied. The following interference signals appear: acoustic echo as a direct sound wave from the speakers through which the speaker's voice is emitted from the far end of the communication channel, acoustic echo as a direct sound wave from the speakers through which a stereo TV program is broadcast, direct waves from one or more sources of noise or sources of other disturbances that can be found in the room and all reflected waves (room echoes) originating from all sound sources, including the speaker, which arise as a result of room reverberation. It should be emphasized that the sources of sounds in the room can be stationary or non-stationary, which is the most common case, both according to their characteristics and their location in the room (moving sources of sounds).

Različite smetnje zahtevaju različite tehnike za njihovo eliminisanje i suština pronalaska jeste u optimalnom projektovanju algoritama koji treba da maksimalno eliminišu smetnje i da obezbede najbolji kvalitet govornog signala koji se prenosi do sagovomika na udaljenom kraju komunikacionog kanala. Different interferences require different techniques for their elimination and the essence of the invention is in the optimal design of algorithms that should maximally eliminate interferences and ensure the best quality of the voice signal that is transmitted to the speaker at the far end of the communication channel.

Mikrofonski signali iz mikrofonskog niza se obrađuju u digitalnoj formi u DSP, kompletno u frekvencijskom domenu. Ovaj domen omogućava određene prednosti u pogledu brzine obrade i broja računskih operacija, što je veoma važno za DSP i rad u realnom vremenu. Za potiskivanje akustičkog eha neophodno je da se u DSP uvedu i signali iz zvučnika. The microphone signals from the microphone array are processed in digital form in the DSP, completely in the frequency domain. This domain provides certain advantages in terms of processing speed and number of computational operations, which is very important for DSP and real-time work. In order to suppress the acoustic echo, it is necessary to introduce the signals from the speakers into the DSP.

U DSP-u se izvršava više složenih algoritama: algoritam za potiskivanje signala akustičkog eha (AEC- Acoustic Echo Cancelling),algoritam za obradu mikrofonskih signala u cilju formiranja adaptivne karakteristike usmerenosti mikrofonskog niza Several complex algorithms are executed in the DSP: an algorithm for suppressing acoustic echo signals (AEC-Acoustic Echo Cancelling), an algorithm for processing microphone signals in order to form an adaptive characteristic of the directionality of the microphone array

(ABF - Adaptive Beam Forming),algoritam za ocenu pravca dolaska korisnog signala(DOA - Direction of Arrival)odnosno lociranje govornika u prostoriji, algoritam za potiskivanje stacionarnog i nestacionarnog šuma i rezidualnog eha( NR- Noise Reduction)i algoritam za automatsku kontrolu pojačanja sistema(AGC - Automatic Gain Control)radi kompenzacije različite udaljenosti govornika od mikrofonskog niza. Pored ovih osnovnih algoritama u DSP-u se izvršava i više drugih algoritama kao što su: detektor aktivnosti govora (VAD -Voice Activated Detector)na bližem kraju, VAD na daljem kraju, detektor istovremene aktivnosti govora na oba kraja(DTD - Double Talk Detector),dodatno filtriranje radi redukcije šuma(PF - Post Filtering),itd. Cilj svih navedenih algoritama je maksimalna redukcija svih smetnji uz minimalnu degradaciju govornog signala i time obezbeđivanja maksimalnog kvaliteta predajnog govornog signala. (ABF - Adaptive Beam Forming), an algorithm for estimating the direction of arrival of the useful signal (DOA - Direction of Arrival), that is, locating the speaker in the room, an algorithm for suppressing stationary and non-stationary noise and residual echo (NR- Noise Reduction) and an algorithm for automatic system gain control (AGC - Automatic Gain Control) to compensate for the different distance of the speaker from the microphone array. In addition to these basic algorithms, several other algorithms are executed in the DSP, such as: speech activity detector (VAD - Voice Activated Detector) at the near end, VAD at the far end, simultaneous speech activity detector at both ends (DTD - Double Talk Detector), additional filtering for noise reduction (PF - Post Filtering), etc. The goal of all the mentioned algorithms is the maximum reduction of all interference with the minimum degradation of the voice signal and thereby ensuring the maximum quality of the transmitted voice signal.

Specifičan aspekt pronalaska se nalazi u adaptivnom potiskivanju akustičkog eha pomoću adaptivnih filtara koji modeliraju prenosnu karakteristiku akustičkog puta od zvučnika do mikrofona. Prenosna karakteristika je složena jer se radi o prenosnom putu od 2 (stereo) zvučnika do N mikrofona u mikrofonskom nizu, zbog čega se svaki mikrofonski signal filtrira sopstvemm adaptivnim filtrom. Kontrolu rada adaptivnih filtara vrši detektor aktivnosti govora na oba kraja. A specific aspect of the invention resides in adaptive acoustic echo suppression using adaptive filters that model the transmission characteristic of the acoustic path from the speaker to the microphone. The transmission characteristic is complex because it is a transmission path from 2 (stereo) speakers to N microphones in a microphone array, which is why each microphone signal is filtered by its own adaptive filter. The operation of the adaptive filters is controlled by the speech activity detector at both ends.

Sleđeću specifičnost pronalaska čini adaptivna karakteristika usmerenosti mikrofonskog niza koja omogućava prostorno ifltriranje, odnosno izdvajanje pravca u prostoru na kome se nalazi govornik i gde se koristan signal maksimalno pojačava u odnosu na signale iz ostalih pravaca koji se slabe. Usmerena karakteristika mikrofonskog niza se ostvaruje adaptivnim ponderisanjem i sumiranjem mikrofonskih signala, što obezbeđuje stabilan indeks usmerenosti u frekvencijskom domenu i veću robusnost sistema za slobodnu govornu komunikaciju u reverberantnom akustičkom ambijentu. The next specificity of the invention is the adaptive characteristic of the directionality of the microphone array, which enables spatial filtering, that is, the selection of the direction in the space where the speaker is located and where the useful signal is maximally amplified compared to signals from other directions that are weakened. The directional characteristic of the microphone array is achieved by adaptive weighting and summation of microphone signals, which ensures a stable directionality index in the frequency domain and greater robustness of the system for free speech communication in a reverberant acoustic environment.

Određivanje dolaznog pravca direktnog akustičkog talasa od govornika je naredna specifičnost pronalaska. Ova funkcija u sistemu slobodne govorne komunikacije je neophodna za kontrolu i upravljanje usmerenom karakteristikom mikrofonskog niza po azimutu, a može se koristiti i za kontrolu i upravljanje video kamere. Ona koristi mikrofonske signale posle potiskivanja akustičkog eha. Nakon određivanja generalizovane kroskorelacije mikrofonskih signala i njihovih faznih transformacija, estimira se dolazni pravac direktnog akustičkog talasa govornika. Ova funkcija je pod direktnom kontrolom detektora aktivnosti govora. Determining the incoming direction of the direct acoustic wave from the speaker is a further specificity of the invention. This function in a free speech communication system is necessary to control and operate the directional feature of the microphone array in azimuth, and it can also be used to control and operate a video camera. It uses microphone signals after acoustic echo suppression. After determining the generalized cross-correlation of microphone signals and their phase transformations, the incoming direction of the speaker's direct acoustic wave is estimated. This function is under the direct control of the speech activity detector.

Sleđeću specifičnost pronalaska čini postupak adaptivnog potiskivanja stacionarnog i nestacionarnog šuma. Postupak je realizovan na bazi nelinearnog kompresora estimiranog šuma koji se određuje u nekoliko podopsega. Koriste se dve estimacije šuma koje obezbeđuju rezultat potiskivanja optimiziran prema karakteristikama govornog signala. To je učinjeno iz razloga potrebe da proces adaptivnog potiskivanja šuma ne sme degradirati govorni signal. Proces filtriranja se završava adaptivnim Wiener-ovim post-filtrom. The next specificity of the invention is the process of adaptive suppression of stationary and non-stationary noise. The procedure was implemented on the basis of a non-linear compressor of estimated noise, which is determined in several sub-bands. Two noise estimations are used, which provide a suppression result optimized according to the characteristics of the speech signal. This is done due to the need that the process of adaptive noise suppression must not degrade the speech signal. The filtering process is completed with an adaptive Wiener post-filter.

Specifičan aspekt pronalaska jeste i automatska kontrola pojačanja govornog signala pre predaje ka udaljenom sagovorniku. Ova specifičnost je važan sastavni elemenat sistema za slobodnu govornu komunikaciju. Sistem obezbeđuje kompenzaciju različitih intenziteta govornog signala, kao individualnih karakteristika govornika, ah i različite intenzitete govora u zavisnosti da li se govornik nalazi bliže ili dalje u odnosu na mikrofonski niz. Rešenje pravi razliku da li je govornik aktivan ili se u korisnom signalu pojavljuje: pauza, rezidualni eho, akustička smetnja ili signal govora sa udaljenog kraja; zbog toga rešenje koristi više informacija prethodno detektovanih u sistemu. Analiza mogućeg scenarija mora biti pouzdana, u protivnom može doći do negativnog efekta slabljenja korisnog govornog signala. A specific aspect of the invention is the automatic control of the amplification of the voice signal before transmission to the remote interlocutor. This specificity is an important integral element of the system for free speech communication. The system provides compensation of different intensities of the speech signal, as individual characteristics of the speaker, and also different intensities of speech depending on whether the speaker is closer or further away from the microphone array. The solution makes a difference whether the speaker is active or whether the following signal appears in the useful signal: a pause, a residual echo, an acoustic disturbance or a speech signal from the far end; therefore, the solution uses more information previously detected in the system. The analysis of the possible scenario must be reliable, otherwise there may be a negative effect of weakening the useful speech signal.

Inventivnost u ovom pronalasku se nalazi u poboljšanju svake ođ navedenih specifičnosti, ali i u postupku integrisanja svih algoritama u jedinstvenu celinu koja funkcioniše stabilno i kvalitetno. Algoritamske procedure su optimizirane korišćenjem zajedničkih resursa. The inventiveness of this invention is found in the improvement of each of the aforementioned specificities, but also in the process of integrating all the algorithms into a single entity that functions stably and with quality. Algorithmic procedures are optimized using shared resources.

Ovi i drugi aspekti, specifičnosti i benefiti ovog pronalaska biće očigledniji nakon uvida u detaljan opis pronalaska, patentne zahteve i pripadajuće crteže. These and other aspects, specificities and benefits of the present invention will be more apparent upon review of the detailed description of the invention, patent claims and accompanying drawings.

KRATAK OPIS SLIKA I NACRTABRIEF DESCRIPTION OF THE IMAGES AND DRAWINGS

Slika 1- prikazuje elemente sistema za slobodnu video-telefonsku komunikaciju pomoću mikrofonskog niza i digitalne televizije. Figure 1 - shows the elements of a system for free video-telephone communication using a microphone array and digital television.

Slika2 - prikazuje ambijentalne uslove primene sistema za slobodnu video-telefonsku komunikaciju pomoću mikrofonskog niza. Figure 2 - shows the ambient conditions of application of the system for free video-telephone communication using a microphone array.

Slika3 - prikazuje blok dijagram pođsistema za obradu audio signala u okviru sistema za slobodnu video-telefonsku komunikaciju; on sadrži mikrofonski niz sa adaptivnom karakteristikom usmerenosti (SD-BF), blok za lociranje govornika u prostoru (DOA), blok za potiskivanje eha (AEC), blok za potiskivanje šuma (NR) i blok za automatsku kontrolu pojačanja (AGC). Figure 3 - shows a block diagram of the audio signal processing subsystem within the system for free video-telephone communication; it contains a microphone array with adaptive directivity (SD-BF), a speaker localization (DOA) block, an echo cancellation (AEC) block, a noise cancellation (NR) block and an automatic gain control (AGC) block.

Slika 4- prikazuje blok dijagram za potiskivanje akustičkog eha (AEC). Figure 4- shows a block diagram for Acoustic Echo Cancellation (AEC).

Slika 5- prikazuje blok dijagram za adaptivno određivanje pravca bliskog govornika po horizontali (DOA-azimut). Figure 5 - shows a block diagram for adaptive determination of the direction of a nearby speaker horizontally (DOA-azimuth).

Slika 6- prikazuje blok dijagram za prostorno filtriranje (SD-BF). Figure 6- shows the block diagram for spatial filtering (SD-BF).

Slika7 - prikazuje blok dijagram za potiskivanje šuma (NR). Figure 7 - shows the noise reduction (NR) block diagram.

Slika8 - prikazuje blok dijagram za automatsku regulaciju pojačanja (AGC). Figure 8 - shows the block diagram for automatic gain control (AGC).

DETALJAN OPIS PRONALASKADETAILED DESCRIPTION OF THE INVENTION

Ovaj pronalazak opisuje sistem i postupak obrade akustičkog signala za slobodnu govornu komunikaciju pomoću mikrofonskog niza. This invention describes an acoustic signal processing system and method for free speech communication using a microphone array.

Slika 1 prikazuje elemente sistema za slobodnu video-telefonsku komunikaciju pomoću mikrofonskog niza i digitalne televizije. Digitalni televizor100,koji korisniku normalno služi za praćenje TV programa, u sistemu za slobodnu video-telefonsku komunikaciju koristi se kao video monitor za video komunikaciju sa sagovornikom i kao audio terminal za audio komunikaciju. Naime, kada se putem komunikacionog kanala101dobije poziv i uspostavi veza sa sagovornikom tada se televizor100koristi kao multimedijalni interfejs gde se preko zvučnika102sluša sagovornik a na delu ekrana105televizora100prati se slika sagovomika. Istovremeno, na udaljenom kraju komunikacionog kanala, sagovornik na sličnom TV prijemniku vidi sagovomika sa bližeg kraja, koga snima kamera104i mikrofonski niz103.Kamera104je pokretna i njom se upravlja na bazi koordinata koje se dobijaju obradom mikrofonskih signala iz mikrofonskog niza103.Figure 1 shows the elements of a system for free video-telephone communication using a microphone array and digital television. Digital television 100, which normally serves the user to monitor TV programs, in the system for free video-telephone communication is used as a video monitor for video communication with the interlocutor and as an audio terminal for audio communication. Namely, when a call is received through the communication channel 101 and a connection is established with the interlocutor, then the television 100 is used as a multimedia interface where the interlocutor is heard through the speaker 102 and the image of the interlocutor is shown on the screen 105 of the television 100. At the same time, at the far end of the communication channel, the interlocutor on a similar TV receiver sees the speaker from the near end, who is recorded by camera 104 and microphone array 103. Camera 104 is mobile and is controlled based on coordinates obtained by processing microphone signals from microphone array 103.

Analogni signali iz mikrofona u mikrofonskom nizu103se pojačavaju pomoću pojačavača106i zajedno sa stereo signalima iz zvučnika102se uvode u akvizicioni modul107,gde se digitalizuju i tako digitalizovani predaju DSP-u108na dalju obradu. Obrađeni govorni signal govornika na bližem kraju pomoću DSP-a108prenosi se preko komunikacionog kanala101do sagovomika na daljem kraju. Obradom akustičkih signala u DSP-u108dobijaju se prostome koordinate lociranja govornika u prostoriji u kojoj se nalazi sistem za slobodnu komunikaciju, pomoću kojih DSP108upravlja sa pokretnom kamerom 104 usmeravajuću je ka govorniku. Na taj način se ostvaruje potpuno slobodna audio i video komunikacija dva sagovomika preko sistema digitalne televizije. Analog signals from the microphones in the microphone array 103 are amplified by the amplifier 106 and together with the stereo signals from the loudspeakers 102 are introduced into the acquisition module 107, where they are digitized and thus digitized are handed over to the DSP 108 for further processing. The processed speech signal of the speaker at the near end using the DSP 108 is transmitted via the communication channel 101 to the speaker at the far end. By processing the acoustic signals in the DSP108, simple coordinates for locating the speaker in the room where the free communication system is located are obtained, with the help of which the DSP108 manages the moving camera 104 directing it towards the speaker. In this way, completely free audio and video communication between two speakers is achieved through the digital television system.

Slika 2 šematski prikazuje ambijentalne uslove primene sistema za slobodnu video-telefonsku komunikaciju pomoću mikrofonskog niza; prikazan je samo deo sistema koji se odnosi na obradu akustičkog signala. U prostoriji 201 nalaze se sistem za slobodnu video-telefonsku komunikaciju, govornik 202 i izvor šuma 203, što je uobičajeno za svaki akustički ambijent. Preko zvučnika 102 stereo audio sistema digitalne televizije govornik 202 sluša dolazni govorni signal 204 sagovomika sa udaljenog kraja najčešće kao mono signal. Zvuk u ambijentu prostorije 201 snima mikrofonski niz 103 sastavljen od N mikrofona. Nakon kompleksne obrade mikrofonskih signala u bloku 207 govorni signal govornika 202 se preko bloka 208 prenosi ka udaljenom sagovomiku kao mono signal. Figure 2 schematically shows the ambient conditions of application of the system for free video-telephone communication using a microphone array; only part of the system related to acoustic signal processing is shown. In the room 201 there are a system for free video-telephone communication, a speaker 202 and a noise source 203, which is common for any acoustic environment. Through the speaker 102 of the digital television stereo audio system, the speaker 202 listens to the incoming speech signal 204 of the speaker from the far end, usually as a mono signal. Sound in the room environment 201 is recorded by a microphone array 103 composed of N microphones. After the complex processing of microphone signals in block 207, the speech signal of the speaker 202 is transmitted via block 208 to the remote speaker as a mono signal.

Ambijentalni uslovi odvijanja govorne komunikacije u prostoriji 201 su veoma kompleksni. Kod slobodne video-telefonske komunikacije u prostoriji 201 postoji minimum tri izvora zvuka: stereo zvučnici 102 koji emituju govor udaljenog sagovomika i TV program, govornik 202 i bar jedan izvor šuma 203. U prostoriji može biti i više izvora šumova: šum računara, šum klima sistema, buka sa ulice koja prodire u prostoriju kroz prozore, buka iz susednih prostorija, vibracije zgrade, ili drugi govornik, više govornika, izvor muzike, itd. Dakle, pojavljuje se veoma složena akustička slika u prostoriji. Mikrofonski niz 103 snima, kao senzorski sistem, sve zvuke u prostoriji, snima direktne zvučne talase od svakog izvora ali i sve refleksije od zidova prostorije i drugih predmeta koji se nalaze u njoj. Tako na primer, ođ zvučnika 102 do mikrofonskog niza 103 stiže direktan talas 209 i mnogi reflektovani talasi od kojih je samo jedan 210 prikazan na slici 2; od govornika 202 stiže direktan talas 211 i pored ostalih i dva reflektovana talasa 212a i 212b, od izvora šuma 203 stiže direktan talas 213 i pored ostalih i reflektovani talas 214. The ambient conditions for speech communication in room 201 are very complex. During free video-telephone communication in the room 201, there are at least three sound sources: stereo speakers 102 that broadcast the speech of the remote speaker and the TV program, speaker 202 and at least one source of noise 203. There can be several sources of noise in the room: the noise of the computer, the noise of the air conditioning system, noise from the street that penetrates into the room through the windows, noise from neighboring rooms, vibrations of the building, or another speaker, multiple speakers, music source, etc. Thus, a very complex acoustic picture appears in the room. The microphone array 103 records, as a sensor system, all sounds in the room, it records direct sound waves from each source, but also all reflections from the walls of the room and other objects in it. For example, from the loudspeaker 102 to the microphone array 103, a direct wave 209 and many reflected waves arrive, only one of which 210 is shown in Figure 2; from the speaker 202 comes a direct wave 211 and, in addition to the others, two reflected waves 212a and 212b, from the noise source 203 a direct wave 213 arrives and, in addition to the others, a reflected wave 214.

Od svih zvukova koje mikrofonski niz snima jedino je direktan talas 211 od govornika 202 koristan signal, svi ostali su smetnje. Od svih smetnji najveća je akustički eho 209 koji dolazi iz zvučnika 102. Sve ostale refleksije zbirno čine reverberaciju prostorije. Zadatak bloka za obradu audio signala 207 jeste da potisne signal akustičkog eha, da selektuje koristan signal 211 od svih ostalih smetnji, da potisne signale reverberacije i da potisne direktne signale izvora smetnji, kojih može da bude i više od jednog izvora. Poseban zadatak bloka 211 jeste adaptivno praćenje nestacionarnosti akustičke scene u prostoriji bilo da se govornik pokreće, ili da se od razgovora do razgovora nalazi na različitim pozicijama u prostoriji, ili da se izvori šumova pokreću, da su nestacionarni ili da menjaju svoje karakteristike. U daljem tekstu biće pojedinačno opisana rešenja koja su u ovom pronalasku primenjena. Of all the sounds that the microphone array records, only the direct wave 211 from the speaker 202 is a useful signal, all others are interference. Of all the disturbances, the biggest is the acoustic echo 209 that comes from the speaker 102. All other reflections collectively make up the reverberation of the room. The task of the audio signal processing block 207 is to suppress the acoustic echo signal, to select the useful signal 211 from all other interferences, to suppress the reverberation signals and to suppress the direct signals of the interference sources, which may be more than one source. The special task of block 211 is to adaptively monitor the non-stationarity of the acoustic scene in the room, whether the speaker is moving, or is in different positions in the room from conversation to conversation, or that noise sources are moving, are non-stationary or change their characteristics. In the following text, the solutions used in this invention will be described individually.

Na slici 3 prikazana je blok šema kompletnog postupka obrade audio signala u okviru sistema za slobodnu video-telefonsku komunikaciju pomoću mikrofonskog niza. Svi mikrofonski signali103,od Ml do M5, kao i signali stereo zvučnika102,Zv-L i Zv-D, se digitalizuju u akvizicionom bloku 107, slika 1, i konvertuju u frekvencijski domen pomoću brze Fourierove transformacije (FFT) 301 u signalex;doxi.Treba naglasiti da mikrofonski niz sadrži 5 mikrofona u rešenju ovog patenta, ali se može primeniti veći broj mikrofona ukoliko određena aplikacija to zahteva. U bloku302vrši se potiskivanje akustičkog eha u svim signalimax\dox$,koristeći signalex&ix7kao referentne. Signali sa potisnutim ehomSaecidoSaecskoriste se u bloku 304 za odredjivanje pravca direktnog zvučnog talasaDOA( Direction Of Arrival)po horizontali (azimutu0a)od aktuelnog govornika i time omogućava njegovo praćenje u prostoriji. Na osnovu ocenjenog ugla9au bloku 303 se optimiziraju težinski koeficijentisignala xidojcju cilju formiranja karakteristike horizontalne usmerenosti mikrofonskog niza sa maksimumom prijema na pravcu8a.Karakteristika prijema formirana u bloku303ima superdirektivno svojstvo što znači da joj je indeks usmerenosti (direktivnosti) prijema veći u odnosu na karakteristiku koja bi se dobila samo kompenzacijom kašnjenja i sumiranjem mikrofonskih signala. Figure 3 shows a block diagram of the complete audio signal processing procedure within the system for free video-telephone communication using a microphone array. All microphone signals 103, from Ml to M5, as well as stereo speaker signals 102, Zv-L and Zv-D, are digitized in the acquisition block 107, Figure 1, and converted to the frequency domain using the fast Fourier transform (FFT) 301 in signalex;doxi. It should be emphasized that the microphone array contains 5 microphones in the solution of this patent, but a larger number of microphones can be applied if a certain application requires it. In block 302, acoustic echo suppression is performed in all signalsmax\dox$, using signalsex&ix7 as reference. Signals with suppressed echoSaecidoSaecs are used in block 304 to determine the direction of the direct sound wave DOA (Direction Of Arrival) horizontally (azimuth0a) from the current speaker and thus enables his monitoring in the room. Based on the estimated angle in block 303, the weighting coefficients of the x-signal are optimized with the goal of forming the characteristic of the horizontal directionality of the microphone array with maximum reception in the direction 8a.

U bloku 303 vrši se vremenska kompenzacija međusobnog kašnjenja akustičkih signala od govornika do mikrofona. Kontrolom ovog kašnjenja signalom DOA (0a) iz bloka304,omogućava se upravljanje karakteristikom usmerenosti mikrofonskog niza po azimutu. Takođe, u bloku 303 formira se karakteristika usmerenosti mikrofonskog niza,SD-BF( Superdirective Beamformer).Ova karakteristika ima osnovnu petlju usmerenja dovoljno uzanu i usmerenu u željenom pravcu, dok su bočne petlje znatno manje po intenzitetu. Time se omogućava mikrofonskom nizu prostorno filtriranje, odnosno separaciju izvora zvukova po horizontali. Ovako formirana karakteristika usmerenosti je veoma bitna sa aspekta utišavanja signala bočnih smetnji u odnosu na korisni signal i sa aspekta smanjenja efekta reverberacije prostorije. Karakteristika usmerenosti se formira ponderisanjem mikrofonskih signala i njihovim sumiranjem u jedinstveni izlazni signal. In block 303, the time compensation of the mutual delay of the acoustic signals from the speaker to the microphone is performed. By controlling this delay with the DOA (0a) signal from block 304, it is possible to control the directionality characteristic of the microphone array in azimuth. Also, in block 303, the directional feature of the microphone array, SD-BF (Superdirective Beamformer), is formed. This feature has a basic directional loop that is sufficiently narrow and directed in the desired direction, while the side loops are significantly less intense. This enables the microphone array to perform spatial filtering, i.e. horizontal separation of sound sources. The directional characteristic formed in this way is very important from the aspect of muting the signal of lateral interference in relation to the useful signal and from the aspect of reducing the effect of room reverberation. The directivity characteristic is formed by weighting the microphone signals and summing them into a single output signal.

Signal na izlazu bloka 303 sadrži koristan govorni signal i signal smetnji koji se sastoji od rezidualnog signala nakon potiskivanja akustičkog eha, potisnut šum ambijenta i potisnute signale reverberacije. Ovaj signal ulazi u blok NR( Noise Reduction)305gde se vrši dodatno potiskivanje signala smetnji. Proces potiskivanja je adaptivan obzirom na nestacionarnost signala smetnji. Takođe, važan zahtev u realizaciji NR bloka jeste da proces potiskivanja šuma ne sme da utiče na kvalitet govornog signala. The signal at the output of block 303 contains the useful speech signal and the interference signal consisting of the residual signal after acoustic echo suppression, suppressed ambient noise and suppressed reverberation signals. This signal enters the block NR (Noise Reduction) 305, where additional interference signal suppression is performed. The suppression process is adaptive considering the non-stationarity of the interference signal. Also, an important requirement in the realization of the NR block is that the noise suppression process must not affect the quality of the voice signal.

Finalni blok obrade signala u sistemu za slobodnu govornu komunikaciju u video-telefonskim ili telekonferencijskirn primenama jeste blok 306 za automatsku kontrolu pojačanja AGC( Automati Gain Control)obrađenog govornog signala. U ovom bloku koristi se više informacija iz celokupnog sistema koje su važne za definisanje mogućih uslova u kojima se govorni signal može naći i gde je potrebno na odgovarajući način izvršiti njekovu amplitudsku korekciju. Na taj način se može obezbediti približno isti nivo predaj nog govornog signala nezavisno od udaljenosti aktuelni govornik od mikrofonskog niza i obezbediti njegov bolji kvalitet na udaljenom kraju komunikacionog kanala. The final block of signal processing in the system for free voice communication in video-telephone or teleconferencing applications is block 306 for automatic gain control of the AGC (Automatic Gain Control) processed voice signal. In this block, more information from the entire system is used, which is important for defining the possible conditions in which the speech signal can be found and where it is necessary to carry out its amplitude correction in an appropriate manner. In this way, it is possible to ensure approximately the same level of transmission speech signal regardless of the distance of the current speaker from the microphone array and ensure its better quality at the far end of the communication channel.

Na izlazu sistema rezultat obrade signala se transformiše iz frekvencijskog u vremenski domen pomoću inverzne FFT u bloku 307. Estimirani govorni signal na bližem kraju( š)se prenosi kroz kanal ka udaljenom sagovorniku. At the output of the system, the result of the signal processing is transformed from the frequency domain to the time domain using the inverse FFT in block 307. The estimated speech signal at the near end (š) is transmitted through the channel to the remote interlocutor.

Na slici 4 prikazan je blok dijagram potiskivača akustičkog eha (AEC) 302, koji se sastoji od dva osnovna bloka: blok 401 koji se sastoji od 5 adaptivnih NLMS( Normalized Least Mean Square)algoritama i bloka402čija je osnovna funkcija detekcija aktivnosti govora bliskog i udaljenog govornika DTD( Double TalkFigure 4 shows the block diagram of the acoustic echo suppressor (AEC) 302, which consists of two basic blocks: block 401, which consists of 5 adaptive NLMS (Normalized Least Mean Square) algorithms, and block 402, whose basic function is the detection of the speech activity of the near and far speaker DTD (Double Talk

Detection).Detection).

NLMS algoritmi, NLMS1 do NLMS6, obrađuju signale iz mikrofonax/doxsi obrađene signaleSaecidoSaecsprosleđuju dalje ka blokovima 303, 304 i 306, slika 3. Funkcija NLMS algoritama je potiskivanje eha u svakom od mikrofonskih signala. Ovu funkciju omogućavaju referentni signali iz zvučnika102i kontrolni signali iz DTD detektora402.NLMS algoritam modelira prenosnu funkciju akustičkog puta od svakog zvučnika102do svakog mikrofona103;na primer NLMS1 modelira prenosne funkcijehuod zvučnika Zv-L do mikrofona Ml ihoiod zvučnika Zv-D do mikrofona Ml, itd. Prolaskom signala iz zvučnika kroz NLMS filtre dobij a se replika signala na mikrofonima koji su došli akustičkim putem i oduzimanjem ova dva signala postiže se potiskivanje eho signala na izlazu NLMS algoritama. U cilju boljeg potiskivanja eha, kao i u slučajuRLS1AEC algoritma(RLS-Recursive Least Squares)koji se dole opisuje, koriste se DFT koeficijenti iz prethodnih blokova obrade. Kako NLMS algoritam zahteva znatno manje računarskog vremena u odnosu na RLS algoritam, u realizaciji NLMS algoritama se koriste DFT koeficijenti iz prethodna 5 bloka obrade. The NLMS algorithms, NLMS1 through NLMS6, process the signals from the microphones/dox and pass the processed signals on to blocks 303, 304, and 306, Figure 3. The function of the NLMS algorithms is to suppress echoes in each of the microphone signals. This function is enabled by the reference signals from the speaker 102 and the control signals from the DTD detector 402. The NLMS algorithm models the transfer function of the acoustic path from each speaker 102 to each microphone 103; for example, NLMS1 models the transfer functions from the speaker Zv-L to the microphone Ml and from the speaker Zv-D to the microphone Ml, etc. By passing the signal from the speaker through the NLMS filters, a replica of the signal on the microphones that came acoustically is obtained, and by subtracting these two signals, the suppression of the echo signal at the output of the NLMS algorithms is achieved. In order to better suppress the echo, as in the case of the RLS1AEC algorithm (RLS-Recursive Least Squares) which is described below, the DFT coefficients from the previous processing blocks are used. As the NLMS algorithm requires significantly less computer time compared to the RLS algorithm, the DFT coefficients from the previous 5 processing blocks are used in the implementation of the NLMS algorithms.

Blok403sa oznakom RLS1 AEC je ključni algoritamski deo postupka detekcije dvostruke govorne aktivnosti iz bloka402. RLS1AEC vrši grubo potiskivanje akustičkih smetnji u signalu iz mikrofona Ml primenom RLS algoritma. RLS algoritam ima brzu konvergenciju što obezbeđuje dobru estimaciju govornog signala kao i estimaciju aditivne komponente eho signala. S obzirom da veličina primenjenog DFT prozora od 1024 nije dovoljno velika da bi se ostvarilo maksimalno potiskivanje eho smetnji u prostoriji sa velikom reverberacijom, regresionom vektoru se pridružuju DFT koeficijenti iz 3 prethodna bloka obrade. Time se ostvaruje dvostruki dobitak: maksimalno potiskivanje eha i kašnjenje signala kroz sistem se ne uvećava jer red DFT ostaje nepromenjen. Block 403 labeled RLS1 AEC is the key algorithmic part of the dual speech activity detection procedure of block 402. RLS1AEC performs rough suppression of acoustic disturbances in the signal from the microphone Ml by applying the RLS algorithm. The RLS algorithm has fast convergence, which provides a good estimation of the speech signal as well as the estimation of the additive component of the echo signal. Given that the size of the applied DFT window of 1024 is not large enough to achieve maximum suppression of echo interference in a room with high reverberation, DFT coefficients from the 3 previous processing blocks are added to the regression vector. This achieves a double gain: maximum echo suppression and signal delay through the system is not increased because the DFT order remains unchanged.

Izlaz izRLS1AEC bloka su dva signalaeiy.Prvi signale jeestimacija govora bliskog govornika na mikrofonu Ml. Drugi signalyje estimacija aditivne komponente signala eha u signalu mikrofona Ml. Oba ova signala se koriste za detekciju dvostruke govorne aktivnosti koja se realizuje u bloku402sa oznakomDTD.Signal izDTDdetektora kontroliše rad NLMS algoritama u smislu da sprečava adaptaciju algoritama NLMS 1 do NLMS 5 za vreme dvostruke aktivnosti govora, kada dolazi do remećenja rada adaptivnih algoritama. U bloku405vrši se usrednjavanje snaga signala na zvučnicima prema relaciji: The output from the RLS1AEC block is two signals. The first signal is the estimation of the speech of the nearby speaker on the microphone Ml. The second signal is the estimation of the additive component of the echo signal in the microphone signal Ml. Both of these signals are used for the detection of double speech activity, which is realized in block 402 labeled DTD. The signal from the DTD detector controls the operation of the NLMS algorithms in the sense that it prevents the adaptation of the algorithms NLMS 1 to NLMS 5 during the double speech activity, when the operation of the adaptive algorithms is disturbed. In block 405, the signal strength on the loudspeakers is averaged according to the relation:

Na oba signalayiPrefse primenjuje rekurzivno usrednjavanje, tako da se dobijaju usrednjene snage signala eha u mikrofonu Ml (2) i signala na zvučnicima koji proizvode eho (3). On both signals, yiPrefse applies recursive averaging, so that the averaged strengths of the echo signal in the microphone Ml (2) and the signal at the speakers producing the echo (3) are obtained.

Estimacija odnosa ove dve snage se određuje veličinomCs:The estimation of the ratio of these two forces is determined by the quantity Cs:

i ona se koristi za skaliranje snaga zvučničkih signala za potrebe donošenja meke odluke u bloku 408. U ovom bloku se određuje odsustvo bližeg govornika u mikrofonskom signalu na bazi meke odluke definisane relacijom: gde je:af- frekvencijski zavisna konstanta kojom se veštački favorizuje dozvola za konvergenciju na višim frekvencijama, gde su snage signala manje, a time i manja mogućnost divergencije NLMS algoritama. VeličinaXje minimalni odnos snage eho signala i bliskog govornika za koji je meka odluka pozitivan broj. U bloku409vrši se ograničavanje kontrolnog signalaDui,koji se pored NLMS algoritama vodi i u blok DOA-azimut. Slika 5 prikazuje blok dijagram rešenja za određivanje azimuta 304, odnosno pravca dolaska direktnog zvučnog talasa DOA-azimut od aktivnog govornika. Ulazni signali u ovaj blok su kanalski signali iz AEC blokaSAecidoSaecs,a izlazni signal je estimacija dolaznog ugla6a.Algoritam se bazira na kroskorelacionoj analizi ulaznih signalaSaecidoSaecsu bloku501,na čijem se izlazu dobijaju estimacije četiri kroskorelacione funkcijeG\ t2{ tJ)doG\$( tj)rekurzivnim usrednjavanjem prema relaciji and it is used to scale the power of speaker signals for the purposes of making a soft decision in block 408. In this block, the absence of a closer speaker in the microphone signal is determined on the basis of a soft decision defined by the relation: where:af- is a frequency-dependent constant that artificially favors the permission for convergence at higher frequencies, where the signal powers are smaller, and thus there is less possibility of divergence of NLMS algorithms. SizeX is the minimum ratio of the echo signal strength to the nearby speaker for which the soft decision is a positive number. In block 409, the control signal Dui is limited, which, in addition to the NLMS algorithms, is also fed into the DOA-azimuth block. Figure 5 shows a block diagram of the solution for determining the azimuth 304, i.e. the direction of arrival of the direct sound wave DOA-azimuth from the active speaker. The input signals to this block are channel signals from the AEC block SAecidoSaecs, and the output signal is the estimation of the arrival angle 6a. The algorithm is based on the cross-correlation analysis of the input signals to the SaecidoSaecsu block 501, at the output of which estimates of four cross-correlation functions are obtained G\ t2{ tJ)doG\$( tj) by recursive averaging according to the relation

Konstante cu i a, se biraju tako da ispunjavaju nejednakost 0.5 < a+ < a. < 1 i pod tim uslovom favorizuje se uticaj članovaXx( t, f) X\ (r, /)sa najvećim modulom. The constants cu and a are chosen to satisfy the inequality 0.5 < a+ < a. < 1 and under that condition the influence of members Xx( t, f) X\ (r, /) with the largest modulus is favored.

U bloku 502 sa oznakom PHAT realizuje se generalizovana kroskorelacija u literaturi često označena kao fazna transformacija. Naime, normalizacijom kroskorelacije na svoj moduo gubi se informacija o snazi signala, a ostaje samo informacija o fazi u kojoj je sadržano relativno vremensko kašnjenje signala. Inverznom FFT transformacijomGlk( t, f)i nalaženjem maksimuma, ocenjuje se relativno vremensko kašnjenje zvučnog talasa između dva mikrofona. In block 502 labeled PHAT, generalized cross-correlation, often labeled as phase transformation in the literature, is implemented. Namely, by normalizing the cross-correlation to its modulus, the information about the signal strength is lost, and only the information about the phase in which the relative time delay of the signal is contained remains. By inverse FFT transformation Glk(t, f) and finding the maximum, the relative time delay of the sound wave between two microphones is evaluated.

Pošto govorni signal ima formantnu strukturu, zbog čega svi frekvencijski binovi nemaju istu snagu, potrebno je selektovati binove sa najvećom snagom i njih iskoristiti za određivanje kroskorelacione funkcije. U tom cilju se u bloku 503 vrši računanje trenutne snage svakog kanalskog signala i računanje srednje vrednosti snage svih kanalaP( t, f).U bloku 504 određuje se težinska funkcijaW( tJ)kojom se favorizuju binovi kod kojih postoji rast trenutne snage signala. Razlog izbora ovakvog rešenja je taj što na delu signala sa naglim rastom snage veći je udeo direktnog talasa nego na delu sa padom snage, gde dominiraju refleksije talasa odnosno reverberacija prostorije. U bloku 505 računa se srednja snaga mikrofonskih signala usrednjena po vremenu i pofrekvenciji, P( t, f).Prvo se vrši usrednjavanje binova po frekvenciji nekauzalnim HR filtrom prvog reda (nulto fazno kašnjenje se postiže dvostrukim filtriranjem unapred i unazad). Usrednjavanje po vremenu vrši se nelinearnim IIR filtrom prvog reda sa dva koeficijenta usrednjavanja, jedan za rast i drugi za pad snage signala. Ovaj nelinearni filtar se opisuje relacijama: Since the speech signal has a formant structure, which is why all frequency bins do not have the same power, it is necessary to select the bins with the highest power and use them to determine the cross-correlation function. To this end, block 503 calculates the current power of each channel signal and calculates the average value of the power of all channels P(t, f). Block 504 determines the weighting function W(tJ) that favors bins with an increase in current signal power. The reason for choosing such a solution is that the part of the signal with a sudden increase in power has a larger share of direct waves than the part with a drop in power, where wave reflections or room reverberation dominate. In block 505, the mean power of the microphone signals averaged by time and frequency, P(t, f) is calculated. First, the bins are averaged by frequency with a non-causal HR filter of the first order (zero phase delay is achieved by double forward and backward filtering). Time averaging is performed by a first-order nonlinear IIR filter with two averaging coefficients, one for increasing and one for decreasing signal power. This nonlinear filter is described by the relations:

VeličinaP( t, f)koristi se za definisanje praga odluke za izdvajanje binova sa najvećom snagom u bloku 506. Množenjem binarnog izlaza iz bloka 506 i težinskog vektoraW( t, J)dobija se filterska funkcijaW( t, f),kojom se ponderišu binovi fazne transformacije u bloku 502. Fazno transformisane kroskorelacione funkcije se dodatno filtriraju IIR filtrom u vremenu kako bi se umanjila varijansa estimacije korelacionih funkcija. Ovo se opisuje relacijom: The size P(t, f) is used to define the decision threshold for extracting the bins with the highest power in block 506. Multiplying the binary output from block 506 and the weight vector W(t, J) yields a filter function W(t, f) that weights the phase transform bins in block 502. The phase transformed cross-correlation functions are additionally filtered with an IIR filter in time to reduce variance. estimation of correlation functions. This is described by the relation:

Pored selekcije binova sa funkcijomW( t, f),primenjuje se i apriorno odbacivanje binova koji se nalaze izvan opsega od interesa. U bloku 507 defmisani su opsezi koji apriorno nisu od interesa i oni se odbacuju pre inverzne FFT (FFT<1>). U bloku 509 vrši se vremensko usklađivanje kroskorelacionih funkcija, koje se zatim usrednjavaju i na njihovoj srednjoj vrednosti se određuje maksimum u bloku 510, čija apscisa predstavlja estimaciju vremenskog kašnjenjat^,.U bloku 511 vrši se preračunavanje vremenskog kašnjenjaxatu upadni ugao 0«/talasa aktivnog govornika. In addition to the selection of bins with the function W(t, f), an a priori rejection of bins that are outside the range of interest is also applied. In block 507, ranges that are not a priori of interest are defined and they are discarded before the inverse FFT (FFT<1>). In block 509, the time alignment of the cross-correlation functions is performed, which are then averaged, and at their mean value, the maximum is determined in block 510, whose abscissa represents the estimation of the time delay, .

Estimacija dolaznog pravca ima smisla kada je bliski govornik aktivan; kada nije aktivan za validnu estimaciju se usvaja estimacija dobijena za vreme poslednje njegove aktivnosti. U cilju detekcije aktivnosti bliskog govornika koriste se: a) informacija iz bloka 513 o srednjoj snazi mikrofonskih signala; b) informacija iz detektora dvostruke aktivnosti govornikaD?,iz bloka 402, slika 4; i c) informacijasbfiz bloka 303, SD-BF slika 3. Na osnovu ovih informacija u bloku 512 se donosi odluka o aktivnosti bliskog govornika. U slučaju odluke da je estimacija dolaznog pravca validna, da je aktivan bliži govornik, na izlaz DOA bloka 304 se prosleđuje trenutna estimacija dolaznog pravca; u suprotnom se prosleđuje poslednja validna estimacija pravca. Incoming direction estimation makes sense when a nearby speaker is active; when he is not active, the estimate obtained during his last activity is used for a valid estimate. In order to detect the activity of a nearby speaker, the following are used: a) information from block 513 about the average strength of microphone signals; b) information from the speaker double activity detector D?, from block 402, Figure 4; and c) informationsbfiz of block 303, SD-BF Figure 3. Based on this information in block 512, a decision is made about the activity of the close speaker. In the case of a decision that the estimation of the incoming direction is valid, that the closer speaker is active, the current estimation of the incoming direction is sent to the output of the DOA block 304; otherwise, the last valid direction estimate is passed.

Na slici 6 prikazan je blok dijagram postupka za formiranje superdirektivnog prostornog filtra 303, slika 3. Zbog problema samoponištavanja korisnog signala koji se javlja kada se adaptivni algoritam za potiskivanje akustičkih smetnji primenjuje u prostoriji sa reverberacijom, često se umesto adaptivnog algoritma primenjuje superdirektivni prostorni filter 601 sa fiksnim koeficijentima. Superđirektivni prostorni filtar obezbeđuje veći indeks usmerenosti u odnosu na prostorni konvencionalni filter koji sadrži samo kompenzaciju kašnjenja i sumiranje. Opis postupka dobijanja težinskih koeficijenata koji obezbeđuju superdirektivnu karakteristiku filtra su dati u daljem tekstu. Figure 6 shows a block diagram of the procedure for forming the superdirective spatial filter 303, Figure 3. Due to the problem of self-cancellation of the useful signal that occurs when the adaptive algorithm for suppressing acoustic interference is applied in a room with reverberation, a superdirective spatial filter 601 with fixed coefficients is often applied instead of the adaptive algorithm. A superdirective spatial filter provides a higher directivity index than a conventional spatial filter that only includes delay compensation and summation. The description of the procedure for obtaining the weighting coefficients that ensure the superdirective characteristic of the filter is given below.

Za prostoriju sa reverberacijom se obično usvaja model difuznog polja šuma, Što podrazumeva da šum dolazi iz svih pravaca sa približno istim intenzitetom. Za takav model polja šuma pokazuje se da je koherencija između dva mikrofona realan broj jednak For a room with reverberation, a diffuse noise field model is usually adopted, which implies that the noise comes from all directions with approximately the same intensity. For such a noise field model, it is shown that the coherence between two microphones is a real number equal to

gde je/učestanost, dtJ jerastojanje mikrofonai ij, acbrzina zvuka. Koherencije parova mikrofonarfJ(/)formiraju matricu koherencija Frf. Koristeći ovako definisanu matricu koherencija,koeficijenti superdirektivnog mikrofonskog niza se odredjuju u bloku 602 prema relaciji: gde je C9vektor usmerenja na pravac odabranog govornika definisan azimutom 0. Ovaj vektor se određuje u bloku 603 prema relaciji: where / is the frequency, dtJ is the microphone distance and ij, ac is the speed of sound. Coherences of pairs of microphones fJ(/) form the matrix of coherences Frf. Using the thus defined coherence matrix, the coefficients of the superdirective microphone array are determined in block 602 according to the relation: where C9 is the direction vector to the direction of the selected speaker defined by azimuth 0. This vector is determined in block 603 according to the relation:

Veličinad jerastojanje dva susedna mikrofona. Na izlazu bloka 303 dobija se estimacija govorasSFaktuelnog govornika na bazi relacije: Sized is the distance between two adjacent microphones. At the output of block 303, an estimate of the actual speaker's speech is obtained based on the relation:

Na slici 7 prikazanje blok za potiskivanje šuma 305 sa oznakom NR. SignalSbfjeste ulazni signal u blok 305 i on sadrži estimirani govorni signal i rezidualne signale smetnji koji potiču od akustičkog eha, akustičkih smetnji u prostoriji i reverberacije prostorije. SignalSbfse uvodi u blok 701, označen sa FWF"', u kome se izvršava IFFT, zatim dopunsko prozorovanje vremenskog oblika segmenta signala u cilju "mekanog" otsecanja krajeva segmenta i na kraju ponovno vraćanje u frekvencijski domen pomoću FFT. Suština ove operacije je sledeća. U procesu prethodnih obrada signala, ekvivalentni vremenski oblik signala se proširuje do granica DFT prozora. Primenom nove operacije Wiener-ovog filtriranja vrši se dodatno proširivanje segmenta i cikličko preklapanje na krajevima segmenta, što stvara impulsne smetnje koje se manifestuju kao ravnomemo "pucketanje". Primenjeni postupak FWF"' u potpunosti otklanja opisani problem a ne unosi nikakva dodatna izobličenja signala. Figure 7 shows the block for noise suppression 305 marked NR. SignalSb places the input signal in block 305 and it contains the estimated speech signal and residual interference signals originating from acoustic echo, acoustic room interference and room reverberation. SignalSbfse is introduced into block 701, marked with FWF"', in which IFFT is performed, then additional windowing of the time shape of the signal segment in order to "softly" cut off the ends of the segment, and finally returning to the frequency domain using FFT. The essence of this operation is as follows. In the process of previous signal processing, the equivalent time shape of the signal is expanded to the limits of the DFT window. By applying a new Wiener filtering operation, additional segment expansion and cyclic overlapping at the ends are performed segment, which creates impulse interference that manifests itself as a smooth "crack". The applied FWF" procedure completely eliminates the described problem and does not introduce any additional signal distortions.

U naredna dva bloka 702 i 703 vrši se estimacija šuma na bazi minimuma snage ulaznog signala. Pošto trenutna adaptacija na minimum snage ne daje dobre rezultate, jer DFT koeficijenti na pojedinim blokovima imaju ekstremno nisku snagu koja remeti prethodnu estimaciju snage šuma, estimacija Šuma je realizovana u tri bloka obrade, U prvom bloku 702 se vrši spora estimacija snage šumaN, low,u drugom 703, brza estimacija snage šumaN^,a u trećem 704 se na osnovu procenaN) lowiNfttllposredstvom nelinearne transformacije vrši procena trenutne snage šumaN.In the next two blocks 702 and 703, noise estimation is performed based on the minimum input signal power. Since the current adaptation to the minimum power does not give good results, because the DFT coefficients on certain blocks have an extremely low power that disturbs the previous noise power estimation, the Noise estimation is realized in three processing blocks. In the first block 702, a slow estimation of the noise power N, low is performed, in the second 703, a fast estimation of the noise power N^, and in the third 704, based on the estimate N) lowiNfttll, an estimation of the current power is performed by means of a nonlinear transformation forestN.

Brza i spora procena snage šuma se realizuje istim postupkom rekurzivnog usrednjavanja IIR filtrom prvog reda sa različitim faktorima adaptacije za rast i pad vrednosti izlaza Fast and slow noise power estimation is realized by the same procedure of recursive averaging by a first-order IIR filter with different adaptation factors for the rise and fall of the output value

pri čemu između konstantialUnM., ash„_, afas,+,<a>slaw_postoji relacija: where between constantUnM., ash„_, afas,+,<a>slaw_ there is a relation:

Brza i spora estimacija šuma se kombinuju u bloku 704, koji je označen kao nelinearni kompresor. Finalna estimacija nivoa šuma se dobij a na bazi sledeće relacije: The fast and slow noise estimates are combined in block 704, which is designated as the nonlinear compressor. The final estimate of the noise level is obtained based on the following relationship:

gde se parametrom a (0.25<a<0.5) reguliše stepen kompresije dinamike estimacije šuma, a parametrom p definiŠe se uvećanje estimacije šuma( overestimation of the noise power).Smisao nelinearne transformacije je sledeći: u slučajuNfajl>N! towprimena samo brze estimacije dala bi prekomerno potiskivanje i govornog signala, zato je uvedena kompresija dinamike estimacije šuma. U slučajuNfmt < N, lowne primenjuje se kompresija kako bi estimacija šuma što brže opala. Time se sprečava otsecanje delova fonema na krajevima reči kada zbog brzog pada snage signala visoka vređnost prethodne estimacije šuma sporog estimatora ne može da prati ovu promenu dinamike. where the parameter a (0.25<a<0.5) regulates the degree of compression of the noise estimation dynamics, and the parameter p defines the overestimation of the noise power. The meaning of the nonlinear transformation is as follows: in case Nfile>N! tow application of only fast estimation would result in excessive suppression of the speech signal, that is why compression of noise estimation dynamics was introduced. In the case of Nfmt < N, lowne compression is applied so that the noise estimate decreases as quickly as possible. This prevents the clipping of phoneme parts at the ends of words when, due to a rapid drop in signal strength, the high value of the previous noise estimate of the slow estimator cannot follow this change in dynamics.

Pošto je odnos korisnog govornog signala i šuma znatno nepovoljniji na visokim frekvencijama, definisan je skup parametara a i p za 4 karakteristična opsega frekvencija (0-2000Hz), (2000-2500Hz), (2500-3500Hz)H(3500-5012Hz), prema očekivanom odnosu signal/šum. Ovaj skup parametara je memorisan u bloku 705. Since the ratio of useful speech signal and noise is significantly less favorable at high frequencies, a set of parameters a and p is defined for 4 characteristic frequency ranges (0-2000Hz), (2000-2500Hz), (2500-3500Hz)H(3500-5012Hz), according to the expected signal/noise ratio. This set of parameters is stored in block 705 .

U bloku 706 vrši se Wiener-ovo filtriranje primenom sledeće prenosne funkcije: In block 706, Wiener filtering is performed using the following transfer function:

gde konstanta /?Mima funkciju procenjivanja prvobitne procene snage šuma kako bi se ostvario kompromis između što većeg potiskivanja šuma i minimalne degradacije korisnog govornog signala. Prenosna funkcijahwmože imati u vremenskom domenu neprihvatljivo dugačak impulsni odziv, što proizvodi izobličenja na granicama DFT blokova, i zbog toga se vrši "meko" skraćenje impulsnog odziva primenom gore opisanog postupka FWF''. Na kraju se vrši u bloku 707 dodatno filtriranje izlaznog estimiranog govornog signalaš,kako bi se odbacile spektralne komponente van opsega where the constant /?M has the function of estimating the original estimate of the noise power in order to achieve a compromise between the highest possible noise suppression and the minimum degradation of the useful speech signal. The transfer function can have an unacceptably long impulse response in the time domain, which produces distortions at the boundaries of the DFT blocks, and therefore a "soft" shortening of the impulse response is performed using the above-described FWF'' procedure. Finally, additional filtering of the output estimated speech signal is performed in block 707, in order to reject out-of-band spectral components.

govornog signala, koje mogu nastati u prethodnim procesima obrade signala, a koje mogu uticati na rad AGC bloka. speech signal, which may arise in previous signal processing processes, and which may affect the operation of the AGC block.

Na slici 8 prikazanje blok za automatsku regulaciju pojačanja (AGC) izlaznog signala sistema, blok 306. Zadatak AGC bloka je: (1) da pojača slabe govorne signale a da oslabi previše jake signale prema unapred zadatoj karakteristici kompresije dinamike signala, (2) da na delovima ulaznog signala gde je prisutan samo eho signala, stacionaran šum ili konkurentni govornik-smetnja, smanji pojačanje kako bi se ove smetnje dovoljno utišale i (3) da utiša delove ulaznog signala gde su jednovremeno prisutni i koristan govorni signal i smetnje, a da pri tome očuva razumljivost govora. Figure 8 shows the block for automatic gain control (AGC) of the output signal of the system, block 306. The task of the AGC block is: (1) to amplify weak speech signals and to weaken excessively strong signals according to a predetermined signal dynamics compression characteristic, (2) to reduce the gain on parts of the input signal where there is only signal echo, stationary noise or competing speaker interference, in order to silence these interferences sufficiently and (3) to silence parts of the input signal where both the useful speech signal and interference are simultaneously present, while preserving speech intelligibility.

Na ulaz bloka 306 dolazi signals^ giz bloka NR, slika 3 blok 305, i prolazi kroz kompresor dinamike signala sa adaptivnim nagibom karakteristike kompresije, blok 801. Izlaz iz bloka 801 je signalsagckoji zatim prolazi kroz blok 307, slika 3, gde se inverznom Fourierovom transformacijom FFT"<1>konvertuje iz frekvencijskog u vremenski domen i kao konačan signal estimacije govornog signalašprenosi ka udaljenom govorniku kroz kanal digitalne televizije. At the input of block 306 comes signals^ giz of block NR, figure 3 block 305, and passes through a compressor of signal dynamics with an adaptive slope of the compression characteristic, block 801. The output of block 801 is a signal which then passes through block 307, figure 3, where it is converted from frequency to time domain by inverse Fourier transformation FFT"<1> and as the final signal of speech signal estimation is transmitted to the remote speaker through a digital channel television.

Kontrola pojačanja govornog signala vrši se u bloku 801 na bazi sledeće relacije: Voice signal gain control is performed in block 801 based on the following relationship:

gde su:Aagc- pojačanje AGC bloka,Pnom- nominalna snaga izlaznog signala, a - where: Aagc- amplification of the AGC block, Pnom- nominal power of the output signal, and -

konstanta kojom se ograničava maksimalno pojačanje na nivoAagcamK= Vl/or (za vrednost a = 0.001 maksimalno pojačanje jeAagc max =31.6 dB),Pin<=>Pa + P„<+>Pecka ( Pdconstant which limits the maximum amplification to the level AagcamK= Vl/or (for the value a = 0.001 the maximum amplification is Aagc max =31.6 dB), Pin<=>Pa + P„<+>Pecka ( Pd

- snaga korisnog govornog signala,P„- snaga difuznog ambijentalnog šuma iPeko-- power of useful voice signal, P„- power of diffuse ambient noise and Peko-

snaga nepotisnutog eho signala), iSLOPE = /[ P^ it)]- veličina koja predstavlja stepen kompresije dinamike signala i složena je funkcija vršne snage korisnog govornog signala. U bloku 802 izračunava se veličinaSLOPEna bazi analize trajektorije vršne snage korisnog govornog signala i praćenja njene konveksnosti i trenda rasta. the power of the unsuppressed echo signal), iSLOPE = /[ P^ it)]- a quantity that represents the degree of compression of the signal dynamics and is a complex function of the peak power of the useful speech signal. In block 802, the value of SLOPE is calculated based on the analysis of the trajectory of the peak power of the useful speech signal and monitoring its convexity and growth trend.

U bloku 803 izračunava se vršna snaga korisnog govornog signala prema sledećim relacijama: In block 803, the peak power of the useful speech signal is calculated according to the following relations:

gde jeOđ - konstantavrednosti blizu 1. where Ođ is a constant value close to 1.

U bloku 804 određuje se estimacija snage nepotisnutog eha prema relaciji: In block 804, the unsuppressed echo power estimate is determined according to the relation:

gde jeaecha- konstanta potiskivanja eho signalayiz bloka 402, slika 4. where echa is the echo suppression constant of block 402, Figure 4.

U bloku 805 vrši se estimacija difuznog šuma P„ kaoTazlika srednje snage ulaznih signalaSaecidosaecsu blok 303, slika 3, i snage izlaznog signalasbfiz bloka 303. In block 805, the diffuse noise P„ is estimated as the average power of the input signals in block 303, Figure 3, and the power of the output signal in block 303.

Neposredna primena relacijeza. Aagcza unapred fiksnu veličinuSLOPEne daje dobre rezultate, jer jednako tretira preostale smetnje i koristan signal. Kada su prisutne samo smetnje dolazi do njihovog pojačanja, što nije dobro. Zato je potrebno detektovati i razdvojiti sledeće slučajeve: (a) pauza u korisnom govornom signalu, (b) prisutan rezidualni eho, i (c) prisutan konkurentni govornik ili akustička smetnja. Kada se detektuje bilo koji od ovih slučajeva, promenljivaSLOPEse izjednačava sa 1 i tako sprečava pojačanje smetnji. Direct application of relations. Aagcza pre-fixed size SLOPE does not give good results, because it treats the residual interference and the useful signal equally. When only disturbances are present, they are amplified, which is not good. It is therefore necessary to detect and separate the following cases: (a) a pause in the useful speech signal, (b) a residual echo present, and (c) a competing speaker or acoustic disturbance present. When any of these cases are detected, the SLOPE variable is set to 1, thus preventing interference amplification.

Pauza u korisnom govornom signalu se razlikuje od govornog signal po stacionarnosti. Govorni signal, ma koliko bio slabog nivoa, nestacionaran je u vremenu, dok je u pauzi prisutan sporopromenjivi ambijentalni šum. Linearni trend snage signala normalizovan na snagu je dobar pokazatelj nestacionarnosti signala. Tome treba dodati i pokazatelj konveksnosti trajektorije koji je negativan na lokalnom maksimumu. A pause in a useful speech signal differs from a speech signal in its stationarity. The speech signal, no matter how weak it is, is non-stationary in time, while the slow-changing ambient noise is present during the pause. A linear trend of signal strength normalized to power is a good indicator of signal non-stationarity. To that should be added the indicator of convexity of the trajectory, which is negative at the local maximum.

U ovom pronalasku opisan je postupak obrade akustičkih i govornih signala u sistemu slobodne govorne komunikacije koji funkcioniše u punom dupleksu. Ovaj pronalazak se odnosi na slobodnu govornu komunikaciju u sistemu digitalne televizije, ali se isto tako može primeniti i na druge komunikacione sisteme kao što su video-telefonski sistemi, telekonferencijski sistemi, spikcrfoni u prostoriji ili kolima, komunikacija čovek-računar putem glasa, i td. Specifičnost rešenja u ovom pronalasku jeste njegova integracija u standardni digitalni TV prijemnik i njegova optimizacija za primenu u prostorijama (akustičkim ambijentima) srednje veličine sa vremenom reverberacije do 600 ms. This invention describes the process of processing acoustic and speech signals in a free speech communication system that operates in full duplex. This invention relates to free speech communication in a digital television system, but it can also be applied to other communication systems such as video-telephone systems, teleconferencing systems, speakerphones in a room or in a car, human-computer communication by voice, etc. The specificity of the solution in this invention is its integration into a standard digital TV receiver and its optimization for use in medium-sized rooms (acoustic environments) with a reverberation time of up to 600 ms.

Postupci i tehnike obrade akustičkih i govornih signala u ovom pronalasku mogu se generalizovati na N mikrofona u mikrofonskom nizu kod višekanalnog snimanja i na M zvučnika kod višekanalne reprodukcije. The procedures and techniques of acoustic and speech signal processing in this invention can be generalized to N microphones in a microphone array in multi-channel recording and to M speakers in multi-channel playback.

Postupci i tehnike obrade akustičkih i govornih signala u ovom pronalasku se nalaze pod kontrolom većeg broja parametara koji omogućavaju optimizaciju rešenja za različite aplikacije. The procedures and techniques of processing acoustic and speech signals in this invention are under the control of a number of parameters that enable the optimization of solutions for different applications.

Postupci i tehnike obrade akustičkih i govornih signala u ovom pronalasku mogu se implementirati na različite načine. Na primer, ove tehnike mogu biti implementirane u hardveru, softveru ili kombinovano. U hardverskoj implementaciji mogu se koristiti specifična integrisana kola (ASIC), procesori za digitalnu obradu signala (DSP), programabilna logička kola (PLD ili FPGA) i druga elektronska kola projektovana tako da mogu izvršiti opisane funkcije u ovom pronalasku. The acoustic and speech signal processing methods and techniques of the present invention can be implemented in a variety of ways. For example, these techniques can be implemented in hardware, software, or a combination. A hardware implementation may use specific integrated circuits (ASICs), digital signal processors (DSPs), programmable logic circuits (PLDs or FPGAs), and other electronic circuits designed to perform the functions described in this invention.

Postupci i tehnike obrade akustičkih i govornih signala u ovom pronalasku mogu se implementirati i softverski u celosti ili po modulima koji izvršavaju pojedine funkcije opisane u ovom pronalasku. Programski kodovi mogu biti memorisani u memorijskim jedinicama i izvršavani pomoću procesora kao što su PC, PDA, DSP, itd. The procedures and techniques for processing acoustic and speech signals in this invention can be implemented in software as a whole or by modules that perform certain functions described in this invention. Program codes can be stored in memory units and executed by processors such as PC, PDA, DSP, etc.

Detalji ovog pronalaska opisani ovde omogućavaju bilo kom stručnjaku u ovoj oblasti da generičke principe ovog pronalaska može implementirati u drugim sistemima za slobodnu govornu komunikaciju čime se ne izlazi iz okvira ovog pronalaska. The details of the invention described herein enable any person skilled in the art to implement the generic principles of the invention in other speech communication systems without departing from the scope of the invention.

Claims

1. A system for free speech communication using a microphone array containing a digital TV receiver that enables audio and video communication in full duplex, characterized by the fact that the digital TV receiver (100) has a stereo audio reproduction (102) for reproducing stereo TV programs and a mono incoming speech signal in videotelephone communication, which has a built-in moving video camera (104) for recording speakers in the room and which on part of its screen reproduces the image of the speaker from the far end (105); which contains a microphone system (103) built into the TV receiver (100) whose purpose is to record the speaker's speech at the near end as well as other ambient sounds and whose purpose is to locate the speaker in the room and control the video camera (104).

2. The system according to claim 1, characterized in that its audio transmission part (207) and (208) enables the suppression of the acoustic echo (209) generated by the speakers of the TV receiver (102), enables the suppression of ambient interference (213) and reverberation (210), (212) and (214), enables the location of speakers in the room, enables adaptive control of the signal level in the transmission and provides coordinates for controlling the video camera.

3. The system according to claim 2, characterized in that it contains a microphone array (103) of more than 2 microphones that provide microphone signals for further parallel processing, a module for adaptive acoustic echo suppression (AEC) (302) consisting of a set of adaptive filters, a module for estimating the incoming direction of the speaker's direct sound wave (DOA) (304) and managing the directionality characteristic of the microphone array, a module for forming the directionality characteristic of the microphone array with optimized ratio of main and side loops (SB-CBF) (303), module for adaptive suppression of all residual interference signals (NR) (305) and module for automatic system gain control (AGC) (306).

4. The system according to claim 3, characterized in that it contains a set of microphones (103) located in a horizontal plane at equal distance from each other and mounted on the upper edge of the digital TV receiver (100).

5. System according to claim 4, characterized in that it suppresses the acoustic echo (209) generated by the stereo speakers (102) and which consists of the stereo audio TV signal (205) and the mono speech signal originating from the remote speaker (204).

6. The system according to claim 5, characterized in that the echo suppression unit (302) and the ambient interference suppression unit (305) work even in low signal-to-noise ratio conditions.

7. A system according to any of the previous requirements, characterized by the fact that it enables adaptive locating and tracking of speakers in space by azimuth.

8. The system according to claim 7, characterized in that it enables adaptive determination of spatial coordinates for controlling the video camera.

9. The system according to claim 4, characterized in that its microphone array forms a narrow directional characteristic that enables spatial filtering and separation of the current speaker from other sources of interference in the room.

10. The system according to claim 9, characterized by the fact that its microphone array forms a narrow directional characteristic that enables suppression of echoes due to reflections in the room, i.e. reverberation signals.

11. A system according to any of the previous requirements, characterized by the fact that, through automatic system gain control, it maintains the average level of the transmitted speech signal within acceptable limits of normal speech dynamics, regardless of the distance and position of the speaker in relation to the microphone array.

12. A procedure for free speech communication using a microphone array, characterized by parallel processing of microphone signals from the microphone array and thereby achieving adaptive suppression of acoustic echo in microphone signals, which estimates the incoming direction of the direct sound wave of a nearby speaker, which forms a superdirective characteristic of the directionality of the microphone array and controls its spatial position in azimuth, which suppresses all interference signals found in microphone signals and which performs automatic maintenance of the level of the transmitted voice signal.

13. The method according to claim 12, characterized by the fact that the complete processing of all audio signals is performed in the frequency domain.

14. The method according to claim 12, characterized by the fact that the adaptive suppression of the acoustic echo is performed individually for each microphone signal and that the suppression includes both signals coming from the stereo speakers.

15. The method according to claim 14, characterized in that the adaptive acoustic echo suppression is performed for each microphone signal by means of NLMS algorithms controlled by means of a speech activity detector at both ends (DTD).

16. The method according to claim 14, characterized by the fact that the NLMS algorithms are controlled by means of a near-end speech activity detector implemented within the DTD and based on the RLS adaptive algorithm under specific conditions of the continuous presence of a TV audio program signal, which, in addition to speech, also contains a music signal.

17. The method according to claim 12, characterized in that the estimation of the incoming direction of the direct sound wave from the current speaker is performed on the basis of cross-correlation analysis of microphone signals after acoustic echo suppression.

18. The method according to claim 17, characterized in that the estimation of the incoming direction of the direct sound wave from the current speaker is performed under the control of the VAD speech detector at the near end.

19. The method according to claim 12, characterized by the fact that the directional characteristic of the microphone array is formed in the SB-CBF module as a superdirective characteristic based on the principle of weighting and summing the microphone signals after acoustic echo suppression and adaptive control according to azimuth.

20. The method according to claim 19, characterized in that the coefficients of the superdirective microphone array are determined using the coherence functions of the pairs of microphone signals and the direction vector to the direction of the selected speaker, defined by the azimuth angle.

21. The method according to claim 12, characterized in that the residual noise suppression function is realized by an adaptive Wiener filter.

22. The method according to claim 21, characterized in that the estimation of the residual noise in the noise suppressor is optimized according to the characteristics of the speech signal and realized on the basis of a non-linear compressor of the dynamics of the estimated noise, parametrically controlled and frequency dependent.

23. The method according to claims 12 to 22, characterized in that the module for automatic system gain control is based on a dynamics compressor with an adaptive slope of the compression characteristic.

24. The method according to claim 23, characterized in that the speech signal dynamics compressor is controlled by means of a residual acoustic echo presence detector, a speech signal pause detector, and a competing speaker and acoustic disturbance detector.