WO2018068636A1 - 一种语音信号检测方法与装置 - Google Patents

一种语音信号检测方法与装置 Download PDF

Info

Publication number
WO2018068636A1
WO2018068636A1 PCT/CN2017/103489 CN2017103489W WO2018068636A1 WO 2018068636 A1 WO2018068636 A1 WO 2018068636A1 CN 2017103489 W CN2017103489 W CN 2017103489W WO 2018068636 A1 WO2018068636 A1 WO 2018068636A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
short
energy
signal
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2017/103489
Other languages
English (en)
French (fr)
Inventor
焦雷
官砚楚
曾晓东
林锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Family has litigation
First worldwide family litigation filed litigation Critical https://patents.darts-ip.com/?family=59176496&utm_source=google_patent&utm_medium=platform_link&utm_campaign=public_patent_search&patent=WO2018068636(A1) "Global patent litigation dataset” by Darts-ip is licensed under a Creative Commons Attribution 4.0 International License.
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to EP17860814.7A priority Critical patent/EP3528251B1/en
Priority to SG11201903320XA priority patent/SG11201903320XA/en
Priority to PH1/2019/500784A priority patent/PH12019500784B1/en
Priority to JP2019520035A priority patent/JP6859499B2/ja
Priority to MYPI2019001999A priority patent/MY201634A/en
Priority to KR1020197013519A priority patent/KR102214888B1/ko
Publication of WO2018068636A1 publication Critical patent/WO2018068636A1/zh
Priority to US16/380,609 priority patent/US10706874B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Definitions

  • the present application relates to the field of computer technologies, and in particular, to a voice signal detecting method and apparatus.
  • the smart device needs to record all the time or record according to the preset period, and judge whether the obtained audio signal contains the voice signal, and if the voice signal is included, the voice is included. The signal is extracted, then processed and sent out, thus completing the transmission of the voice message.
  • a dual threshold method a detection method based on an autocorrelation maximum value, or a detection method based on a wavelet transform is generally used to detect whether a voice signal is included in the acquired audio signal.
  • these methods basically obtain the frequency characteristics of the audio information through complex calculations such as Fourier transform, and then determine whether the voice signal is included according to the frequency characteristics, and need to calculate larger buffer data, and the memory occupancy is high, and the calculation is performed. The amount is too large, the processing speed is slow, and the power consumption is large.
  • the embodiment of the present invention provides a method and a device for detecting a voice signal, which are used to solve the problem that the processing method of the voice signal detection method in the prior art has a slow processing speed and consumes a lot of resources.
  • a method for detecting a voice signal comprising:
  • a voice signal is detected in the audio signal according to the energy of each short-term energy frame.
  • a voice signal detecting device comprising:
  • a dividing module dividing the audio signal into a plurality of short-term energy frames according to a frequency of the preset voice signal
  • the detecting module detects whether the audio signal includes a voice signal according to the energy of each short-term energy frame.
  • the voice signal detection method used in the embodiment of the present application does not need to perform complex calculation such as Fourier transform.
  • the voice signal detecting method provided by the embodiment of the present invention can solve the problem that the processing method of the voice signal detecting method in the prior art has a slow processing speed and consumes a large amount of resources.
  • FIG. 1 is a specific flowchart of a method for detecting a voice signal according to an embodiment of the present application
  • FIG. 2 is a specific flowchart of another method for detecting a voice signal according to an embodiment of the present application
  • FIG. 3 is a diagram showing an audio signal display of a preset duration according to an embodiment of the present application
  • FIG. 4 is a schematic structural diagram of a voice signal detecting apparatus according to an embodiment of the present disclosure.
  • the embodiment of the present application provides a voice signal detection method.
  • the execution body of the method may be, but not limited to, a user terminal such as a mobile phone, a tablet computer or a personal computer (PC), or an application (application, APP) running on the user terminals, or may be a server or the like. device.
  • a user terminal such as a mobile phone, a tablet computer or a personal computer (PC), or an application (application, APP) running on the user terminals, or may be a server or the like. device.
  • FIG. 1 A schematic diagram of the specific process of the method is shown in FIG. 1 and includes the following steps:
  • Step 101 Acquire an audio signal.
  • the above-mentioned audio signal may be an audio signal collected by the APP through the audio collection device, or may be an audio signal received by the APP, such as an audio signal transmitted by another APP or device, which is not limited in this embodiment of the present application. . After the APP acquires the audio signal, it can save the audio signal locally.
  • the present application also does not impose any limitation on the sampling rate, duration, format or channel corresponding to the above audio signal.
  • the above APP can be any type of APP, such as a chat app or a payment APP, as long as the The APP can acquire the audio signal, and can perform the detection of the voice signal on the acquired audio signal by using the voice signal detection method provided by the embodiment of the present application.
  • Step 102 Divide the audio signal into a plurality of short-term energy frames according to a frequency of the preset voice signal.
  • the short-term energy frame described above is actually a part of the audio signal in the audio signal acquired in step 101.
  • the period of the preset voice signal may be determined according to the frequency of the preset voice signal, and the audio signal obtained in step 101 is divided into corresponding durations according to the determined period. Time energy frame. For example, if the period of the preset voice signal is 0.01 S, the audio signal may be divided into a number of short-term energy frames each having a duration of 0.01 S according to the duration of the audio signal acquired in step 101. It should be noted that, when dividing the audio signal acquired in step 101, the audio signal may be divided into at least two short-term energy frames according to the frequency of the preset voice signal according to actual conditions. For the convenience of the following description, the embodiment of the present application will be described later by taking an audio signal into multiple short-term energy frames as an example.
  • the audio signal when the audio signal is collected by the APP itself through the audio collection device in step 101, since the audio signal is generally collected, the audio signal that is actually an analog signal is collected into a digital signal at a certain sampling rate, that is, pulse code modulation (Pulse).
  • the audio signal of the Code Modulation (PCM) format therefore, the audio signal can also be divided into a plurality of short-term energy frames according to the sampling rate of the audio signal and the frequency of the preset speech signal.
  • a ratio m of a sampling rate of the audio signal to a frequency of the preset voice signal may be determined, and then each of the sample points in the collected digital form is divided into a short-time energy frame according to the ratio m. . If m is a positive integer, the audio signal may be divided into a maximum number of short-term energy frames according to m; if m is not a positive integer, the audio signal may be divided into m according to a rounding principle to be converted into a positive integer. The maximum number of short-term energy frames.
  • the remaining sampling points may be discarded.
  • the remaining sampling points can also be It is also used as a short-term energy frame for subsequent processing.
  • the above m is used to indicate the number of sampling points included in the audio signal acquired in step 101 during the period of a preset voice signal.
  • the frequency of the preset speech signal is 82 Hz
  • the duration of the audio signal acquired in step 101 is 1 S
  • the sampling rate is 16000 Hz
  • m is not a positive integer
  • 195.1 is converted to a positive integer 195 according to the rounding principle.
  • the duration of the audio signal and the sampling rate it can be determined that the number of sampling points included in the audio signal is 16000.
  • the audio can be After the signal is divided into 82 short-term energy frames, the remaining 10 sample points are discarded.
  • Each of the short-term energy frames described above includes 195 sampling points.
  • the audio signal acquired in step 101 is the received audio signal transmitted by another APP or device
  • the audio signal may be divided into a plurality of short-term energy frames by any of the above methods.
  • the format of the above audio signal may not be in the PCM format. If the short-term energy frame is divided according to the sampling rate of the audio signal and the frequency of the preset speech signal by the above method, the received audio signal is converted into an audio signal of the PCM format, and when the audio signal is received, The sampling rate of the audio signal needs to be identified, and the method for specifically identifying the sampling rate of the audio signal can be identified by the prior art method, and will not be repeated here.
  • step 103 the energy of each short-term energy frame is determined.
  • the amplitude of the audio signal corresponding to each sampling point in the short-term energy frame may be used.
  • Value to determine the energy of the short-term energy frame Specifically, the energy of each sampling point may be determined according to the amplitude of the audio signal corresponding to each sampling point in the short-term energy frame, and then the energy is added, and the sum of the finally obtained energy is used as The energy of the short-term energy frame.
  • the following formula can be used to determine the energy of a short-term energy frame: Where i is the ith sampling point of the audio signal; n is the number of sampling points contained in the short-time energy frame; A i [t] is the amplitude of the audio signal corresponding to the ith sampling point, wherein, short-time The amplitude of the energy frame ranges from -32768 to 32767.
  • the amplitude obtained when the audio signal is collected may be divided by the value of 32768 as the normalized amplitude of the short-term energy frame, then the short-time energy frame.
  • the normalized amplitude ranges from -1 to 1.
  • the function of calculating the amplitude can be determined according to the amplitude of each moment of the short-term energy frame, and the square of the function is integrated, and the final integration result is the short-time energy.
  • the energy of the frame is not the PCM format.
  • Step 104 Detect whether a voice signal is included in the audio signal according to energy of each short-term energy frame.
  • the following two methods may be used to determine whether a voice signal is detected in the audio signal:
  • Method 1 Determine the ratio of the number of short-term energy frames whose energy is greater than a preset threshold to the total number of all short-term energy frames (hereinafter referred to as a high-energy frame ratio), and determine whether the determined high-energy frame ratio is greater than a preset ratio. If so, it is determined that the audio signal is detected to include the voice signal; if not, it is determined that the voice signal is not detected in the audio signal.
  • a high-energy frame ratio the ratio of the number of short-term energy frames whose energy is greater than a preset threshold to the total number of all short-term energy frames
  • the preset threshold and the preset ratio may be set according to actual needs.
  • the preset threshold may be set to 2
  • the preset ratio is set to 20%
  • the high energy frame ratio is greater than 20%
  • the method 1 can be used to determine whether the voice signal is detected in the audio signal, because in real life, when people talk, there is more or less noise in the external environment, and the noise is generally relative. The energy is lower in what people say. Then, if there is a short-term energy frame whose energy is higher than a preset threshold in an audio signal, and the short-term energy frames occupy a certain ratio in the audio signal, the audio signal may be considered to include a voice signal.
  • Method 2 In order to make the final detection result more accurate, the method mentioned in Method 1 can be used to determine the high energy frame ratio, and determine whether the determined high energy frame ratio is greater than a preset ratio, and if not, then Determining that the audio signal is not detected in the audio signal; if yes, when there are at least N consecutive short-time energy frames in the short-term energy frame whose energy is greater than a preset threshold, determining that the detected audio signal includes the voice signal, when the energy is greater than When there are no at least N consecutive short-time energy frames in the short-time energy frame of the preset threshold, it is determined that the voice signal is not detected in the audio signal.
  • N can be any positive integer. In the embodiment of the present application, N can be set to 10.
  • the method 2 adds a condition for determining whether the audio signal is included in the audio signal: whether there are at least N consecutive short-time energy frames in the short-term energy frame whose energy is greater than the preset threshold. This can effectively reduce noise. Since in real life, the noise is lower than that of human beings, and the signal is random, using Method 2 can effectively eliminate the excessive noise in the audio signal and reduce the influence of noise in the external environment. The role of noise reduction.
  • the above-mentioned voice signal detecting method can be applied to detecting a mono audio signal, a two-channel audio signal, or a multi-channel audio signal.
  • the audio signal collected through one channel is a mono audio signal; the audio signal collected through two channels is a two-channel audio signal, and the audio signal collected through multiple channels is multi-channel audio. signal.
  • the audio signals of each channel obtained may be detected according to the operations mentioned in steps 101 to 104, Finally, based on the detection result of the audio signal of each channel, it is judged whether the acquired audio signal contains a voice signal.
  • the operations mentioned in steps 101 to 104 can be directly performed on the audio signal, and the detection result is used as a final detection result.
  • the audio signals of each channel are processed according to the operations in steps 101-104. If it is detected that the audio signal of each channel does not include a voice signal, it is determined that the audio signal acquired in step 101 does not include a voice signal. If it is detected that the audio signal of at least one channel contains a voice signal, it is determined that the audio signal acquired in step 101 includes a voice signal.
  • the frequency of the preset voice signal mentioned in step 102 may be the frequency of any voice. This application does not limit this. In an actual application, the frequency of different preset voice signals may be set for different audio signals acquired in step 101 according to actual conditions. It should be specially noted that regardless of the frequency of the preset speech, the frequency of the speech signal, such as the frequency of the soprano, or the frequency of the bass, as long as the final divided short-term energy frame satisfies the following conditions: The duration of the short-term energy frame is not less than the period corresponding to the audio signal acquired in step 101.
  • the frequency of the preset voice signal can be set to the minimum vocal frequency, that is, 82 Hz. Since the period is the reciprocal of the frequency, if the frequency of the preset speech signal is the minimum vocal frequency, the period of the preset speech signal is the maximum vocal period, and therefore, regardless of the period of the audio signal acquired in step 101, how short The duration of the time energy frame is not less than the period of the acquired audio signal.
  • the duration of the short-term energy frame is not less than the period of the audio signal acquired in step 101, because the detection method provided by the embodiment of the present application is based on The characteristics of the words spoken by humans to detect whether the audio signal contains a speech signal. The words spoken by humans are higher, more stable, and more continuous than noise. If the duration of the short-term energy frame is smaller than the period of the audio signal acquired in step 101, then there is no complete period of waveform in the waveform corresponding to the short-time energy frame, and the duration of the short-term energy frame is relatively short.
  • the duration of the audio signal acquired in step 101 should be greater than the maximum period of a human voice.
  • the voice signal detection method provided by the embodiment of the present application is particularly suitable for an application scenario in which a chat APP can complete a voice message without any user clicking operation. Then, the voice signal detecting method provided by the embodiment of the present application is described in detail below for the scenario. In this scenario, the specific process diagram of the method is shown in FIG. 2, and includes the following steps:
  • step 201 an audio signal is collected in real time.
  • the app can be finished without any click operation.
  • the user can start to record the external environment without interruption, and collect the audio signal in real time to avoid missing the user's words.
  • the audio signal can be saved locally in real time.
  • the app stops recording.
  • Step 202 The audio signal of the preset duration is intercepted from the collected audio signal in real time.
  • the APP can intercept the audio signal of the preset duration in the audio signal collected in step 201 in real time, and perform subsequent detection on the audio signal of the preset duration.
  • the audio signal of the preset duration is currently referred to as the current audio signal, and the audio signal of the preset duration captured last time may be referred to as the audio signal acquired last time.
  • Step 203 The audio signal of the preset duration is divided into a plurality of short-term energy frames according to the frequency of the preset voice signal.
  • Step 204 determining the energy of each short-term energy frame.
  • Step 205 Detect whether a voice signal is included in the audio signal of the preset duration according to the energy of each short-term energy frame.
  • the current audio signal may be The starting point is determined as the starting point of the speech signal; if it is determined that the previously acquired audio signal contains the speech signal, the starting point of the current audio signal is not the starting point of the speech signal.
  • the voice signal received last time contains a voice signal
  • the voice signal obtained last time contains a voice signal
  • the last time the voice signal is obtained
  • the end point of the audio signal is determined as the end point of the speech signal; if the audio signal obtained last time does not contain the speech signal, the end point of the current audio signal or the audio signal obtained last time is not the end point of the speech signal.
  • A, B, C, and D are four adjacent audio signals of preset durations
  • a and D do not contain a speech signal
  • B and C contain a speech signal.
  • the starting point of B can be determined as the starting point of the speech signal
  • the end point of C can be determined as the end point of the speech signal.
  • the current audio signal is just the beginning or end of a sentence of the user, and the audio signal contains less voice signals.
  • the APP may incorrectly determine that the audio signal does not include a voice signal. Then, in order to avoid misjudgment and cause the user to miss the speech, after detecting the current audio signal including the speech signal, it is determined whether the audio signal obtained last time contains the speech signal, and if the last acquired signal is determined, If the audio signal does not contain a voice signal, the starting point of the last acquired audio signal can be determined as the starting point of the voice signal.
  • the current audio signal after detecting that the current audio signal does not include the voice signal, it may be determined whether the voice signal received last time includes a voice signal, and if it is determined that the last acquired audio signal includes a voice signal, the current audio may be included.
  • the end of the signal is determined as the end of the speech signal.
  • the starting point of A can be determined as the starting point of the speech signal
  • the end point of D can be determined as the end point of the speech signal.
  • the audio signal may be sent to the voice recognition device, so that the voice recognition device may perform voice processing on the audio signal to obtain the voice result, and then the voice recognition device will The audio signal is sent to a subsequent processing device, which ultimately transmits the audio signal as a voice message.
  • the APP may send all the audio signals between the start point and the end point of the determined voice signal to the voice recognition device, to the voice
  • the identification device sends an audio termination signal for informing the user of the voice recognition device that the phrase currently spoken has been completed, so that the voice recognition device sends the audio signals together to the subsequent processing device, and finally the audio signals are voiced.
  • the message is sent out.
  • the sub-signal of the preset time period is intercepted in the audio signal acquired last time, and the current audio signal and the intercepted sub-signal are spliced.
  • the obtained audio signal hereinafter referred to as a spliced audio signal
  • the subsequent voice signal is detected for the spliced audio signal.
  • the sub-signal can be spliced before the current audio signal.
  • the preset time period can be the last time
  • the tail period of the captured audio signal and the duration corresponding to the period may be any length of time.
  • the duration corresponding to the preset time period may be set to be no more than the product of the duration corresponding to the stitched audio signal and the preset ratio.
  • the voice signal is included in the spliced audio signal, it may be determined whether the spliced audio signal obtained last time includes a voice signal, and if it is determined that the spliced audio signal obtained last time does not include a voice signal, the splicing may be performed.
  • the starting point of the audio signal serves as the starting point of the speech signal. If it is detected that the spliced audio signal does not include the voice signal, it may be determined whether the spliced audio signal obtained last time includes a voice signal, and if it is determined that the spliced audio signal obtained last time contains the voice signal, the spliced audio may be The end of the signal is used as the end of the speech signal.
  • the APP can perform recordings in a non-stop manner, and the recording is performed periodically.
  • the voice signal detecting method provided by the embodiment of the present application can also be implemented by a voice signal detecting device.
  • the specific structure of the device is shown in FIG. 4, and mainly includes the following devices:
  • Obtaining module 41 acquiring an audio signal
  • the dividing module 42 divides the audio signal into a plurality of short-term energy frames according to a frequency of the preset voice signal
  • a determining module 43 determining the energy of each short-term energy frame
  • the detecting module 44 detects whether the audio signal includes a voice signal according to the energy of each short-term energy frame.
  • the acquiring module 41 acquires a current audio signal; and in the last acquired audio signal, intercepting a sub-signal of a preset time period;
  • the current audio signal and the intercepted sub-signal are spliced as an acquired audio signal.
  • the dividing module 42 determines a period of the preset voice signal according to a frequency of the preset voice signal
  • the audio signal is divided into a plurality of short-term energy frames of the same duration according to the determined period.
  • the detecting module 44 determines a ratio of the number of short-term energy frames whose energy is greater than a preset threshold to the total number of all short-term energy frames;
  • the detecting module 44 determines a ratio of the number of short-term energy frames whose energy is greater than a preset threshold to the total number of all short-term energy frames;
  • the short-term energy frame is When there are no at least N consecutive short-time energy frames, it is determined that the audio signal is not detected in the audio signal.
  • the voice signal detection method used in the embodiment of the present application does not need to perform complex calculation such as Fourier transform.
  • the voice signal detecting method provided by the embodiment of the present invention can solve the problem that the processing method of the voice signal detecting method in the prior art has a slow processing speed and consumes a large amount of resources.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology.
  • the information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
  • computer readable media does not include temporary storage of computer readable media, such as modulated data signals and carrier waves.
  • embodiments of the present application can be provided as a method, system, or computer program product.
  • the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment in combination of software and hardware.
  • the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Circuits Of Receivers In General (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)
  • Electric Clocks (AREA)
  • Time-Division Multiplex Systems (AREA)

Abstract

一种语音信号检测方法与装置,用于解决现有技术中的语音信号检测方法存在的处理速度较慢,且耗费资源较多的问题。方法包括:获取音频信号(101);根据预设语音信号的频率,将音频信号划分为多个短时能量帧(102);确定每个短时能量帧的能量(103);根据每个短时能量帧的能量,检测音频信号中是否包含语音信号(104)。

Description

一种语音信号检测方法与装置 技术领域
本申请涉及计算机技术领域,尤其涉及一种语音信号检测方法与装置。
背景技术
在实际生活中,人们会经常使用智能设备(例如智能手机、平板电脑等)来发送语音消息。但是人们在使用智能设备发送语音消息时,往往需要点击智能设备屏幕中的开始或结束按钮,才能够完成语音消息的发送,而这些点击操作,会给用户造成诸多不便。
若用户无需点击按钮便可完成语音消息的发送,那么智能设备需要一直进行录音或者按照预设周期进行录音,并判断获取到的音频信号中是否包含语音信号,若包含语音信号,便将该语音信号提取出来,然后进行后续处理并发送出去,这样便完成了语音消息的发送。
现有技术中,一般采用双门限方法、基于自相关极大值的检测方法或基于小波变换的检测方法等语音信号检测方法,来检测获取到的音频信号中是否包含语音信号。但是该些方法基本都是通过傅里叶变换等复杂的计算,获取音频信息的频率特征,进而根据该频率特征来确定是否包含语音信号的,需要计算较大缓冲数据,内存占用较高,计算量偏大,处理速度较慢,且耗电量较大。
发明内容
本申请实施例提供一种语音信号检测方法与装置,用于解决现有技术中的语音信号检测方法存在的处理速度较慢,且耗费资源较多的问题。
本申请实施例采用下述技术方案:
一种语音信号检测方法,所述方法包括:
获取音频信号;
根据预设语音信号的频率,将所述音频信号划分为多个短时能量帧;
确定每个短时能量帧的能量;
根据每个短时能量帧的能量,检测所述音频信号中是否包含语音信号。
一种语音信号检测装置,所述装置包括:
获取模块,获取音频信号;
划分模块,根据预设语音信号的频率,将所述音频信号划分为多个短时能量帧;
确定模块,确定每个短时能量帧的能量;
检测模块,根据每个短时能量帧的能量,检测所述音频信号中是否包含语音信号。
本申请实施例采用的上述至少一个技术方案能够达到以下有益效果:
与现有技术中的通过傅里叶变换等复杂计算来确定音频信号中是否包含语音信号的检测方法相比,本申请实施例采用的语音信号检测方法,无需进行傅里叶变换等复杂计算,通过根据预设语音信号的频率,将获取到的音频信号划分为多个短时能量帧,进而确定出每个短时能量帧的能量,并根据每个短时能量帧的能量,便可检测出获取到的音频信号中是否包含语音信号。因此,本申请实施例提供的语音信号检测方法,能够解决现有技术中的语音信号检测方法存在的处理速度较慢,且耗费资源较多的问题。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1为本申请实施例提供的一种语音信号检测方法的具体流程图;
图2为本申请实施例提供的另一种语音信号检测方法的具体流程图;
图3为本申请实施例提供的预设时长的音频信号显示图;
图4为本申请实施例提供的一种语音信号检测装置的具体结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
以下结合附图,详细说明本申请实施例提供的技术方案。
为了解决现有技术中的语音信号检测方法存在的处理速度较慢,且耗费资源较多的问题,本申请实施例提供一种语音信号检测方法。
该方法的执行主体,可以但不限于为手机、平板电脑或个人电脑(Personal Computer,PC)等用户终端,或者该些用户终端上运行的应用(application,APP),或者,还可以是服务器等设备。
为便于描述,下文以该方法的执行主体为APP为例,对该方法的实施方式进行介绍。可以理解,该方法的执行主体为APP只是一种示例性的说明,并不应理解为对该方法的限定。
该方法的具体流程示意图如图1所示,包括下述步骤:
步骤101,获取音频信号。
上述音频信号,可以为APP通过音频采集设备采集到的音频信号,也可以为APP接收到的音频信号,比如可以是由其他APP或者设备传输的音频信号,本申请实施例对此不进行任何限定。APP在获取到音频信号之后,可以将该音频信号保存在本地。
本申请对上述音频信号对应的采样率、时长、格式或声道等也不作任何限制。
上述APP可以为任意类型的APP,比如聊天APP或支付APP等,只要该 APP可以获取到音频信号,并且可以利用本申请实施例提供的语音信号检测方法对获取到的音频信号进行语音信号的检测即可。
步骤102,根据预设语音信号的频率,将所述音频信号划分为多个短时能量帧。
上述短时能量帧实际上是步骤101获取到的音频信号中的一部分音频信号。
具体的,可以根据预设语音信号的频率,确定出该预设语音信号的周期,按照确定出的周期,将步骤101获取到的音频信号划分为对应的时长均为所述周期的多个短时能量帧。例如,假设该预设语音信号的周期为0.01S,则可根据步骤101获取到的音频信号的时长,将该音频信号划分为若干个时长均为0.01S的短时能量帧。需要说明的是,在划分步骤101获取到的音频信号时,也可以根据实际情况,根据预设语音信号的频率,将该音频信号划分为至少两个短时能量帧。为了后续描述方便,本申请实施例后文中以将音频信号划分为多个短时能量帧为例进行说明。
另外,当步骤101中由该APP自身通过音频采集设备采集音频信号时,由于采集音频信号一般是将实际上是模拟信号的音频信号以一定的采样率采集成数字信号,即脉冲编码调制(Pulse Code Modulation,PCM)格式的音频信号,因此,还可以根据该音频信号的采样率和预设语音信号的频率,将该音频信号划分为多个短时能量帧。
具体的,可确定该音频信号的采样率与预设语音信号的频率的比值m,再根据该比值m,将采集到的数字形式的音频信号中每m个采样点划分为一个短时能量帧。若m为正整数,则可根据m将该音频信号划分为最大数量的短时能量帧;若m不为正整数,则可根据按照四舍五入原则转化为正整数的m,将该音频信号划分为最大数量的短时能量帧。其中,需要特别说明的是,若步骤101获取到的音频信号包含的采样点数量并非为m的整数倍,将该音频信号划分为最大数量的短时能量帧后,可将剩余的采样点丢弃,也可将剩余的采样点 也作为一个短时能量帧进行后续处理。其中,上述m,用于表示在一个预设语音信号的周期内,步骤101获取到的音频信号包含的采样点数量。
例如,若预设语音信号的频率为82HZ,步骤101获取到的音频信号的时长为1S,采样率为16000HZ,那么m=16000/82=195.1。其中,m不是正整数,将195.1按照四舍五入原则转化成正整数195。根据上述音频信号的时长以及采样率,可以确定出该音频信号包含的采样点数量为16000,那么,由于上述音频信号包含的采样点的数量并非是195的整数倍,因此,可以在将该音频信号划分为82个短时能量帧后,将剩余的10个采样点丢弃。其中,上述每个短时能量帧包含的采样点数量均为195。
当步骤101获取到的音频信号是接收到的其他APP或设备传输的音频信号时,可以采用上述任一方法将该音频信号划分为多个短时能量帧。需要特别说明的是,上述音频信号的格式可能并非为PCM格式。若采用上述方法根据音频信号的采样率和预设语音信号的频率来划分短时能量帧,便需将接收到的音频信号转化为PCM格式的音频信号,另外,在接收到音频信号时,也需识别出该音频信号的采样率,具体识别出音频信号的采样率的方法均可采用现有技术的方法来识别,这里就不再一一赘述。
步骤103,确定每个短时能量帧的能量。
在本申请实施例中,当采用上述方法将PCM格式的音频信号划分为若干同样为PCM格式的短时能量帧时,则可以根据短时能量帧中的每个采样点对应的音频信号的幅值,来确定短时能量帧的能量。具体的,可以根据短时能量帧中的每个采样点对应的音频信号的幅值,确定出每个采样点的能量,然后将该些能量相加,将最终得到的能量之和,做为该短时能量帧的能量。
例如,可以采用下述公式来确定短时能量帧的能量:
Figure PCTCN2017103489-appb-000001
其中,i表示音频信号的第i个采样点;n为短时能量帧中包含的采样点的数量;Ai[t]为第i个采样点对应的音频信号的幅值,其中,短时能量帧的幅值的取值范围 为-32768~32767。
另外,本申请实施例中,为了简化计算,节省资源,还可以将采集音频信号时获取到的幅值除以32768的值,作为短时能量帧的归一化幅值,那么短时能量帧的归一化幅值的取值范围为-1~1。
若短时能量帧的格式不为PCM格式,可以根据短时能量帧每一时刻的振幅,确定出计算振幅的函数,针对该函数的平方进行积分,最终得到的积分结果便为该短时能量帧的能量。
步骤104,根据每个短时能量帧的能量,检测所述音频信号中是否包含语音信号。
具体的,可以采用下述两种方法,来确定是否检测到音频信号中包含语音信号:
方法1:确定能量大于预设阈值的短时能量帧的数量占所有短时能量帧总数量的比率(后称高能量帧比率),并判断确定出的高能量帧比率是否大于预设比率。若是,则确定检测到所述音频信号中包含语音信号;若否,则确定未检测到音频信号中包含语音信号。
其中,可以根据实际需要设置预设阈值以及预设比率的大小,在本申请实施例中,可以将预设阈值设置为2,预设比率设置为20%,若高能量帧比率大于20%,则确定检测到所述音频信号中包含语音信号;否则,则确定未检测到音频信号中包含语音信号。
本申请实施例中,之所以可以采用方法1来确定是否检测到音频信号中包含语音信号,是因为在现实生活中,人们说话时,外部环境中多多少少会存在一些噪声,而噪声一般相对于人们说的话来说能量较低。那么若一段音频信号中,存在能量高于预设阈值的短时能量帧,且该些短时能量帧在这一段音频信号中占据一定的比率,便可认为该音频信号中包含语音信号。
方法2:为了使得最终检测结果更加准确,可采用方法1提及的方法来确定高能量帧比率,并判断确定出的高能量帧比率是否大于预设比率,若否,则 确定未检测到音频信号中包含语音信号;若是,则当能量大于预设阈值的短时能量帧中存在至少N个连续短时能量帧时,确定检测到音频信号中包含语音信号,当能量大于预设阈值的短时能量帧中不存在至少N个连续短时能量帧时,确定未检测到音频信号中包含语音信号。其中,N可以为任意正整数。在本申请实施例中,可以将N设置为10。
也就是说,方法2在方法1的基础上,增加了一个判定音频信号中是否包含语音信号的条件:能量大于预设阈值的短时能量帧中是否存在至少N个连续短时能量帧。这样做可以有效降噪。由于在实际生活中,噪音相对于人类所说的话来说能量较低,且信号随机,因此利用方法2,便可以有效排除音频信号中噪声过多的情况,降低外部环境中噪音的影响,达到降噪的作用。
需要特别说明的是,本申请实施例提供的上述语音信号检测方法,可适用于检测单声道音频信号、双声道音频信号或多声道音频信号等。其中,通过一个声道来采集的音频信号为单声道音频信号;通过两个声道来采集的音频信号为双声道音频信号,通过多个声道来采集的音频信号为多声道音频信号。
在采用如图1所示的方法来检测双声道音频信号和多声道音频信号时,可按照步骤101~104提及的操作,分别针对获取到的每一路声道的音频信号进行检测,最终根据对每一路声道的音频信号的检测结果,判断获取到的音频信号中是否包含语音信号。
具体的,若步骤101获取到的音频信号为单声道音频信号,便可针对该音频信号,直接执行步骤101~104中提及的操作,将检测结果作为最终检测结果。
若步骤101获取到的音频信号不为单声道音频信号,而为双声道或多声道音频信号,那么便分别对每一路声道的音频信号按照步骤101~104中的操作进行处理。若检测出每一路声道的音频信号均不包含语音信号,则确定步骤101获取到的音频信号不包含语音信号。若检测出至少一路声道的音频信号包含语音信号,则确定步骤101获取到的音频信号包含语音信号。
另外,步骤102中所提及的预设语音信号的频率可以为任意语音的频率, 本申请对此不进行任何限定。在实际应用中,可以根据实际情况,针对步骤101获取到的不同的音频信号,设置不同的预设语音信号的频率。需要特别说明的是,不管预设语音的频率是哪一种语音信号的频率,比如女高音的频率,或男低音的频率,只要使得最终划分出来的短时能量帧满足下述条件即可:短时能量帧对应的时长不小于步骤101获取到的音频信号对应的周期。为了达到比较好的检测效果、尽可能节省资源、提高处理速度,本申请实施例中,可以将预设语音信号的频率设置为最小人声频率,即82HZ。因为周期为频率的倒数,若预设语音信号的频率为最小人声频率,那么预设语音信号的周期便为最大人声周期,因此,不管步骤101获取到的音频信号的周期是多大,短时能量帧对应的时长均不小于上述获取到的音频信号的周期。
需要特别说明的是,本申请实施例中,之所以要使得短时能量帧对应的时长均不小于步骤101获取到的音频信号的周期,是因为本申请实施例所提供的检测方法,是基于人类所说的话的特点来检测音频信号中是否包含语音信号的。人类所说的话相较于噪声来说,能量较高、较稳定且连续。若短时能量帧对应的时长小于步骤101获取到的音频信号的周期,那么短时能量帧对应的波形中不存在一个完整周期的波形,该短时能量帧的时长便相对较短。这一情况下,即便高能量帧比率大于预设比率,且能量大于预设阈值的短时能量帧中存在至少N个连续短时能量帧,仅仅可以表明音频信号中包含声音信号,却无法表明该声音信号为语音信号。因此,本申请实施例中,步骤101获取到的音频信号的时长应大于一个人声最大周期。
另外,本申请实施例提供的语音信号检测方法尤其适用于在无需用户进行任何点击操作,聊天APP便可完成语音消息的发送这一应用场景。那么下面便针对该场景,详细说明本申请实施例提供的语音信号检测方法。其中,这一场景下,该方法的具体流程示意图如图2所示,包括下述步骤:
步骤201,实时采集音频信号。
若用户希望开启聊天APP之后,无需进行任何点击操作,该APP便可完 成语音消息的发送,于是,当用户开启该APP之后,该APP便可开始不间断地针对外部环境进行录音,实时采集音频信号,以尽量避免漏掉用户所说的话。另外,在采集到音频信号之后,可以实时将该音频信号保存在本地。当用户关闭该APP之后,该APP便停止录音。
步骤202,实时从采集到的音频信号中截取预设时长的音频信号。
若APP一直进行录音,却并非实时进行语音信号的检测,便会导致语音消息的时效性较差。因此,APP可以实时截取步骤201采集到的音频信号中的、预设时长的音频信号,并针对该预设时长的音频信号进行后续检测。
其中,可以将当前截取的预设时长的音频信号称为当前音频信号,可以将上一次截取的预设时长的音频信号称为上一次获取到的音频信号。
步骤203,根据预设语音信号的频率,将预设时长的音频信号划分为多个短时能量帧。
步骤204,确定每个短时能量帧的能量。
步骤205,根据每个短时能量帧的能量,检测预设时长的音频信号中是否包含语音信号。
若检测出当前音频信号中包含语音信号,便判断上一次获取到的音频信号中是否包含语音信号,若判断出上一次获取到的音频信号中不包含语音信号,则可将当前音频信号的起始点确定为语音信号的起始点;若判断出上一次获取到的音频信号中包含语音信号,那么当前音频信号的起始点不为语音信号的起始点。
若检测出当前音频信号中不包含语音信号,便判断上一次获取到的音频信号中是否包含语音信号,若判断出上一次获取到的音频信号中包含语音信号,则可将上一次获取到的音频信号的终点确定为语音信号的终点;若上一次获取到的音频信号中不包含语音信号,那么当前音频信号或者上一次获取到的音频信号的终点,均不为语音信号的终点。
例如,如图3所示,其中A、B、C、D为四段相邻的预设时长的音频信号, A和D中不包含语音信号,B和C中包含语音信号,那么可以将B的开始点确定为语音信号的起始点,可以将C的终点确定为语音信号的终点。
有时,当前音频信号刚好为用户一句话的开始或结尾部分,该音频信号中包含的语音信号比较少,这一情况下,APP有可能会误将该音频信号判定为不包含语音信号。那么为了尽量避免误判而导致遗漏掉用户所说的话,可以在检测出当前音频信号中包含语音信号后,判断上一次获取到的音频信号中是否包含语音信号,若判断出上一次获取到的音频信号中不包含语音信号,则可将上一次获取到的音频信号的起始点确定为语音信号的起始点。另外,可以在检测出当前音频信号中不包含语音信号后,判断上一次获取到的音频信号中是否包含语音信号,若判断出上一次获取到的音频信号中包含语音信号,则可将当前音频信号的终点确定为语音信号的终点。沿用上例,可以将A的起始点确定为语音信号的起始点,可以将D的终点确定为语音信号的终点。
在APP检测出当前音频信号包含语音信号之后,可以将该音频信号发送给语音识别装置,以使得该语音识别装置可以对该音频信号进行语音处理,获取到语音结果,然后语音识别装置再将该音频信号发送给后续处理装置,最终将该音频信号以语音消息的形式发送出去。其中,为了使得发送出去的语音消息中包含的用户所说的话是完整的句子,APP可以将确定出的语音信号的起始点与终点之间的所有音频信号都发送给语音识别装置之后,向语音识别装置发送音频终止信号,用以告知语音识别装置用户当前所说的这一句话已经完结,以使得语音识别装置将该些音频信号一并发送给后续处理装置,最终将该些音频信号以语音消息的形式发送出去。
另外,为了尽量避免误判情况的发生,还可以在获取到当前音频信号之后,在上一次获取到的音频信号中,截取预设时段的子信号,将当前音频信号和截取的子信号进行拼接,作为获取到的音频信号(后称拼接音频信号),并针对该拼接音频信号进行后续语音信号的检测。
其中,可以将子信号拼接在当前音频信号之前。预设时段可以为上一次获 取到的音频信号的尾部时段,该时段对应的时长可以为任意时长。为了使得最终检测结果更加准确,在本申请实施例中,可以将该预设时段对应的时长设置为不大于拼接音频信号对应的时长与预设比率的乘积。
若在检测出拼接音频信号中包含语音信号后,可判断上一次获取到的拼接音频信号中是否包含语音信号,若判断出上一次获取到的拼接音频信号中不包含语音信号,则可将拼接音频信号的起始点作为语音信号的起始点。若检测出拼接音频信号中不包含语音信号后,可判断上一次获取到的拼接音频信号中是否包含语音信号,若判断出上一次获取到的拼接音频信号中包含语音信号,则可将拼接音频信号的终点作为语音信号的终点。
在本申请实施例中,APP除了可以一直不间断的进行录音外,还可以周期性进行录音,本申请实施例对此不进行任何限定。
本申请实施例提供的语音信号检测方法,还可以通过语音信号检测装置来实现,该装置的具体结构示意图如图4所示,主要包括下述装置:
获取模块41,获取音频信号;
划分模块42,根据预设语音信号的频率,将所述音频信号划分为多个短时能量帧;
确定模块43,确定每个短时能量帧的能量;
检测模块44,根据每个短时能量帧的能量,检测所述音频信号中是否包含语音信号。
在一种实施方式中,获取模块41获取当前音频信号;在上一次获取到的音频信号中,截取预设时段的子信号;
将所述当前音频信号和截取的子信号进行拼接,作为获取到的音频信号。
在一种实施方式中,划分模块42,根据预设语音信号的频率,确定出所述预设语音信号的周期;
按照确定出的周期,将所述音频信号划分为对应的时长均为所述周期的多个短时能量帧。
在一种实施方式中,检测模块44,确定能量大于预设阈值的短时能量帧的数量占所有短时能量帧总数量的比率;
判断所述比率是否大于预设比率;
若是,则确定检测到所述音频信号中包含语音信号;
若否,则确定未检测到所述音频信号中包含语音信号。
在一种实施方式中,检测模块44,确定能量大于预设阈值的短时能量帧的数量占所有短时能量帧总数量的比率;
判断所述比率是否大于预设比率;
若否,则确定未检测到所述音频信号中包含语音信号;
若是,则当能量大于预设阈值的短时能量帧中存在至少N个连续短时能量帧时,确定检测到所述音频信号中包含语音信号,当能量大于预设阈值的短时能量帧中不存在至少N个连续短时能量帧时,确定未检测到所述音频信号中包含语音信号。
与现有技术中的通过傅里叶变换等复杂计算来确定音频信号中是否包含语音信号的检测方法相比,本申请实施例采用的语音信号检测方法,无需进行傅里叶变换等复杂计算,通过根据预设语音信号的频率,将获取到的音频信号划分为多个短时能量帧,进而确定出每个短时能量帧的能量,并根据每个短时能量帧的能量,便可检测出获取到的音频信号中是否包含语音信号。因此,本申请实施例提供的语音信号检测方法,能够解决现有技术中的语音信号检测方法存在的处理速度较慢,且耗费资源较多的问题。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一 个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包 括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (10)

  1. 一种语音信号检测方法,其特征在于,所述方法包括:
    获取音频信号;
    根据预设语音信号的频率,将所述音频信号划分为多个短时能量帧;
    确定每个短时能量帧的能量;
    根据每个短时能量帧的能量,检测所述音频信号中是否包含语音信号。
  2. 如权利要求1所述的方法,其特征在于,获取音频信号,具体包括:
    获取当前音频信号;
    在上一次获取到的音频信号中,截取预设时段的子信号;
    将所述当前音频信号和截取的子信号进行拼接,作为获取到的音频信号。
  3. 如权利要求1所述的方法,其特征在于,根据预设语音信号的频率,将所述音频信号划分为多个短时能量帧,具体包括:
    根据预设语音信号的频率,确定出所述预设语音信号的周期;
    按照确定出的周期,将所述音频信号划分为对应的时长均为所述周期的多个短时能量帧。
  4. 如权利要求1所述的方法,其特征在于,根据每个短时能量帧的能量,检测所述音频信号中是否包含语音信号,具体包括:
    确定能量大于预设阈值的短时能量帧的数量占所有短时能量帧总数量的比率;
    判断所述比率是否大于预设比率;
    若是,则确定检测到所述音频信号中包含语音信号;
    若否,则确定未检测到所述音频信号中包含语音信号。
  5. 如权利要求1所述的方法,其特征在于,根据每个短时能量帧的能量,检测所述音频信号中是否包含语音信号,具体包括:
    确定能量大于预设阈值的短时能量帧的数量占所有短时能量帧总数量的 比率;
    判断所述比率是否大于预设比率;
    若否,则确定未检测到所述音频信号中包含语音信号;
    若是,则当能量大于预设阈值的短时能量帧中存在至少N个连续短时能量帧时,确定检测到所述音频信号中包含语音信号,当能量大于预设阈值的短时能量帧中不存在至少N个连续短时能量帧时,确定未检测到所述音频信号中包含语音信号。
  6. 一种语音信号检测装置,其特征在于,所述装置包括:
    获取模块,获取音频信号;
    划分模块,根据预设语音信号的频率,将所述音频信号划分为多个短时能量帧;
    确定模块,确定每个短时能量帧的能量;
    检测模块,根据每个短时能量帧的能量,检测所述音频信号中是否包含语音信号。
  7. 如权利要求1所述的装置,其特征在于,获取模块:
    获取当前音频信号;
    在上一次获取到的音频信号中,截取预设时段的子信号;
    将所述当前音频信号和截取的子信号进行拼接,作为获取到的音频信号。
  8. 如权利要求1所述的装置,其特征在于,划分模块,根据预设语音信号的频率,确定出所述预设语音信号的周期;
    按照确定出的周期,将所述音频信号划分为对应的时长均为所述周期的多个短时能量帧。
  9. 如权利要求1所述的装置,其特征在于,检测模块,确定能量大于预设阈值的短时能量帧的数量占所有短时能量帧总数量的比率;
    判断所述比率是否大于预设比率;
    若是,则确定检测到所述音频信号中包含语音信号;
    若否,则确定未检测到所述音频信号中包含语音信号。
  10. 如权利要求1所述的装置,其特征在于,检测模块,确定能量大于预设阈值的短时能量帧的数量占所有短时能量帧总数量的比率;
    判断所述比率是否大于预设比率;
    若否,则确定未检测到所述音频信号中包含语音信号;
    若是,则当能量大于预设阈值的短时能量帧中存在至少N个连续短时能量帧时,确定检测到所述音频信号中包含语音信号,当能量大于预设阈值的短时能量帧中不存在至少N个连续短时能量帧时,确定未检测到所述音频信号中包含语音信号。
PCT/CN2017/103489 2016-10-12 2017-09-26 一种语音信号检测方法与装置 Ceased WO2018068636A1 (zh)

Priority Applications (7)

Application Number Priority Date Filing Date Title
EP17860814.7A EP3528251B1 (en) 2016-10-12 2017-09-26 Method and device for detecting audio signal
SG11201903320XA SG11201903320XA (en) 2016-10-12 2017-09-26 Voice signal detection method and apparatus
PH1/2019/500784A PH12019500784B1 (en) 2016-10-12 2017-09-26 Voice signal detection method and apparatus
JP2019520035A JP6859499B2 (ja) 2016-10-12 2017-09-26 音声信号検出方法及び装置
MYPI2019001999A MY201634A (en) 2016-10-12 2017-09-26 Voice signal detection method and apparatus
KR1020197013519A KR102214888B1 (ko) 2016-10-12 2017-09-26 오디오 신호를 검출하기 위한 방법 및 디바이스
US16/380,609 US10706874B2 (en) 2016-10-12 2019-04-10 Voice signal detection method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610890946.9 2016-10-12
CN201610890946.9A CN106887241A (zh) 2016-10-12 2016-10-12 一种语音信号检测方法与装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/380,609 Continuation US10706874B2 (en) 2016-10-12 2019-04-10 Voice signal detection method and apparatus

Publications (1)

Publication Number Publication Date
WO2018068636A1 true WO2018068636A1 (zh) 2018-04-19

Family

ID=59176496

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/103489 Ceased WO2018068636A1 (zh) 2016-10-12 2017-09-26 一种语音信号检测方法与装置

Country Status (10)

Country Link
US (1) US10706874B2 (zh)
EP (1) EP3528251B1 (zh)
JP (2) JP6859499B2 (zh)
KR (1) KR102214888B1 (zh)
CN (1) CN106887241A (zh)
MY (1) MY201634A (zh)
PH (1) PH12019500784B1 (zh)
SG (1) SG11201903320XA (zh)
TW (1) TWI654601B (zh)
WO (1) WO2018068636A1 (zh)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106887241A (zh) 2016-10-12 2017-06-23 阿里巴巴集团控股有限公司 一种语音信号检测方法与装置
CN107957918B (zh) * 2016-10-14 2019-05-10 腾讯科技(深圳)有限公司 数据恢复方法和装置
CN108257616A (zh) * 2017-12-05 2018-07-06 苏州车萝卜汽车电子科技有限公司 人机对话的检测方法以及装置
CN108305639B (zh) * 2018-05-11 2021-03-09 南京邮电大学 语音情感识别方法、计算机可读存储介质、终端
CN108682432B (zh) * 2018-05-11 2021-03-16 南京邮电大学 语音情感识别装置
CN108847217A (zh) * 2018-05-31 2018-11-20 平安科技(深圳)有限公司 一种语音切分方法、装置、计算机设备及存储介质
CN109545193B (zh) * 2018-12-18 2023-03-14 百度在线网络技术(北京)有限公司 用于生成模型的方法和装置
CN110225444A (zh) * 2019-06-14 2019-09-10 四川长虹电器股份有限公司 一种麦克风阵列系统的故障检测方法及其检测系统
CN111724783B (zh) 2020-06-24 2023-10-17 北京小米移动软件有限公司 智能设备的唤醒方法、装置、智能设备及介质
CN113270118B (zh) * 2021-05-14 2024-02-13 杭州网易智企科技有限公司 语音活动侦测方法及装置、存储介质和电子设备
CN116612775A (zh) * 2022-02-09 2023-08-18 宸芯科技股份有限公司 一种杂音消除方法、装置、电子设备及介质
CN114792530B (zh) * 2022-04-26 2025-07-04 美的集团(上海)有限公司 语音数据处理方法、装置、电子设备和存储介质
CN114898774B (zh) * 2022-05-06 2025-06-13 钉钉(中国)信息技术有限公司 一种音频掉点的检测方法及装置
CN116863947A (zh) * 2023-07-27 2023-10-10 海纳科德(湖北)科技有限公司 一种利用宠物语音信号识别情绪的方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101494049A (zh) * 2009-03-11 2009-07-29 北京邮电大学 一种用于音频监控系统中的音频特征参数的提取方法
CN101625860A (zh) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 语音端点检测中的背景噪声自适应调整方法
CN103198838A (zh) * 2013-03-29 2013-07-10 苏州皓泰视频技术有限公司 一种用于嵌入式系统的异常声音监控方法和监控装置
CN103544961A (zh) * 2012-07-10 2014-01-29 中兴通讯股份有限公司 语音信号处理方法及装置
CN103646649A (zh) * 2013-12-30 2014-03-19 中国科学院自动化研究所 一种高效的语音检测方法
CN106887241A (zh) * 2016-10-12 2017-06-23 阿里巴巴集团控股有限公司 一种语音信号检测方法与装置

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3297346B2 (ja) * 1997-04-30 2002-07-02 沖電気工業株式会社 音声検出装置
TW333610B (en) 1997-10-16 1998-06-11 Winbond Electronics Corp The phonetic detecting apparatus and its detecting method
US6480823B1 (en) 1998-03-24 2002-11-12 Matsushita Electric Industrial Co., Ltd. Speech detection for noisy conditions
JP3266124B2 (ja) * 1999-01-07 2002-03-18 ヤマハ株式会社 アナログ信号中の類似波形検出装置及び同信号の時間軸伸長圧縮装置
KR100463657B1 (ko) * 2002-11-30 2004-12-29 삼성전자주식회사 음성구간 검출 장치 및 방법
US7715447B2 (en) 2003-12-23 2010-05-11 Intel Corporation Method and system for tone detection
JP5459220B2 (ja) 2008-11-27 2014-04-02 日本電気株式会社 発話音声検出装置
ES2371619B1 (es) 2009-10-08 2012-08-08 Telefónica, S.A. Procedimiento de detección de segmentos de voz.
CN104485118A (zh) * 2009-10-19 2015-04-01 瑞典爱立信有限公司 用于语音活动检测的检测器和方法
KR101666521B1 (ko) * 2010-01-08 2016-10-14 삼성전자 주식회사 입력 신호의 피치 주기 검출 방법 및 그 장치
US20130090926A1 (en) 2011-09-16 2013-04-11 Qualcomm Incorporated Mobile device context information using speech detection
CN102568457A (zh) * 2011-12-23 2012-07-11 深圳市万兴软件有限公司 一种基于哼唱输入的乐曲合成方法及装置
US9351089B1 (en) * 2012-03-14 2016-05-24 Amazon Technologies, Inc. Audio tap detection
JP5772739B2 (ja) * 2012-06-21 2015-09-02 ヤマハ株式会社 音声処理装置
EP2891151B1 (en) * 2012-08-31 2016-08-24 Telefonaktiebolaget LM Ericsson (publ) Method and device for voice activity detection
CN103117067B (zh) * 2013-01-19 2015-07-15 渤海大学 一种低信噪比下语音端点检测方法
CN103177722B (zh) * 2013-03-08 2016-04-20 北京理工大学 一种基于音色相似度的歌曲检索方法
CN103247293B (zh) * 2013-05-14 2015-04-08 中国科学院自动化研究所 一种语音数据的编码及解码方法
WO2014194273A2 (en) * 2013-05-30 2014-12-04 Eisner, Mark Systems and methods for enhancing targeted audibility
US9502028B2 (en) 2013-10-18 2016-11-22 Knowles Electronics, Llc Acoustic activity detection apparatus and method
CN104916288B (zh) 2014-03-14 2019-01-18 深圳Tcl新技术有限公司 一种音频中人声突出处理的方法及装置
CN104934032B (zh) * 2014-03-17 2019-04-05 华为技术有限公司 根据频域能量对语音信号进行处理的方法和装置
US9406313B2 (en) * 2014-03-21 2016-08-02 Intel Corporation Adaptive microphone sampling rate techniques
CN106328168B (zh) * 2016-08-30 2019-10-18 成都普创通信技术股份有限公司 一种语音信号相似度检测方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101625860A (zh) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 语音端点检测中的背景噪声自适应调整方法
CN101494049A (zh) * 2009-03-11 2009-07-29 北京邮电大学 一种用于音频监控系统中的音频特征参数的提取方法
CN103544961A (zh) * 2012-07-10 2014-01-29 中兴通讯股份有限公司 语音信号处理方法及装置
CN103198838A (zh) * 2013-03-29 2013-07-10 苏州皓泰视频技术有限公司 一种用于嵌入式系统的异常声音监控方法和监控装置
CN103646649A (zh) * 2013-12-30 2014-03-19 中国科学院自动化研究所 一种高效的语音检测方法
CN106887241A (zh) * 2016-10-12 2017-06-23 阿里巴巴集团控股有限公司 一种语音信号检测方法与装置

Also Published As

Publication number Publication date
JP6999012B2 (ja) 2022-01-18
EP3528251A4 (en) 2019-08-21
EP3528251A1 (en) 2019-08-21
US10706874B2 (en) 2020-07-07
EP3528251B1 (en) 2022-02-23
TW201814692A (zh) 2018-04-16
PH12019500784A1 (en) 2019-11-11
KR20190061076A (ko) 2019-06-04
TWI654601B (zh) 2019-03-21
JP2019535039A (ja) 2019-12-05
US20190237097A1 (en) 2019-08-01
SG11201903320XA (en) 2019-05-30
PH12019500784B1 (en) 2024-02-28
JP6859499B2 (ja) 2021-04-14
CN106887241A (zh) 2017-06-23
KR102214888B1 (ko) 2021-02-15
JP2021071729A (ja) 2021-05-06
MY201634A (en) 2024-03-06

Similar Documents

Publication Publication Date Title
WO2018068636A1 (zh) 一种语音信号检测方法与装置
US11670325B2 (en) Voice activity detection using a soft decision mechanism
CN108766418B (zh) 语音端点识别方法、装置及设备
WO2019101123A1 (zh) 语音活性检测方法、相关装置和设备
US20190304449A1 (en) Method, apparatus and storage medium for wake-up processing of application
CN108986822A (zh) 语音识别方法、装置、电子设备及非暂态计算机存储介质
CN109065044A (zh) 唤醒词识别方法、装置、电子设备及计算机可读存储介质
CN108877779B (zh) 用于检测语音尾点的方法和装置
CN111667843B (zh) 终端设备的语音唤醒方法、系统、电子设备、存储介质
CN109767784B (zh) 鼾声识别的方法及装置、存储介质和处理器
CN109741753A (zh) 一种语音交互方法、装置、终端及服务器
EP4447046B1 (en) Cascade architecture for noise-robust keyword spotting
US20250252969A1 (en) Audio processing method and apparatus, storage medium, and electronic device
CN110085264B (zh) 语音信号检测方法、装置、设备及存储介质
CN108093356B (zh) 一种啸叫检测方法及装置
US11790931B2 (en) Voice activity detection using zero crossing detection
CN112542157B (zh) 语音处理方法、装置、电子设备及计算机可读存储介质
US20220130405A1 (en) Low Complexity Voice Activity Detection Algorithm
CN113436641A (zh) 一种音乐转场时间点检测方法、设备及介质
CN113936678A (zh) 目标语音的检测方法及装置、设备、存储介质
CN113129904B (zh) 声纹判定方法、装置、系统、设备和存储介质
CN111883159B (zh) 语音的处理方法及装置
HK1237986A1 (zh) 一種語音信號檢測方法與裝置
HK1237986A (zh) 一种语音信号检测方法与装置
TW201828285A (zh) 音頻識別方法和系統

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17860814

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019520035

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20197013519

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2017860814

Country of ref document: EP

Effective date: 20190513