HK1237986A

HK1237986A - Voice signal detection method and apparatus

Info

Publication number: HK1237986A
Application number: HK17111899.0A
Authority: HK
Inventors: 焦雷; 官砚楚; 曾晓东; 林锋
Original assignee: 阿里巴巴集团控股有限公司
Filing date: 2017-11-16
Publication date: 2018-04-20

Description

Voice signal detection method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for detecting a voice signal.

Background

In real life, people often use smart devices (e.g., smartphones, tablets, etc.) to send voice messages. However, when people use the smart device to send voice messages, people often need to click a start button or an end button in a screen of the smart device to finish sending the voice messages, and the click operations cause inconvenience to users.

If the user does not need to click the button to finish sending the voice message, the intelligent device needs to record all the time or record according to a preset period, whether the obtained audio signal contains the voice signal or not is judged, if the obtained audio signal contains the voice signal, the voice signal is extracted, then subsequent processing is carried out, and the voice message is sent out, so that the sending of the voice message is finished.

In the prior art, a speech signal detection method such as a double-threshold method, an autocorrelation maximum-based detection method, or a wavelet transform-based detection method is generally adopted to detect whether an acquired audio signal includes a speech signal. However, in these methods, the frequency characteristics of the audio information are basically obtained through complex calculations such as fourier transform, and then whether the audio information includes a voice signal is determined according to the frequency characteristics, which requires calculation of large buffer data, and thus, the methods have the disadvantages of high memory occupation, large calculation amount, low processing speed, and large power consumption.

Disclosure of Invention

The embodiment of the application provides a voice signal detection method and a voice signal detection device, which are used for solving the problems that a voice signal detection method in the prior art is low in processing speed and consumes more resources.

The embodiment of the application adopts the following technical scheme:

a method of speech signal detection, the method comprising:

acquiring an audio signal;

dividing the audio signal into a plurality of short-time energy frames according to the frequency of a preset voice signal;

determining the energy of each short-time energy frame;

and detecting whether the audio signal contains a voice signal or not according to the energy of each short-time energy frame.

A speech signal detection apparatus, the apparatus comprising:

the acquisition module acquires an audio signal;

the dividing module is used for dividing the audio signal into a plurality of short-time energy frames according to the frequency of a preset voice signal;

the determining module is used for determining the energy of each short-time energy frame;

and the detection module detects whether the audio signal contains a voice signal according to the energy of each short-time energy frame.

The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects:

compared with the detection method for determining whether the audio signal comprises the voice signal through complex calculation such as Fourier transform in the prior art, the voice signal detection method adopted by the embodiment of the application does not need to perform complex calculation such as Fourier transform, the obtained audio signal is divided into a plurality of short-time energy frames according to the frequency of the preset voice signal, the energy of each short-time energy frame is further determined, and whether the obtained audio signal comprises the voice signal can be detected according to the energy of each short-time energy frame. Therefore, the voice signal detection method provided by the embodiment of the application can solve the problems that the voice signal detection method in the prior art is low in processing speed and consumes more resources.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a detailed flowchart of a voice signal detection method according to an embodiment of the present application;

fig. 2 is a detailed flowchart of another speech signal detection method according to an embodiment of the present application;

FIG. 3 is a diagram illustrating an audio signal with a predetermined duration according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a speech signal detection apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

In order to solve the problems of a low processing speed and high resource consumption of a voice signal detection method in the prior art, an embodiment of the present application provides a voice signal detection method.

The execution main body of the method may be, but is not limited to, a user terminal such as a mobile phone, a tablet computer, or a Personal Computer (PC), or an Application (APP) running on the user terminal, or may also be a device such as a server.

For convenience of description, the following description will be made of an embodiment of the method taking the execution subject of the method as APP. It is understood that the execution of the method by APP is merely an exemplary illustration and should not be construed as a limitation of the method.

The specific flow diagram of the method is shown in fig. 1, and the method comprises the following steps:

step 101, an audio signal is acquired.

The audio signal may be an audio signal acquired by the APP through the audio acquisition device, or may also be an audio signal received by the APP, for example, the audio signal may be an audio signal transmitted by another APP or a device, which is not limited in this embodiment of the present application. After the APP acquires the audio signal, it may save the audio signal locally.

The present application does not limit the sampling rate, duration, format or sound channel corresponding to the audio signal.

The APP may be any type of APP, such as a chat APP or a payment APP, as long as the APP can acquire an audio signal, and the voice signal detection method provided by the embodiment of the present application may be used to detect the voice signal of the acquired audio signal.

Step 102, dividing the audio signal into a plurality of short-time energy frames according to the frequency of a preset voice signal.

The short-time energy frame is actually a part of the audio signal acquired in step 101.

Specifically, the period of the preset voice signal may be determined according to the frequency of the preset voice signal, and according to the determined period, the audio signal acquired in step 101 is divided into a plurality of short-time energy frames, the corresponding durations of which are the periods. For example, assuming that the period of the preset speech signal is 0.01S, the audio signal may be divided into a plurality of short-time energy frames each having a duration of 0.01S according to the duration of the audio signal obtained in step 101. It should be noted that, when the audio signal obtained in step 101 is divided, the audio signal may also be divided into at least two short-time energy frames according to the frequency of a preset speech signal according to the actual situation. For convenience of subsequent description, the embodiments of the present application will be described hereinafter by taking an example of dividing an audio signal into a plurality of short-time energy frames.

In addition, when the APP collects an audio signal through the audio collection device in step 101, since the collected audio signal is generally an audio signal that is actually an analog signal and collected into a digital signal at a certain sampling rate, that is, an audio signal in a Pulse Code Modulation (PCM) format, the audio signal can be further divided into a plurality of short-time energy frames according to the sampling rate of the audio signal and a frequency of a preset audio signal.

Specifically, a ratio m of the sampling rate of the audio signal to the frequency of a preset speech signal can be determined, and then each m sampling points in the acquired digital audio signal are divided into a short-time energy frame according to the ratio m. If m is a positive integer, the audio signal can be divided into a maximum number of short-time energy frames according to m; if m is not a positive integer, the audio signal may be divided into a maximum number of short-time energy frames according to m converted to a positive integer according to a rounding principle. It should be particularly noted that, if the number of sampling points included in the audio signal acquired in step 101 is not an integer multiple of m, after the audio signal is divided into the maximum number of short-time energy frames, the remaining sampling points may be discarded, or the remaining sampling points may also be used as one short-time energy frame for subsequent processing. Wherein, the m is used to indicate the number of sampling points included in the audio signal acquired in step 101 in a period of a preset voice signal.

For example, if the frequency of the preset speech signal is 82HZ, the duration of the audio signal acquired in step 101 is 1S, and the sampling rate is 16000HZ, then m is 16000/82 is 195.1. Where m is not a positive integer, 195.1 is rounded to a positive integer 195. According to the duration and the sampling rate of the audio signal, it can be determined that the number of sampling points included in the audio signal is 16000, and then, since the number of sampling points included in the audio signal is not an integral multiple of 195, the remaining 10 sampling points can be discarded after the audio signal is divided into 82 short-time energy frames. Each short-time energy frame contains 195 sampling points.

When the audio signal acquired in step 101 is a received audio signal transmitted by another APP or device, the audio signal may be divided into a plurality of short-time energy frames by using any of the methods described above. It should be noted that the format of the audio signal may not be the PCM format. If the method is adopted to divide the short-time energy frame according to the sampling rate of the audio signal and the frequency of the preset voice signal, the received audio signal needs to be converted into the audio signal in the PCM format, in addition, the sampling rate of the audio signal needs to be identified when the audio signal is received, and the method for specifically identifying the sampling rate of the audio signal can be identified by adopting the method in the prior art, and is not repeated here.

Step 103, determining the energy of each short-time energy frame.

In the embodiment of the present application, when the audio signal in the PCM format is divided into a plurality of short-time energy frames also in the PCM format by using the above method, the energy of the short-time energy frame may be determined according to the amplitude of the audio signal corresponding to each sampling point in the short-time energy frame. Specifically, the energy of each sampling point in the short-time energy frame may be determined according to the amplitude of the audio signal corresponding to each sampling point, and then the energies are added, and the sum of the finally obtained energies is used as the energy of the short-time energy frame.

For example, the energy of the short time energy frame may be determined using the following formula:wherein i represents the ith sample point of the audio signal; n is the number of sampling points contained in the short-time energy frame; a. the_i[t]And the amplitude of the audio signal corresponding to the ith sampling point is obtained, wherein the amplitude of the short-time energy frame ranges from-32768 to 32767.

In addition, in the embodiment of the application, in order to simplify the calculation and save resources, the amplitude obtained when the audio signal is acquired can be divided by the value of 32768 to serve as the normalized amplitude of the short-time energy frame, and then the value range of the normalized amplitude of the short-time energy frame is-1 to 1.

If the format of the short-time energy frame is not the PCM format, a function for calculating the amplitude can be determined according to the amplitude of the short-time energy frame at each moment, integration is performed on the square of the function, and the finally obtained integration result is the energy of the short-time energy frame.

And 104, detecting whether the audio signal contains a voice signal or not according to the energy of each short-time energy frame.

Specifically, the following two methods may be adopted to determine whether the audio signal is detected to contain a speech signal:

the method comprises the following steps: determining the ratio of the number of the short-time energy frames with energy larger than a preset threshold to the total number of all the short-time energy frames (hereinafter referred to as a high-energy frame ratio), and judging whether the determined high-energy frame ratio is larger than the preset ratio. If yes, determining that the audio signal contains a voice signal; if not, determining that the audio signal is not detected to contain the voice signal.

In the embodiment of the present application, the preset threshold may be set to 2, the preset ratio may be set to 20%, and if the high-energy frame ratio is greater than 20%, it is determined that the audio signal is detected to include a speech signal; otherwise, it is determined that the voice signal is not detected to be included in the audio signal.

In the embodiment of the present application, the method 1 can be used to determine whether a speech signal is detected in an audio signal because in real life, when a person speaks, there is more or less noise in the external environment, and the noise is generally lower in energy than what the person speaks. If there are short-time energy frames with energy higher than the preset threshold in a segment of audio signal, and the short-time energy frames occupy a certain ratio in the segment of audio signal, it can be considered that the audio signal includes a speech signal.

The method 2 comprises the following steps: in order to make the final detection result more accurate, the method mentioned in method 1 may be adopted to determine the high-energy frame rate, and determine whether the determined high-energy frame rate is greater than a preset rate, if not, it is determined that the audio signal is not detected to contain the speech signal; if yes, when at least N continuous short-time energy frames exist in the short-time energy frames with the energy larger than the preset threshold, the fact that the voice signal is contained in the detected audio signal is determined, and when at least N continuous short-time energy frames do not exist in the short-time energy frames with the energy larger than the preset threshold, the fact that the voice signal is not contained in the detected audio signal is determined. Wherein, N can be any positive integer. In the embodiment of the present application, N may be set to 10.

That is, method 2 adds a condition for determining whether the audio signal includes a speech signal to method 1: whether at least N continuous short-time energy frames exist in the short-time energy frames with the energy larger than the preset threshold value. This can be done to effectively reduce noise. In actual life, noise is lower in energy compared with that of human beings, and signals are random, so that the method 2 can effectively eliminate the condition of excessive noise in audio signals, reduce the influence of noise in external environment and achieve the effect of noise reduction.

It should be particularly noted that the voice signal detection method provided in the embodiment of the present application may be applied to detect a mono audio signal, a dual-channel audio signal, a multi-channel audio signal, or the like. Wherein, the audio signal collected by one sound channel is a single sound channel audio signal; the audio signal collected through two channels is a two-channel audio signal, and the audio signal collected through a plurality of channels is a multi-channel audio signal.

When the method shown in fig. 1 is used to detect a two-channel audio signal and a multi-channel audio signal, the obtained audio signal of each channel can be detected according to the operations mentioned in steps 101 to 104, and finally, whether the obtained audio signal contains a voice signal is determined according to the detection result of the audio signal of each channel.

Specifically, if the audio signal acquired in step 101 is a mono audio signal, the operations mentioned in steps 101 to 104 may be directly performed on the audio signal, and the detection result is used as the final detection result.

If the audio signal obtained in step 101 is not a mono audio signal but a dual-channel or multi-channel audio signal, the audio signals of each channel are processed according to the operations in steps 101 to 104. If it is detected that the audio signal of each channel does not include a speech signal, it is determined that the audio signal acquired in step 101 does not include a speech signal. If it is detected that the audio signal of at least one channel includes a voice signal, it is determined that the audio signal acquired in step 101 includes a voice signal.

In addition, the frequency of the preset voice signal mentioned in step 102 may be the frequency of any voice, which is not limited in this application. In practical applications, different frequencies of the preset voice signal may be set for different audio signals acquired in step 101 according to actual situations. It should be noted that, regardless of the frequency of the preset speech, for example, the frequency of the female treble, or the frequency of the male bass, the short-time energy frame finally divided may satisfy the following condition: the duration corresponding to the short-time energy frame is not less than the period corresponding to the audio signal acquired in step 101. In order to achieve a better detection effect, save resources as much as possible, and increase the processing speed, in the embodiment of the present application, the frequency of the preset voice signal may be set to be the minimum human voice frequency, that is, 82 HZ. Because the period is the reciprocal of the frequency, if the frequency of the preset voice signal is the minimum voice frequency, the period of the preset voice signal is the maximum voice period, and therefore, no matter how large the period of the audio signal acquired in step 101 is, the duration corresponding to the short-time energy frame is not smaller than the period of the acquired audio signal.

It should be noted that, in the embodiment of the present application, the duration corresponding to the short-time energy frame is not less than the period of the audio signal acquired in step 101, because the detection method provided in the embodiment of the present application detects whether the audio signal includes a speech signal based on the characteristics of the human speech. Human speech is higher in energy, more stable and continuous than noise. If the duration corresponding to the short-time energy frame is less than the period of the audio signal acquired in step 101, a waveform of a complete period does not exist in the waveform corresponding to the short-time energy frame, and the duration of the short-time energy frame is relatively short. In this case, even if the high-energy frame rate is greater than the preset rate and at least N consecutive short-time energy frames exist in the short-time energy frames with energy greater than the preset threshold, it can only be indicated that the audio signal includes a sound signal, but it cannot be indicated that the sound signal is a speech signal. Therefore, in this embodiment of the present application, the duration of the audio signal acquired in step 101 should be greater than a maximum period of human voice.

In addition, the voice signal detection method provided by the embodiment of the application is particularly suitable for an application scenario that the chat APP can finish the sending of the voice message without any click operation of the user. Then, the following describes the speech signal detection method provided in the embodiment of the present application in detail for this scenario. In this scenario, a specific flow diagram of the method is shown in fig. 2, and includes the following steps:

step 201, audio signals are collected in real time.

If the user hopes to open the chat APP later, need not to carry out any click operation, this APP alright accomplish the sending of voice message, then, after the user opened this APP later, this APP alright begin to record to external environment uninterruptedly, gather audio signal in real time to avoid as far as possible missing the word that the user said. In addition, the audio signal may be saved locally in real time after it is acquired. And when the user closes the APP, the APP stops recording.

Step 202, intercepting the audio signal with preset duration from the acquired audio signal in real time.

If the APP records all the time, but does not detect the voice signal in real time, the timeliness of the voice message is poor. Therefore, the APP can intercept the audio signal with the preset duration in the audio signal collected in step 201 in real time, and perform subsequent detection on the audio signal with the preset duration.

The currently intercepted audio signal with the preset duration may be referred to as a current audio signal, and the last intercepted audio signal with the preset duration may be referred to as a last acquired audio signal.

Step 203, dividing the audio signal with preset duration into a plurality of short-time energy frames according to the frequency of the preset voice signal.

At step 204, the energy of each short-time energy frame is determined.

Step 205, detecting whether the audio signal with the preset duration contains a voice signal according to the energy of each short-time energy frame.

If the current audio signal is detected to contain the voice signal, judging whether the audio signal obtained last time contains the voice signal, and if the audio signal obtained last time does not contain the voice signal, determining the starting point of the current audio signal as the starting point of the voice signal; if the audio signal acquired last time is judged to contain the voice signal, the starting point of the current audio signal is not the starting point of the voice signal.

If the current audio signal is detected not to contain the voice signal, judging whether the audio signal obtained last time contains the voice signal, and if the audio signal obtained last time contains the voice signal, determining the end point of the audio signal obtained last time as the end point of the voice signal; if the audio signal acquired last time does not contain the voice signal, the end point of the current audio signal or the audio signal acquired last time is not the end point of the voice signal.

For example, as shown in fig. 3, where A, B, C, D is an audio signal of a preset time duration adjacent to four segments, a and D do not include a speech signal, and B and C include a speech signal, a start point of B may be determined as a start point of the speech signal, and an end point of C may be determined as an end point of the speech signal.

Sometimes, the current audio signal is just the beginning or end of a user speech, and the audio signal contains less speech signals, in which case, the APP may erroneously determine that the audio signal does not contain speech signals. In order to avoid missing what the user said due to erroneous determination as much as possible, after detecting that the current audio signal includes a voice signal, it may be determined whether the audio signal acquired last time includes the voice signal, and if it is determined that the audio signal acquired last time does not include the voice signal, the starting point of the audio signal acquired last time may be determined as the starting point of the voice signal. In addition, after it is detected that the current audio signal does not include a speech signal, it may be determined whether the audio signal acquired last time includes a speech signal, and if it is determined that the audio signal acquired last time includes a speech signal, the end point of the current audio signal may be determined as the end point of the speech signal. Following the above example, the start point of a may be determined as the start point of the voice signal, and the end point of D may be determined as the end point of the voice signal.

After APP detects that the current audio signal contains a voice signal, the audio signal can be sent to a voice recognition device, so that the voice recognition device can perform voice processing on the audio signal to acquire a voice result, then the voice recognition device sends the audio signal to a subsequent processing device, and finally the audio signal is sent out in a voice message mode. In order to make the words spoken by the user contained in the sent voice message be complete sentences, the APP may send an audio termination signal to the voice recognition apparatus after sending all audio signals between the start point and the end point of the determined voice signal to the voice recognition apparatus, so as to inform the voice recognition apparatus that the words spoken by the user currently are complete, so that the voice recognition apparatus sends the audio signals to the subsequent processing apparatus together, and finally sends the audio signals in the form of voice messages.

In addition, in order to avoid the occurrence of the erroneous judgment as much as possible, after the current audio signal is acquired, a sub-signal of a preset time period may be intercepted from the audio signal acquired last time, the current audio signal and the intercepted sub-signal may be spliced to serve as the acquired audio signal (hereinafter referred to as a spliced audio signal), and the subsequent voice signal may be detected with respect to the spliced audio signal.

Wherein the sub-signal may be spliced before the current audio signal. The preset time period may be a tail time period of the last acquired audio signal, and a duration corresponding to the time period may be any duration. In order to make the final detection result more accurate, in the embodiment of the present application, the duration corresponding to the preset time period may be set to be not greater than the product of the duration corresponding to the spliced audio signal and the preset ratio.

If the spliced audio signal is detected to contain the voice signal, whether the spliced audio signal obtained last time contains the voice signal or not can be judged, and if the spliced audio signal obtained last time does not contain the voice signal, the starting point of the spliced audio signal can be used as the starting point of the voice signal. If the spliced audio signal is detected not to contain the voice signal, whether the spliced audio signal obtained last time contains the voice signal or not can be judged, and if the spliced audio signal obtained last time contains the voice signal, the end point of the spliced audio signal can be used as the end point of the voice signal.

In this application embodiment, the APP may record periodically, but may not record continuously all the time, and this is not limited in this application embodiment.

The voice signal detection method provided in the embodiment of the present application can also be implemented by a voice signal detection apparatus, a specific structural diagram of the apparatus is shown in fig. 4, and the apparatus mainly includes the following apparatuses:

an obtaining module 41 for obtaining an audio signal;

a dividing module 42, configured to divide the audio signal into a plurality of short-time energy frames according to a frequency of a preset speech signal;

a determining module 43 for determining the energy of each short-time energy frame;

and the detection module 44 detects whether the audio signal contains a speech signal according to the energy of each short-time energy frame.

In one embodiment, the obtaining module 41 obtains the current audio signal; intercepting a sub-signal of a preset time period from the last acquired audio signal;

and splicing the current audio signal and the intercepted sub-signal to obtain an obtained audio signal.

In one embodiment, the dividing module 42 determines a period of a preset voice signal according to a frequency of the preset voice signal;

and according to the determined period, dividing the audio signal into a plurality of short-time energy frames with corresponding time durations all of the period.

In one embodiment, the detection module 44 determines a ratio of the number of short-time energy frames with energy greater than a preset threshold to the total number of all short-time energy frames;

judging whether the ratio is larger than a preset ratio or not;

if yes, determining that the audio signal contains a voice signal;

if not, determining that the audio signal is not detected to contain the voice signal.

judging whether the ratio is larger than a preset ratio or not;

if not, determining that the audio signal is not detected to contain a voice signal;

if yes, when at least N continuous short-time energy frames exist in the short-time energy frames with the energy larger than the preset threshold, the fact that the voice signal is contained in the audio signal is determined to be detected, and when at least N continuous short-time energy frames do not exist in the short-time energy frames with the energy larger than the preset threshold, the fact that the voice signal is not contained in the audio signal is determined to be not detected.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for detecting a speech signal, the method comprising:

acquiring an audio signal;

determining the energy of each short-time energy frame;

2. The method of claim 1, wherein obtaining the audio signal comprises:

acquiring a current audio signal;

intercepting a sub-signal of a preset time period from the last acquired audio signal;

3. The method of claim 1, wherein dividing the audio signal into a plurality of short-time energy frames according to a frequency of a predetermined speech signal comprises:

determining the period of a preset voice signal according to the frequency of the preset voice signal;

4. The method of claim 1, wherein detecting whether the audio signal contains a speech signal according to the energy of each short-time energy frame comprises:

determining the ratio of the number of the short-time energy frames with the energy larger than a preset threshold value to the total number of all the short-time energy frames;

judging whether the ratio is larger than a preset ratio or not;

if yes, determining that the audio signal contains a voice signal;

5. The method of claim 1, wherein detecting whether the audio signal contains a speech signal according to the energy of each short-time energy frame comprises:

judging whether the ratio is larger than a preset ratio or not;

6. A speech signal detection apparatus, characterized in that the apparatus comprises:

the acquisition module acquires an audio signal;

7. The apparatus of claim 1, wherein the acquisition module:

acquiring a current audio signal;

8. The apparatus of claim 1, wherein the dividing module determines a period of a preset voice signal according to a frequency of the preset voice signal;

9. The apparatus of claim 1, wherein the detection module determines a ratio of a number of short time energy frames with energy greater than a preset threshold to a total number of all short time energy frames;

judging whether the ratio is larger than a preset ratio or not;

if yes, determining that the audio signal contains a voice signal;

10. The apparatus of claim 1, wherein the detection module determines a ratio of a number of short time energy frames with energy greater than a preset threshold to a total number of all short time energy frames;

judging whether the ratio is larger than a preset ratio or not;