WO2019019256A1 - 电子装置、身份验证的方法、系统及计算机可读存储介质 - Google Patents
电子装置、身份验证的方法、系统及计算机可读存储介质 Download PDFInfo
- Publication number
- WO2019019256A1 WO2019019256A1 PCT/CN2017/100055 CN2017100055W WO2019019256A1 WO 2019019256 A1 WO2019019256 A1 WO 2019019256A1 CN 2017100055 W CN2017100055 W CN 2017100055W WO 2019019256 A1 WO2019019256 A1 WO 2019019256A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- feature
- feature units
- neural network
- convolutional neural
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/30—Authentication, i.e. establishing the identity or authorisation of security principals
- G06F21/31—User authentication
- G06F21/32—User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/10—Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems
Definitions
- the present invention relates to the field of communications technologies, and in particular, to an electronic device, a method and system for authenticating, and a computer readable storage medium.
- Voiceprint recognition is an identity authentication technology that is determined by computer simulation of target speech. It can be widely used in the fields of Internet, banking system, public security and justice.
- the traditional voiceprint recognition scheme uses a general background model based on Gaussian mixture model modeling to record the speaker, and analyzes the difference, and then extracts the voiceprint features according to the difference, and scores by similarity measure to give recognition. result.
- This voiceprint recognition scheme has a low recognition error rate for long recordings (for example, recordings of 30 seconds or longer), and the recognition effect is good, but for short recordings that are widely present in different business scenarios (for example, less than 30 seconds) Recording), due to the limited parameters, the general background model framework can not be used to model the subtle differences in the recording, resulting in poor performance of the phrase sound recognition and high recognition error rate.
- the present invention provides an electronic device including a memory and a processor coupled to the memory, wherein the memory stores a system for authenticating an operation on the processor, When the system for authentication is executed by the processor, the following steps are implemented:
- S4 Input the plurality of matched observation feature units into a pre-trained preset type identity verification model, and obtain an output identity verification result to perform identity verification on the target user.
- the present invention also provides a method for identity verification, where the method for identity verification includes:
- the pre-voice data is subjected to framing processing according to preset framing parameters to obtain a plurality of speech frames;
- S4 Input the plurality of matched observation feature units into a pre-trained preset type identity verification model, and obtain an output identity verification result to perform identity verification on the target user.
- the present invention also provides a computer readable storage medium having stored thereon an authentication system, the authentication system being executable by at least one processor to implement the following steps:
- S4 Input the plurality of matched observation feature units into a pre-trained preset type identity verification model, and obtain an output identity verification result to perform identity verification on the target user.
- the present invention also provides a system for authenticating, the system of authentication being stored in a memory, executable by at least one processor, to implement the following steps:
- S4 Input the plurality of matched observation feature units into a pre-trained preset type identity verification model, and obtain an output identity verification result to perform identity verification on the target user.
- the invention has the beneficial effects that the present invention firstly processes the current speech data into a plurality of speech frames, extracts a preset type of acoustic features in each speech frame by using a predetermined filter, and generates the acoustic features according to the extracted acoustic features.
- the plurality of observation feature units corresponding to the current voice data are respectively paired with the pre-stored observation feature units to obtain the plurality of sets of the paired observation feature units, and the plurality of pairs of the paired observation feature units are input into the pre-preparation
- the output authentication result is obtained to authenticate the target user.
- the present invention authenticates the short recordings occurring in multiple service scenarios, the short recording is framed and the acoustic features are extracted. It is transformed into an observation feature unit, and finally the paired observation feature unit is input into the identity verification model for identity verification, and the performance of the phrase sound recognition is better, and the recognition error rate can be reduced.
- FIG. 1 is a schematic diagram of an optional application environment according to various embodiments of the present invention.
- FIG. 2 is a schematic flowchart diagram of an embodiment of an authentication method according to an embodiment of the present invention.
- first, second and the like in the present invention are for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the number of indicated technical features. .
- features defining “first” and “second” may include at least one of the features, either explicitly or implicitly.
- the technical solutions between the various embodiments may be combined with each other, but must be based on the realization of those skilled in the art, and when the combination of the technical solutions is contradictory or impossible to implement, it should be considered that the combination of the technical solutions does not exist. It is also within the scope of protection required by the present invention.
- FIG. 1 it is a schematic diagram of an application environment of a preferred embodiment of the method for authenticating the present invention.
- the application environment diagram includes an electronic device 1 and a terminal device 2.
- the electronic device 1 can perform data interaction with the terminal device 2 through a suitable technology such as a network or a near field communication technology.
- the terminal device 2 includes, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch panel, or a voice control device, and the electronic product can utilize a voice collection device (for example, Microphone) to collect user's voice data, such as personal computers, tablets, smart phones, personal digital assistants (PDAs), game consoles, Internet Protocol Television (IPTV), smart wearable devices
- a voice collection device for example, Microphone
- PDAs personal digital assistants
- IPTV Internet Protocol Television
- a removable device such as a navigation device or the like, or a fixed terminal such as a digital TV, a desktop computer, a notebook, a server, or the like.
- the electronic device 1 is an apparatus capable of automatically performing numerical calculation and/or information processing in accordance with an instruction set or stored in advance.
- the electronic device 1 may be a computer, a single network server, a server group composed of multiple network servers, or a cloud-based cloud composed of a large number of hosts or network servers, where cloud computing is a type of distributed computing.
- a super virtual computer consisting of a group of loosely coupled computers.
- the electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13 that are communicably connected to each other through a system bus. It should be noted that FIG. 1 only shows the electronic device 1 having the components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
- the storage device 11 includes a memory and at least one type of readable storage medium.
- the memory provides a cache for the operation of the electronic device 1;
- the readable storage medium can be, for example, a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static Non-volatile storage medium for random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. .
- the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be external to the electronic device 1.
- a storage device such as a plug-in hard disk equipped with an electronic device 1, a smart memory card (SMC), a Secure Digital (SD) card, a flash card, or the like.
- the readable storage medium of the storage device 11 is generally used to store an operating system installed in the electronic device 1 and various types of application software, such as program codes of the system for identity verification in an embodiment of the present invention. Further, the storage device 11 can also be used to temporarily store various types of data that have been output or are to be output.
- the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
- the processor 12 is typically used to control the overall operation of the electronic device 1, such as performing control and processing related to data interaction or communication with the terminal device 2.
- the processor 12 is configured to run program code or process data stored in the memory 11, such as a system running identity verification.
- the network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the electronic device 1 and other electronic devices.
- the network interface 13 is mainly used to connect the electronic device 1 with one or more terminal devices 2, and establish a data transmission channel and a communication connection between the electronic device 1 and one or more terminal devices 2.
- the authentication system is stored in the memory 11 and includes at least one computer readable instruction stored in the memory 11, the at least one computer readable instruction being executable by the processor 12 to implement the methods of various embodiments of the present application And, the at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.
- the above-described system for authenticating is implemented by the processor 12 to implement the following steps:
- the user is required to perform recording in a plurality of service scenarios, and the current voice data received during the recording process is a piece of recorded data, and the recorded data of the segment is a phrase sound.
- the recording device When recording, try to prevent environmental noise and interference from voice acquisition equipment.
- the recording device should be kept at an appropriate distance from the user, and try not to use a recording device with large distortion.
- the power supply is preferably powered by the mains and keeps the current stable; the sensor should be used when recording the telephone.
- the voice data can be denoised before the framing process to further reduce interference.
- the preset framing parameters are, for example, every 25 milliseconds frame, frame shift 10 milliseconds, and each segment after frame processing Recorded data gets multiple speech frames.
- the embodiment does not limit the framing processing manner described above, and other framing parameters may be used for framing processing, which are all within the protection scope of the embodiment.
- the predetermined filter is preferably a Meyer filter
- the acoustic feature is a voiceprint feature
- the voiceprint feature includes various types, such as a wide-band voiceprint, a narrow-band voiceprint, an amplitude voiceprint, etc., in this embodiment.
- the voiceprint feature is preferably a Mel Frequency Cepstrum Coefficient (MFCC).
- the feature data matrix is composed according to the Mel frequency cepstrum coefficients, and specifically, the feature data matrix is composed according to the Mel frequency cepstrum coefficients of each piece of recorded data.
- the feature data matrix corresponding to the multi-segment recording data is a plurality of observation feature units corresponding to the current voice data.
- S4 Input the plurality of matched observation feature units into a pre-trained preset type identity verification model, and obtain an output identity verification result to perform identity verification on the target user.
- a plurality of observation feature units of the user are stored in advance, and after generating a plurality of observation feature units corresponding to the current voice data, the generated plurality of observation feature units and the pre-stored observation feature units are performed in pairs. Pairing, pairing to get multiple sets of observation feature units.
- the preset type authentication model is preferably a deep convolutional neural network model, and the deep convolutional neural network model consists of one input layer, four convolution layers, one pooling layer, two fully connected layers, and one The normalized layer is composed of one classification layer.
- the detailed structure of the deep convolutional neural network model is as shown in Table 1 above:
- the Layer Name column indicates the name of each layer
- Input indicates the input layer
- Conv indicates the convolution layer
- Conv1 indicates the first convolution layer
- Mean_std_pooling indicates the pooling layer
- Full connected indicates the fully connected layer
- Normalize Wrap indicates the normalization.
- the layer, Scoring represents the classification layer.
- Batch Size indicates the number of observation feature units of the input of the current layer
- Kernel Size indicates the scale of the current layer convolution kernel (for example, the Kernel Size can be equal to 3, indicating that the scale of the convolution kernel is 3x3)
- Stride Size represents the moving step size of the convolution kernel, that is, the distance moved to the next convolution position after one convolution.
- Filter size refers to the channel of each layer output, such as the input voice channel in the Input layer is 1 (that is, the original data), and becomes 512 through the Input layer channel.
- the input layer indicates that the input observation feature unit is sampled
- the Conv layer convolution kernel 1*1 can scale the input and feature combination
- the Normalize Wrap layer normalizes the input variance
- the Scoring layer trains the user class.
- the inner relationship matrix U and the user's inter-class relationship matrix V have dimensions of 300*300.
- the output verification result is obtained, and the output verification result includes the verification pass and the verification failure to identify the target user. verification.
- the embodiment firstly processes the current speech data into a plurality of speech frames, and extracts a preset type of acoustic features in each speech frame by using a predetermined filter, and generates an acoustic feature according to the extracted acoustic features.
- the plurality of observation feature units corresponding to the current voice data are respectively paired with the pre-stored observation feature units to obtain the plurality of sets of the paired observation feature units, and the plurality of pairs of the paired observation feature units are input.
- the output authentication result is obtained to authenticate the target user.
- the short recording is framed and extracted.
- Acoustic features are transformed into observation feature units, and the paired observation feature units are finally input into the identity verification model for identity verification. The performance of the phrase sound recognition is better, and the recognition error rate can be reduced.
- the step of extracting a predetermined type of acoustic features in each speech frame by using a predetermined filter comprises:
- a cepstrum analysis is performed on the Mel spectrum to obtain a Mel frequency cepstral coefficient MFCC, and the Mel frequency cepstral coefficient MFCC is used as an acoustic feature of the speech frame.
- each frame of data is treated as a stationary signal. Since the subsequent Fourier expansion is required to acquire the Mel spectral features, the Gibbs effect occurs, that is, after the Fourier series expansion is performed on a periodic function (such as a rectangular pulse) having discontinuous points, a finite term is selected for synthesis. When the number of selected items is larger, the peak appearing in the synthesized waveform is closer to the discontinuity point of the original signal. When the number of selected items is large, the peak value tends to be a constant, which is approximately equal to the total jump. 9% of the change. In order to avoid the Gibbs effect, the speech frame needs to be windowed to reduce the discontinuity of the signal at the beginning and end of the speech frame.
- the cepstrum analysis is, for example, taking the logarithm and performing the inverse transform.
- the inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after the DCT are taken as the Mel frequency cepstral coefficient MFCC coefficients.
- the Mel frequency cepstrum coefficient MFCC is the voiceprint feature of the speech data of this frame, and the Mel frequency cepstral coefficient MFCC of each frame constitutes a feature data matrix, which is the acoustic feature of the speech frame.
- the step of generating a plurality of observed feature units corresponding to the current voice data according to the extracted acoustic features comprises:
- the deep convolutional neural network model uses an identification function for identity verification, and the identification function includes:
- Obj is the objective function of the deep convolutional neural network model.
- the probability of the correct discriminant of the deep convolutional neural network model is increased to convergence, thereby verifying the identity of the target and the identity.
- . x is a user feature obtained by one of the observation feature units in the normalization layer
- y is a user feature obtained by the other observation feature unit in the normalization layer.
- K is a constant
- P(x, y) is the probability that a set of observed feature units belong to the same user
- L(x, y) is the similarity L for calculating a set of observed feature units
- U is the intra-class relationship matrix of the user.
- V is the relationship matrix between user classes
- b is the offset amount
- T is the matrix transposition.
- FIG. 2 is a schematic flowchart of an embodiment of an authentication method according to an embodiment of the present invention.
- the method for authenticating includes the following steps:
- Step S1 after receiving the current voice data of the target user to be authenticated, performing frame processing on the current voice data according to preset framing parameters to obtain multiple voice frames;
- the user is required to perform recording in a plurality of service scenarios, and the current voice data received during the recording process is a piece of recorded data, and the recorded data of the segment is a phrase sound.
- the recording device When recording, try to prevent environmental noise and interference from voice acquisition equipment.
- the recording device should be kept at an appropriate distance from the user, and try not to use a recording device with large distortion.
- the power supply is preferably powered by the mains and keeps the current stable; the sensor should be used when recording the telephone.
- the voice data can be denoised before the framing process to further reduce interference.
- the preset framing parameters are, for example, every 25 milliseconds frame, frame shift 10 milliseconds, and each segment after frame processing Recorded data gets multiple speech frames.
- the embodiment does not limit the framing processing manner described above, and other framing parameters may be used for framing processing, which are all within the protection scope of the embodiment.
- Step S2 extracting a preset type of acoustic features in each speech frame by using a predetermined filter, Generating a plurality of observed feature units corresponding to the current voice data according to the extracted acoustic features;
- the predetermined filter is preferably a Meyer filter
- the acoustic feature is a voiceprint feature
- the voiceprint feature includes various types, such as a wide-band voiceprint, a narrow-band voiceprint, an amplitude voiceprint, etc., in this embodiment.
- the voiceprint feature is preferably a Mel Frequency Cepstrum Coefficient (MFCC).
- the feature data matrix is composed according to the Mel frequency cepstrum coefficients, and specifically, the feature data matrix is composed according to the Mel frequency cepstrum coefficients of each piece of recorded data.
- the feature data matrix corresponding to the multi-segment recording data is a plurality of observation feature units corresponding to the current voice data.
- Step S3 pairing each of the observed feature units with the pre-stored observation feature units to obtain a plurality of pairs of paired observation feature units;
- step S4 the plurality of sets of paired observation feature units are input into the pre-trained preset type identity verification model, and the outputted identity verification result is obtained to perform identity verification on the target user.
- a plurality of observation feature units of the user are stored in advance, and after generating a plurality of observation feature units corresponding to the current voice data, the generated plurality of observation feature units and the pre-stored observation feature units are performed in pairs. Pairing, pairing to get multiple sets of observation feature units.
- the preset type authentication model is preferably a deep convolutional neural network model, and the deep convolutional neural network model consists of one input layer, four convolution layers, one pooling layer, two fully connected layers, and one The normalized layer is composed of one classification layer.
- the detailed structure of the deep convolutional neural network model is as shown in Table 1 above, and details are not described herein again.
- the Layer Name column indicates the name of each layer
- Input indicates the input layer
- Conv indicates the convolution layer
- Conv1 indicates the first convolution layer
- Mean_std_pooling indicates the pooling layer
- Full connected indicates the fully connected layer
- Normalize Wrap indicates the normalization.
- the layer, Scoring represents the classification layer.
- the Batch Size indicates the number of observation feature units of the input of the current layer
- the Kernel Size indicates the scale of the current layer convolution kernel (for example, the Kernel Size can be equal to 3, indicating that the scale of the convolution kernel is 3x 3)
- the Stride Size indicates the convolution kernel.
- the moving step size is the distance moved to the next convolution position after a convolution.
- Filter size refers to the channel of each layer output, such as the input voice channel in the Input layer is 1 (that is, the original data), and becomes 512 through the Input layer channel.
- the input layer indicates that the input observation feature unit is sampled
- the Conv layer convolution kernel 1*1 can scale the input and feature combination
- the Normalize Wrap layer normalizes the input variance
- the Scoring layer trains the user class.
- the inner relationship matrix U and the user's inter-class relationship matrix V have dimensions of 300*300.
- the output verification result is obtained, and the output verification result includes the verification pass and the verification failure to identify the target user. verification.
- the step of extracting the acoustic characteristics of the preset type in each of the speech frames by using the predetermined filter in the above step S2 comprises:
- a cepstrum analysis is performed on the Mel spectrum to obtain a Mel frequency cepstral coefficient MFCC, and the Mel frequency cepstral coefficient MFCC is used as an acoustic feature of the speech frame.
- each frame of data is treated as a stationary signal. Since the subsequent Fourier expansion is required to acquire the Mel spectral features, the Gibbs effect occurs, that is, after the Fourier series expansion is performed on a periodic function (such as a rectangular pulse) having discontinuous points, a finite term is selected for synthesis. When the number of selected items is larger, the peak appearing in the synthesized waveform is closer to the discontinuity point of the original signal. When the number of selected items is large, the peak value tends to be a constant, which is approximately equal to the total jump. 9% of the change. In order to avoid the Gibbs effect, the speech frame needs to be windowed to reduce the discontinuity of the signal at the beginning and end of the speech frame.
- the cepstrum analysis is, for example, taking the logarithm and performing the inverse transform.
- the inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after the DCT are taken as the Mel frequency cepstral coefficient MFCC coefficients.
- the Mel frequency cepstrum coefficient MFCC is the voiceprint feature of the speech data of this frame, and the Mel frequency cepstral coefficient MFCC of each frame constitutes a feature data matrix, which is the acoustic feature of the speech frame.
- the step of generating, according to the extracted acoustic features, the plurality of observed feature units corresponding to the current voice data in the foregoing step S2 comprises: using the current voice All the speech frames in each recorded data in the data are a set of speech frames, and the 20-dimensional Meyer frequency cepstral coefficient MFCC (ie, acoustic features) of each speech frame in the set of speech frames is corresponding to the corresponding speech frame.
- the framing time is sequentially spliced to generate an observation feature unit of the corresponding (20, N)-dimensional matrix, and the N is the total number of frames of the voice frame set.
- the deep convolutional neural network model uses an identification function for identity verification, and the identification function includes:
- Obj is the objective function of the deep convolutional neural network model.
- the probability of the correct discriminant of the deep convolutional neural network model is increased to convergence, thereby verifying the identity of the target and the identity.
- . x is a user feature obtained by one of the observation feature units in the normalization layer
- y is a user feature obtained by the other observation feature unit in the normalization layer.
- K is a constant
- P(x, y) is the probability that a set of observed feature units belong to the same user
- L(x, y) is the similarity L for calculating a set of observed feature units
- U is the intra-class relationship matrix of the user.
- V is the relationship matrix between user classes
- b is the offset amount
- T is the matrix transposition.
- step S4 include:
- Obtaining a first preset number of voice pairs of the same user for example, acquiring 1000 users, each user acquiring 1000 pairs of voice pairs, each pair of voice pairs consisting of two voices corresponding to two different pronunciation contents of the same user;
- Different users have a second preset number of voice pairs, for example, 1000 users are obtained, and each user is paired in pairs, and each pair of users corresponds to one same pronunciation content to obtain a pair of voice pairs.
- Performing framing processing on the voices of the voice pairs according to the preset framing parameters for example, the preset framing parameters are framed every 25 milliseconds, and the frame is shifted by 10 milliseconds to obtain corresponding voice pairs.
- Pre-determined types of acoustic features eg, 20-dimensional Meyer frequency cepstral coefficient MFCC spectral features
- a predetermined filter eg, a Meyer filter
- each speech is generated from the extracted acoustic features a plurality of observation feature units of the pair, that is, a plurality of feature data matrices according to the Mel frequency cepstral coefficients, wherein the feature data matrix is an observation feature unit;
- Observing feature units corresponding to two voices belonging to the same user and belonging to different users are pairwise paired to obtain multiple sets of paired observation feature units;
- Each voice pair is divided into a first percentage (eg, 70%) of the training set and a second percentage (eg, 20%) of the verification set, the sum of the first percentage and the second percentage being less than Or equal to 1;
- a first percentage eg, 70%
- a second percentage eg, 20%
- the deep convolutional neural network model is used by each set of observation feature units of each speech pair in the training set, and the accuracy of the trained deep convolutional neural network model is verified by the verification set after the training is completed;
- the training ends, and the trained deep convolutional neural network model is the deep convolutional neural network model in the step S4, or if the accuracy is If the rate is less than or equal to the preset threshold, the number of voice pairs to be trained is increased, and the above steps are re-executed to re-train until the accuracy of the trained deep convolutional neural network model is greater than a preset threshold.
- a preset threshold eg, 98.5%
- the present invention also provides a computer readable storage medium having stored thereon an authentication system, the step of implementing the method of identity verification described above when the system of authentication is executed by a processor.
- the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better.
- Implementation Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
- the optical disc includes a number of instructions for causing a terminal device (which may be a cell phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present invention.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Security & Cryptography (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computer Hardware Design (AREA)
- Signal Processing (AREA)
- Game Theory and Decision Science (AREA)
- Business, Economics & Management (AREA)
- Telephonic Communication Services (AREA)
- Collating Specific Patterns (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
- Reverberation, Karaoke And Other Acoustics (AREA)
- User Interface Of Digital Computer (AREA)
- Testing And Monitoring For Control Systems (AREA)
Abstract
Description
Claims (20)
- 一种电子装置,其特征在于,所述电子装置包括存储器及与所述存储器连接的处理器,所述存储器中存储有可在所述处理器上运行的身份验证的系统,所述身份验证的系统被所述处理器执行时实现如下步骤:S1,在接收到待进行身份验证的目标用户的当前语音数据后,对所述当前语音数据按照预设的分帧参数进行分帧处理,以获得多个语音帧;S2,利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元;S3,将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元;S4,将多组配对后的观测特征单元输入预先训练生成的预设类型身份验证模型,并获取输出的身份验证结果,以对该目标用户进行身份验证。
- 根据权利要求1所述的电子装置,其特征在于,所述预定的滤波器为梅尔滤波器,所述利用预定的滤波器提取各个语音帧中预设类型的声学特征的步骤包括:对所述语音帧进行加窗处理;对每一个加窗进行傅立叶变换得到对应的频谱;将所述频谱输入梅尔滤波器以输出得到梅尔频谱;在梅尔频谱上面进行倒谱分析以获得梅尔频率倒谱系数MFCC,以所述梅尔频率倒谱系数MFCC作为该语音帧的声学特征。
- 根据权利要求2所述的电子装置,其特征在于,所述根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元的步骤包括:以所述当前语音数据中的每个录音数据中的语音帧为一语音帧集合,将所述语音帧集合中的每个语音帧的20维梅尔频率倒谱系数MFCC按对应语音帧的分帧时间的先后顺序拼接,生成对应的(20,N)维矩阵的观测特征单元,所述N为该语音帧集合的总帧数。
- 根据权利要求3所述的电子装置,其特征在于,所述预先训练生成的预设类型身份验证模型为深度卷积神经网络模型,所述深度卷积神经网络模型采用识别函数进行身份验证,所述识别函数包括:L(x,y)=xTUy-xTVx-yTVy+b;其中,Obj为所述深度卷积神经网络模型的目标函数,x为一组观测特征单元中其中一个观测特征单元在所述归一化层得到的用户特征,y为该组观测特征单元中另一个观测特征单元在所述归一化层得到的用户特征,K为常量,P(x,y)为计算一组观测特征单元属于同一用户的概率,L(x,y)为计算 一组观测特征单元的相似度L,U为用户的类内关系矩阵,V为用户类间关系矩阵,b为偏置量,T为矩阵转置。
- 根据权利要求4所述的身份验证的方法,其特征在于,所述步骤S4之前还包括:获取同一用户第一预设数量的语音对,并获取不同用户第二预设数量的语音对,分别对各个语音对中的语音按照预设的分帧参数进行分帧处理,以获得各个语音对对应的多个语音帧;利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成每个语音对的多个观测特征单元;将属于同一用户及属于不同用户的两个语音对应的观测特征单元进行两两配对,以获得多组配对的观测特征单元;将各个语音对分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小于或者等于1;利用训练集中的各个语音对的各组观测特征单元对深度卷积神经网络模型,并在训练完成后利用验证集对训练后的深度卷积神经网络模型的准确率进行验证;若所述准确率大于预设阈值,则训练结束,以训练后的深度卷积神经网络模型为所述步骤S4中的深度卷积神经网络模型,或者,若所述准确率小于或者等于所述预设阈值,则增加进行训练的语音对的数量,以重新进行训练。
- 一种身份验证的方法,其特征在于,所述身份验证的方法包括:S1,在接收到待进行身份验证的目标用户的当前语音数据后,对所述当前语音数据按照预设的分帧参数进行分帧处理,以获得多个语音帧;S2,利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元;S3,将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元;S4,将多组配对后的观测特征单元输入预先训练生成的预设类型身份验证模型,并获取输出的身份验证结果,以对该目标用户进行身份验证。
- 根据权利要求6所述的身份验证的方法,其特征在于,所述预定的滤波器为梅尔滤波器,所述利用预定的滤波器提取各个语音帧中预设类型的声学特征的步骤包括:对所述语音帧进行加窗处理;对每一个加窗进行傅立叶变换得到对应的频谱;将所述频谱输入梅尔滤波器以输出得到梅尔频谱;在梅尔频谱上面进行倒谱分析以获得梅尔频率倒谱系数MFCC,以所述梅尔频率倒谱系数MFCC作为该语音帧的声学特征。
- 根据权利要求7所述的身份验证的方法,其特征在于,所述根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元的步骤包括:以所述当前语音数据中的每个录音数据中的语音帧为一语音帧集合,将所述语音帧集合中的每个语音帧的20维梅尔频率倒谱系数MFCC按对应语音帧的分帧时间的先后顺序拼接,生成对应的(20,N)维矩阵的观测特征单元,所述N为该语音帧集合的总帧数。
- 根据权利要求8所述的身份验证的方法,其特征在于,所述预先训练生成的预设类型身份验证模型为深度卷积神经网络模型,所述深度卷积神经网络模型采用识别函数进行身份验证,所述识别函数包括:L(x,y)=xTUy-xTVx-yTVy+b;其中,Obj为所述深度卷积神经网络模型的目标函数,x为一组观测特征单元中其中一个观测特征单元在所述归一化层得到的用户特征,y为该组观测特征单元中另一个观测特征单元在所述归一化层得到的用户特征,K为常量,P(x,y)为计算一组观测特征单元属于同一用户的概率,L(x,y)为计算一组观测特征单元的相似度L,U为用户的类内关系矩阵,V为用户类间关系矩阵,b为偏置量,T为矩阵转置。
- 根据权利要求9所述的身份验证的方法,其特征在于,所述步骤S4之前还包括:获取同一用户第一预设数量的语音对,并获取不同用户第二预设数量的语音对,分别对各个语音对中的语音按照预设的分帧参数进行分帧处理,以获得各个语音对对应的多个语音帧;利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成每个语音对的多个观测特征单元;将属于同一用户及属于不同用户的两个语音对应的观测特征单元进行两两配对,以获得多组配对的观测特征单元;将各个语音对分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小于或者等于1;利用训练集中的各个语音对的各组观测特征单元对深度卷积神经网络模型,并在训练完成后利用验证集对训练后的深度卷积神经网络模型的准确率进行验证;若所述准确率大于预设阈值,则训练结束,以训练后的深度卷积神经网络模型为所述步骤S4中的深度卷积神经网络模型,或者,若所述准确率小于或者等于所述预设阈值,则增加进行训练的语音对的数量,以重新进行训练。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有身份验证的系统,所述身份验证的系统可被至少一个处理器执行,以实现如下步骤:S1,在接收到待进行身份验证的目标用户的当前语音数据后,对所述当前语音数据按照预设的分帧参数进行分帧处理,以获得多个语音帧;S2,利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元;S3,将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元;S4,将多组配对后的观测特征单元输入预先训练生成的预设类型身份验证模型,并获取输出的身份验证结果,以对该目标用户进行身份验证。
- 根据权利要求11所述的计算机可读存储介质,其特征在于,所述预定的滤波器为梅尔滤波器,所述利用预定的滤波器提取各个语音帧中预设类型的声学特征的步骤包括:对所述语音帧进行加窗处理;对每一个加窗进行傅立叶变换得到对应的频谱;将所述频谱输入梅尔滤波器以输出得到梅尔频谱;在梅尔频谱上面进行倒谱分析以获得梅尔频率倒谱系数MFCC,以所述梅尔频率倒谱系数MFCC作为该语音帧的声学特征。
- 根据权利要求12所述的计算机可读存储介质,其特征在于,所述根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元的步骤包括:以所述当前语音数据中的每个录音数据中的语音帧为一语音帧集合,将所述语音帧集合中的每个语音帧的20维梅尔频率倒谱系数MFCC按对应语音帧的分帧时间的先后顺序拼接,生成对应的(20,N)维矩阵的观测特征单元,所述N为该语音帧集合的总帧数。
- 根据权利要求13所述的计算机可读存储介质,其特征在于,所述预先训练生成的预设类型身份验证模型为深度卷积神经网络模型,所述深度卷积神经网络模型采用识别函数进行身份验证,所述识别函数包括:L(x,y)=xTUy-xTVx-yTVy+b;其中,Obj为所述深度卷积神经网络模型的目标函数,x为一组观测特征单元中其中一个观测特征单元在所述归一化层得到的用户特征,y为该组观测特征单元中另一个观测特征单元在所述归一化层得到的用户特征,K为常量,P(x,y)为计算一组观测特征单元属于同一用户的概率,L(x,y)为计算一组观测特征单元的相似度L,U为用户的类内关系矩阵,V为用户类间关系矩阵,b为偏置量,T为矩阵转置。
- 根据权利要求14所述的计算机可读存储介质,其特征在于,所述步骤S4之前还包括:获取同一用户第一预设数量的语音对,并获取不同用户第二预设数量的语音对,分别对各个语音对中的语音按照预设的分帧参数进行分帧处理,以获得各个语音对对应的多个语音帧;利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成每个语音对的多个观测特征单元;将属于同一用户及属于不同用户的两个语音对应的观测特征单元进行两两配对,以获得多组配对的观测特征单元;将各个语音对分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小于或者等于1;利用训练集中的各个语音对的各组观测特征单元对深度卷积神经网络模型,并在训练完成后利用验证集对训练后的深度卷积神经网络模型的准确率进行验证;若所述准确率大于预设阈值,则训练结束,以训练后的深度卷积神经网络模型为所述步骤S4中的深度卷积神经网络模型,或者,若所述准确率小于或者等于所述预设阈值,则增加进行训练的语音对的数量,以重新进行训练。
- 一种身份验证的系统,其特征在于,所述身份验证的系统存储于存储器,可被至少一个处理器执行,以实现如下步骤:S1,在接收到待进行身份验证的目标用户的当前语音数据后,对所述当前语音数据按照预设的分帧参数进行分帧处理,以获得多个语音帧;S2,利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元;S3,将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元;S4,将多组配对后的观测特征单元输入预先训练生成的预设类型身份验证模型,并获取输出的身份验证结果,以对该目标用户进行身份验证。
- 根据权利要求16所述的身份验证的系统,其特征在于,所述预定的滤波器为梅尔滤波器,所述利用预定的滤波器提取各个语音帧中预设类型的声学特征的步骤包括:对所述语音帧进行加窗处理;对每一个加窗进行傅立叶变换得到对应的频谱;将所述频谱输入梅尔滤波器以输出得到梅尔频谱;在梅尔频谱上面进行倒谱分析以获得梅尔频率倒谱系数MFCC,以所述梅尔频率倒谱系数MFCC作为该语音帧的声学特征。
- 根据权利要求17所述的身份验证的系统,其特征在于,所述根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元的步骤包括:以所述当前语音数据中的每个录音数据中的语音帧为一语音帧集合,将所述语音帧集合中的每个语音帧的20维梅尔频率倒谱系数MFCC按对应语 音帧的分帧时间的先后顺序拼接,生成对应的(20,N)维矩阵的观测特征单元,所述N为该语音帧集合的总帧数。
- 根据权利要求18所述的身份验证的系统,其特征在于,所述预先训练生成的预设类型身份验证模型为深度卷积神经网络模型,所述深度卷积神经网络模型采用识别函数进行身份验证,所述识别函数包括:L(x,y)=xTUy-xTVx-yTVy+b;其中,Obj为所述深度卷积神经网络模型的目标函数,x为一组观测特征单元中其中一个观测特征单元在所述归一化层得到的用户特征,y为该组观测特征单元中另一个观测特征单元在所述归一化层得到的用户特征,K为常量,P(x,y)为计算一组观测特征单元属于同一用户的概率,L(x,y)为计算一组观测特征单元的相似度L,U为用户的类内关系矩阵,V为用户类间关系矩阵,b为偏置量,T为矩阵转置。
- 根据权利要求19所述的身份验证的系统,其特征在于,所述步骤S4之前还包括:获取同一用户第一预设数量的语音对,并获取不同用户第二预设数量的语音对,分别对各个语音对中的语音按照预设的分帧参数进行分帧处理,以获得各个语音对对应的多个语音帧;利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成每个语音对的多个观测特征单元;将属于同一用户及属于不同用户的两个语音对应的观测特征单元进行两两配对,以获得多组配对的观测特征单元;将各个语音对分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小于或者等于1;利用训练集中的各个语音对的各组观测特征单元对深度卷积神经网络模型,并在训练完成后利用验证集对训练后的深度卷积神经网络模型的准确率进行验证;若所述准确率大于预设阈值,则训练结束,以训练后的深度卷积神经网络模型为所述步骤S4中的深度卷积神经网络模型,或者,若所述准确率小于或者等于所述预设阈值,则增加进行训练的语音对的数量,以重新进行训练。
Priority Applications (6)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| SG11201901766YA SG11201901766YA (en) | 2017-07-25 | 2017-08-31 | Electronic device, method and system of identity verification and computer readable storage medium |
| US16/084,233 US11068571B2 (en) | 2017-07-25 | 2017-08-31 | Electronic device, method and system of identity verification and computer readable storage medium |
| AU2017404565A AU2017404565B2 (en) | 2017-07-25 | 2017-08-31 | Electronic device, method and system of identity verification and computer readable storage medium |
| EP17897212.1A EP3460793B1 (en) | 2017-07-25 | 2017-08-31 | Electronic apparatus, identity verification method and system, and computer-readable storage medium |
| KR1020187017523A KR102159217B1 (ko) | 2017-07-25 | 2017-08-31 | 전자장치, 신분 검증 방법, 시스템 및 컴퓨터 판독 가능한 저장매체 |
| JP2018534079A JP6621536B2 (ja) | 2017-07-25 | 2017-08-31 | 電子装置、身元認証方法、システム及びコンピュータ読み取り可能な記憶媒体 |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710614649.6A CN107527620B (zh) | 2017-07-25 | 2017-07-25 | 电子装置、身份验证的方法及计算机可读存储介质 |
| CN201710614649.6 | 2017-07-25 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019019256A1 true WO2019019256A1 (zh) | 2019-01-31 |
Family
ID=60680120
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2017/100055 Ceased WO2019019256A1 (zh) | 2017-07-25 | 2017-08-31 | 电子装置、身份验证的方法、系统及计算机可读存储介质 |
Country Status (8)
| Country | Link |
|---|---|
| US (1) | US11068571B2 (zh) |
| EP (1) | EP3460793B1 (zh) |
| JP (1) | JP6621536B2 (zh) |
| KR (1) | KR102159217B1 (zh) |
| CN (1) | CN107527620B (zh) |
| AU (1) | AU2017404565B2 (zh) |
| SG (1) | SG11201901766YA (zh) |
| WO (1) | WO2019019256A1 (zh) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110459209A (zh) * | 2019-08-20 | 2019-11-15 | 深圳追一科技有限公司 | 语音识别方法、装置、设备及存储介质 |
| CN111191754A (zh) * | 2019-12-30 | 2020-05-22 | 秒针信息技术有限公司 | 语音采集方法、装置、电子设备及存储介质 |
| US12205147B2 (en) | 2019-08-29 | 2025-01-21 | Tencent Technology (Shenzhen) Company Limited | Feature processing method and apparatus for artificial intelligence recommendation model, electronic device, and storage medium |
Families Citing this family (23)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108417217B (zh) * | 2018-01-11 | 2021-07-13 | 思必驰科技股份有限公司 | 说话人识别网络模型训练方法、说话人识别方法及系统 |
| CN108154371A (zh) * | 2018-01-12 | 2018-06-12 | 平安科技(深圳)有限公司 | 电子装置、身份验证的方法及存储介质 |
| CN108564954B (zh) * | 2018-03-19 | 2020-01-10 | 平安科技(深圳)有限公司 | 深度神经网络模型、电子装置、身份验证方法和存储介质 |
| CN108564688A (zh) * | 2018-03-21 | 2018-09-21 | 阿里巴巴集团控股有限公司 | 身份验证的方法及装置和电子设备 |
| CN108877775B (zh) * | 2018-06-04 | 2023-03-31 | 平安科技(深圳)有限公司 | 语音数据处理方法、装置、计算机设备及存储介质 |
| CN109448746B (zh) * | 2018-09-28 | 2020-03-24 | 百度在线网络技术(北京)有限公司 | 语音降噪方法及装置 |
| CN109473105A (zh) * | 2018-10-26 | 2019-03-15 | 平安科技(深圳)有限公司 | 与文本无关的声纹验证方法、装置和计算机设备 |
| CN109346086A (zh) * | 2018-10-26 | 2019-02-15 | 平安科技(深圳)有限公司 | 声纹识别方法、装置、计算机设备和计算机可读存储介质 |
| CN109147818A (zh) * | 2018-10-30 | 2019-01-04 | Oppo广东移动通信有限公司 | 声学特征提取方法、装置、存储介质及终端设备 |
| CN109686382A (zh) * | 2018-12-29 | 2019-04-26 | 平安科技(深圳)有限公司 | 一种说话人聚类方法和装置 |
| CN109448726A (zh) * | 2019-01-14 | 2019-03-08 | 李庆湧 | 一种语音控制准确率的调整方法及系统 |
| CN110010133A (zh) * | 2019-03-06 | 2019-07-12 | 平安科技(深圳)有限公司 | 基于短文本的声纹检测方法、装置、设备及存储介质 |
| CN111798857A (zh) * | 2019-04-08 | 2020-10-20 | 北京嘀嘀无限科技发展有限公司 | 一种信息识别方法、装置、电子设备及存储介质 |
| CN110289004B (zh) * | 2019-06-18 | 2021-09-07 | 暨南大学 | 一种基于深度学习的人工合成声纹检测系统及方法 |
| CN110570873B (zh) * | 2019-09-12 | 2022-08-05 | Oppo广东移动通信有限公司 | 声纹唤醒方法、装置、计算机设备以及存储介质 |
| CN110556126B (zh) * | 2019-09-16 | 2024-01-05 | 平安科技(深圳)有限公司 | 语音识别方法、装置以及计算机设备 |
| CN110570871A (zh) * | 2019-09-20 | 2019-12-13 | 平安科技(深圳)有限公司 | 一种基于TristouNet的声纹识别方法、装置及设备 |
| CN114547568B (zh) * | 2022-02-09 | 2026-03-27 | 支付宝(杭州)数字服务技术有限公司 | 一种基于语音的身份验证方法、装置及设备 |
| CN115223569B (zh) * | 2022-06-02 | 2025-02-28 | 康佳集团股份有限公司 | 基于深度神经网络的说话人验证方法、终端及存储介质 |
| CN115424608A (zh) * | 2022-08-25 | 2022-12-02 | 深圳大学 | 一种基于机器学习的身份验证的方法和系统 |
| CN116094725B (zh) * | 2022-12-30 | 2024-12-24 | 中国人民解放军网络空间部队信息工程大学 | 基于sm2算法的声纹认证保护方法及系统 |
| CN116567150B (zh) * | 2023-07-11 | 2023-09-08 | 山东凌晓通信科技有限公司 | 一种会议室防窃听偷录的方法及系统 |
| US12211487B1 (en) * | 2024-01-18 | 2025-01-28 | Morgan Stanley Services Group Inc. | Systems and methods for accessible websites and/or applications for people with disabilities |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103391201A (zh) * | 2013-08-05 | 2013-11-13 | 公安部第三研究所 | 基于声纹识别实现智能卡身份验证的系统及方法 |
| CN104700018A (zh) * | 2015-03-31 | 2015-06-10 | 江苏祥和电子科技有限公司 | 一种用于智能机器人的识别方法 |
| CN107068154A (zh) * | 2017-03-13 | 2017-08-18 | 平安科技(深圳)有限公司 | 基于声纹识别的身份验证的方法及系统 |
Family Cites Families (21)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US4032711A (en) | 1975-12-31 | 1977-06-28 | Bell Telephone Laboratories, Incorporated | Speaker recognition arrangement |
| US5583961A (en) * | 1993-03-25 | 1996-12-10 | British Telecommunications Public Limited Company | Speaker recognition using spectral coefficients normalized with respect to unequal frequency bands |
| GB2355834A (en) * | 1999-10-29 | 2001-05-02 | Nokia Mobile Phones Ltd | Speech recognition |
| US8645137B2 (en) | 2000-03-16 | 2014-02-04 | Apple Inc. | Fast, language-independent method for user authentication by voice |
| CN101465123B (zh) * | 2007-12-20 | 2011-07-06 | 株式会社东芝 | 说话人认证的验证方法和装置以及说话人认证系统 |
| US8554562B2 (en) * | 2009-11-15 | 2013-10-08 | Nuance Communications, Inc. | Method and system for speaker diarization |
| US20160293167A1 (en) * | 2013-10-10 | 2016-10-06 | Google Inc. | Speaker recognition using neural networks |
| CN104008751A (zh) * | 2014-06-18 | 2014-08-27 | 周婷婷 | 一种基于bp神经网络的说话人识别方法 |
| US10580401B2 (en) | 2015-01-27 | 2020-03-03 | Google Llc | Sub-matrix input for neural network layers |
| JP6280068B2 (ja) * | 2015-03-09 | 2018-02-14 | 日本電信電話株式会社 | パラメータ学習装置、話者認識装置、パラメータ学習方法、話者認識方法、およびプログラム |
| JP6616182B2 (ja) * | 2015-12-25 | 2019-12-04 | 綜合警備保障株式会社 | 話者認識装置、判別値生成方法及びプログラム |
| CN105788592A (zh) * | 2016-04-28 | 2016-07-20 | 乐视控股(北京)有限公司 | 一种音频分类方法及装置 |
| CN105869644A (zh) * | 2016-05-25 | 2016-08-17 | 百度在线网络技术(北京)有限公司 | 基于深度学习的声纹认证方法和装置 |
| AU2017294791B2 (en) * | 2016-07-11 | 2021-06-03 | Ftr, Ltd. | Method and system for automatically diarising a sound recording |
| CN106448684A (zh) * | 2016-11-16 | 2017-02-22 | 北京大学深圳研究生院 | 基于深度置信网络特征矢量的信道鲁棒声纹识别系统 |
| CN106710599A (zh) * | 2016-12-02 | 2017-05-24 | 深圳撒哈拉数据科技有限公司 | 一种基于深度神经网络的特定声源检测方法与系统 |
| US10546575B2 (en) * | 2016-12-14 | 2020-01-28 | International Business Machines Corporation | Using recurrent neural network for partitioning of audio data into segments that each correspond to a speech feature cluster identifier |
| CN107610707B (zh) * | 2016-12-15 | 2018-08-31 | 平安科技(深圳)有限公司 | 一种声纹识别方法及装置 |
| CN106847292B (zh) * | 2017-02-16 | 2018-06-19 | 平安科技(深圳)有限公司 | 声纹识别方法及装置 |
| US10637898B2 (en) * | 2017-05-24 | 2020-04-28 | AffectLayer, Inc. | Automatic speaker identification in calls |
| US11289098B2 (en) * | 2019-03-08 | 2022-03-29 | Samsung Electronics Co., Ltd. | Method and apparatus with speaker recognition registration |
-
2017
- 2017-07-25 CN CN201710614649.6A patent/CN107527620B/zh active Active
- 2017-08-31 US US16/084,233 patent/US11068571B2/en active Active
- 2017-08-31 JP JP2018534079A patent/JP6621536B2/ja active Active
- 2017-08-31 AU AU2017404565A patent/AU2017404565B2/en not_active Ceased
- 2017-08-31 EP EP17897212.1A patent/EP3460793B1/en active Active
- 2017-08-31 SG SG11201901766YA patent/SG11201901766YA/en unknown
- 2017-08-31 KR KR1020187017523A patent/KR102159217B1/ko active Active
- 2017-08-31 WO PCT/CN2017/100055 patent/WO2019019256A1/zh not_active Ceased
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103391201A (zh) * | 2013-08-05 | 2013-11-13 | 公安部第三研究所 | 基于声纹识别实现智能卡身份验证的系统及方法 |
| CN104700018A (zh) * | 2015-03-31 | 2015-06-10 | 江苏祥和电子科技有限公司 | 一种用于智能机器人的识别方法 |
| CN107068154A (zh) * | 2017-03-13 | 2017-08-18 | 平安科技(深圳)有限公司 | 基于声纹识别的身份验证的方法及系统 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP3460793A4 * |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110459209A (zh) * | 2019-08-20 | 2019-11-15 | 深圳追一科技有限公司 | 语音识别方法、装置、设备及存储介质 |
| US12205147B2 (en) | 2019-08-29 | 2025-01-21 | Tencent Technology (Shenzhen) Company Limited | Feature processing method and apparatus for artificial intelligence recommendation model, electronic device, and storage medium |
| CN111191754A (zh) * | 2019-12-30 | 2020-05-22 | 秒针信息技术有限公司 | 语音采集方法、装置、电子设备及存储介质 |
| CN111191754B (zh) * | 2019-12-30 | 2023-10-27 | 秒针信息技术有限公司 | 语音采集方法、装置、电子设备及存储介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP3460793B1 (en) | 2023-04-05 |
| EP3460793A1 (en) | 2019-03-27 |
| US20210097159A1 (en) | 2021-04-01 |
| AU2017404565A1 (en) | 2019-02-14 |
| JP6621536B2 (ja) | 2019-12-18 |
| CN107527620B (zh) | 2019-03-26 |
| US11068571B2 (en) | 2021-07-20 |
| SG11201901766YA (en) | 2019-04-29 |
| CN107527620A (zh) | 2017-12-29 |
| KR102159217B1 (ko) | 2020-09-24 |
| AU2017404565B2 (en) | 2020-01-02 |
| JP2019531492A (ja) | 2019-10-31 |
| EP3460793A4 (en) | 2020-04-01 |
| KR20190022432A (ko) | 2019-03-06 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN107527620B (zh) | 电子装置、身份验证的方法及计算机可读存储介质 | |
| TWI641965B (zh) | 基於聲紋識別的身份驗證的方法及系統 | |
| CN108564954B (zh) | 深度神经网络模型、电子装置、身份验证方法和存储介质 | |
| WO2020181824A1 (zh) | 声纹识别方法、装置、设备以及计算机可读存储介质 | |
| WO2021051572A1 (zh) | 语音识别方法、装置以及计算机设备 | |
| WO2019100606A1 (zh) | 电子装置、基于声纹的身份验证方法、系统及存储介质 | |
| CN108564955B (zh) | 电子装置、身份验证方法和计算机可读存储介质 | |
| WO2018149077A1 (zh) | 声纹识别方法、装置、存储介质和后台服务器 | |
| WO2017215558A1 (zh) | 一种声纹识别方法和装置 | |
| CN108630208B (zh) | 服务器、基于声纹的身份验证方法及存储介质 | |
| WO2019136912A1 (zh) | 电子装置、身份验证的方法、系统及存储介质 | |
| CN113223536A (zh) | 声纹识别方法、装置及终端设备 | |
| WO2021042537A1 (zh) | 语音识别认证方法及系统 | |
| WO2020034628A1 (zh) | 口音识别方法、装置、计算机装置及存储介质 | |
| WO2019136911A1 (zh) | 更新声纹数据的语音识别方法、终端装置及存储介质 | |
| CN114141254A (zh) | 声纹信号的更新方法及其装置、电子设备及存储介质 | |
| WO2019196305A1 (zh) | 电子装置、身份验证的方法及存储介质 | |
| WO2020140609A1 (zh) | 一种语音识别方法、设备及计算机可读存储介质 | |
| CN108650266B (zh) | 服务器、声纹验证的方法及存储介质 | |
| CN110797033A (zh) | 基于人工智能的声音识别方法、及其相关设备 | |
| WO2021128847A1 (zh) | 终端交互方法、装置、计算机设备及存储介质 | |
| WO2019179033A1 (zh) | 说话人认证方法、服务器及计算机可读存储介质 | |
| CN115101054A (zh) | 基于热词图的语音识别方法、装置、设备及存储介质 | |
| CN110853652A (zh) | 身份识别方法、装置及计算机可读存储介质 | |
| CN115641853B (zh) | 声纹识别方法及装置、计算机可读存储介质、终端 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| ENP | Entry into the national phase |
Ref document number: 20187017523 Country of ref document: KR Kind code of ref document: A |
|
| ENP | Entry into the national phase |
Ref document number: 2018534079 Country of ref document: JP Kind code of ref document: A |
|
| ENP | Entry into the national phase |
Ref document number: 2017404565 Country of ref document: AU Date of ref document: 20170831 Kind code of ref document: A |
|
| ENP | Entry into the national phase |
Ref document number: 2017897212 Country of ref document: EP Effective date: 20180829 |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17897212 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
