WO2019019256A1 - 电子装置、身份验证的方法、系统及计算机可读存储介质 - Google Patents

电子装置、身份验证的方法、系统及计算机可读存储介质 Download PDF

Info

Publication number
WO2019019256A1
WO2019019256A1 PCT/CN2017/100055 CN2017100055W WO2019019256A1 WO 2019019256 A1 WO2019019256 A1 WO 2019019256A1 CN 2017100055 W CN2017100055 W CN 2017100055W WO 2019019256 A1 WO2019019256 A1 WO 2019019256A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
feature
feature units
neural network
convolutional neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2017/100055
Other languages
English (en)
French (fr)
Inventor
王健宗
郭卉
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to SG11201901766YA priority Critical patent/SG11201901766YA/en
Priority to US16/084,233 priority patent/US11068571B2/en
Priority to AU2017404565A priority patent/AU2017404565B2/en
Priority to EP17897212.1A priority patent/EP3460793B1/en
Priority to KR1020187017523A priority patent/KR102159217B1/ko
Priority to JP2018534079A priority patent/JP6621536B2/ja
Publication of WO2019019256A1 publication Critical patent/WO2019019256A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/10Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems

Definitions

  • the present invention relates to the field of communications technologies, and in particular, to an electronic device, a method and system for authenticating, and a computer readable storage medium.
  • Voiceprint recognition is an identity authentication technology that is determined by computer simulation of target speech. It can be widely used in the fields of Internet, banking system, public security and justice.
  • the traditional voiceprint recognition scheme uses a general background model based on Gaussian mixture model modeling to record the speaker, and analyzes the difference, and then extracts the voiceprint features according to the difference, and scores by similarity measure to give recognition. result.
  • This voiceprint recognition scheme has a low recognition error rate for long recordings (for example, recordings of 30 seconds or longer), and the recognition effect is good, but for short recordings that are widely present in different business scenarios (for example, less than 30 seconds) Recording), due to the limited parameters, the general background model framework can not be used to model the subtle differences in the recording, resulting in poor performance of the phrase sound recognition and high recognition error rate.
  • the present invention provides an electronic device including a memory and a processor coupled to the memory, wherein the memory stores a system for authenticating an operation on the processor, When the system for authentication is executed by the processor, the following steps are implemented:
  • S4 Input the plurality of matched observation feature units into a pre-trained preset type identity verification model, and obtain an output identity verification result to perform identity verification on the target user.
  • the present invention also provides a method for identity verification, where the method for identity verification includes:
  • the pre-voice data is subjected to framing processing according to preset framing parameters to obtain a plurality of speech frames;
  • S4 Input the plurality of matched observation feature units into a pre-trained preset type identity verification model, and obtain an output identity verification result to perform identity verification on the target user.
  • the present invention also provides a computer readable storage medium having stored thereon an authentication system, the authentication system being executable by at least one processor to implement the following steps:
  • S4 Input the plurality of matched observation feature units into a pre-trained preset type identity verification model, and obtain an output identity verification result to perform identity verification on the target user.
  • the present invention also provides a system for authenticating, the system of authentication being stored in a memory, executable by at least one processor, to implement the following steps:
  • S4 Input the plurality of matched observation feature units into a pre-trained preset type identity verification model, and obtain an output identity verification result to perform identity verification on the target user.
  • the invention has the beneficial effects that the present invention firstly processes the current speech data into a plurality of speech frames, extracts a preset type of acoustic features in each speech frame by using a predetermined filter, and generates the acoustic features according to the extracted acoustic features.
  • the plurality of observation feature units corresponding to the current voice data are respectively paired with the pre-stored observation feature units to obtain the plurality of sets of the paired observation feature units, and the plurality of pairs of the paired observation feature units are input into the pre-preparation
  • the output authentication result is obtained to authenticate the target user.
  • the present invention authenticates the short recordings occurring in multiple service scenarios, the short recording is framed and the acoustic features are extracted. It is transformed into an observation feature unit, and finally the paired observation feature unit is input into the identity verification model for identity verification, and the performance of the phrase sound recognition is better, and the recognition error rate can be reduced.
  • FIG. 1 is a schematic diagram of an optional application environment according to various embodiments of the present invention.
  • FIG. 2 is a schematic flowchart diagram of an embodiment of an authentication method according to an embodiment of the present invention.
  • first, second and the like in the present invention are for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the number of indicated technical features. .
  • features defining “first” and “second” may include at least one of the features, either explicitly or implicitly.
  • the technical solutions between the various embodiments may be combined with each other, but must be based on the realization of those skilled in the art, and when the combination of the technical solutions is contradictory or impossible to implement, it should be considered that the combination of the technical solutions does not exist. It is also within the scope of protection required by the present invention.
  • FIG. 1 it is a schematic diagram of an application environment of a preferred embodiment of the method for authenticating the present invention.
  • the application environment diagram includes an electronic device 1 and a terminal device 2.
  • the electronic device 1 can perform data interaction with the terminal device 2 through a suitable technology such as a network or a near field communication technology.
  • the terminal device 2 includes, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch panel, or a voice control device, and the electronic product can utilize a voice collection device (for example, Microphone) to collect user's voice data, such as personal computers, tablets, smart phones, personal digital assistants (PDAs), game consoles, Internet Protocol Television (IPTV), smart wearable devices
  • a voice collection device for example, Microphone
  • PDAs personal digital assistants
  • IPTV Internet Protocol Television
  • a removable device such as a navigation device or the like, or a fixed terminal such as a digital TV, a desktop computer, a notebook, a server, or the like.
  • the electronic device 1 is an apparatus capable of automatically performing numerical calculation and/or information processing in accordance with an instruction set or stored in advance.
  • the electronic device 1 may be a computer, a single network server, a server group composed of multiple network servers, or a cloud-based cloud composed of a large number of hosts or network servers, where cloud computing is a type of distributed computing.
  • a super virtual computer consisting of a group of loosely coupled computers.
  • the electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13 that are communicably connected to each other through a system bus. It should be noted that FIG. 1 only shows the electronic device 1 having the components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
  • the storage device 11 includes a memory and at least one type of readable storage medium.
  • the memory provides a cache for the operation of the electronic device 1;
  • the readable storage medium can be, for example, a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static Non-volatile storage medium for random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. .
  • the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be external to the electronic device 1.
  • a storage device such as a plug-in hard disk equipped with an electronic device 1, a smart memory card (SMC), a Secure Digital (SD) card, a flash card, or the like.
  • the readable storage medium of the storage device 11 is generally used to store an operating system installed in the electronic device 1 and various types of application software, such as program codes of the system for identity verification in an embodiment of the present invention. Further, the storage device 11 can also be used to temporarily store various types of data that have been output or are to be output.
  • the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
  • the processor 12 is typically used to control the overall operation of the electronic device 1, such as performing control and processing related to data interaction or communication with the terminal device 2.
  • the processor 12 is configured to run program code or process data stored in the memory 11, such as a system running identity verification.
  • the network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the electronic device 1 and other electronic devices.
  • the network interface 13 is mainly used to connect the electronic device 1 with one or more terminal devices 2, and establish a data transmission channel and a communication connection between the electronic device 1 and one or more terminal devices 2.
  • the authentication system is stored in the memory 11 and includes at least one computer readable instruction stored in the memory 11, the at least one computer readable instruction being executable by the processor 12 to implement the methods of various embodiments of the present application And, the at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.
  • the above-described system for authenticating is implemented by the processor 12 to implement the following steps:
  • the user is required to perform recording in a plurality of service scenarios, and the current voice data received during the recording process is a piece of recorded data, and the recorded data of the segment is a phrase sound.
  • the recording device When recording, try to prevent environmental noise and interference from voice acquisition equipment.
  • the recording device should be kept at an appropriate distance from the user, and try not to use a recording device with large distortion.
  • the power supply is preferably powered by the mains and keeps the current stable; the sensor should be used when recording the telephone.
  • the voice data can be denoised before the framing process to further reduce interference.
  • the preset framing parameters are, for example, every 25 milliseconds frame, frame shift 10 milliseconds, and each segment after frame processing Recorded data gets multiple speech frames.
  • the embodiment does not limit the framing processing manner described above, and other framing parameters may be used for framing processing, which are all within the protection scope of the embodiment.
  • the predetermined filter is preferably a Meyer filter
  • the acoustic feature is a voiceprint feature
  • the voiceprint feature includes various types, such as a wide-band voiceprint, a narrow-band voiceprint, an amplitude voiceprint, etc., in this embodiment.
  • the voiceprint feature is preferably a Mel Frequency Cepstrum Coefficient (MFCC).
  • the feature data matrix is composed according to the Mel frequency cepstrum coefficients, and specifically, the feature data matrix is composed according to the Mel frequency cepstrum coefficients of each piece of recorded data.
  • the feature data matrix corresponding to the multi-segment recording data is a plurality of observation feature units corresponding to the current voice data.
  • S4 Input the plurality of matched observation feature units into a pre-trained preset type identity verification model, and obtain an output identity verification result to perform identity verification on the target user.
  • a plurality of observation feature units of the user are stored in advance, and after generating a plurality of observation feature units corresponding to the current voice data, the generated plurality of observation feature units and the pre-stored observation feature units are performed in pairs. Pairing, pairing to get multiple sets of observation feature units.
  • the preset type authentication model is preferably a deep convolutional neural network model, and the deep convolutional neural network model consists of one input layer, four convolution layers, one pooling layer, two fully connected layers, and one The normalized layer is composed of one classification layer.
  • the detailed structure of the deep convolutional neural network model is as shown in Table 1 above:
  • the Layer Name column indicates the name of each layer
  • Input indicates the input layer
  • Conv indicates the convolution layer
  • Conv1 indicates the first convolution layer
  • Mean_std_pooling indicates the pooling layer
  • Full connected indicates the fully connected layer
  • Normalize Wrap indicates the normalization.
  • the layer, Scoring represents the classification layer.
  • Batch Size indicates the number of observation feature units of the input of the current layer
  • Kernel Size indicates the scale of the current layer convolution kernel (for example, the Kernel Size can be equal to 3, indicating that the scale of the convolution kernel is 3x3)
  • Stride Size represents the moving step size of the convolution kernel, that is, the distance moved to the next convolution position after one convolution.
  • Filter size refers to the channel of each layer output, such as the input voice channel in the Input layer is 1 (that is, the original data), and becomes 512 through the Input layer channel.
  • the input layer indicates that the input observation feature unit is sampled
  • the Conv layer convolution kernel 1*1 can scale the input and feature combination
  • the Normalize Wrap layer normalizes the input variance
  • the Scoring layer trains the user class.
  • the inner relationship matrix U and the user's inter-class relationship matrix V have dimensions of 300*300.
  • the output verification result is obtained, and the output verification result includes the verification pass and the verification failure to identify the target user. verification.
  • the embodiment firstly processes the current speech data into a plurality of speech frames, and extracts a preset type of acoustic features in each speech frame by using a predetermined filter, and generates an acoustic feature according to the extracted acoustic features.
  • the plurality of observation feature units corresponding to the current voice data are respectively paired with the pre-stored observation feature units to obtain the plurality of sets of the paired observation feature units, and the plurality of pairs of the paired observation feature units are input.
  • the output authentication result is obtained to authenticate the target user.
  • the short recording is framed and extracted.
  • Acoustic features are transformed into observation feature units, and the paired observation feature units are finally input into the identity verification model for identity verification. The performance of the phrase sound recognition is better, and the recognition error rate can be reduced.
  • the step of extracting a predetermined type of acoustic features in each speech frame by using a predetermined filter comprises:
  • a cepstrum analysis is performed on the Mel spectrum to obtain a Mel frequency cepstral coefficient MFCC, and the Mel frequency cepstral coefficient MFCC is used as an acoustic feature of the speech frame.
  • each frame of data is treated as a stationary signal. Since the subsequent Fourier expansion is required to acquire the Mel spectral features, the Gibbs effect occurs, that is, after the Fourier series expansion is performed on a periodic function (such as a rectangular pulse) having discontinuous points, a finite term is selected for synthesis. When the number of selected items is larger, the peak appearing in the synthesized waveform is closer to the discontinuity point of the original signal. When the number of selected items is large, the peak value tends to be a constant, which is approximately equal to the total jump. 9% of the change. In order to avoid the Gibbs effect, the speech frame needs to be windowed to reduce the discontinuity of the signal at the beginning and end of the speech frame.
  • the cepstrum analysis is, for example, taking the logarithm and performing the inverse transform.
  • the inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after the DCT are taken as the Mel frequency cepstral coefficient MFCC coefficients.
  • the Mel frequency cepstrum coefficient MFCC is the voiceprint feature of the speech data of this frame, and the Mel frequency cepstral coefficient MFCC of each frame constitutes a feature data matrix, which is the acoustic feature of the speech frame.
  • the step of generating a plurality of observed feature units corresponding to the current voice data according to the extracted acoustic features comprises:
  • the deep convolutional neural network model uses an identification function for identity verification, and the identification function includes:
  • Obj is the objective function of the deep convolutional neural network model.
  • the probability of the correct discriminant of the deep convolutional neural network model is increased to convergence, thereby verifying the identity of the target and the identity.
  • . x is a user feature obtained by one of the observation feature units in the normalization layer
  • y is a user feature obtained by the other observation feature unit in the normalization layer.
  • K is a constant
  • P(x, y) is the probability that a set of observed feature units belong to the same user
  • L(x, y) is the similarity L for calculating a set of observed feature units
  • U is the intra-class relationship matrix of the user.
  • V is the relationship matrix between user classes
  • b is the offset amount
  • T is the matrix transposition.
  • FIG. 2 is a schematic flowchart of an embodiment of an authentication method according to an embodiment of the present invention.
  • the method for authenticating includes the following steps:
  • Step S1 after receiving the current voice data of the target user to be authenticated, performing frame processing on the current voice data according to preset framing parameters to obtain multiple voice frames;
  • the user is required to perform recording in a plurality of service scenarios, and the current voice data received during the recording process is a piece of recorded data, and the recorded data of the segment is a phrase sound.
  • the recording device When recording, try to prevent environmental noise and interference from voice acquisition equipment.
  • the recording device should be kept at an appropriate distance from the user, and try not to use a recording device with large distortion.
  • the power supply is preferably powered by the mains and keeps the current stable; the sensor should be used when recording the telephone.
  • the voice data can be denoised before the framing process to further reduce interference.
  • the preset framing parameters are, for example, every 25 milliseconds frame, frame shift 10 milliseconds, and each segment after frame processing Recorded data gets multiple speech frames.
  • the embodiment does not limit the framing processing manner described above, and other framing parameters may be used for framing processing, which are all within the protection scope of the embodiment.
  • Step S2 extracting a preset type of acoustic features in each speech frame by using a predetermined filter, Generating a plurality of observed feature units corresponding to the current voice data according to the extracted acoustic features;
  • the predetermined filter is preferably a Meyer filter
  • the acoustic feature is a voiceprint feature
  • the voiceprint feature includes various types, such as a wide-band voiceprint, a narrow-band voiceprint, an amplitude voiceprint, etc., in this embodiment.
  • the voiceprint feature is preferably a Mel Frequency Cepstrum Coefficient (MFCC).
  • the feature data matrix is composed according to the Mel frequency cepstrum coefficients, and specifically, the feature data matrix is composed according to the Mel frequency cepstrum coefficients of each piece of recorded data.
  • the feature data matrix corresponding to the multi-segment recording data is a plurality of observation feature units corresponding to the current voice data.
  • Step S3 pairing each of the observed feature units with the pre-stored observation feature units to obtain a plurality of pairs of paired observation feature units;
  • step S4 the plurality of sets of paired observation feature units are input into the pre-trained preset type identity verification model, and the outputted identity verification result is obtained to perform identity verification on the target user.
  • a plurality of observation feature units of the user are stored in advance, and after generating a plurality of observation feature units corresponding to the current voice data, the generated plurality of observation feature units and the pre-stored observation feature units are performed in pairs. Pairing, pairing to get multiple sets of observation feature units.
  • the preset type authentication model is preferably a deep convolutional neural network model, and the deep convolutional neural network model consists of one input layer, four convolution layers, one pooling layer, two fully connected layers, and one The normalized layer is composed of one classification layer.
  • the detailed structure of the deep convolutional neural network model is as shown in Table 1 above, and details are not described herein again.
  • the Layer Name column indicates the name of each layer
  • Input indicates the input layer
  • Conv indicates the convolution layer
  • Conv1 indicates the first convolution layer
  • Mean_std_pooling indicates the pooling layer
  • Full connected indicates the fully connected layer
  • Normalize Wrap indicates the normalization.
  • the layer, Scoring represents the classification layer.
  • the Batch Size indicates the number of observation feature units of the input of the current layer
  • the Kernel Size indicates the scale of the current layer convolution kernel (for example, the Kernel Size can be equal to 3, indicating that the scale of the convolution kernel is 3x 3)
  • the Stride Size indicates the convolution kernel.
  • the moving step size is the distance moved to the next convolution position after a convolution.
  • Filter size refers to the channel of each layer output, such as the input voice channel in the Input layer is 1 (that is, the original data), and becomes 512 through the Input layer channel.
  • the input layer indicates that the input observation feature unit is sampled
  • the Conv layer convolution kernel 1*1 can scale the input and feature combination
  • the Normalize Wrap layer normalizes the input variance
  • the Scoring layer trains the user class.
  • the inner relationship matrix U and the user's inter-class relationship matrix V have dimensions of 300*300.
  • the output verification result is obtained, and the output verification result includes the verification pass and the verification failure to identify the target user. verification.
  • the step of extracting the acoustic characteristics of the preset type in each of the speech frames by using the predetermined filter in the above step S2 comprises:
  • a cepstrum analysis is performed on the Mel spectrum to obtain a Mel frequency cepstral coefficient MFCC, and the Mel frequency cepstral coefficient MFCC is used as an acoustic feature of the speech frame.
  • each frame of data is treated as a stationary signal. Since the subsequent Fourier expansion is required to acquire the Mel spectral features, the Gibbs effect occurs, that is, after the Fourier series expansion is performed on a periodic function (such as a rectangular pulse) having discontinuous points, a finite term is selected for synthesis. When the number of selected items is larger, the peak appearing in the synthesized waveform is closer to the discontinuity point of the original signal. When the number of selected items is large, the peak value tends to be a constant, which is approximately equal to the total jump. 9% of the change. In order to avoid the Gibbs effect, the speech frame needs to be windowed to reduce the discontinuity of the signal at the beginning and end of the speech frame.
  • the cepstrum analysis is, for example, taking the logarithm and performing the inverse transform.
  • the inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after the DCT are taken as the Mel frequency cepstral coefficient MFCC coefficients.
  • the Mel frequency cepstrum coefficient MFCC is the voiceprint feature of the speech data of this frame, and the Mel frequency cepstral coefficient MFCC of each frame constitutes a feature data matrix, which is the acoustic feature of the speech frame.
  • the step of generating, according to the extracted acoustic features, the plurality of observed feature units corresponding to the current voice data in the foregoing step S2 comprises: using the current voice All the speech frames in each recorded data in the data are a set of speech frames, and the 20-dimensional Meyer frequency cepstral coefficient MFCC (ie, acoustic features) of each speech frame in the set of speech frames is corresponding to the corresponding speech frame.
  • the framing time is sequentially spliced to generate an observation feature unit of the corresponding (20, N)-dimensional matrix, and the N is the total number of frames of the voice frame set.
  • the deep convolutional neural network model uses an identification function for identity verification, and the identification function includes:
  • Obj is the objective function of the deep convolutional neural network model.
  • the probability of the correct discriminant of the deep convolutional neural network model is increased to convergence, thereby verifying the identity of the target and the identity.
  • . x is a user feature obtained by one of the observation feature units in the normalization layer
  • y is a user feature obtained by the other observation feature unit in the normalization layer.
  • K is a constant
  • P(x, y) is the probability that a set of observed feature units belong to the same user
  • L(x, y) is the similarity L for calculating a set of observed feature units
  • U is the intra-class relationship matrix of the user.
  • V is the relationship matrix between user classes
  • b is the offset amount
  • T is the matrix transposition.
  • step S4 include:
  • Obtaining a first preset number of voice pairs of the same user for example, acquiring 1000 users, each user acquiring 1000 pairs of voice pairs, each pair of voice pairs consisting of two voices corresponding to two different pronunciation contents of the same user;
  • Different users have a second preset number of voice pairs, for example, 1000 users are obtained, and each user is paired in pairs, and each pair of users corresponds to one same pronunciation content to obtain a pair of voice pairs.
  • Performing framing processing on the voices of the voice pairs according to the preset framing parameters for example, the preset framing parameters are framed every 25 milliseconds, and the frame is shifted by 10 milliseconds to obtain corresponding voice pairs.
  • Pre-determined types of acoustic features eg, 20-dimensional Meyer frequency cepstral coefficient MFCC spectral features
  • a predetermined filter eg, a Meyer filter
  • each speech is generated from the extracted acoustic features a plurality of observation feature units of the pair, that is, a plurality of feature data matrices according to the Mel frequency cepstral coefficients, wherein the feature data matrix is an observation feature unit;
  • Observing feature units corresponding to two voices belonging to the same user and belonging to different users are pairwise paired to obtain multiple sets of paired observation feature units;
  • Each voice pair is divided into a first percentage (eg, 70%) of the training set and a second percentage (eg, 20%) of the verification set, the sum of the first percentage and the second percentage being less than Or equal to 1;
  • a first percentage eg, 70%
  • a second percentage eg, 20%
  • the deep convolutional neural network model is used by each set of observation feature units of each speech pair in the training set, and the accuracy of the trained deep convolutional neural network model is verified by the verification set after the training is completed;
  • the training ends, and the trained deep convolutional neural network model is the deep convolutional neural network model in the step S4, or if the accuracy is If the rate is less than or equal to the preset threshold, the number of voice pairs to be trained is increased, and the above steps are re-executed to re-train until the accuracy of the trained deep convolutional neural network model is greater than a preset threshold.
  • a preset threshold eg, 98.5%
  • the present invention also provides a computer readable storage medium having stored thereon an authentication system, the step of implementing the method of identity verification described above when the system of authentication is executed by a processor.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better.
  • Implementation Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a cell phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Hardware Design (AREA)
  • Signal Processing (AREA)
  • Game Theory and Decision Science (AREA)
  • Business, Economics & Management (AREA)
  • Telephonic Communication Services (AREA)
  • Collating Specific Patterns (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)
  • User Interface Of Digital Computer (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

一种电子装置、身份验证的方法、系统及计算机可读存储介质,所述电子装置包括存储器及处理器,存储器中存储有身份验证的系统,身份验证的系统被处理器执行时实现:在接收到待进行身份验证的目标用户的当前语音数据后,对当前语音数据按照预设的分帧参数进行分帧处理,以获得多个语音帧;利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成当前语音数据对应的多个观测特征单元;将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元;将多组配对后的观测特征单元输入预先训练生成的预设类型身份验证模型,以对该目标用户进行身份验证。本发明能够降低短语音识别的错误率。

Description

电子装置、身份验证的方法、系统及计算机可读存储介质
本申请申明享有2017年7月25日递交的申请号为201710614649.6、名称为“电子装置、身份验证的方法及计算机可读存储介质”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。
技术领域
本发明涉及通信技术领域,尤其涉及一种电子装置、身份验证的方法、系统及计算机可读存储介质。
背景技术
声纹识别是一种通过对目标语音进行计算机仿真判别的身份认证技术,可广泛应用在互联网、银行系统、公安司法等领域。目前,传统的声纹识别方案采用的是基于高斯混合模型建模的通用背景模型对说话人录音,并进行差异分析,然后根据差异提取声纹特征,并通过相似性测度进行打分,给出识别结果。这种声纹识别方案对于长录音(例如,30秒及以上时长的录音)的识别错误率较低,识别效果好,但是针对不同业务场景中广泛出现的短录音(例如,小于30秒时长的录音),由于参数有限,利用通用背景模型框架无法很好地对录音中的细微差异进行建模,造成对短语音识别的性能不佳,识别错误率高。
发明内容
本发明的目的在于提供一种电子装置、身份验证的方法、系统及计算机可读存储介质,旨在降低短语音识别的错误率。
为实现上述目的,本发明提供一种电子装置,所述电子装置包括存储器及与所述存储器连接的处理器,所述存储器中存储有可在所述处理器上运行的身份验证的系统,所述身份验证的系统被所述处理器执行时实现如下步骤:
S1,在接收到待进行身份验证的目标用户的当前语音数据后,对所述当前语音数据按照预设的分帧参数进行分帧处理,以获得多个语音帧;
S2,利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元;
S3,将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元;
S4,将多组配对后的观测特征单元输入预先训练生成的预设类型身份验证模型,并获取输出的身份验证结果,以对该目标用户进行身份验证。
为实现上述目的,本发明还提供一种身份验证的方法,所述身份验证的方法包括:
S1,在接收到待进行身份验证的目标用户的当前语音数据后,对所述当 前语音数据按照预设的分帧参数进行分帧处理,以获得多个语音帧;
S2,利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元;
S3,将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元;
S4,将多组配对后的观测特征单元输入预先训练生成的预设类型身份验证模型,并获取输出的身份验证结果,以对该目标用户进行身份验证。
本发明还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有身份验证的系统,所述身份验证的系统可被至少一个处理器执行,以实现如下步骤:
S1,在接收到待进行身份验证的目标用户的当前语音数据后,对所述当前语音数据按照预设的分帧参数进行分帧处理,以获得多个语音帧;
S2,利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元;
S3,将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元;
S4,将多组配对后的观测特征单元输入预先训练生成的预设类型身份验证模型,并获取输出的身份验证结果,以对该目标用户进行身份验证。
本发明还提供一种身份验证的系统,所述身份验证的系统存储于存储器,可被至少一个处理器执行,以实现如下步骤:
S1,在接收到待进行身份验证的目标用户的当前语音数据后,对所述当前语音数据按照预设的分帧参数进行分帧处理,以获得多个语音帧;
S2,利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元;
S3,将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元;
S4,将多组配对后的观测特征单元输入预先训练生成的预设类型身份验证模型,并获取输出的身份验证结果,以对该目标用户进行身份验证。
本发明的有益效果是:本发明首先对当前语音数据分帧处理以获得多个语音帧,利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元,将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元,将多组配对后的观测特征单元输入预设类型身份验证模型中,获取输出的身份验证结果,以对该目标用户进行身份验证,本发明对于多种业务场景中出现的短录音进行身份认证时,对短录音进行分帧、提取声学特征并将其转化为观测特征单元,最终将配对后的观测特征单元输入至身份验证模型中进行身份验证,对短语音识别的性能较佳,能够降低识别错误率。
附图说明
图1为本发明各个实施例一可选的应用环境示意图;
图2为本发明身份验证的方法一实施例的流程示意图。
具体实施方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
需要说明的是,在本发明中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本发明要求的保护范围之内。
参阅图1所示,是本发明身份验证的方法的较佳实施例的应用环境示意图。该应用环境示意图包括电子装置1及终端设备2。电子装置1可以通过网络、近场通信技术等适合的技术与终端设备2进行数据交互。
所述终端设备2包括,但不限于,任何一种可与用户通过键盘、鼠标、遥控器、触摸板或者声控设备等方式进行人机交互的电子产品,该电子产品可以利用语音采集装置(例如麦克风)采集用户的语音数据,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能式穿戴式设备、导航装置等等的可移动设备,或者诸如数字TV、台式计算机、笔记本、服务器等等的固定终端。
所述电子装置1是一种能够按照事先设定或者存储的指令,自动进行数值计算和/或信息处理的设备。所述电子装置1可以是计算机、也可以是单个网络服务器、多个网络服务器组成的服务器组或者基于云计算的由大量主机或者网络服务器构成的云,其中云计算是分布式计算的一种,由一群松散耦合的计算机集组成的一个超级虚拟计算机。
在本实施例中,电子装置1可包括,但不仅限于,可通过系统总线相互通信连接的存储器11、处理器12、网络接口13。需要指出的是,图1仅示出了具有组件11-13的电子装置1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
其中,存储设备11包括内存及至少一种类型的可读存储介质。内存为电子装置1的运行提供缓存;可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态 随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等的非易失性存储介质。在一些实施例中,可读存储介质可以是电子装置1的内部存储单元,例如该电子装置1的硬盘;在另一些实施例中,该非易失性存储介质也可以是电子装置1的外部存储设备,例如电子装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。本实施例中,存储设备11的可读存储介质通常用于存储安装于电子装置1的操作系统和各类应用软件,例如本发明一实施例中的身份验证的系统的程序代码等。此外,存储设备11还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述电子装置1的总体操作,例如执行与所述终端设备2进行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行身份验证的系统等。
所述网络接口13可包括无线网络接口或有线网络接口,该网络接口13通常用于在所述电子装置1与其他电子设备之间建立通信连接。本实施例中,网络接口13主要用于将电子装置1与一个或多个终端设备2相连,在电子装置1与一个或多个终端设备2之间建立数据传输通道和通信连接。
所述身份验证的系统存储在存储器11中,包括至少一个存储在存储器11中的计算机可读指令,该至少一个计算机可读指令可被处理器器12执行,以实现本申请各实施例的方法;以及,该至少一个计算机可读指令依据其各部分所实现的功能不同,可被划为不同的逻辑模块。
在一实施例中,上述身份验证的系统被所述处理器12执行时实现如下步骤:
S1,在接收到待进行身份验证的目标用户的当前语音数据后,对所述当前语音数据按照预设的分帧参数进行分帧处理,以获得多个语音帧;
本实施例中,在多种业务场景下需要用户进行录音,录音过程中接收到的当前语音数据为一段段的录音数据,该一段段的录音数据为短语音。
在进行录音时,应尽量防止环境噪声和语音采集设备的干扰。录音设备与用户保持适当距离,且尽量不用失真大的录音设备,电源优选使用市电,并保持电流稳定;在进行电话录音时应使用传感器。在进行分帧处理之前,可以对语音数据进行去噪音处理,以进一步减少干扰。
其中,对当前语音数据中每段录音数据按照预设的分帧参数进行分帧处理时,预设的分帧参数例如为每隔25毫秒分帧、帧移10毫秒,分帧处理后每段录音数据得到多个语音帧。当然,本实施例不限定上述的这种分帧处理方式,可以采用其他的分帧参数进行分帧处理的方式,其均在本实施例的保护范围内。
S2,利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元;
本实施例中,预定的滤波器优选为梅尔滤波器,声学特征即为声纹特征,声纹特征包括多种类型,例如宽带声纹、窄带声纹、振幅声纹等,本实施例的声纹特征为优选地为梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)。
在根据声学特征生成所述当前语音数据对应的多个观测特征单元时,根据梅尔频率倒谱系数组成特征数据矩阵,具体地,根据每段录音数据的梅尔频率倒谱系数组成特征数据矩阵,多段录音数据对应的特征数据矩阵即为当前语音数据对应的多个观测特征单元。
S3,将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元;
S4,将多组配对后的观测特征单元输入预先训练生成的预设类型身份验证模型,并获取输出的身份验证结果,以对该目标用户进行身份验证。
本实施例中,预先存储较多数量的用户的观测特征单元,在生成当前语音数据对应的多个观测特征单元后,将生成的多个观测特征单元与预先存储的观测特征单元进行来两两配对,配对后得到多组观测特征单元。
其中,预设类型身份验证模型优选地为深度卷积神经网络模型,深度卷积神经网络模型由1个输入层,4个卷积层,1个池化层,2个全连接层,1个归一化层,1个分类层构成。所述深度卷积神经网络模型的详细结构如上述表1所示:
Figure PCTCN2017100055-appb-000001
表1
其中,Layer Name列表示每一层的名称,Input表示输入层,Conv表示卷积层,Conv1表示第1个卷积层,Mean_std_pooling表示池化层,Full connected表示全连接层,Normalize Wrap表示归一化层,Scoring表示分类层。Batch Size表示当前层的输入的观测特征单元的数目,Kernel Size表示当前层卷积核的尺度(例如,Kernel Size可以等于3,表示卷积核的尺度为 3x3),Stride Size表示卷积核的移动步长,即做完一次卷积之后移动到下一个卷积位置的距离。Filter size指每层输出的通道,如在Input层的输入语音通道为1(即原始数据),经过Input层通道变成512。具体地,输入层表示对输入的观测特征单元进行采样,Conv层的卷积核1*1可以对输入进行缩放及特征组合,Normalize Wrap层对输入进行方差归一化,Scoring层训练用户的类内关系矩阵U及用户的类间关系矩阵V,其维度均为300*300。
本实施例中,将多组配对后的观测特征单元输入深度卷积神经网络模型后,并获取输出的身份验证结果,输出的身份验证结果包括验证通过及验证失败,以对该目标用户进行身份验证。
与现有技术相比,本实施例首先对当前语音数据分帧处理以获得多个语音帧,利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元,将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元,将多组配对后的观测特征单元输入预设类型身份验证模型中,获取输出的身份验证结果,以对该目标用户进行身份验证,本实施例对于多种业务场景中出现的短录音进行身份认证时,对短录音进行分帧、提取声学特征并将其转化为观测特征单元,最终将配对后的观测特征单元输入至身份验证模型中进行身份验证,对短语音识别的性能较佳,能够降低识别错误率。
在一优选的实施例中,在上述图1的实施例的基础上,所述利用预定的滤波器提取各个语音帧中预设类型的声学特征的步骤包括:
对所述语音帧进行加窗处理;
对每一个加窗进行傅立叶变换得到对应的频谱;
将所述频谱输入梅尔滤波器以输出得到梅尔频谱;
在梅尔频谱上面进行倒谱分析以获得梅尔频率倒谱系数MFCC,以所述梅尔频率倒谱系数MFCC作为该语音帧的声学特征。
其中,在对语音数据进行分帧之后,每一帧数据都当成平稳信号来处理。由于后续需要利用傅里叶展开每一项以获取Mel频谱特征,因此会出现吉布斯效应,即将具有不连续点的周期函数(如矩形脉冲)进行傅立叶级数展开后,选取有限项进行合成,当选取的项数越多,在所合成的波形中出现的峰起越靠近原信号的不连续点,当选取的项数很大时,该峰起值趋于一个常数,大约等于总跳变值的9%。为了避免吉布斯效应,则需要对语音帧进行加窗处理,以减少语音帧起始和结束的地方信号的不连续性问题。
其中,倒谱分析例如为取对数、做逆变换,逆变换一般是通过DCT离散余弦变换来实现,取DCT后的第2个到第13个系数作为梅尔频率倒谱系数MFCC系数。梅尔频率倒谱系数MFCC即为这帧语音数据的声纹特征,将每帧的梅尔频率倒谱系数MFCC组成特征数据矩阵,该特征数据矩阵即为语音帧的声学特征。
在一优选的实施例中,在上述的实施例的基础上,所述根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元的步骤包括:
以所述当前语音数据中的每个录音数据中的全部语音帧为一语音帧集合,将所述语音帧集合中的每个语音帧的20维梅尔频率倒谱系数MFCC(即声学特征)按对应语音帧的分帧时间的先后顺序拼接,生成对应的(20,N)维矩阵的观测特征单元,所述N为该语音帧集合的总帧数。
在一优选的实施例中,在上述的实施例的基础上,深度卷积神经网络模型采用识别函数进行身份验证,所述识别函数包括:
Figure PCTCN2017100055-appb-000002
Figure PCTCN2017100055-appb-000003
L(x,y)=xTUy-xTVx-yTVy+b;
其中,Obj为所述深度卷积神经网络模型的目标函数,通过最大化该目标函数,使深度卷积神经网络模型给出正确判别的概率增大到收敛,由此对目标与的身份进行验证。x为一组观测特征单元中其中一个观测特征单元在所述归一化层得到的用户特征,y为该组观测特征单元中另一个观测特征单元在所述归一化层得到的用户特征,K为常量,P(x,y)为计算一组观测特征单元属于同一用户的概率,L(x,y)为计算一组观测特征单元的相似度L,U为用户的类内关系矩阵,V为用户类间关系矩阵,b为偏置量,T为矩阵转置。
如图2所示,图2为本发明身份验证的方法一实施例的流程示意图,该身份验证的方法包括以下步骤:
步骤S1,在接收到待进行身份验证的目标用户的当前语音数据后,对所述当前语音数据按照预设的分帧参数进行分帧处理,以获得多个语音帧;
本实施例中,在多种业务场景下需要用户进行录音,录音过程中接收到的当前语音数据为一段段的录音数据,该一段段的录音数据为短语音。
在进行录音时,应尽量防止环境噪声和语音采集设备的干扰。录音设备与用户保持适当距离,且尽量不用失真大的录音设备,电源优选使用市电,并保持电流稳定;在进行电话录音时应使用传感器。在进行分帧处理之前,可以对语音数据进行去噪音处理,以进一步减少干扰。
其中,对当前语音数据中每段录音数据按照预设的分帧参数进行分帧处理时,预设的分帧参数例如为每隔25毫秒分帧、帧移10毫秒,分帧处理后每段录音数据得到多个语音帧。当然,本实施例不限定上述的这种分帧处理方式,可以采用其他的分帧参数进行分帧处理的方式,其均在本实施例的保护范围内。
步骤S2,利用预定的滤波器提取各个语音帧中预设类型的声学特征, 根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元;
本实施例中,预定的滤波器优选为梅尔滤波器,声学特征即为声纹特征,声纹特征包括多种类型,例如宽带声纹、窄带声纹、振幅声纹等,本实施例的声纹特征为优选地为梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)。
在根据声学特征生成所述当前语音数据对应的多个观测特征单元时,根据梅尔频率倒谱系数组成特征数据矩阵,具体地,根据每段录音数据的梅尔频率倒谱系数组成特征数据矩阵,多段录音数据对应的特征数据矩阵即为当前语音数据对应的多个观测特征单元。
步骤S3,将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元;
步骤S4,将多组配对后的观测特征单元输入预先训练生成的预设类型身份验证模型,并获取输出的身份验证结果,以对该目标用户进行身份验证。
本实施例中,预先存储较多数量的用户的观测特征单元,在生成当前语音数据对应的多个观测特征单元后,将生成的多个观测特征单元与预先存储的观测特征单元进行来两两配对,配对后得到多组观测特征单元。
其中,预设类型身份验证模型优选地为深度卷积神经网络模型,深度卷积神经网络模型由1个输入层,4个卷积层,1个池化层,2个全连接层,1个归一化层,1个分类层构成。所述深度卷积神经网络模型的详细结构如上述表1所示,此处不再赘述。
其中,Layer Name列表示每一层的名称,Input表示输入层,Conv表示卷积层,Conv1表示第1个卷积层,Mean_std_pooling表示池化层,Full connected表示全连接层,Normalize Wrap表示归一化层,Scoring表示分类层。Batch Size表示当前层的输入的观测特征单元的数目,Kernel Size表示当前层卷积核的尺度(例如,Kernel Size可以等于3,表示卷积核的尺度为3x 3),Stride Size表示卷积核的移动步长,即做完一次卷积之后移动到下一个卷积位置的距离。Filter size指每层输出的通道,如在Input层的输入语音通道为1(即原始数据),经过Input层通道变成512。具体地,输入层表示对输入的观测特征单元进行采样,Conv层的卷积核1*1可以对输入进行缩放及特征组合,Normalize Wrap层对输入进行方差归一化,Scoring层训练用户的类内关系矩阵U及用户的类间关系矩阵V,其维度均为300*300。
本实施例中,将多组配对后的观测特征单元输入深度卷积神经网络模型后,并获取输出的身份验证结果,输出的身份验证结果包括验证通过及验证失败,以对该目标用户进行身份验证。
在一优选的实施例中,在上述图2的实施例的基础上,在上述步骤S2中利用预定的滤波器提取各个语音帧中预设类型的声学特征的步骤包括:
对所述语音帧进行加窗处理;
对每一个加窗进行傅立叶变换得到对应的频谱;
将所述频谱输入梅尔滤波器以输出得到梅尔频谱;
在梅尔频谱上面进行倒谱分析以获得梅尔频率倒谱系数MFCC,以所述梅尔频率倒谱系数MFCC作为该语音帧的声学特征。
其中,在对语音数据进行分帧之后,每一帧数据都当成平稳信号来处理。由于后续需要利用傅里叶展开每一项以获取Mel频谱特征,因此会出现吉布斯效应,即将具有不连续点的周期函数(如矩形脉冲)进行傅立叶级数展开后,选取有限项进行合成,当选取的项数越多,在所合成的波形中出现的峰起越靠近原信号的不连续点,当选取的项数很大时,该峰起值趋于一个常数,大约等于总跳变值的9%。为了避免吉布斯效应,则需要对语音帧进行加窗处理,以减少语音帧起始和结束的地方信号的不连续性问题。
其中,倒谱分析例如为取对数、做逆变换,逆变换一般是通过DCT离散余弦变换来实现,取DCT后的第2个到第13个系数作为梅尔频率倒谱系数MFCC系数。梅尔频率倒谱系数MFCC即为这帧语音数据的声纹特征,将每帧的梅尔频率倒谱系数MFCC组成特征数据矩阵,该特征数据矩阵即为语音帧的声学特征。
在一优选的实施例中,在上述的实施例的基础上,在上述步骤S2中根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元的步骤包括:以所述当前语音数据中的每个录音数据中的全部语音帧为一语音帧集合,将所述语音帧集合中的每个语音帧的20维梅尔频率倒谱系数MFCC(即声学特征)按对应语音帧的分帧时间的先后顺序拼接,生成对应的(20,N)维矩阵的观测特征单元,所述N为该语音帧集合的总帧数。
在一优选的实施例中,在上述的实施例的基础上,深度卷积神经网络模型采用识别函数进行身份验证,所述识别函数包括:
Figure PCTCN2017100055-appb-000004
Figure PCTCN2017100055-appb-000005
L(x,y)=xTUy-xTVx-yTVy+b;
其中,Obj为所述深度卷积神经网络模型的目标函数,通过最大化该目标函数,使深度卷积神经网络模型给出正确判别的概率增大到收敛,由此对目标与的身份进行验证。x为一组观测特征单元中其中一个观测特征单元在所述归一化层得到的用户特征,y为该组观测特征单元中另一个观测特征单元在所述归一化层得到的用户特征,K为常量,P(x,y)为计算一组观测特征单元属于同一用户的概率,L(x,y)为计算一组观测特征单元的相似度L,U为用户的类内关系矩阵,V为用户类间关系矩阵,b为偏置量,T为矩阵转置。
在一优选的实施例中,在上述的实施例的基础上,所述步骤S4之前还 包括:
获取同一用户第一预设数量的语音对,例如,获取1000个用户,每个用户获取1000对语音对,每一对语音对由同一用户对应两个不同发音内容的两段语音组成;并获取不同用户第二预设数量的语音对,例如,获取1000个用户,将各个用户进行两两配对,每对用户对应一个相同发音内容得到一对语音对。分别对各个语音对中的语音按照预设的分帧参数进行分帧处理,例如,所述预设的分帧参数为每隔25毫秒分帧,帧移10毫秒,以获得各个语音对对应的多个语音帧;
利用预定的滤波器(例如,梅尔滤波器)提取各个语音帧中预设类型的声学特征(例如,20维梅尔频率倒谱系数MFCC频谱特征),根据所提取的声学特征生成每个语音对的多个观测特征单元,即根据梅尔频率倒谱系数组成多个特征数据矩阵,该特征数据矩阵即为观测特征单元;
将属于同一用户及属于不同用户的两个语音对应的观测特征单元进行两两配对,以获得多组配对的观测特征单元;
将各个语音对分为第一百分比(例如70%)的训练集和第二百分比(例如20%)的验证集,所述第一百分比和第二百分比之和小于或者等于1;
利用训练集中的各个语音对的各组观测特征单元对深度卷积神经网络模型,并在训练完成后利用验证集对训练后的深度卷积神经网络模型的准确率进行验证;
若所述准确率大于预设阈值(例如,98.5%),则训练结束,以训练后的深度卷积神经网络模型为所述步骤S4中的深度卷积神经网络模型,或者,若所述准确率小于或者等于所述预设阈值,则增加进行训练的语音对的数量,重新执行上述步骤,以重新进行训练,直至训练后的深度卷积神经网络模型的准确率大于预设阈值。
本发明还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有身份验证的系统,所述身份验证的系统被处理器执行时实现上述的身份验证的方法的步骤。
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本发明各个实施例所述的方法。
以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。

Claims (20)

  1. 一种电子装置,其特征在于,所述电子装置包括存储器及与所述存储器连接的处理器,所述存储器中存储有可在所述处理器上运行的身份验证的系统,所述身份验证的系统被所述处理器执行时实现如下步骤:
    S1,在接收到待进行身份验证的目标用户的当前语音数据后,对所述当前语音数据按照预设的分帧参数进行分帧处理,以获得多个语音帧;
    S2,利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元;
    S3,将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元;
    S4,将多组配对后的观测特征单元输入预先训练生成的预设类型身份验证模型,并获取输出的身份验证结果,以对该目标用户进行身份验证。
  2. 根据权利要求1所述的电子装置,其特征在于,所述预定的滤波器为梅尔滤波器,所述利用预定的滤波器提取各个语音帧中预设类型的声学特征的步骤包括:
    对所述语音帧进行加窗处理;
    对每一个加窗进行傅立叶变换得到对应的频谱;
    将所述频谱输入梅尔滤波器以输出得到梅尔频谱;
    在梅尔频谱上面进行倒谱分析以获得梅尔频率倒谱系数MFCC,以所述梅尔频率倒谱系数MFCC作为该语音帧的声学特征。
  3. 根据权利要求2所述的电子装置,其特征在于,所述根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元的步骤包括:
    以所述当前语音数据中的每个录音数据中的语音帧为一语音帧集合,将所述语音帧集合中的每个语音帧的20维梅尔频率倒谱系数MFCC按对应语音帧的分帧时间的先后顺序拼接,生成对应的(20,N)维矩阵的观测特征单元,所述N为该语音帧集合的总帧数。
  4. 根据权利要求3所述的电子装置,其特征在于,所述预先训练生成的预设类型身份验证模型为深度卷积神经网络模型,所述深度卷积神经网络模型采用识别函数进行身份验证,所述识别函数包括:
    Figure PCTCN2017100055-appb-100001
    Figure PCTCN2017100055-appb-100002
    L(x,y)=xTUy-xTVx-yTVy+b;
    其中,Obj为所述深度卷积神经网络模型的目标函数,x为一组观测特征单元中其中一个观测特征单元在所述归一化层得到的用户特征,y为该组观测特征单元中另一个观测特征单元在所述归一化层得到的用户特征,K为常量,P(x,y)为计算一组观测特征单元属于同一用户的概率,L(x,y)为计算 一组观测特征单元的相似度L,U为用户的类内关系矩阵,V为用户类间关系矩阵,b为偏置量,T为矩阵转置。
  5. 根据权利要求4所述的身份验证的方法,其特征在于,所述步骤S4之前还包括:
    获取同一用户第一预设数量的语音对,并获取不同用户第二预设数量的语音对,分别对各个语音对中的语音按照预设的分帧参数进行分帧处理,以获得各个语音对对应的多个语音帧;
    利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成每个语音对的多个观测特征单元;
    将属于同一用户及属于不同用户的两个语音对应的观测特征单元进行两两配对,以获得多组配对的观测特征单元;
    将各个语音对分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小于或者等于1;
    利用训练集中的各个语音对的各组观测特征单元对深度卷积神经网络模型,并在训练完成后利用验证集对训练后的深度卷积神经网络模型的准确率进行验证;
    若所述准确率大于预设阈值,则训练结束,以训练后的深度卷积神经网络模型为所述步骤S4中的深度卷积神经网络模型,或者,若所述准确率小于或者等于所述预设阈值,则增加进行训练的语音对的数量,以重新进行训练。
  6. 一种身份验证的方法,其特征在于,所述身份验证的方法包括:
    S1,在接收到待进行身份验证的目标用户的当前语音数据后,对所述当前语音数据按照预设的分帧参数进行分帧处理,以获得多个语音帧;
    S2,利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元;
    S3,将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元;
    S4,将多组配对后的观测特征单元输入预先训练生成的预设类型身份验证模型,并获取输出的身份验证结果,以对该目标用户进行身份验证。
  7. 根据权利要求6所述的身份验证的方法,其特征在于,所述预定的滤波器为梅尔滤波器,所述利用预定的滤波器提取各个语音帧中预设类型的声学特征的步骤包括:
    对所述语音帧进行加窗处理;
    对每一个加窗进行傅立叶变换得到对应的频谱;
    将所述频谱输入梅尔滤波器以输出得到梅尔频谱;
    在梅尔频谱上面进行倒谱分析以获得梅尔频率倒谱系数MFCC,以所述梅尔频率倒谱系数MFCC作为该语音帧的声学特征。
  8. 根据权利要求7所述的身份验证的方法,其特征在于,所述根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元的步骤包括:
    以所述当前语音数据中的每个录音数据中的语音帧为一语音帧集合,将所述语音帧集合中的每个语音帧的20维梅尔频率倒谱系数MFCC按对应语音帧的分帧时间的先后顺序拼接,生成对应的(20,N)维矩阵的观测特征单元,所述N为该语音帧集合的总帧数。
  9. 根据权利要求8所述的身份验证的方法,其特征在于,所述预先训练生成的预设类型身份验证模型为深度卷积神经网络模型,所述深度卷积神经网络模型采用识别函数进行身份验证,所述识别函数包括:
    Figure PCTCN2017100055-appb-100003
    Figure PCTCN2017100055-appb-100004
    L(x,y)=xTUy-xTVx-yTVy+b;
    其中,Obj为所述深度卷积神经网络模型的目标函数,x为一组观测特征单元中其中一个观测特征单元在所述归一化层得到的用户特征,y为该组观测特征单元中另一个观测特征单元在所述归一化层得到的用户特征,K为常量,P(x,y)为计算一组观测特征单元属于同一用户的概率,L(x,y)为计算一组观测特征单元的相似度L,U为用户的类内关系矩阵,V为用户类间关系矩阵,b为偏置量,T为矩阵转置。
  10. 根据权利要求9所述的身份验证的方法,其特征在于,所述步骤S4之前还包括:
    获取同一用户第一预设数量的语音对,并获取不同用户第二预设数量的语音对,分别对各个语音对中的语音按照预设的分帧参数进行分帧处理,以获得各个语音对对应的多个语音帧;
    利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成每个语音对的多个观测特征单元;
    将属于同一用户及属于不同用户的两个语音对应的观测特征单元进行两两配对,以获得多组配对的观测特征单元;
    将各个语音对分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小于或者等于1;
    利用训练集中的各个语音对的各组观测特征单元对深度卷积神经网络模型,并在训练完成后利用验证集对训练后的深度卷积神经网络模型的准确率进行验证;
    若所述准确率大于预设阈值,则训练结束,以训练后的深度卷积神经网络模型为所述步骤S4中的深度卷积神经网络模型,或者,若所述准确率小于或者等于所述预设阈值,则增加进行训练的语音对的数量,以重新进行训练。
  11. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有身份验证的系统,所述身份验证的系统可被至少一个处理器执行,以实现如下步骤:
    S1,在接收到待进行身份验证的目标用户的当前语音数据后,对所述当前语音数据按照预设的分帧参数进行分帧处理,以获得多个语音帧;
    S2,利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元;
    S3,将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元;
    S4,将多组配对后的观测特征单元输入预先训练生成的预设类型身份验证模型,并获取输出的身份验证结果,以对该目标用户进行身份验证。
  12. 根据权利要求11所述的计算机可读存储介质,其特征在于,所述预定的滤波器为梅尔滤波器,所述利用预定的滤波器提取各个语音帧中预设类型的声学特征的步骤包括:
    对所述语音帧进行加窗处理;
    对每一个加窗进行傅立叶变换得到对应的频谱;
    将所述频谱输入梅尔滤波器以输出得到梅尔频谱;
    在梅尔频谱上面进行倒谱分析以获得梅尔频率倒谱系数MFCC,以所述梅尔频率倒谱系数MFCC作为该语音帧的声学特征。
  13. 根据权利要求12所述的计算机可读存储介质,其特征在于,所述根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元的步骤包括:
    以所述当前语音数据中的每个录音数据中的语音帧为一语音帧集合,将所述语音帧集合中的每个语音帧的20维梅尔频率倒谱系数MFCC按对应语音帧的分帧时间的先后顺序拼接,生成对应的(20,N)维矩阵的观测特征单元,所述N为该语音帧集合的总帧数。
  14. 根据权利要求13所述的计算机可读存储介质,其特征在于,所述预先训练生成的预设类型身份验证模型为深度卷积神经网络模型,所述深度卷积神经网络模型采用识别函数进行身份验证,所述识别函数包括:
    Figure PCTCN2017100055-appb-100005
    Figure PCTCN2017100055-appb-100006
    L(x,y)=xTUy-xTVx-yTVy+b;
    其中,Obj为所述深度卷积神经网络模型的目标函数,x为一组观测特征单元中其中一个观测特征单元在所述归一化层得到的用户特征,y为该组观测特征单元中另一个观测特征单元在所述归一化层得到的用户特征,K为常量,P(x,y)为计算一组观测特征单元属于同一用户的概率,L(x,y)为计算一组观测特征单元的相似度L,U为用户的类内关系矩阵,V为用户类间关系矩阵,b为偏置量,T为矩阵转置。
  15. 根据权利要求14所述的计算机可读存储介质,其特征在于,所述步骤S4之前还包括:
    获取同一用户第一预设数量的语音对,并获取不同用户第二预设数量的语音对,分别对各个语音对中的语音按照预设的分帧参数进行分帧处理,以获得各个语音对对应的多个语音帧;
    利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成每个语音对的多个观测特征单元;
    将属于同一用户及属于不同用户的两个语音对应的观测特征单元进行两两配对,以获得多组配对的观测特征单元;
    将各个语音对分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小于或者等于1;
    利用训练集中的各个语音对的各组观测特征单元对深度卷积神经网络模型,并在训练完成后利用验证集对训练后的深度卷积神经网络模型的准确率进行验证;
    若所述准确率大于预设阈值,则训练结束,以训练后的深度卷积神经网络模型为所述步骤S4中的深度卷积神经网络模型,或者,若所述准确率小于或者等于所述预设阈值,则增加进行训练的语音对的数量,以重新进行训练。
  16. 一种身份验证的系统,其特征在于,所述身份验证的系统存储于存储器,可被至少一个处理器执行,以实现如下步骤:
    S1,在接收到待进行身份验证的目标用户的当前语音数据后,对所述当前语音数据按照预设的分帧参数进行分帧处理,以获得多个语音帧;
    S2,利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元;
    S3,将各个观测特征单元分别与预存的观测特征单元进行两两配对,以获得多组配对后的观测特征单元;
    S4,将多组配对后的观测特征单元输入预先训练生成的预设类型身份验证模型,并获取输出的身份验证结果,以对该目标用户进行身份验证。
  17. 根据权利要求16所述的身份验证的系统,其特征在于,所述预定的滤波器为梅尔滤波器,所述利用预定的滤波器提取各个语音帧中预设类型的声学特征的步骤包括:
    对所述语音帧进行加窗处理;
    对每一个加窗进行傅立叶变换得到对应的频谱;
    将所述频谱输入梅尔滤波器以输出得到梅尔频谱;
    在梅尔频谱上面进行倒谱分析以获得梅尔频率倒谱系数MFCC,以所述梅尔频率倒谱系数MFCC作为该语音帧的声学特征。
  18. 根据权利要求17所述的身份验证的系统,其特征在于,所述根据所提取的声学特征生成所述当前语音数据对应的多个观测特征单元的步骤包括:
    以所述当前语音数据中的每个录音数据中的语音帧为一语音帧集合,将所述语音帧集合中的每个语音帧的20维梅尔频率倒谱系数MFCC按对应语 音帧的分帧时间的先后顺序拼接,生成对应的(20,N)维矩阵的观测特征单元,所述N为该语音帧集合的总帧数。
  19. 根据权利要求18所述的身份验证的系统,其特征在于,所述预先训练生成的预设类型身份验证模型为深度卷积神经网络模型,所述深度卷积神经网络模型采用识别函数进行身份验证,所述识别函数包括:
    Figure PCTCN2017100055-appb-100007
    Figure PCTCN2017100055-appb-100008
    L(x,y)=xTUy-xTVx-yTVy+b;
    其中,Obj为所述深度卷积神经网络模型的目标函数,x为一组观测特征单元中其中一个观测特征单元在所述归一化层得到的用户特征,y为该组观测特征单元中另一个观测特征单元在所述归一化层得到的用户特征,K为常量,P(x,y)为计算一组观测特征单元属于同一用户的概率,L(x,y)为计算一组观测特征单元的相似度L,U为用户的类内关系矩阵,V为用户类间关系矩阵,b为偏置量,T为矩阵转置。
  20. 根据权利要求19所述的身份验证的系统,其特征在于,所述步骤S4之前还包括:
    获取同一用户第一预设数量的语音对,并获取不同用户第二预设数量的语音对,分别对各个语音对中的语音按照预设的分帧参数进行分帧处理,以获得各个语音对对应的多个语音帧;
    利用预定的滤波器提取各个语音帧中预设类型的声学特征,根据所提取的声学特征生成每个语音对的多个观测特征单元;
    将属于同一用户及属于不同用户的两个语音对应的观测特征单元进行两两配对,以获得多组配对的观测特征单元;
    将各个语音对分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小于或者等于1;
    利用训练集中的各个语音对的各组观测特征单元对深度卷积神经网络模型,并在训练完成后利用验证集对训练后的深度卷积神经网络模型的准确率进行验证;
    若所述准确率大于预设阈值,则训练结束,以训练后的深度卷积神经网络模型为所述步骤S4中的深度卷积神经网络模型,或者,若所述准确率小于或者等于所述预设阈值,则增加进行训练的语音对的数量,以重新进行训练。
PCT/CN2017/100055 2017-07-25 2017-08-31 电子装置、身份验证的方法、系统及计算机可读存储介质 Ceased WO2019019256A1 (zh)

Priority Applications (6)

Application Number Priority Date Filing Date Title
SG11201901766YA SG11201901766YA (en) 2017-07-25 2017-08-31 Electronic device, method and system of identity verification and computer readable storage medium
US16/084,233 US11068571B2 (en) 2017-07-25 2017-08-31 Electronic device, method and system of identity verification and computer readable storage medium
AU2017404565A AU2017404565B2 (en) 2017-07-25 2017-08-31 Electronic device, method and system of identity verification and computer readable storage medium
EP17897212.1A EP3460793B1 (en) 2017-07-25 2017-08-31 Electronic apparatus, identity verification method and system, and computer-readable storage medium
KR1020187017523A KR102159217B1 (ko) 2017-07-25 2017-08-31 전자장치, 신분 검증 방법, 시스템 및 컴퓨터 판독 가능한 저장매체
JP2018534079A JP6621536B2 (ja) 2017-07-25 2017-08-31 電子装置、身元認証方法、システム及びコンピュータ読み取り可能な記憶媒体

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710614649.6A CN107527620B (zh) 2017-07-25 2017-07-25 电子装置、身份验证的方法及计算机可读存储介质
CN201710614649.6 2017-07-25

Publications (1)

Publication Number Publication Date
WO2019019256A1 true WO2019019256A1 (zh) 2019-01-31

Family

ID=60680120

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/100055 Ceased WO2019019256A1 (zh) 2017-07-25 2017-08-31 电子装置、身份验证的方法、系统及计算机可读存储介质

Country Status (8)

Country Link
US (1) US11068571B2 (zh)
EP (1) EP3460793B1 (zh)
JP (1) JP6621536B2 (zh)
KR (1) KR102159217B1 (zh)
CN (1) CN107527620B (zh)
AU (1) AU2017404565B2 (zh)
SG (1) SG11201901766YA (zh)
WO (1) WO2019019256A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110459209A (zh) * 2019-08-20 2019-11-15 深圳追一科技有限公司 语音识别方法、装置、设备及存储介质
CN111191754A (zh) * 2019-12-30 2020-05-22 秒针信息技术有限公司 语音采集方法、装置、电子设备及存储介质
US12205147B2 (en) 2019-08-29 2025-01-21 Tencent Technology (Shenzhen) Company Limited Feature processing method and apparatus for artificial intelligence recommendation model, electronic device, and storage medium

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108417217B (zh) * 2018-01-11 2021-07-13 思必驰科技股份有限公司 说话人识别网络模型训练方法、说话人识别方法及系统
CN108154371A (zh) * 2018-01-12 2018-06-12 平安科技(深圳)有限公司 电子装置、身份验证的方法及存储介质
CN108564954B (zh) * 2018-03-19 2020-01-10 平安科技(深圳)有限公司 深度神经网络模型、电子装置、身份验证方法和存储介质
CN108564688A (zh) * 2018-03-21 2018-09-21 阿里巴巴集团控股有限公司 身份验证的方法及装置和电子设备
CN108877775B (zh) * 2018-06-04 2023-03-31 平安科技(深圳)有限公司 语音数据处理方法、装置、计算机设备及存储介质
CN109448746B (zh) * 2018-09-28 2020-03-24 百度在线网络技术(北京)有限公司 语音降噪方法及装置
CN109473105A (zh) * 2018-10-26 2019-03-15 平安科技(深圳)有限公司 与文本无关的声纹验证方法、装置和计算机设备
CN109346086A (zh) * 2018-10-26 2019-02-15 平安科技(深圳)有限公司 声纹识别方法、装置、计算机设备和计算机可读存储介质
CN109147818A (zh) * 2018-10-30 2019-01-04 Oppo广东移动通信有限公司 声学特征提取方法、装置、存储介质及终端设备
CN109686382A (zh) * 2018-12-29 2019-04-26 平安科技(深圳)有限公司 一种说话人聚类方法和装置
CN109448726A (zh) * 2019-01-14 2019-03-08 李庆湧 一种语音控制准确率的调整方法及系统
CN110010133A (zh) * 2019-03-06 2019-07-12 平安科技(深圳)有限公司 基于短文本的声纹检测方法、装置、设备及存储介质
CN111798857A (zh) * 2019-04-08 2020-10-20 北京嘀嘀无限科技发展有限公司 一种信息识别方法、装置、电子设备及存储介质
CN110289004B (zh) * 2019-06-18 2021-09-07 暨南大学 一种基于深度学习的人工合成声纹检测系统及方法
CN110570873B (zh) * 2019-09-12 2022-08-05 Oppo广东移动通信有限公司 声纹唤醒方法、装置、计算机设备以及存储介质
CN110556126B (zh) * 2019-09-16 2024-01-05 平安科技(深圳)有限公司 语音识别方法、装置以及计算机设备
CN110570871A (zh) * 2019-09-20 2019-12-13 平安科技(深圳)有限公司 一种基于TristouNet的声纹识别方法、装置及设备
CN114547568B (zh) * 2022-02-09 2026-03-27 支付宝(杭州)数字服务技术有限公司 一种基于语音的身份验证方法、装置及设备
CN115223569B (zh) * 2022-06-02 2025-02-28 康佳集团股份有限公司 基于深度神经网络的说话人验证方法、终端及存储介质
CN115424608A (zh) * 2022-08-25 2022-12-02 深圳大学 一种基于机器学习的身份验证的方法和系统
CN116094725B (zh) * 2022-12-30 2024-12-24 中国人民解放军网络空间部队信息工程大学 基于sm2算法的声纹认证保护方法及系统
CN116567150B (zh) * 2023-07-11 2023-09-08 山东凌晓通信科技有限公司 一种会议室防窃听偷录的方法及系统
US12211487B1 (en) * 2024-01-18 2025-01-28 Morgan Stanley Services Group Inc. Systems and methods for accessible websites and/or applications for people with disabilities

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103391201A (zh) * 2013-08-05 2013-11-13 公安部第三研究所 基于声纹识别实现智能卡身份验证的系统及方法
CN104700018A (zh) * 2015-03-31 2015-06-10 江苏祥和电子科技有限公司 一种用于智能机器人的识别方法
CN107068154A (zh) * 2017-03-13 2017-08-18 平安科技(深圳)有限公司 基于声纹识别的身份验证的方法及系统

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4032711A (en) 1975-12-31 1977-06-28 Bell Telephone Laboratories, Incorporated Speaker recognition arrangement
US5583961A (en) * 1993-03-25 1996-12-10 British Telecommunications Public Limited Company Speaker recognition using spectral coefficients normalized with respect to unequal frequency bands
GB2355834A (en) * 1999-10-29 2001-05-02 Nokia Mobile Phones Ltd Speech recognition
US8645137B2 (en) 2000-03-16 2014-02-04 Apple Inc. Fast, language-independent method for user authentication by voice
CN101465123B (zh) * 2007-12-20 2011-07-06 株式会社东芝 说话人认证的验证方法和装置以及说话人认证系统
US8554562B2 (en) * 2009-11-15 2013-10-08 Nuance Communications, Inc. Method and system for speaker diarization
US20160293167A1 (en) * 2013-10-10 2016-10-06 Google Inc. Speaker recognition using neural networks
CN104008751A (zh) * 2014-06-18 2014-08-27 周婷婷 一种基于bp神经网络的说话人识别方法
US10580401B2 (en) 2015-01-27 2020-03-03 Google Llc Sub-matrix input for neural network layers
JP6280068B2 (ja) * 2015-03-09 2018-02-14 日本電信電話株式会社 パラメータ学習装置、話者認識装置、パラメータ学習方法、話者認識方法、およびプログラム
JP6616182B2 (ja) * 2015-12-25 2019-12-04 綜合警備保障株式会社 話者認識装置、判別値生成方法及びプログラム
CN105788592A (zh) * 2016-04-28 2016-07-20 乐视控股(北京)有限公司 一种音频分类方法及装置
CN105869644A (zh) * 2016-05-25 2016-08-17 百度在线网络技术(北京)有限公司 基于深度学习的声纹认证方法和装置
AU2017294791B2 (en) * 2016-07-11 2021-06-03 Ftr, Ltd. Method and system for automatically diarising a sound recording
CN106448684A (zh) * 2016-11-16 2017-02-22 北京大学深圳研究生院 基于深度置信网络特征矢量的信道鲁棒声纹识别系统
CN106710599A (zh) * 2016-12-02 2017-05-24 深圳撒哈拉数据科技有限公司 一种基于深度神经网络的特定声源检测方法与系统
US10546575B2 (en) * 2016-12-14 2020-01-28 International Business Machines Corporation Using recurrent neural network for partitioning of audio data into segments that each correspond to a speech feature cluster identifier
CN107610707B (zh) * 2016-12-15 2018-08-31 平安科技(深圳)有限公司 一种声纹识别方法及装置
CN106847292B (zh) * 2017-02-16 2018-06-19 平安科技(深圳)有限公司 声纹识别方法及装置
US10637898B2 (en) * 2017-05-24 2020-04-28 AffectLayer, Inc. Automatic speaker identification in calls
US11289098B2 (en) * 2019-03-08 2022-03-29 Samsung Electronics Co., Ltd. Method and apparatus with speaker recognition registration

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103391201A (zh) * 2013-08-05 2013-11-13 公安部第三研究所 基于声纹识别实现智能卡身份验证的系统及方法
CN104700018A (zh) * 2015-03-31 2015-06-10 江苏祥和电子科技有限公司 一种用于智能机器人的识别方法
CN107068154A (zh) * 2017-03-13 2017-08-18 平安科技(深圳)有限公司 基于声纹识别的身份验证的方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3460793A4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110459209A (zh) * 2019-08-20 2019-11-15 深圳追一科技有限公司 语音识别方法、装置、设备及存储介质
US12205147B2 (en) 2019-08-29 2025-01-21 Tencent Technology (Shenzhen) Company Limited Feature processing method and apparatus for artificial intelligence recommendation model, electronic device, and storage medium
CN111191754A (zh) * 2019-12-30 2020-05-22 秒针信息技术有限公司 语音采集方法、装置、电子设备及存储介质
CN111191754B (zh) * 2019-12-30 2023-10-27 秒针信息技术有限公司 语音采集方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
EP3460793B1 (en) 2023-04-05
EP3460793A1 (en) 2019-03-27
US20210097159A1 (en) 2021-04-01
AU2017404565A1 (en) 2019-02-14
JP6621536B2 (ja) 2019-12-18
CN107527620B (zh) 2019-03-26
US11068571B2 (en) 2021-07-20
SG11201901766YA (en) 2019-04-29
CN107527620A (zh) 2017-12-29
KR102159217B1 (ko) 2020-09-24
AU2017404565B2 (en) 2020-01-02
JP2019531492A (ja) 2019-10-31
EP3460793A4 (en) 2020-04-01
KR20190022432A (ko) 2019-03-06

Similar Documents

Publication Publication Date Title
CN107527620B (zh) 电子装置、身份验证的方法及计算机可读存储介质
TWI641965B (zh) 基於聲紋識別的身份驗證的方法及系統
CN108564954B (zh) 深度神经网络模型、电子装置、身份验证方法和存储介质
WO2020181824A1 (zh) 声纹识别方法、装置、设备以及计算机可读存储介质
WO2021051572A1 (zh) 语音识别方法、装置以及计算机设备
WO2019100606A1 (zh) 电子装置、基于声纹的身份验证方法、系统及存储介质
CN108564955B (zh) 电子装置、身份验证方法和计算机可读存储介质
WO2018149077A1 (zh) 声纹识别方法、装置、存储介质和后台服务器
WO2017215558A1 (zh) 一种声纹识别方法和装置
CN108630208B (zh) 服务器、基于声纹的身份验证方法及存储介质
WO2019136912A1 (zh) 电子装置、身份验证的方法、系统及存储介质
CN113223536A (zh) 声纹识别方法、装置及终端设备
WO2021042537A1 (zh) 语音识别认证方法及系统
WO2020034628A1 (zh) 口音识别方法、装置、计算机装置及存储介质
WO2019136911A1 (zh) 更新声纹数据的语音识别方法、终端装置及存储介质
CN114141254A (zh) 声纹信号的更新方法及其装置、电子设备及存储介质
WO2019196305A1 (zh) 电子装置、身份验证的方法及存储介质
WO2020140609A1 (zh) 一种语音识别方法、设备及计算机可读存储介质
CN108650266B (zh) 服务器、声纹验证的方法及存储介质
CN110797033A (zh) 基于人工智能的声音识别方法、及其相关设备
WO2021128847A1 (zh) 终端交互方法、装置、计算机设备及存储介质
WO2019179033A1 (zh) 说话人认证方法、服务器及计算机可读存储介质
CN115101054A (zh) 基于热词图的语音识别方法、装置、设备及存储介质
CN110853652A (zh) 身份识别方法、装置及计算机可读存储介质
CN115641853B (zh) 声纹识别方法及装置、计算机可读存储介质、终端

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 20187017523

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2018534079

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2017404565

Country of ref document: AU

Date of ref document: 20170831

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2017897212

Country of ref document: EP

Effective date: 20180829

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17897212

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE