WO2024051481A1 - 音频处理方法、装置、设备、可读存储介质及程序产品 - Google Patents
音频处理方法、装置、设备、可读存储介质及程序产品 Download PDFInfo
- Publication number
- WO2024051481A1 WO2024051481A1 PCT/CN2023/114040 CN2023114040W WO2024051481A1 WO 2024051481 A1 WO2024051481 A1 WO 2024051481A1 CN 2023114040 W CN2023114040 W CN 2023114040W WO 2024051481 A1 WO2024051481 A1 WO 2024051481A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- similarity
- similarity matrix
- audio
- matrix
- row
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/65—Clustering; Classification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Definitions
- the embodiments of the present application relate to the field of computer technology, and in particular to an audio processing method, device, equipment, readable storage medium, and program product.
- audio object clustering is an audio processing method. Audio object clustering is to determine the number of audio objects included in multiple audio segments and the audio segments corresponding to each audio object.
- multiple audio clips are obtained, the voiceprint vectors corresponding to each audio clip are determined, the number of audio objects existing in the multiple audio clips is determined based on the similarity between the voiceprint vectors, and the number of audio objects is determined based on the number of audio objects.
- Multiple audio clips are clustered to obtain the audio clips corresponding to each audio object.
- Embodiments of the present application provide an audio processing method, device, equipment, readable storage medium, and program product, which can improve the accuracy of audio object clustering.
- embodiments of the present application provide an audio processing method, which is executed by a computer device.
- the method includes:
- the initial similarity matrix is adjusted to obtain a reference similarity matrix, and the dynamic threshold is used to adjust the similarity difference between different similarities;
- the plurality of audio segments are clustered to obtain audio segments resulting from the sounds of each audio object.
- an audio processing device which includes:
- a determination module configured to determine voiceprint vectors corresponding to multiple audio segments, where the voiceprint vectors are used to represent voiceprint features corresponding to the audio segments;
- the determination module is also configured to determine an initial similarity matrix based on the voiceprint vectors corresponding to each audio segment, where the initial similarity matrix includes the similarity between the voiceprint vectors corresponding to any two audio segments;
- An adjustment module configured to adjust the initial similarity matrix according to the dynamic threshold corresponding to each row in the initial similarity matrix to obtain a reference similarity matrix.
- the dynamic threshold is used to adjust the similarity between different similarities. Adjust the difference;
- the determination module is also configured to determine the audio of the plurality of audio segments based on the reference similarity matrix. number of objects;
- a clustering module is configured to cluster the plurality of audio clips according to the number of the audio objects to obtain the audio clips produced by the sound produced by each audio object.
- the determination module is further configured to, for any row in the initial similarity matrix, perform a first order on the similarities in the any row that are within a preset similarity range. Sort to obtain a first sorting result; determine the similarity difference between two adjacent similarities in the first sorting result to obtain multiple similarity differences; determine among the multiple similarity differences A similarity difference that meets the first requirement; and a dynamic threshold corresponding to any row is determined based on the similarity difference that meets the first requirement.
- the adjustment module is configured to adjust the similarity included in the k-th row of the initial similarity matrix that is smaller than the dynamic threshold corresponding to the k-th row to a first value, and The reference similarity matrix is obtained based on the adjustment results of each row, and k is a positive integer; or, among the similarities included in the k-th row of the initial similarity matrix, the similarity smaller than the dynamic threshold corresponding to the k-th row is compared with The second numerical values are multiplied and the reference similarity matrix is obtained based on the adjustment results of each row.
- the determination module is configured to process the reference similarity matrix according to multiple reference parameters to obtain a similarity matrix corresponding to each reference parameter; according to the multiple reference parameters and The similarity matrix corresponding to each reference parameter determines the number of audio objects present in the multiple audio clips.
- the determination module is configured to numerically adjust the reference similarity matrix for any one of the plurality of reference parameters according to the any reference parameter, to obtain The first similarity matrix, the numerical adjustment is used to simplify the reference similarity matrix; the first similarity matrix is symmetrized to obtain a second similarity matrix, and the second similarity matrix is located at the The similarity in row i and column j is the same as the similarity in row j and column i, and the i and j are positive integers not greater than the number of the plurality of audio clips; for the second similarity
- the degree matrix is diffused in rows and columns to obtain a third similarity matrix, which is used to generate boundaries between multiple audio objects; the third similarity matrix is proportionally adjusted to obtain a fourth similarity matrix. , the proportion adjustment is used to adjust the similarities included in each row in the third similarity matrix within the same range; perform symmetry processing on the fourth similarity matrix to obtain the corresponding value of any reference parameter. Similarity matrix.
- the determination module is configured to adjust the similarity of any reference parameter other than the similarity that meets the third requirement for the multiple similarities included in each row of the reference similarity matrix. is the third value, obtain the first similarity matrix; or, among the multiple similarities included in the reference similarity matrix, the similarities other than the similarities that meet the third requirement of any reference parameter are compared with The fourth numerical values are multiplied to obtain the first similarity matrix.
- the determining module is used to determine the transposed matrix corresponding to the first similarity matrix; and transpose the first similarity matrix and the corresponding transposed matrix
- the similarities located at the same position in the matrix are added to obtain the similarity matrix to be adjusted; the plurality of similarities included in the similarity matrix to be adjusted are halved to obtain the second similarity matrix.
- the determination module is configured to determine the similarity between the i-th row and j-th column in the first similarity matrix and the similarity between the j-th row and j-th column in the first similarity matrix.
- the maximum similarity among the similarities in the j-th row and i-th column is used as the maximum similarity between the i-th row and j-th column and the j-th row and i-th column in the second similarity matrix. similarity to obtain the second similarity matrix.
- the determination module is used to determine the transpose matrix corresponding to the second similarity matrix; according to the second similarity matrix and the transpose corresponding to the second similarity matrix matrix, determine the third similarity matrix, the similarity located in the m-th row and the n-th column in the third similarity matrix is based on the similarity located in the m-th row in the second similarity matrix and the The similarity located in the nth column in the transposed matrix corresponding to the second similarity matrix is determined, and the m and n are positive integers that are not greater than the number of the multiple audio segments.
- the determination module is configured to determine the maximum similarity corresponding to each row according to multiple similarities included in each row in the third similarity matrix; Multiple similarities included in each row are divided by the maximum similarity corresponding to each row to obtain the fourth similarity matrix.
- the determining module is configured to determine The similarity matrix corresponding to the parameter determines the proportion value corresponding to each reference parameter, and the proportion value is used to indicate the number of similarities retained in the similarity matrix corresponding to the reference parameter; according to the similarity matrix corresponding to each reference parameter A ratio value that determines the number of audio objects present in the plurality of audio segments.
- the determination module is configured to perform a Laplace transform on the similarity matrix corresponding to any one of the multiple reference parameters, to obtain The Laplacian matrix corresponding to any of the reference parameters; perform singular value decomposition on the Laplacian matrix to obtain multiple reference eigenvalues; determine the second eigenvalue and the third eigenvalue among the multiple reference eigenvalues.
- a number of first characteristic values, the second characteristic value is the maximum value among the plurality of reference characteristic values, and the first characteristic value satisfies the requirement after sorting the plurality of reference characteristic values in a second order.
- the second required reference feature value determine the difference between two adjacent first feature values in the first number of first feature values, and obtain a plurality of feature value differences; according to the second feature value , normalize the first eigenvalue difference to obtain the eigenvalue difference after normalization, where the first eigenvalue difference is the largest eigenvalue difference among the plurality of eigenvalue differences; According to the normalized feature value difference and any reference parameter, the proportion value corresponding to the any reference parameter is determined.
- the determining module is configured to determine a first parameter among the plurality of reference parameters according to the proportion value corresponding to each reference parameter, and the first parameter is the plurality of reference parameters.
- the corresponding reference parameter with the smallest proportion value is determined; multiple feature value differences corresponding to the first parameter are determined; and the first function is called to process the multiple feature value differences corresponding to the first parameter to obtain the result. Describes the number of audio objects present in multiple audio clips.
- the clustering module is used to perform singular value decomposition on the similarity matrix corresponding to the first parameter to obtain multiple decomposition eigenvalues; determine among the multiple decomposition eigenvalues The decomposition eigenvalues correspond to the number of audio objects; determine the eigenvectors corresponding to the number of decomposition eigenvalues of the audio objects, and generate a decomposition matrix.
- the number of rows of the decomposition matrix is the number of the audio objects, and the columns are The number is the number of the audio segments; according to the decomposition matrix, the feature vectors corresponding to the multiple audio segments are determined, and the feature vectors are used to indicate the corresponding audio segments; according to the number of the audio objects and the Feature vectors corresponding to multiple audio segments are clustered, and the audio segments are obtained by uttering sounds from each audio object.
- inventions of the present application provide a computer device.
- the computer device includes a processor and a memory. At least one program code is stored in the memory. The at least one program code is loaded and executed by the processor. , so that the computer device implements any of the above audio processing methods.
- a computer-readable storage medium is also provided. At least one program code is stored in the computer-readable storage medium. The at least one program code is loaded and executed by the processor to enable the computer to implement any of the above.
- the audio processing method is also provided.
- a computer program or computer program product is also provided. At least one computer instruction is stored in the computer program or computer program product. The at least one computer instruction is loaded and executed by the processor to enable the computer to implement the above. Any audio processing method.
- the initial similarity matrix is adjusted to obtain the reference similarity matrix.
- the similarity of the voiceprint vectors of the audio clips of the same audio object can be brought closer.
- Zooming out the similarity of the voiceprint vectors of audio clips of different audio objects makes the number of audio objects determined based on the reference similarity matrix more accurate; then based on the number of audio objects with higher accuracy, multiple audio clips are processed
- Clustering is used to obtain the audio fragments corresponding to each audio object, so that the accuracy of the determined audio fragments corresponding to each audio object is higher, and the accuracy of audio object clustering is higher, which in turn can improve the audio processing effect of the audio fragments.
- Figure 1 is a schematic diagram of the implementation environment of an audio processing method provided by an embodiment of the present application.
- Figure 2 is a flow chart of an audio processing method provided by an embodiment of the present application.
- Figure 3 is a schematic diagram of a similarity matrix determination process provided by an embodiment of the present application.
- Figure 4 is a flow chart of another audio processing method provided by an embodiment of the present application.
- Figure 5 is a schematic structural diagram of an audio processing device provided by an embodiment of the present application.
- Figure 6 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
- Figure 7 is a schematic structural diagram of a server provided by an embodiment of the present application.
- the audio processing method provided by the embodiment of the present application can be applied to various scenarios, including but not limited to cloud technology, artificial intelligence, smart transportation, assisted driving, games, etc.
- Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
- artificial intelligence is a comprehensive technology of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
- Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
- Machine Learning is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. and many other disciplines. Specializes in studying how computers can simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance.
- Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent. Its applications cover all fields of artificial intelligence.
- Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.
- artificial intelligence technology has been researched and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, driverless driving, autonomous driving, and drones. , robots, smart medical care, smart customer service, Internet of Vehicles, autonomous driving, smart transportation, etc.
- Figure 1 is a schematic diagram of an implementation environment of an audio processing method provided by an embodiment of the present application. As shown in Figure 1, the implementation environment includes: a terminal device 101 and a server 102.
- the audio processing method provided by the embodiment of the present application can be executed by the terminal device 101, or by the server 102, or by both the terminal device 101 and the server 102. This is not limited by the embodiment of the present application.
- the server 102 undertakes the main calculation work and the terminal device 101 undertakes the secondary calculation work; or the server 102 undertakes the secondary calculation work and the terminal device 101 undertakes the main computing work; alternatively, the server 102 and the terminal device 101 adopt a distributed computing architecture for collaborative computing.
- the terminal device 101 can be any electronic product that can perform human-computer interaction with the user through one or more methods such as keyboard, touch pad, touch screen, remote control, voice interaction or handwriting device.
- Terminal devices 101 include but are not limited to mobile phones, computers, intelligent voice interaction devices, smart home appliances, vehicle-mounted terminals, aircraft, etc.
- the server 102 is a server, or a server cluster composed of multiple servers, or a cloud computing platform and a virtualization center, or any of the nodes in the blockchain system. This embodiment of the present application does not mention this. limited.
- the server 102 communicates with the terminal device 101 through a wired network or a wireless network.
- the server 102 has a data receiving function, a data processing function and a data sending function.
- the server 102 may also have other functions, which are not limited in the embodiments of this application.
- terminal equipment 101 and server 102 are only examples.
- the embodiment of the present application provides an audio processing method, which is executed by a computer device.
- the method can be applied to the implementation environment shown in Figure 1.
- the computer device can be the terminal device 101 in Figure 1, or it can be the terminal device 101 in Figure 1.
- the server 102 in this application is not limited in this embodiment. Taking the flow chart of an audio processing method provided by the embodiment of the present application shown in Figure 2 as an example, as shown in Figure 2, the method includes the following steps 201 to 205.
- step 201 voiceprint vectors corresponding to multiple audio segments are determined, and the voiceprint vectors are used to indicate voiceprint features corresponding to the audio segments.
- multiple audio segments need to be obtained first.
- the multiple audio segments are at least two audio segments, and each audio segment corresponds to an audio object.
- the audio object is the sound object of the audio clip, where the audio objects corresponding to different audio clips are the same or different.
- the embodiment of the present application does not limit the acquisition process of multiple audio clips. For example, multiple candidate segments are stored in the storage space of the computer device, and multiple audio segments are obtained from the multiple candidate segments.
- multiple voice segments may be used as audio segments to be processed, or a part of the voice segments may be selected as audio segments to be processed.
- the song data is obtained as voice data, and the voice data is divided into multiple voice segments as audio segments according to the lyrics segmentation.
- the corresponding durations of multiple audio clips may be the same or different, and this is not limited in the embodiments of the present application.
- the corresponding duration of multiple audio clips is 2 seconds.
- the corresponding duration of some audio clips among the multiple audio clips is 2 seconds, and the corresponding duration of some audio clips is 5 seconds.
- the voiceprint vector corresponding to each audio segment is determined.
- the features corresponding to each audio segment may be the MFCC (Mel-scaleFrequency Cepstral Coefficients, Mel-scale Frequency Cepstral Coefficients) corresponding to each audio segment, or the Mel-spectrum features corresponding to each audio segment, or other features. , the embodiment of the present application does not limit this.
- the process of determining the voiceprint vector corresponding to each audio segment according to the characteristics corresponding to each audio segment includes: inputting the characteristics corresponding to the audio segment into the voiceprint extraction model, and inputting the results output by the voiceprint extraction model.
- the voiceprint extraction model can be any model, which is not limited in the embodiments of this application.
- the voiceprint extraction model can be a CLDNN (Convolution-Longshort-Term Mermony-Fully-Connected Deep Neural Networks, convolution-long short-term memory-fully connected neural network) model, or it can be based on TDNN (Time Delay Neural Network) , time-delay neural network) X-vector (the mainstream baseline model framework in the field of voiceprint recognition), or ecapa-tdnn (a model that extracts global features of speech).
- CLDNN Convolution-Longshort-Term Mermony-Fully-Connected Deep Neural Networks, convolution-long short-term memory-fully connected neural network
- TDNN Time Delay Neural Network
- X-vector the mainstream baseline model framework in the field of voiceprint recognition
- ecapa-tdnn a model that extracts global features of speech
- an initial similarity matrix is determined based on the voiceprint vectors corresponding to each audio segment.
- the initial similarity matrix includes the similarity between the voiceprint vectors corresponding to any two audio segments.
- the voiceprint vector corresponding to each audio segment is determined in the above step 201, and the similarity between the voiceprint vectors corresponding to any two audio segments is determined based on the voiceprint vector corresponding to each audio segment. degree to obtain the initial similarity matrix.
- the similarity between the voiceprint vectors corresponding to any two audio segments is determined according to the following formula (1).
- a i, j d(v i , v j ) i, j ⁇ [1, N]
- a i,j is the similarity between the voiceprint vector corresponding to the i-th audio segment and the voiceprint vector corresponding to the j-th audio segment
- d( vi , vj ) is the distance
- vi is the voiceprint vector corresponding to the i-th audio clip
- v j is the voiceprint vector corresponding to the j-th audio clip
- N is the total number of multiple audio clips.
- the cosine similarity distance between the voiceprint vectors corresponding to any two audio clips can be used as the similarity between the voiceprint vectors corresponding to any two audio clips.
- the similarity between the voiceprint vectors corresponding to any two audio clips can also be determined in other ways, which is not limited in the embodiments of the present application.
- the number of rows of the initial similarity matrix is the number of multiple audio clips, and the number of columns is the number of multiple audio clips.
- the initial similarity matrix is a symmetry matrix, that is, the similarity located in the i-th row and j-th column in the initial similarity matrix is the same as the similarity located in the j-th row and i-th column.
- i and j are positive integers not greater than the number of multiple audio clips.
- any two audio clips corresponding to any two audio clips are the same audio object.
- any two audio clips corresponding to The the lower the similarity between voiceprint vectors, the lower the possibility that the audio objects corresponding to any two audio clips are the same audio object.
- the similarity between the voiceprint vectors of any two audio clips is determined according to the above formula (1). Then, based on the similarity between the voiceprint vectors of any two audio clips, the initial similarity matrix is determined to be a 5*5 matrix.
- the initial similarity matrix looks like this:
- a 1, 1 is used to represent the similarity between the voiceprint vector corresponding to the first audio clip and the voiceprint vector corresponding to the first audio clip
- a 1, 2 is used to represent the similarity between the voiceprint vector corresponding to the first audio clip and the voiceprint vector corresponding to the first audio clip.
- the meanings represented by other elements in the initial similarity matrix are the same as those represented by a 1,1 and a 1,2 have similar meanings and will not be described in detail here.
- step 203 the initial similarity matrix is adjusted according to the dynamic threshold corresponding to each row in the initial similarity matrix to obtain a reference similarity matrix.
- the dynamic threshold is used to adjust the similarity difference between different similarities.
- the dynamic threshold is used to bring the difference between the similarities of the voiceprint vectors of the audio clips of the same audio object closer, and/or to make the difference between the similarities of the voiceprint vectors of the audio clips of different audio objects farther away.
- bringing the difference between the similarities of the voiceprint vectors of the audio clips of the same audio object closer refers to the difference between the first similarity and the second similarity
- the first similarity is The similarity between the voiceprint vector of the first audio segment and the voiceprint vector of the second audio segment
- the second similarity is the similarity between the voiceprint vector of the first audio segment and the voiceprint vector of the third audio segment
- the first audio segment, the second audio segment, and the third audio segment correspond to the same audio object.
- the difference between the similarities between the voiceprint vectors of the audio clips of different audio objects refers to the difference between the first similarity and the third similarity.
- the third similarity is the sound of the first audio clip.
- the similarity between the voiceprint vector and the voiceprint vector of the fourth audio segment, the first audio segment and the fourth audio segment correspond to different audio objects.
- the initial similarity matrix is adjusted according to the dynamic threshold corresponding to each row in the initial similarity matrix.
- the dynamic threshold corresponding to each row in the initial similarity matrix needs to be determined.
- the process includes: for any row in the initial similarity matrix, sort the similarities in the preset similarity range among the multiple similarities included in any row in the first order to obtain the first sorting result; according to the first sorting As a result, the similarity difference between two adjacent similarities within the preset similarity range is determined, and multiple similarity differences are obtained. The number of similarity differences is smaller than the similarity included in any row.
- the number of degrees that is, determine the similarity difference between two adjacent similarities in the first sorting result, and obtain multiple similarity differences; determine that the first requirement is met among the multiple similarity differences.
- the similarity difference ; determine the dynamic threshold corresponding to any row based on the similarity difference that meets the first requirement.
- the minuend corresponding to the similarity difference that meets the first requirement is used as the dynamic threshold corresponding to any row.
- the first order may be an order from small to large, or may be an order from large to small, which is not limited in the embodiment of the present application.
- the preset similarity range is set based on experience or adjusted according to the implementation environment, and is not limited in the embodiments of the present application. For example, the preset similarity range is [-1, 1].
- the similarity difference that satisfies the first requirement among the multiple similarity differences refers to the largest similarity difference among the multiple similarity differences.
- the first order is from small to large
- the preset similarity range is [-1, 1]
- the similarities included in any row of the initial similarity matrix are: 1, -0.3, 0.7, 0.5, 0.9.
- the similarities within the preset similarity range are sorted from small to large, and the first sorting results obtained are: -0.3, 0.5, 0.7, 0.9, 1.
- the largest similarity difference among multiple similarity differences is 0.8. Therefore, the minuend 0.5 corresponding to 0.8 is used as the dynamic threshold corresponding to any row. In some embodiments, when there are multiple minuends corresponding to the maximum similarity difference, any one of the minuends is determined as the dynamic threshold.
- the similarity difference between two adjacent similarities within the preset similarity range is determined.
- the similarity difference can also be determined based on multiple similarity differences.
- Each similarity difference value determines a similarity difference vector, and the similarity difference vector includes multiple similarity difference values.
- the following formula (2) is a similarity difference vector.
- the similarity difference vector corresponding to any row of gap q , a′ q, 1 is the similarity that ranks first in the first sorting result after sorting from small to large
- a′ q, 2 is the similarity that is ranked second in the first sorted result sorted from small to large
- a′ q, 3 is the third ranked result in the first sorted result sorted from small to large
- the similarity of , a′ q, N is the similarity at the last position in the first sorted result after sorting from small to large
- a′ q, N-1 is the similarity of the last sorted result after sorting from small to large.
- the second-to-last similarity in a sorted result is the similarity that ranks first in the first sorting result after sorting from small to large
- a′ q, 2 is the similarity that is ranked second in the first sorted result sorted from small to large
- a′ q, 3 is the third ranked result in the first sorted result sorted from
- Implementation method 1 Among the similarities included in the k-th row of the initial similarity matrix, the similarity that is smaller than the dynamic threshold corresponding to the k-th row is adjusted to the first value, and the reference similarity matrix is obtained based on the adjustment results of each row, k is positive integer.
- the first numerical value is set based on experience or adjusted according to the implementation environment, which is not limited in the embodiments of the present application.
- the first value is 0.
- the initial similarity matrix is Among them, the dynamic threshold corresponding to the first row is 0.5, the dynamic threshold corresponding to the second row is 0.6, the dynamic threshold corresponding to the third row is 0.7, the dynamic threshold corresponding to the fourth row is 0.2, and the dynamic threshold corresponding to the fifth row is 0.9 , then among the similarities included in each row of the initial similarity matrix, the similarity smaller than the dynamic threshold corresponding to each row is adjusted to 0, and the obtained reference similarity matrix is
- Implementation method two among the similarities included in the k-th row of the initial similarity matrix, the similarities smaller than the dynamic threshold corresponding to the k-th row are multiplied by the second value, and the reference similarity matrix is obtained based on the adjustment results of each row.
- the second numerical value is set based on experience or adjusted according to the implementation environment, which is not limited in the embodiments of the present application.
- the second value is 0.01.
- the initial similarity matrix is Among them, the dynamic threshold corresponding to the first row is 0.5, the dynamic threshold corresponding to the second row is 0.6, the dynamic threshold corresponding to the third row is 0.7, the dynamic threshold corresponding to the fourth row is 0.2, and the dynamic threshold corresponding to the fifth row is 0.9 , then among the similarities included in each row of the initial similarity matrix, the similarity smaller than the dynamic threshold corresponding to each row is multiplied by 0.01, and the resulting reference similarity matrix is
- the distance between the first similarity and the second similarity in the reference similarity matrix is smaller than the distance between the first similarity and the second similarity in the initial similarity matrix.
- the distance between the similarity and the third similarity is greater than the distance between the first similarity and the third similarity in the initial similarity matrix, so as to achieve the similarity between the voiceprint vectors of the audio fragments of the same audio object.
- the difference is the difference between the voiceprint vectors of audio clips of different audio objects.
- step 204 the number of audio objects that produce multiple audio segments is determined according to the reference similarity matrix.
- the process of determining the number of audio objects present in multiple audio clips based on the reference similarity matrix includes: processing the reference similarity matrix according to multiple reference parameters to obtain the corresponding number of each reference parameter. Similarity matrix; determine the number of audio objects existing in multiple audio clips based on multiple reference parameters and the similarity matrix corresponding to each reference parameter.
- the reference parameters are set based on experience or adjusted according to the implementation environment, which is not limited in the embodiments of the present application. The number of reference parameters is not limited in this application.
- the reference similarity matrix is processed according to multiple reference parameters to obtain the similarity matrix corresponding to each reference parameter.
- the process of obtaining the similarity matrix corresponding to each reference parameter is similar.
- only any one of the multiple reference parameters corresponds to the reference parameter.
- the determination process of the similarity matrix is described as an example. The process includes the following steps 1 to 5.
- Step 1 According to any reference parameter, perform numerical adjustment on the reference similarity matrix to obtain the first similarity matrix.
- the numerical adjustment is used to simplify the reference similarity matrix.
- any reference parameter there are the following two ways to numerically adjust the reference similarity matrix to obtain the first similarity matrix.
- Method 1 For the multiple similarities included in each row of the reference similarity matrix, adjust the similarities other than the similarity that meets the third requirement of any reference parameter to a third value to obtain the first similarity matrix.
- the third numerical value is set based on experience or adjusted according to the implementation environment, which is not limited in the embodiments of the present application.
- the third value is 0.
- the similarity of any reference parameter that meets the third requirement refers to the maximum similarity of any reference parameter, that is, the number of similarities corresponding to any reference parameter is adjusted to the third value and is adjusted to the third value.
- the similarity of is the maximum similarity in the row of the reference similarity matrix.
- sorting results corresponding to each row except for the previous reference parameter similarity
- the similarities outside are adjusted to the third value to obtain the first similarity matrix.
- any reference parameter is 3, the third value is 0, and the reference similarity matrix is According to any reference parameter, it is determined that the three similarities that meet the third requirement in the first row are 1, 0.9, and 0.7, the three similarities that meet the third requirement in the second row are 1, 0.6, and -0.003, and the three similarities that meet the third requirement in the second row are 1, 0.6, and -0.003.
- the three similarities that meet the third requirement in the row are 1, 0.8, and 0.7.
- the three similarities that meet the third requirement in the fourth row are 1, 0.8, and 0.5.
- the three similarities that meet the third requirement in the fifth row are 1, 0.9, and 0.006. Performed on the reference similarity matrix Adjust to get the first similarity matrix as
- Method 2 Multiply the similarities among the multiple similarities included in the reference similarity matrix, except the similarities that meet the third requirement for any reference parameter, by the fourth value to obtain the first similarity matrix.
- the fourth numerical value is set based on experience or adjusted according to the implementation environment, which is not limited in the embodiments of the present application.
- the fourth value is 0.01.
- Multiply the similarities other than the number of similarities with the fourth value to obtain the first similarity matrix that is, multiply the similarities with the number of similarities other than the number corresponding to any reference parameter by the fourth value, and multiply the number of similarities with the number corresponding to any reference parameter.
- the similarity is the maximum similarity in the row of the reference similarity matrix.
- any reference parameter is 3, the fourth value is 0.01, and the reference similarity matrix is According to any reference parameter, it is determined that the three similarities that meet the third requirement in the first row are 1, 0.9, and 0.7, the three similarities that meet the third requirement in the second row are 1, 0.6, and -0.003, and the three similarities that meet the third requirement in the second row are 1, 0.6, and -0.003.
- the three similarities that meet the third requirement in the row are 1, 0.8, and 0.7.
- the three similarities that meet the third requirement in the fourth row are 1, 0.8, and 0.5.
- the three similarities that meet the third requirement in the fifth row are are 1, 0.9, and 0.006. Adjust the reference similarity matrix to obtain the first similarity matrix as the reference similarity matrix as
- any of the above methods can be selected to numerically adjust the reference similarity matrix to obtain the first similarity matrix, which is not limited in the embodiments of the present application.
- B is the first similarity matrix
- A is the reference similarity matrix
- p is any reference parameter
- Threshold is the numerical adjustment function.
- Step 2 Perform symmetry processing on the first similarity matrix to obtain a second similarity matrix.
- the similarity located in the i-th row and j-th column is the same as the similarity located in the j-th row and i-th column.
- i and j are positive integers not greater than the number of multiple audio clips.
- the first similarity matrix after dynamic threshold processing and numerical adjustment may be an asymmetric matrix, so the first similarity matrix is symmetrized. Due to the similarity between the voiceprint vector of the i-th audio segment and the voiceprint vector of the j-th audio segment, the similarity between the voiceprint vector of the j-th audio segment and the voiceprint vector of the i-th audio segment are the same, that is, the similarity located in the i-th row and j-th column is the same as the similarity located in the j-th row and i-th column.
- the first similarity matrix it is necessary to symmetrize the first similarity matrix so that the similarity in the i-th row and j-th column is the same.
- the similarity in row i and column j is the same as the similarity in row j and column i.
- there are at least the following ways to perform the first similarity matrix Perform row symmetry processing to obtain the second similarity matrix.
- Method 1 Determine the transposed matrix corresponding to the first similarity matrix; add the similarities located at the same position in the first similarity matrix and the transposed matrix corresponding to the first similarity matrix to obtain the similarity matrix to be adjusted; treat Adjust multiple similarities included in the similarity matrix and perform a half operation to obtain a second similarity matrix.
- C is the second similarity matrix
- B is the first similarity matrix
- B T is the transposed matrix corresponding to the first similarity matrix
- the first similarity matrix is The transposed matrix corresponding to the first similarity matrix is Add the similarities at the same position in the first similarity matrix and the transposed matrix corresponding to the first similarity matrix, and the obtained similarity matrix to be adjusted is: Perform a half operation on multiple similarities included in the similarity matrix to be adjusted, and the second similarity matrix obtained is
- Method 1 is a process of determining the second similarity matrix by averaging based on the first similarity matrix and the transposed matrix corresponding to the first similarity matrix.
- Method 2 Determine the greatest similarity between the similarity located in the i-th row and j-th column in the first similarity matrix and the similarity located in the j-th row and i-th column in the first similarity matrix, and use the largest similarity as The similarity between the i-th row and j-th column and the j-th row and i-th column in the second similarity matrix is used to obtain the second similarity matrix.
- a′ i,j is the similarity located in the i-th row and j-th column in the second similarity matrix
- a ij is the similarity located in the i-th row and j-th column in the first similarity matrix
- a ji is the similarity located in the j-th row and i-th column in the first similarity matrix.
- the first similarity matrix is Then the second similarity matrix is
- Method 2 is a process of determining the second similarity matrix by taking the maximum value based on the first similarity matrix.
- any of the above methods can be selected to perform symmetry processing on the first similarity matrix to obtain the second similarity matrix, which is not limited in the embodiments of the present application.
- Step 3 Perform row-column diffusion on the second similarity matrix to obtain a third similarity matrix.
- the third similarity matrix is used to generate boundaries between multiple audio objects.
- the process of performing row-column diffusion on the second similarity matrix to obtain the third similarity matrix includes: determining the transpose matrix corresponding to the second similarity matrix, and based on the second similarity matrix and the second similarity matrix.
- the transpose matrix corresponding to the similarity matrix determines the third similarity matrix.
- the similarity in the m-th row and n-th column in the third similarity matrix is based on the similarity in the m-th row and the second similarity in the second similarity matrix.
- the similarity in the nth column of the transposed matrix corresponding to the degree matrix is determined, and m and n are positive integers not greater than the number of multiple audio clips.
- the similarity located at the m-th row and the n-th column in the third similarity matrix are located at the similarity located at the m-th row in the second similarity matrix and the corresponding transposed matrix of the second similarity matrix are located at the The result of corresponding multiplication and addition of the similarity in the n columns is used as the similarity located in the m-th row and n-th column in the third similarity matrix.
- the similarities located in the m-th row of the second similarity matrix are 1, 0, 0.7, 0.5, and 0.9 respectively, and the similarities located in the n-th column of the corresponding transposed matrix of the second similarity matrix are 1 respectively.
- D is the third similarity matrix
- C is the second similarity matrix
- C T is the transposed matrix corresponding to the second similarity matrix
- the second similarity matrix is The transposed matrix corresponding to the second similarity matrix is Then the third similarity matrix is
- Step 4 Proportionally adjust the third similarity matrix to obtain a fourth similarity matrix.
- the proportional adjustment is used to adjust the similarities included in each row of the third similarity matrix within the same range.
- the process of proportionally adjusting the third similarity matrix to obtain the fourth similarity matrix includes: determining the maximum similarity corresponding to each row according to multiple similarities included in each row of the third similarity matrix. degree; divide multiple similarities included in each row of the third similarity matrix by the maximum similarity corresponding to each row to obtain a fourth similarity matrix.
- the third similarity matrix is proportionally adjusted according to the following formula (7) to obtain a fourth similarity matrix.
- a′′ ij is the similarity located in the i-th row and j-th column in the fourth similarity matrix
- a ij is the similarity located in the i-th row and j-th column in the third similarity matrix
- a ik is the maximum similarity corresponding to the i-th row in the third similarity matrix
- k is the column of the maximum similarity corresponding to the i-th row in the third similarity matrix.
- the third similarity matrix is Among them, the maximum similarity corresponding to the first row is 2.55, the maximum value corresponding to the second row is 1.36, the maximum value corresponding to the third row is 2.13, the maximum value corresponding to the fourth row is 1.95, and the maximum value corresponding to the fifth row is 2.17. According to the maximum similarity corresponding to each row, the third similarity matrix is proportionally adjusted, and the fourth similarity matrix obtained is
- Step 5 Perform symmetry processing on the fourth similarity matrix to obtain the similarity matrix corresponding to any reference parameter.
- the process of symmetrizing the fourth similarity matrix to obtain the similarity matrix corresponding to any reference parameter is the same as the above process of symmetrizing the first similarity matrix to obtain the second similarity.
- the matrix process is similar and will not be repeated here.
- step 1 the similarity matrix corresponding to each reference parameter is determined.
- Figure 3 is a schematic diagram of a similarity matrix determination process provided by an embodiment of the present application.
- (1) in Figure 3 is the initial similarity matrix
- (2) in Figure 3 is the reference similarity matrix
- (3) in Figure 3 is the first similarity matrix
- (4) in Figure 3 is the third Similarity matrix
- (5) in Figure 3 is the fourth similarity matrix
- (6) in Figure 3 is the similarity matrix corresponding to the reference parameters.
- the horizontal axis in (1) in Figure 3 is the number of audio clips
- the vertical axis is the number of audio clips. The higher the brightness of the area in Figure 3, the higher the similarity between the voiceprint vectors of the two audio clips. .
- the process of determining the number of audio objects present in multiple audio clips based on multiple reference parameters and the similarity matrix corresponding to each reference parameter includes: based on the multiple reference parameters and the corresponding similarity matrix of each reference parameter. similarity matrix, determine the proportion value corresponding to each reference parameter, the proportion value is used to indicate the number of similarities retained in the similarity matrix corresponding to the reference parameter; according to the proportion value corresponding to each reference parameter, determine the presence of multiple audio clips the sound The number of frequency objects.
- the process of determining the proportion value corresponding to each reference parameter includes: for any reference parameter among the multiple reference parameters, for any reference parameter corresponding Perform Laplace transformation on the similarity matrix to obtain the Laplace matrix corresponding to any reference parameter; perform singular value decomposition on the Laplace matrix to obtain multiple reference eigenvalues; determine the number of reference eigenvalues among the multiple reference eigenvalues. Two eigenvalues and a first number of first eigenvalues, the second eigenvalue is the maximum value among multiple reference eigenvalues, and the first eigenvalue satisfies the second requirement after sorting the multiple reference eigenvalues in a second order. reference characteristic value.
- the first eigenvalue difference is the largest eigenvalue difference among multiple eigenvalue differences; according to the normalized eigenvalue difference and any reference parameter, Determine the proportional value corresponding to any reference parameter.
- the first number is set based on experience or adjusted according to the implementation environment, which is not limited in the embodiments of the present application. For example, the first quantity is 3.
- the second order may be an order from small to large, or may be an order from large to small, which is not limited in the embodiment of the present application.
- the first feature value is the first number of reference feature values after sorting the multiple reference feature values in order from small to large.
- the first feature value is the first number of reference feature values after sorting the plurality of reference feature values in order from large to small.
- any one of the first eigenvalues can be determined based on the multiple eigenvalue differences.
- the eigenvalue difference vector includes multiple eigenvalue differences.
- the following formula (8) is the feature value difference vector corresponding to any reference parameter.
- e p is the eigenvalue difference vector corresponding to any reference parameter
- ⁇ p, 1 is the reference feature that ranks first after sorting multiple reference eigenvalues in order from small to large.
- Value ⁇ p, 2 is the second reference eigenvalue after sorting multiple reference eigenvalues in order from small to large
- ⁇ p, 3 is sorting multiple reference eigenvalues in order from small to large
- ⁇ p, Y is the reference eigenvalue located in the Yth position after sorting multiple reference eigenvalues in order from small to large
- ⁇ p, Y-1 is the reference eigenvalue in order from small to large
- Y is the first quantity.
- the eigenvalue difference is normalized according to the following formula (9) to obtain the normalized eigenvalue difference.
- g P is the eigenvalue difference after normalization
- max(e p ) is the first eigenvalue difference
- ⁇ max is the second eigenvalue
- ⁇ is the normalization parameter
- ⁇ The value is 1*10 -10 .
- the proportion value corresponding to any reference parameter is determined according to the following formula (10).
- r(p) is the proportion value corresponding to any reference parameter
- p is any reference parameter
- g p is the eigenvalue difference after normalization.
- the process of determining the number of audio objects present in multiple audio clips according to the proportion value corresponding to each reference parameter includes: determining the number of audio objects present in the multiple reference parameters according to the proportion value corresponding to each reference parameter.
- the first parameter is the reference parameter with the smallest corresponding proportion value among the multiple reference parameters; determine the multiple reference parameters corresponding to the first parameter.
- characteristic value differences calling the first function to process the plurality of characteristic value differences corresponding to the first parameter, and obtaining the number of audio objects existing in the plurality of audio clips.
- the reference parameter corresponding to the smallest proportion value is used as the first parameter.
- the process of determining the differences between multiple eigenvalues corresponding to the first parameter is: determining the similarity matrix corresponding to the first parameter, performing Laplace transform on the similarity matrix corresponding to the first parameter, and obtaining the first
- the Laplacian matrix corresponding to the parameter perform singular value decomposition on the Laplacian matrix corresponding to the first parameter, obtain multiple reference eigenvalues, and sort the smallest first number of reference eigenvalues among the multiple reference eigenvalues. , using the difference between two adjacent reference feature values in the sorting as the multiple feature value differences corresponding to the first parameter.
- the process of calling the first function to process multiple feature value differences corresponding to the first parameter, and obtaining the number of audio objects present in the multiple audio clips includes: combining the multiple feature value differences corresponding to the first parameter into a first
- the feature value difference vector corresponding to the parameter is called the first function to process the feature value difference vector corresponding to the first parameter to obtain the number of audio objects existing in the multiple audio clips.
- the plurality of feature value differences corresponding to the first parameter are processed according to the following formula (11) to obtain the number of audio objects present in the plurality of audio clips.
- M is the number of audio objects existing in multiple audio clips
- arg max() is the first function
- the eigenvalue difference vector corresponding to the first parameter is a vector composed of multiple eigenvalue differences corresponding to the first parameter.
- the similarity matrix corresponding to the first parameter is the matrix Q, perform Laplace transform on the matrix Q, and obtain the Laplace matrix P corresponding to the first parameter, and perform singular value decomposition on the matrix P to obtain multiple Reference feature values (respectively a, b, c, d, e, f), sort multiple reference feature values in order from small to large, and obtain the sorting result (b, c, a, e, f, d) , the first number is 3, and the difference between the two adjacent reference feature values among the three smallest reference values in the sorting result is used as the multiple feature value differences corresponding to the first parameter, and the multiple feature value differences are c-b, a-c respectively. Therefore, the vector composed of c-b and a-c is used as the eigenvalue difference vector, that is, the eigenvalue difference vector is [c-b, a-c].
- step 205 multiple audio segments are clustered according to the number of audio objects to obtain audio segments resulting from the sounds of each audio object.
- the multiple audio segments are clustered according to the number of audio objects to obtain the audio corresponding to each audio object.
- the process of the fragment includes: performing singular value decomposition on the similarity matrix corresponding to the first parameter to obtain multiple decomposition eigenvalues; determining the number of decomposition eigenvalues of the audio object among the multiple decomposition eigenvalues; determining the number of decompositions of the audio object.
- the eigenvectors corresponding to the eigenvalues respectively; decompose the eigenvectors corresponding to the eigenvalues according to the number of audio objects to generate a decomposition matrix.
- the number of rows of the decomposition matrix is the number of audio objects and the number of columns is the number of audio clips; according to the decomposition matrix , determine the feature vectors corresponding to multiple audio clips, and the feature vectors are used to indicate the corresponding audio clips; according to the number of audio objects and the feature vectors corresponding to the multiple audio clips, cluster the multiple audio clips to obtain each audio The audio clip corresponding to the object.
- the determined decomposition feature value is the smallest number of decomposition feature values of the audio object.
- the smallest three decomposition feature values are determined among the multiple decomposition feature values. Determine the eigenvectors corresponding to these three decomposition eigenvalues.
- the eigenvectors corresponding to the decomposition eigenvalues are 1*5 eigenvectors.
- Three 1*5 eigenvectors are formed into a 3*5 decomposition matrix.
- the first column in the decomposition matrix is used as the feature vector corresponding to the first audio clip
- the second column is used as the feature vector corresponding to the second audio clip
- the third column is used as the feature vector corresponding to the third audio clip
- the fourth column is used as The feature vector corresponding to the fourth audio clip
- the fifth column is the feature vector corresponding to the fifth audio clip.
- the eigenvectors corresponding to the three decomposed eigenvalues are [x 1 , x 2 , x 3 , x 4 , x 5 ], [y 1 , y 2 , y 3 , y 4 , y 5 ], [ z 1 , z 2 , z 3 , z 4 , z 5 ], then the decomposition matrix composed of the eigenvectors corresponding to the three decomposition eigenvalues is: Therefore, take [x 1 , y 1 , z 1 ] as the feature vector corresponding to the first audio clip, take [x 2 , y 2 , z 2 ] as the feature vector corresponding to the second audio clip, and take [x 3 , y 3 , z 3 ] as the feature vector corresponding to the third audio segment, [x 4 , y 4 , z 4 ] as the feature vector corresponding to the fourth audio segment, and [x 5 , y 5 ,
- the multiple audio segments are clustered through the K-means (K-means) clustering algorithm to obtain the audio segments corresponding to each audio object,
- K-means K-means
- the value of K is the number of audio objects.
- other clustering algorithms can also be used to cluster multiple audio clips, which is not limited in the embodiments of the present application.
- audio clip 1 there are five audio clips to be processed, namely audio clip 1, audio clip 2, audio clip 3, audio clip 4 and audio clip 5.
- the number of audio objects existing in the audio clip to be processed is 3.
- the audio fragments corresponding to audio object 1 are audio fragment 1 and audio fragment 3
- the audio fragments corresponding to audio object 2 are audio fragment 5
- the audio fragments corresponding to audio object 3 are audio fragment 2 and audio fragment 4. .
- the multiple audio clips can also be determined based on the number of audio objects and the voiceprint vectors corresponding to the multiple audio clips.
- the audio clips are clustered to obtain the audio clips corresponding to each audio object.
- the process of clustering multiple audio segments according to the number of audio objects and the voiceprint vectors corresponding to the multiple audio segments to obtain the audio segments corresponding to each audio object is the same as the process described above based on the number of audio objects and the multiple audio segments.
- the process of clustering multiple audio clips using corresponding feature vectors to obtain the audio clips corresponding to each audio object is similar and will not be described again here.
- the audio processing method provided by the embodiment of the present application can be applied in the field of games to determine whether the same game account (or the same smart device) is used by several users.
- collect the audio clips of the user when using the game account (or the smart device) call the audio processing method provided by the embodiment of the present application, determine the voiceprint vector corresponding to each audio clip, and determine the voiceprint vector corresponding to each audio clip according to the sound pattern corresponding to each audio clip.
- the texture vector is used to determine the number of audio objects present in the audio clip and the audio clip corresponding to each audio object, so as to know how many users the game account (or the smart device) can use.
- the above method adjusts the initial similarity matrix according to the dynamic threshold corresponding to each row in the initial similarity matrix, and then obtains the reference similarity matrix.
- the similarity of the voiceprint vectors of audio clips of the same audio object can be brought closer degree, zoom out the similarity of the voiceprint vectors of audio fragments of different audio objects, so that the number of audio objects determined based on the reference similarity matrix is more accurate; and then based on the number of audio objects with higher accuracy, multiple audio
- the fragments are clustered to obtain the audio fragments corresponding to each audio object, so that the accuracy of the determined audio fragments corresponding to each audio object is higher, and the accuracy of audio object clustering is higher, which can improve the audio processing effect of the audio fragments.
- the method provided in this embodiment is to sort the person rows in the initial similarity matrix by similarity, thereby determining the dynamic threshold based on the difference between adjacent similarities in the similarity sorting, so as to use the dynamic threshold to determine the initial similarity.
- the matrix is adjusted to improve the accuracy of the similarity of voiceprints that belong to the same audio object and the similarity of voiceprints of different audio objects that are far away, and the accuracy of determining the number of audio objects is improved.
- the method provided in this embodiment sets a first numerical value for the similarity in the initial similarity matrix based on a dynamic threshold, which improves the adjustment efficiency of the initial similarity matrix.
- the method provided in this embodiment operates and sets the similarity and the second value in the initial similarity matrix based on a dynamic threshold, which improves the setting flexibility and accuracy of the adjusted reference similarity matrix.
- the method provided in this embodiment generates multiple reference parameters, processes the reference similarity matrix based on the multiple reference parameters, and obtains the similarity matrix corresponding to each reference parameter, thereby improving the accuracy of the similarity matrix and improving the efficiency of the similarity matrix. Accuracy in determining the number of audio objects.
- the method provided in this embodiment adjusts the reference similarity matrix when determining the similarity matrix corresponding to the reference parameter. After the adjustment, since the adjusted first similarity matrix may be asymmetric, the first similarity matrix is symmetrized, and then column diffusion and proportion adjustment are performed, which improves the accuracy of determining the similarity matrix.
- the method provided in this embodiment determines any reference parameter similarity according to the third requirement when determining the first similarity matrix, and numerically adjusts other similarities, and uses the first similarity matrix and the corresponding transpose The matrix determines the second similarity matrix, which improves the accuracy of the similarity matrix corresponding to the reference parameter.
- Figure 4 is a flow chart of another audio processing method provided by an embodiment of the present application. As shown in Figure 4, the method includes the following steps 401 to 415.
- step 201 this process has been described in step 201 above, and will not be described again here.
- the signal preprocessing method for the audio clip includes at least one of segmentation, noise reduction, sampling, quantization and other processing methods.
- the voiceprint extraction model can be a CLDNN model, an X-vector based on TDNN, or ecapa-tdnn.
- step 202 this process has been described in step 202 above, and will not be described again here.
- the similarity between the voiceprint vectors corresponding to any two audio clips is determined, and an initial similarity matrix is constructed, in which the number of vertical and horizontal directions of the initial similarity matrix corresponds to the number of audio clips, thereby constructing each audio clip
- the correspondence matrix of the similarity between the voiceprint vectors is used as the initial similarity matrix.
- step 203 this process has been described in step 203 above, and will not be described again here.
- the dynamic threshold is used to make the similarity between the voiceprint features of audio clips corresponding to the same speaking object closer, or the dynamic threshold is used to make the similarity farther between the voiceprint features of the audio clips corresponding to different speaking objects.
- any similarity other than the similarity that meets the third requirement of any reference parameter is adjusted to a third value to obtain the first similarity matrix; or, Among the multiple similarities included in the reference similarity matrix, the similarities other than the similarities that meet the third requirement of any reference parameter are multiplied by the fourth numerical value to obtain the first similarity matrix.
- this process has been described in step 204 above, and will not be described again here. Since the first similarity matrix is asymmetric when the reference similarity matrix is adjusted to the first similarity matrix, the first similarity matrix is symmetrized to obtain the second similarity matrix.
- this process has been described in step 204 above, and will not be described again here.
- the second similarity matrix is diffused in rows and columns through the transposed matrix to obtain the third transposed matrix.
- step 204 determine the maximum similarity corresponding to each row according to the multiple similarities included in each row in the third similarity matrix; compare the multiple similarities included in each row in the third similarity matrix with the maximum similarity corresponding to each row respectively. Divide to obtain the fourth similarity matrix.
- step 204 Since the fourth similarity matrix is asymmetric after the proportion adjustment of the third similarity matrix, the fourth similarity matrix is symmetrized to obtain a similarity matrix corresponding to the reference parameters.
- the scale value is used to indicate the number of similarities retained in the similarity matrix corresponding to the reference parameter.
- the smaller the ratio value the smaller the number of similarities retained in the similarity matrix corresponding to the reference parameter, and the higher the accuracy of the number of audio objects subsequently determined; conversely, the larger the ratio value, the smaller the number of similarities retained in the similarity matrix corresponding to the reference parameter.
- the greater the number of similarities retained in the matrix the less accurate the subsequent determination of the number of audio objects.
- step 204 this process has been described in step 204 above, and will not be described again here.
- step 204 this process has been described in step 204 above, and will not be described again here.
- step 205 this process has been described in step 205 above, and will not be described again here.
- step 205 this process has been described in step 205 above, and will not be described again here.
- Figure 5 shows a schematic structural diagram of an audio processing device provided by an embodiment of the present application. As shown in Figure 5, the device includes:
- the determination module 501 is used to determine the voiceprint vectors corresponding to multiple audio clips, and the voiceprint vector is used to indicate the voiceprint characteristics corresponding to the audio clips;
- the determination module 501 is also used to determine an initial similarity matrix based on the voiceprint vector corresponding to each audio clip.
- the initial similarity matrix includes the similarity between the voiceprint vectors corresponding to any two audio clips;
- the adjustment module 502 is used to adjust the initial similarity matrix according to the dynamic threshold corresponding to each row in the initial similarity matrix to obtain a reference similarity matrix.
- the dynamic threshold is used to adjust the similarity difference between different similarities. adjust;
- the determination module 501 is also configured to determine the number of audio objects that sound out and obtain the plurality of audio segments according to the reference similarity matrix;
- the clustering module 503 is configured to cluster the plurality of audio segments according to the number of the audio objects to obtain the audio segments produced by the sounds of each audio object.
- the determination module 501 is also configured to, for any row in the initial similarity matrix, sort the similarities in the any row that are within the preset similarity range in the first order. , obtain the first sorting result; determine the similarity difference between two adjacent similarities in the first sorting result, and obtain multiple similarity differences; determine among the multiple similarity differences that satisfy the first The required similarity difference; determine the dynamic threshold corresponding to any row based on the similarity difference that meets the first requirement.
- the adjustment module 502 is configured to adjust the similarity included in the k-th row of the initial similarity matrix that is less than the dynamic threshold corresponding to the k-th row to a first value, and based on The adjustment results for each row are In the reference similarity matrix, k is a positive integer; or, among the similarities included in the k-th row of the initial similarity matrix, the similarity smaller than the dynamic threshold corresponding to the k-th row is multiplied by the second value, And based on the adjustment results of each row, the reference similarity matrix is obtained.
- the determination module 501 is used to process the reference similarity matrix according to multiple reference parameters to obtain the similarity matrix corresponding to each reference parameter; according to the multiple reference parameters and the similarity matrix corresponding to each reference parameter Similarity matrix that determines the number of audio objects present in multiple audio clips.
- the determination module 501 is configured to perform numerical adjustment on the reference similarity matrix according to any reference parameter among the plurality of reference parameters to obtain a first similarity matrix, with a numerical value Adjustment is used to simplify the reference similarity matrix; symmetrize the first similarity matrix to obtain the second similarity matrix.
- the similarity in the second similarity matrix located in the i-th row and j-th column is the same as the similarity located in the j-th row and i
- the similarity of the columns is the same, i and j are positive integers not greater than the number of multiple audio clips; perform row-column diffusion on the second similarity matrix to obtain the third similarity matrix, which is used to generate multiple The boundary between audio objects; proportionally adjust the third similarity matrix to obtain the fourth similarity matrix.
- the proportion adjustment is used to adjust the similarities included in each row of the third similarity matrix within the same range; adjust the fourth similarity matrix
- the similarity matrix is symmetrized to obtain the similarity matrix corresponding to any reference parameter.
- the determination module 501 is configured to adjust the similarity of any reference parameter other than the similarity that meets the third requirement to the third similarity for the multiple similarities included in each row of the reference similarity matrix. numerical value to obtain the first similarity matrix; or, among the multiple similarities included in the reference similarity matrix, the similarities other than the similarities that meet the third requirement of any reference parameter are multiplied by the fourth numerical value to obtain The first similarity matrix.
- the determination module 501 is used to determine the transposed matrix corresponding to the first similarity matrix; The degrees are added to obtain a similarity matrix to be adjusted; a plurality of similarities included in the similarity matrix to be adjusted are halved to obtain a second similarity matrix.
- the determination module 501 is used to determine the similarity located in the i-th row and j-th column of the first similarity matrix and the similarity located in the j-th row and i-th column of the first similarity matrix.
- the maximum similarity is used as the similarity located in the i-th row, j-th column and j-th row, i-th column in the second similarity matrix to obtain the second similarity matrix.
- the determination module 501 is used to determine the transposed matrix corresponding to the second similarity matrix; determine the third similarity based on the second similarity matrix and the transposed matrix corresponding to the second similarity matrix.
- Matrix, the similarity in the m-th row and n-th column in the third similarity matrix is based on the similarity in the m-th row in the second similarity matrix and the similarity in the n-th column in the corresponding transposed matrix of the second similarity matrix Degree is determined, m and n are positive integers not greater than the number of multiple audio clips.
- the determination module 501 is configured to determine the maximum similarity corresponding to each row according to the multiple similarities included in each row of the third similarity matrix; The similarity is divided by the maximum similarity corresponding to each row to obtain the fourth similarity matrix.
- the determination module 501 is configured to determine the proportional value corresponding to each reference parameter according to the multiple reference parameters and the similarity matrix corresponding to each reference parameter.
- the proportional value is used to indicate the similarity corresponding to the reference parameter.
- the number of similarities retained in the matrix; the number of audio objects present in multiple audio clips is determined based on the proportional values corresponding to each reference parameter.
- the determination module 501 is configured to perform a Laplace transform on the similarity matrix corresponding to any reference parameter among the plurality of reference parameters to obtain the corresponding Laplacian matrix; perform singular value decomposition on the Laplacian matrix to obtain multiple reference eigenvalues; determine the second eigenvalue and the first number of first eigenvalues and the second eigenvalue among the multiple reference eigenvalues The value is the maximum value among multiple reference eigenvalues, and the first eigenvalue is the reference eigenvalue that satisfies the second requirement after sorting the multiple reference eigenvalues in the second order; determine the relative value among the first number of first eigenvalues.
- the difference between two adjacent first eigenvalues is used to obtain multiple eigenvalue differences; according to the second eigenvalue, the first eigenvalue difference is normalized to obtain the normalized eigenvalue difference. value, the first eigenvalue difference is the largest eigenvalue difference among multiple eigenvalue differences; according to the normalized eigenvalue difference and any reference parameter, the proportion value corresponding to any reference parameter is determined.
- the determination module 501 is configured to determine a first parameter among multiple reference parameters according to the proportional value corresponding to each reference parameter.
- the first parameter is the one with the smallest corresponding proportional value among the multiple reference parameters.
- the clustering module 503 is used to perform singular value decomposition on the similarity matrix corresponding to the first parameter to obtain multiple decomposition eigenvalues; determine the relationship between the multiple decomposition eigenvalues and The number of audio objects corresponds to the decomposition eigenvalue; determine the eigenvectors corresponding to the number of decomposition eigenvalues of the audio object, and generate a decomposition matrix, the number of rows of the decomposition matrix is the number of the audio objects, and the number of columns is the number of the audio segments; according to the decomposition matrix, determine the feature vectors corresponding to the multiple audio segments, and the feature vectors are used to indicate the corresponding audio segments; according to the number of the audio objects and the multiple Feature vectors corresponding to each audio segment are clustered, and the audio segments obtained by each audio object's sound are clustered.
- the above device adjusts the initial similarity matrix according to the dynamic threshold corresponding to each row in the initial similarity matrix, and then obtains the reference similarity matrix.
- the similarity of the voiceprint vectors of the audio clips of the same audio object can be brought closer degree, zoom out the similarity of the voiceprint vectors of audio fragments of different audio objects, so that the number of audio objects determined based on the reference similarity matrix is more accurate; and then based on the number of audio objects with higher accuracy, multiple audio
- the fragments are clustered to obtain the audio fragments corresponding to each audio object, so that the accuracy of the determined audio fragments corresponding to each audio object is higher, and the accuracy of audio object clustering is higher, which can improve the audio processing effect of the audio fragments.
- FIG. 6 shows a structural block diagram of a terminal device 600 provided by an exemplary embodiment of the present application.
- the terminal device 600 can be a portable mobile terminal, such as: a smartphone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, Moving Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Motion Picture Expert compresses standard audio levels 4) players, laptops or desktop computers.
- the terminal device 600 may also be called a user device, a portable terminal, a laptop terminal, a desktop terminal, and other names.
- the terminal device 600 includes: a processor 601 and a memory 602.
- the processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc.
- the processor 601 can adopt at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), and PLA (Programmable Logic Array, programmable logic array).
- the processor 601 may also include a main processor and a co-processor.
- the main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the co-processor is A low-power processor used to process data in standby mode.
- the processor 601 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is responsible for rendering and drawing content that needs to be displayed on the display screen.
- the processor 601 may also include an AI (Artificial Intelligence, artificial intelligence) processor, which is used to process computing operations related to machine learning.
- AI Artificial Intelligence, artificial intelligence
- Memory 602 may include one or more computer-readable storage media, which may be non-transitory. Memory 602 may also include high-speed random access memory, and non-volatile memory, such as one or more disk storage devices, flash memory storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 602 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 601 to implement the audio processing method provided by the embodiment of the present application.
- the terminal device 600 optionally further includes: a peripheral device interface 603 and at least one peripheral device.
- the processor 601, the memory 602 and the peripheral device interface 603 may be connected through a bus or a signal line.
- Each peripheral device can be connected to the peripheral device interface 603 through a bus, a signal line or a circuit board.
- the peripheral device includes: at least one of a radio frequency circuit 604, a display screen 605, a camera assembly 606, an audio circuit 607 and a power supply 608.
- the peripheral device interface 603 may be used to connect at least one I/O (Input/Output) related peripheral device to the processor 601 and the memory 602 .
- the processor 601, the memory 602 and the peripheral device interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 601, the memory 602 and the peripheral device interface 603 or Both can be implemented on separate chips or circuit boards, and this embodiment is not limited to this. Certainly.
- the radio frequency circuit 604 is used to receive and transmit RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals.
- Radio frequency circuitry 604 communicates with communication networks and other communication devices through electromagnetic signals.
- the radio frequency circuit 604 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals.
- the radio frequency circuit 604 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, and the like.
- the radio frequency circuit 604 can communicate with other terminal devices through at least one wireless communication protocol.
- the wireless communication protocol includes but is not limited to: World Wide Web, metropolitan area network, intranet, mobile communication networks of all generations (2G, 3G, 4G and 5G), wireless LAN and/or WiFi (Wireless Fidelity, Wireless Fidelity) network.
- the radio frequency circuit 604 may also include NFC (Near Field Communication) related circuits, which is not limited in this application.
- the display screen 605 is used to display UI (User Interface, user interface).
- the UI can include graphics, text, icons, videos, and any combination thereof.
- display screen 605 is a touch display screen
- display screen 605 also has the ability to collect touch signals on or above the surface of display screen 605 .
- the touch signal can be input to the processor 601 as a control signal for processing.
- the display screen 605 can also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.
- the display screen 605 may be a flexible display screen, disposed on the curved surface or folding surface of the terminal device 600. Even, the display screen 605 can also be set in a non-rectangular irregular shape, that is, a special-shaped screen.
- the display screen 605 can be made of LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, organic light-emitting diode) and other materials.
- the camera assembly 606 is used to capture images or videos.
- the camera assembly 606 includes a front camera and a rear camera.
- the front camera is arranged on the front panel of the terminal device 600 and the rear camera is arranged on the back of the terminal device 600 .
- there are at least two rear cameras one of which is a main camera, a depth-of-field camera, a wide-angle camera, and a telephoto camera, so as to realize the integration of the main camera and the depth-of-field camera to realize the background blur function.
- camera assembly 606 may also include a flash.
- the flash can be a single color temperature flash or a dual color temperature flash. Dual color temperature flash refers to a combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.
- Audio circuitry 607 may include a microphone and speakers.
- the microphone is used to collect sound waves from the user and the environment, and convert the sound waves into electrical signals that are input to the processor 601 for processing, or to the radio frequency circuit 604 to implement voice communication. For the purpose of stereo collection or noise reduction, there may be multiple microphones, which are respectively arranged at different parts of the terminal device 600 .
- the microphone can also be an array microphone or an omnidirectional collection microphone.
- the speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves.
- the loudspeaker can be a traditional membrane loudspeaker or a piezoelectric ceramic loudspeaker.
- audio circuitry 607 may also include a headphone jack.
- the power supply 608 is used to power various components in the terminal device 600 .
- Power source 608 may be AC, DC, disposable batteries, or rechargeable batteries.
- the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. Wired rechargeable batteries are batteries that are charged through wired lines, and wireless rechargeable batteries are batteries that are charged through wireless coils.
- the rechargeable battery can also be used to support fast charging technology.
- the terminal device 600 further includes one or more sensors 609.
- the one or more sensors 609 include, but are not limited to: an acceleration sensor 610, a gyroscope sensor 611, a pressure sensor 612, an optical sensor 613, and a proximity sensor 614.
- the acceleration sensor 610 can detect the acceleration on three coordinate axes of the coordinate system established by the terminal device 600 .
- the acceleration sensor 610 can be used to detect the components of gravity acceleration on three coordinate axes.
- the processor 601 can control the display screen 605 to operate in a horizontal view or a vertical view according to the gravity acceleration signal collected by the acceleration sensor 610. Display of the user interface.
- the acceleration sensor 610 can also be used to collect game or user motion data.
- the gyro sensor 611 can detect the body direction and rotation angle of the terminal device 600, and the gyro sensor 611 can cooperate with the acceleration sensor 610 to collect the user's 3D movements on the terminal device 600. Based on the data collected by the gyro sensor 611, the processor 601 can implement the following functions: motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.
- the pressure sensor 612 may be provided on the side frame of the terminal device 600 and/or on the lower layer of the display screen 605 .
- the processor 601 performs left and right hand recognition or quick operation based on the grip signal collected by the pressure sensor 612.
- the processor 601 controls the operability controls on the UI interface according to the user's pressure operation on the display screen 605.
- the operability control includes at least one of a button control, a scroll bar control, an icon control, and a menu control.
- the optical sensor 613 is used to collect ambient light intensity.
- the processor 601 can control the display brightness of the display screen 605 according to the ambient light intensity collected by the optical sensor 613 . Specifically, when the ambient light intensity is high, the display brightness of the display screen 605 is increased; when the ambient light intensity is low, the display brightness of the display screen 605 is decreased.
- the processor 601 can also dynamically adjust the shooting parameters of the camera assembly 606 according to the ambient light intensity collected by the optical sensor 613.
- the proximity sensor 614 also called a distance sensor, is usually provided on the front panel of the terminal device 600.
- the proximity sensor 614 is used to collect the distance between the user and the front of the terminal device 600 .
- the processor 601 controls the display screen 605 to switch from the bright screen state to the closed screen state; when the proximity sensor 614 detects When the distance between the user and the front of the terminal device 600 gradually increases, the processor 601 controls the display screen 605 to switch from the screen-off state to the screen-on state.
- FIG. 6 does not constitute a limitation on the terminal device 600, and may include more or fewer components than shown, or combine certain components, or adopt different component arrangements.
- FIG. 7 is a schematic structural diagram of a server provided by an embodiment of the present application.
- the server 700 may vary greatly due to different configurations or performance, and may include one or more central processing units (Central Processing Units, CPUs) 701 and one or more A plurality of memories 702, wherein at least one program code is stored in the one or more memories 702, and the at least one program code is loaded and executed by the one or more processors 701 to implement the audio provided by the above method embodiments.
- the server 700 may also have components such as wired or wireless network interfaces, keyboards, and input and output interfaces for input and output.
- the server 700 may also include other components for implementing device functions, which will not be described again here.
- a computer-readable storage medium is also provided, and at least one program code is stored in the storage medium, and the at least one program code is loaded and executed by the processor to enable the computer to implement any of the above audio Approach.
- the above computer-readable storage medium may be read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), read-only compact disc (Compact Disc Read-Only Memory, CD-ROM) ), tapes, floppy disks and optical data storage devices, etc.
- a computer program or computer program product is also provided. At least one computer instruction is stored in the computer program or computer program product, and the at least one computer instruction is loaded and executed by the processor, so that the computer implements Any of the above audio processing methods.
- the information including but not limited to user equipment information, user personal information, etc.
- data including but not limited to data used for analysis, stored data, displayed data, etc.
- signals involved in this application All are authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions.
- the audio clips involved in this application were obtained with full authorization.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
ai,j=d(vi,vj) i,j∈[1,N] 公式(1)
gapq=[a′q,2-a′q,1,a′q,3-a′q,2,...,a′q,N-a′q,N-1] 公式(2)
B=Threshold(A,p) 公式(3)
a′ij=max(aij,aji) 公式(5)
D=CCT 公式(6)
ep=[λp,2-λp,1,λp,3-λp,2,...,λp,Y-λp,Y-1] 公式(8)
Claims (20)
- 一种音频处理方法,由计算机设备执行,所述方法包括:确定多个音频片段分别对应的声纹向量,所述声纹向量用于表示所述音频片段对应的声纹特征;根据各个音频片段对应的声纹向量,确定初始相似度矩阵,所述初始相似度矩阵中包括任意两个音频片段对应的声纹向量之间的相似度;根据所述初始相似度矩阵中各行对应的动态阈值,对所述初始相似度矩阵进行调整,得到参考相似度矩阵,所述动态阈值用于对不同相似度之间的相似度差值进行调节;根据所述参考相似度矩阵确定发声得到所述多个音频片段的音频对象的数目;根据所述音频对象的数目,对所述多个音频片段进行聚类,得到各个音频对象发声得到的音频片段。
- 根据权利要求1所述的方法,其中,所述根据所述初始相似度矩阵中各行对应的动态阈值,对所述初始相似度矩阵进行调整,得到参考相似度矩阵之前,所述方法还包括:对于所述初始相似度矩阵中的任一行,按照第一顺序对所述任一行中位于预设相似度范围内的相似度进行排序,得到第一排序结果;确定所述第一排序结果中相邻的两个相似度之间的相似度差值,得到多个相似度差值;在所述多个相似度差值中确定满足第一要求的相似度差值;根据所述满足第一要求的相似度差值,确定所述任一行对应的动态阈值。
- 根据权利要求1或2所述的方法,其中,所述根据所述初始相似度矩阵中各行对应的动态阈值,对所述初始相似度矩阵进行调整,得到参考相似度矩阵,包括:将所述初始相似度矩阵第k行包括的相似度中,小于第k行对应的动态阈值的相似度调整为第一数值,并基于各行的调整结果得到所述参考相似度矩阵,k为正整数;或者,将所述初始相似度矩阵第k行包括的相似度中,小于所述第k行对应的动态阈值的相似度与第二数值相乘,并基于各行的调整结果得到所述参考相似度矩阵。
- 根据权利要求1至3任一所述的方法,其中,所述根据所述参考相似度矩阵确定发声得到所述多个音频片段的音频对象的数目,包括:根据多个参考参数,对所述参考相似度矩阵进行处理,得到各个参考参数对应的相似度矩阵;根据所述多个参考参数和所述各个参考参数对应的相似度矩阵,确定所述多个音频片段中存在的音频对象的数目。
- 根据权利要求4所述的方法,其中,所述根据多个参考参数,对所述参考相似度矩阵进行处理,得到各个参考参数对应的相似度矩阵,包括:对于所述多个参考参数中的任一参考参数,根据所述任一参考参数,对所述参考相似度矩阵进行数值调整,得到第一相似度矩阵,所述数值调整用于简化所述参考相似度矩阵;对所述第一相似度矩阵进行对称化处理,得到第二相似度矩阵,所述第二相似度矩阵中位于第i行第j列的相似度与位于第j行第i列的相似度相同,所述i和所述j为不大于所述多个音频片段的个数的正整数;对所述第二相似度矩阵进行行列扩散,得到第三相似度矩阵,所述第三相似度矩阵用于生成多个音频对象之间的边界;对所述第三相似度矩阵进行比例调整,得到第四相似度矩阵,所述比例调整用于将所述 第三相似度矩阵中各行包括的相似度调整在同一个范围内;对所述第四相似度矩阵进行对称化处理,得到所述任一参考参数对应的相似度矩阵。
- 根据权利要求5所述的方法,其中,所述根据所述任一参考参数,对所述参考相似度矩阵进行数值调整,得到第一相似度矩阵,包括:对于所述参考相似度矩阵各行包括的多个相似度,将任一参考参数个满足第三要求的相似度之外的相似度调整为第三数值,得到所述第一相似度矩阵;或者,将所述参考相似度矩阵包括的多个相似度中,除任一参考参数个满足第三要求的相似度之外的相似度与第四数值相乘,得到所述第一相似度矩阵。
- 根据权利要求5或6所述的方法,其中,所述对所述第一相似度矩阵进行对称化处理,得到第二相似度矩阵,包括:确定所述第一相似度矩阵对应的转置矩阵;将所述第一相似度矩阵和所述第一相似度矩阵对应的转置矩阵中位于相同位置的相似度相加,得到待调整相似度矩阵;对所述待调整相似度矩阵包括的多个相似度进行取半操作,得到所述第二相似度矩阵。
- 根据权利要求5或6所述的方法,其中,所述对所述第一相似度矩阵进行对称化处理,得到第二相似度矩阵,包括:确定所述第一相似度矩阵中位于所述第i行第j列的相似度,与所述第一相似度矩阵中位于所述第j行第i列的相似度中最大的相似度,将所述最大的相似度作为所述第二相似度矩阵中位于所述第i行第j列和所述第j行第i列的相似度,得到所述第二相似度矩阵。
- 根据权利要求5至8任一所述的方法,其中,所述对所述第二相似度矩阵进行行列扩散,得到第三相似度矩阵,包括:确定所述第二相似度矩阵对应的转置矩阵;根据所述第二相似度矩阵和所述第二相似度矩阵对应的转置矩阵,确定所述第三相似度矩阵,所述第三相似度矩阵中位于第m行第n列的相似度基于所述第二相似度矩阵中位于所述第m行的相似度和所述第二相似度矩阵对应的转置矩阵中位于所述第n列的相似度确定,所述m、所述n为不大于所述多个音频片段的个数的正整数。
- 根据权利要求5至9任一所述的方法,其中,所述对所述第三相似度矩阵进行比例调整,得到第四相似度矩阵,包括:根据所述第三相似度矩阵中各行包括的多个相似度,确定各行对应的最大相似度;将所述第三相似度矩阵中各行包括的多个相似度分别与所述各行对应的最大相似度相除,得到所述第四相似度矩阵。
- 根据权利要求4至10任一所述的方法,其中,所述根据所述多个参考参数和所述各个参考参数对应的相似度矩阵,确定所述多个音频片段中存在的音频对象的数目,包括:根据所述多个参考参数和所述各个参考参数对应的相似度矩阵,确定所述各个参考参数对应的比例值,所述比例值用于指示所述参考参数对应的相似度矩阵中保留的相似度的数量;根据所述各个参考参数对应的比例值,确定所述多个音频片段中存在的音频对象的数目。
- 根据权利要求11所述的方法,其中,所述根据所述多个参考参数和所述各个参考参数对应的相似度矩阵,确定所述各个参考参数对应的比例值,包括:对于所述多个参考参数中的任一参考参数,对所述任一参考参数对应的相似度矩阵进行 拉普拉斯变换,得到所述任一参考参数对应的拉普拉斯矩阵;对所述拉普拉斯矩阵进行奇异值分解,得到多个参考特征值;在所述多个参考特征值中确定第二特征值和第一数量个第一特征值,所述第二特征值为所述多个参考特征值中的最大值,所述第一特征值为按照第二顺序对所述多个参考特征值进行排序后满足第二要求的参考特征值;确定所述第一数量个第一特征值中相邻的两个第一特征值之间的差值,得到多个特征值差值;根据所述第二特征值,对第一特征值差值进行归一化处理,得到归一化之后的特征值差值,所述第一特征值差值为所述多个特征值差值中最大的特征值差值;根据所述归一化之后的特征值差值和所述任一参考参数,确定所述任一参考参数对应的比例值。
- 根据权利要求11或12所述的方法,其中,所述根据所述各个参考参数对应的比例值,确定所述多个音频片段中存在的音频对象的数目,包括:根据所述各个参考参数对应的比例值,在所述多个参考参数中确定第一参数,所述第一参数为所述多个参考参数中对应的比例值最小的参考参数;确定所述第一参数对应的多个特征值差值;调用第一函数对所述第一参数对应的多个特征值差值进行处理,得到所述多个音频片段中存在的音频对象的数目。
- 根据权利要求13所述的方法,其中,所述根据所述音频对象的数目,对所述多个音频片段进行聚类,得到各个音频对象发声得到的音频片段,包括:对所述第一参数对应的相似度矩阵进行奇异值分解,得到多个分解特征值;在所述多个分解特征值中确定与所述音频对象的数目对应分解特征值;确定所述音频对象的数目个分解特征值分别对应的特征向量,并生成分解矩阵,所述分解矩阵的行数为所述音频对象的数目,列数为所述音频片段的数目;根据所述分解矩阵,确定所述多个音频片段分别对应的特征向量,所述特征向量用于指示对应的音频片段;根据所述音频对象的数目和所述多个音频片段分别对应的特征向量,对所述多个音频片段进行聚类,各个音频对象发声得到的音频片段。
- 一种音频处理装置,所述装置包括:确定模块,用于确定多个音频片段分别对应的声纹向量,所述声纹向量用于表示所述音频片段对应的声纹特征;所述确定模块,还用于根据各个音频片段对应的声纹向量,确定初始相似度矩阵,所述初始相似度矩阵中包括任意两个音频片段对应的声纹向量之间的相似度;调整模块,用于根据所述初始相似度矩阵中各行对应的动态阈值,对所述初始相似度矩阵进行调整,得到参考相似度矩阵,所述动态阈值用于对不同相似度之间的相似度差值进行调节;所述确定模块,还用于根据所述参考相似度矩阵确定发声得到所述多个音频片段的音频对象的数目;聚类模块,用于根据所述音频对象的数目,对所述多个音频片段进行聚类,得到各个音频对象发声得到的音频片段。
- 根据权利要求15所述的装置,其中,所述确定模块,还用于对于所述初始相似度矩阵中的任一行,按照第一顺序对所述任一行中位于预设相似度范围内的相似度进行排序,得 到第一排序结果;确定所述第一排序结果中相邻的两个相似度之间的相似度差值,得到多个相似度差值;在所述多个相似度差值中确定满足第一要求的相似度差值;根据所述满足第一要求的相似度差值,确定所述任一行对应的动态阈值。
- 根据权利要求15或16所述的装置,所述调整模块,用于将所述初始相似度矩阵第k行包括的相似度中,小于第k行对应的动态阈值的相似度调整为第一数值,并基于各行的调整结果得到所述参考相似度矩阵,k为正整数;或者,将所述初始相似度矩阵第k行包括的相似度中,小于所述第k行对应的动态阈值的相似度与第二数值相乘,并基于各行的调整结果得到所述参考相似度矩阵。
- 一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条程序代码,所述至少一条程序代码由所述处理器加载并执行,以使所述计算机设备实现如权利要求1至14任一所述的音频处理方法。
- 一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行,以使计算机实现如权利要求1至14任一所述的音频处理方法。
- 一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时实现如权利要求1至14任一所述的音频处理方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23862178.3A EP4586105A4 (en) | 2022-09-07 | 2023-08-21 | AUDIO PROCESSING METHOD AND APPARATUS, DEVICE, READABLE STORAGE MEDIA AND PROGRAMMED PRODUCT |
| US18/583,688 US20240242722A1 (en) | 2022-09-07 | 2024-02-21 | Audio processing method and apparatus, device, readable storage medium, and program product |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211088204.6 | 2022-09-07 | ||
| CN202211088204.6A CN115168643B (zh) | 2022-09-07 | 2022-09-07 | 音频处理方法、装置、设备及计算机可读存储介质 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/583,688 Continuation US20240242722A1 (en) | 2022-09-07 | 2024-02-21 | Audio processing method and apparatus, device, readable storage medium, and program product |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024051481A1 true WO2024051481A1 (zh) | 2024-03-14 |
Family
ID=83480765
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/114040 Ceased WO2024051481A1 (zh) | 2022-09-07 | 2023-08-21 | 音频处理方法、装置、设备、可读存储介质及程序产品 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20240242722A1 (zh) |
| EP (1) | EP4586105A4 (zh) |
| CN (1) | CN115168643B (zh) |
| WO (1) | WO2024051481A1 (zh) |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115168643B (zh) * | 2022-09-07 | 2023-04-07 | 腾讯科技(深圳)有限公司 | 音频处理方法、装置、设备及计算机可读存储介质 |
| CN115602178A (zh) * | 2022-10-18 | 2023-01-13 | 锐迪科微电子科技(上海)有限公司(Cn) | 声纹识别方法及装置、计算机可读存储介质、终端 |
| CN115861889A (zh) * | 2022-12-14 | 2023-03-28 | 微梦创科网络科技(中国)有限公司 | 确定视频相似度的方法、训练方法、装置及存储介质 |
| CN116522240B (zh) * | 2023-04-27 | 2026-03-24 | 电子科技大学 | 基于自适应阈值的开集辐射源个体识别方法 |
| CN119964596B (zh) * | 2025-01-17 | 2025-11-21 | 湖南大学 | 一种基于视听融合聚类的说话人日志生成方法 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021072893A1 (zh) * | 2019-10-18 | 2021-04-22 | 平安科技(深圳)有限公司 | 一种声纹聚类方法、装置、处理设备以及计算机存储介质 |
| CN113327628A (zh) * | 2021-05-27 | 2021-08-31 | 北京字节跳动网络技术有限公司 | 音频处理方法、装置、可读介质和电子设备 |
| CN114446284A (zh) * | 2022-02-10 | 2022-05-06 | 上海喜马拉雅科技有限公司 | 说话人日志生成方法、装置、计算机设备及可读存储介质 |
| CN114822558A (zh) * | 2022-04-15 | 2022-07-29 | 马上消费金融股份有限公司 | 声纹识别方法、装置、电子设备及存储介质 |
| CN115168643A (zh) * | 2022-09-07 | 2022-10-11 | 腾讯科技(深圳)有限公司 | 音频处理方法、装置、设备及计算机可读存储介质 |
Family Cites Families (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108040032A (zh) * | 2017-11-02 | 2018-05-15 | 阿里巴巴集团控股有限公司 | 一种声纹认证方法、账号注册方法及装置 |
| CN108908377B (zh) * | 2018-07-06 | 2020-06-23 | 达闼科技(北京)有限公司 | 说话人识别方法、装置和机器人 |
| CN109360572B (zh) * | 2018-11-13 | 2022-03-11 | 平安科技(深圳)有限公司 | 通话分离方法、装置、计算机设备及存储介质 |
| CN110337030B (zh) * | 2019-08-08 | 2020-08-11 | 腾讯科技(深圳)有限公司 | 视频播放方法、装置、终端和计算机可读存储介质 |
| CN111866607B (zh) * | 2020-07-30 | 2022-03-11 | 腾讯科技(深圳)有限公司 | 视频片段定位方法、装置、计算机设备及存储介质 |
| CN112133319B (zh) * | 2020-08-31 | 2024-09-06 | 腾讯音乐娱乐科技(深圳)有限公司 | 音频生成的方法、装置、设备及存储介质 |
| CN111933115B (zh) * | 2020-10-12 | 2021-02-09 | 腾讯科技(深圳)有限公司 | 语音识别方法、装置、设备以及存储介质 |
| CN114792522B (zh) * | 2021-01-26 | 2026-01-02 | 阿里巴巴集团控股有限公司 | 音频信号处理、会议记录与呈现方法、设备、系统及介质 |
| CN113724739B (zh) * | 2021-09-01 | 2024-06-11 | 腾讯音乐娱乐科技(深圳)有限公司 | 检索音频和训练声学模型的方法、终端及存储介质 |
| CN114512135B (zh) * | 2022-01-17 | 2025-02-07 | 马上消费金融股份有限公司 | 声纹聚类方法、声纹识别方法、装置及电子设备 |
| CN114937462A (zh) * | 2022-05-17 | 2022-08-23 | 国网黑龙江省电力有限公司佳木斯供电公司 | 基于声纹智能诊断高压断路器故障检测方法 |
-
2022
- 2022-09-07 CN CN202211088204.6A patent/CN115168643B/zh active Active
-
2023
- 2023-08-21 EP EP23862178.3A patent/EP4586105A4/en active Pending
- 2023-08-21 WO PCT/CN2023/114040 patent/WO2024051481A1/zh not_active Ceased
-
2024
- 2024-02-21 US US18/583,688 patent/US20240242722A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021072893A1 (zh) * | 2019-10-18 | 2021-04-22 | 平安科技(深圳)有限公司 | 一种声纹聚类方法、装置、处理设备以及计算机存储介质 |
| CN113327628A (zh) * | 2021-05-27 | 2021-08-31 | 北京字节跳动网络技术有限公司 | 音频处理方法、装置、可读介质和电子设备 |
| CN114446284A (zh) * | 2022-02-10 | 2022-05-06 | 上海喜马拉雅科技有限公司 | 说话人日志生成方法、装置、计算机设备及可读存储介质 |
| CN114822558A (zh) * | 2022-04-15 | 2022-07-29 | 马上消费金融股份有限公司 | 声纹识别方法、装置、电子设备及存储介质 |
| CN115168643A (zh) * | 2022-09-07 | 2022-10-11 | 腾讯科技(深圳)有限公司 | 音频处理方法、装置、设备及计算机可读存储介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4586105A4 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN115168643B (zh) | 2023-04-07 |
| CN115168643A (zh) | 2022-10-11 |
| US20240242722A1 (en) | 2024-07-18 |
| EP4586105A1 (en) | 2025-07-16 |
| EP4586105A4 (en) | 2025-10-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2024051481A1 (zh) | 音频处理方法、装置、设备、可读存储介质及程序产品 | |
| CN109299315B (zh) | 多媒体资源分类方法、装置、计算机设备及存储介质 | |
| CN110544272B (zh) | 脸部跟踪方法、装置、计算机设备及存储介质 | |
| CN110162604B (zh) | 语句生成方法、装置、设备及存储介质 | |
| CN110807325B (zh) | 谓词识别方法、装置及存储介质 | |
| CN113763933B (zh) | 语音识别方法、语音识别模型的训练方法、装置和设备 | |
| JP7324838B2 (ja) | 符号化方法並びにその、装置、機器及びコンピュータプログラム | |
| CN111324699B (zh) | 语义匹配的方法、装置、电子设备及存储介质 | |
| CN111581958A (zh) | 对话状态确定方法、装置、计算机设备及存储介质 | |
| CN113569042B (zh) | 文本信息分类方法、装置、计算机设备及存储介质 | |
| CN111753498B (zh) | 文本处理方法、装置、设备及存储介质 | |
| CN114996515A (zh) | 视频特征提取模型的训练方法、文本生成方法及装置 | |
| CN114462580A (zh) | 文本识别模型的训练方法、文本识别方法、装置和设备 | |
| CN114299306B (zh) | 获取图像检索模型的方法、图像检索方法、装置和设备 | |
| CN117011571A (zh) | 图像分类模型的训练方法、装置及设备 | |
| CN111737415B (zh) | 实体关系抽取方法、实体关系学习模型的获取方法及设备 | |
| CN114328815A (zh) | 文本映射模型的处理方法、装置、计算机设备及存储介质 | |
| CN114281937A (zh) | 嵌套实体识别模型的训练方法、嵌套实体识别方法及装置 | |
| CN111597823A (zh) | 中心词提取方法、装置、设备及存储介质 | |
| HK40076008B (zh) | 音频处理方法、装置、设备及计算机可读存储介质 | |
| HK40076008A (zh) | 音频处理方法、装置、设备及计算机可读存储介质 | |
| CN114462540B (zh) | 聚类模型的训练方法、聚类方法、装置、设备及存储介质 | |
| CN116959057A (zh) | 面部图像的处理方法、装置、设备及存储介质 | |
| HK40067587B (zh) | 特徵提取模型的训练方法、图像处理方法、装置及设备 | |
| HK40037362B (zh) | 词语的上下位关系确定方法、装置、计算机设备及介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23862178 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023862178 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2023862178 Country of ref document: EP Effective date: 20250407 |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023862178 Country of ref document: EP |