WO2024200071A1 - Appareils et procédés de commande d'une lecture sonore d'un casque d'écoute - Google Patents

Appareils et procédés de commande d'une lecture sonore d'un casque d'écoute Download PDF

Info

Publication number
WO2024200071A1
WO2024200071A1 PCT/EP2024/057153 EP2024057153W WO2024200071A1 WO 2024200071 A1 WO2024200071 A1 WO 2024200071A1 EP 2024057153 W EP2024057153 W EP 2024057153W WO 2024200071 A1 WO2024200071 A1 WO 2024200071A1
Authority
WO
WIPO (PCT)
Prior art keywords
headphone
user
machine learning
controller
learning processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/EP2024/057153
Other languages
English (en)
Inventor
Lev Markhasin
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Europe BV United Kingdom Branch
Sony Semiconductor Solutions Corp
Original Assignee
Sony Europe BV United Kingdom Branch
Sony Semiconductor Solutions Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony Europe BV United Kingdom Branch, Sony Semiconductor Solutions Corp filed Critical Sony Europe BV United Kingdom Branch
Publication of WO2024200071A1 publication Critical patent/WO2024200071A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • H04R1/1041Mechanical or electronic switches, or control elements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K11/00Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/16Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/175Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound
    • G10K11/178Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound by electro-acoustically regenerating the original acoustic waves in anti-phase
    • G10K11/1787General system configurations
    • G10K11/17885General system configurations additionally using a desired external signal, e.g. pass-through audio such as music or speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/81Detection of presence or absence of voice signals for discriminating voice from music
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • H04R1/1083Reduction of ambient noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K2210/00Details of active noise control [ANC] covered by G10K11/178 but not provided for in any of its subgroups
    • G10K2210/10Applications
    • G10K2210/108Communication systems, e.g. where useful sound is kept and noise is cancelled
    • G10K2210/1081Earphones, e.g. for telephones, ear protectors or headsets
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating

Definitions

  • the present disclosure generally relates to headphones, and, more particularly, to apparatuses and methods for controlling a sound playback of a headphone depending on a detected sound of vocal expression of a user of the headphone.
  • Modem headphones may recognize that a user starts to speak so that the playback can be stopped.
  • some headphones may even deactivate noise cancelling and activate a mode in which surrounding sounds are enhanced. This may enable the user to have a conversation without stopping music or taking off the headset. Often users make sounds by humming along or whistling along a melody or singing along a song. When the headphones stop playback, the user cannot continue enjoying the music. A similar situation arises when the user wants to practice singing or beatboxing and plays some playback via the headphones and when the user starts to sing the playback stops. Some headphones might offer to deactivate the automatic “playback off functionality” but this may be annoying to the user who may wish to have to do as little as possible to switch between modes.
  • the present disclosure provides a headphone which comprises at least one microphone configured to capture a vocal expression of a user of the headphone.
  • the headphone further comprises a trained machine learning processor configured to classify the user’s captured vocal expression into one of a plurality of different categories of vocal expressions.
  • the headphone also comprises a controller configured to control sound playback of the headphone based on the classification result of the machine learning processor.
  • the present disclosure provides a method for controlling a sound playback of a headphone.
  • the method includes capturing a vocal expression of a user of the headphone via at least one built-in microphone, classifying, by a built-in trained machine learning processor, the user’s captured vocal expression into one of a plurality of different categories of vocal expressions, and controlling the sound playback of the headphone based on the classification result of the machine learning processor.
  • the present disclosure proposes headphones with an extended functionality where the headphones not only detect sounds that the user makes but also recognize the type/category of sound.
  • the proposed headphones may the detect when user sings along or hums and therefore do not stop playback or deactivate noise cancelling.
  • Fig. 1 schematically illustrates a headphone in accordance with the present disclosure
  • Fig. 2 show a flowchart method for controlling a sound playback of a headphone according to a first embodiment
  • Fig. 3 show a flowchart method for controlling a sound playback of a headphone according to a second embodiment
  • Fig. 4 show a flowchart method for controlling a sound playback of a headphone according to a further embodiment.
  • Fig. 1 schematically illustrates a headphone 100 in accordance with the present disclosure.
  • the headphone 100 which may also be referred to as earphone or headset, is a personal audio device which may be worn over the ears or in the ear canal, depending on the implementation.
  • headphone 100 is shown as an over-ear headphone, which is a type of headphone that is designed to completely cover the ears.
  • Ear cups 102 of over-ear headphone 100 may be large and padded, and may be connected to a headband 104 that sits over the top of the head.
  • Over-ear headphones may provide a high level of sound isolation, as the ear cups 102 create a seal around the ears that blocks out external noise.
  • headphone 100 may also be implemented as an in-ear headphone, which may also be referred to as earbuds or in-ear monitors.
  • An in-ear headphone is a type of headphone that is designed to fit directly into the ear canal. Unlike over-ear headphones, in-ear headphones are much smaller and more compact, and they do not have a headband. In-ear headphones are typically made up of a small driver or speaker that sits inside the ear canal. In general, headphones can come in different shapes and sizes, and some models include features such as noise isolation or noise cancellation, which can help block out external sounds for a more immersive listening experience. They are commonly used for listening to music, making phone calls, and for other forms of audio entertainment. Headphone 100 may be a wireless or a wired headphone.
  • the headphone 100 in accordance with the present disclosure comprises one or more microphones 110 which are configured to capture a vocal expression of a user of the headphone 100.
  • Vocal expression refers to the various ways in which a person may communicate emotions, feelings, and attitudes through their voice.
  • vocal expressions may include speaking, singing, humming, whistling, beatboxing, etc.
  • the microphone 110 may be built into the headphone 100 to capture audio or sound waves from the user's voice.
  • the microphone 110 can be located in different places depending on the design of the headphone 100.
  • the microphone 110 may be located at the ear cups 102 or on a cable that connects to an audio source (e.g., smartphone, tablet, etc.), or it may be located on an adjustable boom arm.
  • an audio source e.g., smartphone, tablet, etc.
  • headphone 100 further comprises a trained machine learning processor which is configured to classify the user’s captured vocal expression into one of a plurality of different categories of vocal expressions.
  • the trained machine learning processor 120 implements a machine learning algorithm that takes the user’s vocal expression or sound as input and processes it to analyze the user’s sound.
  • An audio sample captured by microphone 110 may be input into the machine learning processor 120, and the machine learning processor 120 may output a probability score for each category indicating the likelihood that the audio sample belongs to that category of vocal expression. The category with the highest probability score may be chosen as the classification for the audio sample.
  • the machine learning processor 120 may be configured to classify the user’s captured vocal into one of speaking, humming, singing, whistling, beatboxing, etc.
  • machine learning algorithms that can be used in machine learning processor 120 for classifying the user’s captured vocal expression into different categories, including:
  • CNNs are a type of deep neural network that are well-suited for processing images and other types of multidimensional data, such as sound waves. CNNs are commonly used for sound classification tasks, such as identifying musical genres or detecting anomalies in audio signals.
  • RNNs are a type of neural network that can be used for processing sequential data, such as audio signals. RNNs are commonly used for speech recognition and other types of natural language processing tasks that involve working with audio data.
  • SVMs Support Vector Machines
  • SVMs are a type of supervised learning algorithm that can be used for classification tasks. SVMs are often used for sound classification tasks, such as detecting specific sound events or identifying the source of a sound.
  • Random Forests are an ensemble learning method that combine multiple decision trees to make more accurate predictions. Random forests are often used for audio classification tasks, such as detecting environmental sounds or identifying bird songs.
  • GMMs are a type of statistical model that can be used for audio classification tasks. GMMs are often used for speaker recognition and other types of audio signal processing tasks that involve working with complex audio data.
  • the trained machine learning processor 120 may be configured for humming and/or singing voice detection based on a Bidirectional Long Short-Term Memory (BLSTM) Recurrent Neural Network (RNN).
  • BLSTM Bidirectional Long Short-Term Memory
  • RNN Recurrent Neural Network
  • This classifier is able to take a past and future temporal context into account to decide on the presence/absence of singing voice, thus using the inherent sequential aspect of a short-term feature extraction in a piece of sound/music.
  • the BLSTM-RNN may contain several hidden layers, so it is able to extract a simple repre- sentation fitted to our task from low-level features.
  • Such a BLSTM for singing voice detection is described in Simon Leglaive et al.: “Singing Voice Detection with Deep Recurrent Neural Networks”.
  • the machine learning (ML) processor 120 can be trained using a technique called supervised learning.
  • supervised learning the ML algorithm is trained on a labeled dataset, where each sound is associated with a specific class or label.
  • the process for training a sound classification algorithm may involve the following steps:
  • Data collection Collect a large dataset of sound recordings, with each recording labeled with the correct class or label.
  • Feature extraction Extract relevant features from each sound recording, such as frequency, amplitude, and duration.
  • Data preprocessing Normalize the extracted features and prepare the data for training by splitting it into training and validation sets.
  • Model selection Select an appropriate machine learning model, such as a convolutional neural network (CNN) or a support vector machine (SVM), that is capable of classifying sounds based on their features.
  • CNN convolutional neural network
  • SVM support vector machine
  • Training Train the model using the labeled sound dataset and the extracted features.
  • Validation Evaluate the performance of the trained model on the validation set to ensure that it is accurately classifying sounds.
  • Testing Test the performance of the trained model on a new dataset of sound recordings to evaluate its ability to generalize to new data.
  • the headphone 100 further comprises a controller (control logic) 130 which is configured to automatically control sound playback of the headphone 100 based on the classification result of the machine learning processor 120, i.e., based on the determined category of the user’s vocal expression. Sound or music playback of headphone 100 may be automatically controlled differently depending on whether speaking, humming, whistling, beatboxing, or singing was detected as the user’s captured vocal expression by the trained machine learning processor 120.
  • the controller 130 may be a control processor built in the headphone 100. Here, “automatically” may be understood as without manual interaction of the user.
  • the headphone’s controller 130 may be configured to automatically restrict the headphone’s sound or music playback in case the classification result of the machine learning processor 120 indicates “speaking” as the detected category of vocal expression. For example, the sound or music playback may automatically be continued with decreased volume in case it is detected that the user is speaking. For another example, the sound or music playback may automatically be muted, stopped, or interrupted in case it is detected that the user is speaking.
  • the controller 130 may be configured to automatically deactivate an active noise cancelling (ANC) function of the headphone 100.
  • ANC active noise cancelling
  • headphone 100 may comprise and use additional microphones to detect external noise, and then generate a sound wave that is 180° out of phase with the external noise. When the two sound waves meet, they cancel each other out, effectively reducing or eliminating the background noise. The result is a quieter environment that allows the user to hear audio more clearly, even at lower volume levels.
  • the controller 130 may be configured to automatically enhance an active hearthrough function of the headphone 100.
  • an active hear through function of headphone 100 may automatically be activated in case the classification result of the machine learning processor 120 indicates “speaking” as the category of vocal expression.
  • Hear through refers to a feature that allows the user to hear ambient sounds from their surroundings while still wearing the headphones 100. This feature is also sometimes referred to as ambient sound mode, transparency mode, or pass-through mode. Hear through may be achieved by using external microphones on the headphones 100 to pick up sound from the user's environment, and then playing that sound through the headphones 100. This allows the user to hear important sounds like traffic, announcements, or conversations.
  • the headphone’s controller 130 may be configured to continue the headphone’s audio or music playback. In some implementations, it is conceivable that the headphone’s audio or music playback is continued with automatically decreased volume such that the user can better listen to his/her humming or singing. Additionally or alternatively, the controller 130 may be configured to automatically deactivate the headphone’s ANC function such that the user can better listen to his/her humming or singing. Additionally or alternatively, the controller 130 may be configured to automatically enhance a hear-through function of the headphone 100 such that the user can better listen to his/her humming or singing.
  • FIG. 2 A user 200 wearing headphone 100 utters sound (vocal expression) 202.
  • This vocal expression 202 of the user 200 is captured by built-in microphone 110 and input to built-in machine learning processor 120.
  • machine learning processor 120 classifies captured vocal expression 202 into one of two categories: i) speaking, ii) humming or singing along (could also include whistling and/or beatboxing). If the category is i) speaking, controller 130 stops playback at 230. Optionally, controller 130 deactivates ANC and enhances surrounding sounds such that user 200 can better listen to conversation partners. If the category is ii) humming or singing along, controller 130 continues playback at 240. Optionally, controller 130 deactivates ANC and or enhances surrounding sounds such that user 200 can better listen to him-/herself.
  • controller 130 or associated processing circuitry may be configured to automatically merge the user’s vocal expression (user’s captured voice) into the headphone’s sound or audio playback (e.g., music) in case the classification result of the machine learning processor 120 indicates a melodic vocal expression such as humming and/or singing. In this way, the user can better hear him-/herself humming or singing.
  • Two audio signals e.g., user’s voice and music playback
  • Audio mixing involves combining multiple audio signals into a single output signal (merged playback signal), which can be played in real-time.
  • the controller 130 may be configured to activate an ANC function of the headphone 100 such that the merged sound is not disturbed by background noise and the user can better listen to him-/herself.
  • the controller 130 or associated processing circuitry may be configured to separate the user’s captured vocal expression from surrounding or background sounds (singing voice separation) and merge the user’s separated vocal expression into the (music) playback.
  • Singing voice separation may be more challenging than speech separation because singing involves a wider range of pitch and tone variations.
  • An example approach to singing voice separation may be to use a variant of a neural network called a deep recurrent neural network (DRNN) or a long short-term memory (LSTM) network. These networks can capture the temporal dynamics of singing or humming by modeling long-term dependencies in the input signal.
  • DRNN deep recurrent neural network
  • LSTM long short-term memory
  • the neural network may be trained using a dataset of mixed audio signals and their corresponding target singing voice signals. Once trained, the neural network can be used to separate the user's singing voice from background sounds in real-time.
  • the person skilled in the art will appreciate that the accuracy of singing voice separation is dependent on the complexity of the surrounding sound and the quality of the input audio signal captured by microphone 110.
  • DBnet DOA-Driven Beamforming Network for end-to-end farfield sound source separation
  • DOA estimation may require more than one microphone 110 in the headphone 100.
  • DBnet may be implemented as convolutional-recurrent structure and trained using loss functions that are based on the distances between the separated speech signals and the target speech signals, without a need for the ground-truth DO As of speakers.
  • end-to-end extensions of DBnet may be used which incorporate post masking networks.
  • Fig. 3 The separation of the user’s sound from surrounding signals is illustrated in Fig. 3.
  • the user 200 wearing headphone 100 utters sound (vocal expression) 202.
  • This vocal expression 202 is captured by one, preferably a plurality of built-in microphones 110 and input to built-in machine learning processor 120.
  • machine learning processor 120 classifies captured vocal expression 202 into one of two categories: i) speaking, ii) melodic vocal expression such as humming or singing along. If the category is i) speaking, controller 130 stops playback at 230, Optionally, controller 130 deactivates ANC and enhances surrounding sounds such that user 200 can better listen to conversation partners.
  • controller 130 causes the separation of the user’s vocal expression (voice) 202 from surrounding sounds at 340. As explained above, this may be performed via an additional machine learning network (not shown), for example.
  • controller 130 causes the playback to continue by merging or mixing the separated vocal expression 202 into the playback (e.g., music).
  • controller 130 need not deactivate ANC and or enhance surrounding sounds. In other words, ANC may be kept activated.
  • Fig. 4 shows a flowchart of a method for controlling a sound playback of a headphone 100.
  • Method 400 includes capturing 410 a vocal expression of a user of the headphone 100 via at least one built-in microphone 110.
  • Method 400 further includes classifying, by a built-in trained machine learning processor 120, the user’s captured vocal expression into one of a plurality of different vocal expressions (e.g., speaking and non-speaking).
  • Method 400 further includes controlling the sound playback of the headphone based on the classification result of the machine learning processor 120.
  • Embodiments of the present disclosure provide an opportunity to the user to hum or sing along while listening to music without having the headphone’s playback stopped.
  • An example (e.g., example 1) relates to a headphone comprising at least one microphone which configured to capture a vocal expression of a user of the headphone, a trained machine learning processor configured to classify the user’s captured vocal expression into one of a plurality of different vocal expressions, and a controller configured to control sound playback of the headphone based on the classification result of the machine learning processor.
  • the machine learning processor is configured to classify the user’s captured vocal into one of speaking, humming, and singing.
  • the controller is configured to restrict (e.g., stop) the sound playback in case the classification result of the machine learning processor indicates speaking as the vocal expression.
  • the controller is configured to deactivate a (active) noise cancelling function of the headphone.
  • the controller is configured to enhance a (active) hearthrough function of the headphone.
  • the controller is configured to continue the sound playback in case the classification result of the machine learning processor indicates humming and/or singing as the vocal expression.
  • the controller is configured to deactivate a (active) noise cancelling function of the headphone.
  • the controller is configured to enhance a (active) hearthrough function of the headphone.
  • the controller is configured to merge or mix the user’s vocal expression into the sound playback of the headphone in case the classification result of the machine learning processor indicates humming and/or singing as the vocal expression.
  • the controller is configured to activate a (active) noise cancelling function of the headphone.
  • the controller is configured to separate the user’s vocal expression from background sounds and merge or mix the user’s separated vocal expression into the sound playback.
  • An example (e.g., example 12) relates to a method for controlling a sound playback of a headphone, the method comprising capturing a vocal expression of a user of the headphone via at least one built-in microphone, classifying, by a built-in trained machine learning processor, the user’s captured vocal expression into one of a plurality of different vocal expressions, and controlling the sound playback of the headphone based on the classification result of the machine learning processor.
  • Examples may further be or relate to a (computer) program including a program code to execute one or more of the above methods when the program is executed on a computer, processor or other programmable hardware component.
  • steps, operations or processes of different ones of the methods described above may also be executed by programmed computers, processors or other programmable hardware components.
  • Examples may also cover program storage devices, such as digital data storage media, which are machine-, processor- or computer-readable and encode and/or contain machine-executable, processor executable or computer-executable programs and instructions.
  • Program storage devices may include or be digital storage devices, magnetic storage media such as magnetic disks and magnetic tapes, hard disk drives, or optically readable digital data storage media, for example.
  • Other examples may also include computers, processors, control units, (field) programmable logic arrays ((F)PLAs), (field) programmable gate arrays ((F)PGAs), graphics processor units (GPU), application-specific integrated circuits (ASICs), integrated circuits (ICs) or system-on-a-chip (SoCs) systems programmed to execute the steps of the methods described above.
  • FPLAs field programmable logic arrays
  • F field) programmable gate arrays
  • GPU graphics processor units
  • ASICs application-specific integrated circuits
  • ICs integrated circuits
  • SoCs system-on-a-chip

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

La présente divulgation concerne un casque d'écoute comprenant au moins un microphone configuré pour capturer une expression vocale d'un utilisateur du casque d'écoute, un processeur d'apprentissage automatique configuré pour classer l'expression vocale capturée de l'utilisateur dans une expression d'une pluralité d'expressions vocales différentes, ainsi qu'un dispositif de commande configuré pour commander la lecture sonore du casque d'écoute d'après le résultat de classification du processeur d'apprentissage automatique. Le dispositif de commande peut être configuré pour poursuivre la lecture sonore dans le cas où le résultat de classification du processeur d'apprentissage automatique indique que l'expression vocale est un fredonnement et/ou un chant et arrêter la lecture sonore dans le cas où le résultat de classification du processeur d'apprentissage automatique indique que l'expression vocale est une parole.
PCT/EP2024/057153 2023-03-24 2024-03-18 Appareils et procédés de commande d'une lecture sonore d'un casque d'écoute Ceased WO2024200071A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP23163974.1 2023-03-24
EP23163974 2023-03-24

Publications (1)

Publication Number Publication Date
WO2024200071A1 true WO2024200071A1 (fr) 2024-10-03

Family

ID=85727163

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2024/057153 Ceased WO2024200071A1 (fr) 2023-03-24 2024-03-18 Appareils et procédés de commande d'une lecture sonore d'un casque d'écoute

Country Status (1)

Country Link
WO (1) WO2024200071A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150195641A1 (en) * 2014-01-06 2015-07-09 Harman International Industries, Inc. System and method for user controllable auditory environment customization
US20170318374A1 (en) * 2016-05-02 2017-11-02 Microsoft Technology Licensing, Llc Headset, an apparatus and a method with automatic selective voice pass-through
US20220303688A1 (en) * 2020-10-14 2022-09-22 Google Llc Activity Detection On Devices With Multi-Modal Sensing
US20220366932A1 (en) * 2021-01-14 2022-11-17 Cirrus Logic International Semiconductor Ltd. Methods and apparatus for detecting singing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150195641A1 (en) * 2014-01-06 2015-07-09 Harman International Industries, Inc. System and method for user controllable auditory environment customization
US20170318374A1 (en) * 2016-05-02 2017-11-02 Microsoft Technology Licensing, Llc Headset, an apparatus and a method with automatic selective voice pass-through
US20220303688A1 (en) * 2020-10-14 2022-09-22 Google Llc Activity Detection On Devices With Multi-Modal Sensing
US20220366932A1 (en) * 2021-01-14 2022-11-17 Cirrus Logic International Semiconductor Ltd. Methods and apparatus for detecting singing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ALI AROUDI ET AL., DBNET: DOA-DRIVEN BEAMFORMING NETWORK FOR END-TO-END FARFIELD SOUND SOURCE SEPARATION
SIMON LEGLAIVE ET AL., SINGING VOICE DETECTION WITH DEEP RECURRENT NEURAL NETWORKS

Similar Documents

Publication Publication Date Title
CN114556972B (zh) 用于辅助选择性听觉的系统和方法
US11611840B2 (en) Three-dimensional audio systems
KR102639491B1 (ko) 개인화된 실시간 오디오 프로세싱
JP7799679B2 (ja) 拡張現実におけるバイノーラル再生のためのヘッドホン等化および室内適応のためのシステムおよび方法
KR102803661B1 (ko) 통화들 및 오디오 메시지들로부터 다른 화자들의 음성 필터링
EP4004906A1 (fr) Augmentation de données par époque pour l'apprentissage de modèles acoustiques
US10536786B1 (en) Augmented environmental awareness system
EP4173310A2 (fr) Systèmes, appareil et procédés de transparence acoustique
US11228828B2 (en) Alerting users to events
CN118972776A (zh) 三维音频系统
US11935557B2 (en) Techniques for detecting and processing domain-specific terminology
WO2024200071A1 (fr) Appareils et procédés de commande d'une lecture sonore d'un casque d'écoute
CN116320144B (zh) 一种音频播放方法及电子设备、可读存储介质
Cano et al. Selective hearing: A machine listening perspective
CN111696566B (zh) 语音处理方法、装置和介质
US20250048041A1 (en) Processing audio signals from unknown entities
US12587798B2 (en) Headphones with sound-enhancement and integrated self-administered hearing test
CN111696564B (zh) 语音处理方法、装置和介质
WO2026075674A1 (fr) Architecture de type pass-through sélectif basé sur une classification pour composants audio portables à neutralisation active du bruit
Tran et al. Personalized Mixed Reality Audio Using Audio Classification Using Machine Learning
WO2025090963A1 (fr) Génération de signal de source audio cible comprenant des exemples d'inscription et préservant la directionnalité
Björnsson Amplified Speech in Live Theatre, What should it Sound Like?

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24711564

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 24711564

Country of ref document: EP

Kind code of ref document: A1