{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Voice Activity Detection\n\n\nPlot the VAD in signals and remove silences.\n\nCurrently soundpy has two base functions to complete voice-activity-detection.\n\n1) `soundpy.dsp.sound_index`\n--------------------------------\n\nThis function is used in:\n\n`soundpy.feats.get_stft_clipped`, `soundpy.feats.get_samples_clipped`, \nand `soundpy.feats.plot_vad`\n\n\nThis form of VAD uses the energy in the signal to identify when sounds start and \nend, relative to the beginning and end of the entire sample. \n(It does not identify silences between sounds, as of yet.)\n\nStrength\n~~~~~~~~\n\nThis is quite reliable across noise and speaker variety, especially when combined with \nthe Wiener filter. It also catches a significant portion of the speech signal\nthat is identified.\n\nWeakness \n~~~~~~~~\n\nThis is less sensitive to certain speech sounds such as fricatives (s, f, h, etc.), causing it to miss speech activity consisting primarily of these sounds.\n\n\n2) `soundpy.dsp.vad`\n------------------------\n\nThis function is used in:\n\n`soundpy.feats.get_vad_stft`, `soundpy.feats.get_vad_samples`,\nand `soundpy.feats.plot_vad`\n\nThis function (pulling from research) utilizes energy, frequency, and spectral flatness, \nwhich makes it less finicky when it comes to speech sounds (fricative vs plosive speech sounds). \nHowever, it is sometimes not sensitive enough to pick up general speech and \nwhen it does, it does not pick up as much of the entire speech signal.\n\nStrength \n~~~~~~~~\n\nThis examines speech / sound activity throughout the signal, not just when it starts and ends.\nIt is also more sensitive to a variety of speech sounds, not just those with high energy.\n\nWeakness\n~~~~~~~~\n\nWith certain speakers / background sounds, the VAD is more or less sensitive, and difficult to \npredict.\n\nNote\n---- \n\nThese may be used together and / or with a Wiener filter to balance out the strengths and \nweaknesses of each. One can also apply a `extend_window_ms` to broaden\nthe VAD identified.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os, sys\nimport inspect\ncurrentdir = os.path.dirname(os.path.abspath(\n inspect.getfile(inspect.currentframe())))\nparentdir = os.path.dirname(currentdir)\nparparentdir = os.path.dirname(parentdir)\npackagedir = os.path.dirname(parparentdir)\nsys.path.insert(0, packagedir)\n\nimport soundpy as sp \nimport numpy as np\nimport IPython.display as ipd\n\npackage_dir = '../../../'\nos.chdir(package_dir)\nsp_dir = package_dir" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load sample speech audio\n------------------------\nWe will look at how these two options handle two different speech samples. \nThe speech samples will be combined but separated by a silence. \nThey will also be altered with white noise.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Python\"\n~~~~~~~~\nNote: this file is available in the soundpy repo.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# VAD and filtering work best with high sample rates \nsr = 48000\npython = '{}audiodata/python.wav'.format(sp_dir, sr=sr)\ny_p, sr = sp.loadsound(python, sr=sr)\nipd.Audio(y_p, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"six\"\n~~~~~\nThis is a sample file from the speech commands dataset \n(Attribution 4.0 International (CC BY 4.0))\ndataset: https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.htmll\nlicense: https://creativecommons.org/licenses/by/4.0/\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is audio that has two fricatives in it: 's' and 'x'\nwhich will show to cause issues as noise increases.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "six = '{}audiodata/six.wav'.format(sp_dir, sr = sr)\ny_six, sr = sp.loadsound(six, sr = sr)\nipd.Audio(y_six,rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combine the speech samples and add noise\n----------------------------------------\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combine speech signals with silence between\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nThis is to show the strengths and weaknesses of both VAD techniques.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "p_silence = np.zeros(len(y_p))\ny_p_long, snr_none = sp.dsp.add_backgroundsound(y_p, p_silence,\n sr = sr, \n snr = None,\n pad_mainsound_sec = 1,\n total_len_sec = 3,\n random_seed = 40)\ny_six_long, snr_none = sp.dsp.add_backgroundsound(y_six, p_silence,\n sr = sr, \n snr = None,\n pad_mainsound_sec = 1,\n total_len_sec = 3,\n random_seed = 40)\ny = np.concatenate((y_six_long, y_p_long))\nsp.feats.plot(y, sr=sr, feature_type = 'signal', subprocess=True)\nipd.Audio(y, rate=sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate white noise \n~~~~~~~~~~~~~~~~~~~~\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "white_noise = sp.dsp.generate_noise(len(y), random_seed = 40)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Speech and Noise SNR 20\n-----------------------\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y_snr20, snr20 = sp.dsp.add_backgroundsound(\n y, white_noise, sr=sr, snr = 20,random_seed = 40)\n# round the measured snr:\nsnr20 = int(round(snr20))\nsnr20" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(y_snr20, sr = sr, feature_type = 'signal', \n title = 'Speech SNR {}'.format(snr20), subprocess=True)\nipd.Audio(y_snr20,rate=sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Speech and Noise SNR 5\n----------------------\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y_snr05, snr05 = sp.dsp.add_backgroundsound(\n y, white_noise, sr=sr, snr = 5, random_seed = 40)\n# round the measured snr:\nsnr05 = int(round(snr05))\nsnr05" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(y_snr05, sr = sr, feature_type = 'signal', \n title = 'Speech SNR {}'.format(snr05), subprocess=True)\nipd.Audio(y_snr05,rate=sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot Voice Activity\n-------------------\nNOTE: If no VAD, yellow dots are placed at the bottom. \nIf VAD , yellow dots are placed at the top.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set window size\n~~~~~~~~~~~~~~~\nFor increased frequency definition, a longer window is suggested\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "win_size_ms = 50" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set percent overlap\n~~~~~~~~~~~~~~~~~~~\nPercent overlap is how much each consecutive window (size `win_size_ms`) overlaps.\nThese VAD functions can be reliably used at 0 and 0.5 `percent_overlap`.\nVAD does not need overlapping samples; however, better performance\ntends to occur with 0.5\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "percent_overlap = 0.5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set background noise reference\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nFor measuring background noise in signal, set amount \nof beginning noise in milliseconds to use. Currently, this is \nonly relevant for `soundpy.dsp.vad`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "use_beg_ms = 120" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "VAD (SNR 20)\n------------\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 1: \n~~~~~~~~~\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cut off beginning and ending silences\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.feats.plot_vad(y_snr20, sr=sr, beg_end_clipped = True,\n percent_overlap = percent_overlap, \n win_size_ms = win_size_ms)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "clipped_samples, vad_matrix = sp.feats.get_samples_clipped(y_snr20, sr=sr, percent_overlap = percent_overlap, \n win_size_ms = win_size_ms)\nsp.feats.plot(clipped_samples, sr=sr, feature_type = 'signal', subprocess=True)\nipd.Audio(clipped_samples, rate= sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 2:\n~~~~~~~~~\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check VAD through entire signal\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.feats.plot_vad(y_snr20, sr=sr, beg_end_clipped = False,\n percent_overlap = percent_overlap, \n win_size_ms = win_size_ms)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vad_samples, vad_matrix = sp.feats.get_vad_samples(\n y_snr20, sr=sr, use_beg_ms = use_beg_ms, \n percent_overlap = percent_overlap, win_size_ms = win_size_ms)\nsp.feats.plot(vad_samples, sr=sr, feature_type = 'signal', subprocess=True)\nipd.Audio(vad_samples, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's extend the window of VAD\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.feats.plot_vad(y_snr20, sr=sr, beg_end_clipped = False,\n extend_window_ms = 300, use_beg_ms = use_beg_ms, \n percent_overlap = percent_overlap, win_size_ms = win_size_ms)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vad_samples, vad_matrix = sp.feats.get_vad_samples(\n y_snr20, sr=sr, use_beg_ms = use_beg_ms, extend_window_ms = 300,\n percent_overlap = percent_overlap, win_size_ms = win_size_ms)\nsp.feats.plot(vad_samples, sr=sr, feature_type = 'signal', subprocess=True)\nipd.Audio(vad_samples, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "VAD (SNR 5)\n-----------\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 1: \n~~~~~~~~~\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cut off beginning and ending silences\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.feats.plot_vad(y_snr05, sr=sr, beg_end_clipped = True, \n percent_overlap = percent_overlap, \n win_size_ms = win_size_ms)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "clipped_samples, vad_matrix = sp.feats.get_samples_clipped(y_snr05, sr=sr, percent_overlap = percent_overlap, \n win_size_ms = win_size_ms)\nsp.feats.plot(clipped_samples, sr=sr, feature_type = 'signal', subprocess=True)\nipd.Audio(clipped_samples, rate= sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Improves with Wiener filter and padding?\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y_snr05_wf, sr = sp.filtersignal(\n y_snr05, sr=sr, apply_postfilter = True)\n\nsp.feats.plot_vad(y_snr05_wf, sr=sr, beg_end_clipped = True,\n percent_overlap = percent_overlap, \n win_size_ms = win_size_ms, extend_window_ms = 300)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "clipped_samples, vad_matrix = sp.feats.get_samples_clipped(\n y_snr05_wf, sr=sr, percent_overlap = percent_overlap, \n win_size_ms = win_size_ms, extend_window_ms = 300)\nsp.feats.plot(clipped_samples, sr=sr, feature_type = 'signal', subprocess=True)\nipd.Audio(clipped_samples, rate= sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Option 2:\n~~~~~~~~~\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check VAD through entire signal\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.feats.plot_vad(y_snr05, sr=sr, beg_end_clipped = False,\n percent_overlap = percent_overlap, win_size_ms = win_size_ms)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vad_samples, vad_matrix = sp.feats.get_vad_samples(\n y_snr05, sr=sr, use_beg_ms = use_beg_ms, \n percent_overlap = percent_overlap, win_size_ms = win_size_ms)\nsp.feats.plot(vad_samples, sr=sr, feature_type = 'signal', subprocess=True)\nipd.Audio(vad_samples, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's extend the window of VAD\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.feats.plot_vad(y_snr05, sr=sr, beg_end_clipped = False,\n extend_window_ms = 300, use_beg_ms = use_beg_ms, \n percent_overlap = percent_overlap, win_size_ms = win_size_ms)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vad_samples, vad_matrix = sp.feats.get_vad_samples(\n y_snr05, sr=sr, use_beg_ms = use_beg_ms, extend_window_ms = 300,\n percent_overlap = percent_overlap, win_size_ms = win_size_ms)\nsp.feats.plot(vad_samples, sr=sr, feature_type = 'signal', subprocess=True)\nipd.Audio(vad_samples, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Sum\n------\nWe can see from the above examples that the first option (clipping beginning \nand ending silences) works pretty well at higher SNRs and with filtering. \nIt identified pretty well when the speech began and ended.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The second option (VAD throughout the signal) was perhaps better able \nto identify the existence of speech despite noise (without filtering); \nhowever, it only recognized a very small portion of it. \n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Despite these functions being a work in progress, I have found them \nto be quite useful when working with audio data for deep learning and \nother sound related projects.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 0 }