{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Augment Speech and Sound for Machine and Deep Learning\n\n\nAugment audio to expanding datasets and train resilient models.\n\nTo see how SoundPy implements this, see the module `soundpy.augment`.\n\n\nNote:\n~~~~~\nConsideration of what type of sound one is working with must be taken when performing augmentation. Not all speech and non-speech sounds should be handled the same. For example, you may want to augment speech differently if you are training a speech recognition model versus an emotion recognition model. Additionally, not all non-speech sounds behave the same, for example stationary (white noise) vs non-stationary (car horn) sounds.\n\nIn sum, awareness of how your sound data behave and what features of the sound are relevant for training models are important factors for sound data augmentation. \n\nBelow are a few augmentation techniques I have seen implemented in sound research; this is in no way a complete list of augmentation techniques.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import soundpy as sp\nimport IPython.display as ipd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Augmenting Speech\n^^^^^^^^^^^^^^^^^\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Designate the path relevant for accessing audiodata\nNote: the speech and sound come with the soundpy repo.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp_dir = '../../../'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Speech sample:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "speech = '{}audiodata/python.wav'.format(sp_dir)\nspeech = sp.utils.string2pathlib(speech)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hear and see speech\n~~~~~~~~~~~~~~~~~~~\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sr = 44100\nf, sr = sp.loadsound(speech, sr=sr)\nipd.Audio(f,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(f, sr=sr, feature_type='stft', title='Female Speech: \"Python\"', subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Change Speed\n~~~~~~~~~~~~\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's increase the speed by 15%:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fast = sp.augment.speed_increase(f, sr=sr, perc = 0.15)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ipd.Audio(fast,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(fast, sr = sr, feature_type = 'stft', \n title = 'Female speech: 15% faster',\n subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's decrease the speed by 15%:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "slow = sp.augment.speed_decrease(f, sr = sr, perc = 0.15)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ipd.Audio(slow, rate = sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(slow, sr = sr, feature_type = 'stft', \n title = 'Speech: 15% slower', subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add Noise\n~~~~~~~~~\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add white noise: 10 SNR\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "noisy = sp.augment.add_white_noise(f, sr=sr, snr = 10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ipd.Audio(noisy,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(noisy, sr=sr, feature_type='stft', \n title='Speech with white noise: 10 SNR', subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Harmonic Distortion\n~~~~~~~~~~~~~~~~~~~\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "hd = sp.augment.harmonic_distortion(f, sr=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ipd.Audio(hd,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(hd, sr=sr, feature_type='stft', \n title='Speech with harmonic distortion', subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pitch Shift\n~~~~~~~~~~~\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pitch shift increase\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "psi = sp.augment.pitch_increase(f, sr=sr, num_semitones = 2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ipd.Audio(psi,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(psi, sr=sr, feature_type='stft', \n title='Speech with pitch shift increase', subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pitch shift decrease\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "psd = sp.augment.pitch_decrease(f, sr=sr, num_semitones = 2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ipd.Audio(psd,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(psd, sr=sr, feature_type='stft', \n title='Speech with pitch shift decrease', subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vocal Tract Length Perturbation\n ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n Note: this is still experimental.\n#########################################################\n Vocal tract length perturbation (by factor 0.8 to 1.2)\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vtlp_stft, a = sp.augment.vtlp(f, sr=sr, win_size_ms = 50,\n percent_overlap = 0.5,\n random_seed = 41)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to listen to this, we need to turn the stft into \nsamples:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vtlp_y = sp.feats.feats2audio(vtlp_stft, sr = sr,\n feature_type = 'stft',\n win_size_ms = 50,\n percent_overlap = 0.5)\nipd.Audio(vtlp_y,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.feats.plot(vtlp_stft, sr=sr, feature_type='stft', \n title='VTLP (factor {})'.format(a), subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vocal tract length perturbation (by factor 0.8 to 1.2)\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vtlp_stft, a = sp.augment.vtlp(f, sr=sr, win_size_ms = 50,\n percent_overlap = 0.5,\n random_seed = 43)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to listen to this, we need to turn the stft into \nsamples:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vtlp_y = sp.feats.feats2audio(vtlp_stft, sr = sr,\n feature_type = 'stft',\n win_size_ms = 50,\n percent_overlap = 0.5)\nipd.Audio(vtlp_y,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.feats.plot(vtlp_stft, sr=sr, feature_type='stft', \n title='VTLP (factor {})'.format(a), subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Augmenting non-speech signals\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Car horn sample:\nhonk = '{}audiodata/car_horn.wav'.format(sp_dir)\nhonk = sp.utils.string2pathlib(honk)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hear and see sound signal \n~~~~~~~~~~~~~~~~~~~~~~~~~\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "h, sr = sp.loadsound(honk, sr=sr)\nipd.Audio(h,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(h, sr=sr, feature_type='stft', \n title='Car Horn', subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Change Speed\n~~~~~~~~~~~~\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's increase the speed by 15%:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fast = sp.augment.speed_increase(h, sr=sr, perc = 0.15)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ipd.Audio(fast,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(fast, sr=sr, feature_type='stft', \n title='Car horn: 15% faster', subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's decrease the speed by 15%:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "slow = sp.augment.speed_decrease(h, sr=sr, perc = 0.15)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ipd.Audio(slow,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(slow, sr=sr, feature_type='stft', \n title='Car horn: 15% slower', subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add Noise\n~~~~~~~~~\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add white noise \n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "h_noisy = sp.augment.add_white_noise(h, sr=sr, snr = 10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ipd.Audio(h_noisy,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(h_noisy, sr=sr, feature_type='stft', \n title='Car horn with white noise (10 SNR)', \n subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Harmonic Distortion\n~~~~~~~~~~~~~~~~~~~\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "hd = sp.augment.harmonic_distortion(h, sr=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ipd.Audio(hd,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(hd, sr=sr, feature_type='stft', \n title='Car horn with harmonic distortion', \n subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pitch Shift\n~~~~~~~~~~~\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pitch shift increase\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "psi = sp.augment.pitch_increase(h, sr=sr, num_semitones = 2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ipd.Audio(psi,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(psi, sr=sr, feature_type='stft', \n title='Car horn with pitch shift increase', \n subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pitch shift decrease\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "psd = sp.augment.pitch_decrease(h, sr=sr, num_semitones = 2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ipd.Audio(psd,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(psd, sr=sr, feature_type='stft', \n title='Car horn with pitch shift decrease', \n subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Time Shift\n~~~~~~~~~~\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll apply a random shift to the sound\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "h_shift = sp.augment.time_shift(h, sr=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ipd.Audio(h_shift,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(h_shift, sr=sr, feature_type='stft', \n title='Car horn: time shifted', \n subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Shuffle the Sound\n~~~~~~~~~~~~~~~~~\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "h_shuffle = sp.augment.shufflesound(h, sr=sr,\n num_subsections = 5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ipd.Audio(h_shuffle,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(h_shuffle, sr=sr, feature_type='stft', \n title='Car horn: shuffled', subprocess=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just for kicks let's do the same to speech and see how \nthat influences the signal:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "h_shuffle = sp.augment.shufflesound(f, sr=sr,\n num_subsections = 5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ipd.Audio(h_shuffle,rate=sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sp.plotsound(h_shuffle, sr=sr, feature_type='stft', \n title='Speech: shuffled ', subprocess=True)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 0 }