Augment Speech and Sound for Machine and Deep Learning

Augment audio to expanding datasets and train resilient models.

To see how SoundPy implements this, see the module soundpy.augment.

Note:

Consideration of what type of sound one is working with must be taken when performing augmentation. Not all speech and non-speech sounds should be handled the same. For example, you may want to augment speech differently if you are training a speech recognition model versus an emotion recognition model. Additionally, not all non-speech sounds behave the same, for example stationary (white noise) vs non-stationary (car horn) sounds.

In sum, awareness of how your sound data behave and what features of the sound are relevant for training models are important factors for sound data augmentation.

Below are a few augmentation techniques I have seen implemented in sound research; this is in no way a complete list of augmentation techniques.

import soundpy as sp
import IPython.display as ipd

Augmenting Speech

Designate the path relevant for accessing audiodata Note: the speech and sound come with the soundpy repo.

sp_dir = '../../../'

Speech sample:

speech = '{}audiodata/python.wav'.format(sp_dir)
speech = sp.utils.string2pathlib(speech)

Hear and see speech

sr = 44100
f, sr = sp.loadsound(speech, sr=sr)
ipd.Audio(f,rate=sr)


sp.plotsound(f, sr=sr, feature_type='stft', title='Female Speech: "Python"', subprocess=True)
Female Speech: "Python"

Out:

/home/airos/Projects/github/a-n-rose/Python-Sound-Tool/soundpy/feats.py:117: UserWarning: Due to matplotlib using AGG backend, cannot display plot. Therefore, the plot will be saved here: current working directory
  warnings.warn(msg)

Change Speed

Let’s increase the speed by 15%:

fast = sp.augment.speed_increase(f, sr=sr, perc = 0.15)
ipd.Audio(fast,rate=sr)


sp.plotsound(fast, sr = sr, feature_type = 'stft',
               title = 'Female speech: 15%  faster',
               subprocess=True)
Female speech: 15%  faster

Let’s decrease the speed by 15%:

slow = sp.augment.speed_decrease(f, sr = sr, perc = 0.15)
ipd.Audio(slow, rate = sr)


sp.plotsound(slow, sr = sr, feature_type = 'stft',
               title = 'Speech: 15%  slower', subprocess=True)
Speech: 15%  slower

Add Noise

Add white noise: 10 SNR

noisy = sp.augment.add_white_noise(f, sr=sr, snr = 10)

Out:

/home/airos/Projects/github/a-n-rose/Python-Sound-Tool/soundpy/dsp.py:769: UserWarning:
Warning: `soundpy.dsp.clip_at_zero` found no samples close to zero. Clipping was not applied.

  warnings.warn(msg)
ipd.Audio(noisy,rate=sr)


sp.plotsound(noisy, sr=sr, feature_type='stft',
               title='Speech with white noise: 10 SNR', subprocess=True)
Speech with white noise: 10 SNR

Harmonic Distortion

hd = sp.augment.harmonic_distortion(f, sr=sr)
ipd.Audio(hd,rate=sr)


sp.plotsound(hd, sr=sr, feature_type='stft',
               title='Speech with harmonic distortion', subprocess=True)
Speech with harmonic distortion

Pitch Shift

Pitch shift increase

psi = sp.augment.pitch_increase(f, sr=sr, num_semitones = 2)
ipd.Audio(psi,rate=sr)


sp.plotsound(psi, sr=sr, feature_type='stft',
               title='Speech with pitch shift increase', subprocess=True)
Speech with pitch shift increase

Pitch shift decrease

psd = sp.augment.pitch_decrease(f, sr=sr, num_semitones = 2)
ipd.Audio(psd,rate=sr)


sp.plotsound(psd, sr=sr, feature_type='stft',
               title='Speech with pitch shift decrease', subprocess=True)
Speech with pitch shift decrease
Vocal Tract Length Perturbation

Note: this is still experimental.

vtlp_stft, a = sp.augment.vtlp(f, sr=sr, win_size_ms = 50,
                                 percent_overlap = 0.5,
                                 random_seed = 41)

In order to listen to this, we need to turn the stft into samples:

vtlp_y = sp.feats.feats2audio(vtlp_stft, sr = sr,
                                feature_type = 'stft',
                                win_size_ms = 50,
                                percent_overlap = 0.5)
ipd.Audio(vtlp_y,rate=sr)


sp.feats.plot(vtlp_stft, sr=sr, feature_type='stft',
               title='VTLP (factor {})'.format(a), subprocess=True)
VTLP (factor 0.8)

Vocal tract length perturbation (by factor 0.8 to 1.2)

vtlp_stft, a = sp.augment.vtlp(f, sr=sr, win_size_ms = 50,
                                 percent_overlap = 0.5,
                                 random_seed = 43)

In order to listen to this, we need to turn the stft into samples:

vtlp_y = sp.feats.feats2audio(vtlp_stft, sr = sr,
                                feature_type = 'stft',
                                win_size_ms = 50,
                                percent_overlap = 0.5)
ipd.Audio(vtlp_y,rate=sr)


sp.feats.plot(vtlp_stft, sr=sr, feature_type='stft',
               title='VTLP (factor {})'.format(a), subprocess=True)
VTLP (factor 1.2)

Augmenting non-speech signals

# Car horn sample:
honk = '{}audiodata/car_horn.wav'.format(sp_dir)
honk = sp.utils.string2pathlib(honk)

Hear and see sound signal

h, sr = sp.loadsound(honk, sr=sr)
ipd.Audio(h,rate=sr)


sp.plotsound(h, sr=sr, feature_type='stft',
               title='Car Horn', subprocess=True)
Car Horn

Change Speed

Let’s increase the speed by 15%:

fast = sp.augment.speed_increase(h, sr=sr, perc = 0.15)
ipd.Audio(fast,rate=sr)


sp.plotsound(fast, sr=sr, feature_type='stft',
               title='Car horn: 15%  faster', subprocess=True)
Car horn: 15%  faster

Let’s decrease the speed by 15%:

slow = sp.augment.speed_decrease(h, sr=sr, perc = 0.15)
ipd.Audio(slow,rate=sr)


sp.plotsound(slow, sr=sr, feature_type='stft',
               title='Car horn: 15%  slower', subprocess=True)
Car horn: 15%  slower

Add Noise

Add white noise

h_noisy = sp.augment.add_white_noise(h, sr=sr, snr = 10)
ipd.Audio(h_noisy,rate=sr)


sp.plotsound(h_noisy, sr=sr, feature_type='stft',
               title='Car horn with white noise (10 SNR)',
               subprocess=True)
Car horn with white noise (10 SNR)

Harmonic Distortion

hd = sp.augment.harmonic_distortion(h, sr=sr)
ipd.Audio(hd,rate=sr)


sp.plotsound(hd, sr=sr, feature_type='stft',
               title='Car horn with harmonic distortion',
               subprocess=True)
Car horn with harmonic distortion

Pitch Shift

Pitch shift increase

psi = sp.augment.pitch_increase(h, sr=sr, num_semitones = 2)
ipd.Audio(psi,rate=sr)


sp.plotsound(psi, sr=sr, feature_type='stft',
               title='Car horn with pitch shift increase',
               subprocess=True)
Car horn with pitch shift increase

Pitch shift decrease

psd = sp.augment.pitch_decrease(h, sr=sr, num_semitones = 2)
ipd.Audio(psd,rate=sr)


sp.plotsound(psd, sr=sr, feature_type='stft',
               title='Car horn with pitch shift decrease',
               subprocess=True)
Car horn with pitch shift decrease

Time Shift

We’ll apply a random shift to the sound

h_shift = sp.augment.time_shift(h, sr=sr)
ipd.Audio(h_shift,rate=sr)


sp.plotsound(h_shift, sr=sr, feature_type='stft',
               title='Car horn: time shifted',
               subprocess=True)
Car horn: time shifted

Shuffle the Sound

h_shuffle = sp.augment.shufflesound(h, sr=sr,
                                      num_subsections = 5)
ipd.Audio(h_shuffle,rate=sr)


sp.plotsound(h_shuffle, sr=sr, feature_type='stft',
               title='Car horn: shuffled', subprocess=True)
Car horn: shuffled

Just for kicks let’s do the same to speech and see how that influences the signal:

h_shuffle = sp.augment.shufflesound(f, sr=sr,
                                      num_subsections = 5)
ipd.Audio(h_shuffle,rate=sr)


sp.plotsound(h_shuffle, sr=sr, feature_type='stft',
               title='Speech: shuffled ', subprocess=True)
Speech: shuffled

Total running time of the script: ( 0 minutes 10.779 seconds)

Gallery generated by Sphinx-Gallery