SoundPy Functionality v0.1.0a3

Built-In Functionality (non Deep Learning)

The soundpy.builtin module includes more complex functions that pull from several other functions to complete fairly complex tasks, such as dataset formatting, filtering signals, and extracting features for neural networks.

soundpy.builtin.filtersignal(audiofile, sr=None, noise_file=None, filter_type='wiener', filter_scale=1, apply_postfilter=False, duration_noise_ms=120, real_signal=False, phase_radians=True, num_bands=None, visualize=False, visualize_every_n_windows=50, max_vol=0.4, min_vol=0.15, save2wav=False, output_filename=None, overwrite=False, use_scipy=False, remove_dc=True, control_vol=False, **kwargs)[source]

Apply Wiener or band spectral subtraction filter to signal using noise.

The noise can be provided as a separate file / samples, or it can be taken from the beginning of the provided audio. How much noise is measured can be set in the parameter duration_noise_ms.

Parameters
  • audiofile (str, np.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – Filename or the audio data of the signal to be filtered.

  • sr (int) – The sample rate of the audio. If audiofile is type np.ndarray, sr is required. (default None)

  • noise_file (str, tuple, optional) – Path to either noise audiofile or .npy file containing average power spectrum values. If tuple, must include samples and sr. If None, the beginning of the audiofile will be used for noise data. (default None)

  • filter_type (str) – Type of filter to apply. Options ‘wiener’ or ‘band_specsub’.

  • filter_scale (int or float) – The scale at which the filter should be applied. This value will be multiplied to the noise levels thereby increasing or decreasing the filter strength. (default 1)

  • apply_postfilter (bool) – Whether or not the post filter should be applied. The post filter reduces musical noise (i.e. distortion) in the signal as a byproduct of filtering.

  • duration_noise_ms (int or float) – The amount of time in milliseconds to use from noise to apply the Welch’s method to. In other words, how much of the noise to use when approximating the average noise power spectrum.

  • real_signal (bool) – If True, only half of the (mirrored) fast Fourier transform will be used during filtering. For audio, there is no difference. This is visible in the plots, however, if you are interested. (default False)

  • phase_radians (bool) – Relevant for band spectral subtraction: whether phase should be calculated in radians or complex values/ power spectrum. (default True)

  • num_bands (int) – Relevant for band spectral subtraction: the number of bands to section frequencies into. By grouping sections of frequencies during spectral subtraction filtering, musical noise or distortion should be reduced. (defaults to 6)

  • visualize (bool) – If True, plots of the windows and filtered signal will be made. (default False)

  • visualize_every_n_windows (int) – If visualize is set to True, this controls how often plots are made: every 50 windows, for example. (default 50)

  • max_vol (int or float) – The maximum volume level of the filtered signal. This is useful if you know you do not want the signal to be louder than a certain value. Ears are important (default 0.4) TODO improve on matching volume to original signal? At least use objective measures.

  • min_vol (int or float) – The minimum volume level of the filtered signal. (default 0.15) TODO improve on matching volume to original signal.

  • save2wav (bool) – If True, will save the filtered signal as a .wav file

  • output_filename (str, pathlib.PosixPath, optional) – path and name the filtered signal is to be saved. (default None) If no filename provided, will save under date.

  • overwrite (bool) – If True and an audiofile by the same name exists, that file will be overwritten.

  • use_scipy (bool) – If False, audiofiles will be loaded using librosa. Otherwise, scipy.io.wavfile. (default False)

  • remove_dc (bool) – It True, the DC bias (‘direct current’ bias) will be removed. In other words, the mean amplitude will be made to equal 0.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.filters.WienerFilter or ‘soundpy.filters.BandSubtraction` (depending on filter_type).

Returns

  • enhanced_signal (np.ndarray [size = (num_samples, )]) – The enhanced signal in raw sample form. Stereo audio has not yet been tested.

  • sr (int) – The sample rate of the enhanced/ filtered signal.

References

Kamath, S. and Loizou, P. (2002). A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. Proc. IEEE Int. Conf. Acoust.,Speech, Signal Processing

Kamath, S. and Loizou, P. (2006). mband.m MATLAB code from the book:

C Loizou, P. (2013). Speech Enhancement: Theory and Practice.

soundpy.builtin.dataset_logger(audiofile_dir=None, recursive=True)[source]

Logs name, format, bitdepth, sr, duration of audiofiles, num_channels

Parameters
  • audiofile_dir (str or pathlib.PosixPath) – The directory where audiofiles of interest are. If no directory provided, the current working directory will be used.

  • recursive (bool) – If True, all audiofiles will be analyzed, also in nested directories. Otherwise, only the audio files in the immediate directory will be analyzed. (default True)

Returns

audiofile_dict – Dictionary within a dictionary, holding the formats of the audiofiles in the directory/ies.

Return type

dict

Examples

>>> audio_info = dataset_logger()
>>> # look at three audio files:
>>> count = 0
>>> for key, value in audio_info.items():
...:     for k, v in value.items():
...:         print(k, ' : ', v)
...:     count += 1
...:     print()
...:     if count > 2:
...:         break
audio  :  audiodata/dogbark_2channels.wav
sr  :  48000
num_channels  :  2
dur_sec  :  0.389
format_type  :  WAV
bitdepth  :  PCM_16

audio  :  audiodata/python_traffic_pf.wav
sr  :  48000
num_channels  :  1
dur_sec  :  1.86
format_type  :  WAV
bitdepth  :  DOUBLE

audio  :  audiodata/259672__nooc__this-is-not-right.wav
sr  :  44100
num_channels  :  1
dur_sec  :  2.48453514739229
format_type  :  WAV
bitdepth  :  PCM_16

See also

soundfile.available_subtypes

The subtypes available with the package SoundFile

soundfile.available_formats

The formats available with the package SoundFile

soundpy.builtin.dataset_formatter(audiodirectory=None, recursive=False, new_dir=None, sr=None, dur_sec=None, zeropad=False, format='WAV', bitdepth=None, overwrite=False, mono=False)[source]

Formats all audio files in a directory to set parameters.

The audiofiles formatted can be limited to the specific directory or be extended to the subfolders of that directory.

Parameters
  • audiodirectory (str or pathlib.PosixPath) – The directory where audio files live. If no directory provided, the current working directory will be used.

  • recursive (bool) – If False, only audiofiles limited to the specific directory will be formatted. If True, audio files in nested directories will also be formatted. (default False)

  • new_dir (str or pathlib.PosixPath) – The audiofiles will be saved with the same structure in this directory. If None, a default directory name with time stamp will be generated.

  • sr (int) – The desired sample rate to assign to the audio files. If None, the orignal sample rate will be maintained.

  • dur_sec (int) – The desired length in seconds the audio files should be limited to. If zeropad is set to True, the samples will be zeropadded to match this length if they are too short. If None, no limitation will be applied.

  • zeropad (bool) – If True, samples will be zeropadded to match dur_sec. (default False)

  • format (str) – The format to save the audio data in. (default ‘WAV’)

  • bitdepth (int, str) – The desired bitdepth. If int, 16 or 32 are possible. Defaults to ‘PCM_16’.

  • overwrite (bool) – If True and new_dir is None, the audio data will be reformatted in the original directory and saved over any existing filenames. (default False)

  • mono (bool) – If True, the audio will be limited to a single channel. Note: not much has been tested for stereo sound and soundpy. (default False)

Returns

directory – The directory where the formatted audio files are located.

Return type

pathlib.PosixPath

See also

soundpy.files.collect_audiofiles

Collects audiofiles from a given directory.

soundpy.files.conversion_formats

The available formats for converting audio data.

soundfile.available_subtypes

The subtypes or bitdepth possible for soundfile

soundpy.builtin.create_denoise_data(cleandata_dir, noisedata_dir, trainingdata_dir, limit=None, snr_levels=None, pad_mainsound_sec=None, random_seed=None, overwrite=False, **kwargs)[source]

Applies noise to clean audio; saves clean and noisy audio to traingingdata_dir.

Parameters
  • cleandata_dir (str, pathlib.PosixPath) – Name of folder containing clean audio data for autoencoder. E.g. ‘clean_speech’

  • noisedata_dir (str, pathlib.PosixPath) – Name of folder containing noise to add to clean data. E.g. ‘noise’

  • trainingdata_dir (str, pathlib.PosixPath) – Directory to save newly created train, validation, and test data

  • limit (int, optional) – Limit in number of audiofiles used for training data

  • snr_levels (list of ints, optional) – List of varying signal-to-noise ratios to apply to noise levels. (default None)

  • pad_mainsound_sec (int, float, optional) – Amount in seconds the main sound should be padded. In other words, in seconds how long the background sound should play before the clean / main / target audio starts. The same amount of noise will be appended at the end. (default None)

  • random_seed (int) – A value to allow random order of audiofiles to be predictable. (default None). If None, the order of audiofiles will not be predictable.

  • overwrite (bool) – If True, a new dataset will be created regardless of whether or not a matching directory already exists. (default False)

  • **kwargs (additional keyword arguments) – The keyword arguments for soundpy.files.loadsound

Returns

See also

soundpy.files.loadsound

Loads audiofiles.

soundpy.dsp.add_backgroundsound

Add background sound / noise to signal at a determined signal-to-noise ratio.

soundpy.builtin.envclassifier_feats(data_dir, data_features_dir=None, perc_train=0.8, ignore_label_marker=None, **kwargs)[source]

Environment Classifier: feature extraction of scene audio into train, val, & test datasets.

Saves extracted feature datasets (train, val, test datasets) as well as feature extraction settings in the directory data_features_dir.

Parameters
  • data_dir (str or pathlib.PosixPath) – The directory with scene subfolders (e.g. ‘air_conditioner’, ‘traffic’) that contain audio files belonging to that scene (e.g. ‘air_conditioner/ac1.wav’, ‘air_conditioner/ac2.wav’, ‘traffic/t1.wav’).

  • data_features_dir (str or pathlib.PosixPath, optional) – The directory where feature extraction related to the dataset will be stored. Within this directory, a unique subfolder will be created each time features are extracted. This allows several versions of extracted features on the same dataset without overwriting files.

  • perc_train (float) – The amount of data to be set aside for train data. The rest will be divided into validation and test datasets.

  • ignore_label_marker (str) – A string to look for in the labels if the “label” should not be included. For example, ‘__’ to ignore a subdirectory titled “__noise” or “not__label”.

  • kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.save_features_datasets and soundpy.feats.get_feats.

Returns

feat_extraction_dir – The pathway to where all feature extraction files can be found, including datasets.

Return type

pathlib.PosixPath

See also

soundpy.feats.get_feats

Extract features from audio file or audio data.

soundpy.feats.save_features_datasets

Preparation of acoustic features in train, validation and test datasets.

soundpy.builtin.denoiser_feats(data_clean_dir, data_noisy_dir, data_features_dir=None, limit=None, perc_train=0.8, **kwargs)[source]

Autoencoder Denoiser: feature extraction of clean & noisy audio into train, val, & test datasets.

Saves extracted feature datasets (train, val, test datasets) as well as feature extraction settings in the directory data_features_dir.

Parameters
  • data_clean_dir (str or pathlib.PosixPath) – The directory with clean audio files.

  • data_noisy_dir (str or pathlib.PosixPath) – The directory with noisy audio files. These should be the same as the clean audio, except noise has been added.

  • data_features_dir (str or pathlib.PosixPath, optional) – The directory where feature extraction related to the dataset will be stored. Within this directory, a unique subfolder will be created each time features are extracted. This allows several versions of extracted features on the same dataset without overwriting files.

  • limit (int, optional) – The limit of audio files for feature extraction. (default None)

  • kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.save_features_datasets and soundpy.feats.get_feats.

Returns

feat_extraction_dir – The pathway to where all feature extraction files can be found, including datasets.

Return type

pathlib.PosixPath

See also

soundpy.datasets.create_denoise_data

Applies noise at specified SNR levels to clean audio files.

soundpy.feats.get_feats

Extract features from audio file or audio data.

soundpy.feats.save_features_datasets

Preparation of acoustic features in train, validation and test datasets.

Built-In Functionality (Deep Learning)

The soundpy.models.builtin module includes example functions that train neural networks on sound data.

soundpy.models.builtin.denoiser_train(feature_extraction_dir, model_name='model_autoencoder_denoise', feature_type=None, use_generator=True, normalize=True, patience=10, **kwargs)[source]

Collects training features and train autoencoder denoiser.

Parameters
  • feature_extraction_dir (str or pathlib.PosixPath) – Directory where extracted feature files are located (format .npy).

  • model_name (str) – The name for the model. This can be quite generic as the date up to the millisecond will be added to ensure a unique name for each trained model. (default ‘model_autoencoder_denoise’)

  • feature_type (str, optional) – The type of features that will be used to train the model. This is only for the purposes of naming the model. If set to None, it will not be included in the model name.

  • use_generator (bool) – If True, a generator will be used to feed training data to the model. Otherwise the entire training data will be used to train the model all at once. (default True)

  • normalize (bool) – If True, the data will be normalized before feeding to the model. (default False)

  • patience (int) – Number of epochs to train without improvement before early stopping.

  • **kwargs (additional keyword arguments) – The keyword arguments for keras.fit(). Note, the keyword arguments differ for validation data so be sure to use the correct keyword arguments, depending on if you use the generator or not. TODO: add link to keras.fit().

Returns

model_dir – The directory where the model and associated files can be found.

Return type

pathlib.PosixPath

See also

soundpy.datasets.separate_train_val_test_files

Generates paths lists for train, validation, and test files. Useful for noisy vs clean datasets and also for multiple training files.

soundpy.models.generator

The generator function that feeds data to the model.

soundpy.models.modelsetup.setup_callbacks

The function that sets up callbacks (e.g. logging, save best model, early stopping, etc.)

soundpy.models.template_models.autoencoder_denoise

Template model architecture for basic autoencoder denoiser.

soundpy.models.builtin.envclassifier_train(feature_extraction_dir, model_name='model_cnn_classifier', feature_type=None, use_generator=True, normalize=True, patience=15, add_tensor_last=True, num_layers=3, **kwargs)[source]

Collects training features and trains cnn environment classifier.

This model may be applied to any speech and label scenario, for example, male vs female speech, clinical vs healthy speech, simple speech / word recognition, as well as noise / scene / environment classification.

Parameters
  • feature_extraction_dir (str or pathlib.PosixPath) – Directory where extracted feature files are located (format .npy).

  • model_name (str) – The name for the model. This can be quite generic as the date up to the millisecond will be added to ensure a unique name for each trained model. (default ‘model_cnn_classifier’)

  • feature_type (str, optional) – The type of features that will be used to train the model. This is only for the purposes of naming the model. If set to None, it will not be included in the model name.

  • use_generator (bool) – If True, a generator will be used to feed training data to the model. Otherwise the entire training data will be used to train the model all at once. (default True)

  • normalize (bool) – If True, the data will be normalized before feeding to the model. (default False)

  • patience (int) – Number of epochs to train without improvement before early stopping.

  • num_layers (int) – The number of convolutional neural network layers desired. (default 3)

  • **kwargs (additional keyword arguments) – The keyword arguments for keras.fit(). Note, the keyword arguments differ for validation data so be sure to use the correct keyword arguments, depending on if you use the generator or not. TODO: add link to keras.fit().

Returns

model_dir – The directory where the model and associated files can be found.

Return type

pathlib.PosixPath

See also

soundpy.datasets.separate_train_val_test_files

Generates paths lists for train, validation, and test files. Useful for noisy vs clean datasets and also for multiple training files.

soundpy.models.generator

The generator function that feeds data to the model.

soundpy.models.modelsetup.setup_callbacks

The function that sets up callbacks (e.g. logging, save best model, early stopping, etc.)

soundpy.models.template_models.cnn_classifier

Template model architecture for a low-computational CNN sound classifier.

soundpy.models.builtin.denoiser_run(model, new_audio, feat_settings_dict, remove_dc=True)[source]

Implements a pre-trained denoiser

Parameters
  • model (str or pathlib.PosixPath) – The path to the denoising model.

  • new_audio (str, pathlib.PosixPath, or np.ndarray) – The path to the noisy audiofile.

  • feat_settings_dict (dict) – Dictionary containing necessary settings for how the features were extracted for training the model. Expected keys: ‘feature_type’, ‘win_size_ms’, ‘percent_overlap’, ‘sr’, ‘window’, ‘frames_per_sample’, ‘input_shape’, ‘desired_shape’, ‘dur_sec’, ‘num_feats’.

Returns

  • cleaned_audio (np.ndarray [shape = (num_samples, )]) – The cleaned audio samples ready for playing or saving as audio file.

  • sr (int) – The sample rate of cleaned_audio.

See also

soundpy.feats.get_feats

How features are extracted.

soundpy.feats.feats2audio

How features are transformed back into audio samples.

soundpy.models.builtin.envclassifier_run(model, new_audio, feat_settings_dict, dict_decode)[source]

Implement a convnet model with new_audio.

Parameters
  • model (str, pathlib.PosixPath) – The pathway to the pre-trained model.

  • new_audio (str, pathlib.PosixPath) – The pathway to the audio file to be classified.

  • feat_settings_dict (dict) – Dictionary containing necessary settings for feature extraction, such as sample rate, feature type, etc.

  • dict_decode (dict) – Dictionary containing encoded labels as keys and string labels as values. for example {0:’office’, 1:’traffic’, 2:’park’}.

Returns

  • label (int) – The encoded label applied to the new_audio.

  • label_string (str) – The string label applied to the new_audio.

  • strength (float) – The confidence of the model’s assignment. For example, 0.99 would be very confident, 0.51 would not be very confident.

soundpy.models.builtin.collect_classifier_settings(feature_extraction_dir)[source]

Collects relevant information for some models from files in the feature directory.

These relevant files have been generated in soundpy.models.builtin.envclassifier_train.

Parameters

feature_extraction_dir (str, pathlib.PosixPath) – The directory where extracted files are located, included .npy and .csv log files.

Returns

  • datasets (NamedTuple) – A named tuple containing train, val, and test data

  • num_labels (int) – The number of labels used for the data.

  • feat_shape (tuple) – The initial shape of the features when they were extracted. For example, labels or context window not applied.

  • num_feats (int) – The number of features used to train the pre-trained model.

  • feature_type (str) – The feature_type used to train the pre-trained model. For example, ‘fbank’, ‘mfcc’, ‘stft’, ‘signal’, ‘powspec’.

See also

soundpy.models.builtin.envclassifier_train

The builtin functionality for training a simple scene/environment/speech classifier. This function generates the files expected by this function.

soundpy.models.builtin.cnnlstm_train(feature_extraction_dir, model_name='model_cnnlstm_classifier', use_generator=True, normalize=True, patience=15, timesteps=10, context_window=5, frames_per_sample=None, colorscale=1, total_training_sessions=None, add_tensor_last=False, **kwargs)[source]

Example implementation of a Convnet+LSTM model for speech recognition.

Note: improvements must still be made, for example with the context_window. However, this still may be useful as an example of a simple CNN and LSTM model.

Parameters
  • feature_extraction_dir (str, pathlib.PosixPath) – The directory where feature data will be saved.

  • model_name (str) – The name of the model. (default ‘model_cnnlstm_classifier’)

  • use_generator (True) – If True, data will be fed to the model via generator. This parameter will likely be removed and set as a default. (default True)

  • normalize (bool) – If True, the data will be normalized before being fed to the model. (default True)

  • patience (int) – The number of epochs to allow with no improvement in either val accuracy or loss. (default 15)

  • timesteps (int) – The frames dedicated to each subsection of each sample. This allows the long-short term memory model to process each subsection consecutively.

  • context_window (int) – The number of frames surrounding a central frame that make up sound context. Note: this needs improvement and further exploration.

  • frames_per_sample (int) – Serves basically same role as context_window does currently: frames_per_sample equals context_window * 2 + 1. This parameter will likely be removed in future versions.

  • colorscale (int) – The colorscale relevant for the convolutional neural network. (default 1)

  • total_training_sessions (int) – Option to limit number of audiofiles used for training, if use_generator is set to False. This parameter will likely be removed in future versions. But as this is just an example model, the low priority may result in this parameter living forever.

  • add_tensor_last (bool) – No longer used in the code. Irrelevant.

  • kwargs (additional keyword arguments.) – Keyword arguments for keras.model.fit.

Returns

  • model_dir (pathlib.PosixPath) – The directory where model and log files are saved.

  • history (tf.keras.callbacks.History) – Contains model training and validation accuracy and loss throughout training.

References

Kim, Myungjong & Cao, Beiming & An, Kwanghoon & Wang, Jun. (2018). Dysarthric Speech Recognition Using Convolutional LSTM Neural Network. 10.21437/interspeech.2018-2250.

soundpy.models.builtin.resnet50_train(feature_extraction_dir, model_name='model_resnet50_classifier', use_generator=True, normalize=True, patience=15, colorscale=3, total_training_sessions=None, **kwargs)[source]

Continue training a pre-trained resnet50 model for speech recogntion or other sound classification.

Parameters
  • feature_extraction_dir (str or pathlib.PosixPath) – The directory where feature extraction files will be saved.

  • model_name (str) – The name for the model. (default ‘model_resnet50_classifier’)

  • use_generator (True) – If True, data will be fed to the model via generator. This parameter will likely be removed and set as a default. (default True)

  • normalize (bool) – If True, the data will be normalized before being fed to the model. (default True)

  • patience (int) – The number of epochs to allow with no improvement in either val accuracy or loss. (default 15)

  • timesteps (int) – The frames dedicated to each subsection of each sample. This allows the long-short term memory model to process each subsection consecutively.

  • context_window (int) – The number of frames surrounding a central frame that make up sound context. Note: this needs improvement and further exploration.

  • frames_per_sample (int) – Serves basically same role as context_window does currently: frames_per_sample equals context_window * 2 + 1. This parameter will likely be removed in future versions.

  • colorscale (int) – The colorscale relevant for the convolutional neural network. (default 1)

  • total_training_sessions (int) – Option to limit number of audiofiles used for training, if use_generator is set to False. This parameter will likely be removed in future versions. But as this is just an example model, the low priority may result in this parameter living forever.

Returns

  • model_dir (pathlib.PosixPath) – The directory where model and log files are saved.

  • history (tf.keras.callbacks.History()) – Contains model training and validation accuracy and loss throughout training.

soundpy.models.builtin.envclassifier_extract_train(model_name='env_classifier', augment_dict=None, audiodata_path=None, features_dir=None, save_new_files_dir=None, labeled_data=True, ignore_label_marker=None, batch_size=10, epochs=5, patience=15, callbacks=None, random_seed=None, visualize=False, vis_every_n_items=50, label_silence=False, val_data=None, test_data=None, append_model_dir=False, **kwargs)[source]

Extract and augment features during training of a scene/environment/speech classifier

Parameters
  • model_name (str) – Name of the model. No extension (will save as .h5 file) (default ‘env_classifier’)

  • augment_dict (dict, optional) – Dictionary containing keys (e.g. ‘add_white_noise’). See `soundpy.augment.list_augmentations`and corresponding True or False values. If the value is True, the key / augmentation gets implemented at random, each epoch. (default None)

  • audiodata_path (str, pathlib.PosixPath) – Where audio data can be found, if no features_dir where previously extracted and prepared files are located. (default None)

  • features_dir (str, pathlib.PosixPath) – The feature directory where previously extracted validation and test data are located, as well as the relevant log files.

  • save_new_files_dir (str, pathlib.PosixPath) – Where new files (logging, model(s), etc.) will be saved. If None, will be set in a unique directory within the current working directory. (default None)

  • labeled_data (bool) – Useful in determining shape of data. If True, expected label column to exist at the end of the feature column of feature data. Note: this may be removed in future versions.

  • ignore_label_marker (str) – When collecting labels from subdirectory names, this allows a subfolder name to be ignored. For example, if ignore_label_marker is set as ‘__’, the folder name ‘__test__’ will not be included as a label while a folder name ‘dog_barking’ will.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats.

soundpy.models.builtin.cnnlstm_extract_train(model_name='cnnlstm_classifier', dataset_dict=None, num_labels=None, augment_dict=None, audiodata_path=None, save_new_files_dir=None, labeled_data=True, ignore_label_marker=None, context_window=5, batch_size=10, epochs=5, patience=15, callbacks=None, random_seed=None, visualize=False, vis_every_n_items=50, label_silence=False, **kwargs)[source]

Extract and augment features during training of a scene/environment/speech classifier

Parameters
  • model_name (str) – Name of the model. No extension (will save as .h5 file)

  • dataset_dict (dict, optional) – A dictionary including datasets as keys, and audio file lists (with or without labels) as values. If None, will be created based on audiodata_path. (default None)

  • augment_dict (dict, optional) – Dictionary containing keys (e.g. ‘add_white_noise’). See `soundpy.augment.list_augmentations`and corresponding True or False values. If the value is True, the key / augmentation gets implemented at random, each epoch. (default None)

  • audiodata_path (str, pathlib.PosixPath) – Where audio data can be found, if no dataset_dict provided. (default None)

  • save_new_files_dir (str, pathlib.PosixPath) – Where new files (logging, model(s), etc.) will be saved. If None, will be set in a unique directory within the current working directory. (default None)

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats.

soundpy.models.builtin.denoiser_extract_train(model_name='denoiser', augment_dict=None, audiodata_clean_path=None, audiodata_noisy_path=None, features_dir=None, save_new_files_dir=None, labeled_data=False, ignore_label_marker=None, batch_size=10, epochs=5, patience=15, callbacks=None, random_seed=20, visualize=False, vis_every_n_items=50, label_silence=False, val_data=None, test_data=None, append_model_dir=False, **kwargs)[source]

Extract and augment features during training of a scene/environment/speech classifier

Parameters
  • model_name (str) – Name of the model. No extension (will save as .h5 file) (default ‘env_classifier’)

  • augment_dict (dict, optional) – Dictionary containing keys (e.g. ‘add_white_noise’). See `soundpy.augment.list_augmentations`and corresponding True or False values. If the value is True, the key / augmentation gets implemented at random, each epoch. (default None)

  • audiodata_path (str, pathlib.PosixPath) – Where audio data can be found, if no features_dir where previously extracted and prepared files are located. (default None)

  • features_dir (str, pathlib.PosixPath) – The feature directory where previously extracted validation and test data are located, as well as the relevant log files.

  • save_new_files_dir (str, pathlib.PosixPath) – Where new files (logging, model(s), etc.) will be saved. If None, will be set in a unique directory within the current working directory. (default None)

  • labeled_data (bool) – Useful in determining shape of data. If True, expected label column to exist at the end of the feature column of feature data. Note: this may be removed in future versions.

  • ignore_label_marker (str) – When collecting labels from subdirectory names, this allows a subfolder name to be ignored. For example, if ignore_label_marker is set as ‘__’, the folder name ‘__test__’ will not be included as a label while a folder name ‘dog_barking’ will.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats.

Augment audio data

The augment module includes functions related to augmenting audio data. These functions pull from implementations performed in research.

Other resources for augmentation (not included in soundpy functionality):

Ma, E. (2019). NLP Augmentation. https://github.com/makcedward/nlpaug

Park, D. S., Chan, W., Zhang, Y., Chiu, C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Google Brain. arxiv.org/pdf/1904.08779.pdf

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084:

1.Signal speed scaling by a random number in[0.8,1.2](SpeedupFactoryRange). 2.Pitch shift by a random number in [−2,2]semitones(SemitoneShiftRange). 3.Volume increase/decrease by a random number in [−3,3]dB(VolumeGainRange). 4.Addition of random noise in the range [0,10]dB(SNR). 5.Time shift in the range [−0.005,0.005]seconds(TimeShiftRange).

soundpy.augment.speed_increase(sound, sr, perc=0.15, **kwargs)[source]

Acoustic augmentation of speech.

References

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084

Ko, T., Peddinti, V., Povey, D., & Khudanpur (2015). Audio Augmentation for Speech Recognition. Interspeech.

W. Verhelst and M. Roelands, “An overlap-add technique based on waveform similarity (wsola) for high quality time-scale modifica- tion of speech,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, April 1993, pp. 554–557 vol.2.

soundpy.augment.speed_decrease(sound, sr, perc=0.15, **kwargs)[source]

Acoustic augmentation of speech.

References

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084

soundpy.augment.time_shift(sound, sr, random_seed=None, **kwargs)[source]

Acoustic augmentation of sound (probably not for speech).

Applies random shift of sound by dividing sound into 2 sections and switching them.

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084

soundpy.augment.shufflesound(sound, sr, num_subsections=2, random_seed=None, **kwargs)[source]

Acoustic augmentation of noise or background sounds.

This separates the sound into num_subsections and pseudorandomizes the order.

References

Inoue, T., Vinayavekhin, P., Wang, S., Wood, D., Munawar, A., Ko, B. J., Greco, N., & Tachibana, R. (2019). Shuffling and mixing data augmentation for environmental sound classification. Detection and Classification of Acoustic Scenes and Events 2019. 25-26 October 2019, New York, NY, USA

soundpy.augment.add_white_noise(sound, sr, noise_level=0.01, snr=10, random_seed=None, **kwargs)[source]

References

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084

soundpy.augment.harmonic_distortion(sound, sr, **kwargs)[source]

Applies sin function five times.

References

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084

soundpy.augment.pitch_increase(sound, sr, num_semitones=2, **kwargs)[source]

References

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084

soundpy.augment.pitch_decrease(sound, sr, num_semitones=2, **kwargs)[source]

References

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084

soundpy.augment.vtlp(sound, sr, a=0.8, 1.2, random_seed=None, oversize_factor=16, win_size_ms=50, percent_overlap=0.5, bilinear_warp=True, real_signal=True, fft_bins=1024, window='hann', zeropad=True, expected_shape=None, visualize=False)[source]

Applies vocal tract length perturbations directly to dft (oversized) windows.

References

Kim, C., Shin, M., Garg, A., & Gowda, D. (2019). Improved vocal tract length perturbation for a state-of-the-art end-to-end speech recognition system. Interspeech. September 15-19, Graz, Austria.

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084

soundpy.augment.get_augmentation_dict()[source]

Returns dictionary with augmentation options as keys and values set to False.

Examples

>>> import soundpy as sp
>>> ad = sp.augment.get_augmentation_dict()
>>> ad
{'speed_increase': False,
'speed_decrease': False,
'time_shift': False,
'shufflesound': False,
'add_white_noise': False,
'harmonic_distortion': False,
'pitch_increase': False,
'pitch_decrease': False,
'vtlp': False}
>>> # to set augmentation to True:
>>> ad['add_white_noise'] = True
>>> ad
{'speed_increase': False,
'speed_decrease': False,
'time_shift': False,
'shufflesound': False,
'add_white_noise': True,
'harmonic_distortion': False,
'pitch_increase': False,
'pitch_decrease': False,
'vtlp': False}
soundpy.augment.list_augmentations()[source]

Lists available augmentations.

Examples

>>> import soundpy as sp
>>> print(sp.augment.list_augmentations())
Available augmentations:
        speed_increase
        speed_decrease
        time_shift
        shufflesound
        add_white_noise
        harmonic_distortion
        pitch_increase
        pitch_decrease
        vtlp
soundpy.augment.get_augmentation_settings_dict(augmentation)[source]

Returns default settings of base function for augmentation.

Parameters

augmentation (str) – The augmentation of interest.

Returns

aug_defaults – A dictionary with the base augmentation function parameters as keys and default values as values.

Return type

dict

Examples

>>> import soundpy as sp
>>> d = sp.augment.get_augmentation_settings_dict('speed_decrease')
>>> d
{'perc': 0.15}
>>> # can use this dictionary to apply different values for augmentation
>>> d['perc'] = 0.1
>>> d
{'perc': 0.1}
>>> # to build a dictionary with several settings:
>>> many_settings_dict = {}
>>> many_settings_dict['add_white_noise'] = sp.augment.get_augmentation_settings_dict('add_white_noise')
>>> many_settings_dict['pitch_increase'] = sp.augment.get_augmentation_settings_dict('pitch_increase')
>>> many_settings_dict
{'add_white_noise': {'noise_level': 0.01, 'snr': 10, 'random_seed': None},
'pitch_increase': {'num_semitones': 2}}
>>> # change 'snr' default values to list of several values
>>> # this would apply white noise at either 10, 15, or 20 SNR, at random
>>> many_settings_dict['add_white_noise']['snr'] = [10, 15, 20]
>>> # change number of semitones pitch increase is applied
>>> many_settings_dict['pitch_increase']['num_semitones'] = 1
>>> many_settings_dict
{'add_white_noise': {'noise_level': 0.01,
'snr': [10, 15, 20],
'random_seed': None},
'pitch_increase': {'num_semitones': 1}}
Raises

ValueError – If augmentation does not match available augmentations.

See also

soundpy.models.dataprep.augment_features

The above dictionary example many_settings_dict can be applied under the parameter augment_settings_dict to apply augmentation settings when augmenting data, for example, within a generator function. See soundpy.models.dataprep.GeneratorFeatExtraction.

Working with audio files

The files module contains functions related to handling audio data files, for example loading audio files, saving audio files, and examing and reformatting audio files.

soundpy.files.loadsound(filename, sr=None, mono=True, dur_sec=None, remove_dc=True, use_scipy=False)[source]

Loads sound file with scipy.io.wavfile.read or librosa.load (default librosa)

Parameters
  • filename (str) – The filename of the sound to be loaded

  • sr (int, optional) – The desired sample rate of the audio samples. If None, the sample rate of the audio file will be used.

  • mono (bool) – If True, the samples will be loaded in mono sound. If False, if the samples are in stereo, they will be loaded in stereo sound.

  • dur_sec (int, float, optional) – The length in seconds of the audio signal.

  • remove_dc_bias (bool) – If True, the mean is subtracted from the signal. This has shown to be very helpful when working with audio data. (default True)

  • use_scipy (bool) – If False, librosa will be used to load the audiofile. If True, scipy.io.wavfile and/or soundfile will be used. If the sound file is not compatible with scipy.io.wavfile.read, this functions converts the file to .wav format and/or changes the bit depth to be compatible. (default False)

Returns

  • data (nd.array [size=(num_samples,) or (num_samples, num_channels)]) – The normalized (between 1 and -1) sample data returned according to the specified settings.

  • sr (int) – The sample rate of the loaded samples.

See also

soundpy.files.prep4scipywavfile

Prepares audio file for scipy.io.wavfile.read.

soundpy.files.convert_audiofile

Converts audio file to .wav format.

soundpy.files.newbitdepth

Converts audio file to specified bitdepth.

soundpy.dsp.resample_audio

Resampe audio data to a specified sample rate.

soundpy.files.list_possibleformats

Lists the possible formats to load with soundpy.loadsound

librosa.load

The package used to load sound data by default. See librosa.

scipy.io.wavfile.read

The package used to load sound if use_scipy is set to True. See scipy.

soundpy.dsp.remove_dc_bias

Removes the ‘direct current’ bias from the signal.

soundpy.files.savesound(audiofile_name, signal_values, sr, remove_dc=True, overwrite=False, use_scipy=False, **kwargs)[source]

saves the wave at designated path

Parameters
  • audiofile_name (str or pathlib.PosixPath) – path and name the audio is to be saved under. (.wav format)

  • signal_values (ndarray) – values of real signal to be saved

  • sr (int) – sample rate of the audio samples.

  • remove_dc (bool) – If True, the mean is subtracted from the signal. (default True)

  • overwrite (bool) – If True, audio with the same naem will be overwritten. (default False)

  • use_scipy (bool) – If True, scipy.io.wavfile.write will be used. However, file conversion is limited. Can only save .wav files. Otherwise soundfile.write will be used, which can save audio under more audio fomats.

  • **kwargs (additional keyword arguments) – The keyword arguments for soundfile.write: https://pysoundfile.readthedocs.io/en/latest/index.html?highlight=write#soundfile.write

Returns

audiofile_name – The new audiofile name

Return type

pathlib.PosixPath

See also

scipy.io.wavfile.write

soundpy.files.conversion_formats

Lists the possible formats to save audio files if use_scipy is False.

soundpy.dsp.remove_dc_bias

Removes the ‘direct current’ bias from the signal.

soundpy.files.get_file_format(audiofile)[source]

Use soundfile to get file format.

soundpy.files.list_possibleformats(use_scipy=False)[source]
soundpy.files.list_audioformats()[source]
soundpy.files.audiofiles_present(directory, recursive=False)[source]

Checks to see if audio files are present.

Parameters
  • directory (str or pathlib.PosixPath) – The directory to look for audio.

  • recursive (bool) – If True, all nested directories will be checked as well. (default False)

Returns

True if audio is present; otherwise False.

Return type

bool

soundpy.files.collect_audiofiles(directory, hidden_files=False, wav_only=False, recursive=False)[source]

Collects all files within a given directory.

This includes the option to include hidden_files in the collection.

Parameters
  • directory (str or pathlib.PosixPath) – The path to where desired files are located.

  • hidden_files (bool) – If True, hidden files will be included. If False, they won’t. (default False)

  • wav_only (bool) – If True, only .wav files will be included. Otherwise, no limit on file type.

Returns

paths_list – Sorted list of file pathways.

Return type

list of pathlib.PosixPath objects

soundpy.files.collect_zipfiles(directory, hidden_files=False, ext='tgz', recursive=False)[source]

Collects all zipfiles within a given directory.

This includes the option to include hidden_files in the collection.

Parameters
  • directory (str or pathlib.PosixPath) – The path to where desired files are located.

  • hidden_files (bool) – If True, hidden files will be included. If False, they won’t. (default False)

  • wav_only (bool) – If True, only .wav files will be included. Otherwise, no limit on file type.

Returns

paths_list – Sorted list of file pathways.

Return type

list of pathlib.PosixPath objects

soundpy.files.ensure_only_audiofiles(audiolist)[source]
soundpy.files.prep4scipywavfile(filename, overwrite=False)[source]

Takes soundfile and saves it in a format compatible with scipy.io.wavfile

Parameters

filename (str) – Filename of the soundfile to load with scipy.io.wavfile

Returns

filename – Filename of the soundfile compatible with scipy.io.wavfile

Return type

str

soundpy.files.conversion_formats()[source]

Lists the formats available for conversion.

soundpy uses soundfile to convert files; therefore, whatever available formats soundfile has will be listed here.

Examples

>>> conversion_formats
{'AIFF': 'AIFF (Apple/SGI)',
'AU': 'AU (Sun/NeXT)',
'AVR': 'AVR (Audio Visual Research)',
'CAF': 'CAF (Apple Core Audio File)',
'FLAC': 'FLAC (Free Lossless Audio Codec)',
'HTK': 'HTK (HMM Tool Kit)',
'SVX': 'IFF (Amiga IFF/SVX8/SV16)',
'MAT4': 'MAT4 (GNU Octave 2.0 / Matlab 4.2)',
'MAT5': 'MAT5 (GNU Octave 2.1 / Matlab 5.0)',
'MPC2K': 'MPC (Akai MPC 2k)',
'OGG': 'OGG (OGG Container format)',
'PAF': 'PAF (Ensoniq PARIS)',
'PVF': 'PVF (Portable Voice Format)',
'RAW': 'RAW (header-less)',
'RF64': 'RF64 (RIFF 64)',
'SD2': 'SD2 (Sound Designer II)',
'SDS': 'SDS (Midi Sample Dump Standard)',
'IRCAM': 'SF (Berkeley/IRCAM/CARL)',
'VOC': 'VOC (Creative Labs)',
'W64': 'W64 (SoundFoundry WAVE 64)',
'WAV': 'WAV (Microsoft)',
'NIST': 'WAV (NIST Sphere)',
'WAVEX': 'WAVEX (Microsoft)',
'WVE': 'WVE (Psion Series 3)',
'XI': 'XI (FastTracker 2)'}
soundpy.files.convert_audiofile(filename, format_type=None, sr=None, new_dir=False, overwrite=False, use_scipy=False, **kwargs)[source]

Converts and saves soundfile as .wav type in same or new directory.

Parameters
  • filename (str or pathlib.PosixPath) – The filename of the audiofile to be converted to .wav type

  • format_type (str) – The format to convert the audio file to. See soundpy.files.conversion_formats. (defaults to ‘wav’)

  • new_dir (str, pathlib.PosixPath, optional) – If False, the converted files will be saved in same directory as originals. If a path is provided, the converted files will be saved there. If no such directory exists, one will be created.

  • sr (int, optional) – The sample rate to be applied to the signal. If none supplied, the sample rate of the original file will be used.

  • **kwargs (additional keyword arguments) – The keyword arguments for soundfile.write: https://pysoundfile.readthedocs.io/en/latest/index.html?highlight=write#soundfile.write

Returns

f_wavfile – The filename / path where the audio file is saved.

Return type

pathlib.PosixPath

Examples

>>> audiofile = './example/audio.wav'
# in same directory
>>> audiofile_flac = sp.files.convert_audiofile(audiofile, format_type='flac')
>>> audiofile_flac
PosixPath('example/audio.flac')
# in new directory
>>> audiofile_flac = sp.files.convert_audiofile(audiofile, format_type='flac',
                                                 new_dir = './examples2/')
>>> audiofile_flac
PosixPath('examples2/audio.flac')
>>> # can establish desired conversion format in `new_dir`
>>> audiofile_ogg = sp.files.convert_audiofile(audiofile,
                                                 new_dir = './examples2/audio.ogg')
>>> audiofile_ogg
PosixPath('audiodata2/audio.ogg')

See also

soundpy.files.conversion_formats

Lists the possible formats to convert audio files.

soundpy.files.list_possibleformats

Lists the possible formats to load with soundpy.loadsound

soundpy.files.replace_ext(filename, extension)[source]

Adds or replaces an extension in the filename

Parameters
  • filename (str or pathlib.PosixPath) – Filename with the missing or incorrect extension

  • extension (str) – The correct extension for the given filename.

Returns

file_newext – The filename with the new extension

Return type

pathlib.PosixPath

soundpy.files.match_ext(filename1, filename2)[source]

Matches the file extensions.

If both have extensions, default set to that of filename1.

soundpy.files.newbitdepth(wave, bitdepth=16, newname=None, overwrite=False)[source]

Convert bitdepth to 16 or 32, to ensure compatibility with scipy.io.wavfile

Scipy.io.wavfile is easily used online, for example in Jupyter notebooks.

soundpy.files.adjustname(filename, adjustment=None)[source]

Adjusts filename.

Parameters
  • filename (str) – The filename to be adjusted

  • adjustment (str, optional) – The adjustment to add to the filename. If None, the string ‘_adj’ will be added.

Returns

fname – The adjusted filename with the original extension

Return type

str

Examples

>>> adjustname('happy.md')
'happy_adj.md'
>>> adjustname('happy.md', '_not_sad')
'happy_not_sad.md'
soundpy.files.extract(tar_url, extract_path='.')[source]
soundpy.files.delete_dir_contents(directory, remove_dir=False)[source]

https://stackoverflow.com/a/28834214

soundpy.files.matching_filenames(list1, list_of_lists)[source]
soundpy.files.remove_contaminated_files(list1, contaminated_files)[source]

Organizing datasets

The datasets module contains functions related to organizing datasets.

soundpy.datasets.create_encodedlabel2audio_dict(dict_encodelabels, paths_list, limit=None, seed=40)[source]

Creates dictionary with audio labels as keys and filename lists as values.

If no label is found in the filename path, the label is not included in the returned dictionary: labels are only included if corresponding paths are present.

Parameters
  • dict_encodelabels (dict) – Dictionary containing the labels as keys and their encoded values as values.

  • paths_list (set, list) – List containing pathlib.PosixPath objects (i.e. paths) of all audio files; expected the audio files reside in directories with names matching their audio class

  • limit (int, optional) – The integer indicating a limit to number of audiofiles to each class. This may be useful if one wants to ensure a balanced dataset (default None)

  • seed (int, optional) – The seed for pseudorandomizing the wavfiles, if a limit is requested. If seed is set to None, the randomized order of the limited wavfiles cannot be repeated. (default 40)

Returns

  • label_waves_dict (OrderedDict) – A dictionary with encoded audio labels as keys with values being the audio files corresponding to that label

  • TODO update

Examples

>>> from pathlib import Path
>>> labels = dict([('vacuum',2),('fridge',0),('wind',1)])
>>> paths = [Path('data/audio/vacuum/vacuum1.wav'),
...         Path('data/audio/fridge/fridge1.wav'),
...         Path('data/audio/vacuum/vacuum2.wav'),
...         Path('data/audio/wind/wind1.wav')]
>>> label_waves_dict = create_encodedlabel2audio_dict(labels, paths)
>>> label_waves_dict
OrderedDict([(0, [PosixPath('data/audio/fridge/fridge1.wav')]), (2, [PosixPath('data/audio/vacuum/vacuum1.wav'), PosixPath('data/audio/vacuum/vacuum2.wav')]), (1, [PosixPath('data/audio/wind/wind1.wav')])])
>>> #to set a limit on number of audiofiles per class:
>>> create_encodedlabel2audio_dict(labels, paths, limit=1, seed=40)
OrderedDict([(0, [PosixPath('data/audio/fridge/fridge1.wav')]), (2, [PosixPath('data/audio/vacuum/vacuum2.wav')]), (1, [PosixPath('data/audio/wind/wind1.wav')])])
>>> #change the limited pathways chosen:
>>> create_encodedlabel2audio_dict(labels, paths, limit=1, seed=10)
OrderedDict([(0, [PosixPath('data/audio/fridge/fridge1.wav')]), (2, [PosixPath('data/audio/vacuum/vacuum1.wav')]), (1, [PosixPath('data/audio/wind/wind1.wav')])])
soundpy.datasets.create_dicts_labelsencoded(labels_class, add_extra_label=False, extra_label='silence')[source]

Encodes audio class labels and saves in dictionaries.

The labels are alphabetized and encoded under their index. If add_extra_label, the extra_label is added as the last entry in the dictionary. This is useful if performing voice activity and want to label non-voice activated sections as silent rather than as some speech label.

Parameters
  • labels_class (set, list) – Set or list containing the labels of all audio classes.

  • add_extra_label (bool) – If True, extra_label added to dictionary.

  • extra_label (str) – The extra label to add. (default ‘silence’).

Returns

  • dict_label2int (dict) – Dictionary where the keys are the string labels and the values are the encoded integers

  • dict_int2label (dict) – Dictionary where the keys are the encoded integers and the values are the string labels

Examples

>>> labels = {'wind','air_conditioner','fridge'}
>>> label2int, int2label = create_dicts_labelsencoded(labels)
>>> label2int
{'air_conditioner': 0, 'fridge': 1, 'wind': 2}
>>> int2label
{0: 'air_conditioner', 1: 'fridge', 2: 'wind'}
soundpy.datasets.waves2dataset(audiolist, perc_train=0.8, seed=40, train=True, val=True, test=True)[source]

Organizes audio files list into train, validation and test datasets.

If only two or one dataset is to be prepared, they will be assigned to train and val or simply to train, respectively. The remaining ‘datasets’ will remain empty.

Parameters
  • audiolist (list) – List containing paths to audio files

  • perc_train (float, int) – Percentage of data to be in the training dataset (default 0.8)

  • seed (int, None, optional) – Set seed for the generation of pseudorandom train, validation, and test datsets. Useful for reproducing results. (default 40)

  • train (bool) – If True, assumed the training data will be prepared. (default True)

  • val (bool) – If True, assumed validation data will be prepared. (default True)

  • test (bool) – If True, assumed test data will be prepared. (default True)

Returns

  • train_waves (list) – List of audio files for the training dataset

  • val_waves (list) – List of audio files for the validation dataset

  • test_waves (list) – List of audio files for the test dataset

Examples

>>> #Using a list of numbers instead of filenames
>>> audiolist = [1,2,3,4,5,6,7,8,9,10]
>>> #default settings:
>>> waves2dataset(audiolist)
([5, 4, 9, 2, 3, 10, 1, 6], [8], [7])
>>> #perc_train set to 50% instead of 80%:
>>> waves2dataset(audiolist, perc_train=50)
([5, 4, 9, 2, 3, 10], [1, 6], [8, 7])
>>> #change seed number
>>> waves2dataset(audiolist, seed=0)
([7, 1, 2, 5, 6, 9, 10, 8], [4], [3])
soundpy.datasets.audio2datasets(audiodata, perc_train=0.8, limit=None, seed=None, audio_only=True, **kwargs)[source]

Organizes all audio in audio class directories into datasets (randomized).

The validation and test datasets are halved between what isn’t train data. For example, if perc_train is 0.8, validation data will be 0.1 and test data will be 0.1.

Parameters
  • audiodata (str, pathlib.PosixPath, dict, list, or set) – If data has multiple labels, path to the dictionary where audio class labels and the paths of all audio files belonging to each class are or will be stored. The dictionary with the labels and their encoded values can also directly supplied here. If the data does not have labels, a list or set of audiofiles can be provided to be placed in train, val, and test datasets.

  • seed (int, optional) – A value to allow random order of audiofiles to be predictable. (default None). If None, the order of audiofiles will not be predictable.

  • audio_only (bool) – If audio files are expected (e.g. extensions of .wav, .flac etc.) or not. If True, list will be checked to contain only audio files. Otherwise not. (default True)

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.datasets.waves2dataset

Returns

dataset_audio – Named tuple including three lists / datasets of audiofiles or label-audiofile pairs: the train, validation, and test lists, respectively. The label-audiofile pairs are saved as tuples within the lists and contain the encoded label integer (e.g. 0 instead of ‘air_conditioner’) and the audio paths associated to that class and dataset.

Return type

tuple

Raises

ValueError – If perc_train is set too high for the amount of data or there are simply too few data. Specifically, if the percentage of train data cannot be upheld while also ensuring the validation and test datasets have more than 1 sample.

soundpy.datasets.separate_train_val_test_files(list_of_files)[source]

Checks that file(s) exist, then sorts file(s) into train, val, test lists.

If ‘nois’ or ‘clean’ are in the filenames, two paths lists per dataset will be generated. Otherwise just one. This paths list is useful if there are multiple training files available for training a model (e.g. for large datasets).

Parameters

list_of_files (list, str, or pathlib.PosixPath) – The feature files (format: .npy) for training a model.

Returns

(train_paths_list, val_paths_list, test_paths_list) – Tuple comprised of paths lists to train, validation, and test data files. If noisy and clean data files found, each tuple item will be a tuple comprised of two lists: a noisy file paths list and a clean file paths list.

Return type

tuple

Examples

>>> features_files = ['train1.npy', 'train2.npy', 'val.npy', 'test.npy']
>>> datasets = separate_train_val_test_files(features_files)
>>> datasets.train
[PosixPath('train1.npy'), PosixPath('train2.npy')]
>>> datasets.val
[PosixPath('val.npy')]
>>> datasets.test
[PosixPath('test.npy')]
>>> # try with noisy and clean data
>>> features_files = ['train_noisy.npy', 'train_clean.npy', 'val_noisy.npy', 'val_clean.npy', 'test_noisy.npy', 'test_clean.npy']
>>> datasets = separate_train_val_test_files(features_files)
>>> datasets.train.noisy
[PosixPath('train_noisy.npy')]
>>> datasets.train.clean
[PosixPath('train_clean.npy')]
>>> datasets.val.noisy
[PosixPath('val_noisy.npy')]
>>> datasets.val.clean
[PosixPath('val_clean.npy')]
>>> datasets.test.noisy
[PosixPath('test_noisy.npy')]
>>> datasets.test.clean
[PosixPath('test_clean.npy')]
soundpy.datasets.section_data(dataset_dict, dataset_paths_dict, divide_factor=None)[source]

Expects keys of these two dictionaries to match

Examples

>>> import pathlib
>>> # train is longer than val and test
>>> d = {'train': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],            'val': [1, 2, 3, 4, 5],            'test': [1, 2, 3, 4, 5]}
>>> # dictionary: paths to where extracted data will be saved
>>> dp = {'train': pathlib.PosixPath('train_data.npy'),              'val': pathlib.PosixPath('val_data.npy'),              'test': pathlib.PosixPath('test_data.npy')}
>>> d2, dp2 = section_data(d, dp, divide_factor = 3)
>>> # val and train not touched (too small)
>>> d2
{'train__1': [1, 2, 3, 4, 5], 'train__2': [6, 7, 8, 9, 10], 'train__3': [11, 12, 13, 14, 15], 'val': [1, 2, 3, 4, 5], 'test': [1, 2, 3, 4, 5]}
>>> dp2
{'train__1': PosixPath('train_data__1.npy'), 'train__2': PosixPath('train_data__2.npy'), 'train__3': PosixPath('train_data__3.npy'), 'val': PosixPath('val_data.npy'), 'test': PosixPath('test_data.npy')}
>>> # repeat: now val and test as long as train
>>> # default divide_factor is 2
>>> d3, dp3 = section_data(d2, dp2)
>>> d3
{'train__1': [1, 2], 'train__2': [3, 4, 5], 'train__3': [6, 7], 'train__4': [8, 9, 10], 'train__5': [11, 12], 'train__6': [13, 14, 15], 'val__1': [1, 2], 'val__2': [3, 4, 5], 'test__1': [1, 2], 'test__2': [3, 4, 5]}
>>> dp3
{'train__1': PosixPath('train_data__1.npy'), 'train__2': PosixPath('train_data__2.npy'), 'train__3': PosixPath('train_data__3.npy'), 'train__4': PosixPath('train_data__4.npy'), 'train__5': PosixPath('train_data__5.npy'), 'train__6': PosixPath('train_data__6.npy'), 'val__1': PosixPath('val_data__1.npy'), 'val__2': PosixPath('val_data__2.npy'), 'test__1': PosixPath('test_data__1.npy'), 'test__2': PosixPath('test_data__2.npy')}

Working with signals

Dsp module contains functions pertaining to the actual generation, manipulation, and analysis of sound. This ranges from generating sounds to calculating sound to noise ratio.

soundpy.dsp.generate_sound(freq=200, amplitude=0.4, sr=8000, dur_sec=0.25)[source]

Generates a sound signal with the provided parameters. Signal begins at 0.

Parameters
  • freq (int, float) – The frequency in Hz the signal should have (default 200 Hz). This pertains to the number of ossicliations per second.

  • amplitude (int, float) – The parameter controling how much energy the signal should have. (default 0.4)

  • sr (int) – The sampling rate of the signal, or how many samples make up the signal per second. (default 8000)

Returns

  • sound_samples (np.ndarray [size = ()]) – The samples of the generated sound

  • sr (int) – The sample rate of the generated signal

Examples

>>> sound, sr = generate_sound(freq=5, amplitude=0.5, sr=5, dur_sec=1)
>>> sound
array([ 0.000000e+00,  5.000000e-01,  3.061617e-16, -5.000000e-01, -6.123234e-16])
>>> sr
5
soundpy.dsp.get_time_points(dur_sec, sr)[source]

Get evenly spaced time points from zero to length of dur_sec.

The time points align with the provided sample rate, making it easy to plot a signal with a time line in seconds.

Parameters
  • dur_sec (int, float) – The amount of time in seconds

  • sr (int) – The sample rate relevant for the signal

Returns

time

Return type

np.ndarray [size = (num_time_points,)]

Examples

>>> # 50 milliseconds at sample rate of 100 (100 samples per second)
>>> x = get_time_points(0.05,100)
>>> x.shape
(5,)
>>> x
array([0.    , 0.0125, 0.025 , 0.0375, 0.05  ])
soundpy.dsp.generate_noise(num_samples, amplitude=0.025, random_seed=None)[source]

Generates noise to be of a certain amplitude and number of samples.

Useful for adding noise to another signal of length num_samples.

Parameters
  • num_samples (int) – The number of total samples making up the noise signal.

  • amplitude (float) – Allows the noise signal to be louder or quieter. (default 0.025)

  • random_seed (int, optional) – Useful for repeating ‘random’ noise samples.

Examples

>>> noise = generate_noise(5, random_seed = 0)
>>> noise
array([0.04410131, 0.01000393, 0.02446845, 0.05602233, 0.04668895])
soundpy.dsp.set_signal_length(samples, numsamps)[source]

Sets audio signal to be a certain length. Zeropads if too short.

Useful for setting signals to be a certain length, regardless of how long the audio signal is.

Parameters
  • samples (np.ndarray [size = (num_samples, num_channels), or (num_samples,)]) – The array of sample data to be zero padded.

  • numsamps (int) – The desired number of samples.

Returns

data – Copy of samples zeropadded or limited to numsamps.

Return type

np.ndarray [size = (numsamps, num_channels), or (numsamps,)]

Examples

>>> import numpy as np
>>> input_samples = np.array([1,2,3,4,5])
>>> output_samples = set_signal_length(input_samples, numsamps = 8)
>>> output_samples
array([1, 2, 3, 4, 5, 0, 0, 0])
>>> output_samples = set_signal_length(input_samples, numsamps = 4)
>>> output_samples
array([1, 2, 3, 4])
soundpy.dsp.scalesound(data, max_val=1, min_val=None)[source]

Scales the input array to range between min_val and max_val.

Parameters
  • data (np.ndarray [size = (num_samples,) or (num_samples, num_channels)]) – Original samples

  • max_val (int, float) – The maximum value the dataset is to range from (default 1)

  • min_val (int, float, optional) – The minimum value the dataset is to range from. If set to None, will be set to the opposiite of max_val. E.g. if max_val is set to 0.8, min_val will be set to -0.8. (default None)

Returns

samples – Copy of original data, scaled to the min and max values.

Return type

np.ndarray [size = (num_samples,) or (num_samples, num_channels)]

Examples

>>> import numpy as np
>>> np.random.seed(0)
>>> input_samples = np.random.random_sample((5,))
>>> input_samples
array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ])
>>> input_samples.max()
0.7151893663724195
>>> input_samples.min()
0.4236547993389047
>>> # default setting: between -1 and 1
>>> output_samples = scalesound(input_samples)
>>> output_samples
array([-0.14138 ,1., 0.22872961, -0.16834299, -1.])
>>> output_samples.max()
1.0
>>> output_samples.min()
-1.0
>>> # range between -100 and 100
>>> output_samples = scalesound(input_samples, max_val = 100, min_val = -100)
>>> output_samples
array([ -14.13800026,100., 22.87296052,-16.83429866,-100.])
>>> output_samples.max()
100.0
>>> output_samples.min()
-100.0
soundpy.dsp.shape_samps_channels(data)[source]

Returns data in shape (num_samps, num_channels)

Parameters

data (np.ndarray [size= (num_samples,) or (num_samples, num_channels), or (num_channels, num_samples)]) – The data that needs to be checked for correct format

Returns

data

Return type

np.ndarray [size = (num_samples,) or (num_samples, num_channels)]

soundpy.dsp.resample_audio(samples, sr_original, sr_desired)[source]

Allows audio samples to be resampled to desired sample rate.

Parameters
  • samples (np.ndarray [size = (num_samples,)]) – The samples to be resampled.

  • sr_original (int) – The orignal sample rate of the samples.

  • sr_desired (int) – The desired sample rate of the samples.

Returns

  • resampled (np.ndarray [size = (num_samples_resampled,)]) – The resampled samples.

  • sr_desired (int) – The newly applied sample rate

Examples

>>> import numpy as np
>>> # example samples from 5 millisecond signal with sr 100 and frequency 10
>>> input_samples = np.array([0.00e+00, 2.82842712e-01, 4.000e-01, 2.82842712e-01, 4.89858720e-17])
>>> # we want to resample to 80 instead of 100 (for this example's sake)
>>> output_samples, sr = resample_audio(input_samples, sr_original = 100, sr_desired = 80)
>>> output_samples
array([-2.22044605e-17, 3.35408001e-01, 3.72022523e-01, 6.51178161e-02])
soundpy.dsp.stereo2mono(data)[source]

If sound data has multiple channels, reduces to first channel

Parameters

data (numpy.ndarray) – The series of sound samples, with 1+ columns/channels

Returns

data_mono – The series of sound samples, with first column

Return type

numpy.ndarray

Examples

>>> import numpy as np
>>> data = np.linspace(0,20)
>>> data_2channel = data.reshape(25,2)
>>> data_2channel[:5]
array([[0.        , 0.40816327],
       [0.81632653, 1.2244898 ],
       [1.63265306, 2.04081633],
       [2.44897959, 2.85714286],
       [3.26530612, 3.67346939]])
>>> data_mono = stereo2mono(data_2channel)
>>> data_mono[:5]
array([0.        , 0.81632653, 1.63265306, 2.44897959, 3.26530612])
soundpy.dsp.add_backgroundsound(audio_main, audio_background, sr, snr=None, pad_mainsound_sec=None, total_len_sec=None, wrap=False, stationary_noise=True, random_seed=None, extend_window_ms=0, remove_dc=False, mirror_sound=False, clip_at_zero=True, **kwargs)[source]

Adds a sound (i.e. background noise) to a target signal. Stereo sound should work.

If the sample rates of the two audio samples do not match, the sample rate of audio_main will be applied. (i.e. the audio_background will be resampled). If you have issues with clicks at the beginning or end of signals, see soundpy.dsp.clip_at_zero.

Parameters
  • audio_main (str, pathlib.PosixPath, or np.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – Sound file of the main sound (will not be modified; only delayed if specified). If not path or string, should be a data samples corrresponding to the provided sample rate.

  • audio_background (str, pathlib.PosixPath, or np.ndarray [size=(num_samples,)]) – Sound file of the background sound (will be modified /repeated to match or extend the length indicated). If not of type pathlib.PosixPath or string, should be a data samples corrresponding to the provided sample rate.

  • sr (int) – The sample rate of sounds to be added together. Note: sr of 44100 or higher is suggested.

  • snr (int, float, list, tuple) – The sound-to-noise-ratio of the target and background signals. Note: this is an approximation and needs further testing and development to be used as an official measurement of snr. If no SNR provided, signals will be added together as-is. (default None)

  • pad_mainsound_sec (int or float, optional) – Length of time in seconds the background sound will pad the main sound. For example, if pad_mainsound_sec is set to 1, one second of the audio_background will be played before audio_main starts as well as after the main audio stops. (default None)

  • total_len_sec (int or float, optional) – Total length of combined sound in seconds. If none, the sound will end after the (padded) target sound ends (default None).

  • wrap (bool) – If False, the random selection of sound will be limited to end by the end of the audio file. If True, the random selection will wrap to beginning of the audio file if extends beyond the end of the audio file. (default False)

  • stationary_noise (bool) – If False, soundpy.feats.get_vad_stft will be applied to noise to get energy of the active noise in the signal. Otherwise energy will be collected via soundpy.dsp.get_stft. (default True)

  • random_seed (int) – If provided, the ‘random’ section of noise will be chosen using this seed. (default None)

  • extend_window_ms (int or float) – The number of milliseconds the voice activity detected should be padded with. This might be useful to ensure sufficient amount of activity is calculated. (default 0)

  • remove_dc (bool) – If the dc bias should be removed. This aids in the removal of clicks. See soundpy.dsp.remove_dc_bias. (default False)

  • **kwargs (additional keyword arguments) – The keyword arguments for soundpy.files.loadsound

Returns

  • combined (numpy.ndarray [shape=(num_samples) or (num_samples, num_channels)]) – The samples of the sounds added together

  • snr (int, float) – The updated signal-to-noise ratio. Due to the non-stationary state of speech and sound in general, this value is only an approximation.

References

Yi Hu and Philipos C. Loizouoriginal authors

Copyright (c) 2006 by Philipos C. Loizou

SIP-Lab/CNN-VAD/GitHub Repo

Copyright (c) 2019 Signal and Image Processing Lab MIT License

See also

soundpy.files.loadsound

Loads audiofiles.

soundpy.dsp.snr_adjustnoiselevel

Calculates how much to adjust noise signal to achieve SNR.

soundpy.feats.get_vad_stft

Returns stft matrix of only voice active regions

soundpy.feats.get_stft

Returns stft matrix of entire signal

soundpy.dsp.hz_to_mel(freq)[source]

Converts frequency to Mel scale

Parameters

freq (int or float or array like of ints / floats) – The frequency/ies to convert to Mel scale.

Returns

mel – The frequency/ies in Mel scale.

Return type

int or float or array of ints / floats

References

https://en.wikipedia.org/wiki/Mel_scale#Formula

Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

soundpy.dsp.mel_to_hz(mel)[source]

Converts Mel item or list to frequency/ies.

Parameters

mel (int, float, or list of ints / floats) – Mel item(s) to be converted to Hz.

Returns

freq – The converted frequency/ies

Return type

int, float, or list of ints / floats

References

https://en.wikipedia.org/wiki/Mel_scale#Formula

Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

soundpy.dsp.fbank_filters(fmin, fmax, num_filters)[source]

Calculates the mel filterbanks given a min and max frequency and num_filters.

Parameters
  • fmin (int, float) – Minimum frequency relevant in signal.

  • fmax (int, float) – Maximum frequency relevant in signal.

  • num_filters (int) – The number of evenly spaced filters (according to mel scale) between the fmin and fmax frequencies.

Returns

mel_points – An array of floats containing evenly spaced filters (according to mel scale).

Return type

np.ndarray [size=(num_filters,)]

References

Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

soundpy.dsp.sinosoidal_liftering(mfccs, cep_lifter=22)[source]

Reduces influence of higher coefficients; found useful in automatic speech rec.

Parameters
  • mfccs (np.ndarray [shape=(num_samples, num_mfcc)]) – The matrix containing mel-frequency cepstral coefficients.

  • cep_lifter (int) – The amount to apply sinosoidal_liftering. (default 22)

References

Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

soundpy.dsp.index_at_zero(samples, num_dec_places=2)[source]

Finds indices of start and end of utterance, given amplitude strength.

Parameters
  • samples (numpy.ndarray [size= (num_samples,) or (num_samples, num_channels)]) – The samples to index where the zeros surrounding speech are located.

  • num_dec_places (int) – To the number of decimal places the lowest value in samples should be rounded to. (default 2)

Returns

  • f_0 (int) – The index of the last occuring zero, right before speech or sound begins.

  • l_0 (int) – The index of the first occuring zero, after speech ends.

Examples

>>> signal = np.array([-1, 0, 1, 2, 3, 2, 1, 0, -1, -2, -3, -2, -1, 0, 1])
>>> zero_1, zero_2 = index_at_zero(signal)
>>> # +1 to include zero_2 in signal
>>> signal[zero_1:zero_2+1]
[ 0  1  2  3  2  1  0 -1 -2 -3 -2 -1  0]
>>> # does not assume a zero preceeds any sample
>>> signal = np.array([1, 2, 1, 0, -1, -2, -1, 0, 1, 2, 1])
>>> zero_1, zero_2 = index_at_zero(signal)
>>> signal[zero_1:zero_2+1]
[ 0 -1 -2 -1  0]
soundpy.dsp.clip_at_zero(samples, samp_win=None, neg2pos=True, **kwargs)[source]

Clips the signal at samples close to zero.

The samples where clipping occurs crosses the zero line from negative to positive. This clipping process allows for a smoother transition of audio, especially if concatenating audio.

Parameters
  • samples (np.ndarray [shape = (num_samples, ) or (num_samples, num_channels)]) – The array containing sample data. Should work on stereo sound.

  • start_with_zero (bool) – If True, the returned array will begin with 0 (or close to 0). Otherwise the array will end with 0.

  • neg2pos (bool) – If True, the returned array will begin with positive values and end with negative values. Otherwise, the array will be returned with the first zeros detected, regardless of surrounding positive or negative values.

  • samp_win (int, optional) – The window of samples to apply when clipping at zero crossings. The zero crossings adjacent to the main signal will be used. This is useful to remove already existing clicks within the signal, often found at the beginning and / or end of signals.

  • kwargs (additional keyword arguments) – Keyword arguments for soundpy.dsp.index_at_zero.

Warning

If only one zero found.

Examples

>>> sig = np.array([-2,-1,0,1, 2, 1, 0, -1, -2, -1, 0, 1, 2, 1,0])
>>> clip_at_zero(sig) # defaults
[ 0  1  2  1  0 -1 -2 -1  0]
>>> # finds first and last insance of zeros, regardless of surrounding
>>> # negative or positive values in signal
>>> clip_at_zero(sig, neg2pos = False)
[ 0  1  2  1  0 -1 -2 -1  0  1  2  1  0]
>>> # avoid clicks at start of signal
>>> sig = np.array([0,-10,-20,-1,0,1, 2, 1, 0, -1, -2, -1, 0, 1, 2, 1,0])
>>> clip_at_zero(sig, samp_win = 5)
[ 0  1  2  1  0 -1 -2 -1  0]
soundpy.dsp.remove_dc_bias(samples, samp_win=None)[source]

Removes DC bias by subtracting mean from sample data.

Seems to work best without samp_win.

# TODO add moving average?

Parameters
  • samples (np.ndarray [shape=(samples, num_channels) or (samples)]) – The sample data to center around zero. This worsk on both mono and stero data.

  • samp_win (int, optional) – Apply subtraction of mean at windows - experimental. (default None)

Returns

samps – The samples with zero mean.

Return type

np.ndarray [shape=(samples, num_channels) or (samples)]

References

Lyons, Richard. (2011). Understanding Digital Signal Processing (3rd Edition).

soundpy.dsp.apply_num_channels(sound_data, num_channels)[source]

Ensures data has indicated num_channels.

To increase number of channels, the first column will be duplicated. To limit channels, channels will simply be removed.

Parameters
  • sound_data (np.ndarray [size= (num_samples,) or (num_samples, num_channels)]) – The data to adjust the number of channels

  • num_channels (int) – The number of channels desired

Returns

data

Return type

np.ndarray [size = (num_samples, num_channels)]

Examples

>>> import numpy as np
>>> data = np.array([1, 1, 1, 1])
>>> data_3d = apply_num_channels(data, 3)
>>> data_3d
array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]])
>>> data_2d = apply_num_channels(data_3d, 2)
>>> data_2d
array([[1, 1],
       [1, 1],
       [1, 1],
       [1, 1]])
soundpy.dsp.apply_sample_length(data, target_len, mirror_sound=False, clip_at_zero=True)[source]

Extends a sound by repeating it until its target_len. If the target_len is shorter than the length of data, data will be shortened to the specificed target_len

This is perhaps useful when working with repetitive or stationary sounds.

Parameters
  • data (np.ndarray [size = (num_samples,) or (num_samples, num_channels)]) – The data to be checked or extended in length. If shape (num_channels, num_samples), the data will be reshaped to (num_samples, num_channels).

  • target_len (int) – The length of samples the input data should be.

Returns

new_data

Return type

np.ndarray [size=(target_len, ) or (target_len, num_channels)]

Examples

>>> import numpy as np
>>> data = np.array([1,2,3,4])
>>> sp.dsp.apply_sample_length(data, 12)
array([1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4])
>>> # two channels
>>> data = np.zeros((3,2))
>>> data[:,0] = np.array([0,1,2])
>>> data[:,1] = np.array([1,2,3])
>>> data
array([[0., 1.],
       [1., 2.],
       [2., 3.]])
>>> sp.dsp.apply_sample_length(data,5)
array([[0., 1.],
       [1., 2.],
       [2., 3.],
       [0., 1.],
       [1., 2.]])
soundpy.dsp.zeropad_sound(data, target_len, sr, delay_sec=None)[source]

If the sound data needs to be a certain length, zero pad it.

Parameters
  • data (numpy.ndarray [size = (num_samples,) or (num_samples, num_channels)]) – The sound data that needs zero padding. Shape (len(data),).

  • target_len (int) – The number of samples the data should have

  • sr (int) – The samplerate of the data

  • delay_sec (int, float, optional) – If the data should be zero padded also at the beginning. (default None)

Returns

signal_zeropadded – The data zero padded.

Return type

numpy.ndarray [size = (target_len,) or (target_len, num_channels)]

Examples

>>> import numpy as np
>>> x = np.array([1,2,3,4])
>>> # with 1 second delay (with sr of 4, that makes 4 sample delay)
>>> x_zeropadded = zeropad_sound(x, target_len=10, sr=4, delay_sec=1)
>>> x_zeropadded
array([0., 0., 0., 0., 1., 2., 3., 4., 0., 0.])
>>> # without delay
>>> x_zeropadded = zeropad_sound(x, target_len=10, sr=4)
>>> x_zeropadded
array([1., 2., 3., 4., 0., 0., 0., 0., 0., 0.])
>>> # if signal is longer than desired length:
>>> x_zeropadded = zeropad_sound(x, target_len=3, sr=4)
UserWarning: The signal cannot be zeropadded and will instead be truncated as length of `data` is 4 and `target_len` is 3.
len(data), target_len))
>>> x_zeropadded
array([1, 2, 3])
soundpy.dsp.get_num_channels(data)[source]
soundpy.dsp.combine_sounds(file1, file2, match2shortest=True, time_delay_sec=None, total_dur_sec=None)[source]

Combines sounds

Parameters
  • file1 (str) – One of two files to be added together

  • file2 (str) – Second of two files to be added together

  • match2shortest (bool) – If the lengths of the addition should be limited by the shorter sound. (defaul True)

  • time_delay_sec (int, float, optional) – The amount of time in seconds before the sounds are added together. The longer sound will play for this period of time before the shorter sound is added to it. (default 1)

  • total_dur_sec (int, float, optional) – The total duration in seconds of the combined sounds. (default 5)

Returns

  • added_sound (numpy.ndarray) – The sound samples of the two soundfiles added together

  • sr1 (int) – The sample rate of the original signals and added sound

soundpy.dsp.calc_frame_length(dur_frame_millisec, sr)[source]

Calculates the number of samples necessary for each frame

Parameters
  • dur_frame_millisec (int or float) – time in milliseconds each frame should be

  • sr (int) – sampling rate of the samples to be framed

Returns

frame_length – the number of samples necessary to fill a frame

Return type

int

Examples

>>> calc_frame_length(dur_frame_millisec=20, sr=1000)
20
>>> calc_frame_length(dur_frame_millisec=20, sr=48000)
960
>>> calc_frame_length(dur_frame_millisec=25.5, sr=22500)
573
soundpy.dsp.calc_num_overlap_samples(samples_per_frame, percent_overlap)[source]

Calculate the number of samples that constitute the overlap of frames

Parameters
  • samples_per_frame (int) – the number of samples in each window / frame

  • percent_overlap (int, float) – either an integer between 0 and 100 or a decimal between 0.0 and 1.0 indicating the amount of overlap of windows / frames

Returns

num_overlap_samples – the number of samples in the overlap

Return type

int

Examples

>>> calc_num_overlap_samples(samples_per_frame=100,percent_overlap=0.10)
10
>>> calc_num_overlap_samples(samples_per_frame=100,percent_overlap=10)
10
>>> calc_num_overlap_samples(samples_per_frame=960,percent_overlap=0.5)
480
>>> calc_num_overlap_samples(samples_per_frame=960,percent_overlap=75)
720
soundpy.dsp.calc_num_subframes(tot_samples, frame_length, overlap_samples, zeropad=False)[source]

Assigns total frames needed to process entire noise or target series

This function calculates the number of full frames that can be created given the total number of samples, the number of samples in each frame, and the number of overlapping samples.

Parameters
  • tot_samples (int) – total number of samples in the entire series

  • frame_length (int) – total number of samples in each frame / processing window

  • overlap_samples (int) – number of samples in overlap between frames

  • zeropad (bool, optional) – If False, number of subframes limited to full frames. If True, number of subframes extended to zeropad the last partial frame. (default False)

Returns

subframes – The number of subframes necessary to fully process the audio samples at given frame_length, overlap_samples, and zeropad.

Return type

int

Examples

>>> calc_num_subframes(30,10,5)
5
>>> calc_num_subframes(30,20,5)
3
soundpy.dsp.create_window(window_type, frame_length)[source]

Creates window according to set window type and frame length

the Hamming window tapers edges to around 0.08 while the Hann window tapers edges to 0.0. Both are commonly used in noise filtering.

Parameters

window_type (str) – type of window to be applied (default ‘hamming’)

Returns

window – a window fitted to the class attribute ‘frame_length’

Return type

ndarray

Examples

>>> #create Hamming window
>>> hamm_win = create_window('hamming', frame_length=5)
>>> hamm_win
array([0.08, 0.54, 1.  , 0.54, 0.08])
>>> #create Hann window
>>> hann_win = create_window('hann',frame_length=5)
>>> hann_win
array([0. , 0.5, 1. , 0.5, 0. ])
soundpy.dsp.apply_window(samples, window, zeropad=False)[source]

Applies predefined window to a section of samples. Mono or stereo sound checked.

The length of the samples must be the same length as the window.

Parameters
  • samples (ndarray [shape=(num_samples,) or (num_samples, num_channels)]) – series of samples with the length of input window

  • window (ndarray [shape=(num_samples,) or (num_samples, num_channels)]) – window to be applied to the signal. If window does not match number of channels of sample data, the missing channels will be applied to the window, repeating the first channel.

Returns

samples_win – series with tapered sides according to the window provided

Return type

ndarray

Examples

>>> import numpy as np
>>> input_signal = np.array([ 0.        ,  0.36371897, -0.302721,
...                         -0.1117662 ,  0.3957433 ])
>>> window_hamming = np.array([0.08, 0.54, 1.  , 0.54, 0.08])
>>> apply_window(input_signal, window_hamming)
array([ 0.        ,  0.19640824, -0.302721  , -0.06035375,  0.03165946])
>>> window_hann = np.array([0. , 0.5, 1. , 0.5, 0. ])
>>> apply_window(input_signal, window_hann)
array([ 0.        ,  0.18185948, -0.302721  , -0.0558831 ,  0.        ])
soundpy.dsp.add_channels(samples, channels_total)[source]

Copies columns of samples to create additional channels.

Parameters
  • samples (np.ndarray [shape=(num_samples) or (num_samples,num_channels)]) – The samples to add channels to.

  • channels_total (int) – The total number of channels desired. For example, if samples already has 2 channels and you want it to have 3, set channels_total to 3.

Returns

x – A copy of samples with desired number of channels.

Return type

np.ndarray [shape = (num_samples, channels_total)]

Examples

>>> import numpy as np
>>> samps_mono = np.array([1,2,3,4,5])
>>> samps_stereo2 = add_channels(samps_mono, 2)
>>> samps_stereo2
array([[1, 1],
...    [2, 2],
...    [3, 3],
...    [4, 4],
...    [5, 5]])
>>> samps_stereo5 = add_channels(samps_stereo2, 5)
>>> samps_stereo5
array([[1, 1, 1, 1, 1],
...    [2, 2, 2, 2, 2],
...    [3, 3, 3, 3, 3],
...    [4, 4, 4, 4, 4],
...    [5, 5, 5, 5, 5]])

Warning

If channels_total is less than or equal to the number of channels already presesnt in samples. No channels added in those cases.

soundpy.dsp.average_channels(data)[source]

Averages all channels in a stereo signal into one channel.

Parameters

data (np.ndarray [size=(num_samples, num_channels)]) – The stereo data to average out. If mono data supplied, mono data is returned unchanged.

Returns

data averaged – Copy of data averaged into one channel.

Return type

np.ndarray [size=(num_samples)]

Examples

>>> import numpy as np
>>> input_samples1 = np.array([1,2,3,4,5])
>>> input_samples2 = np.array([1,1,3,3,5])
>>> input_2channels = np.vstack((input_samples1, input_samples2)).T
>>> input_averaged = average_channels(input_2channels)
>>> input_averaged
array([1. , 1.5, 3. , 3.5, 5. ])
soundpy.dsp.calc_fft(signal_section, real_signal=None, fft_bins=None, **kwargs)[source]

Calculates the fast Fourier transform of a time series. Should work with stereo signals.

The length of the signal_section determines the number of frequency bins analyzed if fft_bins not set. Therefore, if there are higher frequencies in the signal, the length of the signal_section should be long enough to accommodate those frequencies.

The frequency bins with energy levels at around zero denote frequencies not prevelant in the signal;the frequency bins with prevalent energy levels relate to frequencies as well as their amplitudes that are in the signal.

Parameters
  • signal_section (ndarray [shape = (num_samples) or (num_samples, num_channels)]) – the series that the fft will be applied to. If stereo sound, will return a FFT for each channel.

  • real_signal (bool) – If True, only half of the fft will be returned (the fft is mirrored). Otherwise the full fft will be returned.

  • kwargs (additional keyword arguments) – keyword arguments for numpy.fft.fft or nump.fft.rfft

Returns

fft_vals – the series transformed into the frequency domain with the same shape as the input series

Return type

ndarray [shape=(num_fft_bins), or (num_fft_bins, num_channels), dtype=np.complex_]

soundpy.dsp.ismono(data)[source]
soundpy.dsp.calc_power(fft_vals)[source]

Calculates the power of fft values

Parameters

fft_vals (ndarray (complex or floats)) – the fft values of a windowed section of a series

Returns

power_spec – the squared absolute value of the input fft values

Return type

ndarray

Example

>>> import numpy as np
>>> matrix = np.array([[1,1,1],[2j,2j,2j],[-3,-3,-3]],
...                     dtype=np.complex_)
>>> calc_power(matrix)
array([[0.33333333, 0.33333333, 0.33333333],
       [1.33333333, 1.33333333, 1.33333333],
       [3.        , 3.        , 3.        ]])
soundpy.dsp.calc_average_power(matrix, num_iters)[source]

Divides matrix values by the number of times power values were added.

This function assumes the power values of n-number of series were calculated and added. It divides the values in the input matrix by n, i.e. ‘num_iters’.

Parameters
  • matrix (ndarray) – a collection of floats or ints representing the sum of power values across several series sets

  • num_iters (int) – an integer denoting the number of times power values were added to the input matrix

Returns

matrix – the averaged input matrix

Return type

ndarray

Examples

>>> matrix = np.array([[6,6,6],[3,3,3],[1,1,1]])
>>> ave_matrix = calc_average_power(matrix, 3)
>>> ave_matrix
array([[2.        , 2.        , 2.        ],
       [1.        , 1.        , 1.        ],
       [0.33333333, 0.33333333, 0.33333333]])
soundpy.dsp.calc_phase(fft_matrix, radians=False)[source]

Calculates phase from complex fft values.

Parameters
  • fft_vals (np.ndarray [shape=(num_frames, num_features), dtype=complex]) – matrix with fft values

  • radians (boolean) – False and complex values are returned. True and radians are returned. (Default False)

Returns

phase – Phase values for fft_vals. If radians is set to False, dtype = complex. If radians is set to True, dtype = float.

Return type

np.ndarray [shape=(num_frames, num_features)]

Examples

>>> import numpy as np
>>> frame_length = 10
>>> time = np.arange(0, 10, 0.1)
>>> signal = np.sin(time)[:frame_length]
>>> fft_vals = np.fft.fft(signal)
>>> phase = calc_phase(fft_vals, radians=False)
>>> phase[:2]
array([ 1.        +0.j        , -0.37872566+0.92550898j])
>>> phase = calc_phase(fft_vals, radians=True)
>>> phase[:2]
array([0.        , 1.95921533])
soundpy.dsp.reconstruct_whole_spectrum(band_reduced_noise_matrix, n_fft=None)[source]

Reconstruct whole spectrum by mirroring complex conjugate of data.

Parameters
  • band_reduced_noise_matrix (np.ndarray [size=(n_fft,), dtype=np.float or np.complex_]) – Matrix with either power or fft values of the left part of the fft. The whole fft can be provided; however the right values will be overwritten by a mirrored left side.

  • n_fft (int, optional) – If None, n_fft set to length of band_reduced_noise_matrix. n_fft defines the size of the mirrored vector.

Returns

output_matrix – Mirrored vector of input data.

Return type

np.ndarray [size = (n_fft,), dtype=np.float or np.complex_]

Examples

>>> x = np.array([3.,2.,1.,0.])
>>> # double the size of x
>>> x_rec = sp.dsp.reconstruct_whole_spectrum(x, n_fft=int(len(x)*2))
>>> x_rec
array([3., 2., 1., 0., 0., 1., 2., 3.])
>>> # overwrite right side of data
>>> x = np.array([3.,2.,1.,0.,0.,2.,3.,5.])
>>> x_rec = sp.dsp.reconstruct_whole_spectrum(x, n_fft=len(x))
>>> x_rec
array([3., 2., 1., 0., 0., 1., 2., 3.])
soundpy.dsp.apply_original_phase(spectrum, phase)[source]

Multiplies phase to power spectrum

Parameters
  • spectrum (np.ndarray [shape=(n,), dtype=np.float or np.complex]) – Magnitude or power spectrum

  • phase (np.ndarray [shape=(n,), dtype=np.float or np.complex]) – Phase to be applied to spectrum

Returns

spectrum_complex

Return type

np.ndarray [shape=(n,), dtype = np.complex]

soundpy.dsp.calc_posteri_snr(target_power_spec, noise_power_spec)[source]

Calculates and signal to noise ratio of current frame

Parameters
  • target_power_spec (ndarray) – matrix of shape with power values of target signal

  • noise_power_spec (ndarray) – matrix of shape with power values of noise signal

Returns

posteri_snr – matrix containing the signal to noise ratio

Return type

ndarray

Examples

>>> sig_power = np.array([6,6,6,6])
>>> noise_power = np.array([2,2,2,2])
>>> calc_posteri_snr(sig_power, noise_power)
array([3., 3., 3., 3.])
soundpy.dsp.get_max_index(matrix)[source]

If not np.ndarray, expects real sample data.

soundpy.dsp.get_local_target_high_power(target_samples, sr, local_size_ms=25, min_power_percent=0.25)[source]
soundpy.dsp.get_vad_snr(target_samples, noise_samples, sr, extend_window_ms=0)[source]

Approximates the signal to noise ratio of two sets of power spectrums

Note: this is a simple implementation and should not be used for official/exact measurement of snr.

Parameters
  • target_samples (np.ndarray [size = (num_samples, )]) – The samples of the main / speech signal. Only frames with higher levels of energy will be used to calculate SNR.

  • noise_samples (np.ndarray [size = (num_samples, )]) – The samples of background noise. Expects only noise, no speech. Must be the same sample rate as the target_samples

  • sr (int) – The sample rate for the audio samples.

  • local_size_ms (int or float) – The length in milliseconds to calculate level of SNR. (default 25)

  • min_power_percent (float) – The minimum percentage of energy / power the target samples should have. This is to look at only sections with speech or other signal of interest and not periods of silence. Value should be between 0 and 1. (default 0.25)

References

http://www1.icsi.berkeley.edu/Speech/faq/speechSNR.html

Gomolka, Ryszard. (2017). Re: How to measure signal-to-noise ratio (SNR) in real time?. Retrieved from: https://www.researchgate.net/post/How_to_measure_signal-to-noise_ratio_SNR_in_real_time/586a880f217e2060b65a8853/citation/download.

https://www.who.int/occupational_health/publications/noise1.pdf

soundpy.dsp.snr_adjustnoiselevel(target_samples, noise_samples, sr, snr)[source]

Computes scale factor to adjust noise samples to achieve snr.

From script addnoise_asl_nseg.m: This function adds noise to a file at a specified SNR level. It uses the active speech level to compute the speech energy. The active speech level is computed as per ITU-T P.56 standard.

soundpy Note: this functionality was pulled from the MATLAB script: addnoise_asl_nseg.m at this GitHub repo: https://github.com/SIP-Lab/CNN-VAD/blob/master/Training%20Code/Functions/addnoise_asl_nseg.m

I do not understand all that went on to calculate the scale factor and therefore do not explain anything futher than the original script.

Parameters
  • target_samples (np.ndarray [size = (num_samples,)]) – The audio samples of the target / clean signal.

  • noise_samples (np.ndarray [size = (num_samples,)]) – The audio samples of the noise signal.

  • sr (int) – The sample rate of both target_samples and noise_samples

  • snr (int) – The desired signal-to-noise ratio of the target and noise audio signals.

Returns

scale_factor – The factor to which noise samples should be multiplied before being added to target samples to achieve SNR.

Return type

int, float

References

Yi Hu and Philipos C. Loizouoriginal authors

Copyright (c) 2006 by Philipos C. Loizou

SIP-Lab/CNN-VAD/GitHub Repo

Copyright (c) 2019 Signal and Image Processing Lab MIT License

ITU-T (1993). Objective measurement of active speech level. ITU-T Recommendation P. 56

soundpy.dsp.asl_P56(samples, sr, bitdepth=16, smooth_factor=0.03, hangover=0.2, margin_db=15.9)[source]

Computes the active speech level according to ITU-T P.56 standard.

Note: I don’t personally understand the functionality behind this function and therefore do not offer the best documentation as of yet.

Parameters
  • samples (np.ndarray [size = (num_samples, )]) – The audio samples, for example speech samples.

  • sr (int) – The sample rate of samples.

  • bitdepth (int) – The bitdepth of audio. Expects 16. (default 16)

  • smooth_factor (float) – Time smoothing factor. (default 0.03)

  • hangover (float) – Hangover. Thank goodness not the kind I’m familiar with. (default 0.2)

  • margin_db (int, float) – Margin decibels… (default 15.9)

Returns

  • asl_ms (float) – The active speech level ms energy

  • asl (float) – The active factor

  • c0 (float) – Active speech level threshold

References

ITU-T (1993). Objective measurement of active speech level. ITU-T Recommendation P. 56

TODO handle bitdepth variation - what if not 16? TODO improve documentation

soundpy.dsp.bin_interp(upcount, lwcount, upthr, lwthr, Margin, tol)[source]
soundpy.dsp.calc_posteri_prime(posteri_snr)[source]

Calculates the posteri prime

Parameters

posteri_snr (ndarray) – The signal-to-noise ratio of the noisey signal, frame by frame.

Returns

posteri_prime – The primed posteri_snr, calculated according to the reference paper.

Return type

ndarray

References

Scalart, P. and Filho, J. (1996). Speech enhancement based on a priori signal to noise estimation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 629-632.

soundpy.dsp.calc_prior_snr(snr, snr_prime, smooth_factor=0.98, first_iter=None, gain=None)[source]

Estimates the signal-to-noise ratio of the previous frame

Depending on the first_iter argument, the prior snr is calculated according to different algorithms. If first_iter is None, prior snr is calculated according to Scalart and Filho (1996); if first_iter is True or False, snr prior is calculated according to Loizou (2013).

Parameters
  • snr (ndarray) – The sound-to-noise ratio of target vs noise power/energy levels.

  • snr_prime (ndarray) – The prime of the snr (see Scalart & Filho (1996))

  • smooth_factor (float) – The value applied to smooth the signal. (default 0.98)

  • first_iter (None, True, False) – If None, snr prior values are estimated the same, no matter if it is the first iteration or not (Scalart & Filho (1996)) If True, snr prior values are estimated without gain (Loizou 2013) If False, snr prior values are enstimed with gain (Loizou 2013) (default None)

  • gain (None, ndarray) – If None, gain will not be used. If gain, it is a previously calculated value from the previous frame. (default None)

Returns

prior_snr – Estimation of signal-to-noise ratio of the previous frame of target signal.

Return type

ndarray

References

C Loizou, P. (2013). Speech Enhancement: Theory and Practice.

Scalart, P. and Filho, J. (1996). Speech enhancement based on a priori signal to noise estimation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 629-632.

soundpy.dsp.calc_gain(prior_snr)[source]

Calculates the gain (i.e. attenuation) values to reduce noise.

Parameters

prior_snr (ndarray) – The prior signal-to-noise ratio estimation

Returns

gain – An array of attenuation values to be applied to the signal (stft) array at the current frame.

Return type

ndarray

References

C Loizou, P. (2013). Speech Enhancement: Theory and Practice.

Scalart, P. and Filho, J. (1996). Speech enhancement based on a priori signal to noise estimation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 629-632.

soundpy.dsp.apply_gain_fft(fft_vals, gain)[source]

Reduces noise by applying gain values to the stft / fft array of the target signal

Parameters
  • fft_vals (ndarray(complex)) – Matrix containing complex values (i.e. stft values) of target signal

  • gain (ndarray(real)) – Matrix containing calculated attenuation values to apply to ‘fft_vals’

Returns

enhanced_fft – Matrix with attenuated noise in target (stft) values

Return type

ndarray(complex)

soundpy.dsp.postfilter(original_powerspec, noisereduced_powerspec, gain, threshold=0.4, scale=10)[source]

Apply filter that reduces musical noise resulting from other filter.

If it is estimated that speech (or target signal) is present, reduced filtering is applied.

References

T. Esch and P. Vary, “Efficient musical noise suppression for speech enhancement system,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, 2009.

soundpy.dsp.calc_ifft(signal_section, real_signal=None, norm=False)[source]

Calculates the inverse fft of a series of fft values

The real values of the ifft can be used to be saved as an audiofile

Parameters
  • signal_section (ndarray [shape=(num_freq_bins,)) – The frame of fft values to apply the inverse fft to

  • num_fft (int, optional) – The number of total fft values applied when calculating the original fft. If not given, length of signal_section is used.

  • norm (bool) – Whether or not the ifft should apply ‘ortho’ normalization (default False)

Returns

ifft_vals – The inverse Fourier transform of filtered audio data

Return type

ndarray(complex)

soundpy.dsp.control_volume(samples, max_limit)[source]

Keeps max volume of samples to within a specified range.

Parameters
  • samples (ndarray) – series of audio samples

  • max_limit (float) – maximum boundary of the maximum value of the audio samples

Returns

samples – samples with volume adjusted (if need be).

Return type

np.ndarray

Examples

>>> import numpy as np
>>> #low volume example: increase volume to desired window
>>> x = np.array([-0.03, 0.04, -0.05, 0.02])
>>> x = control_volume(x, max_limit=0.25)
>>> x
array([-0.13888889,  0.25      , -0.25      ,  0.13888889])
>>> #high volume example: decrease volume to desired window
>>> y = np.array([-0.3, 0.4, -0.5, 0.2])
>>> y = control_volume(y, max_limit=0.15)
>>> y
array([-0.08333333,  0.15      , -0.15      ,  0.08333333])
soundpy.dsp.calc_power_ratio(original_powerspec, noisereduced_powerspec)[source]

Calc. the ratio of original vs noise reduced power spectrum.

soundpy.dsp.calc_noise_frame_len(SNR_decision, threshold, scale)[source]

Calc. window length for calculating moving average.

Note: lower SNRs require larger window.

soundpy.dsp.calc_linear_impulse(noise_frame_len, num_freq_bins)[source]

Calc. the post filter coefficients to be applied to gain values.

soundpy.dsp.adjust_volume(samples, vol_range)[source]
soundpy.dsp.spread_volumes(samples, vol_list=[0.1, 0.3, 0.5])[source]

Returns samples with a range of volumes.

This may be useful in applying to training data (transforming data).

Parameters
  • samples (ndarray) – Series belonging to acoustic signal.

  • vol_list (list) – List of floats or ints representing the volumes the samples are to be oriented towards. (default [0.1,0.3,0.5])

Returns

volrange_dict – Tuple of volrange_dict values containing samples at various vols.

Return type

tuple

soundpy.dsp.create_empty_matrix(shape, complex_vals=False)[source]

Allows creation of a matrix filled with real or complex zeros.

In digital signal processing, complex numbers are common; it is important to note that if complex_vals=False and complex values are inserted into the matrix, the imaginary part will be removed.

Parameters
  • shape (tuple or int) – tuple or int indicating the shape or length of desired matrix or vector, respectively

  • complex_vals (bool) – indicator of whether or not the matrix will receive real or complex values (default False)

Returns

matrix – a matrix filled with real or complex zeros

Return type

ndarray

Examples

>>> matrix = create_empty_matrix((3,4))
>>> matrix
array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])
>>> matrix_complex = create_empty_matrix((3,4),complex_vals=True)
>>> matrix_complex
array([[0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
       [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
       [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j]])
>>> vector = create_empty_matrix(5,)
>>> vector
array([0., 0., 0., 0., 0.])
soundpy.dsp.overlap_add(enhanced_matrix, frame_length, overlap, complex_vals=False)[source]

Overlaps and adds windowed sections together to form 1D signal.

Parameters
  • enhanced_matrix (np.ndarray [shape=(frame_length, num_frames), dtype=float]) – Matrix with enhance values

  • frame_length (int) – Number of samples per frame

  • overlap (int) – Number of samples that overlap

Returns

new_signal – Length equals (frame_length - overlap) * enhanced_matrix.shape[1] + overlap

Return type

np.ndarray [shape=(frame_length,), dtype=float]

Examples

>>> import numpy as np
>>> enhanced_matrix = np.ones((4, 4))
>>> frame_length = 4
>>> overlap = 1
>>> sig = overlap_add(enhanced_matrix, frame_length, overlap)
>>> sig
array([1., 1., 1., 2., 1., 1., 2., 1., 1., 2., 1., 1., 1.])
soundpy.dsp.random_selection_samples(samples, len_section_samps, wrap=False, random_seed=None, axis=0)[source]

Selects a section of samples, starting at random.

Parameters
  • samples (np.ndarray [shape = (num_samples, )]) – The array of sample data

  • len_section_samps (int) – How many samples should be randomly selected

  • wrap (bool) – If False, the selected noise will not be wrapped from end to beginning; if True, the random selected may take sound sample that is wrapped from the end to the beginning. See examples below. (default False)

  • random_seed (int, optional) – If replicated randomization desired. (default None)

Examples

>>> import numpy as np
>>> # no wrap:
>>> x = np.array([1,2,3,4,5,6,7,8,9,10])
>>> n = sp.dsp.random_selection_samples(x, len_section_samps = 7,
...                                     wrap = False, random_seed = 40)
>>> n
array([3, 4, 5, 6, 7, 8, 9])
>>> # with wrap:
>>> n = sp.dsp.random_selection_samples(x, len_section_samps = 7,
...                                     wrap = True, random_seed = 40)
>>> n
array([ 7,  8,  9, 10,  1,  2,  3])
soundpy.dsp.get_pitch(sound, sr=16000, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', **kwargs)[source]

Approximates pitch by collecting dominant frequencies of signal.

soundpy.dsp.get_mean_freq(sound, sr=16000, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', percent_vad=0.75)[source]

Takes the mean of dominant frequencies of voice activated regions in a signal.

Note: Silences discarded.

The average fundamental frequency for a male voice is 125Hz; for a female voice it’s 200Hz; and for a child’s voice, 300Hz. (Russell, J., 2020)

References

Russell, James (2020) The Human Voice and the Frequency Range. Retrieved from: https://blog.accusonus.com/pro-audio-production/human-voice-frequency-range/

soundpy.dsp.vad(sound, sr, win_size_ms=50, percent_overlap=0, real_signal=False, fft_bins=None, window='hann', energy_thresh=40, freq_thresh=185, sfm_thresh=5, min_energy=None, min_freq=None, min_sfm=None, use_beg_ms=120)[source]

Warning: this VAD works best with sample rates above 44100 Hz.

Parameters
  • energy_thresh (int, float) – The minimum amount of energy for speech detection.

  • freq_thresh (int, float) – The maximum frequency threshold.

  • sfm_thresh (int, float) – The spectral flatness measure threshold.

References

    1. Moattar and M. M. Homayounpour, “A simple but efficient real-time Voice Activity Detection algorithm,” 2009 17th European Signal Processing Conference, Glasgow, 2009, pp. 2549-2553.

soundpy.dsp.suspended_energy(speech_energy, speech_energy_mean, row, start)[source]
soundpy.dsp.sound_index(speech_energy, speech_energy_mean, start=True)[source]

Identifies the index of where speech or energy starts or ends.

soundpy.dsp.get_energy(stft)[source]
soundpy.dsp.get_energy_mean(rms_energy)[source]
soundpy.dsp.spectral_flatness_measure(spectrum)[source]
soundpy.dsp.get_dom_freq(power_values)[source]

If real_signal (i.e. half fft bins), might mess up values.

soundpy.dsp.short_term_energy(signal_windowed)[source]

Expects signal to be scaled (-1, 1) as well as windowed.

References

http://vlab.amrita.edu/?sub=3&brch=164&sim=857&cnt=1

soundpy.dsp.bilinear_warp(fft_value, alpha)[source]

Subfunction for vocal tract length perturbation.

References

Kim, C., Shin, M., Garg, A., & Gowda, D. (2019). Improved vocal tract length perturbation for a state-of-the-art end-to-end speech recognition system. Interspeech. September 15-19, Graz, Austria.

soundpy.dsp.piecewise_linear_warp(fft_value, alpha, max_freq)[source]

Subfunction for vocal tract length perturbation.

References

Kim, C., Shin, M., Garg, A., & Gowda, D. (2019). Improved vocal tract length perturbation for a state-of-the-art end-to-end speech recognition system. Interspeech. September 15-19, Graz, Austria.

soundpy.dsp.f0_approximation(sound, sr, low_freq=50, high_freq=300, **kwargs)[source]

Approximates fundamental frequency.

Limits the stft of voice active sections to frequencies to between low_freq and high_freq and takes mean of the dominant frequencies within that range. Defaults are set at 50 and 300 as most human speech frequencies occur between 85 and 255 Hz.

References

https://en.wikipedia.org/wiki/Voice_frequency

Filters: Wiener and Band Spectral Subtraction

Filters module covers functions related to the filtering out of noise of a target signal.

class soundpy.filters.FilterSettings(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, zeropad=None)[source]

Bases: object

Basic settings for filter related classes to inherit from.

frame_dur

Time in milliseconds of each audio frame window. (default 20)

Type

int, float

sr

Desired sampling rate of audio; audio will be resampled to match if audio has other sampling rate. (default 48000)

Type

int

frame_length

Number of audio samples in each frame: frame_dur multiplied with sr, divided by 1000. (default 960)

Type

int

percent_overlap

Percentage of overlap between frames.

Type

float

overlap_length

Number of overlapping audio samples between subsequent frames: frame_length multiplied by percent_overlap, floored. (default 480)

Type

int

window_type

Type of window applied to audio frames: hann vs hamming (default ‘hamming’)

Type

str

num_fft_bins

The number of frequency bins used when calculating the fft. Currently the frame_length is used to set num_fft_bins.

Type

int

zeropad

If False, only full frames of audio data are processed. If True, the last partial frame will be zeropadded. (default False)

Type

bool, optional

Methods

get_window()

Returns window acc.

get_window()[source]

Returns window acc. to attributes window_type and frame_length

class soundpy.filters.Filter(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=None, zeropad=None)[source]

Bases: soundpy.filters.FilterSettings

Interactive class to explore Wiener filter settings on audio signals.

These class methods implement research based algorithms with low computational cost, aimed for noise reduction via mobile phone.

beta

Value applied in Wiener filter that smooths the application of ‘gain’; default set according to previous research. (default 0.98)

Type

float

first_iter

Keeps track if first_iter is relevant in filtering. If True, filtering has just started, and calculations made for filtering cannot use information from previous frames; if False, calculations for filtering use information from previous frames; if None, no difference is applied when processing the 1st vs subsequent frames. (default None)

Type

bool, optional

target_subframes

The number of total subsections within the total number of samples belonging to the target signal (i.e. audiofile being filtered). Until target_subframes is calculated, it is set to None. (default None)

Type

int, None

noise_subframes

The number of total subsections within the total number of samples belonging to the noise signal. If noise power spectrum is used, this doesn’t need to be calculated. Until noise_subframes is calculated, it is set to None. (default None)

Type

int, None

gain

Once calculated, the attenuation values to be applied to the fft for noise reduction. Until calculated, None. (default None)

Type

ndarray, None

max_vol

The maximum volume allowed for the filtered signal. (default 0.4)

Type

float, int

Methods

check_volume(samples)

ensures volume of filtered signal is within the bounds of the original

get_samples(audiofile[, dur_sec])

Load signal and save original volume

get_window()

Returns window acc.

set_num_subframes(len_samples[, is_noise, …])

Sets the number of target or noise subframes available for processing

set_volume(samples[, max_vol, min_vol])

Records and limits the maximum amplitude of original samples.

get_samples(audiofile, dur_sec=None)[source]

Load signal and save original volume

Parameters
  • audiofile (str) – Path and name of audiofile to be loaded

  • dur_sec (int, float optional) – Max length of time in seconds (default None)

Returns

samples – Array containing signal amplitude values in time domain

Return type

ndarray

set_volume(samples, max_vol=0.4, min_vol=0.15)[source]

Records and limits the maximum amplitude of original samples.

This enables the output wave to be within a range of volume that does not go below or too far above the orignal maximum amplitude of the signal.

Parameters
  • samples (ndarray) – The original samples of a signal (1 dimensional), of any length

  • max_vol (float) – The maximum volume level. If a signal has values higher than this number, the signal is curtailed to remain at and below this number.

  • min_vol (float) – The minimum volume level. If a signal has only values lower than this number, the signal is amplified to be at this number and below.

Returns

Return type

None

set_num_subframes(len_samples, is_noise=False, zeropad=False)[source]

Sets the number of target or noise subframes available for processing

Parameters
  • len_samples (int) – The total number of samples in a given signal

  • is_noise (bool) – If False, subframe number saved under self.target_subframes, otherwise self.noise_subframes (default False)

  • zeropad (bool) – If False, number of frames limited to full frames. If True, last frame is zeropadded.

Returns

Return type

None

check_volume(samples)[source]

ensures volume of filtered signal is within the bounds of the original

class soundpy.filters.WienerFilter(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, smooth_factor=0.98, first_iter=None, zeropad=None)[source]

Bases: soundpy.filters.Filter

Methods

check_volume(samples)

ensures volume of filtered signal is within the bounds of the original

get_samples(audiofile[, dur_sec])

Load signal and save original volume

get_window()

Returns window acc.

set_num_subframes(len_samples[, is_noise, …])

Sets the number of target or noise subframes available for processing

set_volume(samples[, max_vol, min_vol])

Records and limits the maximum amplitude of original samples.

apply_postfilter

apply_wienerfilter

apply_wienerfilter(frame_index, target_fft, target_power_frame, noise_power)[source]
apply_postfilter(enhanced_fft, target_fft, target_power_frame)[source]
class soundpy.filters.BandSubtraction(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, num_bands=6, band_spacing='linear', zeropad=None, smooth_factor=0.98, first_iter=None)[source]

Bases: soundpy.filters.Filter

Methods

calc_oversub_factor()

Calculate over subtraction factor used in the cited paper.

calc_relevant_band(target_powspec)

Calculates band with highest energy levels.

check_volume(samples)

ensures volume of filtered signal is within the bounds of the original

get_samples(audiofile[, dur_sec])

Load signal and save original volume

get_window()

Returns window acc.

set_num_subframes(len_samples[, is_noise, …])

Sets the number of target or noise subframes available for processing

set_volume(samples[, max_vol, min_vol])

Records and limits the maximum amplitude of original samples.

setup_bands()

Provides starting and ending frequncy bins/indices for each band.

update_posteri_bands(target_powspec, …)

Updates SNR of each set of bands.

apply_bandspecsub

apply_floor

apply_postfilter

sub_noise

apply_bandspecsub(target_power, target_phase, noise_power)[source]
setup_bands()[source]

Provides starting and ending frequncy bins/indices for each band.

Parameters

self (class) – Contains variables num_bands (if None, set to 6) and frame_length

Returns

Sets the class variables band_start_freq and band_end_freq.

Return type

None

Examples

>>> import soundpy as sp
>>> import numpy as np
>>> # Default is set to 6 bands:
>>> fil = sp.BandSubtraction()
>>> fil.setup_bands()
>>> fil.band_start_freq
array([  0.,  80., 160., 240., 320., 400.])
>>> fil.band_end_freq
array([ 80., 160., 240., 320., 400., 480.])
>>> # change default settings
>>> fil = sp.BandSubtraction(num_bands=5)
>>> fil.setup_bands()
>>> fil.band_start_freq
array([  0.,  96., 192., 288., 384.])
>>> fil.band_end_freq
array([ 96., 192., 288., 384., 480.])
update_posteri_bands(target_powspec, noise_powspec)[source]

Updates SNR of each set of bands.

MATLAB code from speech enhancement book uses power, puts it into magnitude (via square root), then puts it back into power..? And uses some sort of ‘norm’ function… which I think is actually just the sum. Original equation can be found in the paper below. page 117 from book?

paper: Kamath, S. D. & Loizou, P. C. (____), A multi-band spectral subtraction method for enhancing speech corrupted by colored noise.

I am using power for the time being.

Examples

>>> import soundpy as sp
>>> import numpy as np
>>> # setting to 4 bands for space:
>>> fil = sp.BandSubtraction(num_bands=4)
>>> fil.setup_bands()
>>> # generate sine signal with and without noise
>>> time = np.arange(0, 10, 0.01)
>>> signal = np.sin(time)[:fil.frame_length]
>>> np.random.seed(0)
>>> noise = np.random.normal(np.mean(signal),np.mean(signal)+0.3,960)
>>> powerspec_clean = np.abs(np.fft.fft(signal))**2
>>> powerspec_noisy = np.abs(np.fft.fft(signal + noise))**2
>>> fil.update_posteri_bands(powerspec_clean, powerspec_noisy)
>>> fil.snr_bands
array([ -1.91189028, -39.22078063, -44.16682922, -45.65265895])
>>> # compare with no noise in signal:
>>> fil.update_posteri_bands(powerspec_clean, powerspec_clean)
>>> fil.snr_bands
array([0., 0., 0., 0.])
calc_oversub_factor()[source]

Calculate over subtraction factor used in the cited paper.

Uses decibel SNR values calculated in update_posteri_bands()

paper: Kamath, S. D. & Loizou, P. C. (____), A multi-band spectral subtraction method ofr enhancing speech corrupted by colored noise.

Examples

>>> import soundpy as sp
>>> import numpy as np
>>> # setting to 4 bands for space:
>>> fil = sp.BandSubtraction(num_bands=4)
>>> fil.setup_bands()
>>> # generate sine signal with and without noise
>>> time = np.arange(0, 10, 0.01)
>>> signal = np.sin(time)[:fil.frame_length]
>>> np.random.seed(0)
>>> noise = np.random.normal(np.mean(signal),np.mean(signal)+0.3,960)
>>> powerspec_clean = np.abs(np.fft.fft(signal))**2
>>> powerspec_noisy = np.abs(np.fft.fft(signal + noise))**2
>>> fil.update_posteri_bands(powerspec_clean, powerspec_noisy)
>>> fil.snr_bands
array([ -1.91189028, -39.22078063, -44.16682922, -45.65265895])
>>> a = fil.calc_oversub_factor()
>>> a
array([4.28678354, 4.75      , 4.75      , 4.75      ])
>>> # compare with no noise in signal:
>>> fil.update_posteri_bands(powerspec_clean, powerspec_clean)
>>> fil.snr_bands
array([0., 0., 0., 0.])
>>> a = fil.calc_oversub_factor()
>>> a
array([4., 4., 4., 4.])
calc_relevant_band(target_powspec)[source]

Calculates band with highest energy levels.

Parameters
  • self (class instance) – Contains class variables band_start_freq and band_end_freq.

  • target_powerspec (np.ndarray) – Power spectrum of the target signal.

Returns

  • rel_band_index (int) – Index for which band contains the most energy.

  • band_energy_matrix (np.ndarray [size=(num_bands, ), dtype=np.float]) – Power levels of each band.

Examples

>>> import soundpy as sp
>>> import numpy as np
>>> # setting to 4 bands for this example (default is 6):
>>> fil = sp.BandSubtraction(num_bands=4)
>>> fil.setup_bands()
>>> # generate sine signal with and with frequency 25
>>> time = np.arange(0, 10, 0.01)
>>> full_circle = 2 * np.pi
>>> freq = 25
>>> signal = np.sin((freq*full_circle)*time)[:fil.frame_length]
>>> powerspec_clean = np.abs(np.fft.fft(signal))**2
>>> rel_band_index, band_power_energies = fil.calc_relevant_band(powerspec_clean)
>>> rel_band_index
2
>>> # and with frequency 50
>>> freq = 50
>>> signal = np.sin((freq*full_circle)*time)[:fil.frame_length]
>>> powerspec_clean = np.abs(np.fft.fft(signal))**2
>>> rel_band_index, band_power_energies = fil.calc_relevant_band(powerspec_clean)
>>> rel_band_index
3
apply_floor(sub_band, original_band, floor=0.002, book=True)[source]
sub_noise(target_powspec, noise_powspec, oversub_factor, speech=True)[source]
apply_postfilter(enhanced_fft, target_fft, target_power_frame, noise_power)[source]
class soundpy.filters.FilterSettings(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, zeropad=None)[source]

Bases: object

Basic settings for filter related classes to inherit from.

frame_dur

Time in milliseconds of each audio frame window. (default 20)

Type

int, float

sr

Desired sampling rate of audio; audio will be resampled to match if audio has other sampling rate. (default 48000)

Type

int

frame_length

Number of audio samples in each frame: frame_dur multiplied with sr, divided by 1000. (default 960)

Type

int

percent_overlap

Percentage of overlap between frames.

Type

float

overlap_length

Number of overlapping audio samples between subsequent frames: frame_length multiplied by percent_overlap, floored. (default 480)

Type

int

window_type

Type of window applied to audio frames: hann vs hamming (default ‘hamming’)

Type

str

num_fft_bins

The number of frequency bins used when calculating the fft. Currently the frame_length is used to set num_fft_bins.

Type

int

zeropad

If False, only full frames of audio data are processed. If True, the last partial frame will be zeropadded. (default False)

Type

bool, optional

Methods

get_window()

Returns window acc.

__init__(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, zeropad=None)[source]

Initialize self. See help(type(self)) for accurate signature.

get_window()[source]

Returns window acc. to attributes window_type and frame_length

class soundpy.filters.Filter(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=None, zeropad=None)[source]

Bases: soundpy.filters.FilterSettings

Interactive class to explore Wiener filter settings on audio signals.

These class methods implement research based algorithms with low computational cost, aimed for noise reduction via mobile phone.

beta

Value applied in Wiener filter that smooths the application of ‘gain’; default set according to previous research. (default 0.98)

Type

float

first_iter

Keeps track if first_iter is relevant in filtering. If True, filtering has just started, and calculations made for filtering cannot use information from previous frames; if False, calculations for filtering use information from previous frames; if None, no difference is applied when processing the 1st vs subsequent frames. (default None)

Type

bool, optional

target_subframes

The number of total subsections within the total number of samples belonging to the target signal (i.e. audiofile being filtered). Until target_subframes is calculated, it is set to None. (default None)

Type

int, None

noise_subframes

The number of total subsections within the total number of samples belonging to the noise signal. If noise power spectrum is used, this doesn’t need to be calculated. Until noise_subframes is calculated, it is set to None. (default None)

Type

int, None

gain

Once calculated, the attenuation values to be applied to the fft for noise reduction. Until calculated, None. (default None)

Type

ndarray, None

max_vol

The maximum volume allowed for the filtered signal. (default 0.4)

Type

float, int

Methods

check_volume(samples)

ensures volume of filtered signal is within the bounds of the original

get_samples(audiofile[, dur_sec])

Load signal and save original volume

get_window()

Returns window acc.

set_num_subframes(len_samples[, is_noise, …])

Sets the number of target or noise subframes available for processing

set_volume(samples[, max_vol, min_vol])

Records and limits the maximum amplitude of original samples.

__init__(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=None, zeropad=None)[source]

Initialize self. See help(type(self)) for accurate signature.

get_samples(audiofile, dur_sec=None)[source]

Load signal and save original volume

Parameters
  • audiofile (str) – Path and name of audiofile to be loaded

  • dur_sec (int, float optional) – Max length of time in seconds (default None)

Returns

samples – Array containing signal amplitude values in time domain

Return type

ndarray

set_volume(samples, max_vol=0.4, min_vol=0.15)[source]

Records and limits the maximum amplitude of original samples.

This enables the output wave to be within a range of volume that does not go below or too far above the orignal maximum amplitude of the signal.

Parameters
  • samples (ndarray) – The original samples of a signal (1 dimensional), of any length

  • max_vol (float) – The maximum volume level. If a signal has values higher than this number, the signal is curtailed to remain at and below this number.

  • min_vol (float) – The minimum volume level. If a signal has only values lower than this number, the signal is amplified to be at this number and below.

Returns

Return type

None

set_num_subframes(len_samples, is_noise=False, zeropad=False)[source]

Sets the number of target or noise subframes available for processing

Parameters
  • len_samples (int) – The total number of samples in a given signal

  • is_noise (bool) – If False, subframe number saved under self.target_subframes, otherwise self.noise_subframes (default False)

  • zeropad (bool) – If False, number of frames limited to full frames. If True, last frame is zeropadded.

Returns

Return type

None

check_volume(samples)[source]

ensures volume of filtered signal is within the bounds of the original

class soundpy.filters.WienerFilter(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, smooth_factor=0.98, first_iter=None, zeropad=None)[source]

Bases: soundpy.filters.Filter

Methods

check_volume(samples)

ensures volume of filtered signal is within the bounds of the original

get_samples(audiofile[, dur_sec])

Load signal and save original volume

get_window()

Returns window acc.

set_num_subframes(len_samples[, is_noise, …])

Sets the number of target or noise subframes available for processing

set_volume(samples[, max_vol, min_vol])

Records and limits the maximum amplitude of original samples.

apply_postfilter

apply_wienerfilter

__init__(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, smooth_factor=0.98, first_iter=None, zeropad=None)[source]

Initialize self. See help(type(self)) for accurate signature.

apply_wienerfilter(frame_index, target_fft, target_power_frame, noise_power)[source]
apply_postfilter(enhanced_fft, target_fft, target_power_frame)[source]
class soundpy.filters.BandSubtraction(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, num_bands=6, band_spacing='linear', zeropad=None, smooth_factor=0.98, first_iter=None)[source]

Bases: soundpy.filters.Filter

Methods

calc_oversub_factor()

Calculate over subtraction factor used in the cited paper.

calc_relevant_band(target_powspec)

Calculates band with highest energy levels.

check_volume(samples)

ensures volume of filtered signal is within the bounds of the original

get_samples(audiofile[, dur_sec])

Load signal and save original volume

get_window()

Returns window acc.

set_num_subframes(len_samples[, is_noise, …])

Sets the number of target or noise subframes available for processing

set_volume(samples[, max_vol, min_vol])

Records and limits the maximum amplitude of original samples.

setup_bands()

Provides starting and ending frequncy bins/indices for each band.

update_posteri_bands(target_powspec, …)

Updates SNR of each set of bands.

apply_bandspecsub

apply_floor

apply_postfilter

sub_noise

__init__(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, num_bands=6, band_spacing='linear', zeropad=None, smooth_factor=0.98, first_iter=None)[source]

Initialize self. See help(type(self)) for accurate signature.

apply_bandspecsub(target_power, target_phase, noise_power)[source]
setup_bands()[source]

Provides starting and ending frequncy bins/indices for each band.

Parameters

self (class) – Contains variables num_bands (if None, set to 6) and frame_length

Returns

Sets the class variables band_start_freq and band_end_freq.

Return type

None

Examples

>>> import soundpy as sp
>>> import numpy as np
>>> # Default is set to 6 bands:
>>> fil = sp.BandSubtraction()
>>> fil.setup_bands()
>>> fil.band_start_freq
array([  0.,  80., 160., 240., 320., 400.])
>>> fil.band_end_freq
array([ 80., 160., 240., 320., 400., 480.])
>>> # change default settings
>>> fil = sp.BandSubtraction(num_bands=5)
>>> fil.setup_bands()
>>> fil.band_start_freq
array([  0.,  96., 192., 288., 384.])
>>> fil.band_end_freq
array([ 96., 192., 288., 384., 480.])
update_posteri_bands(target_powspec, noise_powspec)[source]

Updates SNR of each set of bands.

MATLAB code from speech enhancement book uses power, puts it into magnitude (via square root), then puts it back into power..? And uses some sort of ‘norm’ function… which I think is actually just the sum. Original equation can be found in the paper below. page 117 from book?

paper: Kamath, S. D. & Loizou, P. C. (____), A multi-band spectral subtraction method for enhancing speech corrupted by colored noise.

I am using power for the time being.

Examples

>>> import soundpy as sp
>>> import numpy as np
>>> # setting to 4 bands for space:
>>> fil = sp.BandSubtraction(num_bands=4)
>>> fil.setup_bands()
>>> # generate sine signal with and without noise
>>> time = np.arange(0, 10, 0.01)
>>> signal = np.sin(time)[:fil.frame_length]
>>> np.random.seed(0)
>>> noise = np.random.normal(np.mean(signal),np.mean(signal)+0.3,960)
>>> powerspec_clean = np.abs(np.fft.fft(signal))**2
>>> powerspec_noisy = np.abs(np.fft.fft(signal + noise))**2
>>> fil.update_posteri_bands(powerspec_clean, powerspec_noisy)
>>> fil.snr_bands
array([ -1.91189028, -39.22078063, -44.16682922, -45.65265895])
>>> # compare with no noise in signal:
>>> fil.update_posteri_bands(powerspec_clean, powerspec_clean)
>>> fil.snr_bands
array([0., 0., 0., 0.])
calc_oversub_factor()[source]

Calculate over subtraction factor used in the cited paper.

Uses decibel SNR values calculated in update_posteri_bands()

paper: Kamath, S. D. & Loizou, P. C. (____), A multi-band spectral subtraction method ofr enhancing speech corrupted by colored noise.

Examples

>>> import soundpy as sp
>>> import numpy as np
>>> # setting to 4 bands for space:
>>> fil = sp.BandSubtraction(num_bands=4)
>>> fil.setup_bands()
>>> # generate sine signal with and without noise
>>> time = np.arange(0, 10, 0.01)
>>> signal = np.sin(time)[:fil.frame_length]
>>> np.random.seed(0)
>>> noise = np.random.normal(np.mean(signal),np.mean(signal)+0.3,960)
>>> powerspec_clean = np.abs(np.fft.fft(signal))**2
>>> powerspec_noisy = np.abs(np.fft.fft(signal + noise))**2
>>> fil.update_posteri_bands(powerspec_clean, powerspec_noisy)
>>> fil.snr_bands
array([ -1.91189028, -39.22078063, -44.16682922, -45.65265895])
>>> a = fil.calc_oversub_factor()
>>> a
array([4.28678354, 4.75      , 4.75      , 4.75      ])
>>> # compare with no noise in signal:
>>> fil.update_posteri_bands(powerspec_clean, powerspec_clean)
>>> fil.snr_bands
array([0., 0., 0., 0.])
>>> a = fil.calc_oversub_factor()
>>> a
array([4., 4., 4., 4.])
calc_relevant_band(target_powspec)[source]

Calculates band with highest energy levels.

Parameters
  • self (class instance) – Contains class variables band_start_freq and band_end_freq.

  • target_powerspec (np.ndarray) – Power spectrum of the target signal.

Returns

  • rel_band_index (int) – Index for which band contains the most energy.

  • band_energy_matrix (np.ndarray [size=(num_bands, ), dtype=np.float]) – Power levels of each band.

Examples

>>> import soundpy as sp
>>> import numpy as np
>>> # setting to 4 bands for this example (default is 6):
>>> fil = sp.BandSubtraction(num_bands=4)
>>> fil.setup_bands()
>>> # generate sine signal with and with frequency 25
>>> time = np.arange(0, 10, 0.01)
>>> full_circle = 2 * np.pi
>>> freq = 25
>>> signal = np.sin((freq*full_circle)*time)[:fil.frame_length]
>>> powerspec_clean = np.abs(np.fft.fft(signal))**2
>>> rel_band_index, band_power_energies = fil.calc_relevant_band(powerspec_clean)
>>> rel_band_index
2
>>> # and with frequency 50
>>> freq = 50
>>> signal = np.sin((freq*full_circle)*time)[:fil.frame_length]
>>> powerspec_clean = np.abs(np.fft.fft(signal))**2
>>> rel_band_index, band_power_energies = fil.calc_relevant_band(powerspec_clean)
>>> rel_band_index
3
apply_floor(sub_band, original_band, floor=0.002, book=True)[source]
sub_noise(target_powspec, noise_powspec, oversub_factor, speech=True)[source]
apply_postfilter(enhanced_fft, target_fft, target_power_frame, noise_power)[source]

Extract and manipulate audio features

Feats module includes functions related to converting audio sample data to features for analysis, filtering, machine learning, or visualization.

soundpy.feats.plot(feature_matrix, feature_type, save_pic=False, name4pic=None, energy_scale='power_to_db', title=None, sr=None, win_size_ms=None, percent_overlap=None, x_label=None, y_label=None, subprocess=False, overwrite=False)[source]

Visualize feature extraction; frames on x axis, features on y axis. Uses librosa to scale the data if scale applied.

Note: can only take multiple channels if feature_type is ‘signal’. For other feature types, the plot will not work as expected.

Parameters
  • feature_matrix (np.ndarray [shape=(num_samples,), (num_samples, num_channels), or (num_features, num_frames), dtype=np.float].) – Matrix of features. If the features are not of type ‘signal’ and the shape is 1 D, one dimension will be added to be plotted with a colormesh.

  • feature_type (str) – Options: ‘signal’, ‘stft’, ‘mfcc’, or ‘fbank’ features, or what user would like to name the feature set. signal: the 1 D samples of sound. STFT: short-time Fourier transform MFCC: mel frequency cepstral coefficients. FBANK: mel-log filterbank energies (default ‘fbank’).

  • save_pic (bool) – True to save image as .png; False to just plot it. If subprocess is True, save_pic will automatically be set to True.

  • name4pic (str, optional) – If save_pic set to True, the name the image should be saved under.

  • energy_scale (str, optional) – If features need to be adjusted, e.g. from power to decibels. Default is ‘power_to_db’.

  • title (str, optional) – The title for the graph. If None, feature_type is used.

  • sr (int, optional) – Useful in plotting the time for features.

  • win_size_ms (int, float, optional) – Useful in plotting the time for features in the frequency domain (e.g. STFT, FBANK, MFCC features)

  • percent_overlap (int, float, optional) – Useful in plotting the time for features in the frequency domain (e.g. STFT, FBANK, MFCC features)

  • x_label (str, optional) – The label to be applied to the x axis.

  • y_label (str, optional) – The label to be applied to the y axis.

  • subprocess (bool) – If subprocess is True, matplotlib will use backend ‘Agg’, which only allows plots to be saved. If subprocess is False, the default backend ‘TkAgg’ will be used, which allows plots to be generated live as well as saved. The ‘Agg’ backend is useful if one wants to visualize sound while a main process is being performed, for example, while a model is being trained. (default False)

  • overwrite (bool) – If False, if .png file already exists under given name, a date tag will be added to the .png filename to avoid overwriting the file. (default False)

Returns

Return type

None

soundpy.feats.plotsound(audiodata, feature_type='fbank', win_size_ms=20, percent_overlap=0.5, fft_bins=None, num_filters=40, num_mfcc=40, sr=None, save_pic=False, name4pic=None, energy_scale='power_to_db', mono=None, real_signal=False, **kwargs)[source]

Visualize feature extraction depending on set parameters.

Stereo sound can be graphed. If feature_type is ‘signal’, all channels will be graphed on same plot. Otherwise, each channel will be plotted separately.

Parameters
  • audiodata (str, numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.

  • feature_type (str) – Options: ‘signal’, ‘mfcc’, or ‘fbank’ features. MFCC: mel frequency cepstral coefficients; FBANK: mel-log filterbank energies (default ‘fbank’)

  • win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 20)

  • percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.

  • num_filters (int) – Number of mel-filters to be used when applying mel-scale. For ‘fbank’ features, 20-128 are common, with 40 being very common. (default 40)

  • num_mfcc (int) – Number of mel frequency cepstral coefficients. First coefficient pertains to loudness; 2-13 frequencies relevant for speech; 13-40 for acoustic environment analysis or non-linguistic information. Note: it is not possible to choose only 2-13 or 13-40; if num_mfcc is set to 40, all 40 coefficients will be included. (default 40).

  • sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)

  • mono (bool, optional) – When loading an audiofile, True will limit number of channels to one; False will allow more channels to be loaded. (default None, which results in mono channel loading.)

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.plot

soundpy.feats.get_feats(sound, sr=None, feature_type='fbank', win_size_ms=20, percent_overlap=0.5, window='hann', fft_bins=None, num_filters=None, num_mfcc=None, remove_first_coefficient=False, sinosoidal_liftering=False, dur_sec=None, mono=None, rate_of_change=False, rate_of_acceleration=False, subtract_mean=False, real_signal=True, fmin=None, fmax=None, zeropad=True)[source]

Collects raw signal data, stft, fbank, or mfcc features.

Parameters
  • sound (str or numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.

  • sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)

  • feature_type (str) – Options include ‘signal’, ‘stft’, ‘powspec’, ‘fbank’, or ‘mfcc’ data (default ‘fbank’). signal: energy/amplitude measurements along time STFT: short-time fourier transform powspec : power spectrum (absolute value of stft, squared) FBANK: mel-log filterbank energies MFCC: mel frequency cepstral coefficients

  • win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 20)

  • percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.

  • window (str or np.ndarray [size (n_fft, )]) – The window function to be applied to each window. (Default ‘hann’)

  • fft_bins (int) – Number of frequency bins to apply in fast Fourier transform. (default None)

  • num_filters (int) – Number of mel-filters to be used when applying mel-scale. For ‘fbank’ features, 20-128 are common, with 40 being very common. If None, will be set to 40. (default None)

  • num_mfcc (int) – Number of mel frequency cepstral coefficients. First coefficient pertains to loudness; 2-13 frequencies relevant for speech; 13-40 for acoustic environment analysis or non-linguistic information. If None, will be set to num_filters or 40. (default None).

  • dur_sec (float, optional) – Time in seconds to limit in loading a signal. (default None)

  • mono (bool, optional) – For loading an audiofile, True will result in only one channel of data being loaded; False will allow additional channels be loaded. (default None, which results in mono channel data)

  • rate_of_change (bool) – If True, the first derivative of spectral data will be concatenated to the features. This is applicable for all feature types except ‘signal’.

  • rate_of_acceleration (bool) – If True, the second derivative of spectral data will be concatenated to the features. This is applicable for all feature types except ‘signal’.

  • subtract_mean (bool) – If True, the mean of each feature column will be subtracted from each row. This is applicable for all feature types except ‘signal’.

Returns

feats – Feature data. If feature_type is ‘signal’, returns a tuple containing samples and sampling rate. If feature_type is of another type, returns np.ndarray with shape (num_frames, num_filters/features)

Return type

tuple (num_samples, sr) or np.ndarray [size (num_frames, num_filters) dtype=np.float or np.complex]

soundpy.feats.load_feat_settings(feat_settings_dict)[source]

Loads feature settings into named tuple. Sets defaults if not present. TODO: test w previous version

soundpy.feats.get_stft(sound, sr=22050, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', zeropad=True, **kwargs)[source]

Returns short-time Fourier transform matrix.

This function allows more flexibility in number of fft_bins and real_signal settings. Additionally, this does not require the package librosa, making it a bit easier to manipulate if desired. For an example, see soundpy.augment.vtlp.

Parameters
  • sound (np.ndarray [shape=(num_samples,) or (num_samples, num_channels)], str, or pathlib.PosixPath) – If type np.ndarray, expect raw samples in mono or stereo sound. If type str or pathlib.PosixPath, expect pathway to audio file.

  • sr (int) – The sample rate of sound.

  • win_size_ms (int, float) – Window length in milliseconds for Fourier transform to be applied (default 50)

  • percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.

  • real_signal (bool) – If True, only half the FFT spectrum will be used; there should really be no difference as the FFT is symmetrical. If anything, setting real_signal to True may speed up functionality / make functions more efficient.

  • fft_bins (int) – Number of frequency bins to use when applying fast Fourier Transform. (default 1024)

  • window (str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)

  • zeropad (bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.files.loadsound.

Returns

stft_matrix

Return type

np.ndarray[size=(num_frames, fft_bins)]

soundpy.feats.get_fbank(sound, sr, num_filters, fmin=None, fmax=None, fft_bins=None, **kwargs)[source]

Extract mel-filterbank energy features from audio.

Parameters
  • sound (np.ndarray [size=(num_samples,) or (num_samples, num_features)], str, or pathlib.PosixPath) – Sound in raw samples, a power spectrum, or a short-time-fourier-transform. If type string or pathlib.PosixPath, expect pathway to audio file.

  • sr (int) – The sample rate of sound.

  • num_filters (int) – The number of mel-filters to use when extracting mel-filterbank energies.

  • fmin (int or float, optional) – The minimum frequency of interest. If None, will be set to 0. (default None)

  • fmax (int or float, optional) – The maximum frequency of interst. If None, will be set to half of sr. (default None)

  • fft_bins (int, optional) – The number of frequency bins / fast Fourier transform bins used in calculating the fast Fourier transform. If None, set depending on type of parameter sound. If sound is a raw signal or audio pathway, fft_bins will be set to 1024; if sound is a STFT or power spectrum, fft_bins will be set to 2 * length of sound feature column, or 2 * sound.shape[1].

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_stft.

Returns

fbank – The mel-filterbank energeis extracted. The number of samples depends on the parameters applied in soundpy.feats.get_stft.

Return type

np.ndarray [shape=(num_samples, num_filters)]

References

Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from: https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

soundpy.feats.get_mfcc(sound, sr, num_mfcc, remove_first_coefficient=False, sinosoidal_liftering=False, **kwargs)[source]

Extracts mel-frequency cepstral coefficients from audio.

Parameters
  • sound (np.ndarray [size=(num_samples,) or (num_samples, num_features)] or str or pathlib.PosixPath) – If sound is a np.ndarray, expected as raw samples, a power spectrum or a short-time Fourier transform. If string or pathlib.PosixPath, should be the pathway to the audio file.

  • sr (int) – The sample rate of the sound.

  • num_mfcc (int) – The number of mel-frequency cepstral coefficients

  • remove_first_coefficient (bool) – If True, the first coefficient, representing amplitude or volume of signal, is removed. Found to sometimes improve automatic speech recognition. (default False)

  • sinosoidal_liftering (bool) – If True, reduces influence of higher coefficients, found to aid in handling noise in background in automatic speech recognition. (default False)

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_fbank()

References

Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

soundpy.feats.get_vad_stft(sound, sr=48000, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', use_beg_ms=120, extend_window_ms=0, energy_thresh=40, freq_thresh=185, sfm_thresh=5, zeropad=True, **kwargs)[source]

Returns STFT matrix and VAD matrix. STFT matrix contains only VAD sections.

Parameters
  • sound (str or numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.

  • sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)

  • win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 50)

  • percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.

  • real_signal (bool) – If True, only half the FFT spectrum will be used; there should really be no difference as the FFT is symmetrical. If anything, setting real_signal to True may speed up functionality / make functions more efficient.

  • fft_bins (int) – Number of frequency bins to use when applying fast Fourier Transform. (default 1024)

  • window (str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)

  • use_beg_ms (int) – The amount of time in milliseconds to use from beginning of signal to estimate background noise.

  • extend_window_ms (int) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired.

  • energy_thresh (int) – The threshold to set for measuring energy for VAD in the signal. (default 40)

  • freq_thresh (int) – The threshold to set for measuring frequency for VAD in the signal. (default 185)

  • sfm_thresh (int) – The threshold to set for measuring spectral flatness for VAD in the signal. (default 5)

  • zeropad (bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.files.loadsound

Returns

  • stft_matrix (np.ndarray [size=(num_frames_vad, fft_bins//2+1), dtype=np.complex_]) – The STFT matrix frames of where voice activity has been detected.

  • vad_matrix_extwin (np.ndarray [size=(num_frames,)]) – A vector containing indices of the full STFT matrix for frames of where voice activity was detected or not.

soundpy.feats.get_stft_clipped(samples, sr, win_size_ms=50, percent_overlap=0.5, extend_window_ms=0, window='hann', zeropad=True, **kwargs)[source]

Returns STFT matrix and VAD matrix with beginning and ending silence removed.

Parameters
  • samples (str or numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data.

  • sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded.

  • win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 50)

  • percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.

  • extend_window_ms (int) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired.

  • window (str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)

  • zeropad (bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.files.loadsound.

Returns

  • stft_speech (np.ndarry [size (num_frames_clipped, fft_bins//2+1)]) – The STFT of the samples with beginning and ending silences clipped.

  • vad_matrix (np.ndarry [size (num_frames, )]) – A vector with zeros and ones indicating which indices of the full STFT that have voice activity or not.

soundpy.feats.get_vad_samples(sound, sr=None, win_size_ms=50, percent_overlap=0.5, use_beg_ms=120, extend_window_ms=0, energy_thresh=40, freq_thresh=185, sfm_thresh=5, window='hann', zeropad=True, **kwargs)[source]

Returns samples and VAD matrix. Only samples where with VAD are returned.

Parameters
  • sound (str or numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.

  • sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)

  • win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 50)

  • percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.

  • use_beg_ms (int) – The amount of time in milliseconds to use from beginning of signal to estimate background noise.

  • extend_window_ms (int) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired.

  • energy_thresh (int) – The threshold to set for measuring energy for VAD in the signal. (default 40)

  • freq_thresh (int) – The threshold to set for measuring frequency for VAD in the signal. (default 185)

  • sfm_thresh (int) – The threshold to set for measuring spectral flatness for VAD in the signal. (default 5)

  • window (str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)

  • zeropad (bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.files.loadsound

Returns

  • samples_matrix (np.ndarray [size = (num_samples_vad, )]) – The samples of where voice activity was detected.

  • vad_matrix_extwin (np.ndarray [size = (num_frames, )]) – A vector of zeros and ones indicating the frames / windows of the samples that either had voice activity or not.

soundpy.feats.get_samples_clipped(samples, sr, win_size_ms=50, percent_overlap=0.5, extend_window_ms=0, window='hann', zeropad=True, **kwargs)[source]

Returns samples and VAD matrix with beginning and ending silence removed.

Parameters
  • samples (str or numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data.

  • sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded.

  • win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 50)

  • percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.

  • extend_window_ms (int) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired. (default 0)

  • window (str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)

  • zeropad (bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.files.loadsound.

Returns

  • stft_speech (np.ndarry [size (num_frames_clipped, fft_bins//2+1)]) – The STFT of the samples with beginning and ending silences clipped.

  • vad_matrix (np.ndarry [size (num_frames, )]) – A vector with zeros and ones indicating which indices of the full STFT that have voice activity or not.

soundpy.feats.normalize(data, max_val=None, min_val=None)[source]

Normalizes data to be between 0 and 1. Should not be applied to raw sample data.

This is useful if you have predetermined max and min values you want to normalize new data with. Is helpful in training models on sound features (not raw samples).

Parameters
  • data (np.ndarray [size=(num_features,) or (num_frames,num_features)]) – Data to be normalized.

  • max_val (int or float, optional) – Predetermined maximum value. If None, will use max value from data.

  • min_val (int or float, optional) – Predetermined minimum value. If None, will use min value from data.

Returns

normed_data

Return type

np.ndarray [size = (num_features,) or (num_frames, num_features)]

Examples

>>> # using the min and max of a previous dataset:
>>> import numpy as np
>>> np.random.seed(0)
>>> input_samples = np.random.random_sample((5,))
>>> input_samples
array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ])
>>> np.random.seed(40)
>>> previous_samples = np.random.random_sample((5,))
>>> previous_samples
array([0.40768703, 0.05536604, 0.78853488, 0.28730518, 0.45035059])
>>> max_prev = np.max(previous_samples)
>>> min_prev = np.min(previous_samples)
>>> output_samples = normalize(input_samples, min_val = min_prev, max_val = max_prev)
>>> output_samples
array([0.67303388, 0.89996095, 0.74661839, 0.66767314, 0.50232462])
soundpy.feats.plot_dom_freq(sound, energy_scale='power_to_db', title=None, save_pic=False, name4pic=None, overwrite=False, **kwargs)[source]

Plots the approximate dominant frequency over a STFT plot of a signal.

If sound has multiple channels, the VAD for each channel is plotted in its own plot.

Parameters
  • sound (np.ndarray [shape=(num_samples,) or (num_samples, num_channels)]) – The sound to plot the dominant frequency of.

  • energy_scale (str) – The scale of energy for the plot. If in frequency spectrum, likey in power and needs to be put into db. (default ‘power_to_db’)

  • title (str) – The title for the plot. (default None)

  • **kwargs (additional keyword arguments) – Keyword arguments used in both soundpy.feats.get_stft and soundpy.dsp.get_pitch.

Returns

Return type

None

soundpy.feats.plot_vad(sound, energy_scale='power_to_db', title='Voice Activity', use_beg_ms=120, extend_window_ms=0, beg_end_clipped=True, save_pic=False, name4pic=None, overwrite=False, **kwargs)[source]

Plots where voice (sound) activity detected on power spectrum.

This either plots immediately or saves the plot at name4pic. If sound has multiple channels, the VAD for each channel is plotted in its own plot.

Parameters
  • sound (np.ndarray [shape=(num_samples,) or (num_samples, num_channels)]) – The sound to plot the VAD of.

  • energy_scale (str) – If plotting STFT or power spectrum, will plot it in decibels. (default ‘power_to_db’)

  • title (str) – The title of the plot (default ‘Voice Activity’)

  • use_beg_ms (int) – The amount of noise to use at the beginning of the signal to measuer VAD. This is only applied if beg_end_silence is set to False.

  • extend_window_ms (int) – The number of milliseconds VAD should be padded. This is useful if one wants to encompass more speech if the VAD is not including all the speech / desired sound. However, this may capture more noise. (default 0)

  • beg_end_silence (bool) – If True, just the silences at the beginning and end of the sample will be cut off. If False, VAD will be checked throughout the sample, not just the beginning and end. NOTE: Both options have strengths and weaknesses. Sometimes the VAD checking the entire signal is unreliable (e.i. when beg_end_silence is set to False), not recognizing speech in speech filled samples. And when set to True, some speech sounds tend to get ignored (‘s’, ‘x’ and other fricatives).

  • save_pic (bool) – If True, the plot will be saved rather than plotted immediately.

  • name4pic (str) – The full pathway and filename to save the picture (as .png file). A file extension is expected. (default None)

  • overwrite (bool) – If False, a date tag will be added to name4pic if name4pic already exists. (default False)

  • **kwargs (keyword arguments) – Additional keyword arguments for soundpy.feats.get_speech_stft or soundpy.dsp.vad.

Returns

Return type

None

soundpy.feats.get_change_acceleration_rate(spectro_data)[source]

Gets first and second derivatives of spectral data.

This is useful particularly for speech recognition.

Parameters

spectro_data (np.ndarray [shape = (num_samples, num_features)]) –

Returns

  • delta (np.ndarray [shape = (num_samples, num_features)]) – The first order derivative of spectral data. Reflects rate of change in signal.

  • delta_delta (np.ndarray [shape = (num_samples, num_features)]) – The second order derivative of spectral data. Reflects rate of acceleration in signal.

soundpy.feats.get_mfcc_fbank(samples, feature_type='mfcc', sr=48000, win_size_ms=20, percent_overlap=0.5, num_filters=40, num_mfcc=40, fft_bins=None, window_function=None, zeropad=True, **kwargs)[source]

Collects fbank or mfcc features via python-speech-features (rather than librosa).

soundpy.feats.zeropad_features(feats, desired_shape, complex_vals=False)[source]

Applies zeropadding to a copy of feats.

soundpy.feats.reduce_num_features(feats, desired_shape)[source]

Limits number features of a copy of feats.

This is useful if you want the features to be a certain size, for training models for example.

soundpy.feats.adjust_shape(data, desired_shape, change_dims=None, complex_vals=None)[source]
soundpy.feats.reduce_dim(matrix, axis=0)[source]
soundpy.feats.featshape_new_subframe(feature_matrix_shape, new_frame_size, zeropad=True, axis=0, include_dim_size_1=False)[source]

Subdivides features from (num_frames, num_feats) to (new_frame_size, num_frames, num_feats)

Parameters
  • feature_matrix_shape (tuple [size=(num_frames, num_features)]) – Feature matrix shape to be subdivided. Can be multidimensional.

  • new_frame_size (int) – The number of subframes to section axis into.

  • zeropad (bool) – If True, frames that don’t completely fill a new_frame_size will be zeropadded. Otherwise, those frames will be discarded. (default True)

  • axis (int) – The axis where the new_frame_size should be applied. (default 0)

Returns

new_shape

Return type

tuple [size=(num_subframes, new_frame_size, num_feats)]

soundpy.feats.apply_new_subframe(feature_matrix, new_frame_size, zeropad=True, axis=0)[source]

Reshapes feature_matrix to allow for new_frame_size.

Note: Dimensions of feature_matrix must be at least 2 and can be up to 5, returning a matrix with one additional dimension.

Parameters
  • feature_matrix (np.ndarray [size(num_frames, num_features) ]) – Expects minimum 2D, maximum 5D matrix.

  • new_frame_size (int) – The number of subframes to section axis into.

  • axis (int) – The axis to apply the new_frame_size. (default 0)

  • zeropad (bool) – If True, the feature_matrix will be zeropadded to include frames that do not fill entire frame_size, given the new_frame_size. If False, feature_matrix will not include the last zeropadded frame. (default True)

Returns

feats_reshaped – The feature_matrix returned with axis subdivided into 2 dimensions, the number of subframes and the other length new_frame_size.

Return type

np.ndarray [size(num_subframes, new_frame_size, num_features)]

Raises

ValueError if number of dimensions of feature_matrix is below 2 or exceeds 5.

Examples

>>> import numpy as np
>>> matrix = np.arange(24).reshape(3,4,2)
>>> # apply new_frame_size to dimension of length 4 (i.e. axis 1)
>>> matrix_zp = apply_new_subframe(matrix, new_frame_size = 3, axis = 1)
>>> matrix_zp.shape
(3, 2, 3, 2)
>>> matrix_zp
array([[[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

[[ 6, 7], [ 0, 0], [ 0, 0]]],

[[[ 8, 9],

[10, 11], [12, 13]],

[[14, 15], [ 0, 0], [ 0, 0]]],

[[[16, 17],

[18, 19], [20, 21]],

[[22, 23], [ 0, 0], [ 0, 0]]]])

>>> matrix_nozp = apply_new_subframe(matrix, new_frame_size = 3, axis = 1,
...                                    zeropad=False)
>>> matrix_nozp.shape
(3, 1, 3, 2)
>>> matrix_nozp
array([[[[ 0,  1],
        [ 2,  3],
        [ 4,  5]]],
[[[ 8, 9],

[10, 11], [12, 13]]],

[[[16, 17],

[18, 19], [20, 21]]]])

soundpy.feats.check_percent_overlap(percent_overlap)[source]

Ensures percent_overlap is between 0 and 1.

soundpy.feats.separate_dependent_var(matrix)[source]

Separates matrix into features and labels. Expects 3D array.

Assumes the last column of the last dimension of the matrix constitutes the dependent variable (labels), and all other columns the indpendent variables (features). Additionally, it is assumed that for each block of data, only one label is needed; therefore, just the first label is taken for each block.

Parameters

matrix (numpy.ndarray [size = (num_samples, num_frames, num_features)]) – The matrix holds the numerical data to separate. num_features is expected to be at least 2.

Returns

  • X (numpy.ndarray [size = (num_samples, num_frames, num_features -1)]) – A matrix holding the (assumed) independent variables

  • y (numpy.ndarray, numpy.int64, numpy.float64 [size = (num_samples,)]) – A vector holding the labels assigned to the independent variables. If only one value in array, just the value inside is returned

Examples

>>> import numpy as np
>>> #vector
>>> separate_dependent_var(np.array([1,2,3,4]))
(array([1, 2, 3]), 4)
>>> #simple matrix
>>> matrix = np.arange(4).reshape(2,2)
>>> matrix
array([[0, 1],
       [2, 3]])
>>> X, y = separate_dependent_var(matrix)
>>> X
array([[0],
       [2]])
>>> y
1
>>> #more complex matrix
>>> matrix = np.arange(20).reshape((2,2,5))
>>> matrix
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9]],

       [[10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]]])
>>> X, y = separate_dependent_var(matrix)
>>> X
array([[[ 0,  1,  2,  3],
        [ 5,  6,  7,  8]],

       [[10, 11, 12, 13],
        [15, 16, 17, 18]]])
>>> y
array([ 4, 14])
soundpy.feats.add_tensor(matrix)[source]

Adds tensor / dimension to input ndarray (e.g. features).

Keras requires an extra dimension at some layers, which represents the ‘tensor’ encapsulating the data.

Further clarification taking the example below. The input matrix has shape (2,3,4). Think of it as 2 different events, each having 3 sets of measurements, with each of those having 4 features. So, let’s measure differences between 2 cities at 3 different times of day. Let’s take measurements at 08:00, 14:00, and 19:00 in… Magic City and Never-ever Town. We’ll measure.. 1) tempurature, 2) wind speed 3) light level 4) noise level.

How I best understand it, putting our measurements into a matrix with an added dimension/tensor, this highlights the separate measurements, telling the algorithm: yes, these are 4 features from the same city, BUT they occur at different times. Or it’s just how Keras set up the code :P

Parameters

matrix (numpy.ndarray) – The matrix holds the numerical data to add a dimension to.

Returns

matrix – The matrix with an additional dimension.

Return type

numpy.ndarray

Examples

>>> import numpy as np
>>> matrix = np.arange(24).reshape((2,3,4))
>>> matrix.shape
(2, 3, 4)
>>> matrix
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
>>> matrix_2 = add_tensor(matrix)
>>> matrix_2.shape
(2, 3, 4, 1)
>>> matrix_2
array([[[[ 0],
         [ 1],
         [ 2],
         [ 3]],

        [[ 4],
         [ 5],
         [ 6],
         [ 7]],

        [[ 8],
         [ 9],
         [10],
         [11]]],


       [[[12],
         [13],
         [14],
         [15]],

        [[16],
         [17],
         [18],
         [19]],

        [[20],
         [21],
         [22],
         [23]]]])
soundpy.feats.scale_X_y(matrix, is_train=True, scalars=None)[source]

Separates and scales data into X and y arrays. Adds dimension for keras.

Assumes the last column of the last dimension is the y or label data.

Parameters
  • matrix (np.ndarray [size = (num_samples, num_frames, num_features)]) – Matrix with X and y data

  • is_train (bool) – Relevant for the scalars parameter. If the data is training data (i.e. True), the scalars will be created. If the data is test data (i.e. False), the function expects scalars to be provided. (default True)

  • scalars (dict, optional) – Dictionary with scalars to be applied to non-training data.

Returns

  • X (np.ndarray [size = (num_sampls, num_frames, num_features-1, 1)]) – Scaled features with extra dimension

  • y (np.ndarray [size = (num_samples, 1, 1)]) – Scaled independent variable with extra dimension

  • scalars (dict) – The scalars either created or previously loaded.

soundpy.feats.list_available_features()[source]
soundpy.feats.get_feature_matrix_shape(sr=None, dur_sec=None, feature_type=None, win_size_ms=None, percent_overlap=None, fft_bins=None, num_mfcc=None, num_filters=None, rate_of_change=False, rate_of_acceleration=False, context_window=None, frames_per_sample=None, zeropad=True, labeled_data=False, remove_first_coefficient=False, real_signal=False, **kwargs)[source]

Returns expected shapes of feature matrix depending on several parameters.

Parameters
  • sr (int) – Sample rate of the audio to be extracted.

  • dur_sec (int, float) – The number of seconds of audio feature extraction will be applied to.

  • feature_type (str) – Accepted features include ‘signal’, ‘stft’, ‘powspec’, ‘fbank’, ‘mfcc’. Which feature_type applied will influence the resulting shape of the feature matrix shape.

  • win_size_ms (int or float) – The size of the window the audio signal should be broken into. If feature_type is set to ‘signal’, this is irrelevant. Otherwise will raise TypeError if set to None.

  • percent_overlap (float) – The amount of overlap between windows. If set to 0.5, the number of overlapping samples will be half the number of samples that make up win_size_ms.

  • fft_bins (int) – The number of frequency bins to use when calculating the fast Fourier transform. If None, the calculated frame_length will be used.

  • num_mfcc (int) – If extracting ‘mfcc’ features, the total number of coefficients expected.

  • num_filters (int) – If extracting ‘fbank’ features, the total number of mel-filters to be applied.

  • rate_of_change (bool) – If True, the first delta will be concatenated to features extracted.

  • rate_of_acceleration (bool) – If True, the second delta will be concatenated to features extracted.

  • context_window (int) – The size of context_window or number of samples padding a central frame. This may be useful for models training on small changes occuring in the signal, e.g. to break up the image of sound into smaller parts.

  • frames_per_sample (int) – The previous keyword argument for sugementing audio into smaller parts. Will be removed in future versions and available in generator functions as context_window. frames_per_sample equals 2 * context_window + 1. See soundpy.models.dataprep.Generator

  • zeropad (bool) – If True, windows and frames will be zeropadded to avoid losing any sample data.

  • labeled_data (bool) – If True, a label will be added to the output shape of features.

  • remove_first_coefficient (bool) – If True, the first mfcc coefficient will not be included in feature matrix.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats. These may not be used in this function as they may not influence the size of the feature matrix.

Returns

  • feature_matrix_base (tuple) – The base shape of the feature matrix. This is the shape that should result from extracting the features for each audio file

  • feature_matrix_model (tuple) – The shape relevant to training models. For example, one including space for a context window and label.

soundpy.feats.visualize_feat_extraction(feats, iteration=None, dataset=None, label=None, datadir=None, subsections=False, **kwargs)[source]

Saves plots of features during feature extraction or training of models.

Parameters
  • feats (np.ndarray [shape=(num_samples,) or (num_samples, num_frames) or (num_frames, num_features) or (num_subsections, num_frames, num_features)]) – The extracted features can be raw signal data, stft, fbank, powspec, mfcc data, either as a single plot or subsectioned into batches / subframes.

  • iteration (int, optional) – The iteration of the audio getting extracted; e.g. the 10th training item.

  • dataset (str, optional) – The identifying string (for example ‘train’ , ‘val’, or ‘test’, but this can be anything).

  • label (str, int, optional) – The label of the audio file. Used in titles and filenames.

  • datadir (str, pathlib.PosixPath, optional) – The directory where related data is located. An ‘image’ directory will be created within this datadir where the saved plots will be stored. If None, will be created in current working directory.

  • subsections (bool, optional) – To subsection raw ‘signal’ data into frames. For other features, this is easier to identify via the shape of feats.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats

Returns

Return type

None

soundpy.feats.save_features_datasets(datasets_dict, datasets_path2save_dict, context_window=None, frames_per_sample=None, labeled_data=False, subsection_data=False, divide_factor=None, visualize=False, vis_every_n_frames=50, log_settings=True, decode_dict=None, random_seed=None, **kwargs)[source]

Extracts and saves audio features, sectioned into datasets, to indicated locations.

If MemoryError, the provided dataset dicts will be adjusted to allow data to be subsectioned.

Parameters
  • datasets_dict (dict) – Dictionary with keys representing datasets and values the audifiles making up that dataset. E.g. {‘train’:[‘1.wav’, ‘2.wav’, ‘3.wav’], ‘val’: [‘4.wav’], ‘test’:[‘5.wav’]} for unlabled data or {‘train’:[(0, ‘1.wav’), (1, ‘2.wav’), (0, ‘3.wav’)], ‘val’: [(1, ‘4.wav’)], ‘test’:[(0, ‘5.wav’)]} for labeled data.

  • datasets_path2save_dict (dict) – Dictionary with keys representing datasets and values the pathways of where extracted features of that dataset will be saved. E.g. {‘train’: ‘./data/train.npy’, ‘val’: ‘./data/val.npy’, ‘test’: ‘./data/test.npy’}

  • context_window (int) – The size of context_window or number of samples padding a central frame. This may be useful for models training on small changes occuring in the signal, e.g. to break up the image of sound into smaller parts, to feed to a long short-term memory network (LSTM), for example. (Can avoid this by simply reshaping data later).

  • frames_per_sample (int) – The previous keyword argument for sugementing audio into smaller parts. Will be removed in future versions. This equals 2 * context_window + 1

  • labeled_data (bool) – If True, expects each audiofile to be accompanied by an integer label. See example given for datasets_dict.

  • subsection_data (bool) – If you have a large dataset, you may want to divide it into subsections. See soundpy.datasets.subsection_data. If datasets are large enough to raise a MemoryError, this will be applied automatically.

  • divide_factor (int, optional) – The number of subsections to divide data into. Only large enough sections will be divided. If smaller datasets (i.e. validation and test datasets) are as large or smaller than the new subsectioned larger dataset(s) (i.e. train), they will be left unchanged. (defaults to 5)

  • visualize (bool) – If True, periodic plots of the features will be saved throughout the extraction process. (default False)

  • vis_every_n_frames (int) – How often visuals should be made: every 10 samples, every 100, etc. (default 50)

  • log_settings (bool) – If True, a .csv file will be saved in the feature extraction directory with most of the feature settings saved. (default True)

  • decode_dict (dict, optional) – The dictionary to get the label given the encoded label. This is for plotting purposes. (default None)

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats.

Returns

  • datasets_dict (dict) – The final dataset dictionary used in feature extraction. The datasets may have been subdivided.

  • datasets_path2save_dict (dict) – The final dataset feature pathway dict. The pathways will have been adjusted if the datasets have been subdivided.

See also

soundpy.feats.get_feats

Extract features from audio file or audio data.

soundpy.feats.save_features_datasets_zipfiles(datasets_dict, datasets_path2save_dict, extract_dir, dur_sec, feature_type='fbank', num_feats=None, sr=22050, win_size_ms=20, percent_overlap=0.5, n_fft=None, frames_per_sample=None, labeled_data=False, subsection_data=False, divide_factor=None, visualize=False, vis_every_n_frames=50, use_librosa=True, center=True, mode='reflect', log_settings=True, decode_dict=None, audiofile_lim=10, **kwargs)[source]

Extracts and saves audio features, sectioned into datasets, to indicated locations.

If MemoryError, the provided dataset dicts will be adjusted to allow data to be subsectioned.

Parameters
  • datasets_dict (dict) – Dictionary with keys representing datasets and values the audifiles making up that dataset. E.g. {‘train’:[‘1.wav’, ‘2.wav’, ‘3.wav’], ‘val’: [‘4.wav’], ‘test’:[‘5.wav’]} for unlabled data or {‘train’:[(0, ‘1.wav’), (1, ‘2.wav’), (0, ‘3.wav’)], ‘val’: [(1, ‘4.wav’)], ‘test’:[(0, ‘5.wav’)]} for labeled data.

  • datasets_path2save_dict (dict) – Dictionary with keys representing datasets and values the pathways of where extracted features of that dataset will be saved. E.g. {‘train’: ‘./data/train.npy’, ‘val’: ‘./data/val.npy’, ‘test’: ‘./data/test.npy’}

  • feature_type (str) – String including only one of the following: ‘signal’, ‘stft’, ‘powspec’, ‘fbank’, and ‘mfcc’. ‘signal’ currently only supports mono channel data. TODO test for stereo ‘powspec’ and ‘stft’ are basically the same; ‘powspec’ is the ‘stft’ except without complex values and squared. E.g ‘mfcc_noisy’ or ‘stft_train’.

  • sr (int) – The sample rate the audio data should be loaded with.

  • n_fft (int) – The number of frequency bins used for the Fast Fourier Transform (fft)

  • dur_sec (int or float) – The desired duration of how long the audio data should be. This is used to calculate size of feature data and is therefore necessary, as audiofiles tend to differe in length. If audiofiles are longer or shorter, they will be cut or zeropadded respectively.

  • num_feats (int) – The number of mfcc coefficients (mfcc), mel filters (fbank), or frequency bins (stft).

  • win_size_ms (int) – The desired window size in milliseconds to process audio samples.

  • percent_overlap (float) – The amount audio samples should overlap as each window is processed.

  • frames_per_sample (int, optional) – If you want to section each audio file feature data into smaller frames. This might be useful for speech related contexts. (Can avoid this by simply reshaping data later)

  • labeled_data (bool) – If True, expects each audiofile to be accompanied by an integer label. See example given for datasets_dict.

  • subsection_data (bool) – If you have a large dataset, you may want to divide it into subsections. See soundpy.datasets.subsection_data. If datasets are large enough to raise a MemoryError, this will be applied automatically.

  • divide_factor (int, optional) – The number of subsections to divide data into. Only large enough sections will be divided. If smaller datasets (i.e. validation and test datasets) are as large or smaller than the new subsectioned larger dataset(s) (i.e. train), they will be left unchanged. (defaults to 5)

  • visualize (bool) – If True, periodic plots of the features will be saved throughout the extraction process. (default False)

  • vis_every_n_frames (int) – How often visuals should be made: every 10 samples, every 100, etc. (default 50)

  • use_librosa (bool) – If True, librosa is used to load and extract features. As of now, no other option is available. TODO: add other options. :P I just wanted to be clear that some elements of this function are unique to using librosa. (default True)

  • center (bool) – Relevant for librosa and feature extraction. (default True)

  • mode (str) – Relevant for librosa and feature extraction. (default ‘reflect’)

  • log_settings (bool) – If True, a .csv file will be saved in the feature extraction directory with most of the feature settings saved. (default True)

  • decode_dict (dict, optional) – The dictionary to get the label given the encoded label. This is for plotting purposes. (default None)

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats.

Returns

  • datasets_dict (dict) – The final dataset dictionary used in feature extraction. The datasets may have been subdivided.

  • datasets_path2save_dict (dict) – The final dataset feature pathway dict. The pathways will have been adjusted if the datasets have been subdivided.

See also

soundpy.feats.get_feats

Extract features from audio file or audio data.

soundpy.feats.prep_new_audiofeats(feats, desired_shape, input_shape)[source]

Prepares new audio data to feed to a pre-trained model.

Parameters
  • feats (np.ndarray [shape = (num_frames, num_features)]) – The features to prepare for feeding to a model.

  • desired_shape (tuple) – The expected number of samples necessary to fulfill the expected input_shape for the model. The feats will be zeropadded or limited to match this desired_shape.

  • input_shape (tuple) – The input_shape the model expects a single sample of data to be.

Returns

feats_reshaped – The features reshaped to what the model expects.

Return type

np.ndarray [shape = (`input_shape)]`

soundpy.feats.feats2audio(feats, feature_type, sr, win_size_ms, percent_overlap, phase=None)[source]

Prepares features into audio playable format.

Parameters
  • feats (np.ndarray [shape = (num_frames, num_feats)]) – If the features are a signal, [size = (batch_size * num_frames * num_features, 1)]. Otherwise [size = (batch_size * num_frames, num_features)].

  • feature_type (str) – Either ‘stft’, ‘fbank’, ‘signal’, or ‘mfcc’. For the ‘signal’ feature, only mono channel is supported.

  • sr (int) – Sampling rate that the features were extracted with

  • win_size_ms (int) – The window size in milliseconds the features were extracted with

  • percent_overlap (float) – The percent overlap between windows.

  • phase (np.ndarray [shape = (num_frames, num_feats)], optional) – The original phase information of the reconstructed signal.

Returns

y – The reconstructed signal in samples.

Return type

np.ndarray [shape = (num_samples, )]

soundpy.feats.grayscale2color(image_matrix, colorscale=3)[source]

Expects grayscale image. Copies first channel into additional channels.

This is useful for pre-trained models that require features to have rgb channels, not grayscale. Assumes last channel the colorscale column.

Template deep neural networks

The models.template_models module contains functions for building (ideally research-based) models.

soundpy.models.template_models.adjust_layers_cnn(**kwargs)[source]

Reduces layers of CNN until the model can be built.

If the number of filters for ‘mfcc’ or ‘fbank’ is in the lower range (i.e. 13 or so), this causes issues with the default settings of the cnn architecture. The architecture was built with at least 40 filters being applied during feature extraction. To deal with this problem, the number of CNN layers are reduced.

Parameters

**kwargs (Keyword arguments) – Keyword arguments for soundpy.models.template_models.cnn_classifier

Returns

settings – Updated dictionary with relevant settings for model.

Return type

dict

References

https://github.com/pgys/NoIze

soundpy.models.template_models.cnn_classifier(feature_maps=[40, 20, 10], kernel_size=[3, 3, 3, 3, 3, 3], strides=2, activation_layer='relu', activation_output='softmax', input_shape=79, 40, 1, num_labels=3, dense_hidden_units=100, dropout=0.25)[source]

Build a single or multilayer convolutional neural network.

Parameters
  • feature_maps (int or list) – The filter or feature map applied to the data. One feature map per convolutional neural layer required. For example, a list of length 3 will result in a three-layer convolutional neural network.

  • kernel_size (tuple or list of tuples) – Must match the number of feature_maps. The size of each corresponding feature map.

  • strides (int) –

  • activation_layer (str) – (default ‘relu’)

  • activation_outpu (str) – (default ‘softmax’)

  • input_shape (tuple) – The shape of the input

  • dense_hidden_units (int, optional) –

  • dropout (float, optional) – Reduces overfitting

Returns

  • model (tf.keras.Model) – Model ready to be compiled.

  • settings (dict) – Dictionary with relevant settings for model.

Warning

If number features are not compatible with number of layers, warning raised and layers adjusted. E.g. For lower number of MFCC features this will likely be applied if number of layers is greater than 1.

References

A. Sehgal and N. Kehtarnavaz, “A Convolutional Neural Network Smartphone App for Real-Time Voice Activity Detection,” in IEEE Access, vol. 6, pp. 9017-9026, 2018.

soundpy.models.template_models.autoencoder_denoise(input_shape, kernel_size=3, 3, max_norm_value=2.0, activation_function_layer='relu', activation_function_output='sigmoid', padding='same', kernel_initializer='he_uniform')[source]

Build a simple autoencoder denoiser.

Parameters
  • input_shape (tuple) – Shape of the input data.

  • max_norm_value (int or float) –

Returns

autoencoder – Model ready to be compiled

Return type

tf.keras.Model

References

Versloot, Christian (2019, December 19). Creating a Signal Noise Removal Autoencoder with Keras. MachineCurve. https://www.machinecurve.com

soundpy.models.template_models.resnet50_classifier(input_shape, num_labels, activation='softmax', final_layer_name='features')[source]

Simple image classifier built ontop of a pretrained ResNet50 model.

References

Revay, S. & Teschke, M. (2019). Multiclass Language Identification using Deep Learning on Spectral Images of Audio Signals. arXiv:1905.04348 [cs.SD]

soundpy.models.template_models.cnnlstm_classifier(num_labels, input_shape, lstm_cells, feature_map_filters=32, kernel_size=8, 4, pool_size=3, 3, dense_hidden_units=60, activation_layer='relu', activation_output='softmax', dropout=0.25)[source]

Model architecture inpsired from the paper below.

References

Kim, Myungjong & Cao, Beiming & An, Kwanghoon & Wang, Jun. (2018). Dysarthric Speech Recognition Using Convolutional LSTM Neural Network. 10.21437/interspeech.2018-2250.

Additional model setup (e.g. Early Stopping)

The models.modelsetup module containes functionality for preparing for training a model

soundpy.models.modelsetup.setup_layers(num_features, num_layers, kernel_shape=3, 3, max_feature_map=64)[source]

Sets up feature_maps and kernels for 1 or more layered convolutional neural networks.

Parameters
  • num_features (int) – The number of features used to train the model. This will be used to set the number of feature_maps for each layer.

  • num_layers (int) – The number of layers desired

  • kernel_shape (tuple or int) – The shape of the desired kernel

  • max_feature_map (int) – The maximum size of feature map / filter. This depends on the system and is relevant for processing higher definition features, such as STFT features. If this is set too large given memory restraints, training may be ‘killed’.

Returns

  • feature_maps (list) – List of feature maps or filters that will be applied to each layer of the network.

  • kernels (list) – List of kernels that will be applied to each layer of the network. Matches length of feature_maps

Warning

If num_features is larger than the max_feature_map. The num_features is usually used to set the first feature map, but if too large, will be reduced to be lower than max_feature_map.

soundpy.models.modelsetup.setup_callbacks(early_stop=True, patience=15, log=True, log_filename=None, append=True, save_bestmodel=True, best_modelname=None, monitor='val_loss', verbose=1, save_best_only=True, mode='min', tensorboard=True, write_images=False, x_test=None, y_test=None, batch_size=None, embedded_layer_name=None)[source]

Easy set up of early stopping, model logging, and saving best model.

Parameters
  • early_stop (bool) – Whether or not the model should stop if training is not improving (default True)

  • patience (int) – The number of epochs the model should complete without improvement before stopping training. (default 15)

  • log (bool) – If true, the accuracy, loss, and (if possible) the val_accuracy and val_loss for each epoch will be saved in a .csv file. (default True)

  • log_filename (str or pathlib.PosixPath, optional) – The filename the logging information will be stored. If None, the date will be used as a unique .csv filename in a subfolder ‘model_logs’ in the local directory.

  • save_bestmodel (bool) – If True, the best performing model will be saved.

  • best_modelname (str or pathlib.PosixPath) – The name to save the best model version under. If None, the date will be used to create a unique .h5 filename and it will be saved in a subfolder ‘best_models’ in the local directory.

  • monitor (str) – The metric to be used to measure model performance. (default ‘val_loss’

  • verbose (bool) – If True, the state of the model will be printed. (default True)

  • save_best_only (bool) – If True, the best performing model will overwrite any previously saved ‘best model’.

  • mode (str) – If monitor is set to ‘val_loss’, this should be set to ‘min’. If monitor``is set to ‘val_acc’, this should be set to ‘max’. If `mode is set to ‘auto’, the direction will be inferred. (default ‘min’)

  • tensorboard (bool) – If True, logs for TensorBoard will be made.

Returns

callbacks – The callbacks ready to be applied to Keras model training.

Return type

# TODO what data type is this?

Feeding large datasets to models

class soundpy.models.dataprep.Generator(data_matrix1, data_matrix2=None, timestep=None, axis_timestep=0, normalize=True, apply_log=False, context_window=None, axis_context_window=- 2, labeled_data=False, gray2color=False, zeropad=True, desired_input_shape=None, combine_axes_0_1=False)[source]

Bases: object

Methods

generator()

Shapes, norms, and feeds data depending on labeled or non-labeled data.

__init__(data_matrix1, data_matrix2=None, timestep=None, axis_timestep=0, normalize=True, apply_log=False, context_window=None, axis_context_window=- 2, labeled_data=False, gray2color=False, zeropad=True, desired_input_shape=None, combine_axes_0_1=False)[source]

This generator pulls data out in sections (i.e. batch sizes). Prepared for 3 dimensional data.

Note: Keras adds a dimension to input to represent the “Tensor” that #handles the input. This means that sometimes you have to add a shape of (1,) to the shape of the data.

Parameters
  • data_matrix1 (np.ndarray [size=(num_samples, batch_size, num_frames, num_features) or (num_samples, num_frames, num_features+label_column)]) – The training data. This can contain the feature and label data or just the input feature data.

  • data_matrix2 (np.ndarray [size = (num_samples, ) `data_matrix1.shape]`, optional) – Either label data for data_matrix1 or, for example, the clean version of data_matrix1 if training an autoencoder. (default None)

  • normalize (bool) – If False, the data has already been normalized and won’t be normalized by the generator. (default True)

  • apply_log (bool) – If True, log will be applied to the data.

  • timestep (int) – The number of frames to constitute a timestep.

  • axis_timestep (int) – The axis to apply the timestep to. (default 0)

  • context_window (int) – The size of context_window or number of samples padding a central frame. This may be useful for models training on small changes occuring in the signal, e.g. to break up the image of sound into smaller parts.

  • axis_context_window (int) – The axis to apply_context_window, if context_window is not None. Ideally should be in axis preceding feature column. (default -2)

  • zeropad (bool) – If features should be zeropadded in reshaping functions.

  • desired_input_shape (int or tuple, optional) – The desired number of features or shape of data to feed a neural network. If type int, only the last column of features will be adjusted (zeropadded or limited). If tuple, the entire data shape will be adjusted (all columns). If the int or shape is larger than that of the data provided, data will be zeropadded. If the int or shape is smaller, the data will be restricted. (default None)

generator()[source]

Shapes, norms, and feeds data depending on labeled or non-labeled data.

The models.dataprep module covers functionality for feeding features to models.

class soundpy.models.dataprep.Generator(data_matrix1, data_matrix2=None, timestep=None, axis_timestep=0, normalize=True, apply_log=False, context_window=None, axis_context_window=- 2, labeled_data=False, gray2color=False, zeropad=True, desired_input_shape=None, combine_axes_0_1=False)[source]

Bases: object

Methods

generator()

Shapes, norms, and feeds data depending on labeled or non-labeled data.

generator()[source]

Shapes, norms, and feeds data depending on labeled or non-labeled data.

class soundpy.models.dataprep.GeneratorFeatExtraction(datalist, datalist2=None, model_name=None, normalize=True, apply_log=False, randomize=True, random_seed=None, desired_input_shape=None, timestep=None, axis_timestep=0, context_window=None, axis_context_window=- 2, batch_size=1, gray2color=False, visualize=False, vis_every_n_items=50, visuals_dir=None, decode_dict=None, dataset='train', augment_dict=None, label_silence=False, vad_start_end=False, **kwargs)[source]

Bases: soundpy.models.dataprep.Generator

Methods

generator()

Extracts features and feeds them to model according to desired_input_shape.

generator()[source]

Extracts features and feeds them to model according to desired_input_shape.

soundpy.models.dataprep.check4na(numpyarray)[source]
soundpy.models.dataprep.randomize_augs(aug_dict, random_seed=None)[source]

Creates copy of dict and chooses which augs applied randomly.

Can apply random seed for number of augmentations applied and shuffling order of possible augmentations.

soundpy.models.dataprep.augment_features(sound, sr, add_white_noise=False, snr=[5, 10, 20], speed_increase=False, speed_decrease=False, speed_perc=0.15, time_shift=False, shufflesound=False, num_subsections=3, harmonic_distortion=False, pitch_increase=False, pitch_decrease=False, num_semitones=2, vtlp=False, bilinear_warp=True, augment_settings_dict=None, random_seed=None)[source]

Randomly applies augmentations to audio. If no augment_settings_dict, defaults applied.

soundpy.models.dataprep.get_input_shape(kwargs_get_feats, labeled_data=False, frames_per_sample=None, use_librosa=True, mode='reflect')[source]
soundpy.models.dataprep.make_gen_callable(_gen)[source]

Prepares Python generator for tf.data.Dataset.from_generator

Bug fix: Python generator fails to work in Tensorflow 2.2.0 +

Parameters

_gen (generator) – The generator function to feed to a deep neural network.

Returns

  • x (np.ndarray [shape=(batch_size, num_frames, num_features, 1)]) – The feature data

  • y (np.ndarray [shape=(1,1)]) – The label for the feature data.

References

Shu, Nicolas (2020) https://stackoverflow.com/a/62186572 CC BY-SA 4.0

Other useful non-specific functionality

Utils module covers functions that are useful for soundpy but are not directly related to sound data.

soundpy.utils.path_or_samples(input_value)[source]

Checks whether input_value is a path or sample data. Does not check path validity.

This is useful for functions that take both pathways to audio as well as pre-loaded audio data.

Parameters

input_value (str, pathlib.PosixPath, or tuple [size= ( (samples,), sr)] or np.ndarray [size = (samples, )]) –

Returns

‘path’ or ‘samples’

Return type

str

Examples

>>> import numpy as np
>>> # create some example samples and sample rate
>>> samples = np.array([1,2,3,2,1,0])
>>> sr = 5
>>> path_or_samples( (samples, sr) )
'samples'
>>> # expects both audio samples and sr
>>> path_or_samples(samples)
TypeError: The input for `path_or_samples` expected a str, pathlib.PosixPath, or tuple with samples and sample rate, not type <class 'numpy.ndarray'>
>>> # create example string pathway
>>> path_or_samples('my_audio.wav')
'path'
>>> # create pathlib.PosixPath object
>>> import pathlib
>>> path_or_samples(pathlib.Path('my_audio.wav')
'path'
soundpy.utils.get_default_args(func)[source]

References

stackoverflow answer by mgilson: link: https://stackoverflow.com/a/12627202 license: https://creativecommons.org/licenses/by-sa/3.0/

soundpy.utils.match_dtype(array1, array2)[source]

Match the dtype of the second array to the first.

Parameters
  • array1 (np.ndarray) – The numpy array with the dataype to be adjusted and returned.

  • array2 (np.ndarray) – The numpy array with the orginal or desired datatype.

Returns

array1 – The array1 with the dtype of array2

Return type

np.ndarray

soundpy.utils.get_date()[source]

Get a string containing month, day, hour, minute, second and millisecond.

This is useful for creating a unique filename.

Parameters

None

Returns

time_str – A string containing the date and time.

Return type

str

Examples

>>> date = get_date()
>>> date
'6m18d1h16m32s295ms'
soundpy.utils.check_dir(directory, make=True, append=True)[source]

Checks if directory exists and creates it if indicated.

Parameters
  • directory (str or pathlib.PosixPath) – The directory of interest

  • make (bool) – Whether or not the directory should be created or just checked to ensure it exists. (default True)

  • append (bool) – If True, if a directory with the same name exists, new items will be saved into the old directory. Otherwise, an error will be raised. (default True)

Returns

directory – If a directory could be created or confirmed to exist, the directory path will be returned. Otherwise Errors will be raised.

Return type

pathlib.PosixPath

soundpy.utils.create_nested_dirs(directory)[source]

Creates directory even if several parent directories don’t exist.

Parameters

directory (str, pathlib.PosixPath) – The directory to be created.

Returns

directory – If successful, the directory path that has been created.

Return type

pathlib.PosixPath

Examples

>>> # First an unsucessful creation of nested directory
>>> import os
>>> new_dir = './testdir/testdir/testdir/'
>>> os.mkdir(new_dir)
FileNotFoundError: [Errno 2] No such file or directory: './testdir/testdir/testdir/'
>>> # try again with create_nested_dirs()
>>> directory = create_nested_dirs(new_dir)
>>> directory
PosixPath('testdir/testdir/testdir')
soundpy.utils.string2pathlib(pathway_string)[source]

Turns string path into pathlib.PosixPath object.

This is useful when working with pathways from varying operating systems. Windows, Linux, and Mac have different ways of organizing pathways and pathlib turns strings from these different versions into a pathlib object that can be understood by the software regardless of the system. (At least I hope so..)

Parameters

pathway_string (str or pathlib.PosixPath) – The pathway to be turned into a pathlib object, if need be.

Returns

pathway_string – The pathway as a pathlib object.

Return type

pathlib.PosixPath

Examples

>>> pathway = 'folder/way2go.txt'
>>> pathlib_pathway = string2pathlib(pathway)
>>> pathlib_pathway
PosixPath('folder/way2go.txt')
soundpy.utils.restore_dictvalue(value_string)[source]

Takes dict value and converts it to its original type.

When loading a dictionary from a .csv file, the values are strings. This function handles integers, floats, tuples, and some strings. It also has been suited to handle a list of audio files or list of pathlib.PosixPath objects.

Warning: no extensive testing has been completed for this function. It might not handle all value types as expected.

Parameters

value_string (str) – The dictionary value that was converted into a string object .

Returns

value_original_type – The value converted back to its original type.

Return type

list, int, tuple, string, float, etc.

:raises ValueError : If passed a nested list of pathlib.PosixPath objects.:

Examples

>>> input_string = "[PosixPath('data/audio/vacuum/vacuum1.wav')]"
>>> type(input_string)
<class 'str'>
>>> typelist = string2list(input_string)
>>> typelist
[PosixPath('data/audio/vacuum/vacuum1.wav')]
>>> type(typelist)
<class 'list'>
>>> # Get type of the object
>>> type(typelist[0])
pathlib.PosixPath
>>> # Example with a list of tuples, i.e. label and audio file pairs:
>>> input_string = "[(2, PosixPath('data/audio/vacuum/vacuum1.wav')), '+        '(1, PosixPath('data/audio/vacuum/vacuum2.wav'))]"
>>> labelaudio_pairs = string2list(input_string)
>>> labelaudio_pairs
[(2, PosixPath('data/audio/vacuum/vacuum1.wav')),
(1, PosixPath('data/audio/vacuum/vacuum2.wav'))]
>>> type(labelaudio_pairs)
list
>>> type(labelaudio_pairs[0])
tuple
>>> type(labelaudio_pairs[0][0])
int
>>> type(labelaudio_pairs[0][1])
pathlib.PosixPath
soundpy.utils.adjust_time_units(time_sec)[source]

Turns seconds into relevant time units.

This is useful if measuring time of a process and that process takes longer than a couple minutes.

Parameters

time_sec (int, float) – The amount of time measured in seconds.

Returns

  • total_time (int, float) – The total amount of time

  • units (str) – The unites of total_time: ‘seconds’, ‘minutes’, or ‘hours’.

Examples

>>> adjust_time_units(5)
(5, 'seconds')
>>> adjust_time_units(500)
(8.333333333333334, 'minutes')
>>> adjust_time_units(5000)
(1.3888888888888888, 'hours')
soundpy.utils.print_progress(iteration, total_iterations, task=None)[source]

Prints the status of a process based on iteration number.

Assumes the iteration starts at 0 rather than 1.

Parameters
  • iteration (int) – The iteration of the current process.

  • total_iterations (int) – The total iterations to be completed.

  • task (str, optional) – The relevant task of the process.

Returns

Return type

sys.stdout.flush()

Examples

>>> print_progress(4, 10)
50% through current task
>>> print_progress(4, 10, task = 'testing')
50% through testing
soundpy.utils.check_extraction_variables(sr=None, feature_type=None, win_size_ms=None, percent_overlap=None)[source]

Checks to ensure extraction variables are compatible.

Parameters
  • sr (int) – The sample rate of audio.

  • feature_type (str) – The type of feature to be extracted: ‘fbank’, ‘stft’, ‘powspec’, ‘mfcc’, ‘signal’.

  • win_size_ms (int, float) – The window size to process audio samples.

  • percent_overlap (int, float) – The percent windows should overlap.

Returns

Return type

None

Raises

ValueError – If any of the Parameters aren’t compatible.

Examples

>>> check_extraction_variables(sr=48000, feature_type='signal', win_size_ms=25,percent_overlap=0.5)
>>> check_extraction_variables(sr='48000', feature_type='sig',win_size_ms='25',percent_overlap='0.5')
ValueError: Sampling rate (sr) must be of type int, not 48000 of type <class 'str'>.
soundpy.utils.check_noisy_clean_match(noisyfilename, cleanfilename)[source]

Checks if the clean filename is inside of the noisy filename.

This may be helpful to check that two audiofile datasets (a noisy and clean dataset) are aligned.

soundpy.utils.audiofile_length_match(filename1, filename2)[source]

Checks that two audiofiles have the same length.

This may be useful if you have clean and noisy audiofiles that should be the same length.

Parameters
Returns

bool

Return type

True if they match, False if not.

Warning

UserWarning

If the sample rate of the audio files don’t match.

UserWarning

If the length of the files don’t match.

soundpy.utils.save_dict(filename, dict2save, overwrite=False)[source]

Saves dictionary as csv file to indicated path and filename.

Ensures pathlib objects turned to strings. Warning: not thoroughly tested.

Parameters
  • filename (str) – The path and name to save the dictionary under. If ‘.csv’ extension is not given, it is added.

  • dict2save (dict) – The dictionary that is to be saved

  • overwrite (bool, optional) – Whether or not the saved dictionary should overwrite a preexisting file (default False)

Returns

path – The path where the dictionary was saved

Return type

pathlib.PosixPath

soundpy.utils.load_dict(csv_path)[source]

Loads a dictionary from csv file. Expands csv limit if too large.

Increasing the csv limit helps if loading dicitonaries with very long audio file path lists. For example, see soundpy.datasets.audio2datasets function.

Customized Errors

The soundpy.exceptions module includes customized errors.

soundpy.exceptions.notsufficientdata_error(numtrain, numval, numtest, expected_numtrain)[source]
soundpy.exceptions.numfeatures_incompatible_templatemodel()[source]