SoundPy Functionality v0.1.0a3¶

See also

soundpy.files.collect_audiofiles: Collects audiofiles from a given directory.
soundpy.files.conversion_formats: The available formats for converting audio data.
soundfile.available_subtypes: The subtypes or bitdepth possible for soundfile

soundpy.builtin.create_denoise_data(cleandata_dir, noisedata_dir, trainingdata_dir, limit=None, snr_levels=None, pad_mainsound_sec=None, random_seed=None, overwrite=False, **kwargs)[source]¶

Applies noise to clean audio; saves clean and noisy audio to traingingdata_dir.

Parameters

cleandata_dir (str, pathlib.PosixPath) – Name of folder containing clean audio data for autoencoder. E.g. ‘clean_speech’
noisedata_dir (str, pathlib.PosixPath) – Name of folder containing noise to add to clean data. E.g. ‘noise’
trainingdata_dir (str, pathlib.PosixPath) – Directory to save newly created train, validation, and test data
limit (int, optional) – Limit in number of audiofiles used for training data
snr_levels (list of ints, optional) – List of varying signal-to-noise ratios to apply to noise levels. (default None)
pad_mainsound_sec (int, float, optional) – Amount in seconds the main sound should be padded. In other words, in seconds how long the background sound should play before the clean / main / target audio starts. The same amount of noise will be appended at the end. (default None)
random_seed (int) – A value to allow random order of audiofiles to be predictable. (default None). If None, the order of audiofiles will not be predictable.
overwrite (bool) – If True, a new dataset will be created regardless of whether or not a matching directory already exists. (default False)
**kwargs (additional keyword arguments) – The keyword arguments for soundpy.files.loadsound

Returns

saveinput_path (pathlib.PosixPath) – Path to where noisy audio files are located
saveoutput_path (pathlib.PosixPath) – Path to where clean audio files are located

See also

soundpy.files.loadsound: Loads audiofiles.
soundpy.dsp.add_backgroundsound: Add background sound / noise to signal at a determined signal-to-noise ratio.

soundpy.builtin.envclassifier_feats(data_dir, data_features_dir=None, perc_train=0.8, ignore_label_marker=None, **kwargs)[source]¶

Environment Classifier: feature extraction of scene audio into train, val, & test datasets.

Saves extracted feature datasets (train, val, test datasets) as well as feature extraction settings in the directory data_features_dir.

Parameters

data_dir (str or pathlib.PosixPath) – The directory with scene subfolders (e.g. ‘air_conditioner’, ‘traffic’) that contain audio files belonging to that scene (e.g. ‘air_conditioner/ac1.wav’, ‘air_conditioner/ac2.wav’, ‘traffic/t1.wav’).
data_features_dir (str or pathlib.PosixPath, optional) – The directory where feature extraction related to the dataset will be stored. Within this directory, a unique subfolder will be created each time features are extracted. This allows several versions of extracted features on the same dataset without overwriting files.
perc_train (float) – The amount of data to be set aside for train data. The rest will be divided into validation and test datasets.
ignore_label_marker (str) – A string to look for in the labels if the “label” should not be included. For example, ‘__’ to ignore a subdirectory titled “__noise” or “not__label”.
kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.save_features_datasets and soundpy.feats.get_feats.

Returns

feat_extraction_dir – The pathway to where all feature extraction files can be found, including datasets.

Return type

See also

soundpy.feats.get_feats: Extract features from audio file or audio data.
soundpy.feats.save_features_datasets: Preparation of acoustic features in train, validation and test datasets.

soundpy.builtin.denoiser_feats(data_clean_dir, data_noisy_dir, data_features_dir=None, limit=None, perc_train=0.8, **kwargs)[source]¶

Autoencoder Denoiser: feature extraction of clean & noisy audio into train, val, & test datasets.

Saves extracted feature datasets (train, val, test datasets) as well as feature extraction settings in the directory data_features_dir.

Parameters

data_clean_dir (str or pathlib.PosixPath) – The directory with clean audio files.
data_noisy_dir (str or pathlib.PosixPath) – The directory with noisy audio files. These should be the same as the clean audio, except noise has been added.
data_features_dir (str or pathlib.PosixPath, optional) – The directory where feature extraction related to the dataset will be stored. Within this directory, a unique subfolder will be created each time features are extracted. This allows several versions of extracted features on the same dataset without overwriting files.
limit (int, optional) – The limit of audio files for feature extraction. (default None)
kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.save_features_datasets and soundpy.feats.get_feats.

Returns

feat_extraction_dir – The pathway to where all feature extraction files can be found, including datasets.

Return type

See also

soundpy.datasets.create_denoise_data: Applies noise at specified SNR levels to clean audio files.
soundpy.feats.get_feats: Extract features from audio file or audio data.
soundpy.feats.save_features_datasets: Preparation of acoustic features in train, validation and test datasets.

Built-In Functionality (Deep Learning)¶

The soundpy.models.builtin module includes example functions that train neural networks on sound data.

soundpy.models.builtin.denoiser_train(feature_extraction_dir, model_name='model_autoencoder_denoise', feature_type=None, use_generator=True, normalize=True, patience=10, **kwargs)[source]¶

Collects training features and train autoencoder denoiser.

Parameters

feature_extraction_dir (str or pathlib.PosixPath) – Directory where extracted feature files are located (format .npy).
model_name (str) – The name for the model. This can be quite generic as the date up to the millisecond will be added to ensure a unique name for each trained model. (default ‘model_autoencoder_denoise’)
feature_type (str, optional) – The type of features that will be used to train the model. This is only for the purposes of naming the model. If set to None, it will not be included in the model name.
use_generator (bool) – If True, a generator will be used to feed training data to the model. Otherwise the entire training data will be used to train the model all at once. (default True)
normalize (bool) – If True, the data will be normalized before feeding to the model. (default False)
patience (int) – Number of epochs to train without improvement before early stopping.
**kwargs (additional keyword arguments) – The keyword arguments for keras.fit(). Note, the keyword arguments differ for validation data so be sure to use the correct keyword arguments, depending on if you use the generator or not. TODO: add link to keras.fit().

Returns

model_dir – The directory where the model and associated files can be found.

Return type

See also

soundpy.datasets.separate_train_val_test_files: Generates paths lists for train, validation, and test files. Useful for noisy vs clean datasets and also for multiple training files.
soundpy.models.generator: The generator function that feeds data to the model.
soundpy.models.modelsetup.setup_callbacks: The function that sets up callbacks (e.g. logging, save best model, early stopping, etc.)
soundpy.models.template_models.autoencoder_denoise: Template model architecture for basic autoencoder denoiser.

soundpy.models.builtin.envclassifier_train(feature_extraction_dir, model_name='model_cnn_classifier', feature_type=None, use_generator=True, normalize=True, patience=15, add_tensor_last=True, num_layers=3, **kwargs)[source]¶

Collects training features and trains cnn environment classifier.

This model may be applied to any speech and label scenario, for example, male vs female speech, clinical vs healthy speech, simple speech / word recognition, as well as noise / scene / environment classification.

Parameters

feature_extraction_dir (str or pathlib.PosixPath) – Directory where extracted feature files are located (format .npy).
model_name (str) – The name for the model. This can be quite generic as the date up to the millisecond will be added to ensure a unique name for each trained model. (default ‘model_cnn_classifier’)
feature_type (str, optional) – The type of features that will be used to train the model. This is only for the purposes of naming the model. If set to None, it will not be included in the model name.
use_generator (bool) – If True, a generator will be used to feed training data to the model. Otherwise the entire training data will be used to train the model all at once. (default True)
normalize (bool) – If True, the data will be normalized before feeding to the model. (default False)
patience (int) – Number of epochs to train without improvement before early stopping.
num_layers (int) – The number of convolutional neural network layers desired. (default 3)
**kwargs (additional keyword arguments) – The keyword arguments for keras.fit(). Note, the keyword arguments differ for validation data so be sure to use the correct keyword arguments, depending on if you use the generator or not. TODO: add link to keras.fit().

Returns

model_dir – The directory where the model and associated files can be found.

Return type

See also

soundpy.datasets.separate_train_val_test_files: Generates paths lists for train, validation, and test files. Useful for noisy vs clean datasets and also for multiple training files.
soundpy.models.generator: The generator function that feeds data to the model.
soundpy.models.modelsetup.setup_callbacks: The function that sets up callbacks (e.g. logging, save best model, early stopping, etc.)
soundpy.models.template_models.cnn_classifier: Template model architecture for a low-computational CNN sound classifier.

soundpy.models.builtin.denoiser_run(model, new_audio, feat_settings_dict, remove_dc=True)[source]¶

Implements a pre-trained denoiser

Parameters

model (str or pathlib.PosixPath) – The path to the denoising model.
new_audio (str, pathlib.PosixPath, or np.ndarray) – The path to the noisy audiofile.
feat_settings_dict (dict) – Dictionary containing necessary settings for how the features were extracted for training the model. Expected keys: ‘feature_type’, ‘win_size_ms’, ‘percent_overlap’, ‘sr’, ‘window’, ‘frames_per_sample’, ‘input_shape’, ‘desired_shape’, ‘dur_sec’, ‘num_feats’.

Returns

cleaned_audio (np.ndarray [shape = (num_samples, )]) – The cleaned audio samples ready for playing or saving as audio file.
sr (int) – The sample rate of cleaned_audio.

See also

soundpy.feats.get_feats: How features are extracted.
soundpy.feats.feats2audio: How features are transformed back into audio samples.

soundpy.models.builtin.envclassifier_run(model, new_audio, feat_settings_dict, dict_decode)[source]¶

Implement a convnet model with new_audio.

Parameters

model (str, pathlib.PosixPath) – The pathway to the pre-trained model.
new_audio (str, pathlib.PosixPath) – The pathway to the audio file to be classified.
feat_settings_dict (dict) – Dictionary containing necessary settings for feature extraction, such as sample rate, feature type, etc.
dict_decode (dict) – Dictionary containing encoded labels as keys and string labels as values. for example {0:’office’, 1:’traffic’, 2:’park’}.

Returns

label (int) – The encoded label applied to the new_audio.
label_string (str) – The string label applied to the new_audio.
strength (float) – The confidence of the model’s assignment. For example, 0.99 would be very confident, 0.51 would not be very confident.

soundpy.models.builtin.collect_classifier_settings(feature_extraction_dir)[source]¶

Collects relevant information for some models from files in the feature directory.

These relevant files have been generated in soundpy.models.builtin.envclassifier_train.

Parameters

feature_extraction_dir (str, pathlib.PosixPath) – The directory where extracted files are located, included .npy and .csv log files.

Returns

datasets (NamedTuple) – A named tuple containing train, val, and test data
num_labels (int) – The number of labels used for the data.
feat_shape (tuple) – The initial shape of the features when they were extracted. For example, labels or context window not applied.
num_feats (int) – The number of features used to train the pre-trained model.
feature_type (str) – The feature_type used to train the pre-trained model. For example, ‘fbank’, ‘mfcc’, ‘stft’, ‘signal’, ‘powspec’.

See also

soundpy.models.builtin.envclassifier_train: The builtin functionality for training a simple scene/environment/speech classifier. This function generates the files expected by this function.

soundpy.models.builtin.cnnlstm_train(feature_extraction_dir, model_name='model_cnnlstm_classifier', use_generator=True, normalize=True, patience=15, timesteps=10, context_window=5, frames_per_sample=None, colorscale=1, total_training_sessions=None, add_tensor_last=False, **kwargs)[source]¶

Example implementation of a Convnet+LSTM model for speech recognition.

Note: improvements must still be made, for example with the context_window. However, this still may be useful as an example of a simple CNN and LSTM model.

Parameters

feature_extraction_dir (str, pathlib.PosixPath) – The directory where feature data will be saved.
model_name (str) – The name of the model. (default ‘model_cnnlstm_classifier’)
use_generator (True) – If True, data will be fed to the model via generator. This parameter will likely be removed and set as a default. (default True)
normalize (bool) – If True, the data will be normalized before being fed to the model. (default True)
patience (int) – The number of epochs to allow with no improvement in either val accuracy or loss. (default 15)
timesteps (int) – The frames dedicated to each subsection of each sample. This allows the long-short term memory model to process each subsection consecutively.
context_window (int) – The number of frames surrounding a central frame that make up sound context. Note: this needs improvement and further exploration.
frames_per_sample (int) – Serves basically same role as context_window does currently: frames_per_sample equals context_window * 2 + 1. This parameter will likely be removed in future versions.
colorscale (int) – The colorscale relevant for the convolutional neural network. (default 1)
total_training_sessions (int) – Option to limit number of audiofiles used for training, if use_generator is set to False. This parameter will likely be removed in future versions. But as this is just an example model, the low priority may result in this parameter living forever.
add_tensor_last (bool) – No longer used in the code. Irrelevant.
kwargs (additional keyword arguments.) – Keyword arguments for keras.model.fit.

Returns

model_dir (pathlib.PosixPath) – The directory where model and log files are saved.
history (tf.keras.callbacks.History) – Contains model training and validation accuracy and loss throughout training.

References

Kim, Myungjong & Cao, Beiming & An, Kwanghoon & Wang, Jun. (2018). Dysarthric Speech Recognition Using Convolutional LSTM Neural Network. 10.21437/interspeech.2018-2250.

soundpy.models.builtin.resnet50_train(feature_extraction_dir, model_name='model_resnet50_classifier', use_generator=True, normalize=True, patience=15, colorscale=3, total_training_sessions=None, **kwargs)[source]¶

Continue training a pre-trained resnet50 model for speech recogntion or other sound classification.

Parameters

feature_extraction_dir (str or pathlib.PosixPath) – The directory where feature extraction files will be saved.
model_name (str) – The name for the model. (default ‘model_resnet50_classifier’)
use_generator (True) – If True, data will be fed to the model via generator. This parameter will likely be removed and set as a default. (default True)
normalize (bool) – If True, the data will be normalized before being fed to the model. (default True)
patience (int) – The number of epochs to allow with no improvement in either val accuracy or loss. (default 15)
timesteps (int) – The frames dedicated to each subsection of each sample. This allows the long-short term memory model to process each subsection consecutively.
context_window (int) – The number of frames surrounding a central frame that make up sound context. Note: this needs improvement and further exploration.
frames_per_sample (int) – Serves basically same role as context_window does currently: frames_per_sample equals context_window * 2 + 1. This parameter will likely be removed in future versions.
colorscale (int) – The colorscale relevant for the convolutional neural network. (default 1)
total_training_sessions (int) – Option to limit number of audiofiles used for training, if use_generator is set to False. This parameter will likely be removed in future versions. But as this is just an example model, the low priority may result in this parameter living forever.

Returns

model_dir (pathlib.PosixPath) – The directory where model and log files are saved.
history (tf.keras.callbacks.History()) – Contains model training and validation accuracy and loss throughout training.

soundpy.models.builtin.envclassifier_extract_train(model_name='env_classifier', augment_dict=None, audiodata_path=None, features_dir=None, save_new_files_dir=None, labeled_data=True, ignore_label_marker=None, batch_size=10, epochs=5, patience=15, callbacks=None, random_seed=None, visualize=False, vis_every_n_items=50, label_silence=False, val_data=None, test_data=None, append_model_dir=False, **kwargs)[source]¶

Extract and augment features during training of a scene/environment/speech classifier

Parameters

model_name (str) – Name of the model. No extension (will save as .h5 file) (default ‘env_classifier’)
augment_dict (dict, optional) – Dictionary containing keys (e.g. ‘add_white_noise’). See `soundpy.augment.list_augmentations`and corresponding True or False values. If the value is True, the key / augmentation gets implemented at random, each epoch. (default None)
audiodata_path (str, pathlib.PosixPath) – Where audio data can be found, if no features_dir where previously extracted and prepared files are located. (default None)
features_dir (str, pathlib.PosixPath) – The feature directory where previously extracted validation and test data are located, as well as the relevant log files.
save_new_files_dir (str, pathlib.PosixPath) – Where new files (logging, model(s), etc.) will be saved. If None, will be set in a unique directory within the current working directory. (default None)
labeled_data (bool) – Useful in determining shape of data. If True, expected label column to exist at the end of the feature column of feature data. Note: this may be removed in future versions.
ignore_label_marker (str) – When collecting labels from subdirectory names, this allows a subfolder name to be ignored. For example, if ignore_label_marker is set as ‘__’, the folder name ‘__test__’ will not be included as a label while a folder name ‘dog_barking’ will.
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats.

soundpy.models.builtin.cnnlstm_extract_train(model_name='cnnlstm_classifier', dataset_dict=None, num_labels=None, augment_dict=None, audiodata_path=None, save_new_files_dir=None, labeled_data=True, ignore_label_marker=None, context_window=5, batch_size=10, epochs=5, patience=15, callbacks=None, random_seed=None, visualize=False, vis_every_n_items=50, label_silence=False, **kwargs)[source]¶

Extract and augment features during training of a scene/environment/speech classifier

Parameters

model_name (str) – Name of the model. No extension (will save as .h5 file)
dataset_dict (dict, optional) – A dictionary including datasets as keys, and audio file lists (with or without labels) as values. If None, will be created based on audiodata_path. (default None)
augment_dict (dict, optional) – Dictionary containing keys (e.g. ‘add_white_noise’). See `soundpy.augment.list_augmentations`and corresponding True or False values. If the value is True, the key / augmentation gets implemented at random, each epoch. (default None)
audiodata_path (str, pathlib.PosixPath) – Where audio data can be found, if no dataset_dict provided. (default None)
save_new_files_dir (str, pathlib.PosixPath) – Where new files (logging, model(s), etc.) will be saved. If None, will be set in a unique directory within the current working directory. (default None)
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats.

soundpy.models.builtin.denoiser_extract_train(model_name='denoiser', augment_dict=None, audiodata_clean_path=None, audiodata_noisy_path=None, features_dir=None, save_new_files_dir=None, labeled_data=False, ignore_label_marker=None, batch_size=10, epochs=5, patience=15, callbacks=None, random_seed=20, visualize=False, vis_every_n_items=50, label_silence=False, val_data=None, test_data=None, append_model_dir=False, **kwargs)[source]¶

Extract and augment features during training of a scene/environment/speech classifier

Parameters

model_name (str) – Name of the model. No extension (will save as .h5 file) (default ‘env_classifier’)
augment_dict (dict, optional) – Dictionary containing keys (e.g. ‘add_white_noise’). See `soundpy.augment.list_augmentations`and corresponding True or False values. If the value is True, the key / augmentation gets implemented at random, each epoch. (default None)
audiodata_path (str, pathlib.PosixPath) – Where audio data can be found, if no features_dir where previously extracted and prepared files are located. (default None)
features_dir (str, pathlib.PosixPath) – The feature directory where previously extracted validation and test data are located, as well as the relevant log files.
save_new_files_dir (str, pathlib.PosixPath) – Where new files (logging, model(s), etc.) will be saved. If None, will be set in a unique directory within the current working directory. (default None)
labeled_data (bool) – Useful in determining shape of data. If True, expected label column to exist at the end of the feature column of feature data. Note: this may be removed in future versions.
ignore_label_marker (str) – When collecting labels from subdirectory names, this allows a subfolder name to be ignored. For example, if ignore_label_marker is set as ‘__’, the folder name ‘__test__’ will not be included as a label while a folder name ‘dog_barking’ will.
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats.

Augment audio data¶

The augment module includes functions related to augmenting audio data. These functions pull from implementations performed in research.

Other resources for augmentation (not included in soundpy functionality):

Ma, E. (2019). NLP Augmentation. https://github.com/makcedward/nlpaug

Park, D. S., Chan, W., Zhang, Y., Chiu, C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Google Brain. arxiv.org/pdf/1904.08779.pdf

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084:

1.Signal speed scaling by a random number in[0.8,1.2](SpeedupFactoryRange). 2.Pitch shift by a random number in [−2,2]semitones(SemitoneShiftRange). 3.Volume increase/decrease by a random number in [−3,3]dB(VolumeGainRange). 4.Addition of random noise in the range [0,10]dB(SNR). 5.Time shift in the range [−0.005,0.005]seconds(TimeShiftRange).

soundpy.augment.speed_increase(sound, sr, perc=0.15, **kwargs)[source]¶

Acoustic augmentation of speech.

References

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084

Ko, T., Peddinti, V., Povey, D., & Khudanpur (2015). Audio Augmentation for Speech Recognition. Interspeech.

W. Verhelst and M. Roelands, “An overlap-add technique based on waveform similarity (wsola) for high quality time-scale modifica- tion of speech,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, April 1993, pp. 554–557 vol.2.

soundpy.augment.speed_decrease(sound, sr, perc=0.15, **kwargs)[source]¶

Acoustic augmentation of speech.

References

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084

soundpy.augment.time_shift(sound, sr, random_seed=None, **kwargs)[source]¶

Acoustic augmentation of sound (probably not for speech).

Applies random shift of sound by dividing sound into 2 sections and switching them.

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084

soundpy.augment.shufflesound(sound, sr, num_subsections=2, random_seed=None, **kwargs)[source]¶

Acoustic augmentation of noise or background sounds.

This separates the sound into num_subsections and pseudorandomizes the order.

References

Inoue, T., Vinayavekhin, P., Wang, S., Wood, D., Munawar, A., Ko, B. J., Greco, N., & Tachibana, R. (2019). Shuffling and mixing data augmentation for environmental sound classification. Detection and Classification of Acoustic Scenes and Events 2019. 25-26 October 2019, New York, NY, USA

soundpy.augment.add_white_noise(sound, sr, noise_level=0.01, snr=10, random_seed=None, **kwargs)[source]¶

References

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084

soundpy.augment.harmonic_distortion(sound, sr, **kwargs)[source]¶

Applies sin function five times.

References

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084

soundpy.augment.pitch_increase(sound, sr, num_semitones=2, **kwargs)[source]¶

References

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084

soundpy.augment.pitch_decrease(sound, sr, num_semitones=2, **kwargs)[source]¶

References

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084

soundpy.augment.vtlp(sound, sr, a=0.8, 1.2, random_seed=None, oversize_factor=16, win_size_ms=50, percent_overlap=0.5, bilinear_warp=True, real_signal=True, fft_bins=1024, window='hann', zeropad=True, expected_shape=None, visualize=False)[source]¶

Applies vocal tract length perturbations directly to dft (oversized) windows.

References

Kim, C., Shin, M., Garg, A., & Gowda, D. (2019). Improved vocal tract length perturbation for a state-of-the-art end-to-end speech recognition system. Interspeech. September 15-19, Graz, Austria.

Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084

soundpy.augment.get_augmentation_dict()[source]¶

Returns dictionary with augmentation options as keys and values set to False.

Examples

>>> import soundpy as sp
>>> ad = sp.augment.get_augmentation_dict()
>>> ad
{'speed_increase': False,
'speed_decrease': False,
'time_shift': False,
'shufflesound': False,
'add_white_noise': False,
'harmonic_distortion': False,
'pitch_increase': False,
'pitch_decrease': False,
'vtlp': False}
>>> # to set augmentation to True:
>>> ad['add_white_noise'] = True
>>> ad
{'speed_increase': False,
'speed_decrease': False,
'time_shift': False,
'shufflesound': False,
'add_white_noise': True,
'harmonic_distortion': False,
'pitch_increase': False,
'pitch_decrease': False,
'vtlp': False}

soundpy.augment.list_augmentations()[source]¶

Lists available augmentations.

Examples

>>> import soundpy as sp
>>> print(sp.augment.list_augmentations())
Available augmentations:
        speed_increase
        speed_decrease
        time_shift
        shufflesound
        add_white_noise
        harmonic_distortion
        pitch_increase
        pitch_decrease
        vtlp

soundpy.augment.get_augmentation_settings_dict(augmentation)[source]¶

Returns default settings of base function for augmentation.

Parameters: augmentation (str) – The augmentation of interest.
Returns: aug_defaults – A dictionary with the base augmentation function parameters as keys and default values as values.
Return type: dict

Examples

>>> import soundpy as sp
>>> d = sp.augment.get_augmentation_settings_dict('speed_decrease')
>>> d
{'perc': 0.15}
>>> # can use this dictionary to apply different values for augmentation
>>> d['perc'] = 0.1
>>> d
{'perc': 0.1}
>>> # to build a dictionary with several settings:
>>> many_settings_dict = {}
>>> many_settings_dict['add_white_noise'] = sp.augment.get_augmentation_settings_dict('add_white_noise')
>>> many_settings_dict['pitch_increase'] = sp.augment.get_augmentation_settings_dict('pitch_increase')
>>> many_settings_dict
{'add_white_noise': {'noise_level': 0.01, 'snr': 10, 'random_seed': None},
'pitch_increase': {'num_semitones': 2}}
>>> # change 'snr' default values to list of several values
>>> # this would apply white noise at either 10, 15, or 20 SNR, at random
>>> many_settings_dict['add_white_noise']['snr'] = [10, 15, 20]
>>> # change number of semitones pitch increase is applied
>>> many_settings_dict['pitch_increase']['num_semitones'] = 1
>>> many_settings_dict
{'add_white_noise': {'noise_level': 0.01,
'snr': [10, 15, 20],
'random_seed': None},
'pitch_increase': {'num_semitones': 1}}

Raises: ValueError – If augmentation does not match available augmentations.

See also

soundpy.models.dataprep.augment_features: The above dictionary example many_settings_dict can be applied under the parameter augment_settings_dict to apply augmentation settings when augmenting data, for example, within a generator function. See soundpy.models.dataprep.GeneratorFeatExtraction.

Working with audio files¶

The files module contains functions related to handling audio data files, for example loading audio files, saving audio files, and examing and reformatting audio files.

soundpy.files.loadsound(filename, sr=None, mono=True, dur_sec=None, remove_dc=True, use_scipy=False)[source]¶

Loads sound file with scipy.io.wavfile.read or librosa.load (default librosa)

Parameters

filename (str) – The filename of the sound to be loaded
sr (int, optional) – The desired sample rate of the audio samples. If None, the sample rate of the audio file will be used.
mono (bool) – If True, the samples will be loaded in mono sound. If False, if the samples are in stereo, they will be loaded in stereo sound.
dur_sec (int, float, optional) – The length in seconds of the audio signal.
remove_dc_bias (bool) – If True, the mean is subtracted from the signal. This has shown to be very helpful when working with audio data. (default True)
use_scipy (bool) – If False, librosa will be used to load the audiofile. If True, scipy.io.wavfile and/or soundfile will be used. If the sound file is not compatible with scipy.io.wavfile.read, this functions converts the file to .wav format and/or changes the bit depth to be compatible. (default False)

Returns

data (nd.array [size=(num_samples,) or (num_samples, num_channels)]) – The normalized (between 1 and -1) sample data returned according to the specified settings.
sr (int) – The sample rate of the loaded samples.

See also

soundpy.files.prep4scipywavfile: Prepares audio file for scipy.io.wavfile.read.
soundpy.files.convert_audiofile: Converts audio file to .wav format.
soundpy.files.newbitdepth: Converts audio file to specified bitdepth.
soundpy.dsp.resample_audio: Resampe audio data to a specified sample rate.
soundpy.files.list_possibleformats: Lists the possible formats to load with soundpy.loadsound
librosa.load: The package used to load sound data by default. See librosa.
scipy.io.wavfile.read: The package used to load sound if use_scipy is set to True. See scipy.
soundpy.dsp.remove_dc_bias: Removes the ‘direct current’ bias from the signal.

Todo

Make librosa data and scipy.io.wavfile data more similar: https://stackoverflow.com/questions/54482346/reading-a-wav-file-with-scipy-and-librosa-in-python

soundpy.files.savesound(audiofile_name, signal_values, sr, remove_dc=True, overwrite=False, use_scipy=False, **kwargs)[source]¶

saves the wave at designated path

Parameters

audiofile_name (str or pathlib.PosixPath) – path and name the audio is to be saved under. (.wav format)
signal_values (ndarray) – values of real signal to be saved
sr (int) – sample rate of the audio samples.
remove_dc (bool) – If True, the mean is subtracted from the signal. (default True)
overwrite (bool) – If True, audio with the same naem will be overwritten. (default False)
use_scipy (bool) – If True, scipy.io.wavfile.write will be used. However, file conversion is limited. Can only save .wav files. Otherwise soundfile.write will be used, which can save audio under more audio fomats.
**kwargs (additional keyword arguments) – The keyword arguments for soundfile.write: https://pysoundfile.readthedocs.io/en/latest/index.html?highlight=write#soundfile.write

Returns

audiofile_name – The new audiofile name

Return type

See also

scipy.io.wavfile.write

soundpy.files.conversion_formats: Lists the possible formats to save audio files if use_scipy is False.
soundpy.dsp.remove_dc_bias: Removes the ‘direct current’ bias from the signal.

soundpy.files.get_file_format(audiofile)[source]¶: Use soundfile to get file format.

soundpy.files.list_possibleformats(use_scipy=False)[source]¶

soundpy.files.list_audioformats()[source]¶

soundpy.files.audiofiles_present(directory, recursive=False)[source]¶

Checks to see if audio files are present.

Parameters

directory (str or pathlib.PosixPath) – The directory to look for audio.
recursive (bool) – If True, all nested directories will be checked as well. (default False)

Returns

True if audio is present; otherwise False.

Return type

bool

soundpy.files.collect_audiofiles(directory, hidden_files=False, wav_only=False, recursive=False)[source]¶

Collects all files within a given directory.

This includes the option to include hidden_files in the collection.

Parameters

directory (str or pathlib.PosixPath) – The path to where desired files are located.
hidden_files (bool) – If True, hidden files will be included. If False, they won’t. (default False)
wav_only (bool) – If True, only .wav files will be included. Otherwise, no limit on file type.

Returns

paths_list – Sorted list of file pathways.

Return type

list of pathlib.PosixPath objects

soundpy.files.collect_zipfiles(directory, hidden_files=False, ext='tgz', recursive=False)[source]¶

Collects all zipfiles within a given directory.

This includes the option to include hidden_files in the collection.

Parameters

directory (str or pathlib.PosixPath) – The path to where desired files are located.
hidden_files (bool) – If True, hidden files will be included. If False, they won’t. (default False)
wav_only (bool) – If True, only .wav files will be included. Otherwise, no limit on file type.

Returns

paths_list – Sorted list of file pathways.

Return type

list of pathlib.PosixPath objects

soundpy.files.ensure_only_audiofiles(audiolist)[source]¶

soundpy.files.prep4scipywavfile(filename, overwrite=False)[source]¶

Takes soundfile and saves it in a format compatible with scipy.io.wavfile

Parameters: filename (str) – Filename of the soundfile to load with scipy.io.wavfile
Returns: filename – Filename of the soundfile compatible with scipy.io.wavfile
Return type: str

soundpy.files.conversion_formats()[source]¶

Lists the formats available for conversion.

soundpy uses soundfile to convert files; therefore, whatever available formats soundfile has will be listed here.

Examples

>>> conversion_formats
{'AIFF': 'AIFF (Apple/SGI)',
'AU': 'AU (Sun/NeXT)',
'AVR': 'AVR (Audio Visual Research)',
'CAF': 'CAF (Apple Core Audio File)',
'FLAC': 'FLAC (Free Lossless Audio Codec)',
'HTK': 'HTK (HMM Tool Kit)',
'SVX': 'IFF (Amiga IFF/SVX8/SV16)',
'MAT4': 'MAT4 (GNU Octave 2.0 / Matlab 4.2)',
'MAT5': 'MAT5 (GNU Octave 2.1 / Matlab 5.0)',
'MPC2K': 'MPC (Akai MPC 2k)',
'OGG': 'OGG (OGG Container format)',
'PAF': 'PAF (Ensoniq PARIS)',
'PVF': 'PVF (Portable Voice Format)',
'RAW': 'RAW (header-less)',
'RF64': 'RF64 (RIFF 64)',
'SD2': 'SD2 (Sound Designer II)',
'SDS': 'SDS (Midi Sample Dump Standard)',
'IRCAM': 'SF (Berkeley/IRCAM/CARL)',
'VOC': 'VOC (Creative Labs)',
'W64': 'W64 (SoundFoundry WAVE 64)',
'WAV': 'WAV (Microsoft)',
'NIST': 'WAV (NIST Sphere)',
'WAVEX': 'WAVEX (Microsoft)',
'WVE': 'WVE (Psion Series 3)',
'XI': 'XI (FastTracker 2)'}

soundpy.files.convert_audiofile(filename, format_type=None, sr=None, new_dir=False, overwrite=False, use_scipy=False, **kwargs)[source]¶

Converts and saves soundfile as .wav type in same or new directory.

Parameters

filename (str or pathlib.PosixPath) – The filename of the audiofile to be converted to .wav type
format_type (str) – The format to convert the audio file to. See soundpy.files.conversion_formats. (defaults to ‘wav’)
new_dir (str, pathlib.PosixPath, optional) – If False, the converted files will be saved in same directory as originals. If a path is provided, the converted files will be saved there. If no such directory exists, one will be created.
sr (int, optional) – The sample rate to be applied to the signal. If none supplied, the sample rate of the original file will be used.
**kwargs (additional keyword arguments) – The keyword arguments for soundfile.write: https://pysoundfile.readthedocs.io/en/latest/index.html?highlight=write#soundfile.write

Returns

f_wavfile – The filename / path where the audio file is saved.

Return type

Examples

>>> audiofile = './example/audio.wav'
# in same directory
>>> audiofile_flac = sp.files.convert_audiofile(audiofile, format_type='flac')
>>> audiofile_flac
PosixPath('example/audio.flac')
# in new directory
>>> audiofile_flac = sp.files.convert_audiofile(audiofile, format_type='flac',
                                                 new_dir = './examples2/')
>>> audiofile_flac
PosixPath('examples2/audio.flac')
>>> # can establish desired conversion format in `new_dir`
>>> audiofile_ogg = sp.files.convert_audiofile(audiofile,
                                                 new_dir = './examples2/audio.ogg')
>>> audiofile_ogg
PosixPath('audiodata2/audio.ogg')

See also

soundpy.files.conversion_formats: Lists the possible formats to convert audio files.
soundpy.files.list_possibleformats: Lists the possible formats to load with soundpy.loadsound

soundpy.files.replace_ext(filename, extension)[source]¶

Adds or replaces an extension in the filename

Parameters

filename (str or pathlib.PosixPath) – Filename with the missing or incorrect extension
extension (str) – The correct extension for the given filename.

Returns

file_newext – The filename with the new extension

Return type

https://en.wikipedia.org/wiki/Mel_scale#Formula

soundpy.files.match_ext(filename1, filename2)[source]¶

Matches the file extensions.

If both have extensions, default set to that of filename1.

soundpy.files.newbitdepth(wave, bitdepth=16, newname=None, overwrite=False)[source]¶

Convert bitdepth to 16 or 32, to ensure compatibility with scipy.io.wavfile

Scipy.io.wavfile is easily used online, for example in Jupyter notebooks.

soundpy.files.adjustname(filename, adjustment=None)[source]¶

Adjusts filename.

Parameters

filename (str) – The filename to be adjusted
adjustment (str, optional) – The adjustment to add to the filename. If None, the string ‘_adj’ will be added.

Returns

fname – The adjusted filename with the original extension

Return type

str

Examples

>>> adjustname('happy.md')
'happy_adj.md'
>>> adjustname('happy.md', '_not_sad')
'happy_not_sad.md'

soundpy.files.extract(tar_url, extract_path='.')[source]¶

soundpy.files.delete_dir_contents(directory, remove_dir=False)[source]¶: https://stackoverflow.com/a/28834214

soundpy.files.matching_filenames(list1, list_of_lists)[source]¶

soundpy.files.remove_contaminated_files(list1, contaminated_files)[source]¶

Organizing datasets¶

The datasets module contains functions related to organizing datasets.

soundpy.datasets.create_encodedlabel2audio_dict(dict_encodelabels, paths_list, limit=None, seed=40)[source]¶

Creates dictionary with audio labels as keys and filename lists as values.

If no label is found in the filename path, the label is not included in the returned dictionary: labels are only included if corresponding paths are present.

Parameters

dict_encodelabels (dict) – Dictionary containing the labels as keys and their encoded values as values.
paths_list (set, list) – List containing pathlib.PosixPath objects (i.e. paths) of all audio files; expected the audio files reside in directories with names matching their audio class
limit (int, optional) – The integer indicating a limit to number of audiofiles to each class. This may be useful if one wants to ensure a balanced dataset (default None)
seed (int, optional) – The seed for pseudorandomizing the wavfiles, if a limit is requested. If seed is set to None, the randomized order of the limited wavfiles cannot be repeated. (default 40)

Returns

label_waves_dict (OrderedDict) – A dictionary with encoded audio labels as keys with values being the audio files corresponding to that label
TODO update

Examples

>>> from pathlib import Path
>>> labels = dict([('vacuum',2),('fridge',0),('wind',1)])
>>> paths = [Path('data/audio/vacuum/vacuum1.wav'),
...         Path('data/audio/fridge/fridge1.wav'),
...         Path('data/audio/vacuum/vacuum2.wav'),
...         Path('data/audio/wind/wind1.wav')]
>>> label_waves_dict = create_encodedlabel2audio_dict(labels, paths)
>>> label_waves_dict
OrderedDict([(0, [PosixPath('data/audio/fridge/fridge1.wav')]), (2, [PosixPath('data/audio/vacuum/vacuum1.wav'), PosixPath('data/audio/vacuum/vacuum2.wav')]), (1, [PosixPath('data/audio/wind/wind1.wav')])])
>>> #to set a limit on number of audiofiles per class:
>>> create_encodedlabel2audio_dict(labels, paths, limit=1, seed=40)
OrderedDict([(0, [PosixPath('data/audio/fridge/fridge1.wav')]), (2, [PosixPath('data/audio/vacuum/vacuum2.wav')]), (1, [PosixPath('data/audio/wind/wind1.wav')])])
>>> #change the limited pathways chosen:
>>> create_encodedlabel2audio_dict(labels, paths, limit=1, seed=10)
OrderedDict([(0, [PosixPath('data/audio/fridge/fridge1.wav')]), (2, [PosixPath('data/audio/vacuum/vacuum1.wav')]), (1, [PosixPath('data/audio/wind/wind1.wav')])])

soundpy.datasets.create_dicts_labelsencoded(labels_class, add_extra_label=False, extra_label='silence')[source]¶

Encodes audio class labels and saves in dictionaries.

The labels are alphabetized and encoded under their index. If add_extra_label, the extra_label is added as the last entry in the dictionary. This is useful if performing voice activity and want to label non-voice activated sections as silent rather than as some speech label.

Parameters

labels_class (set, list) – Set or list containing the labels of all audio classes.
add_extra_label (bool) – If True, extra_label added to dictionary.
extra_label (str) – The extra label to add. (default ‘silence’).

Returns

dict_label2int (dict) – Dictionary where the keys are the string labels and the values are the encoded integers
dict_int2label (dict) – Dictionary where the keys are the encoded integers and the values are the string labels

Examples

>>> labels = {'wind','air_conditioner','fridge'}
>>> label2int, int2label = create_dicts_labelsencoded(labels)
>>> label2int
{'air_conditioner': 0, 'fridge': 1, 'wind': 2}
>>> int2label
{0: 'air_conditioner', 1: 'fridge', 2: 'wind'}

soundpy.datasets.waves2dataset(audiolist, perc_train=0.8, seed=40, train=True, val=True, test=True)[source]¶

Organizes audio files list into train, validation and test datasets.

If only two or one dataset is to be prepared, they will be assigned to train and val or simply to train, respectively. The remaining ‘datasets’ will remain empty.

Parameters

audiolist (list) – List containing paths to audio files
perc_train (float, int) – Percentage of data to be in the training dataset (default 0.8)
seed (int, None, optional) – Set seed for the generation of pseudorandom train, validation, and test datsets. Useful for reproducing results. (default 40)
train (bool) – If True, assumed the training data will be prepared. (default True)
val (bool) – If True, assumed validation data will be prepared. (default True)
test (bool) – If True, assumed test data will be prepared. (default True)

Returns

train_waves (list) – List of audio files for the training dataset
val_waves (list) – List of audio files for the validation dataset
test_waves (list) – List of audio files for the test dataset

Examples

>>> #Using a list of numbers instead of filenames
>>> audiolist = [1,2,3,4,5,6,7,8,9,10]
>>> #default settings:
>>> waves2dataset(audiolist)
([5, 4, 9, 2, 3, 10, 1, 6], [8], [7])
>>> #perc_train set to 50% instead of 80%:
>>> waves2dataset(audiolist, perc_train=50)
([5, 4, 9, 2, 3, 10], [1, 6], [8, 7])
>>> #change seed number
>>> waves2dataset(audiolist, seed=0)
([7, 1, 2, 5, 6, 9, 10, 8], [4], [3])

soundpy.datasets.audio2datasets(audiodata, perc_train=0.8, limit=None, seed=None, audio_only=True, **kwargs)[source]¶

Organizes all audio in audio class directories into datasets (randomized).

The validation and test datasets are halved between what isn’t train data. For example, if perc_train is 0.8, validation data will be 0.1 and test data will be 0.1.

Parameters

audiodata (str, pathlib.PosixPath, dict, list, or set) – If data has multiple labels, path to the dictionary where audio class labels and the paths of all audio files belonging to each class are or will be stored. The dictionary with the labels and their encoded values can also directly supplied here. If the data does not have labels, a list or set of audiofiles can be provided to be placed in train, val, and test datasets.
seed (int, optional) – A value to allow random order of audiofiles to be predictable. (default None). If None, the order of audiofiles will not be predictable.
audio_only (bool) – If audio files are expected (e.g. extensions of .wav, .flac etc.) or not. If True, list will be checked to contain only audio files. Otherwise not. (default True)
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.datasets.waves2dataset

Returns

dataset_audio – Named tuple including three lists / datasets of audiofiles or label-audiofile pairs: the train, validation, and test lists, respectively. The label-audiofile pairs are saved as tuples within the lists and contain the encoded label integer (e.g. 0 instead of ‘air_conditioner’) and the audio paths associated to that class and dataset.

Return type

tuple

Raises

ValueError – If perc_train is set too high for the amount of data or there are simply too few data. Specifically, if the percentage of train data cannot be upheld while also ensuring the validation and test datasets have more than 1 sample.

soundpy.datasets.separate_train_val_test_files(list_of_files)[source]¶

Checks that file(s) exist, then sorts file(s) into train, val, test lists.

If ‘nois’ or ‘clean’ are in the filenames, two paths lists per dataset will be generated. Otherwise just one. This paths list is useful if there are multiple training files available for training a model (e.g. for large datasets).

Parameters: list_of_files (list, str, or pathlib.PosixPath) – The feature files (format: .npy) for training a model.
Returns: (train_paths_list, val_paths_list, test_paths_list) – Tuple comprised of paths lists to train, validation, and test data files. If noisy and clean data files found, each tuple item will be a tuple comprised of two lists: a noisy file paths list and a clean file paths list.
Return type: tuple

Examples

>>> features_files = ['train1.npy', 'train2.npy', 'val.npy', 'test.npy']
>>> datasets = separate_train_val_test_files(features_files)
>>> datasets.train
[PosixPath('train1.npy'), PosixPath('train2.npy')]
>>> datasets.val
[PosixPath('val.npy')]
>>> datasets.test
[PosixPath('test.npy')]
>>> # try with noisy and clean data
>>> features_files = ['train_noisy.npy', 'train_clean.npy', 'val_noisy.npy', 'val_clean.npy', 'test_noisy.npy', 'test_clean.npy']
>>> datasets = separate_train_val_test_files(features_files)
>>> datasets.train.noisy
[PosixPath('train_noisy.npy')]
>>> datasets.train.clean
[PosixPath('train_clean.npy')]
>>> datasets.val.noisy
[PosixPath('val_noisy.npy')]
>>> datasets.val.clean
[PosixPath('val_clean.npy')]
>>> datasets.test.noisy
[PosixPath('test_noisy.npy')]
>>> datasets.test.clean
[PosixPath('test_clean.npy')]

soundpy.datasets.section_data(dataset_dict, dataset_paths_dict, divide_factor=None)[source]¶

Expects keys of these two dictionaries to match

Examples

>>> import pathlib
>>> # train is longer than val and test
>>> d = {'train': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],            'val': [1, 2, 3, 4, 5],            'test': [1, 2, 3, 4, 5]}
>>> # dictionary: paths to where extracted data will be saved
>>> dp = {'train': pathlib.PosixPath('train_data.npy'),              'val': pathlib.PosixPath('val_data.npy'),              'test': pathlib.PosixPath('test_data.npy')}
>>> d2, dp2 = section_data(d, dp, divide_factor = 3)
>>> # val and train not touched (too small)
>>> d2
{'train__1': [1, 2, 3, 4, 5], 'train__2': [6, 7, 8, 9, 10], 'train__3': [11, 12, 13, 14, 15], 'val': [1, 2, 3, 4, 5], 'test': [1, 2, 3, 4, 5]}
>>> dp2
{'train__1': PosixPath('train_data__1.npy'), 'train__2': PosixPath('train_data__2.npy'), 'train__3': PosixPath('train_data__3.npy'), 'val': PosixPath('val_data.npy'), 'test': PosixPath('test_data.npy')}
>>> # repeat: now val and test as long as train
>>> # default divide_factor is 2
>>> d3, dp3 = section_data(d2, dp2)
>>> d3
{'train__1': [1, 2], 'train__2': [3, 4, 5], 'train__3': [6, 7], 'train__4': [8, 9, 10], 'train__5': [11, 12], 'train__6': [13, 14, 15], 'val__1': [1, 2], 'val__2': [3, 4, 5], 'test__1': [1, 2], 'test__2': [3, 4, 5]}
>>> dp3
{'train__1': PosixPath('train_data__1.npy'), 'train__2': PosixPath('train_data__2.npy'), 'train__3': PosixPath('train_data__3.npy'), 'train__4': PosixPath('train_data__4.npy'), 'train__5': PosixPath('train_data__5.npy'), 'train__6': PosixPath('train_data__6.npy'), 'val__1': PosixPath('val_data__1.npy'), 'val__2': PosixPath('val_data__2.npy'), 'test__1': PosixPath('test_data__1.npy'), 'test__2': PosixPath('test_data__2.npy')}

Working with signals¶

Dsp module contains functions pertaining to the actual generation, manipulation, and analysis of sound. This ranges from generating sounds to calculating sound to noise ratio.

soundpy.dsp.generate_sound(freq=200, amplitude=0.4, sr=8000, dur_sec=0.25)[source]¶

Generates a sound signal with the provided parameters. Signal begins at 0.

Parameters

freq (int, float) – The frequency in Hz the signal should have (default 200 Hz). This pertains to the number of ossicliations per second.
amplitude (int, float) – The parameter controling how much energy the signal should have. (default 0.4)
sr (int) – The sampling rate of the signal, or how many samples make up the signal per second. (default 8000)

Returns

sound_samples (np.ndarray [size = ()]) – The samples of the generated sound
sr (int) – The sample rate of the generated signal

Examples

>>> sound, sr = generate_sound(freq=5, amplitude=0.5, sr=5, dur_sec=1)
>>> sound
array([ 0.000000e+00,  5.000000e-01,  3.061617e-16, -5.000000e-01, -6.123234e-16])
>>> sr
5

soundpy.dsp.get_time_points(dur_sec, sr)[source]¶

Get evenly spaced time points from zero to length of dur_sec.

The time points align with the provided sample rate, making it easy to plot a signal with a time line in seconds.

Parameters

dur_sec (int, float) – The amount of time in seconds
sr (int) – The sample rate relevant for the signal

Returns

time

Return type

np.ndarray [size = (num_time_points,)]

Examples

>>> # 50 milliseconds at sample rate of 100 (100 samples per second)
>>> x = get_time_points(0.05,100)
>>> x.shape
(5,)
>>> x
array([0.    , 0.0125, 0.025 , 0.0375, 0.05  ])

soundpy.dsp.generate_noise(num_samples, amplitude=0.025, random_seed=None)[source]¶

Generates noise to be of a certain amplitude and number of samples.

Useful for adding noise to another signal of length num_samples.

Parameters

num_samples (int) – The number of total samples making up the noise signal.
amplitude (float) – Allows the noise signal to be louder or quieter. (default 0.025)
random_seed (int, optional) – Useful for repeating ‘random’ noise samples.

Examples

>>> noise = generate_noise(5, random_seed = 0)
>>> noise
array([0.04410131, 0.01000393, 0.02446845, 0.05602233, 0.04668895])

soundpy.dsp.set_signal_length(samples, numsamps)[source]¶

Sets audio signal to be a certain length. Zeropads if too short.

Useful for setting signals to be a certain length, regardless of how long the audio signal is.

Parameters

samples (np.ndarray [size = (num_samples, num_channels), or (num_samples,)]) – The array of sample data to be zero padded.
numsamps (int) – The desired number of samples.

Returns

data – Copy of samples zeropadded or limited to numsamps.

Return type

np.ndarray [size = (numsamps, num_channels), or (numsamps,)]

Examples

>>> import numpy as np
>>> input_samples = np.array([1,2,3,4,5])
>>> output_samples = set_signal_length(input_samples, numsamps = 8)
>>> output_samples
array([1, 2, 3, 4, 5, 0, 0, 0])
>>> output_samples = set_signal_length(input_samples, numsamps = 4)
>>> output_samples
array([1, 2, 3, 4])

soundpy.dsp.scalesound(data, max_val=1, min_val=None)[source]¶

Scales the input array to range between min_val and max_val.

Parameters

data (np.ndarray [size = (num_samples,) or (num_samples, num_channels)]) – Original samples
max_val (int, float) – The maximum value the dataset is to range from (default 1)
min_val (int, float, optional) – The minimum value the dataset is to range from. If set to None, will be set to the opposiite of max_val. E.g. if max_val is set to 0.8, min_val will be set to -0.8. (default None)

Returns

samples – Copy of original data, scaled to the min and max values.

Return type

np.ndarray [size = (num_samples,) or (num_samples, num_channels)]

Examples

>>> import numpy as np
>>> np.random.seed(0)
>>> input_samples = np.random.random_sample((5,))
>>> input_samples
array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ])
>>> input_samples.max()
0.7151893663724195
>>> input_samples.min()
0.4236547993389047
>>> # default setting: between -1 and 1
>>> output_samples = scalesound(input_samples)
>>> output_samples
array([-0.14138 ,1., 0.22872961, -0.16834299, -1.])
>>> output_samples.max()
1.0
>>> output_samples.min()
-1.0
>>> # range between -100 and 100
>>> output_samples = scalesound(input_samples, max_val = 100, min_val = -100)
>>> output_samples
array([ -14.13800026,100., 22.87296052,-16.83429866,-100.])
>>> output_samples.max()
100.0
>>> output_samples.min()
-100.0

soundpy.dsp.shape_samps_channels(data)[source]¶

Returns data in shape (num_samps, num_channels)

Parameters: data (np.ndarray [size= (num_samples,) or (num_samples, num_channels), or (num_channels, num_samples)]) – The data that needs to be checked for correct format
Returns: data
Return type: np.ndarray [size = (num_samples,) or (num_samples, num_channels)]

soundpy.dsp.resample_audio(samples, sr_original, sr_desired)[source]¶

Allows audio samples to be resampled to desired sample rate.

Parameters

samples (np.ndarray [size = (num_samples,)]) – The samples to be resampled.
sr_original (int) – The orignal sample rate of the samples.
sr_desired (int) – The desired sample rate of the samples.

Returns

resampled (np.ndarray [size = (num_samples_resampled,)]) – The resampled samples.
sr_desired (int) – The newly applied sample rate

Examples

>>> import numpy as np
>>> # example samples from 5 millisecond signal with sr 100 and frequency 10
>>> input_samples = np.array([0.00e+00, 2.82842712e-01, 4.000e-01, 2.82842712e-01, 4.89858720e-17])
>>> # we want to resample to 80 instead of 100 (for this example's sake)
>>> output_samples, sr = resample_audio(input_samples, sr_original = 100, sr_desired = 80)
>>> output_samples
array([-2.22044605e-17, 3.35408001e-01, 3.72022523e-01, 6.51178161e-02])

soundpy.dsp.stereo2mono(data)[source]¶

If sound data has multiple channels, reduces to first channel

Parameters: data (numpy.ndarray) – The series of sound samples, with 1+ columns/channels
Returns: data_mono – The series of sound samples, with first column
Return type: numpy.ndarray

Examples

>>> import numpy as np
>>> data = np.linspace(0,20)
>>> data_2channel = data.reshape(25,2)
>>> data_2channel[:5]
array([[0.        , 0.40816327],
       [0.81632653, 1.2244898 ],
       [1.63265306, 2.04081633],
       [2.44897959, 2.85714286],
       [3.26530612, 3.67346939]])
>>> data_mono = stereo2mono(data_2channel)
>>> data_mono[:5]
array([0.        , 0.81632653, 1.63265306, 2.44897959, 3.26530612])

soundpy.dsp.add_backgroundsound(audio_main, audio_background, sr, snr=None, pad_mainsound_sec=None, total_len_sec=None, wrap=False, stationary_noise=True, random_seed=None, extend_window_ms=0, remove_dc=False, mirror_sound=False, clip_at_zero=True, **kwargs)[source]¶

Adds a sound (i.e. background noise) to a target signal. Stereo sound should work.

If the sample rates of the two audio samples do not match, the sample rate of audio_main will be applied. (i.e. the audio_background will be resampled). If you have issues with clicks at the beginning or end of signals, see soundpy.dsp.clip_at_zero.

Parameters

audio_main (str, pathlib.PosixPath, or np.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – Sound file of the main sound (will not be modified; only delayed if specified). If not path or string, should be a data samples corrresponding to the provided sample rate.
audio_background (str, pathlib.PosixPath, or np.ndarray [size=(num_samples,)]) – Sound file of the background sound (will be modified /repeated to match or extend the length indicated). If not of type pathlib.PosixPath or string, should be a data samples corrresponding to the provided sample rate.
sr (int) – The sample rate of sounds to be added together. Note: sr of 44100 or higher is suggested.
snr (int, float, list, tuple) – The sound-to-noise-ratio of the target and background signals. Note: this is an approximation and needs further testing and development to be used as an official measurement of snr. If no SNR provided, signals will be added together as-is. (default None)
pad_mainsound_sec (int or float, optional) – Length of time in seconds the background sound will pad the main sound. For example, if pad_mainsound_sec is set to 1, one second of the audio_background will be played before audio_main starts as well as after the main audio stops. (default None)
total_len_sec (int or float, optional) – Total length of combined sound in seconds. If none, the sound will end after the (padded) target sound ends (default None).
wrap (bool) – If False, the random selection of sound will be limited to end by the end of the audio file. If True, the random selection will wrap to beginning of the audio file if extends beyond the end of the audio file. (default False)
stationary_noise (bool) – If False, soundpy.feats.get_vad_stft will be applied to noise to get energy of the active noise in the signal. Otherwise energy will be collected via soundpy.dsp.get_stft. (default True)
random_seed (int) – If provided, the ‘random’ section of noise will be chosen using this seed. (default None)
extend_window_ms (int or float) – The number of milliseconds the voice activity detected should be padded with. This might be useful to ensure sufficient amount of activity is calculated. (default 0)
remove_dc (bool) – If the dc bias should be removed. This aids in the removal of clicks. See soundpy.dsp.remove_dc_bias. (default False)
**kwargs (additional keyword arguments) – The keyword arguments for soundpy.files.loadsound

Returns

combined (numpy.ndarray [shape=(num_samples) or (num_samples, num_channels)]) – The samples of the sounds added together
snr (int, float) – The updated signal-to-noise ratio. Due to the non-stationary state of speech and sound in general, this value is only an approximation.

References

Yi Hu and Philipos C. Loizouoriginal authors: Copyright (c) 2006 by Philipos C. Loizou
SIP-Lab/CNN-VAD/GitHub Repo: Copyright (c) 2019 Signal and Image Processing Lab MIT License

See also

soundpy.files.loadsound: Loads audiofiles.
soundpy.dsp.snr_adjustnoiselevel: Calculates how much to adjust noise signal to achieve SNR.
soundpy.feats.get_vad_stft: Returns stft matrix of only voice active regions
soundpy.feats.get_stft: Returns stft matrix of entire signal

soundpy.dsp.hz_to_mel(freq)[source]¶

Converts frequency to Mel scale

Parameters: freq (int or float or array like of ints / floats) – The frequency/ies to convert to Mel scale.
Returns: mel – The frequency/ies in Mel scale.
Return type: int or float or array of ints / floats

References

Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

soundpy.dsp.mel_to_hz(mel)[source]¶

Converts Mel item or list to frequency/ies.

Parameters: mel (int, float, or list of ints / floats) – Mel item(s) to be converted to Hz.
Returns: freq – The converted frequency/ies
Return type: int, float, or list of ints / floats

References

https://en.wikipedia.org/wiki/Mel_scale#Formula

Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

soundpy.dsp.fbank_filters(fmin, fmax, num_filters)[source]¶

Calculates the mel filterbanks given a min and max frequency and num_filters.

Parameters

fmin (int, float) – Minimum frequency relevant in signal.
fmax (int, float) – Maximum frequency relevant in signal.
num_filters (int) – The number of evenly spaced filters (according to mel scale) between the fmin and fmax frequencies.

Returns

mel_points – An array of floats containing evenly spaced filters (according to mel scale).

Return type

np.ndarray [size=(num_filters,)]

References

Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

soundpy.dsp.sinosoidal_liftering(mfccs, cep_lifter=22)[source]¶

Reduces influence of higher coefficients; found useful in automatic speech rec.

Parameters

mfccs (np.ndarray [shape=(num_samples, num_mfcc)]) – The matrix containing mel-frequency cepstral coefficients.
cep_lifter (int) – The amount to apply sinosoidal_liftering. (default 22)

References

Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

soundpy.dsp.index_at_zero(samples, num_dec_places=2)[source]¶

Finds indices of start and end of utterance, given amplitude strength.

Parameters

samples (numpy.ndarray [size= (num_samples,) or (num_samples, num_channels)]) – The samples to index where the zeros surrounding speech are located.
num_dec_places (int) – To the number of decimal places the lowest value in samples should be rounded to. (default 2)

Returns

f_0 (int) – The index of the last occuring zero, right before speech or sound begins.
l_0 (int) – The index of the first occuring zero, after speech ends.

Examples

>>> signal = np.array([-1, 0, 1, 2, 3, 2, 1, 0, -1, -2, -3, -2, -1, 0, 1])
>>> zero_1, zero_2 = index_at_zero(signal)
>>> # +1 to include zero_2 in signal
>>> signal[zero_1:zero_2+1]
[ 0  1  2  3  2  1  0 -1 -2 -3 -2 -1  0]
>>> # does not assume a zero preceeds any sample
>>> signal = np.array([1, 2, 1, 0, -1, -2, -1, 0, 1, 2, 1])
>>> zero_1, zero_2 = index_at_zero(signal)
>>> signal[zero_1:zero_2+1]
[ 0 -1 -2 -1  0]

soundpy.dsp.clip_at_zero(samples, samp_win=None, neg2pos=True, **kwargs)[source]¶

Clips the signal at samples close to zero.

The samples where clipping occurs crosses the zero line from negative to positive. This clipping process allows for a smoother transition of audio, especially if concatenating audio.

Parameters

samples (np.ndarray [shape = (num_samples, ) or (num_samples, num_channels)]) – The array containing sample data. Should work on stereo sound.
start_with_zero (bool) – If True, the returned array will begin with 0 (or close to 0). Otherwise the array will end with 0.
neg2pos (bool) – If True, the returned array will begin with positive values and end with negative values. Otherwise, the array will be returned with the first zeros detected, regardless of surrounding positive or negative values.
samp_win (int, optional) – The window of samples to apply when clipping at zero crossings. The zero crossings adjacent to the main signal will be used. This is useful to remove already existing clicks within the signal, often found at the beginning and / or end of signals.
kwargs (additional keyword arguments) – Keyword arguments for soundpy.dsp.index_at_zero.

Warning

If only one zero found.

Examples

>>> sig = np.array([-2,-1,0,1, 2, 1, 0, -1, -2, -1, 0, 1, 2, 1,0])
>>> clip_at_zero(sig) # defaults
[ 0  1  2  1  0 -1 -2 -1  0]
>>> # finds first and last insance of zeros, regardless of surrounding
>>> # negative or positive values in signal
>>> clip_at_zero(sig, neg2pos = False)
[ 0  1  2  1  0 -1 -2 -1  0  1  2  1  0]
>>> # avoid clicks at start of signal
>>> sig = np.array([0,-10,-20,-1,0,1, 2, 1, 0, -1, -2, -1, 0, 1, 2, 1,0])
>>> clip_at_zero(sig, samp_win = 5)
[ 0  1  2  1  0 -1 -2 -1  0]

soundpy.dsp.remove_dc_bias(samples, samp_win=None)[source]¶

Removes DC bias by subtracting mean from sample data.

Seems to work best without samp_win.

# TODO add moving average?

Parameters

samples (np.ndarray [shape=(samples, num_channels) or (samples)]) – The sample data to center around zero. This worsk on both mono and stero data.
samp_win (int, optional) – Apply subtraction of mean at windows - experimental. (default None)

Returns

samps – The samples with zero mean.

Return type

np.ndarray [shape=(samples, num_channels) or (samples)]

References

Lyons, Richard. (2011). Understanding Digital Signal Processing (3rd Edition).

soundpy.dsp.apply_num_channels(sound_data, num_channels)[source]¶

Ensures data has indicated num_channels.

To increase number of channels, the first column will be duplicated. To limit channels, channels will simply be removed.

Parameters

sound_data (np.ndarray [size= (num_samples,) or (num_samples, num_channels)]) – The data to adjust the number of channels
num_channels (int) – The number of channels desired

Returns

data

Return type

np.ndarray [size = (num_samples, num_channels)]

Examples

>>> import numpy as np
>>> data = np.array([1, 1, 1, 1])
>>> data_3d = apply_num_channels(data, 3)
>>> data_3d
array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]])
>>> data_2d = apply_num_channels(data_3d, 2)
>>> data_2d
array([[1, 1],
       [1, 1],
       [1, 1],
       [1, 1]])

soundpy.dsp.apply_sample_length(data, target_len, mirror_sound=False, clip_at_zero=True)[source]¶

Extends a sound by repeating it until its target_len. If the target_len is shorter than the length of data, data will be shortened to the specificed target_len

This is perhaps useful when working with repetitive or stationary sounds.

Parameters

data (np.ndarray [size = (num_samples,) or (num_samples, num_channels)]) – The data to be checked or extended in length. If shape (num_channels, num_samples), the data will be reshaped to (num_samples, num_channels).
target_len (int) – The length of samples the input data should be.

Returns

new_data

Return type

np.ndarray [size=(target_len, ) or (target_len, num_channels)]

Examples

>>> import numpy as np
>>> data = np.array([1,2,3,4])
>>> sp.dsp.apply_sample_length(data, 12)
array([1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4])
>>> # two channels
>>> data = np.zeros((3,2))
>>> data[:,0] = np.array([0,1,2])
>>> data[:,1] = np.array([1,2,3])
>>> data
array([[0., 1.],
       [1., 2.],
       [2., 3.]])
>>> sp.dsp.apply_sample_length(data,5)
array([[0., 1.],
       [1., 2.],
       [2., 3.],
       [0., 1.],
       [1., 2.]])

soundpy.dsp.zeropad_sound(data, target_len, sr, delay_sec=None)[source]¶

If the sound data needs to be a certain length, zero pad it.

Parameters

data (numpy.ndarray [size = (num_samples,) or (num_samples, num_channels)]) – The sound data that needs zero padding. Shape (len(data),).
target_len (int) – The number of samples the data should have
sr (int) – The samplerate of the data
delay_sec (int, float, optional) – If the data should be zero padded also at the beginning. (default None)

Returns

signal_zeropadded – The data zero padded.

Return type

numpy.ndarray [size = (target_len,) or (target_len, num_channels)]

Examples

>>> import numpy as np
>>> x = np.array([1,2,3,4])
>>> # with 1 second delay (with sr of 4, that makes 4 sample delay)
>>> x_zeropadded = zeropad_sound(x, target_len=10, sr=4, delay_sec=1)
>>> x_zeropadded
array([0., 0., 0., 0., 1., 2., 3., 4., 0., 0.])
>>> # without delay
>>> x_zeropadded = zeropad_sound(x, target_len=10, sr=4)
>>> x_zeropadded
array([1., 2., 3., 4., 0., 0., 0., 0., 0., 0.])
>>> # if signal is longer than desired length:
>>> x_zeropadded = zeropad_sound(x, target_len=3, sr=4)
UserWarning: The signal cannot be zeropadded and will instead be truncated as length of `data` is 4 and `target_len` is 3.
len(data), target_len))
>>> x_zeropadded
array([1, 2, 3])

soundpy.dsp.get_num_channels(data)[source]¶

soundpy.dsp.combine_sounds(file1, file2, match2shortest=True, time_delay_sec=None, total_dur_sec=None)[source]¶

Combines sounds

Parameters

file1 (str) – One of two files to be added together
file2 (str) – Second of two files to be added together
match2shortest (bool) – If the lengths of the addition should be limited by the shorter sound. (defaul True)
time_delay_sec (int, float, optional) – The amount of time in seconds before the sounds are added together. The longer sound will play for this period of time before the shorter sound is added to it. (default 1)
total_dur_sec (int, float, optional) – The total duration in seconds of the combined sounds. (default 5)

Returns

added_sound (numpy.ndarray) – The sound samples of the two soundfiles added together
sr1 (int) – The sample rate of the original signals and added sound

soundpy.dsp.calc_frame_length(dur_frame_millisec, sr)[source]¶

Calculates the number of samples necessary for each frame

Parameters

dur_frame_millisec (int or float) – time in milliseconds each frame should be
sr (int) – sampling rate of the samples to be framed

Returns

frame_length – the number of samples necessary to fill a frame

Return type

int

Examples

>>> calc_frame_length(dur_frame_millisec=20, sr=1000)
20
>>> calc_frame_length(dur_frame_millisec=20, sr=48000)
960
>>> calc_frame_length(dur_frame_millisec=25.5, sr=22500)
573

soundpy.dsp.calc_num_overlap_samples(samples_per_frame, percent_overlap)[source]¶

Calculate the number of samples that constitute the overlap of frames

Parameters

samples_per_frame (int) – the number of samples in each window / frame
percent_overlap (int, float) – either an integer between 0 and 100 or a decimal between 0.0 and 1.0 indicating the amount of overlap of windows / frames

Returns

num_overlap_samples – the number of samples in the overlap

Return type

int

Examples

>>> calc_num_overlap_samples(samples_per_frame=100,percent_overlap=0.10)
10
>>> calc_num_overlap_samples(samples_per_frame=100,percent_overlap=10)
10
>>> calc_num_overlap_samples(samples_per_frame=960,percent_overlap=0.5)
480
>>> calc_num_overlap_samples(samples_per_frame=960,percent_overlap=75)
720

soundpy.dsp.calc_num_subframes(tot_samples, frame_length, overlap_samples, zeropad=False)[source]¶

Assigns total frames needed to process entire noise or target series

This function calculates the number of full frames that can be created given the total number of samples, the number of samples in each frame, and the number of overlapping samples.

Parameters

tot_samples (int) – total number of samples in the entire series
frame_length (int) – total number of samples in each frame / processing window
overlap_samples (int) – number of samples in overlap between frames
zeropad (bool, optional) – If False, number of subframes limited to full frames. If True, number of subframes extended to zeropad the last partial frame. (default False)

Returns

subframes – The number of subframes necessary to fully process the audio samples at given frame_length, overlap_samples, and zeropad.

Return type

int

Examples

>>> calc_num_subframes(30,10,5)
5
>>> calc_num_subframes(30,20,5)
3

soundpy.dsp.create_window(window_type, frame_length)[source]¶

Creates window according to set window type and frame length

the Hamming window tapers edges to around 0.08 while the Hann window tapers edges to 0.0. Both are commonly used in noise filtering.

Parameters: window_type (str) – type of window to be applied (default ‘hamming’)
Returns: window – a window fitted to the class attribute ‘frame_length’
Return type: ndarray

Examples

>>> #create Hamming window
>>> hamm_win = create_window('hamming', frame_length=5)
>>> hamm_win
array([0.08, 0.54, 1.  , 0.54, 0.08])
>>> #create Hann window
>>> hann_win = create_window('hann',frame_length=5)
>>> hann_win
array([0. , 0.5, 1. , 0.5, 0. ])

soundpy.dsp.apply_window(samples, window, zeropad=False)[source]¶

Applies predefined window to a section of samples. Mono or stereo sound checked.

The length of the samples must be the same length as the window.

Parameters

samples (ndarray [shape=(num_samples,) or (num_samples, num_channels)]) – series of samples with the length of input window
window (ndarray [shape=(num_samples,) or (num_samples, num_channels)]) – window to be applied to the signal. If window does not match number of channels of sample data, the missing channels will be applied to the window, repeating the first channel.

Returns

samples_win – series with tapered sides according to the window provided

Return type

ndarray

Examples

>>> import numpy as np
>>> input_signal = np.array([ 0.        ,  0.36371897, -0.302721,
...                         -0.1117662 ,  0.3957433 ])
>>> window_hamming = np.array([0.08, 0.54, 1.  , 0.54, 0.08])
>>> apply_window(input_signal, window_hamming)
array([ 0.        ,  0.19640824, -0.302721  , -0.06035375,  0.03165946])
>>> window_hann = np.array([0. , 0.5, 1. , 0.5, 0. ])
>>> apply_window(input_signal, window_hann)
array([ 0.        ,  0.18185948, -0.302721  , -0.0558831 ,  0.        ])

soundpy.dsp.add_channels(samples, channels_total)[source]¶

Copies columns of samples to create additional channels.

Parameters

samples (np.ndarray [shape=(num_samples) or (num_samples,num_channels)]) – The samples to add channels to.
channels_total (int) – The total number of channels desired. For example, if samples already has 2 channels and you want it to have 3, set channels_total to 3.

Returns

x – A copy of samples with desired number of channels.

Return type

np.ndarray [shape = (num_samples, channels_total)]

Examples

>>> import numpy as np
>>> samps_mono = np.array([1,2,3,4,5])
>>> samps_stereo2 = add_channels(samps_mono, 2)
>>> samps_stereo2
array([[1, 1],
...    [2, 2],
...    [3, 3],
...    [4, 4],
...    [5, 5]])
>>> samps_stereo5 = add_channels(samps_stereo2, 5)
>>> samps_stereo5
array([[1, 1, 1, 1, 1],
...    [2, 2, 2, 2, 2],
...    [3, 3, 3, 3, 3],
...    [4, 4, 4, 4, 4],
...    [5, 5, 5, 5, 5]])

Warning

If channels_total is less than or equal to the number of channels already presesnt in samples. No channels added in those cases.

soundpy.dsp.average_channels(data)[source]¶

Averages all channels in a stereo signal into one channel.

Parameters: data (np.ndarray [size=(num_samples, num_channels)]) – The stereo data to average out. If mono data supplied, mono data is returned unchanged.
Returns: data averaged – Copy of data averaged into one channel.
Return type: np.ndarray [size=(num_samples)]

Examples

>>> import numpy as np
>>> input_samples1 = np.array([1,2,3,4,5])
>>> input_samples2 = np.array([1,1,3,3,5])
>>> input_2channels = np.vstack((input_samples1, input_samples2)).T
>>> input_averaged = average_channels(input_2channels)
>>> input_averaged
array([1. , 1.5, 3. , 3.5, 5. ])

soundpy.dsp.calc_fft(signal_section, real_signal=None, fft_bins=None, **kwargs)[source]¶

Calculates the fast Fourier transform of a time series. Should work with stereo signals.

The length of the signal_section determines the number of frequency bins analyzed if fft_bins not set. Therefore, if there are higher frequencies in the signal, the length of the signal_section should be long enough to accommodate those frequencies.

The frequency bins with energy levels at around zero denote frequencies not prevelant in the signal;the frequency bins with prevalent energy levels relate to frequencies as well as their amplitudes that are in the signal.

Parameters

signal_section (ndarray [shape = (num_samples) or (num_samples, num_channels)]) – the series that the fft will be applied to. If stereo sound, will return a FFT for each channel.
real_signal (bool) – If True, only half of the fft will be returned (the fft is mirrored). Otherwise the full fft will be returned.
kwargs (additional keyword arguments) – keyword arguments for numpy.fft.fft or nump.fft.rfft

Returns

fft_vals – the series transformed into the frequency domain with the same shape as the input series

Return type

ndarray [shape=(num_fft_bins), or (num_fft_bins, num_channels), dtype=np.complex_]

soundpy.dsp.ismono(data)[source]¶

soundpy.dsp.calc_power(fft_vals)[source]¶

Calculates the power of fft values

Parameters: fft_vals (ndarray (complex or floats)) – the fft values of a windowed section of a series
Returns: power_spec – the squared absolute value of the input fft values
Return type: ndarray

Example

>>> import numpy as np
>>> matrix = np.array([[1,1,1],[2j,2j,2j],[-3,-3,-3]],
...                     dtype=np.complex_)
>>> calc_power(matrix)
array([[0.33333333, 0.33333333, 0.33333333],
       [1.33333333, 1.33333333, 1.33333333],
       [3.        , 3.        , 3.        ]])

soundpy.dsp.calc_average_power(matrix, num_iters)[source]¶

Divides matrix values by the number of times power values were added.

This function assumes the power values of n-number of series were calculated and added. It divides the values in the input matrix by n, i.e. ‘num_iters’.

Parameters

matrix (ndarray) – a collection of floats or ints representing the sum of power values across several series sets
num_iters (int) – an integer denoting the number of times power values were added to the input matrix

Returns

matrix – the averaged input matrix

Return type

ndarray

Examples

>>> matrix = np.array([[6,6,6],[3,3,3],[1,1,1]])
>>> ave_matrix = calc_average_power(matrix, 3)
>>> ave_matrix
array([[2.        , 2.        , 2.        ],
       [1.        , 1.        , 1.        ],
       [0.33333333, 0.33333333, 0.33333333]])

soundpy.dsp.calc_phase(fft_matrix, radians=False)[source]¶

Calculates phase from complex fft values.

Parameters

fft_vals (np.ndarray [shape=(num_frames, num_features), dtype=complex]) – matrix with fft values
radians (boolean) – False and complex values are returned. True and radians are returned. (Default False)

Returns

phase – Phase values for fft_vals. If radians is set to False, dtype = complex. If radians is set to True, dtype = float.

Return type

np.ndarray [shape=(num_frames, num_features)]

Examples

>>> import numpy as np
>>> frame_length = 10
>>> time = np.arange(0, 10, 0.1)
>>> signal = np.sin(time)[:frame_length]
>>> fft_vals = np.fft.fft(signal)
>>> phase = calc_phase(fft_vals, radians=False)
>>> phase[:2]
array([ 1.        +0.j        , -0.37872566+0.92550898j])
>>> phase = calc_phase(fft_vals, radians=True)
>>> phase[:2]
array([0.        , 1.95921533])

soundpy.dsp.reconstruct_whole_spectrum(band_reduced_noise_matrix, n_fft=None)[source]¶

Reconstruct whole spectrum by mirroring complex conjugate of data.

Parameters

band_reduced_noise_matrix (np.ndarray [size=(n_fft,), dtype=np.float or np.complex_]) – Matrix with either power or fft values of the left part of the fft. The whole fft can be provided; however the right values will be overwritten by a mirrored left side.
n_fft (int, optional) – If None, n_fft set to length of band_reduced_noise_matrix. n_fft defines the size of the mirrored vector.

Returns

output_matrix – Mirrored vector of input data.

Return type

np.ndarray [size = (n_fft,), dtype=np.float or np.complex_]

Examples

>>> x = np.array([3.,2.,1.,0.])
>>> # double the size of x
>>> x_rec = sp.dsp.reconstruct_whole_spectrum(x, n_fft=int(len(x)*2))
>>> x_rec
array([3., 2., 1., 0., 0., 1., 2., 3.])
>>> # overwrite right side of data
>>> x = np.array([3.,2.,1.,0.,0.,2.,3.,5.])
>>> x_rec = sp.dsp.reconstruct_whole_spectrum(x, n_fft=len(x))
>>> x_rec
array([3., 2., 1., 0., 0., 1., 2., 3.])

soundpy.dsp.apply_original_phase(spectrum, phase)[source]¶

Multiplies phase to power spectrum

Parameters

spectrum (np.ndarray [shape=(n,), dtype=np.float or np.complex]) – Magnitude or power spectrum
phase (np.ndarray [shape=(n,), dtype=np.float or np.complex]) – Phase to be applied to spectrum

Returns

spectrum_complex

Return type

np.ndarray [shape=(n,), dtype = np.complex]

soundpy.dsp.calc_posteri_snr(target_power_spec, noise_power_spec)[source]¶

Calculates and signal to noise ratio of current frame

Parameters

target_power_spec (ndarray) – matrix of shape with power values of target signal
noise_power_spec (ndarray) – matrix of shape with power values of noise signal

Returns

posteri_snr – matrix containing the signal to noise ratio

Return type

ndarray

Examples

>>> sig_power = np.array([6,6,6,6])
>>> noise_power = np.array([2,2,2,2])
>>> calc_posteri_snr(sig_power, noise_power)
array([3., 3., 3., 3.])

soundpy.dsp.get_max_index(matrix)[source]¶: If not np.ndarray, expects real sample data.

soundpy.dsp.get_local_target_high_power(target_samples, sr, local_size_ms=25, min_power_percent=0.25)[source]¶

soundpy.dsp.get_vad_snr(target_samples, noise_samples, sr, extend_window_ms=0)[source]¶

Approximates the signal to noise ratio of two sets of power spectrums

Note: this is a simple implementation and should not be used for official/exact measurement of snr.

Parameters

target_samples (np.ndarray [size = (num_samples, )]) – The samples of the main / speech signal. Only frames with higher levels of energy will be used to calculate SNR.
noise_samples (np.ndarray [size = (num_samples, )]) – The samples of background noise. Expects only noise, no speech. Must be the same sample rate as the target_samples
sr (int) – The sample rate for the audio samples.
local_size_ms (int or float) – The length in milliseconds to calculate level of SNR. (default 25)
min_power_percent (float) – The minimum percentage of energy / power the target samples should have. This is to look at only sections with speech or other signal of interest and not periods of silence. Value should be between 0 and 1. (default 0.25)

References

http://www1.icsi.berkeley.edu/Speech/faq/speechSNR.html

Gomolka, Ryszard. (2017). Re: How to measure signal-to-noise ratio (SNR) in real time?. Retrieved from: https://www.researchgate.net/post/How_to_measure_signal-to-noise_ratio_SNR_in_real_time/586a880f217e2060b65a8853/citation/download.

https://www.who.int/occupational_health/publications/noise1.pdf

soundpy.dsp.snr_adjustnoiselevel(target_samples, noise_samples, sr, snr)[source]¶

Computes scale factor to adjust noise samples to achieve snr.

From script addnoise_asl_nseg.m: This function adds noise to a file at a specified SNR level. It uses the active speech level to compute the speech energy. The active speech level is computed as per ITU-T P.56 standard.

soundpy Note: this functionality was pulled from the MATLAB script: addnoise_asl_nseg.m at this GitHub repo: https://github.com/SIP-Lab/CNN-VAD/blob/master/Training%20Code/Functions/addnoise_asl_nseg.m

I do not understand all that went on to calculate the scale factor and therefore do not explain anything futher than the original script.

Parameters

target_samples (np.ndarray [size = (num_samples,)]) – The audio samples of the target / clean signal.
noise_samples (np.ndarray [size = (num_samples,)]) – The audio samples of the noise signal.
sr (int) – The sample rate of both target_samples and noise_samples
snr (int) – The desired signal-to-noise ratio of the target and noise audio signals.

Returns

scale_factor – The factor to which noise samples should be multiplied before being added to target samples to achieve SNR.

Return type

int, float

References

Yi Hu and Philipos C. Loizouoriginal authors: Copyright (c) 2006 by Philipos C. Loizou
SIP-Lab/CNN-VAD/GitHub Repo: Copyright (c) 2019 Signal and Image Processing Lab MIT License

ITU-T (1993). Objective measurement of active speech level. ITU-T Recommendation P. 56

See also

soundpy.dsp.asl_P56

soundpy.dsp.asl_P56(samples, sr, bitdepth=16, smooth_factor=0.03, hangover=0.2, margin_db=15.9)[source]¶

Computes the active speech level according to ITU-T P.56 standard.

Note: I don’t personally understand the functionality behind this function and therefore do not offer the best documentation as of yet.

Parameters

samples (np.ndarray [size = (num_samples, )]) – The audio samples, for example speech samples.
sr (int) – The sample rate of samples.
bitdepth (int) – The bitdepth of audio. Expects 16. (default 16)
smooth_factor (float) – Time smoothing factor. (default 0.03)
hangover (float) – Hangover. Thank goodness not the kind I’m familiar with. (default 0.2)
margin_db (int, float) – Margin decibels… (default 15.9)

Returns

asl_ms (float) – The active speech level ms energy
asl (float) – The active factor
c0 (float) – Active speech level threshold

References

ITU-T (1993). Objective measurement of active speech level. ITU-T Recommendation P. 56

TODO handle bitdepth variation - what if not 16? TODO improve documentation

soundpy.dsp.bin_interp(upcount, lwcount, upthr, lwthr, Margin, tol)[source]¶

soundpy.dsp.calc_posteri_prime(posteri_snr)[source]¶

Calculates the posteri prime

Parameters: posteri_snr (ndarray) – The signal-to-noise ratio of the noisey signal, frame by frame.
Returns: posteri_prime – The primed posteri_snr, calculated according to the reference paper.
Return type: ndarray

References

Scalart, P. and Filho, J. (1996). Speech enhancement based on a priori signal to noise estimation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 629-632.

soundpy.dsp.calc_prior_snr(snr, snr_prime, smooth_factor=0.98, first_iter=None, gain=None)[source]¶

Estimates the signal-to-noise ratio of the previous frame

Depending on the first_iter argument, the prior snr is calculated according to different algorithms. If first_iter is None, prior snr is calculated according to Scalart and Filho (1996); if first_iter is True or False, snr prior is calculated according to Loizou (2013).

Parameters

snr (ndarray) – The sound-to-noise ratio of target vs noise power/energy levels.
snr_prime (ndarray) – The prime of the snr (see Scalart & Filho (1996))
smooth_factor (float) – The value applied to smooth the signal. (default 0.98)
first_iter (None, True, False) – If None, snr prior values are estimated the same, no matter if it is the first iteration or not (Scalart & Filho (1996)) If True, snr prior values are estimated without gain (Loizou 2013) If False, snr prior values are enstimed with gain (Loizou 2013) (default None)
gain (None, ndarray) – If None, gain will not be used. If gain, it is a previously calculated value from the previous frame. (default None)

Returns

prior_snr – Estimation of signal-to-noise ratio of the previous frame of target signal.

Return type

ndarray

References

C Loizou, P. (2013). Speech Enhancement: Theory and Practice.

Scalart, P. and Filho, J. (1996). Speech enhancement based on a priori signal to noise estimation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 629-632.

soundpy.dsp.calc_gain(prior_snr)[source]¶

Calculates the gain (i.e. attenuation) values to reduce noise.

Parameters: prior_snr (ndarray) – The prior signal-to-noise ratio estimation
Returns: gain – An array of attenuation values to be applied to the signal (stft) array at the current frame.
Return type: ndarray

References

C Loizou, P. (2013). Speech Enhancement: Theory and Practice.

Scalart, P. and Filho, J. (1996). Speech enhancement based on a priori signal to noise estimation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 629-632.

soundpy.dsp.apply_gain_fft(fft_vals, gain)[source]¶

Reduces noise by applying gain values to the stft / fft array of the target signal

Parameters

fft_vals (ndarray(complex)) – Matrix containing complex values (i.e. stft values) of target signal
gain (ndarray(real)) – Matrix containing calculated attenuation values to apply to ‘fft_vals’

Returns

enhanced_fft – Matrix with attenuated noise in target (stft) values

Return type

ndarray(complex)

soundpy.dsp.postfilter(original_powerspec, noisereduced_powerspec, gain, threshold=0.4, scale=10)[source]¶

Apply filter that reduces musical noise resulting from other filter.

If it is estimated that speech (or target signal) is present, reduced filtering is applied.

References

T. Esch and P. Vary, “Efficient musical noise suppression for speech enhancement system,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, 2009.

soundpy.dsp.calc_ifft(signal_section, real_signal=None, norm=False)[source]¶

Calculates the inverse fft of a series of fft values

The real values of the ifft can be used to be saved as an audiofile

Parameters

signal_section (ndarray [shape=(num_freq_bins,)) – The frame of fft values to apply the inverse fft to
num_fft (int, optional) – The number of total fft values applied when calculating the original fft. If not given, length of signal_section is used.
norm (bool) – Whether or not the ifft should apply ‘ortho’ normalization (default False)

Returns

ifft_vals – The inverse Fourier transform of filtered audio data

Return type

ndarray(complex)

soundpy.dsp.control_volume(samples, max_limit)[source]¶

Keeps max volume of samples to within a specified range.

Parameters

samples (ndarray) – series of audio samples
max_limit (float) – maximum boundary of the maximum value of the audio samples

Returns

samples – samples with volume adjusted (if need be).

Return type

np.ndarray

Examples

>>> import numpy as np
>>> #low volume example: increase volume to desired window
>>> x = np.array([-0.03, 0.04, -0.05, 0.02])
>>> x = control_volume(x, max_limit=0.25)
>>> x
array([-0.13888889,  0.25      , -0.25      ,  0.13888889])
>>> #high volume example: decrease volume to desired window
>>> y = np.array([-0.3, 0.4, -0.5, 0.2])
>>> y = control_volume(y, max_limit=0.15)
>>> y
array([-0.08333333,  0.15      , -0.15      ,  0.08333333])

soundpy.dsp.calc_power_ratio(original_powerspec, noisereduced_powerspec)[source]¶: Calc. the ratio of original vs noise reduced power spectrum.

soundpy.dsp.calc_noise_frame_len(SNR_decision, threshold, scale)[source]¶

Calc. window length for calculating moving average.

Note: lower SNRs require larger window.

soundpy.dsp.calc_linear_impulse(noise_frame_len, num_freq_bins)[source]¶: Calc. the post filter coefficients to be applied to gain values.

soundpy.dsp.adjust_volume(samples, vol_range)[source]¶

soundpy.dsp.spread_volumes(samples, vol_list=[0.1, 0.3, 0.5])[source]¶

Returns samples with a range of volumes.

This may be useful in applying to training data (transforming data).

Parameters

samples (ndarray) – Series belonging to acoustic signal.
vol_list (list) – List of floats or ints representing the volumes the samples are to be oriented towards. (default [0.1,0.3,0.5])

Returns

volrange_dict – Tuple of volrange_dict values containing samples at various vols.

Return type

tuple

soundpy.dsp.create_empty_matrix(shape, complex_vals=False)[source]¶

Allows creation of a matrix filled with real or complex zeros.

In digital signal processing, complex numbers are common; it is important to note that if complex_vals=False and complex values are inserted into the matrix, the imaginary part will be removed.

Parameters

shape (tuple or int) – tuple or int indicating the shape or length of desired matrix or vector, respectively
complex_vals (bool) – indicator of whether or not the matrix will receive real or complex values (default False)

Returns

matrix – a matrix filled with real or complex zeros

Return type

ndarray

Examples

>>> matrix = create_empty_matrix((3,4))
>>> matrix
array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])
>>> matrix_complex = create_empty_matrix((3,4),complex_vals=True)
>>> matrix_complex
array([[0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
       [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
       [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j]])
>>> vector = create_empty_matrix(5,)
>>> vector
array([0., 0., 0., 0., 0.])

soundpy.dsp.overlap_add(enhanced_matrix, frame_length, overlap, complex_vals=False)[source]¶

Overlaps and adds windowed sections together to form 1D signal.

Parameters

enhanced_matrix (np.ndarray [shape=(frame_length, num_frames), dtype=float]) – Matrix with enhance values
frame_length (int) – Number of samples per frame
overlap (int) – Number of samples that overlap

Returns

new_signal – Length equals (frame_length - overlap) * enhanced_matrix.shape[1] + overlap

Return type

np.ndarray [shape=(frame_length,), dtype=float]

Examples

>>> import numpy as np
>>> enhanced_matrix = np.ones((4, 4))
>>> frame_length = 4
>>> overlap = 1
>>> sig = overlap_add(enhanced_matrix, frame_length, overlap)
>>> sig
array([1., 1., 1., 2., 1., 1., 2., 1., 1., 2., 1., 1., 1.])

soundpy.dsp.random_selection_samples(samples, len_section_samps, wrap=False, random_seed=None, axis=0)[source]¶

Selects a section of samples, starting at random.

Parameters

samples (np.ndarray [shape = (num_samples, )]) – The array of sample data
len_section_samps (int) – How many samples should be randomly selected
wrap (bool) – If False, the selected noise will not be wrapped from end to beginning; if True, the random selected may take sound sample that is wrapped from the end to the beginning. See examples below. (default False)
random_seed (int, optional) – If replicated randomization desired. (default None)

Examples

>>> import numpy as np
>>> # no wrap:
>>> x = np.array([1,2,3,4,5,6,7,8,9,10])
>>> n = sp.dsp.random_selection_samples(x, len_section_samps = 7,
...                                     wrap = False, random_seed = 40)
>>> n
array([3, 4, 5, 6, 7, 8, 9])
>>> # with wrap:
>>> n = sp.dsp.random_selection_samples(x, len_section_samps = 7,
...                                     wrap = True, random_seed = 40)
>>> n
array([ 7,  8,  9, 10,  1,  2,  3])

soundpy.dsp.get_pitch(sound, sr=16000, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', **kwargs)[source]¶: Approximates pitch by collecting dominant frequencies of signal.

soundpy.dsp.get_mean_freq(sound, sr=16000, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', percent_vad=0.75)[source]¶

Takes the mean of dominant frequencies of voice activated regions in a signal.

Note: Silences discarded.

The average fundamental frequency for a male voice is 125Hz; for a female voice it’s 200Hz; and for a child’s voice, 300Hz. (Russell, J., 2020)

References

Russell, James (2020) The Human Voice and the Frequency Range. Retrieved from: https://blog.accusonus.com/pro-audio-production/human-voice-frequency-range/

soundpy.dsp.vad(sound, sr, win_size_ms=50, percent_overlap=0, real_signal=False, fft_bins=None, window='hann', energy_thresh=40, freq_thresh=185, sfm_thresh=5, min_energy=None, min_freq=None, min_sfm=None, use_beg_ms=120)[source]¶

Warning: this VAD works best with sample rates above 44100 Hz.

Parameters

energy_thresh (int, float) – The minimum amount of energy for speech detection.
freq_thresh (int, float) – The maximum frequency threshold.
sfm_thresh (int, float) – The spectral flatness measure threshold.

References

1. Moattar and M. M. Homayounpour, “A simple but efficient real-time Voice Activity Detection algorithm,” 2009 17th European Signal Processing Conference, Glasgow, 2009, pp. 2549-2553.

soundpy.dsp.suspended_energy(speech_energy, speech_energy_mean, row, start)[source]¶

soundpy.dsp.sound_index(speech_energy, speech_energy_mean, start=True)[source]¶: Identifies the index of where speech or energy starts or ends.

soundpy.dsp.get_energy(stft)[source]¶

soundpy.dsp.get_energy_mean(rms_energy)[source]¶

soundpy.dsp.spectral_flatness_measure(spectrum)[source]¶

soundpy.dsp.get_dom_freq(power_values)[source]¶: If real_signal (i.e. half fft bins), might mess up values.

soundpy.dsp.short_term_energy(signal_windowed)[source]¶

Expects signal to be scaled (-1, 1) as well as windowed.

References

http://vlab.amrita.edu/?sub=3&brch=164&sim=857&cnt=1

soundpy.dsp.bilinear_warp(fft_value, alpha)[source]¶

Subfunction for vocal tract length perturbation.

See also

soundpy.augment.vtlp

References

Kim, C., Shin, M., Garg, A., & Gowda, D. (2019). Improved vocal tract length perturbation for a state-of-the-art end-to-end speech recognition system. Interspeech. September 15-19, Graz, Austria.

soundpy.dsp.piecewise_linear_warp(fft_value, alpha, max_freq)[source]¶

Subfunction for vocal tract length perturbation.

See also

soundpy.augment.vtlp

References

Kim, C., Shin, M., Garg, A., & Gowda, D. (2019). Improved vocal tract length perturbation for a state-of-the-art end-to-end speech recognition system. Interspeech. September 15-19, Graz, Austria.

soundpy.dsp.f0_approximation(sound, sr, low_freq=50, high_freq=300, **kwargs)[source]¶

Approximates fundamental frequency.

Limits the stft of voice active sections to frequencies to between low_freq and high_freq and takes mean of the dominant frequencies within that range. Defaults are set at 50 and 300 as most human speech frequencies occur between 85 and 255 Hz.

References

https://en.wikipedia.org/wiki/Voice_frequency

Filters: Wiener and Band Spectral Subtraction¶

Filters module covers functions related to the filtering out of noise of a target signal.

class soundpy.filters.FilterSettings(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, zeropad=None)[source]¶

Bases: object

Basic settings for filter related classes to inherit from.

frame_dur¶

Time in milliseconds of each audio frame window. (default 20)

Type: int, float

sr¶

Desired sampling rate of audio; audio will be resampled to match if audio has other sampling rate. (default 48000)

Type: int

frame_length¶

Number of audio samples in each frame: frame_dur multiplied with sr, divided by 1000. (default 960)

Type: int

percent_overlap¶

Percentage of overlap between frames.

Type: float

overlap_length¶

Number of overlapping audio samples between subsequent frames: frame_length multiplied by percent_overlap, floored. (default 480)

Type: int

window_type¶

Type of window applied to audio frames: hann vs hamming (default ‘hamming’)

Type: str

num_fft_bins¶

The number of frequency bins used when calculating the fft. Currently the frame_length is used to set num_fft_bins.

Type: int

zeropad¶

If False, only full frames of audio data are processed. If True, the last partial frame will be zeropadded. (default False)

Type: bool, optional

Methods

get_window()

Returns window acc.

get_window()[source]¶: Returns window acc. to attributes window_type and frame_length

class soundpy.filters.Filter(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=None, zeropad=None)[source]¶

Bases: soundpy.filters.FilterSettings

Interactive class to explore Wiener filter settings on audio signals.

These class methods implement research based algorithms with low computational cost, aimed for noise reduction via mobile phone.

beta¶

Value applied in Wiener filter that smooths the application of ‘gain’; default set according to previous research. (default 0.98)

Type: float

first_iter¶

Keeps track if first_iter is relevant in filtering. If True, filtering has just started, and calculations made for filtering cannot use information from previous frames; if False, calculations for filtering use information from previous frames; if None, no difference is applied when processing the 1st vs subsequent frames. (default None)

Type: bool, optional

target_subframes¶

The number of total subsections within the total number of samples belonging to the target signal (i.e. audiofile being filtered). Until target_subframes is calculated, it is set to None. (default None)

Type: int, None

noise_subframes¶

The number of total subsections within the total number of samples belonging to the noise signal. If noise power spectrum is used, this doesn’t need to be calculated. Until noise_subframes is calculated, it is set to None. (default None)

Type: int, None

gain¶

Once calculated, the attenuation values to be applied to the fft for noise reduction. Until calculated, None. (default None)

Type: ndarray, None

max_vol¶

The maximum volume allowed for the filtered signal. (default 0.4)

Type: float, int

Methods

`check_volume`(samples)	ensures volume of filtered signal is within the bounds of the original
`get_samples`(audiofile[, dur_sec])	Load signal and save original volume
`get_window`()	Returns window acc.
`set_num_subframes`(len_samples[, is_noise, …])	Sets the number of target or noise subframes available for processing
`set_volume`(samples[, max_vol, min_vol])	Records and limits the maximum amplitude of original samples.

get_samples(audiofile, dur_sec=None)[source]¶

Load signal and save original volume

Parameters

audiofile (str) – Path and name of audiofile to be loaded
dur_sec (int, float optional) – Max length of time in seconds (default None)

Returns

samples – Array containing signal amplitude values in time domain

Return type

ndarray

set_volume(samples, max_vol=0.4, min_vol=0.15)[source]¶

Records and limits the maximum amplitude of original samples.

This enables the output wave to be within a range of volume that does not go below or too far above the orignal maximum amplitude of the signal.

Parameters

samples (ndarray) – The original samples of a signal (1 dimensional), of any length
max_vol (float) – The maximum volume level. If a signal has values higher than this number, the signal is curtailed to remain at and below this number.
min_vol (float) – The minimum volume level. If a signal has only values lower than this number, the signal is amplified to be at this number and below.

Returns

Return type

set_num_subframes(len_samples, is_noise=False, zeropad=False)[source]¶

Sets the number of target or noise subframes available for processing

Parameters

len_samples (int) – The total number of samples in a given signal
is_noise (bool) – If False, subframe number saved under self.target_subframes, otherwise self.noise_subframes (default False)
zeropad (bool) – If False, number of frames limited to full frames. If True, last frame is zeropadded.

Returns

Return type

check_volume(samples)[source]¶: ensures volume of filtered signal is within the bounds of the original

class soundpy.filters.WienerFilter(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, smooth_factor=0.98, first_iter=None, zeropad=None)[source]¶

Methods

`check_volume`(samples)	ensures volume of filtered signal is within the bounds of the original
`get_samples`(audiofile[, dur_sec])	Load signal and save original volume
`get_window`()	Returns window acc.
`set_num_subframes`(len_samples[, is_noise, …])	Sets the number of target or noise subframes available for processing
`set_volume`(samples[, max_vol, min_vol])	Records and limits the maximum amplitude of original samples.

apply_postfilter
apply_wienerfilter

apply_wienerfilter(frame_index, target_fft, target_power_frame, noise_power)[source]¶

apply_postfilter(enhanced_fft, target_fft, target_power_frame)[source]¶

class soundpy.filters.BandSubtraction(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, num_bands=6, band_spacing='linear', zeropad=None, smooth_factor=0.98, first_iter=None)[source]¶

Methods

`calc_oversub_factor`()	Calculate over subtraction factor used in the cited paper.
`calc_relevant_band`(target_powspec)	Calculates band with highest energy levels.
`check_volume`(samples)	ensures volume of filtered signal is within the bounds of the original
`get_samples`(audiofile[, dur_sec])	Load signal and save original volume
`get_window`()	Returns window acc.
`set_num_subframes`(len_samples[, is_noise, …])	Sets the number of target or noise subframes available for processing
`set_volume`(samples[, max_vol, min_vol])	Records and limits the maximum amplitude of original samples.
`setup_bands`()	Provides starting and ending frequncy bins/indices for each band.
`update_posteri_bands`(target_powspec, …)	Updates SNR of each set of bands.

apply_bandspecsub
apply_floor
apply_postfilter
sub_noise

apply_bandspecsub(target_power, target_phase, noise_power)[source]¶

setup_bands()[source]¶

Provides starting and ending frequncy bins/indices for each band.

Parameters: self (class) – Contains variables num_bands (if None, set to 6) and frame_length
Returns: Sets the class variables band_start_freq and band_end_freq.
Return type: None

Examples

>>> import soundpy as sp
>>> import numpy as np
>>> # Default is set to 6 bands:
>>> fil = sp.BandSubtraction()
>>> fil.setup_bands()
>>> fil.band_start_freq
array([  0.,  80., 160., 240., 320., 400.])
>>> fil.band_end_freq
array([ 80., 160., 240., 320., 400., 480.])
>>> # change default settings
>>> fil = sp.BandSubtraction(num_bands=5)
>>> fil.setup_bands()
>>> fil.band_start_freq
array([  0.,  96., 192., 288., 384.])
>>> fil.band_end_freq
array([ 96., 192., 288., 384., 480.])

update_posteri_bands(target_powspec, noise_powspec)[source]¶

Updates SNR of each set of bands.

MATLAB code from speech enhancement book uses power, puts it into magnitude (via square root), then puts it back into power..? And uses some sort of ‘norm’ function… which I think is actually just the sum. Original equation can be found in the paper below. page 117 from book?

paper: Kamath, S. D. & Loizou, P. C. (____), A multi-band spectral subtraction method for enhancing speech corrupted by colored noise.

I am using power for the time being.

Examples

>>> import soundpy as sp
>>> import numpy as np
>>> # setting to 4 bands for space:
>>> fil = sp.BandSubtraction(num_bands=4)
>>> fil.setup_bands()
>>> # generate sine signal with and without noise
>>> time = np.arange(0, 10, 0.01)
>>> signal = np.sin(time)[:fil.frame_length]
>>> np.random.seed(0)
>>> noise = np.random.normal(np.mean(signal),np.mean(signal)+0.3,960)
>>> powerspec_clean = np.abs(np.fft.fft(signal))**2
>>> powerspec_noisy = np.abs(np.fft.fft(signal + noise))**2
>>> fil.update_posteri_bands(powerspec_clean, powerspec_noisy)
>>> fil.snr_bands
array([ -1.91189028, -39.22078063, -44.16682922, -45.65265895])
>>> # compare with no noise in signal:
>>> fil.update_posteri_bands(powerspec_clean, powerspec_clean)
>>> fil.snr_bands
array([0., 0., 0., 0.])

calc_oversub_factor()[source]¶

Calculate over subtraction factor used in the cited paper.

Uses decibel SNR values calculated in update_posteri_bands()

paper: Kamath, S. D. & Loizou, P. C. (____), A multi-band spectral subtraction method ofr enhancing speech corrupted by colored noise.

Examples

>>> import soundpy as sp
>>> import numpy as np
>>> # setting to 4 bands for space:
>>> fil = sp.BandSubtraction(num_bands=4)
>>> fil.setup_bands()
>>> # generate sine signal with and without noise
>>> time = np.arange(0, 10, 0.01)
>>> signal = np.sin(time)[:fil.frame_length]
>>> np.random.seed(0)
>>> noise = np.random.normal(np.mean(signal),np.mean(signal)+0.3,960)
>>> powerspec_clean = np.abs(np.fft.fft(signal))**2
>>> powerspec_noisy = np.abs(np.fft.fft(signal + noise))**2
>>> fil.update_posteri_bands(powerspec_clean, powerspec_noisy)
>>> fil.snr_bands
array([ -1.91189028, -39.22078063, -44.16682922, -45.65265895])
>>> a = fil.calc_oversub_factor()
>>> a
array([4.28678354, 4.75      , 4.75      , 4.75      ])
>>> # compare with no noise in signal:
>>> fil.update_posteri_bands(powerspec_clean, powerspec_clean)
>>> fil.snr_bands
array([0., 0., 0., 0.])
>>> a = fil.calc_oversub_factor()
>>> a
array([4., 4., 4., 4.])

calc_relevant_band(target_powspec)[source]¶

Calculates band with highest energy levels.

Parameters

self (class instance) – Contains class variables band_start_freq and band_end_freq.
target_powerspec (np.ndarray) – Power spectrum of the target signal.

Returns

rel_band_index (int) – Index for which band contains the most energy.
band_energy_matrix (np.ndarray [size=(num_bands, ), dtype=np.float]) – Power levels of each band.

Examples

>>> import soundpy as sp
>>> import numpy as np
>>> # setting to 4 bands for this example (default is 6):
>>> fil = sp.BandSubtraction(num_bands=4)
>>> fil.setup_bands()
>>> # generate sine signal with and with frequency 25
>>> time = np.arange(0, 10, 0.01)
>>> full_circle = 2 * np.pi
>>> freq = 25
>>> signal = np.sin((freq*full_circle)*time)[:fil.frame_length]
>>> powerspec_clean = np.abs(np.fft.fft(signal))**2
>>> rel_band_index, band_power_energies = fil.calc_relevant_band(powerspec_clean)
>>> rel_band_index
2
>>> # and with frequency 50
>>> freq = 50
>>> signal = np.sin((freq*full_circle)*time)[:fil.frame_length]
>>> powerspec_clean = np.abs(np.fft.fft(signal))**2
>>> rel_band_index, band_power_energies = fil.calc_relevant_band(powerspec_clean)
>>> rel_band_index
3

apply_floor(sub_band, original_band, floor=0.002, book=True)[source]¶

sub_noise(target_powspec, noise_powspec, oversub_factor, speech=True)[source]¶

apply_postfilter(enhanced_fft, target_fft, target_power_frame, noise_power)[source]¶

class soundpy.filters.FilterSettings(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, zeropad=None)[source]¶

Bases: object

Basic settings for filter related classes to inherit from.

frame_dur¶

Time in milliseconds of each audio frame window. (default 20)

Type: int, float

sr¶

Desired sampling rate of audio; audio will be resampled to match if audio has other sampling rate. (default 48000)

Type: int

frame_length¶

Number of audio samples in each frame: frame_dur multiplied with sr, divided by 1000. (default 960)

Type: int

percent_overlap¶

Percentage of overlap between frames.

Type: float

overlap_length¶

Number of overlapping audio samples between subsequent frames: frame_length multiplied by percent_overlap, floored. (default 480)

Type: int

window_type¶

Type of window applied to audio frames: hann vs hamming (default ‘hamming’)

Type: str

num_fft_bins¶

The number of frequency bins used when calculating the fft. Currently the frame_length is used to set num_fft_bins.

Type: int

zeropad¶

If False, only full frames of audio data are processed. If True, the last partial frame will be zeropadded. (default False)

Type: bool, optional

Methods

get_window()

Returns window acc.

__init__(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, zeropad=None)[source]¶: Initialize self. See help(type(self)) for accurate signature.

get_window()[source]¶: Returns window acc. to attributes window_type and frame_length

class soundpy.filters.Filter(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=None, zeropad=None)[source]¶

Bases: soundpy.filters.FilterSettings

Interactive class to explore Wiener filter settings on audio signals.

These class methods implement research based algorithms with low computational cost, aimed for noise reduction via mobile phone.

beta¶

Value applied in Wiener filter that smooths the application of ‘gain’; default set according to previous research. (default 0.98)

Type: float

first_iter¶

Keeps track if first_iter is relevant in filtering. If True, filtering has just started, and calculations made for filtering cannot use information from previous frames; if False, calculations for filtering use information from previous frames; if None, no difference is applied when processing the 1st vs subsequent frames. (default None)

Type: bool, optional

target_subframes¶

The number of total subsections within the total number of samples belonging to the target signal (i.e. audiofile being filtered). Until target_subframes is calculated, it is set to None. (default None)

Type: int, None

noise_subframes¶

The number of total subsections within the total number of samples belonging to the noise signal. If noise power spectrum is used, this doesn’t need to be calculated. Until noise_subframes is calculated, it is set to None. (default None)

Type: int, None

gain¶

Once calculated, the attenuation values to be applied to the fft for noise reduction. Until calculated, None. (default None)

Type: ndarray, None

max_vol¶

The maximum volume allowed for the filtered signal. (default 0.4)

Type: float, int

Methods

`check_volume`(samples)	ensures volume of filtered signal is within the bounds of the original
`get_samples`(audiofile[, dur_sec])	Load signal and save original volume
`get_window`()	Returns window acc.
`set_num_subframes`(len_samples[, is_noise, …])	Sets the number of target or noise subframes available for processing
`set_volume`(samples[, max_vol, min_vol])	Records and limits the maximum amplitude of original samples.

__init__(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=None, zeropad=None)[source]¶: Initialize self. See help(type(self)) for accurate signature.

get_samples(audiofile, dur_sec=None)[source]¶

Load signal and save original volume

Parameters

audiofile (str) – Path and name of audiofile to be loaded
dur_sec (int, float optional) – Max length of time in seconds (default None)

Returns

samples – Array containing signal amplitude values in time domain

Return type

ndarray

set_volume(samples, max_vol=0.4, min_vol=0.15)[source]¶

Records and limits the maximum amplitude of original samples.

This enables the output wave to be within a range of volume that does not go below or too far above the orignal maximum amplitude of the signal.

Parameters

samples (ndarray) – The original samples of a signal (1 dimensional), of any length
max_vol (float) – The maximum volume level. If a signal has values higher than this number, the signal is curtailed to remain at and below this number.
min_vol (float) – The minimum volume level. If a signal has only values lower than this number, the signal is amplified to be at this number and below.

Returns

Return type

set_num_subframes(len_samples, is_noise=False, zeropad=False)[source]¶

Sets the number of target or noise subframes available for processing

Parameters

len_samples (int) – The total number of samples in a given signal
is_noise (bool) – If False, subframe number saved under self.target_subframes, otherwise self.noise_subframes (default False)
zeropad (bool) – If False, number of frames limited to full frames. If True, last frame is zeropadded.

Returns

Return type

check_volume(samples)[source]¶: ensures volume of filtered signal is within the bounds of the original

class soundpy.filters.WienerFilter(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, smooth_factor=0.98, first_iter=None, zeropad=None)[source]¶

Methods

`check_volume`(samples)	ensures volume of filtered signal is within the bounds of the original
`get_samples`(audiofile[, dur_sec])	Load signal and save original volume
`get_window`()	Returns window acc.
`set_num_subframes`(len_samples[, is_noise, …])	Sets the number of target or noise subframes available for processing
`set_volume`(samples[, max_vol, min_vol])	Records and limits the maximum amplitude of original samples.

apply_postfilter
apply_wienerfilter

__init__(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, smooth_factor=0.98, first_iter=None, zeropad=None)[source]¶: Initialize self. See help(type(self)) for accurate signature.

apply_wienerfilter(frame_index, target_fft, target_power_frame, noise_power)[source]¶

apply_postfilter(enhanced_fft, target_fft, target_power_frame)[source]¶

class soundpy.filters.BandSubtraction(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, num_bands=6, band_spacing='linear', zeropad=None, smooth_factor=0.98, first_iter=None)[source]¶

Methods

`calc_oversub_factor`()	Calculate over subtraction factor used in the cited paper.
`calc_relevant_band`(target_powspec)	Calculates band with highest energy levels.
`check_volume`(samples)	ensures volume of filtered signal is within the bounds of the original
`get_samples`(audiofile[, dur_sec])	Load signal and save original volume
`get_window`()	Returns window acc.
`set_num_subframes`(len_samples[, is_noise, …])	Sets the number of target or noise subframes available for processing
`set_volume`(samples[, max_vol, min_vol])	Records and limits the maximum amplitude of original samples.
`setup_bands`()	Provides starting and ending frequncy bins/indices for each band.
`update_posteri_bands`(target_powspec, …)	Updates SNR of each set of bands.

apply_bandspecsub
apply_floor
apply_postfilter
sub_noise

__init__(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, num_bands=6, band_spacing='linear', zeropad=None, smooth_factor=0.98, first_iter=None)[source]¶: Initialize self. See help(type(self)) for accurate signature.

apply_bandspecsub(target_power, target_phase, noise_power)[source]¶

setup_bands()[source]¶

Provides starting and ending frequncy bins/indices for each band.

Parameters: self (class) – Contains variables num_bands (if None, set to 6) and frame_length
Returns: Sets the class variables band_start_freq and band_end_freq.
Return type: None

Examples

>>> import soundpy as sp
>>> import numpy as np
>>> # Default is set to 6 bands:
>>> fil = sp.BandSubtraction()
>>> fil.setup_bands()
>>> fil.band_start_freq
array([  0.,  80., 160., 240., 320., 400.])
>>> fil.band_end_freq
array([ 80., 160., 240., 320., 400., 480.])
>>> # change default settings
>>> fil = sp.BandSubtraction(num_bands=5)
>>> fil.setup_bands()
>>> fil.band_start_freq
array([  0.,  96., 192., 288., 384.])
>>> fil.band_end_freq
array([ 96., 192., 288., 384., 480.])

update_posteri_bands(target_powspec, noise_powspec)[source]¶

Updates SNR of each set of bands.

MATLAB code from speech enhancement book uses power, puts it into magnitude (via square root), then puts it back into power..? And uses some sort of ‘norm’ function… which I think is actually just the sum. Original equation can be found in the paper below. page 117 from book?

paper: Kamath, S. D. & Loizou, P. C. (____), A multi-band spectral subtraction method for enhancing speech corrupted by colored noise.

I am using power for the time being.

Examples

>>> import soundpy as sp
>>> import numpy as np
>>> # setting to 4 bands for space:
>>> fil = sp.BandSubtraction(num_bands=4)
>>> fil.setup_bands()
>>> # generate sine signal with and without noise
>>> time = np.arange(0, 10, 0.01)
>>> signal = np.sin(time)[:fil.frame_length]
>>> np.random.seed(0)
>>> noise = np.random.normal(np.mean(signal),np.mean(signal)+0.3,960)
>>> powerspec_clean = np.abs(np.fft.fft(signal))**2
>>> powerspec_noisy = np.abs(np.fft.fft(signal + noise))**2
>>> fil.update_posteri_bands(powerspec_clean, powerspec_noisy)
>>> fil.snr_bands
array([ -1.91189028, -39.22078063, -44.16682922, -45.65265895])
>>> # compare with no noise in signal:
>>> fil.update_posteri_bands(powerspec_clean, powerspec_clean)
>>> fil.snr_bands
array([0., 0., 0., 0.])

calc_oversub_factor()[source]¶

Calculate over subtraction factor used in the cited paper.

Uses decibel SNR values calculated in update_posteri_bands()

paper: Kamath, S. D. & Loizou, P. C. (____), A multi-band spectral subtraction method ofr enhancing speech corrupted by colored noise.

Examples

>>> import soundpy as sp
>>> import numpy as np
>>> # setting to 4 bands for space:
>>> fil = sp.BandSubtraction(num_bands=4)
>>> fil.setup_bands()
>>> # generate sine signal with and without noise
>>> time = np.arange(0, 10, 0.01)
>>> signal = np.sin(time)[:fil.frame_length]
>>> np.random.seed(0)
>>> noise = np.random.normal(np.mean(signal),np.mean(signal)+0.3,960)
>>> powerspec_clean = np.abs(np.fft.fft(signal))**2
>>> powerspec_noisy = np.abs(np.fft.fft(signal + noise))**2
>>> fil.update_posteri_bands(powerspec_clean, powerspec_noisy)
>>> fil.snr_bands
array([ -1.91189028, -39.22078063, -44.16682922, -45.65265895])
>>> a = fil.calc_oversub_factor()
>>> a
array([4.28678354, 4.75      , 4.75      , 4.75      ])
>>> # compare with no noise in signal:
>>> fil.update_posteri_bands(powerspec_clean, powerspec_clean)
>>> fil.snr_bands
array([0., 0., 0., 0.])
>>> a = fil.calc_oversub_factor()
>>> a
array([4., 4., 4., 4.])

calc_relevant_band(target_powspec)[source]¶

Calculates band with highest energy levels.

Parameters

self (class instance) – Contains class variables band_start_freq and band_end_freq.
target_powerspec (np.ndarray) – Power spectrum of the target signal.

Returns

rel_band_index (int) – Index for which band contains the most energy.
band_energy_matrix (np.ndarray [size=(num_bands, ), dtype=np.float]) – Power levels of each band.

Examples

>>> import soundpy as sp
>>> import numpy as np
>>> # setting to 4 bands for this example (default is 6):
>>> fil = sp.BandSubtraction(num_bands=4)
>>> fil.setup_bands()
>>> # generate sine signal with and with frequency 25
>>> time = np.arange(0, 10, 0.01)
>>> full_circle = 2 * np.pi
>>> freq = 25
>>> signal = np.sin((freq*full_circle)*time)[:fil.frame_length]
>>> powerspec_clean = np.abs(np.fft.fft(signal))**2
>>> rel_band_index, band_power_energies = fil.calc_relevant_band(powerspec_clean)
>>> rel_band_index
2
>>> # and with frequency 50
>>> freq = 50
>>> signal = np.sin((freq*full_circle)*time)[:fil.frame_length]
>>> powerspec_clean = np.abs(np.fft.fft(signal))**2
>>> rel_band_index, band_power_energies = fil.calc_relevant_band(powerspec_clean)
>>> rel_band_index
3

apply_floor(sub_band, original_band, floor=0.002, book=True)[source]¶

sub_noise(target_powspec, noise_powspec, oversub_factor, speech=True)[source]¶

apply_postfilter(enhanced_fft, target_fft, target_power_frame, noise_power)[source]¶

Extract and manipulate audio features¶

Feats module includes functions related to converting audio sample data to features for analysis, filtering, machine learning, or visualization.

soundpy.feats.plot(feature_matrix, feature_type, save_pic=False, name4pic=None, energy_scale='power_to_db', title=None, sr=None, win_size_ms=None, percent_overlap=None, x_label=None, y_label=None, subprocess=False, overwrite=False)[source]¶

Visualize feature extraction; frames on x axis, features on y axis. Uses librosa to scale the data if scale applied.

Note: can only take multiple channels if feature_type is ‘signal’. For other feature types, the plot will not work as expected.

Parameters

feature_matrix (np.ndarray [shape=(num_samples,), (num_samples, num_channels), or (num_features, num_frames), dtype=np.float].) – Matrix of features. If the features are not of type ‘signal’ and the shape is 1 D, one dimension will be added to be plotted with a colormesh.
feature_type (str) – Options: ‘signal’, ‘stft’, ‘mfcc’, or ‘fbank’ features, or what user would like to name the feature set. signal: the 1 D samples of sound. STFT: short-time Fourier transform MFCC: mel frequency cepstral coefficients. FBANK: mel-log filterbank energies (default ‘fbank’).
save_pic (bool) – True to save image as .png; False to just plot it. If subprocess is True, save_pic will automatically be set to True.
name4pic (str, optional) – If save_pic set to True, the name the image should be saved under.
energy_scale (str, optional) – If features need to be adjusted, e.g. from power to decibels. Default is ‘power_to_db’.
title (str, optional) – The title for the graph. If None, feature_type is used.
sr (int, optional) – Useful in plotting the time for features.
win_size_ms (int, float, optional) – Useful in plotting the time for features in the frequency domain (e.g. STFT, FBANK, MFCC features)
percent_overlap (int, float, optional) – Useful in plotting the time for features in the frequency domain (e.g. STFT, FBANK, MFCC features)
x_label (str, optional) – The label to be applied to the x axis.
y_label (str, optional) – The label to be applied to the y axis.
subprocess (bool) – If subprocess is True, matplotlib will use backend ‘Agg’, which only allows plots to be saved. If subprocess is False, the default backend ‘TkAgg’ will be used, which allows plots to be generated live as well as saved. The ‘Agg’ backend is useful if one wants to visualize sound while a main process is being performed, for example, while a model is being trained. (default False)
overwrite (bool) – If False, if .png file already exists under given name, a date tag will be added to the .png filename to avoid overwriting the file. (default False)

Returns

Return type

soundpy.feats.plotsound(audiodata, feature_type='fbank', win_size_ms=20, percent_overlap=0.5, fft_bins=None, num_filters=40, num_mfcc=40, sr=None, save_pic=False, name4pic=None, energy_scale='power_to_db', mono=None, real_signal=False, **kwargs)[source]¶

Visualize feature extraction depending on set parameters.

Stereo sound can be graphed. If feature_type is ‘signal’, all channels will be graphed on same plot. Otherwise, each channel will be plotted separately.

Parameters

audiodata (str, numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.
feature_type (str) – Options: ‘signal’, ‘mfcc’, or ‘fbank’ features. MFCC: mel frequency cepstral coefficients; FBANK: mel-log filterbank energies (default ‘fbank’)
win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 20)
percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.
num_filters (int) – Number of mel-filters to be used when applying mel-scale. For ‘fbank’ features, 20-128 are common, with 40 being very common. (default 40)
num_mfcc (int) – Number of mel frequency cepstral coefficients. First coefficient pertains to loudness; 2-13 frequencies relevant for speech; 13-40 for acoustic environment analysis or non-linguistic information. Note: it is not possible to choose only 2-13 or 13-40; if num_mfcc is set to 40, all 40 coefficients will be included. (default 40).
sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)
mono (bool, optional) – When loading an audiofile, True will limit number of channels to one; False will allow more channels to be loaded. (default None, which results in mono channel loading.)
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.plot

soundpy.feats.get_feats(sound, sr=None, feature_type='fbank', win_size_ms=20, percent_overlap=0.5, window='hann', fft_bins=None, num_filters=None, num_mfcc=None, remove_first_coefficient=False, sinosoidal_liftering=False, dur_sec=None, mono=None, rate_of_change=False, rate_of_acceleration=False, subtract_mean=False, real_signal=True, fmin=None, fmax=None, zeropad=True)[source]¶

Collects raw signal data, stft, fbank, or mfcc features.

Parameters

sound (str or numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.
sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)
feature_type (str) – Options include ‘signal’, ‘stft’, ‘powspec’, ‘fbank’, or ‘mfcc’ data (default ‘fbank’). signal: energy/amplitude measurements along time STFT: short-time fourier transform powspec : power spectrum (absolute value of stft, squared) FBANK: mel-log filterbank energies MFCC: mel frequency cepstral coefficients
win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 20)
percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.
window (str or np.ndarray [size (n_fft, )]) – The window function to be applied to each window. (Default ‘hann’)
fft_bins (int) – Number of frequency bins to apply in fast Fourier transform. (default None)
num_filters (int) – Number of mel-filters to be used when applying mel-scale. For ‘fbank’ features, 20-128 are common, with 40 being very common. If None, will be set to 40. (default None)
num_mfcc (int) – Number of mel frequency cepstral coefficients. First coefficient pertains to loudness; 2-13 frequencies relevant for speech; 13-40 for acoustic environment analysis or non-linguistic information. If None, will be set to num_filters or 40. (default None).
dur_sec (float, optional) – Time in seconds to limit in loading a signal. (default None)
mono (bool, optional) – For loading an audiofile, True will result in only one channel of data being loaded; False will allow additional channels be loaded. (default None, which results in mono channel data)
rate_of_change (bool) – If True, the first derivative of spectral data will be concatenated to the features. This is applicable for all feature types except ‘signal’.
rate_of_acceleration (bool) – If True, the second derivative of spectral data will be concatenated to the features. This is applicable for all feature types except ‘signal’.
subtract_mean (bool) – If True, the mean of each feature column will be subtracted from each row. This is applicable for all feature types except ‘signal’.

Returns

feats – Feature data. If feature_type is ‘signal’, returns a tuple containing samples and sampling rate. If feature_type is of another type, returns np.ndarray with shape (num_frames, num_filters/features)

Return type

tuple (num_samples, sr) or np.ndarray [size (num_frames, num_filters) dtype=np.float or np.complex]

soundpy.feats.load_feat_settings(feat_settings_dict)[source]¶: Loads feature settings into named tuple. Sets defaults if not present. TODO: test w previous version

soundpy.feats.get_stft(sound, sr=22050, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', zeropad=True, **kwargs)[source]¶

Returns short-time Fourier transform matrix.

This function allows more flexibility in number of fft_bins and real_signal settings. Additionally, this does not require the package librosa, making it a bit easier to manipulate if desired. For an example, see soundpy.augment.vtlp.

Parameters

sound (np.ndarray [shape=(num_samples,) or (num_samples, num_channels)], str, or pathlib.PosixPath) – If type np.ndarray, expect raw samples in mono or stereo sound. If type str or pathlib.PosixPath, expect pathway to audio file.
sr (int) – The sample rate of sound.
win_size_ms (int, float) – Window length in milliseconds for Fourier transform to be applied (default 50)
percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.
real_signal (bool) – If True, only half the FFT spectrum will be used; there should really be no difference as the FFT is symmetrical. If anything, setting real_signal to True may speed up functionality / make functions more efficient.
fft_bins (int) – Number of frequency bins to use when applying fast Fourier Transform. (default 1024)
window (str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)
zeropad (bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.files.loadsound.

Returns

stft_matrix

Return type

np.ndarray[size=(num_frames, fft_bins)]

soundpy.feats.get_fbank(sound, sr, num_filters, fmin=None, fmax=None, fft_bins=None, **kwargs)[source]¶

Extract mel-filterbank energy features from audio.

Parameters

sound (np.ndarray [size=(num_samples,) or (num_samples, num_features)], str, or pathlib.PosixPath) – Sound in raw samples, a power spectrum, or a short-time-fourier-transform. If type string or pathlib.PosixPath, expect pathway to audio file.
sr (int) – The sample rate of sound.
num_filters (int) – The number of mel-filters to use when extracting mel-filterbank energies.
fmin (int or float, optional) – The minimum frequency of interest. If None, will be set to 0. (default None)
fmax (int or float, optional) – The maximum frequency of interst. If None, will be set to half of sr. (default None)
fft_bins (int, optional) – The number of frequency bins / fast Fourier transform bins used in calculating the fast Fourier transform. If None, set depending on type of parameter sound. If sound is a raw signal or audio pathway, fft_bins will be set to 1024; if sound is a STFT or power spectrum, fft_bins will be set to 2 * length of sound feature column, or 2 * sound.shape[1].
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_stft.

Returns

fbank – The mel-filterbank energeis extracted. The number of samples depends on the parameters applied in soundpy.feats.get_stft.

Return type

np.ndarray [shape=(num_samples, num_filters)]

References

Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from: https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

soundpy.feats.get_mfcc(sound, sr, num_mfcc, remove_first_coefficient=False, sinosoidal_liftering=False, **kwargs)[source]¶

Extracts mel-frequency cepstral coefficients from audio.

Parameters

sound (np.ndarray [size=(num_samples,) or (num_samples, num_features)] or str or pathlib.PosixPath) – If sound is a np.ndarray, expected as raw samples, a power spectrum or a short-time Fourier transform. If string or pathlib.PosixPath, should be the pathway to the audio file.
sr (int) – The sample rate of the sound.
num_mfcc (int) – The number of mel-frequency cepstral coefficients
remove_first_coefficient (bool) – If True, the first coefficient, representing amplitude or volume of signal, is removed. Found to sometimes improve automatic speech recognition. (default False)
sinosoidal_liftering (bool) – If True, reduces influence of higher coefficients, found to aid in handling noise in background in automatic speech recognition. (default False)
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_fbank()

References

Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

soundpy.feats.get_vad_stft(sound, sr=48000, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', use_beg_ms=120, extend_window_ms=0, energy_thresh=40, freq_thresh=185, sfm_thresh=5, zeropad=True, **kwargs)[source]¶

Returns STFT matrix and VAD matrix. STFT matrix contains only VAD sections.

Parameters

sound (str or numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.
sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)
win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 50)
percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.
real_signal (bool) – If True, only half the FFT spectrum will be used; there should really be no difference as the FFT is symmetrical. If anything, setting real_signal to True may speed up functionality / make functions more efficient.
fft_bins (int) – Number of frequency bins to use when applying fast Fourier Transform. (default 1024)
window (str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)
use_beg_ms (int) – The amount of time in milliseconds to use from beginning of signal to estimate background noise.
extend_window_ms (int) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired.
energy_thresh (int) – The threshold to set for measuring energy for VAD in the signal. (default 40)
freq_thresh (int) – The threshold to set for measuring frequency for VAD in the signal. (default 185)
sfm_thresh (int) – The threshold to set for measuring spectral flatness for VAD in the signal. (default 5)
zeropad (bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.files.loadsound

Returns

stft_matrix (np.ndarray [size=(num_frames_vad, fft_bins//2+1), dtype=np.complex_]) – The STFT matrix frames of where voice activity has been detected.
vad_matrix_extwin (np.ndarray [size=(num_frames,)]) – A vector containing indices of the full STFT matrix for frames of where voice activity was detected or not.

soundpy.feats.get_stft_clipped(samples, sr, win_size_ms=50, percent_overlap=0.5, extend_window_ms=0, window='hann', zeropad=True, **kwargs)[source]¶

Returns STFT matrix and VAD matrix with beginning and ending silence removed.

Parameters

samples (str or numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data.
sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded.
win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 50)
percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.
extend_window_ms (int) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired.
window (str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)
zeropad (bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.files.loadsound.

Returns

stft_speech (np.ndarry [size (num_frames_clipped, fft_bins//2+1)]) – The STFT of the samples with beginning and ending silences clipped.
vad_matrix (np.ndarry [size (num_frames, )]) – A vector with zeros and ones indicating which indices of the full STFT that have voice activity or not.

soundpy.feats.get_vad_samples(sound, sr=None, win_size_ms=50, percent_overlap=0.5, use_beg_ms=120, extend_window_ms=0, energy_thresh=40, freq_thresh=185, sfm_thresh=5, window='hann', zeropad=True, **kwargs)[source]¶

Returns samples and VAD matrix. Only samples where with VAD are returned.

Parameters

sound (str or numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.
sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)
win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 50)
percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.
use_beg_ms (int) – The amount of time in milliseconds to use from beginning of signal to estimate background noise.
extend_window_ms (int) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired.
energy_thresh (int) – The threshold to set for measuring energy for VAD in the signal. (default 40)
freq_thresh (int) – The threshold to set for measuring frequency for VAD in the signal. (default 185)
sfm_thresh (int) – The threshold to set for measuring spectral flatness for VAD in the signal. (default 5)
window (str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)
zeropad (bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.files.loadsound

Returns

samples_matrix (np.ndarray [size = (num_samples_vad, )]) – The samples of where voice activity was detected.
vad_matrix_extwin (np.ndarray [size = (num_frames, )]) – A vector of zeros and ones indicating the frames / windows of the samples that either had voice activity or not.

soundpy.feats.get_samples_clipped(samples, sr, win_size_ms=50, percent_overlap=0.5, extend_window_ms=0, window='hann', zeropad=True, **kwargs)[source]¶

Returns samples and VAD matrix with beginning and ending silence removed.

Parameters

samples (str or numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data.
sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded.
win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 50)
percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.
extend_window_ms (int) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired. (default 0)
window (str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)
zeropad (bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.files.loadsound.

Returns

stft_speech (np.ndarry [size (num_frames_clipped, fft_bins//2+1)]) – The STFT of the samples with beginning and ending silences clipped.
vad_matrix (np.ndarry [size (num_frames, )]) – A vector with zeros and ones indicating which indices of the full STFT that have voice activity or not.

soundpy.feats.normalize(data, max_val=None, min_val=None)[source]¶

Normalizes data to be between 0 and 1. Should not be applied to raw sample data.

This is useful if you have predetermined max and min values you want to normalize new data with. Is helpful in training models on sound features (not raw samples).

Parameters

data (np.ndarray [size=(num_features,) or (num_frames,num_features)]) – Data to be normalized.
max_val (int or float, optional) – Predetermined maximum value. If None, will use max value from data.
min_val (int or float, optional) – Predetermined minimum value. If None, will use min value from data.

Returns

normed_data

Return type

np.ndarray [size = (num_features,) or (num_frames, num_features)]

Examples

>>> # using the min and max of a previous dataset:
>>> import numpy as np
>>> np.random.seed(0)
>>> input_samples = np.random.random_sample((5,))
>>> input_samples
array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ])
>>> np.random.seed(40)
>>> previous_samples = np.random.random_sample((5,))
>>> previous_samples
array([0.40768703, 0.05536604, 0.78853488, 0.28730518, 0.45035059])
>>> max_prev = np.max(previous_samples)
>>> min_prev = np.min(previous_samples)
>>> output_samples = normalize(input_samples, min_val = min_prev, max_val = max_prev)
>>> output_samples
array([0.67303388, 0.89996095, 0.74661839, 0.66767314, 0.50232462])

soundpy.feats.plot_dom_freq(sound, energy_scale='power_to_db', title=None, save_pic=False, name4pic=None, overwrite=False, **kwargs)[source]¶

Plots the approximate dominant frequency over a STFT plot of a signal.

If sound has multiple channels, the VAD for each channel is plotted in its own plot.

Parameters

sound (np.ndarray [shape=(num_samples,) or (num_samples, num_channels)]) – The sound to plot the dominant frequency of.
energy_scale (str) – The scale of energy for the plot. If in frequency spectrum, likey in power and needs to be put into db. (default ‘power_to_db’)
title (str) – The title for the plot. (default None)
**kwargs (additional keyword arguments) – Keyword arguments used in both soundpy.feats.get_stft and soundpy.dsp.get_pitch.

Returns

Return type

soundpy.feats.plot_vad(sound, energy_scale='power_to_db', title='Voice Activity', use_beg_ms=120, extend_window_ms=0, beg_end_clipped=True, save_pic=False, name4pic=None, overwrite=False, **kwargs)[source]¶

Plots where voice (sound) activity detected on power spectrum.

This either plots immediately or saves the plot at name4pic. If sound has multiple channels, the VAD for each channel is plotted in its own plot.

Parameters

sound (np.ndarray [shape=(num_samples,) or (num_samples, num_channels)]) – The sound to plot the VAD of.
energy_scale (str) – If plotting STFT or power spectrum, will plot it in decibels. (default ‘power_to_db’)
title (str) – The title of the plot (default ‘Voice Activity’)
use_beg_ms (int) – The amount of noise to use at the beginning of the signal to measuer VAD. This is only applied if beg_end_silence is set to False.
extend_window_ms (int) – The number of milliseconds VAD should be padded. This is useful if one wants to encompass more speech if the VAD is not including all the speech / desired sound. However, this may capture more noise. (default 0)
beg_end_silence (bool) – If True, just the silences at the beginning and end of the sample will be cut off. If False, VAD will be checked throughout the sample, not just the beginning and end. NOTE: Both options have strengths and weaknesses. Sometimes the VAD checking the entire signal is unreliable (e.i. when beg_end_silence is set to False), not recognizing speech in speech filled samples. And when set to True, some speech sounds tend to get ignored (‘s’, ‘x’ and other fricatives).
save_pic (bool) – If True, the plot will be saved rather than plotted immediately.
name4pic (str) – The full pathway and filename to save the picture (as .png file). A file extension is expected. (default None)
overwrite (bool) – If False, a date tag will be added to name4pic if name4pic already exists. (default False)
**kwargs (keyword arguments) – Additional keyword arguments for soundpy.feats.get_speech_stft or soundpy.dsp.vad.

Returns

Return type

soundpy.feats.get_change_acceleration_rate(spectro_data)[source]¶

Gets first and second derivatives of spectral data.

This is useful particularly for speech recognition.

Parameters

spectro_data (np.ndarray [shape = (num_samples, num_features)]) –

Returns

delta (np.ndarray [shape = (num_samples, num_features)]) – The first order derivative of spectral data. Reflects rate of change in signal.
delta_delta (np.ndarray [shape = (num_samples, num_features)]) – The second order derivative of spectral data. Reflects rate of acceleration in signal.

soundpy.feats.get_mfcc_fbank(samples, feature_type='mfcc', sr=48000, win_size_ms=20, percent_overlap=0.5, num_filters=40, num_mfcc=40, fft_bins=None, window_function=None, zeropad=True, **kwargs)[source]¶: Collects fbank or mfcc features via python-speech-features (rather than librosa).

soundpy.feats.zeropad_features(feats, desired_shape, complex_vals=False)[source]¶: Applies zeropadding to a copy of feats.

soundpy.feats.reduce_num_features(feats, desired_shape)[source]¶

Limits number features of a copy of feats.

This is useful if you want the features to be a certain size, for training models for example.

soundpy.feats.adjust_shape(data, desired_shape, change_dims=None, complex_vals=None)[source]¶

soundpy.feats.reduce_dim(matrix, axis=0)[source]¶

soundpy.feats.featshape_new_subframe(feature_matrix_shape, new_frame_size, zeropad=True, axis=0, include_dim_size_1=False)[source]¶

Subdivides features from (num_frames, num_feats) to (new_frame_size, num_frames, num_feats)

Parameters

feature_matrix_shape (tuple [size=(num_frames, num_features)]) – Feature matrix shape to be subdivided. Can be multidimensional.
new_frame_size (int) – The number of subframes to section axis into.
zeropad (bool) – If True, frames that don’t completely fill a new_frame_size will be zeropadded. Otherwise, those frames will be discarded. (default True)
axis (int) – The axis where the new_frame_size should be applied. (default 0)

Returns

new_shape

Return type

tuple [size=(num_subframes, new_frame_size, num_feats)]

soundpy.feats.apply_new_subframe(feature_matrix, new_frame_size, zeropad=True, axis=0)[source]¶

Reshapes feature_matrix to allow for new_frame_size.

Note: Dimensions of feature_matrix must be at least 2 and can be up to 5, returning a matrix with one additional dimension.

Parameters

feature_matrix (np.ndarray [size(num_frames, num_features) ]) – Expects minimum 2D, maximum 5D matrix.
new_frame_size (int) – The number of subframes to section axis into.
axis (int) – The axis to apply the new_frame_size. (default 0)
zeropad (bool) – If True, the feature_matrix will be zeropadded to include frames that do not fill entire frame_size, given the new_frame_size. If False, feature_matrix will not include the last zeropadded frame. (default True)

Returns

feats_reshaped – The feature_matrix returned with axis subdivided into 2 dimensions, the number of subframes and the other length new_frame_size.

Return type

np.ndarray [size(num_subframes, new_frame_size, num_features)]

Raises

ValueError if number of dimensions of feature_matrix is below 2 or exceeds 5. –

Examples

>>> import numpy as np
>>> matrix = np.arange(24).reshape(3,4,2)
>>> # apply new_frame_size to dimension of length 4 (i.e. axis 1)
>>> matrix_zp = apply_new_subframe(matrix, new_frame_size = 3, axis = 1)
>>> matrix_zp.shape
(3, 2, 3, 2)
>>> matrix_zp
array([[[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

[[ 6, 7], [ 0, 0], [ 0, 0]]],

[[[ 8, 9],
[10, 11], [12, 13]],

[[14, 15], [ 0, 0], [ 0, 0]]],

[[[16, 17],
[18, 19], [20, 21]],

[[22, 23], [ 0, 0], [ 0, 0]]]])

>>> matrix_nozp = apply_new_subframe(matrix, new_frame_size = 3, axis = 1,
...                                    zeropad=False)
>>> matrix_nozp.shape
(3, 1, 3, 2)
>>> matrix_nozp
array([[[[ 0,  1],
        [ 2,  3],
        [ 4,  5]]],

[[[ 8, 9],
[10, 11], [12, 13]]],

[[[16, 17],
[18, 19], [20, 21]]]])

soundpy.feats.check_percent_overlap(percent_overlap)[source]¶: Ensures percent_overlap is between 0 and 1.

soundpy.feats.separate_dependent_var(matrix)[source]¶

Separates matrix into features and labels. Expects 3D array.

Assumes the last column of the last dimension of the matrix constitutes the dependent variable (labels), and all other columns the indpendent variables (features). Additionally, it is assumed that for each block of data, only one label is needed; therefore, just the first label is taken for each block.

Parameters

matrix (numpy.ndarray [size = (num_samples, num_frames, num_features)]) – The matrix holds the numerical data to separate. num_features is expected to be at least 2.

Returns

X (numpy.ndarray [size = (num_samples, num_frames, num_features -1)]) – A matrix holding the (assumed) independent variables
y (numpy.ndarray, numpy.int64, numpy.float64 [size = (num_samples,)]) – A vector holding the labels assigned to the independent variables. If only one value in array, just the value inside is returned

Examples

>>> import numpy as np
>>> #vector
>>> separate_dependent_var(np.array([1,2,3,4]))
(array([1, 2, 3]), 4)
>>> #simple matrix
>>> matrix = np.arange(4).reshape(2,2)
>>> matrix
array([[0, 1],
       [2, 3]])
>>> X, y = separate_dependent_var(matrix)
>>> X
array([[0],
       [2]])
>>> y
1
>>> #more complex matrix
>>> matrix = np.arange(20).reshape((2,2,5))
>>> matrix
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9]],

       [[10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]]])
>>> X, y = separate_dependent_var(matrix)
>>> X
array([[[ 0,  1,  2,  3],
        [ 5,  6,  7,  8]],

       [[10, 11, 12, 13],
        [15, 16, 17, 18]]])
>>> y
array([ 4, 14])

soundpy.feats.add_tensor(matrix)[source]¶

Adds tensor / dimension to input ndarray (e.g. features).

Keras requires an extra dimension at some layers, which represents the ‘tensor’ encapsulating the data.

Further clarification taking the example below. The input matrix has shape (2,3,4). Think of it as 2 different events, each having 3 sets of measurements, with each of those having 4 features. So, let’s measure differences between 2 cities at 3 different times of day. Let’s take measurements at 08:00, 14:00, and 19:00 in… Magic City and Never-ever Town. We’ll measure.. 1) tempurature, 2) wind speed 3) light level 4) noise level.

How I best understand it, putting our measurements into a matrix with an added dimension/tensor, this highlights the separate measurements, telling the algorithm: yes, these are 4 features from the same city, BUT they occur at different times. Or it’s just how Keras set up the code :P

Parameters: matrix (numpy.ndarray) – The matrix holds the numerical data to add a dimension to.
Returns: matrix – The matrix with an additional dimension.
Return type: numpy.ndarray

Examples

>>> import numpy as np
>>> matrix = np.arange(24).reshape((2,3,4))
>>> matrix.shape
(2, 3, 4)
>>> matrix
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
>>> matrix_2 = add_tensor(matrix)
>>> matrix_2.shape
(2, 3, 4, 1)
>>> matrix_2
array([[[[ 0],
         [ 1],
         [ 2],
         [ 3]],

        [[ 4],
         [ 5],
         [ 6],
         [ 7]],

        [[ 8],
         [ 9],
         [10],
         [11]]],


       [[[12],
         [13],
         [14],
         [15]],

        [[16],
         [17],
         [18],
         [19]],

        [[20],
         [21],
         [22],
         [23]]]])

soundpy.feats.scale_X_y(matrix, is_train=True, scalars=None)[source]¶

Separates and scales data into X and y arrays. Adds dimension for keras.

Assumes the last column of the last dimension is the y or label data.

Parameters

matrix (np.ndarray [size = (num_samples, num_frames, num_features)]) – Matrix with X and y data
is_train (bool) – Relevant for the scalars parameter. If the data is training data (i.e. True), the scalars will be created. If the data is test data (i.e. False), the function expects scalars to be provided. (default True)
scalars (dict, optional) – Dictionary with scalars to be applied to non-training data.

Returns

X (np.ndarray [size = (num_sampls, num_frames, num_features-1, 1)]) – Scaled features with extra dimension
y (np.ndarray [size = (num_samples, 1, 1)]) – Scaled independent variable with extra dimension
scalars (dict) – The scalars either created or previously loaded.

soundpy.feats.list_available_features()[source]¶

soundpy.feats.get_feature_matrix_shape(sr=None, dur_sec=None, feature_type=None, win_size_ms=None, percent_overlap=None, fft_bins=None, num_mfcc=None, num_filters=None, rate_of_change=False, rate_of_acceleration=False, context_window=None, frames_per_sample=None, zeropad=True, labeled_data=False, remove_first_coefficient=False, real_signal=False, **kwargs)[source]¶

Returns expected shapes of feature matrix depending on several parameters.

Parameters

sr (int) – Sample rate of the audio to be extracted.
dur_sec (int, float) – The number of seconds of audio feature extraction will be applied to.
feature_type (str) – Accepted features include ‘signal’, ‘stft’, ‘powspec’, ‘fbank’, ‘mfcc’. Which feature_type applied will influence the resulting shape of the feature matrix shape.
win_size_ms (int or float) – The size of the window the audio signal should be broken into. If feature_type is set to ‘signal’, this is irrelevant. Otherwise will raise TypeError if set to None.
percent_overlap (float) – The amount of overlap between windows. If set to 0.5, the number of overlapping samples will be half the number of samples that make up win_size_ms.
fft_bins (int) – The number of frequency bins to use when calculating the fast Fourier transform. If None, the calculated frame_length will be used.
num_mfcc (int) – If extracting ‘mfcc’ features, the total number of coefficients expected.
num_filters (int) – If extracting ‘fbank’ features, the total number of mel-filters to be applied.
rate_of_change (bool) – If True, the first delta will be concatenated to features extracted.
rate_of_acceleration (bool) – If True, the second delta will be concatenated to features extracted.
context_window (int) – The size of context_window or number of samples padding a central frame. This may be useful for models training on small changes occuring in the signal, e.g. to break up the image of sound into smaller parts.
frames_per_sample (int) – The previous keyword argument for sugementing audio into smaller parts. Will be removed in future versions and available in generator functions as context_window. frames_per_sample equals 2 * context_window + 1. See soundpy.models.dataprep.Generator
zeropad (bool) – If True, windows and frames will be zeropadded to avoid losing any sample data.
labeled_data (bool) – If True, a label will be added to the output shape of features.
remove_first_coefficient (bool) – If True, the first mfcc coefficient will not be included in feature matrix.
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats. These may not be used in this function as they may not influence the size of the feature matrix.

Returns

feature_matrix_base (tuple) – The base shape of the feature matrix. This is the shape that should result from extracting the features for each audio file
feature_matrix_model (tuple) – The shape relevant to training models. For example, one including space for a context window and label.

soundpy.feats.visualize_feat_extraction(feats, iteration=None, dataset=None, label=None, datadir=None, subsections=False, **kwargs)[source]¶

Saves plots of features during feature extraction or training of models.

Parameters

feats (np.ndarray [shape=(num_samples,) or (num_samples, num_frames) or (num_frames, num_features) or (num_subsections, num_frames, num_features)]) – The extracted features can be raw signal data, stft, fbank, powspec, mfcc data, either as a single plot or subsectioned into batches / subframes.
iteration (int, optional) – The iteration of the audio getting extracted; e.g. the 10th training item.
dataset (str, optional) – The identifying string (for example ‘train’ , ‘val’, or ‘test’, but this can be anything).
label (str, int, optional) – The label of the audio file. Used in titles and filenames.
datadir (str, pathlib.PosixPath, optional) – The directory where related data is located. An ‘image’ directory will be created within this datadir where the saved plots will be stored. If None, will be created in current working directory.
subsections (bool, optional) – To subsection raw ‘signal’ data into frames. For other features, this is easier to identify via the shape of feats.
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats

Returns

Return type

https://github.com/pgys/NoIze

soundpy.feats.save_features_datasets(datasets_dict, datasets_path2save_dict, context_window=None, frames_per_sample=None, labeled_data=False, subsection_data=False, divide_factor=None, visualize=False, vis_every_n_frames=50, log_settings=True, decode_dict=None, random_seed=None, **kwargs)[source]¶

Extracts and saves audio features, sectioned into datasets, to indicated locations.

If MemoryError, the provided dataset dicts will be adjusted to allow data to be subsectioned.

Parameters

datasets_dict (dict) – Dictionary with keys representing datasets and values the audifiles making up that dataset. E.g. {‘train’:[‘1.wav’, ‘2.wav’, ‘3.wav’], ‘val’: [‘4.wav’], ‘test’:[‘5.wav’]} for unlabled data or {‘train’:[(0, ‘1.wav’), (1, ‘2.wav’), (0, ‘3.wav’)], ‘val’: [(1, ‘4.wav’)], ‘test’:[(0, ‘5.wav’)]} for labeled data.
datasets_path2save_dict (dict) – Dictionary with keys representing datasets and values the pathways of where extracted features of that dataset will be saved. E.g. {‘train’: ‘./data/train.npy’, ‘val’: ‘./data/val.npy’, ‘test’: ‘./data/test.npy’}
context_window (int) – The size of context_window or number of samples padding a central frame. This may be useful for models training on small changes occuring in the signal, e.g. to break up the image of sound into smaller parts, to feed to a long short-term memory network (LSTM), for example. (Can avoid this by simply reshaping data later).
frames_per_sample (int) – The previous keyword argument for sugementing audio into smaller parts. Will be removed in future versions. This equals 2 * context_window + 1
labeled_data (bool) – If True, expects each audiofile to be accompanied by an integer label. See example given for datasets_dict.
subsection_data (bool) – If you have a large dataset, you may want to divide it into subsections. See soundpy.datasets.subsection_data. If datasets are large enough to raise a MemoryError, this will be applied automatically.
divide_factor (int, optional) – The number of subsections to divide data into. Only large enough sections will be divided. If smaller datasets (i.e. validation and test datasets) are as large or smaller than the new subsectioned larger dataset(s) (i.e. train), they will be left unchanged. (defaults to 5)
visualize (bool) – If True, periodic plots of the features will be saved throughout the extraction process. (default False)
vis_every_n_frames (int) – How often visuals should be made: every 10 samples, every 100, etc. (default 50)
log_settings (bool) – If True, a .csv file will be saved in the feature extraction directory with most of the feature settings saved. (default True)
decode_dict (dict, optional) – The dictionary to get the label given the encoded label. This is for plotting purposes. (default None)
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats.

Returns

datasets_dict (dict) – The final dataset dictionary used in feature extraction. The datasets may have been subdivided.
datasets_path2save_dict (dict) – The final dataset feature pathway dict. The pathways will have been adjusted if the datasets have been subdivided.

See also

soundpy.feats.get_feats: Extract features from audio file or audio data.

soundpy.feats.save_features_datasets_zipfiles(datasets_dict, datasets_path2save_dict, extract_dir, dur_sec, feature_type='fbank', num_feats=None, sr=22050, win_size_ms=20, percent_overlap=0.5, n_fft=None, frames_per_sample=None, labeled_data=False, subsection_data=False, divide_factor=None, visualize=False, vis_every_n_frames=50, use_librosa=True, center=True, mode='reflect', log_settings=True, decode_dict=None, audiofile_lim=10, **kwargs)[source]¶

Extracts and saves audio features, sectioned into datasets, to indicated locations.

If MemoryError, the provided dataset dicts will be adjusted to allow data to be subsectioned.

Parameters

datasets_dict (dict) – Dictionary with keys representing datasets and values the audifiles making up that dataset. E.g. {‘train’:[‘1.wav’, ‘2.wav’, ‘3.wav’], ‘val’: [‘4.wav’], ‘test’:[‘5.wav’]} for unlabled data or {‘train’:[(0, ‘1.wav’), (1, ‘2.wav’), (0, ‘3.wav’)], ‘val’: [(1, ‘4.wav’)], ‘test’:[(0, ‘5.wav’)]} for labeled data.
datasets_path2save_dict (dict) – Dictionary with keys representing datasets and values the pathways of where extracted features of that dataset will be saved. E.g. {‘train’: ‘./data/train.npy’, ‘val’: ‘./data/val.npy’, ‘test’: ‘./data/test.npy’}
feature_type (str) – String including only one of the following: ‘signal’, ‘stft’, ‘powspec’, ‘fbank’, and ‘mfcc’. ‘signal’ currently only supports mono channel data. TODO test for stereo ‘powspec’ and ‘stft’ are basically the same; ‘powspec’ is the ‘stft’ except without complex values and squared. E.g ‘mfcc_noisy’ or ‘stft_train’.
sr (int) – The sample rate the audio data should be loaded with.
n_fft (int) – The number of frequency bins used for the Fast Fourier Transform (fft)
dur_sec (int or float) – The desired duration of how long the audio data should be. This is used to calculate size of feature data and is therefore necessary, as audiofiles tend to differe in length. If audiofiles are longer or shorter, they will be cut or zeropadded respectively.
num_feats (int) – The number of mfcc coefficients (mfcc), mel filters (fbank), or frequency bins (stft).
win_size_ms (int) – The desired window size in milliseconds to process audio samples.
percent_overlap (float) – The amount audio samples should overlap as each window is processed.
frames_per_sample (int, optional) – If you want to section each audio file feature data into smaller frames. This might be useful for speech related contexts. (Can avoid this by simply reshaping data later)
labeled_data (bool) – If True, expects each audiofile to be accompanied by an integer label. See example given for datasets_dict.
subsection_data (bool) – If you have a large dataset, you may want to divide it into subsections. See soundpy.datasets.subsection_data. If datasets are large enough to raise a MemoryError, this will be applied automatically.
divide_factor (int, optional) – The number of subsections to divide data into. Only large enough sections will be divided. If smaller datasets (i.e. validation and test datasets) are as large or smaller than the new subsectioned larger dataset(s) (i.e. train), they will be left unchanged. (defaults to 5)
visualize (bool) – If True, periodic plots of the features will be saved throughout the extraction process. (default False)
vis_every_n_frames (int) – How often visuals should be made: every 10 samples, every 100, etc. (default 50)
use_librosa (bool) – If True, librosa is used to load and extract features. As of now, no other option is available. TODO: add other options. :P I just wanted to be clear that some elements of this function are unique to using librosa. (default True)
center (bool) – Relevant for librosa and feature extraction. (default True)
mode (str) – Relevant for librosa and feature extraction. (default ‘reflect’)
log_settings (bool) – If True, a .csv file will be saved in the feature extraction directory with most of the feature settings saved. (default True)
decode_dict (dict, optional) – The dictionary to get the label given the encoded label. This is for plotting purposes. (default None)
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats.

Returns

datasets_dict (dict) – The final dataset dictionary used in feature extraction. The datasets may have been subdivided.
datasets_path2save_dict (dict) – The final dataset feature pathway dict. The pathways will have been adjusted if the datasets have been subdivided.

See also

soundpy.feats.get_feats: Extract features from audio file or audio data.

soundpy.feats.prep_new_audiofeats(feats, desired_shape, input_shape)[source]¶

Prepares new audio data to feed to a pre-trained model.

Parameters

feats (np.ndarray [shape = (num_frames, num_features)]) – The features to prepare for feeding to a model.
desired_shape (tuple) – The expected number of samples necessary to fulfill the expected input_shape for the model. The feats will be zeropadded or limited to match this desired_shape.
input_shape (tuple) – The input_shape the model expects a single sample of data to be.

Returns

feats_reshaped – The features reshaped to what the model expects.

Return type

np.ndarray [shape = (`input_shape)]`

soundpy.feats.feats2audio(feats, feature_type, sr, win_size_ms, percent_overlap, phase=None)[source]¶

Prepares features into audio playable format.

Parameters

feats (np.ndarray [shape = (num_frames, num_feats)]) – If the features are a signal, [size = (batch_size * num_frames * num_features, 1)]. Otherwise [size = (batch_size * num_frames, num_features)].
feature_type (str) – Either ‘stft’, ‘fbank’, ‘signal’, or ‘mfcc’. For the ‘signal’ feature, only mono channel is supported.
sr (int) – Sampling rate that the features were extracted with
win_size_ms (int) – The window size in milliseconds the features were extracted with
percent_overlap (float) – The percent overlap between windows.
phase (np.ndarray [shape = (num_frames, num_feats)], optional) – The original phase information of the reconstructed signal.

Returns

y – The reconstructed signal in samples.

Return type

np.ndarray [shape = (num_samples, )]

soundpy.feats.grayscale2color(image_matrix, colorscale=3)[source]¶

Expects grayscale image. Copies first channel into additional channels.

This is useful for pre-trained models that require features to have rgb channels, not grayscale. Assumes last channel the colorscale column.

Template deep neural networks¶

The models.template_models module contains functions for building (ideally research-based) models.

soundpy.models.template_models.adjust_layers_cnn(**kwargs)[source]¶

Reduces layers of CNN until the model can be built.

If the number of filters for ‘mfcc’ or ‘fbank’ is in the lower range (i.e. 13 or so), this causes issues with the default settings of the cnn architecture. The architecture was built with at least 40 filters being applied during feature extraction. To deal with this problem, the number of CNN layers are reduced.

Parameters: **kwargs (Keyword arguments) – Keyword arguments for soundpy.models.template_models.cnn_classifier
Returns: settings – Updated dictionary with relevant settings for model.
Return type: dict

References

soundpy.models.template_models.cnn_classifier(feature_maps=[40, 20, 10], kernel_size=[3, 3, 3, 3, 3, 3], strides=2, activation_layer='relu', activation_output='softmax', input_shape=79, 40, 1, num_labels=3, dense_hidden_units=100, dropout=0.25)[source]¶

Build a single or multilayer convolutional neural network.

Parameters

feature_maps (int or list) – The filter or feature map applied to the data. One feature map per convolutional neural layer required. For example, a list of length 3 will result in a three-layer convolutional neural network.
kernel_size (tuple or list of tuples) – Must match the number of feature_maps. The size of each corresponding feature map.
strides (int) –
activation_layer (str) – (default ‘relu’)
activation_outpu (str) – (default ‘softmax’)
input_shape (tuple) – The shape of the input
dense_hidden_units (int, optional) –
dropout (float, optional) – Reduces overfitting

Returns

model (tf.keras.Model) – Model ready to be compiled.
settings (dict) – Dictionary with relevant settings for model.

Warning

If number features are not compatible with number of layers, warning raised and layers adjusted. E.g. For lower number of MFCC features this will likely be applied if number of layers is greater than 1.

References

A. Sehgal and N. Kehtarnavaz, “A Convolutional Neural Network Smartphone App for Real-Time Voice Activity Detection,” in IEEE Access, vol. 6, pp. 9017-9026, 2018.

soundpy.models.template_models.autoencoder_denoise(input_shape, kernel_size=3, 3, max_norm_value=2.0, activation_function_layer='relu', activation_function_output='sigmoid', padding='same', kernel_initializer='he_uniform')[source]¶

Build a simple autoencoder denoiser.

Parameters

input_shape (tuple) – Shape of the input data.
max_norm_value (int or float) –

Returns

autoencoder – Model ready to be compiled

Return type

tf.keras.Model

References

Versloot, Christian (2019, December 19). Creating a Signal Noise Removal Autoencoder with Keras. MachineCurve. https://www.machinecurve.com

soundpy.models.template_models.resnet50_classifier(input_shape, num_labels, activation='softmax', final_layer_name='features')[source]¶

Simple image classifier built ontop of a pretrained ResNet50 model.

References

Revay, S. & Teschke, M. (2019). Multiclass Language Identification using Deep Learning on Spectral Images of Audio Signals. arXiv:1905.04348 [cs.SD]

soundpy.models.template_models.cnnlstm_classifier(num_labels, input_shape, lstm_cells, feature_map_filters=32, kernel_size=8, 4, pool_size=3, 3, dense_hidden_units=60, activation_layer='relu', activation_output='softmax', dropout=0.25)[source]¶

Model architecture inpsired from the paper below.

References

Kim, Myungjong & Cao, Beiming & An, Kwanghoon & Wang, Jun. (2018). Dysarthric Speech Recognition Using Convolutional LSTM Neural Network. 10.21437/interspeech.2018-2250.

Additional model setup (e.g. Early Stopping)¶

The models.modelsetup module containes functionality for preparing for training a model

soundpy.models.modelsetup.setup_layers(num_features, num_layers, kernel_shape=3, 3, max_feature_map=64)[source]¶

Sets up feature_maps and kernels for 1 or more layered convolutional neural networks.

Parameters

num_features (int) – The number of features used to train the model. This will be used to set the number of feature_maps for each layer.
num_layers (int) – The number of layers desired
kernel_shape (tuple or int) – The shape of the desired kernel
max_feature_map (int) – The maximum size of feature map / filter. This depends on the system and is relevant for processing higher definition features, such as STFT features. If this is set too large given memory restraints, training may be ‘killed’.

Returns

feature_maps (list) – List of feature maps or filters that will be applied to each layer of the network.
kernels (list) – List of kernels that will be applied to each layer of the network. Matches length of feature_maps

Warning

If num_features is larger than the max_feature_map. The num_features is usually used to set the first feature map, but if too large, will be reduced to be lower than max_feature_map.

soundpy.models.modelsetup.setup_callbacks(early_stop=True, patience=15, log=True, log_filename=None, append=True, save_bestmodel=True, best_modelname=None, monitor='val_loss', verbose=1, save_best_only=True, mode='min', tensorboard=True, write_images=False, x_test=None, y_test=None, batch_size=None, embedded_layer_name=None)[source]¶

Easy set up of early stopping, model logging, and saving best model.

Parameters

early_stop (bool) – Whether or not the model should stop if training is not improving (default True)
patience (int) – The number of epochs the model should complete without improvement before stopping training. (default 15)
log (bool) – If true, the accuracy, loss, and (if possible) the val_accuracy and val_loss for each epoch will be saved in a .csv file. (default True)
log_filename (str or pathlib.PosixPath, optional) – The filename the logging information will be stored. If None, the date will be used as a unique .csv filename in a subfolder ‘model_logs’ in the local directory.
save_bestmodel (bool) – If True, the best performing model will be saved.
best_modelname (str or pathlib.PosixPath) – The name to save the best model version under. If None, the date will be used to create a unique .h5 filename and it will be saved in a subfolder ‘best_models’ in the local directory.
monitor (str) – The metric to be used to measure model performance. (default ‘val_loss’
verbose (bool) – If True, the state of the model will be printed. (default True)
save_best_only (bool) – If True, the best performing model will overwrite any previously saved ‘best model’.
mode (str) – If monitor is set to ‘val_loss’, this should be set to ‘min’. If monitor``is set to ‘val_acc’, this should be set to ‘max’. If `mode is set to ‘auto’, the direction will be inferred. (default ‘min’)
tensorboard (bool) – If True, logs for TensorBoard will be made.

Returns

callbacks – The callbacks ready to be applied to Keras model training.

Return type

# TODO what data type is this?

Feeding large datasets to models¶

class soundpy.models.dataprep.Generator(data_matrix1, data_matrix2=None, timestep=None, axis_timestep=0, normalize=True, apply_log=False, context_window=None, axis_context_window=- 2, labeled_data=False, gray2color=False, zeropad=True, desired_input_shape=None, combine_axes_0_1=False)[source]¶

Bases: object

Methods

generator()

Shapes, norms, and feeds data depending on labeled or non-labeled data.

__init__(data_matrix1, data_matrix2=None, timestep=None, axis_timestep=0, normalize=True, apply_log=False, context_window=None, axis_context_window=- 2, labeled_data=False, gray2color=False, zeropad=True, desired_input_shape=None, combine_axes_0_1=False)[source]¶

This generator pulls data out in sections (i.e. batch sizes). Prepared for 3 dimensional data.

Note: Keras adds a dimension to input to represent the “Tensor” that #handles the input. This means that sometimes you have to add a shape of (1,) to the shape of the data.

Parameters

data_matrix1 (np.ndarray [size=(num_samples, batch_size, num_frames, num_features) or (num_samples, num_frames, num_features+label_column)]) – The training data. This can contain the feature and label data or just the input feature data.
data_matrix2 (np.ndarray [size = (num_samples, ) `data_matrix1.shape]`, optional) – Either label data for data_matrix1 or, for example, the clean version of data_matrix1 if training an autoencoder. (default None)
normalize (bool) – If False, the data has already been normalized and won’t be normalized by the generator. (default True)
apply_log (bool) – If True, log will be applied to the data.
timestep (int) – The number of frames to constitute a timestep.
axis_timestep (int) – The axis to apply the timestep to. (default 0)
context_window (int) – The size of context_window or number of samples padding a central frame. This may be useful for models training on small changes occuring in the signal, e.g. to break up the image of sound into smaller parts.
axis_context_window (int) – The axis to apply_context_window, if context_window is not None. Ideally should be in axis preceding feature column. (default -2)
zeropad (bool) – If features should be zeropadded in reshaping functions.
desired_input_shape (int or tuple, optional) – The desired number of features or shape of data to feed a neural network. If type int, only the last column of features will be adjusted (zeropadded or limited). If tuple, the entire data shape will be adjusted (all columns). If the int or shape is larger than that of the data provided, data will be zeropadded. If the int or shape is smaller, the data will be restricted. (default None)

generator()[source]¶: Shapes, norms, and feeds data depending on labeled or non-labeled data.

The models.dataprep module covers functionality for feeding features to models.

class soundpy.models.dataprep.Generator(data_matrix1, data_matrix2=None, timestep=None, axis_timestep=0, normalize=True, apply_log=False, context_window=None, axis_context_window=- 2, labeled_data=False, gray2color=False, zeropad=True, desired_input_shape=None, combine_axes_0_1=False)[source]¶

Bases: object

Methods

generator()

Shapes, norms, and feeds data depending on labeled or non-labeled data.

generator()[source]¶: Shapes, norms, and feeds data depending on labeled or non-labeled data.

class soundpy.models.dataprep.GeneratorFeatExtraction(datalist, datalist2=None, model_name=None, normalize=True, apply_log=False, randomize=True, random_seed=None, desired_input_shape=None, timestep=None, axis_timestep=0, context_window=None, axis_context_window=- 2, batch_size=1, gray2color=False, visualize=False, vis_every_n_items=50, visuals_dir=None, decode_dict=None, dataset='train', augment_dict=None, label_silence=False, vad_start_end=False, **kwargs)[source]¶

Bases: soundpy.models.dataprep.Generator

Methods

generator()

Extracts features and feeds them to model according to desired_input_shape.

generator()[source]¶: Extracts features and feeds them to model according to desired_input_shape.

soundpy.models.dataprep.check4na(numpyarray)[source]¶

soundpy.models.dataprep.randomize_augs(aug_dict, random_seed=None)[source]¶

Creates copy of dict and chooses which augs applied randomly.

Can apply random seed for number of augmentations applied and shuffling order of possible augmentations.

soundpy.models.dataprep.augment_features(sound, sr, add_white_noise=False, snr=[5, 10, 20], speed_increase=False, speed_decrease=False, speed_perc=0.15, time_shift=False, shufflesound=False, num_subsections=3, harmonic_distortion=False, pitch_increase=False, pitch_decrease=False, num_semitones=2, vtlp=False, bilinear_warp=True, augment_settings_dict=None, random_seed=None)[source]¶: Randomly applies augmentations to audio. If no augment_settings_dict, defaults applied.

soundpy.models.dataprep.get_input_shape(kwargs_get_feats, labeled_data=False, frames_per_sample=None, use_librosa=True, mode='reflect')[source]¶

soundpy.models.dataprep.make_gen_callable(_gen)[source]¶

Prepares Python generator for tf.data.Dataset.from_generator

Bug fix: Python generator fails to work in Tensorflow 2.2.0 +

Parameters

_gen (generator) – The generator function to feed to a deep neural network.

Returns

x (np.ndarray [shape=(batch_size, num_frames, num_features, 1)]) – The feature data
y (np.ndarray [shape=(1,1)]) – The label for the feature data.

References

Shu, Nicolas (2020) https://stackoverflow.com/a/62186572 CC BY-SA 4.0

Other useful non-specific functionality¶

Utils module covers functions that are useful for soundpy but are not directly related to sound data.

soundpy.utils.path_or_samples(input_value)[source]¶

Checks whether input_value is a path or sample data. Does not check path validity.

This is useful for functions that take both pathways to audio as well as pre-loaded audio data.

Parameters: input_value (str, pathlib.PosixPath, or tuple [size= ( (samples,), sr)] or np.ndarray [size = (samples, )]) –
Returns: ‘path’ or ‘samples’
Return type: str

Examples

>>> import numpy as np
>>> # create some example samples and sample rate
>>> samples = np.array([1,2,3,2,1,0])
>>> sr = 5
>>> path_or_samples( (samples, sr) )
'samples'
>>> # expects both audio samples and sr
>>> path_or_samples(samples)
TypeError: The input for `path_or_samples` expected a str, pathlib.PosixPath, or tuple with samples and sample rate, not type <class 'numpy.ndarray'>
>>> # create example string pathway
>>> path_or_samples('my_audio.wav')
'path'
>>> # create pathlib.PosixPath object
>>> import pathlib
>>> path_or_samples(pathlib.Path('my_audio.wav')
'path'

soundpy.utils.get_default_args(func)[source]¶

References

stackoverflow answer by mgilson: link: https://stackoverflow.com/a/12627202 license: https://creativecommons.org/licenses/by-sa/3.0/

soundpy.utils.match_dtype(array1, array2)[source]¶

Match the dtype of the second array to the first.

Parameters

array1 (np.ndarray) – The numpy array with the dataype to be adjusted and returned.
array2 (np.ndarray) – The numpy array with the orginal or desired datatype.

Returns

array1 – The array1 with the dtype of array2

Return type

np.ndarray

soundpy.utils.get_date()[source]¶

Get a string containing month, day, hour, minute, second and millisecond.

This is useful for creating a unique filename.

Parameters: None –
Returns: time_str – A string containing the date and time.
Return type: str

Examples

>>> date = get_date()
>>> date
'6m18d1h16m32s295ms'

soundpy.utils.check_dir(directory, make=True, append=True)[source]¶

Checks if directory exists and creates it if indicated.

Parameters

directory (str or pathlib.PosixPath) – The directory of interest
make (bool) – Whether or not the directory should be created or just checked to ensure it exists. (default True)
append (bool) – If True, if a directory with the same name exists, new items will be saved into the old directory. Otherwise, an error will be raised. (default True)

Returns

directory – If a directory could be created or confirmed to exist, the directory path will be returned. Otherwise Errors will be raised.

Return type

soundpy.utils.create_nested_dirs(directory)[source]¶

Creates directory even if several parent directories don’t exist.

Parameters: directory (str, pathlib.PosixPath) – The directory to be created.
Returns: directory – If successful, the directory path that has been created.
Return type: pathlib.PosixPath

Examples

>>> # First an unsucessful creation of nested directory
>>> import os
>>> new_dir = './testdir/testdir/testdir/'
>>> os.mkdir(new_dir)
FileNotFoundError: [Errno 2] No such file or directory: './testdir/testdir/testdir/'
>>> # try again with create_nested_dirs()
>>> directory = create_nested_dirs(new_dir)
>>> directory
PosixPath('testdir/testdir/testdir')

soundpy.utils.string2pathlib(pathway_string)[source]¶

Turns string path into pathlib.PosixPath object.

This is useful when working with pathways from varying operating systems. Windows, Linux, and Mac have different ways of organizing pathways and pathlib turns strings from these different versions into a pathlib object that can be understood by the software regardless of the system. (At least I hope so..)

Parameters: pathway_string (str or pathlib.PosixPath) – The pathway to be turned into a pathlib object, if need be.
Returns: pathway_string – The pathway as a pathlib object.
Return type: pathlib.PosixPath

Examples

>>> pathway = 'folder/way2go.txt'
>>> pathlib_pathway = string2pathlib(pathway)
>>> pathlib_pathway
PosixPath('folder/way2go.txt')

soundpy.utils.restore_dictvalue(value_string)[source]¶

Takes dict value and converts it to its original type.

When loading a dictionary from a .csv file, the values are strings. This function handles integers, floats, tuples, and some strings. It also has been suited to handle a list of audio files or list of pathlib.PosixPath objects.

Warning: no extensive testing has been completed for this function. It might not handle all value types as expected.

Parameters: value_string (str) – The dictionary value that was converted into a string object .
Returns: value_original_type – The value converted back to its original type.
Return type: list, int, tuple, string, float, etc.

:raises ValueError : If passed a nested list of pathlib.PosixPath objects.:

Examples

>>> input_string = "[PosixPath('data/audio/vacuum/vacuum1.wav')]"
>>> type(input_string)
<class 'str'>
>>> typelist = string2list(input_string)
>>> typelist
[PosixPath('data/audio/vacuum/vacuum1.wav')]
>>> type(typelist)
<class 'list'>
>>> # Get type of the object
>>> type(typelist[0])
pathlib.PosixPath
>>> # Example with a list of tuples, i.e. label and audio file pairs:
>>> input_string = "[(2, PosixPath('data/audio/vacuum/vacuum1.wav')), '+        '(1, PosixPath('data/audio/vacuum/vacuum2.wav'))]"
>>> labelaudio_pairs = string2list(input_string)
>>> labelaudio_pairs
[(2, PosixPath('data/audio/vacuum/vacuum1.wav')),
(1, PosixPath('data/audio/vacuum/vacuum2.wav'))]
>>> type(labelaudio_pairs)
list
>>> type(labelaudio_pairs[0])
tuple
>>> type(labelaudio_pairs[0][0])
int
>>> type(labelaudio_pairs[0][1])
pathlib.PosixPath

soundpy.utils.adjust_time_units(time_sec)[source]¶

Turns seconds into relevant time units.

This is useful if measuring time of a process and that process takes longer than a couple minutes.

Parameters

time_sec (int, float) – The amount of time measured in seconds.

Returns

total_time (int, float) – The total amount of time
units (str) – The unites of total_time: ‘seconds’, ‘minutes’, or ‘hours’.

Examples

>>> adjust_time_units(5)
(5, 'seconds')
>>> adjust_time_units(500)
(8.333333333333334, 'minutes')
>>> adjust_time_units(5000)
(1.3888888888888888, 'hours')

soundpy.utils.print_progress(iteration, total_iterations, task=None)[source]¶

Prints the status of a process based on iteration number.

Assumes the iteration starts at 0 rather than 1.

Parameters

iteration (int) – The iteration of the current process.
total_iterations (int) – The total iterations to be completed.
task (str, optional) – The relevant task of the process.

Returns

Return type

sys.stdout.flush()

Examples

>>> print_progress(4, 10)
50% through current task
>>> print_progress(4, 10, task = 'testing')
50% through testing

soundpy.utils.check_extraction_variables(sr=None, feature_type=None, win_size_ms=None, percent_overlap=None)[source]¶

Checks to ensure extraction variables are compatible.

Parameters

sr (int) – The sample rate of audio.
feature_type (str) – The type of feature to be extracted: ‘fbank’, ‘stft’, ‘powspec’, ‘mfcc’, ‘signal’.
win_size_ms (int, float) – The window size to process audio samples.
percent_overlap (int, float) – The percent windows should overlap.

Returns

Return type

Raises

ValueError – If any of the Parameters aren’t compatible.

Examples

>>> check_extraction_variables(sr=48000, feature_type='signal', win_size_ms=25,percent_overlap=0.5)
>>> check_extraction_variables(sr='48000', feature_type='sig',win_size_ms='25',percent_overlap='0.5')
ValueError: Sampling rate (sr) must be of type int, not 48000 of type <class 'str'>.

soundpy.utils.check_noisy_clean_match(noisyfilename, cleanfilename)[source]¶

Checks if the clean filename is inside of the noisy filename.

This may be helpful to check that two audiofile datasets (a noisy and clean dataset) are aligned.

soundpy.utils.audiofile_length_match(filename1, filename2)[source]¶

Checks that two audiofiles have the same length.

This may be useful if you have clean and noisy audiofiles that should be the same length.

Parameters

filename1 (str or pathlib.PosixPath) – The path to first audio file.
filename2 (str or pathlib.PosixPath) – The path to second audio file.

Returns

bool

Return type

True if they match, False if not.

Warning

UserWarning: If the sample rate of the audio files don’t match.
UserWarning: If the length of the files don’t match.

soundpy.utils.save_dict(filename, dict2save, overwrite=False)[source]¶

Saves dictionary as csv file to indicated path and filename.

Ensures pathlib objects turned to strings. Warning: not thoroughly tested.

Parameters

filename (str) – The path and name to save the dictionary under. If ‘.csv’ extension is not given, it is added.
dict2save (dict) – The dictionary that is to be saved
overwrite (bool, optional) – Whether or not the saved dictionary should overwrite a preexisting file (default False)

Returns

path – The path where the dictionary was saved

Return type