SoundPy Functionality v0.1.0a3¶
Built-In Functionality (non Deep Learning)¶
The soundpy.builtin module includes more complex functions that pull from several other functions to complete fairly complex tasks, such as dataset formatting, filtering signals, and extracting features for neural networks.
- 
soundpy.builtin.filtersignal(audiofile, sr=None, noise_file=None, filter_type='wiener', filter_scale=1, apply_postfilter=False, duration_noise_ms=120, real_signal=False, phase_radians=True, num_bands=None, visualize=False, visualize_every_n_windows=50, max_vol=0.4, min_vol=0.15, save2wav=False, output_filename=None, overwrite=False, use_scipy=False, remove_dc=True, control_vol=False, **kwargs)[source]¶
- Apply Wiener or band spectral subtraction filter to signal using noise. - The noise can be provided as a separate file / samples, or it can be taken from the beginning of the provided audio. How much noise is measured can be set in the parameter duration_noise_ms. - Parameters
- audiofile ( - str,- np.ndarray [size=(num_samples,)or- (num_samples,- num_channels)]) – Filename or the audio data of the signal to be filtered.
- sr ( - int) – The sample rate of the audio. If audiofile is type np.ndarray, sr is required. (default None)
- noise_file ( - str,- tuple, optional) – Path to either noise audiofile or .npy file containing average power spectrum values. If tuple, must include samples and sr. If None, the beginning of the audiofile will be used for noise data. (default None)
- filter_type ( - str) – Type of filter to apply. Options ‘wiener’ or ‘band_specsub’.
- filter_scale ( - intor- float) – The scale at which the filter should be applied. This value will be multiplied to the noise levels thereby increasing or decreasing the filter strength. (default 1)
- apply_postfilter ( - bool) – Whether or not the post filter should be applied. The post filter reduces musical noise (i.e. distortion) in the signal as a byproduct of filtering.
- duration_noise_ms ( - intor- float) – The amount of time in milliseconds to use from noise to apply the Welch’s method to. In other words, how much of the noise to use when approximating the average noise power spectrum.
- real_signal ( - bool) – If True, only half of the (mirrored) fast Fourier transform will be used during filtering. For audio, there is no difference. This is visible in the plots, however, if you are interested. (default False)
- phase_radians ( - bool) – Relevant for band spectral subtraction: whether phase should be calculated in radians or complex values/ power spectrum. (default True)
- num_bands ( - int) – Relevant for band spectral subtraction: the number of bands to section frequencies into. By grouping sections of frequencies during spectral subtraction filtering, musical noise or distortion should be reduced. (defaults to 6)
- visualize ( - bool) – If True, plots of the windows and filtered signal will be made. (default False)
- visualize_every_n_windows ( - int) – If visualize is set to True, this controls how often plots are made: every 50 windows, for example. (default 50)
- max_vol ( - intor- float) – The maximum volume level of the filtered signal. This is useful if you know you do not want the signal to be louder than a certain value. Ears are important (default 0.4) TODO improve on matching volume to original signal? At least use objective measures.
- min_vol ( - intor- float) – The minimum volume level of the filtered signal. (default 0.15) TODO improve on matching volume to original signal.
- save2wav ( - bool) – If True, will save the filtered signal as a .wav file
- output_filename ( - str,- pathlib.PosixPath, optional) – path and name the filtered signal is to be saved. (default None) If no filename provided, will save under date.
- overwrite ( - bool) – If True and an audiofile by the same name exists, that file will be overwritten.
- use_scipy ( - bool) – If False, audiofiles will be loaded using librosa. Otherwise, scipy.io.wavfile. (default False)
- remove_dc ( - bool) – It True, the DC bias (‘direct current’ bias) will be removed. In other words, the mean amplitude will be made to equal 0.
- **kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.filters.WienerFilteror ‘soundpy.filters.BandSubtraction` (depending on filter_type).
 
- Returns
- enhanced_signal ( - np.ndarray [size = (num_samples,- )]) – The enhanced signal in raw sample form. Stereo audio has not yet been tested.
- sr ( - int) – The sample rate of the enhanced/ filtered signal.
 
 - References - Kamath, S. and Loizou, P. (2002). A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. Proc. IEEE Int. Conf. Acoust.,Speech, Signal Processing - Kamath, S. and Loizou, P. (2006). mband.m MATLAB code from the book: - C Loizou, P. (2013). Speech Enhancement: Theory and Practice. 
- 
soundpy.builtin.dataset_logger(audiofile_dir=None, recursive=True)[source]¶
- Logs name, format, bitdepth, sr, duration of audiofiles, num_channels - Parameters
- audiofile_dir ( - stror- pathlib.PosixPath) – The directory where audiofiles of interest are. If no directory provided, the current working directory will be used.
- recursive ( - bool) – If True, all audiofiles will be analyzed, also in nested directories. Otherwise, only the audio files in the immediate directory will be analyzed. (default True)
 
- Returns
- audiofile_dict – Dictionary within a dictionary, holding the formats of the audiofiles in the directory/ies. 
- Return type
 - Examples - >>> audio_info = dataset_logger() >>> # look at three audio files: >>> count = 0 >>> for key, value in audio_info.items(): ...: for k, v in value.items(): ...: print(k, ' : ', v) ...: count += 1 ...: print() ...: if count > 2: ...: break audio : audiodata/dogbark_2channels.wav sr : 48000 num_channels : 2 dur_sec : 0.389 format_type : WAV bitdepth : PCM_16 audio : audiodata/python_traffic_pf.wav sr : 48000 num_channels : 1 dur_sec : 1.86 format_type : WAV bitdepth : DOUBLE audio : audiodata/259672__nooc__this-is-not-right.wav sr : 44100 num_channels : 1 dur_sec : 2.48453514739229 format_type : WAV bitdepth : PCM_16 - See also - soundfile.available_subtypes
- The subtypes available with the package SoundFile 
- soundfile.available_formats
- The formats available with the package SoundFile 
 
- 
soundpy.builtin.dataset_formatter(audiodirectory=None, recursive=False, new_dir=None, sr=None, dur_sec=None, zeropad=False, format='WAV', bitdepth=None, overwrite=False, mono=False)[source]¶
- Formats all audio files in a directory to set parameters. - The audiofiles formatted can be limited to the specific directory or be extended to the subfolders of that directory. - Parameters
- audiodirectory ( - stror- pathlib.PosixPath) – The directory where audio files live. If no directory provided, the current working directory will be used.
- recursive ( - bool) – If False, only audiofiles limited to the specific directory will be formatted. If True, audio files in nested directories will also be formatted. (default False)
- new_dir ( - stror- pathlib.PosixPath) – The audiofiles will be saved with the same structure in this directory. If None, a default directory name with time stamp will be generated.
- sr ( - int) – The desired sample rate to assign to the audio files. If None, the orignal sample rate will be maintained.
- dur_sec ( - int) – The desired length in seconds the audio files should be limited to. If zeropad is set to True, the samples will be zeropadded to match this length if they are too short. If None, no limitation will be applied.
- zeropad ( - bool) – If True, samples will be zeropadded to match dur_sec. (default False)
- format ( - str) – The format to save the audio data in. (default ‘WAV’)
- bitdepth ( - int,- str) – The desired bitdepth. If int, 16 or 32 are possible. Defaults to ‘PCM_16’.
- overwrite ( - bool) – If True and new_dir is None, the audio data will be reformatted in the original directory and saved over any existing filenames. (default False)
- mono ( - bool) – If True, the audio will be limited to a single channel. Note: not much has been tested for stereo sound and soundpy. (default False)
 
- Returns
- directory – The directory where the formatted audio files are located. 
- Return type
 - See also - soundpy.files.collect_audiofiles
- Collects audiofiles from a given directory. 
- soundpy.files.conversion_formats
- The available formats for converting audio data. 
- soundfile.available_subtypes
- The subtypes or bitdepth possible for soundfile 
 
- 
soundpy.builtin.create_denoise_data(cleandata_dir, noisedata_dir, trainingdata_dir, limit=None, snr_levels=None, pad_mainsound_sec=None, random_seed=None, overwrite=False, **kwargs)[source]¶
- Applies noise to clean audio; saves clean and noisy audio to traingingdata_dir. - Parameters
- cleandata_dir ( - str,- pathlib.PosixPath) – Name of folder containing clean audio data for autoencoder. E.g. ‘clean_speech’
- noisedata_dir ( - str,- pathlib.PosixPath) – Name of folder containing noise to add to clean data. E.g. ‘noise’
- trainingdata_dir ( - str,- pathlib.PosixPath) – Directory to save newly created train, validation, and test data
- limit ( - int, optional) – Limit in number of audiofiles used for training data
- snr_levels ( - listof- ints, optional) – List of varying signal-to-noise ratios to apply to noise levels. (default None)
- pad_mainsound_sec ( - int,- float, optional) – Amount in seconds the main sound should be padded. In other words, in seconds how long the background sound should play before the clean / main / target audio starts. The same amount of noise will be appended at the end. (default None)
- random_seed ( - int) – A value to allow random order of audiofiles to be predictable. (default None). If None, the order of audiofiles will not be predictable.
- overwrite ( - bool) – If True, a new dataset will be created regardless of whether or not a matching directory already exists. (default False)
- **kwargs ( - additional keyword arguments) – The keyword arguments for soundpy.files.loadsound
 
- Returns
- saveinput_path ( - pathlib.PosixPath) – Path to where noisy audio files are located
- saveoutput_path ( - pathlib.PosixPath) – Path to where clean audio files are located
 
 - See also - soundpy.files.loadsound
- Loads audiofiles. 
- soundpy.dsp.add_backgroundsound
- Add background sound / noise to signal at a determined signal-to-noise ratio. 
 
- 
soundpy.builtin.envclassifier_feats(data_dir, data_features_dir=None, perc_train=0.8, ignore_label_marker=None, **kwargs)[source]¶
- Environment Classifier: feature extraction of scene audio into train, val, & test datasets. - Saves extracted feature datasets (train, val, test datasets) as well as feature extraction settings in the directory data_features_dir. - Parameters
- data_dir ( - stror- pathlib.PosixPath) – The directory with scene subfolders (e.g. ‘air_conditioner’, ‘traffic’) that contain audio files belonging to that scene (e.g. ‘air_conditioner/ac1.wav’, ‘air_conditioner/ac2.wav’, ‘traffic/t1.wav’).
- data_features_dir ( - stror- pathlib.PosixPath, optional) – The directory where feature extraction related to the dataset will be stored. Within this directory, a unique subfolder will be created each time features are extracted. This allows several versions of extracted features on the same dataset without overwriting files.
- perc_train ( - float) – The amount of data to be set aside for train data. The rest will be divided into validation and test datasets.
- ignore_label_marker ( - str) – A string to look for in the labels if the “label” should not be included. For example, ‘__’ to ignore a subdirectory titled “__noise” or “not__label”.
- kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.feats.save_features_datasetsand- soundpy.feats.get_feats.
 
- Returns
- feat_extraction_dir – The pathway to where all feature extraction files can be found, including datasets. 
- Return type
 - See also - soundpy.feats.get_feats
- Extract features from audio file or audio data. 
- soundpy.feats.save_features_datasets
- Preparation of acoustic features in train, validation and test datasets. 
 
- 
soundpy.builtin.denoiser_feats(data_clean_dir, data_noisy_dir, data_features_dir=None, limit=None, perc_train=0.8, **kwargs)[source]¶
- Autoencoder Denoiser: feature extraction of clean & noisy audio into train, val, & test datasets. - Saves extracted feature datasets (train, val, test datasets) as well as feature extraction settings in the directory data_features_dir. - Parameters
- data_clean_dir ( - stror- pathlib.PosixPath) – The directory with clean audio files.
- data_noisy_dir ( - stror- pathlib.PosixPath) – The directory with noisy audio files. These should be the same as the clean audio, except noise has been added.
- data_features_dir ( - stror- pathlib.PosixPath, optional) – The directory where feature extraction related to the dataset will be stored. Within this directory, a unique subfolder will be created each time features are extracted. This allows several versions of extracted features on the same dataset without overwriting files.
- limit ( - int, optional) – The limit of audio files for feature extraction. (default None)
- kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.feats.save_features_datasetsand- soundpy.feats.get_feats.
 
- Returns
- feat_extraction_dir – The pathway to where all feature extraction files can be found, including datasets. 
- Return type
 - See also - soundpy.datasets.create_denoise_data
- Applies noise at specified SNR levels to clean audio files. 
- soundpy.feats.get_feats
- Extract features from audio file or audio data. 
- soundpy.feats.save_features_datasets
- Preparation of acoustic features in train, validation and test datasets. 
 
Built-In Functionality (Deep Learning)¶
The soundpy.models.builtin module includes example functions that train neural networks on sound data.
- 
soundpy.models.builtin.denoiser_train(feature_extraction_dir, model_name='model_autoencoder_denoise', feature_type=None, use_generator=True, normalize=True, patience=10, **kwargs)[source]¶
- Collects training features and train autoencoder denoiser. - Parameters
- feature_extraction_dir ( - stror- pathlib.PosixPath) – Directory where extracted feature files are located (format .npy).
- model_name ( - str) – The name for the model. This can be quite generic as the date up to the millisecond will be added to ensure a unique name for each trained model. (default ‘model_autoencoder_denoise’)
- feature_type ( - str, optional) – The type of features that will be used to train the model. This is only for the purposes of naming the model. If set to None, it will not be included in the model name.
- use_generator ( - bool) – If True, a generator will be used to feed training data to the model. Otherwise the entire training data will be used to train the model all at once. (default True)
- normalize ( - bool) – If True, the data will be normalized before feeding to the model. (default False)
- patience ( - int) – Number of epochs to train without improvement before early stopping.
- **kwargs ( - additional keyword arguments) – The keyword arguments for keras.fit(). Note, the keyword arguments differ for validation data so be sure to use the correct keyword arguments, depending on if you use the generator or not. TODO: add link to keras.fit().
 
- Returns
- model_dir – The directory where the model and associated files can be found. 
- Return type
 - See also - soundpy.datasets.separate_train_val_test_files
- Generates paths lists for train, validation, and test files. Useful for noisy vs clean datasets and also for multiple training files. 
- soundpy.models.generator
- The generator function that feeds data to the model. 
- soundpy.models.modelsetup.setup_callbacks
- The function that sets up callbacks (e.g. logging, save best model, early stopping, etc.) 
- soundpy.models.template_models.autoencoder_denoise
- Template model architecture for basic autoencoder denoiser. 
 
- 
soundpy.models.builtin.envclassifier_train(feature_extraction_dir, model_name='model_cnn_classifier', feature_type=None, use_generator=True, normalize=True, patience=15, add_tensor_last=True, num_layers=3, **kwargs)[source]¶
- Collects training features and trains cnn environment classifier. - This model may be applied to any speech and label scenario, for example, male vs female speech, clinical vs healthy speech, simple speech / word recognition, as well as noise / scene / environment classification. - Parameters
- feature_extraction_dir ( - stror- pathlib.PosixPath) – Directory where extracted feature files are located (format .npy).
- model_name ( - str) – The name for the model. This can be quite generic as the date up to the millisecond will be added to ensure a unique name for each trained model. (default ‘model_cnn_classifier’)
- feature_type ( - str, optional) – The type of features that will be used to train the model. This is only for the purposes of naming the model. If set to None, it will not be included in the model name.
- use_generator ( - bool) – If True, a generator will be used to feed training data to the model. Otherwise the entire training data will be used to train the model all at once. (default True)
- normalize ( - bool) – If True, the data will be normalized before feeding to the model. (default False)
- patience ( - int) – Number of epochs to train without improvement before early stopping.
- num_layers ( - int) – The number of convolutional neural network layers desired. (default 3)
- **kwargs ( - additional keyword arguments) – The keyword arguments for keras.fit(). Note, the keyword arguments differ for validation data so be sure to use the correct keyword arguments, depending on if you use the generator or not. TODO: add link to keras.fit().
 
- Returns
- model_dir – The directory where the model and associated files can be found. 
- Return type
 - See also - soundpy.datasets.separate_train_val_test_files
- Generates paths lists for train, validation, and test files. Useful for noisy vs clean datasets and also for multiple training files. 
- soundpy.models.generator
- The generator function that feeds data to the model. 
- soundpy.models.modelsetup.setup_callbacks
- The function that sets up callbacks (e.g. logging, save best model, early stopping, etc.) 
- soundpy.models.template_models.cnn_classifier
- Template model architecture for a low-computational CNN sound classifier. 
 
- 
soundpy.models.builtin.denoiser_run(model, new_audio, feat_settings_dict, remove_dc=True)[source]¶
- Implements a pre-trained denoiser - Parameters
- model ( - stror- pathlib.PosixPath) – The path to the denoising model.
- new_audio ( - str,- pathlib.PosixPath, or- np.ndarray) – The path to the noisy audiofile.
- feat_settings_dict ( - dict) – Dictionary containing necessary settings for how the features were extracted for training the model. Expected keys: ‘feature_type’, ‘win_size_ms’, ‘percent_overlap’, ‘sr’, ‘window’, ‘frames_per_sample’, ‘input_shape’, ‘desired_shape’, ‘dur_sec’, ‘num_feats’.
 
- Returns
- cleaned_audio ( - np.ndarray [shape = (num_samples,- )]) – The cleaned audio samples ready for playing or saving as audio file.
- sr ( - int) – The sample rate of cleaned_audio.
 
 - See also - soundpy.feats.get_feats
- How features are extracted. 
- soundpy.feats.feats2audio
- How features are transformed back into audio samples. 
 
- 
soundpy.models.builtin.envclassifier_run(model, new_audio, feat_settings_dict, dict_decode)[source]¶
- Implement a convnet model with new_audio. - Parameters
- model ( - str,- pathlib.PosixPath) – The pathway to the pre-trained model.
- new_audio ( - str,- pathlib.PosixPath) – The pathway to the audio file to be classified.
- feat_settings_dict ( - dict) – Dictionary containing necessary settings for feature extraction, such as sample rate, feature type, etc.
- dict_decode ( - dict) – Dictionary containing encoded labels as keys and string labels as values. for example {0:’office’, 1:’traffic’, 2:’park’}.
 
- Returns
 
- 
soundpy.models.builtin.collect_classifier_settings(feature_extraction_dir)[source]¶
- Collects relevant information for some models from files in the feature directory. - These relevant files have been generated in - soundpy.models.builtin.envclassifier_train.- Parameters
- feature_extraction_dir ( - str,- pathlib.PosixPath) – The directory where extracted files are located, included .npy and .csv log files.
- Returns
- datasets ( - NamedTuple) – A named tuple containing train, val, and test data
- num_labels ( - int) – The number of labels used for the data.
- feat_shape ( - tuple) – The initial shape of the features when they were extracted. For example, labels or context window not applied.
- num_feats ( - int) – The number of features used to train the pre-trained model.
- feature_type ( - str) – The feature_type used to train the pre-trained model. For example, ‘fbank’, ‘mfcc’, ‘stft’, ‘signal’, ‘powspec’.
 
 - See also - soundpy.models.builtin.envclassifier_train
- The builtin functionality for training a simple scene/environment/speech classifier. This function generates the files expected by this function. 
 
- 
soundpy.models.builtin.cnnlstm_train(feature_extraction_dir, model_name='model_cnnlstm_classifier', use_generator=True, normalize=True, patience=15, timesteps=10, context_window=5, frames_per_sample=None, colorscale=1, total_training_sessions=None, add_tensor_last=False, **kwargs)[source]¶
- Example implementation of a Convnet+LSTM model for speech recognition. - Note: improvements must still be made, for example with the context_window. However, this still may be useful as an example of a simple CNN and LSTM model. - Parameters
- feature_extraction_dir ( - str,- pathlib.PosixPath) – The directory where feature data will be saved.
- model_name ( - str) – The name of the model. (default ‘model_cnnlstm_classifier’)
- use_generator ( - True) – If True, data will be fed to the model via generator. This parameter will likely be removed and set as a default. (default True)
- normalize ( - bool) – If True, the data will be normalized before being fed to the model. (default True)
- patience ( - int) – The number of epochs to allow with no improvement in either val accuracy or loss. (default 15)
- timesteps ( - int) – The frames dedicated to each subsection of each sample. This allows the long-short term memory model to process each subsection consecutively.
- context_window ( - int) – The number of frames surrounding a central frame that make up sound context. Note: this needs improvement and further exploration.
- frames_per_sample ( - int) – Serves basically same role as context_window does currently: frames_per_sample equals context_window * 2 + 1. This parameter will likely be removed in future versions.
- colorscale ( - int) – The colorscale relevant for the convolutional neural network. (default 1)
- total_training_sessions ( - int) – Option to limit number of audiofiles used for training, if use_generator is set to False. This parameter will likely be removed in future versions. But as this is just an example model, the low priority may result in this parameter living forever.
- add_tensor_last ( - bool) – No longer used in the code. Irrelevant.
- kwargs ( - additional keyword arguments.) – Keyword arguments for keras.model.fit.
 
- Returns
- model_dir ( - pathlib.PosixPath) – The directory where model and log files are saved.
- history ( - tf.keras.callbacks.History) – Contains model training and validation accuracy and loss throughout training.
 
 - References - Kim, Myungjong & Cao, Beiming & An, Kwanghoon & Wang, Jun. (2018). Dysarthric Speech Recognition Using Convolutional LSTM Neural Network. 10.21437/interspeech.2018-2250. 
- 
soundpy.models.builtin.resnet50_train(feature_extraction_dir, model_name='model_resnet50_classifier', use_generator=True, normalize=True, patience=15, colorscale=3, total_training_sessions=None, **kwargs)[source]¶
- Continue training a pre-trained resnet50 model for speech recogntion or other sound classification. - Parameters
- feature_extraction_dir ( - stror- pathlib.PosixPath) – The directory where feature extraction files will be saved.
- model_name ( - str) – The name for the model. (default ‘model_resnet50_classifier’)
- use_generator ( - True) – If True, data will be fed to the model via generator. This parameter will likely be removed and set as a default. (default True)
- normalize ( - bool) – If True, the data will be normalized before being fed to the model. (default True)
- patience ( - int) – The number of epochs to allow with no improvement in either val accuracy or loss. (default 15)
- timesteps ( - int) – The frames dedicated to each subsection of each sample. This allows the long-short term memory model to process each subsection consecutively.
- context_window ( - int) – The number of frames surrounding a central frame that make up sound context. Note: this needs improvement and further exploration.
- frames_per_sample ( - int) – Serves basically same role as context_window does currently: frames_per_sample equals context_window * 2 + 1. This parameter will likely be removed in future versions.
- colorscale ( - int) – The colorscale relevant for the convolutional neural network. (default 1)
- total_training_sessions ( - int) – Option to limit number of audiofiles used for training, if use_generator is set to False. This parameter will likely be removed in future versions. But as this is just an example model, the low priority may result in this parameter living forever.
 
- Returns
- model_dir ( - pathlib.PosixPath) – The directory where model and log files are saved.
- history ( - tf.keras.callbacks.History()) – Contains model training and validation accuracy and loss throughout training.
 
 
- 
soundpy.models.builtin.envclassifier_extract_train(model_name='env_classifier', augment_dict=None, audiodata_path=None, features_dir=None, save_new_files_dir=None, labeled_data=True, ignore_label_marker=None, batch_size=10, epochs=5, patience=15, callbacks=None, random_seed=None, visualize=False, vis_every_n_items=50, label_silence=False, val_data=None, test_data=None, append_model_dir=False, **kwargs)[source]¶
- Extract and augment features during training of a scene/environment/speech classifier - Parameters
- model_name ( - str) – Name of the model. No extension (will save as .h5 file) (default ‘env_classifier’)
- augment_dict ( - dict, optional) – Dictionary containing keys (e.g. ‘add_white_noise’). See `soundpy.augment.list_augmentations`and corresponding True or False values. If the value is True, the key / augmentation gets implemented at random, each epoch. (default None)
- audiodata_path ( - str,- pathlib.PosixPath) – Where audio data can be found, if no features_dir where previously extracted and prepared files are located. (default None)
- features_dir ( - str,- pathlib.PosixPath) – The feature directory where previously extracted validation and test data are located, as well as the relevant log files.
- save_new_files_dir ( - str,- pathlib.PosixPath) – Where new files (logging, model(s), etc.) will be saved. If None, will be set in a unique directory within the current working directory. (default None)
- labeled_data ( - bool) – Useful in determining shape of data. If True, expected label column to exist at the end of the feature column of feature data. Note: this may be removed in future versions.
- ignore_label_marker ( - str) – When collecting labels from subdirectory names, this allows a subfolder name to be ignored. For example, if ignore_label_marker is set as ‘__’, the folder name ‘__test__’ will not be included as a label while a folder name ‘dog_barking’ will.
- **kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.feats.get_feats.
 
 
- 
soundpy.models.builtin.cnnlstm_extract_train(model_name='cnnlstm_classifier', dataset_dict=None, num_labels=None, augment_dict=None, audiodata_path=None, save_new_files_dir=None, labeled_data=True, ignore_label_marker=None, context_window=5, batch_size=10, epochs=5, patience=15, callbacks=None, random_seed=None, visualize=False, vis_every_n_items=50, label_silence=False, **kwargs)[source]¶
- Extract and augment features during training of a scene/environment/speech classifier - Parameters
- model_name ( - str) – Name of the model. No extension (will save as .h5 file)
- dataset_dict ( - dict, optional) – A dictionary including datasets as keys, and audio file lists (with or without labels) as values. If None, will be created based on audiodata_path. (default None)
- augment_dict ( - dict, optional) – Dictionary containing keys (e.g. ‘add_white_noise’). See `soundpy.augment.list_augmentations`and corresponding True or False values. If the value is True, the key / augmentation gets implemented at random, each epoch. (default None)
- audiodata_path ( - str,- pathlib.PosixPath) – Where audio data can be found, if no dataset_dict provided. (default None)
- save_new_files_dir ( - str,- pathlib.PosixPath) – Where new files (logging, model(s), etc.) will be saved. If None, will be set in a unique directory within the current working directory. (default None)
- **kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.feats.get_feats.
 
 
- 
soundpy.models.builtin.denoiser_extract_train(model_name='denoiser', augment_dict=None, audiodata_clean_path=None, audiodata_noisy_path=None, features_dir=None, save_new_files_dir=None, labeled_data=False, ignore_label_marker=None, batch_size=10, epochs=5, patience=15, callbacks=None, random_seed=20, visualize=False, vis_every_n_items=50, label_silence=False, val_data=None, test_data=None, append_model_dir=False, **kwargs)[source]¶
- Extract and augment features during training of a scene/environment/speech classifier - Parameters
- model_name ( - str) – Name of the model. No extension (will save as .h5 file) (default ‘env_classifier’)
- augment_dict ( - dict, optional) – Dictionary containing keys (e.g. ‘add_white_noise’). See `soundpy.augment.list_augmentations`and corresponding True or False values. If the value is True, the key / augmentation gets implemented at random, each epoch. (default None)
- audiodata_path ( - str,- pathlib.PosixPath) – Where audio data can be found, if no features_dir where previously extracted and prepared files are located. (default None)
- features_dir ( - str,- pathlib.PosixPath) – The feature directory where previously extracted validation and test data are located, as well as the relevant log files.
- save_new_files_dir ( - str,- pathlib.PosixPath) – Where new files (logging, model(s), etc.) will be saved. If None, will be set in a unique directory within the current working directory. (default None)
- labeled_data ( - bool) – Useful in determining shape of data. If True, expected label column to exist at the end of the feature column of feature data. Note: this may be removed in future versions.
- ignore_label_marker ( - str) – When collecting labels from subdirectory names, this allows a subfolder name to be ignored. For example, if ignore_label_marker is set as ‘__’, the folder name ‘__test__’ will not be included as a label while a folder name ‘dog_barking’ will.
- **kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.feats.get_feats.
 
 
Augment audio data¶
The augment module includes functions related to augmenting audio data. These functions pull from implementations performed in research.
Other resources for augmentation (not included in soundpy functionality):
Ma, E. (2019). NLP Augmentation. https://github.com/makcedward/nlpaug
Park, D. S., Chan, W., Zhang, Y., Chiu, C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Google Brain. arxiv.org/pdf/1904.08779.pdf
Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084:
1.Signal speed scaling by a random number in[0.8,1.2](SpeedupFactoryRange). 2.Pitch shift by a random number in [−2,2]semitones(SemitoneShiftRange). 3.Volume increase/decrease by a random number in [−3,3]dB(VolumeGainRange). 4.Addition of random noise in the range [0,10]dB(SNR). 5.Time shift in the range [−0.005,0.005]seconds(TimeShiftRange).
- 
soundpy.augment.speed_increase(sound, sr, perc=0.15, **kwargs)[source]¶
- Acoustic augmentation of speech. - References - Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084 - Ko, T., Peddinti, V., Povey, D., & Khudanpur (2015). Audio Augmentation for Speech Recognition. Interspeech. - W. Verhelst and M. Roelands, “An overlap-add technique based on waveform similarity (wsola) for high quality time-scale modifica- tion of speech,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, April 1993, pp. 554–557 vol.2. 
- 
soundpy.augment.speed_decrease(sound, sr, perc=0.15, **kwargs)[source]¶
- Acoustic augmentation of speech. - References - Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084 
- 
soundpy.augment.time_shift(sound, sr, random_seed=None, **kwargs)[source]¶
- Acoustic augmentation of sound (probably not for speech). - Applies random shift of sound by dividing sound into 2 sections and switching them. - Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084 
- 
soundpy.augment.shufflesound(sound, sr, num_subsections=2, random_seed=None, **kwargs)[source]¶
- Acoustic augmentation of noise or background sounds. - This separates the sound into num_subsections and pseudorandomizes the order. - References - Inoue, T., Vinayavekhin, P., Wang, S., Wood, D., Munawar, A., Ko, B. J., Greco, N., & Tachibana, R. (2019). Shuffling and mixing data augmentation for environmental sound classification. Detection and Classification of Acoustic Scenes and Events 2019. 25-26 October 2019, New York, NY, USA 
- 
soundpy.augment.add_white_noise(sound, sr, noise_level=0.01, snr=10, random_seed=None, **kwargs)[source]¶
- References - Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084 
- 
soundpy.augment.harmonic_distortion(sound, sr, **kwargs)[source]¶
- Applies sin function five times. - References - Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084 
- 
soundpy.augment.pitch_increase(sound, sr, num_semitones=2, **kwargs)[source]¶
- References - Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084 
- 
soundpy.augment.pitch_decrease(sound, sr, num_semitones=2, **kwargs)[source]¶
- References - Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084 
- 
soundpy.augment.vtlp(sound, sr, a=0.8, 1.2, random_seed=None, oversize_factor=16, win_size_ms=50, percent_overlap=0.5, bilinear_warp=True, real_signal=True, fft_bins=1024, window='hann', zeropad=True, expected_shape=None, visualize=False)[source]¶
- Applies vocal tract length perturbations directly to dft (oversized) windows. - References - Kim, C., Shin, M., Garg, A., & Gowda, D. (2019). Improved vocal tract length perturbation for a state-of-the-art end-to-end speech recognition system. Interspeech. September 15-19, Graz, Austria. - Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084 
- 
soundpy.augment.get_augmentation_dict()[source]¶
- Returns dictionary with augmentation options as keys and values set to False. - Examples - >>> import soundpy as sp >>> ad = sp.augment.get_augmentation_dict() >>> ad {'speed_increase': False, 'speed_decrease': False, 'time_shift': False, 'shufflesound': False, 'add_white_noise': False, 'harmonic_distortion': False, 'pitch_increase': False, 'pitch_decrease': False, 'vtlp': False} >>> # to set augmentation to True: >>> ad['add_white_noise'] = True >>> ad {'speed_increase': False, 'speed_decrease': False, 'time_shift': False, 'shufflesound': False, 'add_white_noise': True, 'harmonic_distortion': False, 'pitch_increase': False, 'pitch_decrease': False, 'vtlp': False} 
- 
soundpy.augment.list_augmentations()[source]¶
- Lists available augmentations. - Examples - >>> import soundpy as sp >>> print(sp.augment.list_augmentations()) Available augmentations: speed_increase speed_decrease time_shift shufflesound add_white_noise harmonic_distortion pitch_increase pitch_decrease vtlp 
- 
soundpy.augment.get_augmentation_settings_dict(augmentation)[source]¶
- Returns default settings of base function for augmentation. - Parameters
- augmentation ( - str) – The augmentation of interest.
- Returns
- aug_defaults – A dictionary with the base augmentation function parameters as keys and default values as values. 
- Return type
 - Examples - >>> import soundpy as sp >>> d = sp.augment.get_augmentation_settings_dict('speed_decrease') >>> d {'perc': 0.15} >>> # can use this dictionary to apply different values for augmentation >>> d['perc'] = 0.1 >>> d {'perc': 0.1} >>> # to build a dictionary with several settings: >>> many_settings_dict = {} >>> many_settings_dict['add_white_noise'] = sp.augment.get_augmentation_settings_dict('add_white_noise') >>> many_settings_dict['pitch_increase'] = sp.augment.get_augmentation_settings_dict('pitch_increase') >>> many_settings_dict {'add_white_noise': {'noise_level': 0.01, 'snr': 10, 'random_seed': None}, 'pitch_increase': {'num_semitones': 2}} >>> # change 'snr' default values to list of several values >>> # this would apply white noise at either 10, 15, or 20 SNR, at random >>> many_settings_dict['add_white_noise']['snr'] = [10, 15, 20] >>> # change number of semitones pitch increase is applied >>> many_settings_dict['pitch_increase']['num_semitones'] = 1 >>> many_settings_dict {'add_white_noise': {'noise_level': 0.01, 'snr': [10, 15, 20], 'random_seed': None}, 'pitch_increase': {'num_semitones': 1}} - Raises
- ValueError – If augmentation does not match available augmentations. 
 - See also - soundpy.models.dataprep.augment_features
- The above dictionary example many_settings_dict can be applied under the parameter augment_settings_dict to apply augmentation settings when augmenting data, for example, within a generator function. See - soundpy.models.dataprep.GeneratorFeatExtraction.
 
Working with audio files¶
The files module contains functions related to handling audio data files, for example loading audio files, saving audio files, and examing and reformatting audio files.
- 
soundpy.files.loadsound(filename, sr=None, mono=True, dur_sec=None, remove_dc=True, use_scipy=False)[source]¶
- Loads sound file with scipy.io.wavfile.read or librosa.load (default librosa) - Parameters
- filename ( - str) – The filename of the sound to be loaded
- sr ( - int, optional) – The desired sample rate of the audio samples. If None, the sample rate of the audio file will be used.
- mono ( - bool) – If True, the samples will be loaded in mono sound. If False, if the samples are in stereo, they will be loaded in stereo sound.
- dur_sec ( - int,- float, optional) – The length in seconds of the audio signal.
- remove_dc_bias ( - bool) – If True, the mean is subtracted from the signal. This has shown to be very helpful when working with audio data. (default True)
- use_scipy ( - bool) – If False, librosa will be used to load the audiofile. If True, scipy.io.wavfile and/or soundfile will be used. If the sound file is not compatible with scipy.io.wavfile.read, this functions converts the file to .wav format and/or changes the bit depth to be compatible. (default False)
 
- Returns
- data ( - nd.array [size=(num_samples,)or- (num_samples,- num_channels)]) – The normalized (between 1 and -1) sample data returned according to the specified settings.
- sr ( - int) – The sample rate of the loaded samples.
 
 - See also - soundpy.files.prep4scipywavfile
- Prepares audio file for scipy.io.wavfile.read. 
- soundpy.files.convert_audiofile
- Converts audio file to .wav format. 
- soundpy.files.newbitdepth
- Converts audio file to specified bitdepth. 
- soundpy.dsp.resample_audio
- Resampe audio data to a specified sample rate. 
- soundpy.files.list_possibleformats
- Lists the possible formats to load with soundpy.loadsound 
- librosa.load
- The package used to load sound data by default. See - librosa.
- scipy.io.wavfile.read
- The package used to load sound if use_scipy is set to True. See - scipy.
- soundpy.dsp.remove_dc_bias
- Removes the ‘direct current’ bias from the signal. 
 - Todo - Make librosa data and scipy.io.wavfile data more similar
- https://stackoverflow.com/questions/54482346/reading-a-wav-file-with-scipy-and-librosa-in-python 
 
- 
soundpy.files.savesound(audiofile_name, signal_values, sr, remove_dc=True, overwrite=False, use_scipy=False, **kwargs)[source]¶
- saves the wave at designated path - Parameters
- audiofile_name ( - stror- pathlib.PosixPath) – path and name the audio is to be saved under. (.wav format)
- signal_values ( - ndarray) – values of real signal to be saved
- sr ( - int) – sample rate of the audio samples.
- remove_dc ( - bool) – If True, the mean is subtracted from the signal. (default True)
- overwrite ( - bool) – If True, audio with the same naem will be overwritten. (default False)
- use_scipy ( - bool) – If True, scipy.io.wavfile.write will be used. However, file conversion is limited. Can only save .wav files. Otherwise soundfile.write will be used, which can save audio under more audio fomats.
- **kwargs ( - additional keyword arguments) – The keyword arguments for soundfile.write: https://pysoundfile.readthedocs.io/en/latest/index.html?highlight=write#soundfile.write
 
- Returns
- audiofile_name – The new audiofile name 
- Return type
 - See also - scipy.io.wavfile.write- soundpy.files.conversion_formats
- Lists the possible formats to save audio files if use_scipy is False. 
- soundpy.dsp.remove_dc_bias
- Removes the ‘direct current’ bias from the signal. 
 
- 
soundpy.files.audiofiles_present(directory, recursive=False)[source]¶
- Checks to see if audio files are present. - Parameters
- directory ( - stror- pathlib.PosixPath) – The directory to look for audio.
- recursive ( - bool) – If True, all nested directories will be checked as well. (default False)
 
- Returns
- True if audio is present; otherwise False. 
- Return type
 
- 
soundpy.files.collect_audiofiles(directory, hidden_files=False, wav_only=False, recursive=False)[source]¶
- Collects all files within a given directory. - This includes the option to include hidden_files in the collection. - Parameters
- directory ( - stror- pathlib.PosixPath) – The path to where desired files are located.
- hidden_files ( - bool) – If True, hidden files will be included. If False, they won’t. (default False)
- wav_only ( - bool) – If True, only .wav files will be included. Otherwise, no limit on file type.
 
- Returns
- paths_list – Sorted list of file pathways. 
- Return type
- listof- pathlib.PosixPath objects
 
- 
soundpy.files.collect_zipfiles(directory, hidden_files=False, ext='tgz', recursive=False)[source]¶
- Collects all zipfiles within a given directory. - This includes the option to include hidden_files in the collection. - Parameters
- directory ( - stror- pathlib.PosixPath) – The path to where desired files are located.
- hidden_files ( - bool) – If True, hidden files will be included. If False, they won’t. (default False)
- wav_only ( - bool) – If True, only .wav files will be included. Otherwise, no limit on file type.
 
- Returns
- paths_list – Sorted list of file pathways. 
- Return type
- listof- pathlib.PosixPath objects
 
- 
soundpy.files.prep4scipywavfile(filename, overwrite=False)[source]¶
- Takes soundfile and saves it in a format compatible with scipy.io.wavfile 
- 
soundpy.files.conversion_formats()[source]¶
- Lists the formats available for conversion. - soundpy uses soundfile to convert files; therefore, whatever available formats soundfile has will be listed here. - Examples - >>> conversion_formats {'AIFF': 'AIFF (Apple/SGI)', 'AU': 'AU (Sun/NeXT)', 'AVR': 'AVR (Audio Visual Research)', 'CAF': 'CAF (Apple Core Audio File)', 'FLAC': 'FLAC (Free Lossless Audio Codec)', 'HTK': 'HTK (HMM Tool Kit)', 'SVX': 'IFF (Amiga IFF/SVX8/SV16)', 'MAT4': 'MAT4 (GNU Octave 2.0 / Matlab 4.2)', 'MAT5': 'MAT5 (GNU Octave 2.1 / Matlab 5.0)', 'MPC2K': 'MPC (Akai MPC 2k)', 'OGG': 'OGG (OGG Container format)', 'PAF': 'PAF (Ensoniq PARIS)', 'PVF': 'PVF (Portable Voice Format)', 'RAW': 'RAW (header-less)', 'RF64': 'RF64 (RIFF 64)', 'SD2': 'SD2 (Sound Designer II)', 'SDS': 'SDS (Midi Sample Dump Standard)', 'IRCAM': 'SF (Berkeley/IRCAM/CARL)', 'VOC': 'VOC (Creative Labs)', 'W64': 'W64 (SoundFoundry WAVE 64)', 'WAV': 'WAV (Microsoft)', 'NIST': 'WAV (NIST Sphere)', 'WAVEX': 'WAVEX (Microsoft)', 'WVE': 'WVE (Psion Series 3)', 'XI': 'XI (FastTracker 2)'} 
- 
soundpy.files.convert_audiofile(filename, format_type=None, sr=None, new_dir=False, overwrite=False, use_scipy=False, **kwargs)[source]¶
- Converts and saves soundfile as .wav type in same or new directory. - Parameters
- filename ( - stror- pathlib.PosixPath) – The filename of the audiofile to be converted to .wav type
- format_type ( - str) – The format to convert the audio file to. See soundpy.files.conversion_formats. (defaults to ‘wav’)
- new_dir ( - str,- pathlib.PosixPath, optional) – If False, the converted files will be saved in same directory as originals. If a path is provided, the converted files will be saved there. If no such directory exists, one will be created.
- sr ( - int, optional) – The sample rate to be applied to the signal. If none supplied, the sample rate of the original file will be used.
- **kwargs ( - additional keyword arguments) – The keyword arguments for soundfile.write: https://pysoundfile.readthedocs.io/en/latest/index.html?highlight=write#soundfile.write
 
- Returns
- f_wavfile – The filename / path where the audio file is saved. 
- Return type
 - Examples - >>> audiofile = './example/audio.wav' # in same directory >>> audiofile_flac = sp.files.convert_audiofile(audiofile, format_type='flac') >>> audiofile_flac PosixPath('example/audio.flac') # in new directory >>> audiofile_flac = sp.files.convert_audiofile(audiofile, format_type='flac', new_dir = './examples2/') >>> audiofile_flac PosixPath('examples2/audio.flac') >>> # can establish desired conversion format in `new_dir` >>> audiofile_ogg = sp.files.convert_audiofile(audiofile, new_dir = './examples2/audio.ogg') >>> audiofile_ogg PosixPath('audiodata2/audio.ogg') - See also - soundpy.files.conversion_formats
- Lists the possible formats to convert audio files. 
- soundpy.files.list_possibleformats
- Lists the possible formats to load with soundpy.loadsound 
 
- 
soundpy.files.replace_ext(filename, extension)[source]¶
- Adds or replaces an extension in the filename - Parameters
- filename ( - stror- pathlib.PosixPath) – Filename with the missing or incorrect extension
- extension ( - str) – The correct extension for the given filename.
 
- Returns
- file_newext – The filename with the new extension 
- Return type
 
- 
soundpy.files.match_ext(filename1, filename2)[source]¶
- Matches the file extensions. - If both have extensions, default set to that of filename1. 
- 
soundpy.files.newbitdepth(wave, bitdepth=16, newname=None, overwrite=False)[source]¶
- Convert bitdepth to 16 or 32, to ensure compatibility with scipy.io.wavfile - Scipy.io.wavfile is easily used online, for example in Jupyter notebooks. 
- 
soundpy.files.adjustname(filename, adjustment=None)[source]¶
- Adjusts filename. - Parameters
- Returns
- fname – The adjusted filename with the original extension 
- Return type
 - Examples - >>> adjustname('happy.md') 'happy_adj.md' >>> adjustname('happy.md', '_not_sad') 'happy_not_sad.md' 
Organizing datasets¶
The datasets module contains functions related to organizing datasets.
- 
soundpy.datasets.create_encodedlabel2audio_dict(dict_encodelabels, paths_list, limit=None, seed=40)[source]¶
- Creates dictionary with audio labels as keys and filename lists as values. - If no label is found in the filename path, the label is not included in the returned dictionary: labels are only included if corresponding paths are present. - Parameters
- dict_encodelabels ( - dict) – Dictionary containing the labels as keys and their encoded values as values.
- paths_list ( - set,- list) – List containing pathlib.PosixPath objects (i.e. paths) of all audio files; expected the audio files reside in directories with names matching their audio class
- limit ( - int, optional) – The integer indicating a limit to number of audiofiles to each class. This may be useful if one wants to ensure a balanced dataset (default None)
- seed ( - int, optional) – The seed for pseudorandomizing the wavfiles, if a limit is requested. If seed is set to None, the randomized order of the limited wavfiles cannot be repeated. (default 40)
 
- Returns
- label_waves_dict ( - OrderedDict) – A dictionary with encoded audio labels as keys with values being the audio files corresponding to that label
- TODO update 
 
 - Examples - >>> from pathlib import Path >>> labels = dict([('vacuum',2),('fridge',0),('wind',1)]) >>> paths = [Path('data/audio/vacuum/vacuum1.wav'), ... Path('data/audio/fridge/fridge1.wav'), ... Path('data/audio/vacuum/vacuum2.wav'), ... Path('data/audio/wind/wind1.wav')] >>> label_waves_dict = create_encodedlabel2audio_dict(labels, paths) >>> label_waves_dict OrderedDict([(0, [PosixPath('data/audio/fridge/fridge1.wav')]), (2, [PosixPath('data/audio/vacuum/vacuum1.wav'), PosixPath('data/audio/vacuum/vacuum2.wav')]), (1, [PosixPath('data/audio/wind/wind1.wav')])]) >>> #to set a limit on number of audiofiles per class: >>> create_encodedlabel2audio_dict(labels, paths, limit=1, seed=40) OrderedDict([(0, [PosixPath('data/audio/fridge/fridge1.wav')]), (2, [PosixPath('data/audio/vacuum/vacuum2.wav')]), (1, [PosixPath('data/audio/wind/wind1.wav')])]) >>> #change the limited pathways chosen: >>> create_encodedlabel2audio_dict(labels, paths, limit=1, seed=10) OrderedDict([(0, [PosixPath('data/audio/fridge/fridge1.wav')]), (2, [PosixPath('data/audio/vacuum/vacuum1.wav')]), (1, [PosixPath('data/audio/wind/wind1.wav')])]) 
- 
soundpy.datasets.create_dicts_labelsencoded(labels_class, add_extra_label=False, extra_label='silence')[source]¶
- Encodes audio class labels and saves in dictionaries. - The labels are alphabetized and encoded under their index. If add_extra_label, the extra_label is added as the last entry in the dictionary. This is useful if performing voice activity and want to label non-voice activated sections as silent rather than as some speech label. - Parameters
- Returns
 - Examples - >>> labels = {'wind','air_conditioner','fridge'} >>> label2int, int2label = create_dicts_labelsencoded(labels) >>> label2int {'air_conditioner': 0, 'fridge': 1, 'wind': 2} >>> int2label {0: 'air_conditioner', 1: 'fridge', 2: 'wind'} 
- 
soundpy.datasets.waves2dataset(audiolist, perc_train=0.8, seed=40, train=True, val=True, test=True)[source]¶
- Organizes audio files list into train, validation and test datasets. - If only two or one dataset is to be prepared, they will be assigned to train and val or simply to train, respectively. The remaining ‘datasets’ will remain empty. - Parameters
- audiolist ( - list) – List containing paths to audio files
- perc_train ( - float,- int) – Percentage of data to be in the training dataset (default 0.8)
- seed ( - int,- None, optional) – Set seed for the generation of pseudorandom train, validation, and test datsets. Useful for reproducing results. (default 40)
- train ( - bool) – If True, assumed the training data will be prepared. (default True)
- val ( - bool) – If True, assumed validation data will be prepared. (default True)
- test ( - bool) – If True, assumed test data will be prepared. (default True)
 
- Returns
 - Examples - >>> #Using a list of numbers instead of filenames >>> audiolist = [1,2,3,4,5,6,7,8,9,10] >>> #default settings: >>> waves2dataset(audiolist) ([5, 4, 9, 2, 3, 10, 1, 6], [8], [7]) >>> #perc_train set to 50% instead of 80%: >>> waves2dataset(audiolist, perc_train=50) ([5, 4, 9, 2, 3, 10], [1, 6], [8, 7]) >>> #change seed number >>> waves2dataset(audiolist, seed=0) ([7, 1, 2, 5, 6, 9, 10, 8], [4], [3]) 
- 
soundpy.datasets.audio2datasets(audiodata, perc_train=0.8, limit=None, seed=None, audio_only=True, **kwargs)[source]¶
- Organizes all audio in audio class directories into datasets (randomized). - The validation and test datasets are halved between what isn’t train data. For example, if perc_train is 0.8, validation data will be 0.1 and test data will be 0.1. - Parameters
- audiodata ( - str,- pathlib.PosixPath,- dict,- list, or- set) – If data has multiple labels, path to the dictionary where audio class labels and the paths of all audio files belonging to each class are or will be stored. The dictionary with the labels and their encoded values can also directly supplied here. If the data does not have labels, a list or set of audiofiles can be provided to be placed in train, val, and test datasets.
- seed ( - int, optional) – A value to allow random order of audiofiles to be predictable. (default None). If None, the order of audiofiles will not be predictable.
- audio_only ( - bool) – If audio files are expected (e.g. extensions of .wav, .flac etc.) or not. If True, list will be checked to contain only audio files. Otherwise not. (default True)
- **kwargs ( - additional keyword arguments) – Keyword arguments for soundpy.datasets.waves2dataset
 
- Returns
- dataset_audio – Named tuple including three lists / datasets of audiofiles or label-audiofile pairs: the train, validation, and test lists, respectively. The label-audiofile pairs are saved as tuples within the lists and contain the encoded label integer (e.g. 0 instead of ‘air_conditioner’) and the audio paths associated to that class and dataset. 
- Return type
- Raises
- ValueError – If perc_train is set too high for the amount of data or there are simply too few data. Specifically, if the percentage of train data cannot be upheld while also ensuring the validation and test datasets have more than 1 sample. 
 
- 
soundpy.datasets.separate_train_val_test_files(list_of_files)[source]¶
- Checks that file(s) exist, then sorts file(s) into train, val, test lists. - If ‘nois’ or ‘clean’ are in the filenames, two paths lists per dataset will be generated. Otherwise just one. This paths list is useful if there are multiple training files available for training a model (e.g. for large datasets). - Parameters
- list_of_files ( - list,- str, or- pathlib.PosixPath) – The feature files (format: .npy) for training a model.
- Returns
- (train_paths_list, val_paths_list, test_paths_list) – Tuple comprised of paths lists to train, validation, and test data files. If noisy and clean data files found, each tuple item will be a tuple comprised of two lists: a noisy file paths list and a clean file paths list. 
- Return type
 - Examples - >>> features_files = ['train1.npy', 'train2.npy', 'val.npy', 'test.npy'] >>> datasets = separate_train_val_test_files(features_files) >>> datasets.train [PosixPath('train1.npy'), PosixPath('train2.npy')] >>> datasets.val [PosixPath('val.npy')] >>> datasets.test [PosixPath('test.npy')] >>> # try with noisy and clean data >>> features_files = ['train_noisy.npy', 'train_clean.npy', 'val_noisy.npy', 'val_clean.npy', 'test_noisy.npy', 'test_clean.npy'] >>> datasets = separate_train_val_test_files(features_files) >>> datasets.train.noisy [PosixPath('train_noisy.npy')] >>> datasets.train.clean [PosixPath('train_clean.npy')] >>> datasets.val.noisy [PosixPath('val_noisy.npy')] >>> datasets.val.clean [PosixPath('val_clean.npy')] >>> datasets.test.noisy [PosixPath('test_noisy.npy')] >>> datasets.test.clean [PosixPath('test_clean.npy')] 
- 
soundpy.datasets.section_data(dataset_dict, dataset_paths_dict, divide_factor=None)[source]¶
- Expects keys of these two dictionaries to match - Examples - >>> import pathlib >>> # train is longer than val and test >>> d = {'train': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'val': [1, 2, 3, 4, 5], 'test': [1, 2, 3, 4, 5]} >>> # dictionary: paths to where extracted data will be saved >>> dp = {'train': pathlib.PosixPath('train_data.npy'), 'val': pathlib.PosixPath('val_data.npy'), 'test': pathlib.PosixPath('test_data.npy')} >>> d2, dp2 = section_data(d, dp, divide_factor = 3) >>> # val and train not touched (too small) >>> d2 {'train__1': [1, 2, 3, 4, 5], 'train__2': [6, 7, 8, 9, 10], 'train__3': [11, 12, 13, 14, 15], 'val': [1, 2, 3, 4, 5], 'test': [1, 2, 3, 4, 5]} >>> dp2 {'train__1': PosixPath('train_data__1.npy'), 'train__2': PosixPath('train_data__2.npy'), 'train__3': PosixPath('train_data__3.npy'), 'val': PosixPath('val_data.npy'), 'test': PosixPath('test_data.npy')} >>> # repeat: now val and test as long as train >>> # default divide_factor is 2 >>> d3, dp3 = section_data(d2, dp2) >>> d3 {'train__1': [1, 2], 'train__2': [3, 4, 5], 'train__3': [6, 7], 'train__4': [8, 9, 10], 'train__5': [11, 12], 'train__6': [13, 14, 15], 'val__1': [1, 2], 'val__2': [3, 4, 5], 'test__1': [1, 2], 'test__2': [3, 4, 5]} >>> dp3 {'train__1': PosixPath('train_data__1.npy'), 'train__2': PosixPath('train_data__2.npy'), 'train__3': PosixPath('train_data__3.npy'), 'train__4': PosixPath('train_data__4.npy'), 'train__5': PosixPath('train_data__5.npy'), 'train__6': PosixPath('train_data__6.npy'), 'val__1': PosixPath('val_data__1.npy'), 'val__2': PosixPath('val_data__2.npy'), 'test__1': PosixPath('test_data__1.npy'), 'test__2': PosixPath('test_data__2.npy')} 
Working with signals¶
Dsp module contains functions pertaining to the actual generation, manipulation, and analysis of sound. This ranges from generating sounds to calculating sound to noise ratio.
- 
soundpy.dsp.generate_sound(freq=200, amplitude=0.4, sr=8000, dur_sec=0.25)[source]¶
- Generates a sound signal with the provided parameters. Signal begins at 0. - Parameters
- freq ( - int,- float) – The frequency in Hz the signal should have (default 200 Hz). This pertains to the number of ossicliations per second.
- amplitude ( - int,- float) – The parameter controling how much energy the signal should have. (default 0.4)
- sr ( - int) – The sampling rate of the signal, or how many samples make up the signal per second. (default 8000)
 
- Returns
- sound_samples ( - np.ndarray [size = ()]) – The samples of the generated sound
- sr ( - int) – The sample rate of the generated signal
 
 - Examples - >>> sound, sr = generate_sound(freq=5, amplitude=0.5, sr=5, dur_sec=1) >>> sound array([ 0.000000e+00, 5.000000e-01, 3.061617e-16, -5.000000e-01, -6.123234e-16]) >>> sr 5 
- 
soundpy.dsp.get_time_points(dur_sec, sr)[source]¶
- Get evenly spaced time points from zero to length of dur_sec. - The time points align with the provided sample rate, making it easy to plot a signal with a time line in seconds. - Parameters
- Returns
- time 
- Return type
- np.ndarray [size = (num_time_points,)]
 - Examples - >>> # 50 milliseconds at sample rate of 100 (100 samples per second) >>> x = get_time_points(0.05,100) >>> x.shape (5,) >>> x array([0. , 0.0125, 0.025 , 0.0375, 0.05 ]) 
- 
soundpy.dsp.generate_noise(num_samples, amplitude=0.025, random_seed=None)[source]¶
- Generates noise to be of a certain amplitude and number of samples. - Useful for adding noise to another signal of length num_samples. - Parameters
 - Examples - >>> noise = generate_noise(5, random_seed = 0) >>> noise array([0.04410131, 0.01000393, 0.02446845, 0.05602233, 0.04668895]) 
- 
soundpy.dsp.set_signal_length(samples, numsamps)[source]¶
- Sets audio signal to be a certain length. Zeropads if too short. - Useful for setting signals to be a certain length, regardless of how long the audio signal is. - Parameters
- samples ( - np.ndarray [size = (num_samples,- num_channels), or- (num_samples,)]) – The array of sample data to be zero padded.
- numsamps ( - int) – The desired number of samples.
 
- Returns
- data – Copy of samples zeropadded or limited to numsamps. 
- Return type
- np.ndarray [size = (numsamps,- num_channels), or- (numsamps,)]
 - Examples - >>> import numpy as np >>> input_samples = np.array([1,2,3,4,5]) >>> output_samples = set_signal_length(input_samples, numsamps = 8) >>> output_samples array([1, 2, 3, 4, 5, 0, 0, 0]) >>> output_samples = set_signal_length(input_samples, numsamps = 4) >>> output_samples array([1, 2, 3, 4]) 
- 
soundpy.dsp.scalesound(data, max_val=1, min_val=None)[source]¶
- Scales the input array to range between min_val and max_val. - Parameters
- data ( - np.ndarray [size = (num_samples,)or- (num_samples,- num_channels)]) – Original samples
- max_val ( - int,- float) – The maximum value the dataset is to range from (default 1)
- min_val ( - int,- float, optional) – The minimum value the dataset is to range from. If set to None, will be set to the opposiite of max_val. E.g. if max_val is set to 0.8, min_val will be set to -0.8. (default None)
 
- Returns
- samples – Copy of original data, scaled to the min and max values. 
- Return type
- np.ndarray [size = (num_samples,)or- (num_samples,- num_channels)]
 - Examples - >>> import numpy as np >>> np.random.seed(0) >>> input_samples = np.random.random_sample((5,)) >>> input_samples array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ]) >>> input_samples.max() 0.7151893663724195 >>> input_samples.min() 0.4236547993389047 >>> # default setting: between -1 and 1 >>> output_samples = scalesound(input_samples) >>> output_samples array([-0.14138 ,1., 0.22872961, -0.16834299, -1.]) >>> output_samples.max() 1.0 >>> output_samples.min() -1.0 >>> # range between -100 and 100 >>> output_samples = scalesound(input_samples, max_val = 100, min_val = -100) >>> output_samples array([ -14.13800026,100., 22.87296052,-16.83429866,-100.]) >>> output_samples.max() 100.0 >>> output_samples.min() -100.0 
- 
soundpy.dsp.shape_samps_channels(data)[source]¶
- Returns data in shape (num_samps, num_channels) - Parameters
- data ( - np.ndarray [size= (num_samples,)or- (num_samples,- num_channels), or- (num_channels,- num_samples)]) – The data that needs to be checked for correct format
- Returns
- data 
- Return type
- np.ndarray [size = (num_samples,)or- (num_samples,- num_channels)]
 
- 
soundpy.dsp.resample_audio(samples, sr_original, sr_desired)[source]¶
- Allows audio samples to be resampled to desired sample rate. - Parameters
- Returns
- resampled ( - np.ndarray [size = (num_samples_resampled,)]) – The resampled samples.
- sr_desired ( - int) – The newly applied sample rate
 
 - Examples - >>> import numpy as np >>> # example samples from 5 millisecond signal with sr 100 and frequency 10 >>> input_samples = np.array([0.00e+00, 2.82842712e-01, 4.000e-01, 2.82842712e-01, 4.89858720e-17]) >>> # we want to resample to 80 instead of 100 (for this example's sake) >>> output_samples, sr = resample_audio(input_samples, sr_original = 100, sr_desired = 80) >>> output_samples array([-2.22044605e-17, 3.35408001e-01, 3.72022523e-01, 6.51178161e-02]) 
- 
soundpy.dsp.stereo2mono(data)[source]¶
- If sound data has multiple channels, reduces to first channel - Parameters
- data ( - numpy.ndarray) – The series of sound samples, with 1+ columns/channels
- Returns
- data_mono – The series of sound samples, with first column 
- Return type
- numpy.ndarray
 - Examples - >>> import numpy as np >>> data = np.linspace(0,20) >>> data_2channel = data.reshape(25,2) >>> data_2channel[:5] array([[0. , 0.40816327], [0.81632653, 1.2244898 ], [1.63265306, 2.04081633], [2.44897959, 2.85714286], [3.26530612, 3.67346939]]) >>> data_mono = stereo2mono(data_2channel) >>> data_mono[:5] array([0. , 0.81632653, 1.63265306, 2.44897959, 3.26530612]) 
- 
soundpy.dsp.add_backgroundsound(audio_main, audio_background, sr, snr=None, pad_mainsound_sec=None, total_len_sec=None, wrap=False, stationary_noise=True, random_seed=None, extend_window_ms=0, remove_dc=False, mirror_sound=False, clip_at_zero=True, **kwargs)[source]¶
- Adds a sound (i.e. background noise) to a target signal. Stereo sound should work. - If the sample rates of the two audio samples do not match, the sample rate of audio_main will be applied. (i.e. the audio_background will be resampled). If you have issues with clicks at the beginning or end of signals, see - soundpy.dsp.clip_at_zero.- Parameters
- audio_main ( - str,- pathlib.PosixPath, or- np.ndarray [size=(num_samples,)or- (num_samples,- num_channels)]) – Sound file of the main sound (will not be modified; only delayed if specified). If not path or string, should be a data samples corrresponding to the provided sample rate.
- audio_background ( - str,- pathlib.PosixPath, or- np.ndarray [size=(num_samples,)]) – Sound file of the background sound (will be modified /repeated to match or extend the length indicated). If not of type pathlib.PosixPath or string, should be a data samples corrresponding to the provided sample rate.
- sr ( - int) – The sample rate of sounds to be added together. Note: sr of 44100 or higher is suggested.
- snr ( - int,- float,- list,- tuple) – The sound-to-noise-ratio of the target and background signals. Note: this is an approximation and needs further testing and development to be used as an official measurement of snr. If no SNR provided, signals will be added together as-is. (default None)
- pad_mainsound_sec ( - intor- float, optional) – Length of time in seconds the background sound will pad the main sound. For example, if pad_mainsound_sec is set to 1, one second of the audio_background will be played before audio_main starts as well as after the main audio stops. (default None)
- total_len_sec ( - intor- float, optional) – Total length of combined sound in seconds. If none, the sound will end after the (padded) target sound ends (default None).
- wrap ( - bool) – If False, the random selection of sound will be limited to end by the end of the audio file. If True, the random selection will wrap to beginning of the audio file if extends beyond the end of the audio file. (default False)
- stationary_noise ( - bool) – If False,- soundpy.feats.get_vad_stftwill be applied to noise to get energy of the active noise in the signal. Otherwise energy will be collected via soundpy.dsp.get_stft. (default True)
- random_seed ( - int) – If provided, the ‘random’ section of noise will be chosen using this seed. (default None)
- extend_window_ms ( - intor- float) – The number of milliseconds the voice activity detected should be padded with. This might be useful to ensure sufficient amount of activity is calculated. (default 0)
- remove_dc ( - bool) – If the dc bias should be removed. This aids in the removal of clicks. See- soundpy.dsp.remove_dc_bias. (default False)
- **kwargs ( - additional keyword arguments) – The keyword arguments for soundpy.files.loadsound
 
- Returns
 - References - Yi Hu and Philipos C. Loizouoriginal authors
- Copyright (c) 2006 by Philipos C. Loizou 
- SIP-Lab/CNN-VAD/GitHub Repo
- Copyright (c) 2019 Signal and Image Processing Lab MIT License 
 - See also - soundpy.files.loadsound
- Loads audiofiles. 
- soundpy.dsp.snr_adjustnoiselevel
- Calculates how much to adjust noise signal to achieve SNR. 
- soundpy.feats.get_vad_stft
- Returns stft matrix of only voice active regions 
- soundpy.feats.get_stft
- Returns stft matrix of entire signal 
 
- 
soundpy.dsp.hz_to_mel(freq)[source]¶
- Converts frequency to Mel scale - Parameters
- freq ( - intor- floator- array likeof- ints / floats) – The frequency/ies to convert to Mel scale.
- Returns
- mel – The frequency/ies in Mel scale. 
- Return type
 - References - https://en.wikipedia.org/wiki/Mel_scale#Formula - Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html 
- 
soundpy.dsp.mel_to_hz(mel)[source]¶
- Converts Mel item or list to frequency/ies. - Parameters
- mel ( - int,- float, or- listof- ints / floats) – Mel item(s) to be converted to Hz.
- Returns
- freq – The converted frequency/ies 
- Return type
 - References - https://en.wikipedia.org/wiki/Mel_scale#Formula - Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html 
- 
soundpy.dsp.fbank_filters(fmin, fmax, num_filters)[source]¶
- Calculates the mel filterbanks given a min and max frequency and num_filters. - Parameters
- Returns
- mel_points – An array of floats containing evenly spaced filters (according to mel scale). 
- Return type
- np.ndarray [size=(num_filters,)]
 - References - Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html 
- 
soundpy.dsp.sinosoidal_liftering(mfccs, cep_lifter=22)[source]¶
- Reduces influence of higher coefficients; found useful in automatic speech rec. - Parameters
- mfccs ( - np.ndarray [shape=(num_samples,- num_mfcc)]) – The matrix containing mel-frequency cepstral coefficients.
- cep_lifter ( - int) – The amount to apply- sinosoidal_liftering. (default 22)
 
 - References - Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html 
- 
soundpy.dsp.index_at_zero(samples, num_dec_places=2)[source]¶
- Finds indices of start and end of utterance, given amplitude strength. - Parameters
- samples ( - numpy.ndarray [size= (num_samples,)or- (num_samples,- num_channels)]) – The samples to index where the zeros surrounding speech are located.
- num_dec_places ( - int) – To the number of decimal places the lowest value in samples should be rounded to. (default 2)
 
- Returns
 - Examples - >>> signal = np.array([-1, 0, 1, 2, 3, 2, 1, 0, -1, -2, -3, -2, -1, 0, 1]) >>> zero_1, zero_2 = index_at_zero(signal) >>> # +1 to include zero_2 in signal >>> signal[zero_1:zero_2+1] [ 0 1 2 3 2 1 0 -1 -2 -3 -2 -1 0] >>> # does not assume a zero preceeds any sample >>> signal = np.array([1, 2, 1, 0, -1, -2, -1, 0, 1, 2, 1]) >>> zero_1, zero_2 = index_at_zero(signal) >>> signal[zero_1:zero_2+1] [ 0 -1 -2 -1 0] 
- 
soundpy.dsp.clip_at_zero(samples, samp_win=None, neg2pos=True, **kwargs)[source]¶
- Clips the signal at samples close to zero. - The samples where clipping occurs crosses the zero line from negative to positive. This clipping process allows for a smoother transition of audio, especially if concatenating audio. - Parameters
- samples ( - np.ndarray [shape = (num_samples,- )or- (num_samples,- num_channels)]) – The array containing sample data. Should work on stereo sound.
- start_with_zero ( - bool) – If True, the returned array will begin with 0 (or close to 0). Otherwise the array will end with 0.
- neg2pos ( - bool) – If True, the returned array will begin with positive values and end with negative values. Otherwise, the array will be returned with the first zeros detected, regardless of surrounding positive or negative values.
- samp_win ( - int, optional) – The window of samples to apply when clipping at zero crossings. The zero crossings adjacent to the main signal will be used. This is useful to remove already existing clicks within the signal, often found at the beginning and / or end of signals.
- kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.dsp.index_at_zero.
 
 - Warning - If only one zero found. - Examples - >>> sig = np.array([-2,-1,0,1, 2, 1, 0, -1, -2, -1, 0, 1, 2, 1,0]) >>> clip_at_zero(sig) # defaults [ 0 1 2 1 0 -1 -2 -1 0] >>> # finds first and last insance of zeros, regardless of surrounding >>> # negative or positive values in signal >>> clip_at_zero(sig, neg2pos = False) [ 0 1 2 1 0 -1 -2 -1 0 1 2 1 0] >>> # avoid clicks at start of signal >>> sig = np.array([0,-10,-20,-1,0,1, 2, 1, 0, -1, -2, -1, 0, 1, 2, 1,0]) >>> clip_at_zero(sig, samp_win = 5) [ 0 1 2 1 0 -1 -2 -1 0] 
- 
soundpy.dsp.remove_dc_bias(samples, samp_win=None)[source]¶
- Removes DC bias by subtracting mean from sample data. - Seems to work best without samp_win. - # TODO add moving average? - Parameters
- samples ( - np.ndarray [shape=(samples,- num_channels)or- (samples)]) – The sample data to center around zero. This worsk on both mono and stero data.
- samp_win ( - int, optional) – Apply subtraction of mean at windows - experimental. (default None)
 
- Returns
- samps – The samples with zero mean. 
- Return type
- np.ndarray [shape=(samples,- num_channels)or- (samples)]
 - References - Lyons, Richard. (2011). Understanding Digital Signal Processing (3rd Edition). 
- 
soundpy.dsp.apply_num_channels(sound_data, num_channels)[source]¶
- Ensures data has indicated num_channels. - To increase number of channels, the first column will be duplicated. To limit channels, channels will simply be removed. - Parameters
- sound_data ( - np.ndarray [size= (num_samples,)or- (num_samples,- num_channels)]) – The data to adjust the number of channels
- num_channels ( - int) – The number of channels desired
 
- Returns
- data 
- Return type
- np.ndarray [size = (num_samples,- num_channels)]
 - Examples - >>> import numpy as np >>> data = np.array([1, 1, 1, 1]) >>> data_3d = apply_num_channels(data, 3) >>> data_3d array([[1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1]]) >>> data_2d = apply_num_channels(data_3d, 2) >>> data_2d array([[1, 1], [1, 1], [1, 1], [1, 1]]) 
- 
soundpy.dsp.apply_sample_length(data, target_len, mirror_sound=False, clip_at_zero=True)[source]¶
- Extends a sound by repeating it until its target_len. If the target_len is shorter than the length of data, data will be shortened to the specificed target_len - This is perhaps useful when working with repetitive or stationary sounds. - Parameters
- data ( - np.ndarray [size = (num_samples,)or- (num_samples,- num_channels)]) – The data to be checked or extended in length. If shape (num_channels, num_samples), the data will be reshaped to (num_samples, num_channels).
- target_len ( - int) – The length of samples the input data should be.
 
- Returns
- new_data 
- Return type
- np.ndarray [size=(target_len,- )or- (target_len,- num_channels)]
 - Examples - >>> import numpy as np >>> data = np.array([1,2,3,4]) >>> sp.dsp.apply_sample_length(data, 12) array([1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]) >>> # two channels >>> data = np.zeros((3,2)) >>> data[:,0] = np.array([0,1,2]) >>> data[:,1] = np.array([1,2,3]) >>> data array([[0., 1.], [1., 2.], [2., 3.]]) >>> sp.dsp.apply_sample_length(data,5) array([[0., 1.], [1., 2.], [2., 3.], [0., 1.], [1., 2.]]) 
- 
soundpy.dsp.zeropad_sound(data, target_len, sr, delay_sec=None)[source]¶
- If the sound data needs to be a certain length, zero pad it. - Parameters
- data ( - numpy.ndarray [size = (num_samples,)or- (num_samples,- num_channels)]) – The sound data that needs zero padding. Shape (len(data),).
- target_len ( - int) – The number of samples the data should have
- sr ( - int) – The samplerate of the data
- delay_sec ( - int,- float, optional) – If the data should be zero padded also at the beginning. (default None)
 
- Returns
- signal_zeropadded – The data zero padded. 
- Return type
- numpy.ndarray [size = (target_len,)or- (target_len,- num_channels)]
 - Examples - >>> import numpy as np >>> x = np.array([1,2,3,4]) >>> # with 1 second delay (with sr of 4, that makes 4 sample delay) >>> x_zeropadded = zeropad_sound(x, target_len=10, sr=4, delay_sec=1) >>> x_zeropadded array([0., 0., 0., 0., 1., 2., 3., 4., 0., 0.]) >>> # without delay >>> x_zeropadded = zeropad_sound(x, target_len=10, sr=4) >>> x_zeropadded array([1., 2., 3., 4., 0., 0., 0., 0., 0., 0.]) >>> # if signal is longer than desired length: >>> x_zeropadded = zeropad_sound(x, target_len=3, sr=4) UserWarning: The signal cannot be zeropadded and will instead be truncated as length of `data` is 4 and `target_len` is 3. len(data), target_len)) >>> x_zeropadded array([1, 2, 3]) 
- 
soundpy.dsp.combine_sounds(file1, file2, match2shortest=True, time_delay_sec=None, total_dur_sec=None)[source]¶
- Combines sounds - Parameters
- file1 ( - str) – One of two files to be added together
- file2 ( - str) – Second of two files to be added together
- match2shortest ( - bool) – If the lengths of the addition should be limited by the shorter sound. (defaul True)
- time_delay_sec ( - int,- float, optional) – The amount of time in seconds before the sounds are added together. The longer sound will play for this period of time before the shorter sound is added to it. (default 1)
- total_dur_sec ( - int,- float, optional) – The total duration in seconds of the combined sounds. (default 5)
 
- Returns
- added_sound ( - numpy.ndarray) – The sound samples of the two soundfiles added together
- sr1 ( - int) – The sample rate of the original signals and added sound
 
 
- 
soundpy.dsp.calc_frame_length(dur_frame_millisec, sr)[source]¶
- Calculates the number of samples necessary for each frame - Parameters
- Returns
- frame_length – the number of samples necessary to fill a frame 
- Return type
 - Examples - >>> calc_frame_length(dur_frame_millisec=20, sr=1000) 20 >>> calc_frame_length(dur_frame_millisec=20, sr=48000) 960 >>> calc_frame_length(dur_frame_millisec=25.5, sr=22500) 573 
- 
soundpy.dsp.calc_num_overlap_samples(samples_per_frame, percent_overlap)[source]¶
- Calculate the number of samples that constitute the overlap of frames - Parameters
- Returns
- num_overlap_samples – the number of samples in the overlap 
- Return type
 - Examples - >>> calc_num_overlap_samples(samples_per_frame=100,percent_overlap=0.10) 10 >>> calc_num_overlap_samples(samples_per_frame=100,percent_overlap=10) 10 >>> calc_num_overlap_samples(samples_per_frame=960,percent_overlap=0.5) 480 >>> calc_num_overlap_samples(samples_per_frame=960,percent_overlap=75) 720 
- 
soundpy.dsp.calc_num_subframes(tot_samples, frame_length, overlap_samples, zeropad=False)[source]¶
- Assigns total frames needed to process entire noise or target series - This function calculates the number of full frames that can be created given the total number of samples, the number of samples in each frame, and the number of overlapping samples. - Parameters
- tot_samples ( - int) – total number of samples in the entire series
- frame_length ( - int) – total number of samples in each frame / processing window
- overlap_samples ( - int) – number of samples in overlap between frames
- zeropad ( - bool, optional) – If False, number of subframes limited to full frames. If True, number of subframes extended to zeropad the last partial frame. (default False)
 
- Returns
- subframes – The number of subframes necessary to fully process the audio samples at given frame_length, overlap_samples, and zeropad. 
- Return type
 - Examples - >>> calc_num_subframes(30,10,5) 5 >>> calc_num_subframes(30,20,5) 3 
- 
soundpy.dsp.create_window(window_type, frame_length)[source]¶
- Creates window according to set window type and frame length - the Hamming window tapers edges to around 0.08 while the Hann window tapers edges to 0.0. Both are commonly used in noise filtering. - Parameters
- window_type ( - str) – type of window to be applied (default ‘hamming’)
- Returns
- window – a window fitted to the class attribute ‘frame_length’ 
- Return type
- ndarray
 - Examples - >>> #create Hamming window >>> hamm_win = create_window('hamming', frame_length=5) >>> hamm_win array([0.08, 0.54, 1. , 0.54, 0.08]) >>> #create Hann window >>> hann_win = create_window('hann',frame_length=5) >>> hann_win array([0. , 0.5, 1. , 0.5, 0. ]) 
- 
soundpy.dsp.apply_window(samples, window, zeropad=False)[source]¶
- Applies predefined window to a section of samples. Mono or stereo sound checked. - The length of the samples must be the same length as the window. - Parameters
- samples ( - ndarray [shape=(num_samples,)or- (num_samples,- num_channels)]) – series of samples with the length of input window
- window ( - ndarray [shape=(num_samples,)or- (num_samples,- num_channels)]) – window to be applied to the signal. If window does not match number of channels of sample data, the missing channels will be applied to the window, repeating the first channel.
 
- Returns
- samples_win – series with tapered sides according to the window provided 
- Return type
- ndarray
 - Examples - >>> import numpy as np >>> input_signal = np.array([ 0. , 0.36371897, -0.302721, ... -0.1117662 , 0.3957433 ]) >>> window_hamming = np.array([0.08, 0.54, 1. , 0.54, 0.08]) >>> apply_window(input_signal, window_hamming) array([ 0. , 0.19640824, -0.302721 , -0.06035375, 0.03165946]) >>> window_hann = np.array([0. , 0.5, 1. , 0.5, 0. ]) >>> apply_window(input_signal, window_hann) array([ 0. , 0.18185948, -0.302721 , -0.0558831 , 0. ]) 
- 
soundpy.dsp.add_channels(samples, channels_total)[source]¶
- Copies columns of samples to create additional channels. - Parameters
- samples ( - np.ndarray [shape=(num_samples)or- (num_samples,num_channels)]) – The samples to add channels to.
- channels_total ( - int) – The total number of channels desired. For example, if samples already has 2 channels and you want it to have 3, set channels_total to 3.
 
- Returns
- x – A copy of samples with desired number of channels. 
- Return type
- np.ndarray [shape = (num_samples,- channels_total)]
 - Examples - >>> import numpy as np >>> samps_mono = np.array([1,2,3,4,5]) >>> samps_stereo2 = add_channels(samps_mono, 2) >>> samps_stereo2 array([[1, 1], ... [2, 2], ... [3, 3], ... [4, 4], ... [5, 5]]) >>> samps_stereo5 = add_channels(samps_stereo2, 5) >>> samps_stereo5 array([[1, 1, 1, 1, 1], ... [2, 2, 2, 2, 2], ... [3, 3, 3, 3, 3], ... [4, 4, 4, 4, 4], ... [5, 5, 5, 5, 5]]) - Warning - If channels_total is less than or equal to the number of channels already presesnt in samples. No channels added in those cases. 
- 
soundpy.dsp.average_channels(data)[source]¶
- Averages all channels in a stereo signal into one channel. - Parameters
- data ( - np.ndarray [size=(num_samples,- num_channels)]) – The stereo data to average out. If mono data supplied, mono data is returned unchanged.
- Returns
- data averaged – Copy of data averaged into one channel. 
- Return type
- np.ndarray [size=(num_samples)]
 - Examples - >>> import numpy as np >>> input_samples1 = np.array([1,2,3,4,5]) >>> input_samples2 = np.array([1,1,3,3,5]) >>> input_2channels = np.vstack((input_samples1, input_samples2)).T >>> input_averaged = average_channels(input_2channels) >>> input_averaged array([1. , 1.5, 3. , 3.5, 5. ]) 
- 
soundpy.dsp.calc_fft(signal_section, real_signal=None, fft_bins=None, **kwargs)[source]¶
- Calculates the fast Fourier transform of a time series. Should work with stereo signals. - The length of the signal_section determines the number of frequency bins analyzed if fft_bins not set. Therefore, if there are higher frequencies in the signal, the length of the signal_section should be long enough to accommodate those frequencies. - The frequency bins with energy levels at around zero denote frequencies not prevelant in the signal;the frequency bins with prevalent energy levels relate to frequencies as well as their amplitudes that are in the signal. - Parameters
- signal_section ( - ndarray [shape = (num_samples)or- (num_samples,- num_channels)]) – the series that the fft will be applied to. If stereo sound, will return a FFT for each channel.
- real_signal ( - bool) – If True, only half of the fft will be returned (the fft is mirrored). Otherwise the full fft will be returned.
- kwargs ( - additional keyword arguments) – keyword arguments for numpy.fft.fft or nump.fft.rfft
 
- Returns
- fft_vals – the series transformed into the frequency domain with the same shape as the input series 
- Return type
- ndarray [shape=(num_fft_bins), or- (num_fft_bins,- num_channels),- dtype=np.complex_]
 
- 
soundpy.dsp.calc_power(fft_vals)[source]¶
- Calculates the power of fft values - Parameters
- fft_vals ( - ndarray (complexor- floats)) – the fft values of a windowed section of a series
- Returns
- power_spec – the squared absolute value of the input fft values 
- Return type
- ndarray
 - Example - >>> import numpy as np >>> matrix = np.array([[1,1,1],[2j,2j,2j],[-3,-3,-3]], ... dtype=np.complex_) >>> calc_power(matrix) array([[0.33333333, 0.33333333, 0.33333333], [1.33333333, 1.33333333, 1.33333333], [3. , 3. , 3. ]]) 
- 
soundpy.dsp.calc_average_power(matrix, num_iters)[source]¶
- Divides matrix values by the number of times power values were added. - This function assumes the power values of n-number of series were calculated and added. It divides the values in the input matrix by n, i.e. ‘num_iters’. - Parameters
- matrix ( - ndarray) – a collection of floats or ints representing the sum of power values across several series sets
- num_iters ( - int) – an integer denoting the number of times power values were added to the input matrix
 
- Returns
- matrix – the averaged input matrix 
- Return type
- ndarray
 - Examples - >>> matrix = np.array([[6,6,6],[3,3,3],[1,1,1]]) >>> ave_matrix = calc_average_power(matrix, 3) >>> ave_matrix array([[2. , 2. , 2. ], [1. , 1. , 1. ], [0.33333333, 0.33333333, 0.33333333]]) 
- 
soundpy.dsp.calc_phase(fft_matrix, radians=False)[source]¶
- Calculates phase from complex fft values. - Parameters
- fft_vals ( - np.ndarray [shape=(num_frames,- num_features),- dtype=complex]) – matrix with fft values
- radians ( - boolean) – False and complex values are returned. True and radians are returned. (Default False)
 
- Returns
- phase – Phase values for fft_vals. If radians is set to False, dtype = complex. If radians is set to True, dtype = float. 
- Return type
- np.ndarray [shape=(num_frames,- num_features)]
 - Examples - >>> import numpy as np >>> frame_length = 10 >>> time = np.arange(0, 10, 0.1) >>> signal = np.sin(time)[:frame_length] >>> fft_vals = np.fft.fft(signal) >>> phase = calc_phase(fft_vals, radians=False) >>> phase[:2] array([ 1. +0.j , -0.37872566+0.92550898j]) >>> phase = calc_phase(fft_vals, radians=True) >>> phase[:2] array([0. , 1.95921533]) 
- 
soundpy.dsp.reconstruct_whole_spectrum(band_reduced_noise_matrix, n_fft=None)[source]¶
- Reconstruct whole spectrum by mirroring complex conjugate of data. - Parameters
- band_reduced_noise_matrix ( - np.ndarray [size=(n_fft,),- dtype=np.floator- np.complex_]) – Matrix with either power or fft values of the left part of the fft. The whole fft can be provided; however the right values will be overwritten by a mirrored left side.
- n_fft ( - int, optional) – If None, n_fft set to length of band_reduced_noise_matrix. n_fft defines the size of the mirrored vector.
 
- Returns
- output_matrix – Mirrored vector of input data. 
- Return type
- np.ndarray [size = (n_fft,),- dtype=np.floator- np.complex_]
 - Examples - >>> x = np.array([3.,2.,1.,0.]) >>> # double the size of x >>> x_rec = sp.dsp.reconstruct_whole_spectrum(x, n_fft=int(len(x)*2)) >>> x_rec array([3., 2., 1., 0., 0., 1., 2., 3.]) >>> # overwrite right side of data >>> x = np.array([3.,2.,1.,0.,0.,2.,3.,5.]) >>> x_rec = sp.dsp.reconstruct_whole_spectrum(x, n_fft=len(x)) >>> x_rec array([3., 2., 1., 0., 0., 1., 2., 3.]) 
- 
soundpy.dsp.apply_original_phase(spectrum, phase)[source]¶
- Multiplies phase to power spectrum - Parameters
- spectrum ( - np.ndarray [shape=(n,),- dtype=np.floator- np.complex]) – Magnitude or power spectrum
- phase ( - np.ndarray [shape=(n,),- dtype=np.floator- np.complex]) – Phase to be applied to spectrum
 
- Returns
- spectrum_complex 
- Return type
- np.ndarray [shape=(n,),- dtype = np.complex]
 
- 
soundpy.dsp.calc_posteri_snr(target_power_spec, noise_power_spec)[source]¶
- Calculates and signal to noise ratio of current frame - Parameters
- target_power_spec ( - ndarray) – matrix of shape with power values of target signal
- noise_power_spec ( - ndarray) – matrix of shape with power values of noise signal
 
- Returns
- posteri_snr – matrix containing the signal to noise ratio 
- Return type
- ndarray
 - Examples - >>> sig_power = np.array([6,6,6,6]) >>> noise_power = np.array([2,2,2,2]) >>> calc_posteri_snr(sig_power, noise_power) array([3., 3., 3., 3.]) 
- 
soundpy.dsp.get_local_target_high_power(target_samples, sr, local_size_ms=25, min_power_percent=0.25)[source]¶
- 
soundpy.dsp.get_vad_snr(target_samples, noise_samples, sr, extend_window_ms=0)[source]¶
- Approximates the signal to noise ratio of two sets of power spectrums - Note: this is a simple implementation and should not be used for official/exact measurement of snr. - Parameters
- target_samples ( - np.ndarray [size = (num_samples,- )]) – The samples of the main / speech signal. Only frames with higher levels of energy will be used to calculate SNR.
- noise_samples ( - np.ndarray [size = (num_samples,- )]) – The samples of background noise. Expects only noise, no speech. Must be the same sample rate as the target_samples
- sr ( - int) – The sample rate for the audio samples.
- local_size_ms ( - intor- float) – The length in milliseconds to calculate level of SNR. (default 25)
- min_power_percent ( - float) – The minimum percentage of energy / power the target samples should have. This is to look at only sections with speech or other signal of interest and not periods of silence. Value should be between 0 and 1. (default 0.25)
 
 - References - http://www1.icsi.berkeley.edu/Speech/faq/speechSNR.html - Gomolka, Ryszard. (2017). Re: How to measure signal-to-noise ratio (SNR) in real time?. Retrieved from: https://www.researchgate.net/post/How_to_measure_signal-to-noise_ratio_SNR_in_real_time/586a880f217e2060b65a8853/citation/download. - https://www.who.int/occupational_health/publications/noise1.pdf 
- 
soundpy.dsp.snr_adjustnoiselevel(target_samples, noise_samples, sr, snr)[source]¶
- Computes scale factor to adjust noise samples to achieve snr. - From script addnoise_asl_nseg.m: This function adds noise to a file at a specified SNR level. It uses the active speech level to compute the speech energy. The active speech level is computed as per ITU-T P.56 standard. - soundpy Note: this functionality was pulled from the MATLAB script: addnoise_asl_nseg.m at this GitHub repo: https://github.com/SIP-Lab/CNN-VAD/blob/master/Training%20Code/Functions/addnoise_asl_nseg.m - I do not understand all that went on to calculate the scale factor and therefore do not explain anything futher than the original script. - Parameters
- target_samples ( - np.ndarray [size = (num_samples,)]) – The audio samples of the target / clean signal.
- noise_samples ( - np.ndarray [size = (num_samples,)]) – The audio samples of the noise signal.
- sr ( - int) – The sample rate of both target_samples and noise_samples
- snr ( - int) – The desired signal-to-noise ratio of the target and noise audio signals.
 
- Returns
- scale_factor – The factor to which noise samples should be multiplied before being added to target samples to achieve SNR. 
- Return type
 - References - Yi Hu and Philipos C. Loizouoriginal authors
- Copyright (c) 2006 by Philipos C. Loizou 
- SIP-Lab/CNN-VAD/GitHub Repo
- Copyright (c) 2019 Signal and Image Processing Lab MIT License 
 - ITU-T (1993). Objective measurement of active speech level. ITU-T Recommendation P. 56 - See also 
- 
soundpy.dsp.asl_P56(samples, sr, bitdepth=16, smooth_factor=0.03, hangover=0.2, margin_db=15.9)[source]¶
- Computes the active speech level according to ITU-T P.56 standard. - Note: I don’t personally understand the functionality behind this function and therefore do not offer the best documentation as of yet. - Parameters
- samples ( - np.ndarray [size = (num_samples,- )]) – The audio samples, for example speech samples.
- sr ( - int) – The sample rate of samples.
- bitdepth ( - int) – The bitdepth of audio. Expects 16. (default 16)
- smooth_factor ( - float) – Time smoothing factor. (default 0.03)
- hangover ( - float) – Hangover. Thank goodness not the kind I’m familiar with. (default 0.2)
 
- Returns
 - References - ITU-T (1993). Objective measurement of active speech level. ITU-T Recommendation P. 56 - TODO handle bitdepth variation - what if not 16? TODO improve documentation 
- 
soundpy.dsp.calc_posteri_prime(posteri_snr)[source]¶
- Calculates the posteri prime - Parameters
- posteri_snr ( - ndarray) – The signal-to-noise ratio of the noisey signal, frame by frame.
- Returns
- posteri_prime – The primed posteri_snr, calculated according to the reference paper. 
- Return type
- ndarray
 - References - Scalart, P. and Filho, J. (1996). Speech enhancement based on a priori signal to noise estimation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 629-632. 
- 
soundpy.dsp.calc_prior_snr(snr, snr_prime, smooth_factor=0.98, first_iter=None, gain=None)[source]¶
- Estimates the signal-to-noise ratio of the previous frame - Depending on the first_iter argument, the prior snr is calculated according to different algorithms. If first_iter is None, prior snr is calculated according to Scalart and Filho (1996); if first_iter is True or False, snr prior is calculated according to Loizou (2013). - Parameters
- snr ( - ndarray) – The sound-to-noise ratio of target vs noise power/energy levels.
- snr_prime ( - ndarray) – The prime of the snr (see Scalart & Filho (1996))
- smooth_factor ( - float) – The value applied to smooth the signal. (default 0.98)
- first_iter ( - None,- True,- False) – If None, snr prior values are estimated the same, no matter if it is the first iteration or not (Scalart & Filho (1996)) If True, snr prior values are estimated without gain (Loizou 2013) If False, snr prior values are enstimed with gain (Loizou 2013) (default None)
- gain ( - None,- ndarray) – If None, gain will not be used. If gain, it is a previously calculated value from the previous frame. (default None)
 
- Returns
- prior_snr – Estimation of signal-to-noise ratio of the previous frame of target signal. 
- Return type
- ndarray
 - References - C Loizou, P. (2013). Speech Enhancement: Theory and Practice. - Scalart, P. and Filho, J. (1996). Speech enhancement based on a priori signal to noise estimation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 629-632. 
- 
soundpy.dsp.calc_gain(prior_snr)[source]¶
- Calculates the gain (i.e. attenuation) values to reduce noise. - Parameters
- prior_snr ( - ndarray) – The prior signal-to-noise ratio estimation
- Returns
- gain – An array of attenuation values to be applied to the signal (stft) array at the current frame. 
- Return type
- ndarray
 - References - C Loizou, P. (2013). Speech Enhancement: Theory and Practice. - Scalart, P. and Filho, J. (1996). Speech enhancement based on a priori signal to noise estimation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 629-632. 
- 
soundpy.dsp.apply_gain_fft(fft_vals, gain)[source]¶
- Reduces noise by applying gain values to the stft / fft array of the target signal - Parameters
- fft_vals ( - ndarray(complex)) – Matrix containing complex values (i.e. stft values) of target signal
- gain ( - ndarray(real)) – Matrix containing calculated attenuation values to apply to ‘fft_vals’
 
- Returns
- enhanced_fft – Matrix with attenuated noise in target (stft) values 
- Return type
- ndarray(complex)
 
- 
soundpy.dsp.postfilter(original_powerspec, noisereduced_powerspec, gain, threshold=0.4, scale=10)[source]¶
- Apply filter that reduces musical noise resulting from other filter. - If it is estimated that speech (or target signal) is present, reduced filtering is applied. - References - T. Esch and P. Vary, “Efficient musical noise suppression for speech enhancement system,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, 2009. 
- 
soundpy.dsp.calc_ifft(signal_section, real_signal=None, norm=False)[source]¶
- Calculates the inverse fft of a series of fft values - The real values of the ifft can be used to be saved as an audiofile - Parameters
- signal_section ( - ndarray [shape=(num_freq_bins,)) – The frame of fft values to apply the inverse fft to
- num_fft ( - int, optional) – The number of total fft values applied when calculating the original fft. If not given, length of signal_section is used.
- norm ( - bool) – Whether or not the ifft should apply ‘ortho’ normalization (default False)
 
- Returns
- ifft_vals – The inverse Fourier transform of filtered audio data 
- Return type
- ndarray(complex)
 
- 
soundpy.dsp.control_volume(samples, max_limit)[source]¶
- Keeps max volume of samples to within a specified range. - Parameters
- samples ( - ndarray) – series of audio samples
- max_limit ( - float) – maximum boundary of the maximum value of the audio samples
 
- Returns
- samples – samples with volume adjusted (if need be). 
- Return type
- np.ndarray
 - Examples - >>> import numpy as np >>> #low volume example: increase volume to desired window >>> x = np.array([-0.03, 0.04, -0.05, 0.02]) >>> x = control_volume(x, max_limit=0.25) >>> x array([-0.13888889, 0.25 , -0.25 , 0.13888889]) >>> #high volume example: decrease volume to desired window >>> y = np.array([-0.3, 0.4, -0.5, 0.2]) >>> y = control_volume(y, max_limit=0.15) >>> y array([-0.08333333, 0.15 , -0.15 , 0.08333333]) 
- 
soundpy.dsp.calc_power_ratio(original_powerspec, noisereduced_powerspec)[source]¶
- Calc. the ratio of original vs noise reduced power spectrum. 
- 
soundpy.dsp.calc_noise_frame_len(SNR_decision, threshold, scale)[source]¶
- Calc. window length for calculating moving average. - Note: lower SNRs require larger window. 
- 
soundpy.dsp.calc_linear_impulse(noise_frame_len, num_freq_bins)[source]¶
- Calc. the post filter coefficients to be applied to gain values. 
- 
soundpy.dsp.spread_volumes(samples, vol_list=[0.1, 0.3, 0.5])[source]¶
- Returns samples with a range of volumes. - This may be useful in applying to training data (transforming data). - Parameters
- samples ( - ndarray) – Series belonging to acoustic signal.
- vol_list ( - list) – List of floats or ints representing the volumes the samples are to be oriented towards. (default [0.1,0.3,0.5])
 
- Returns
- volrange_dict – Tuple of volrange_dict values containing samples at various vols. 
- Return type
 
- 
soundpy.dsp.create_empty_matrix(shape, complex_vals=False)[source]¶
- Allows creation of a matrix filled with real or complex zeros. - In digital signal processing, complex numbers are common; it is important to note that if complex_vals=False and complex values are inserted into the matrix, the imaginary part will be removed. - Parameters
- Returns
- matrix – a matrix filled with real or complex zeros 
- Return type
- ndarray
 - Examples - >>> matrix = create_empty_matrix((3,4)) >>> matrix array([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]]) >>> matrix_complex = create_empty_matrix((3,4),complex_vals=True) >>> matrix_complex array([[0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j], [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j], [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j]]) >>> vector = create_empty_matrix(5,) >>> vector array([0., 0., 0., 0., 0.]) 
- 
soundpy.dsp.overlap_add(enhanced_matrix, frame_length, overlap, complex_vals=False)[source]¶
- Overlaps and adds windowed sections together to form 1D signal. - Parameters
- Returns
- new_signal – Length equals (frame_length - overlap) * enhanced_matrix.shape[1] + overlap 
- Return type
- np.ndarray [shape=(frame_length,),- dtype=float]
 - Examples - >>> import numpy as np >>> enhanced_matrix = np.ones((4, 4)) >>> frame_length = 4 >>> overlap = 1 >>> sig = overlap_add(enhanced_matrix, frame_length, overlap) >>> sig array([1., 1., 1., 2., 1., 1., 2., 1., 1., 2., 1., 1., 1.]) 
- 
soundpy.dsp.random_selection_samples(samples, len_section_samps, wrap=False, random_seed=None, axis=0)[source]¶
- Selects a section of samples, starting at random. - Parameters
- samples ( - np.ndarray [shape = (num_samples,- )]) – The array of sample data
- len_section_samps ( - int) – How many samples should be randomly selected
- wrap ( - bool) – If False, the selected noise will not be wrapped from end to beginning; if True, the random selected may take sound sample that is wrapped from the end to the beginning. See examples below. (default False)
- random_seed ( - int, optional) – If replicated randomization desired. (default None)
 
 - Examples - >>> import numpy as np >>> # no wrap: >>> x = np.array([1,2,3,4,5,6,7,8,9,10]) >>> n = sp.dsp.random_selection_samples(x, len_section_samps = 7, ... wrap = False, random_seed = 40) >>> n array([3, 4, 5, 6, 7, 8, 9]) >>> # with wrap: >>> n = sp.dsp.random_selection_samples(x, len_section_samps = 7, ... wrap = True, random_seed = 40) >>> n array([ 7, 8, 9, 10, 1, 2, 3]) 
- 
soundpy.dsp.get_pitch(sound, sr=16000, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', **kwargs)[source]¶
- Approximates pitch by collecting dominant frequencies of signal. 
- 
soundpy.dsp.get_mean_freq(sound, sr=16000, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', percent_vad=0.75)[source]¶
- Takes the mean of dominant frequencies of voice activated regions in a signal. - Note: Silences discarded. - The average fundamental frequency for a male voice is 125Hz; for a female voice it’s 200Hz; and for a child’s voice, 300Hz. (Russell, J., 2020) - References - Russell, James (2020) The Human Voice and the Frequency Range. Retrieved from: https://blog.accusonus.com/pro-audio-production/human-voice-frequency-range/ 
- 
soundpy.dsp.vad(sound, sr, win_size_ms=50, percent_overlap=0, real_signal=False, fft_bins=None, window='hann', energy_thresh=40, freq_thresh=185, sfm_thresh=5, min_energy=None, min_freq=None, min_sfm=None, use_beg_ms=120)[source]¶
- Warning: this VAD works best with sample rates above 44100 Hz. - Parameters
 - References - Moattar and M. M. Homayounpour, “A simple but efficient real-time Voice Activity Detection algorithm,” 2009 17th European Signal Processing Conference, Glasgow, 2009, pp. 2549-2553. 
 
 
- 
soundpy.dsp.sound_index(speech_energy, speech_energy_mean, start=True)[source]¶
- Identifies the index of where speech or energy starts or ends. 
- 
soundpy.dsp.get_dom_freq(power_values)[source]¶
- If real_signal (i.e. half fft bins), might mess up values. 
- 
soundpy.dsp.short_term_energy(signal_windowed)[source]¶
- Expects - signalto be scaled (-1, 1) as well as windowed.- References 
- 
soundpy.dsp.bilinear_warp(fft_value, alpha)[source]¶
- Subfunction for vocal tract length perturbation. - See also - References - Kim, C., Shin, M., Garg, A., & Gowda, D. (2019). Improved vocal tract length perturbation for a state-of-the-art end-to-end speech recognition system. Interspeech. September 15-19, Graz, Austria. 
- 
soundpy.dsp.piecewise_linear_warp(fft_value, alpha, max_freq)[source]¶
- Subfunction for vocal tract length perturbation. - See also - References - Kim, C., Shin, M., Garg, A., & Gowda, D. (2019). Improved vocal tract length perturbation for a state-of-the-art end-to-end speech recognition system. Interspeech. September 15-19, Graz, Austria. 
- 
soundpy.dsp.f0_approximation(sound, sr, low_freq=50, high_freq=300, **kwargs)[source]¶
- Approximates fundamental frequency. - Limits the stft of voice active sections to frequencies to between low_freq and high_freq and takes mean of the dominant frequencies within that range. Defaults are set at 50 and 300 as most human speech frequencies occur between 85 and 255 Hz. - References 
Filters: Wiener and Band Spectral Subtraction¶
Filters module covers functions related to the filtering out of noise of a target signal.
- 
class soundpy.filters.FilterSettings(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, zeropad=None)[source]¶
- Bases: - object- Basic settings for filter related classes to inherit from. - 
sr¶
- Desired sampling rate of audio; audio will be resampled to match if audio has other sampling rate. (default 48000) - Type
 
 - 
frame_length¶
- Number of audio samples in each frame: frame_dur multiplied with sr, divided by 1000. (default 960) - Type
 
 - 
overlap_length¶
- Number of overlapping audio samples between subsequent frames: frame_length multiplied by percent_overlap, floored. (default 480) - Type
 
 - 
num_fft_bins¶
- The number of frequency bins used when calculating the fft. Currently the frame_length is used to set num_fft_bins. - Type
 
 - 
zeropad¶
- If False, only full frames of audio data are processed. If True, the last partial frame will be zeropadded. (default False) - Type
- bool, optional
 
 - Methods - Returns window acc. 
- 
- 
class soundpy.filters.Filter(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=None, zeropad=None)[source]¶
- Bases: - soundpy.filters.FilterSettings- Interactive class to explore Wiener filter settings on audio signals. - These class methods implement research based algorithms with low computational cost, aimed for noise reduction via mobile phone. - 
beta¶
- Value applied in Wiener filter that smooths the application of ‘gain’; default set according to previous research. (default 0.98) - Type
 
 - 
first_iter¶
- Keeps track if first_iter is relevant in filtering. If True, filtering has just started, and calculations made for filtering cannot use information from previous frames; if False, calculations for filtering use information from previous frames; if None, no difference is applied when processing the 1st vs subsequent frames. (default None) - Type
- bool, optional
 
 - 
target_subframes¶
- The number of total subsections within the total number of samples belonging to the target signal (i.e. audiofile being filtered). Until target_subframes is calculated, it is set to None. (default None) 
 - 
noise_subframes¶
- The number of total subsections within the total number of samples belonging to the noise signal. If noise power spectrum is used, this doesn’t need to be calculated. Until noise_subframes is calculated, it is set to None. (default None) 
 - 
gain¶
- Once calculated, the attenuation values to be applied to the fft for noise reduction. Until calculated, None. (default None) - Type
- ndarray,- None
 
 - Methods - check_volume(samples)- ensures volume of filtered signal is within the bounds of the original - get_samples(audiofile[, dur_sec])- Load signal and save original volume - get_window()- Returns window acc. - set_num_subframes(len_samples[, is_noise, …])- Sets the number of target or noise subframes available for processing - set_volume(samples[, max_vol, min_vol])- Records and limits the maximum amplitude of original samples. - 
set_volume(samples, max_vol=0.4, min_vol=0.15)[source]¶
- Records and limits the maximum amplitude of original samples. - This enables the output wave to be within a range of volume that does not go below or too far above the orignal maximum amplitude of the signal. - Parameters
- samples ( - ndarray) – The original samples of a signal (1 dimensional), of any length
- max_vol ( - float) – The maximum volume level. If a signal has values higher than this number, the signal is curtailed to remain at and below this number.
- min_vol ( - float) – The minimum volume level. If a signal has only values lower than this number, the signal is amplified to be at this number and below.
 
- Returns
- Return type
 
 
- 
- 
class soundpy.filters.WienerFilter(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, smooth_factor=0.98, first_iter=None, zeropad=None)[source]¶
- Bases: - soundpy.filters.Filter- Methods - check_volume(samples)- ensures volume of filtered signal is within the bounds of the original - get_samples(audiofile[, dur_sec])- Load signal and save original volume - get_window()- Returns window acc. - set_num_subframes(len_samples[, is_noise, …])- Sets the number of target or noise subframes available for processing - set_volume(samples[, max_vol, min_vol])- Records and limits the maximum amplitude of original samples. - apply_postfilter - apply_wienerfilter 
- 
class soundpy.filters.BandSubtraction(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, num_bands=6, band_spacing='linear', zeropad=None, smooth_factor=0.98, first_iter=None)[source]¶
- Bases: - soundpy.filters.Filter- Methods - Calculate over subtraction factor used in the cited paper. - calc_relevant_band(target_powspec)- Calculates band with highest energy levels. - check_volume(samples)- ensures volume of filtered signal is within the bounds of the original - get_samples(audiofile[, dur_sec])- Load signal and save original volume - get_window()- Returns window acc. - set_num_subframes(len_samples[, is_noise, …])- Sets the number of target or noise subframes available for processing - set_volume(samples[, max_vol, min_vol])- Records and limits the maximum amplitude of original samples. - Provides starting and ending frequncy bins/indices for each band. - update_posteri_bands(target_powspec, …)- Updates SNR of each set of bands. - apply_bandspecsub - apply_floor - apply_postfilter - sub_noise - 
setup_bands()[source]¶
- Provides starting and ending frequncy bins/indices for each band. - Parameters
- self ( - class) – Contains variables num_bands (if None, set to 6) and frame_length
- Returns
- Sets the class variables band_start_freq and band_end_freq. 
- Return type
 - Examples - >>> import soundpy as sp >>> import numpy as np >>> # Default is set to 6 bands: >>> fil = sp.BandSubtraction() >>> fil.setup_bands() >>> fil.band_start_freq array([ 0., 80., 160., 240., 320., 400.]) >>> fil.band_end_freq array([ 80., 160., 240., 320., 400., 480.]) >>> # change default settings >>> fil = sp.BandSubtraction(num_bands=5) >>> fil.setup_bands() >>> fil.band_start_freq array([ 0., 96., 192., 288., 384.]) >>> fil.band_end_freq array([ 96., 192., 288., 384., 480.]) 
 - 
update_posteri_bands(target_powspec, noise_powspec)[source]¶
- Updates SNR of each set of bands. - MATLAB code from speech enhancement book uses power, puts it into magnitude (via square root), then puts it back into power..? And uses some sort of ‘norm’ function… which I think is actually just the sum. Original equation can be found in the paper below. page 117 from book? - paper: Kamath, S. D. & Loizou, P. C. (____), A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. - I am using power for the time being. - Examples - >>> import soundpy as sp >>> import numpy as np >>> # setting to 4 bands for space: >>> fil = sp.BandSubtraction(num_bands=4) >>> fil.setup_bands() >>> # generate sine signal with and without noise >>> time = np.arange(0, 10, 0.01) >>> signal = np.sin(time)[:fil.frame_length] >>> np.random.seed(0) >>> noise = np.random.normal(np.mean(signal),np.mean(signal)+0.3,960) >>> powerspec_clean = np.abs(np.fft.fft(signal))**2 >>> powerspec_noisy = np.abs(np.fft.fft(signal + noise))**2 >>> fil.update_posteri_bands(powerspec_clean, powerspec_noisy) >>> fil.snr_bands array([ -1.91189028, -39.22078063, -44.16682922, -45.65265895]) >>> # compare with no noise in signal: >>> fil.update_posteri_bands(powerspec_clean, powerspec_clean) >>> fil.snr_bands array([0., 0., 0., 0.]) 
 - 
calc_oversub_factor()[source]¶
- Calculate over subtraction factor used in the cited paper. - Uses decibel SNR values calculated in update_posteri_bands() - paper: Kamath, S. D. & Loizou, P. C. (____), A multi-band spectral subtraction method ofr enhancing speech corrupted by colored noise. - Examples - >>> import soundpy as sp >>> import numpy as np >>> # setting to 4 bands for space: >>> fil = sp.BandSubtraction(num_bands=4) >>> fil.setup_bands() >>> # generate sine signal with and without noise >>> time = np.arange(0, 10, 0.01) >>> signal = np.sin(time)[:fil.frame_length] >>> np.random.seed(0) >>> noise = np.random.normal(np.mean(signal),np.mean(signal)+0.3,960) >>> powerspec_clean = np.abs(np.fft.fft(signal))**2 >>> powerspec_noisy = np.abs(np.fft.fft(signal + noise))**2 >>> fil.update_posteri_bands(powerspec_clean, powerspec_noisy) >>> fil.snr_bands array([ -1.91189028, -39.22078063, -44.16682922, -45.65265895]) >>> a = fil.calc_oversub_factor() >>> a array([4.28678354, 4.75 , 4.75 , 4.75 ]) >>> # compare with no noise in signal: >>> fil.update_posteri_bands(powerspec_clean, powerspec_clean) >>> fil.snr_bands array([0., 0., 0., 0.]) >>> a = fil.calc_oversub_factor() >>> a array([4., 4., 4., 4.]) 
 - 
calc_relevant_band(target_powspec)[source]¶
- Calculates band with highest energy levels. - Parameters
- self ( - class instance) – Contains class variables band_start_freq and band_end_freq.
- target_powerspec ( - np.ndarray) – Power spectrum of the target signal.
 
- Returns
- rel_band_index ( - int) – Index for which band contains the most energy.
- band_energy_matrix ( - np.ndarray [size=(num_bands,- ),- dtype=np.float]) – Power levels of each band.
 
 - Examples - >>> import soundpy as sp >>> import numpy as np >>> # setting to 4 bands for this example (default is 6): >>> fil = sp.BandSubtraction(num_bands=4) >>> fil.setup_bands() >>> # generate sine signal with and with frequency 25 >>> time = np.arange(0, 10, 0.01) >>> full_circle = 2 * np.pi >>> freq = 25 >>> signal = np.sin((freq*full_circle)*time)[:fil.frame_length] >>> powerspec_clean = np.abs(np.fft.fft(signal))**2 >>> rel_band_index, band_power_energies = fil.calc_relevant_band(powerspec_clean) >>> rel_band_index 2 >>> # and with frequency 50 >>> freq = 50 >>> signal = np.sin((freq*full_circle)*time)[:fil.frame_length] >>> powerspec_clean = np.abs(np.fft.fft(signal))**2 >>> rel_band_index, band_power_energies = fil.calc_relevant_band(powerspec_clean) >>> rel_band_index 3 
 
- 
- 
class soundpy.filters.FilterSettings(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, zeropad=None)[source]¶
- Bases: - object- Basic settings for filter related classes to inherit from. - 
sr¶
- Desired sampling rate of audio; audio will be resampled to match if audio has other sampling rate. (default 48000) - Type
 
 - 
frame_length¶
- Number of audio samples in each frame: frame_dur multiplied with sr, divided by 1000. (default 960) - Type
 
 - 
overlap_length¶
- Number of overlapping audio samples between subsequent frames: frame_length multiplied by percent_overlap, floored. (default 480) - Type
 
 - 
num_fft_bins¶
- The number of frequency bins used when calculating the fft. Currently the frame_length is used to set num_fft_bins. - Type
 
 - 
zeropad¶
- If False, only full frames of audio data are processed. If True, the last partial frame will be zeropadded. (default False) - Type
- bool, optional
 
 - Methods - Returns window acc. 
- 
- 
class soundpy.filters.Filter(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=None, zeropad=None)[source]¶
- Bases: - soundpy.filters.FilterSettings- Interactive class to explore Wiener filter settings on audio signals. - These class methods implement research based algorithms with low computational cost, aimed for noise reduction via mobile phone. - 
beta¶
- Value applied in Wiener filter that smooths the application of ‘gain’; default set according to previous research. (default 0.98) - Type
 
 - 
first_iter¶
- Keeps track if first_iter is relevant in filtering. If True, filtering has just started, and calculations made for filtering cannot use information from previous frames; if False, calculations for filtering use information from previous frames; if None, no difference is applied when processing the 1st vs subsequent frames. (default None) - Type
- bool, optional
 
 - 
target_subframes¶
- The number of total subsections within the total number of samples belonging to the target signal (i.e. audiofile being filtered). Until target_subframes is calculated, it is set to None. (default None) 
 - 
noise_subframes¶
- The number of total subsections within the total number of samples belonging to the noise signal. If noise power spectrum is used, this doesn’t need to be calculated. Until noise_subframes is calculated, it is set to None. (default None) 
 - 
gain¶
- Once calculated, the attenuation values to be applied to the fft for noise reduction. Until calculated, None. (default None) - Type
- ndarray,- None
 
 - Methods - check_volume(samples)- ensures volume of filtered signal is within the bounds of the original - get_samples(audiofile[, dur_sec])- Load signal and save original volume - get_window()- Returns window acc. - set_num_subframes(len_samples[, is_noise, …])- Sets the number of target or noise subframes available for processing - set_volume(samples[, max_vol, min_vol])- Records and limits the maximum amplitude of original samples. - 
__init__(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=None, zeropad=None)[source]¶
- Initialize self. See help(type(self)) for accurate signature. 
 - 
set_volume(samples, max_vol=0.4, min_vol=0.15)[source]¶
- Records and limits the maximum amplitude of original samples. - This enables the output wave to be within a range of volume that does not go below or too far above the orignal maximum amplitude of the signal. - Parameters
- samples ( - ndarray) – The original samples of a signal (1 dimensional), of any length
- max_vol ( - float) – The maximum volume level. If a signal has values higher than this number, the signal is curtailed to remain at and below this number.
- min_vol ( - float) – The minimum volume level. If a signal has only values lower than this number, the signal is amplified to be at this number and below.
 
- Returns
- Return type
 
 
- 
- 
class soundpy.filters.WienerFilter(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, smooth_factor=0.98, first_iter=None, zeropad=None)[source]¶
- Bases: - soundpy.filters.Filter- Methods - check_volume(samples)- ensures volume of filtered signal is within the bounds of the original - get_samples(audiofile[, dur_sec])- Load signal and save original volume - get_window()- Returns window acc. - set_num_subframes(len_samples[, is_noise, …])- Sets the number of target or noise subframes available for processing - set_volume(samples[, max_vol, min_vol])- Records and limits the maximum amplitude of original samples. - apply_postfilter - apply_wienerfilter 
- 
class soundpy.filters.BandSubtraction(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, num_bands=6, band_spacing='linear', zeropad=None, smooth_factor=0.98, first_iter=None)[source]¶
- Bases: - soundpy.filters.Filter- Methods - Calculate over subtraction factor used in the cited paper. - calc_relevant_band(target_powspec)- Calculates band with highest energy levels. - check_volume(samples)- ensures volume of filtered signal is within the bounds of the original - get_samples(audiofile[, dur_sec])- Load signal and save original volume - get_window()- Returns window acc. - set_num_subframes(len_samples[, is_noise, …])- Sets the number of target or noise subframes available for processing - set_volume(samples[, max_vol, min_vol])- Records and limits the maximum amplitude of original samples. - Provides starting and ending frequncy bins/indices for each band. - update_posteri_bands(target_powspec, …)- Updates SNR of each set of bands. - apply_bandspecsub - apply_floor - apply_postfilter - sub_noise - 
__init__(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, num_bands=6, band_spacing='linear', zeropad=None, smooth_factor=0.98, first_iter=None)[source]¶
- Initialize self. See help(type(self)) for accurate signature. 
 - 
setup_bands()[source]¶
- Provides starting and ending frequncy bins/indices for each band. - Parameters
- self ( - class) – Contains variables num_bands (if None, set to 6) and frame_length
- Returns
- Sets the class variables band_start_freq and band_end_freq. 
- Return type
 - Examples - >>> import soundpy as sp >>> import numpy as np >>> # Default is set to 6 bands: >>> fil = sp.BandSubtraction() >>> fil.setup_bands() >>> fil.band_start_freq array([ 0., 80., 160., 240., 320., 400.]) >>> fil.band_end_freq array([ 80., 160., 240., 320., 400., 480.]) >>> # change default settings >>> fil = sp.BandSubtraction(num_bands=5) >>> fil.setup_bands() >>> fil.band_start_freq array([ 0., 96., 192., 288., 384.]) >>> fil.band_end_freq array([ 96., 192., 288., 384., 480.]) 
 - 
update_posteri_bands(target_powspec, noise_powspec)[source]¶
- Updates SNR of each set of bands. - MATLAB code from speech enhancement book uses power, puts it into magnitude (via square root), then puts it back into power..? And uses some sort of ‘norm’ function… which I think is actually just the sum. Original equation can be found in the paper below. page 117 from book? - paper: Kamath, S. D. & Loizou, P. C. (____), A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. - I am using power for the time being. - Examples - >>> import soundpy as sp >>> import numpy as np >>> # setting to 4 bands for space: >>> fil = sp.BandSubtraction(num_bands=4) >>> fil.setup_bands() >>> # generate sine signal with and without noise >>> time = np.arange(0, 10, 0.01) >>> signal = np.sin(time)[:fil.frame_length] >>> np.random.seed(0) >>> noise = np.random.normal(np.mean(signal),np.mean(signal)+0.3,960) >>> powerspec_clean = np.abs(np.fft.fft(signal))**2 >>> powerspec_noisy = np.abs(np.fft.fft(signal + noise))**2 >>> fil.update_posteri_bands(powerspec_clean, powerspec_noisy) >>> fil.snr_bands array([ -1.91189028, -39.22078063, -44.16682922, -45.65265895]) >>> # compare with no noise in signal: >>> fil.update_posteri_bands(powerspec_clean, powerspec_clean) >>> fil.snr_bands array([0., 0., 0., 0.]) 
 - 
calc_oversub_factor()[source]¶
- Calculate over subtraction factor used in the cited paper. - Uses decibel SNR values calculated in update_posteri_bands() - paper: Kamath, S. D. & Loizou, P. C. (____), A multi-band spectral subtraction method ofr enhancing speech corrupted by colored noise. - Examples - >>> import soundpy as sp >>> import numpy as np >>> # setting to 4 bands for space: >>> fil = sp.BandSubtraction(num_bands=4) >>> fil.setup_bands() >>> # generate sine signal with and without noise >>> time = np.arange(0, 10, 0.01) >>> signal = np.sin(time)[:fil.frame_length] >>> np.random.seed(0) >>> noise = np.random.normal(np.mean(signal),np.mean(signal)+0.3,960) >>> powerspec_clean = np.abs(np.fft.fft(signal))**2 >>> powerspec_noisy = np.abs(np.fft.fft(signal + noise))**2 >>> fil.update_posteri_bands(powerspec_clean, powerspec_noisy) >>> fil.snr_bands array([ -1.91189028, -39.22078063, -44.16682922, -45.65265895]) >>> a = fil.calc_oversub_factor() >>> a array([4.28678354, 4.75 , 4.75 , 4.75 ]) >>> # compare with no noise in signal: >>> fil.update_posteri_bands(powerspec_clean, powerspec_clean) >>> fil.snr_bands array([0., 0., 0., 0.]) >>> a = fil.calc_oversub_factor() >>> a array([4., 4., 4., 4.]) 
 - 
calc_relevant_band(target_powspec)[source]¶
- Calculates band with highest energy levels. - Parameters
- self ( - class instance) – Contains class variables band_start_freq and band_end_freq.
- target_powerspec ( - np.ndarray) – Power spectrum of the target signal.
 
- Returns
- rel_band_index ( - int) – Index for which band contains the most energy.
- band_energy_matrix ( - np.ndarray [size=(num_bands,- ),- dtype=np.float]) – Power levels of each band.
 
 - Examples - >>> import soundpy as sp >>> import numpy as np >>> # setting to 4 bands for this example (default is 6): >>> fil = sp.BandSubtraction(num_bands=4) >>> fil.setup_bands() >>> # generate sine signal with and with frequency 25 >>> time = np.arange(0, 10, 0.01) >>> full_circle = 2 * np.pi >>> freq = 25 >>> signal = np.sin((freq*full_circle)*time)[:fil.frame_length] >>> powerspec_clean = np.abs(np.fft.fft(signal))**2 >>> rel_band_index, band_power_energies = fil.calc_relevant_band(powerspec_clean) >>> rel_band_index 2 >>> # and with frequency 50 >>> freq = 50 >>> signal = np.sin((freq*full_circle)*time)[:fil.frame_length] >>> powerspec_clean = np.abs(np.fft.fft(signal))**2 >>> rel_band_index, band_power_energies = fil.calc_relevant_band(powerspec_clean) >>> rel_band_index 3 
 
- 
Extract and manipulate audio features¶
Feats module includes functions related to converting audio sample data to features for analysis, filtering, machine learning, or visualization.
- 
soundpy.feats.plot(feature_matrix, feature_type, save_pic=False, name4pic=None, energy_scale='power_to_db', title=None, sr=None, win_size_ms=None, percent_overlap=None, x_label=None, y_label=None, subprocess=False, overwrite=False)[source]¶
- Visualize feature extraction; frames on x axis, features on y axis. Uses librosa to scale the data if scale applied. - Note: can only take multiple channels if feature_type is ‘signal’. For other feature types, the plot will not work as expected. - Parameters
- feature_matrix ( - np.ndarray [shape=(num_samples,),- (num_samples,- num_channels), or- (num_features,- num_frames),- dtype=np.float].) – Matrix of features. If the features are not of type ‘signal’ and the shape is 1 D, one dimension will be added to be plotted with a colormesh.
- feature_type ( - str) – Options: ‘signal’, ‘stft’, ‘mfcc’, or ‘fbank’ features, or what user would like to name the feature set. signal: the 1 D samples of sound. STFT: short-time Fourier transform MFCC: mel frequency cepstral coefficients. FBANK: mel-log filterbank energies (default ‘fbank’).
- save_pic ( - bool) – True to save image as .png; False to just plot it. If- subprocessis True, save_pic will automatically be set to True.
- name4pic ( - str, optional) – If save_pic set to True, the name the image should be saved under.
- energy_scale ( - str, optional) – If features need to be adjusted, e.g. from power to decibels. Default is ‘power_to_db’.
- title ( - str, optional) – The title for the graph. If None, feature_type is used.
- sr ( - int, optional) – Useful in plotting the time for features.
- win_size_ms ( - int,- float, optional) – Useful in plotting the time for features in the frequency domain (e.g. STFT, FBANK, MFCC features)
- percent_overlap ( - int,- float, optional) – Useful in plotting the time for features in the frequency domain (e.g. STFT, FBANK, MFCC features)
- x_label ( - str, optional) – The label to be applied to the x axis.
- y_label ( - str, optional) – The label to be applied to the y axis.
- subprocess ( - bool) – If- subprocessis True, matplotlib will use backend ‘Agg’, which only allows plots to be saved. If- subprocessis False, the default backend ‘TkAgg’ will be used, which allows plots to be generated live as well as saved. The ‘Agg’ backend is useful if one wants to visualize sound while a main process is being performed, for example, while a model is being trained. (default False)
- overwrite ( - bool) – If False, if .png file already exists under given name, a date tag will be added to the .png filename to avoid overwriting the file. (default False)
 
- Returns
- Return type
 
- 
soundpy.feats.plotsound(audiodata, feature_type='fbank', win_size_ms=20, percent_overlap=0.5, fft_bins=None, num_filters=40, num_mfcc=40, sr=None, save_pic=False, name4pic=None, energy_scale='power_to_db', mono=None, real_signal=False, **kwargs)[source]¶
- Visualize feature extraction depending on set parameters. - Stereo sound can be graphed. If feature_type is ‘signal’, all channels will be graphed on same plot. Otherwise, each channel will be plotted separately. - Parameters
- audiodata ( - str,- numpy.ndarray [size=(num_samples,)or- (num_samples,- num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.
- feature_type ( - str) – Options: ‘signal’, ‘mfcc’, or ‘fbank’ features. MFCC: mel frequency cepstral coefficients; FBANK: mel-log filterbank energies (default ‘fbank’)
- win_size_ms ( - intor- float) – Window length in milliseconds for Fourier transform to be applied (default 20)
- percent_overlap ( - intor- float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.
- num_filters ( - int) – Number of mel-filters to be used when applying mel-scale. For ‘fbank’ features, 20-128 are common, with 40 being very common. (default 40)
- num_mfcc ( - int) – Number of mel frequency cepstral coefficients. First coefficient pertains to loudness; 2-13 frequencies relevant for speech; 13-40 for acoustic environment analysis or non-linguistic information. Note: it is not possible to choose only 2-13 or 13-40; if num_mfcc is set to 40, all 40 coefficients will be included. (default 40).
- sr ( - int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)
- mono ( - bool, optional) – When loading an audiofile, True will limit number of channels to one; False will allow more channels to be loaded. (default None, which results in mono channel loading.)
- **kwargs ( - additional keyword arguments) – Keyword arguments for soundpy.feats.plot
 
 
- 
soundpy.feats.get_feats(sound, sr=None, feature_type='fbank', win_size_ms=20, percent_overlap=0.5, window='hann', fft_bins=None, num_filters=None, num_mfcc=None, remove_first_coefficient=False, sinosoidal_liftering=False, dur_sec=None, mono=None, rate_of_change=False, rate_of_acceleration=False, subtract_mean=False, real_signal=True, fmin=None, fmax=None, zeropad=True)[source]¶
- Collects raw signal data, stft, fbank, or mfcc features. - Parameters
- sound ( - stror- numpy.ndarray [size=(num_samples,)or- (num_samples,- num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.
- sr ( - int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)
- feature_type ( - str) – Options include ‘signal’, ‘stft’, ‘powspec’, ‘fbank’, or ‘mfcc’ data (default ‘fbank’). signal: energy/amplitude measurements along time STFT: short-time fourier transform powspec : power spectrum (absolute value of stft, squared) FBANK: mel-log filterbank energies MFCC: mel frequency cepstral coefficients
- win_size_ms ( - intor- float) – Window length in milliseconds for Fourier transform to be applied (default 20)
- percent_overlap ( - intor- float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.
- window ( - stror- np.ndarray [size (n_fft,- )]) – The window function to be applied to each window. (Default ‘hann’)
- fft_bins ( - int) – Number of frequency bins to apply in fast Fourier transform. (default None)
- num_filters ( - int) – Number of mel-filters to be used when applying mel-scale. For ‘fbank’ features, 20-128 are common, with 40 being very common. If None, will be set to 40. (default None)
- num_mfcc ( - int) – Number of mel frequency cepstral coefficients. First coefficient pertains to loudness; 2-13 frequencies relevant for speech; 13-40 for acoustic environment analysis or non-linguistic information. If None, will be set to num_filters or 40. (default None).
- dur_sec ( - float, optional) – Time in seconds to limit in loading a signal. (default None)
- mono ( - bool, optional) – For loading an audiofile, True will result in only one channel of data being loaded; False will allow additional channels be loaded. (default None, which results in mono channel data)
- rate_of_change ( - bool) – If True, the first derivative of spectral data will be concatenated to the features. This is applicable for all feature types except ‘signal’.
- rate_of_acceleration ( - bool) – If True, the second derivative of spectral data will be concatenated to the features. This is applicable for all feature types except ‘signal’.
- subtract_mean ( - bool) – If True, the mean of each feature column will be subtracted from each row. This is applicable for all feature types except ‘signal’.
 
- Returns
- feats – Feature data. If feature_type is ‘signal’, returns a tuple containing samples and sampling rate. If feature_type is of another type, returns np.ndarray with shape (num_frames, num_filters/features) 
- Return type
- tuple (num_samples,- sr)or- np.ndarray [size (num_frames,- num_filters) dtype=np.floator- np.complex]
 
- 
soundpy.feats.load_feat_settings(feat_settings_dict)[source]¶
- Loads feature settings into named tuple. Sets defaults if not present. TODO: test w previous version 
- 
soundpy.feats.get_stft(sound, sr=22050, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', zeropad=True, **kwargs)[source]¶
- Returns short-time Fourier transform matrix. - This function allows more flexibility in number of fft_bins and real_signal settings. Additionally, this does not require the package librosa, making it a bit easier to manipulate if desired. For an example, see - soundpy.augment.vtlp.- Parameters
- sound ( - np.ndarray [shape=(num_samples,)or- (num_samples,- num_channels)],- str, or- pathlib.PosixPath) – If type np.ndarray, expect raw samples in mono or stereo sound. If type str or pathlib.PosixPath, expect pathway to audio file.
- sr ( - int) – The sample rate of sound.
- win_size_ms ( - int,- float) – Window length in milliseconds for Fourier transform to be applied (default 50)
- percent_overlap ( - intor- float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.
- real_signal ( - bool) – If True, only half the FFT spectrum will be used; there should really be no difference as the FFT is symmetrical. If anything, setting real_signal to True may speed up functionality / make functions more efficient.
- fft_bins ( - int) – Number of frequency bins to use when applying fast Fourier Transform. (default 1024)
- window ( - str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)
- zeropad ( - bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.
- **kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.files.loadsound.
 
- Returns
- stft_matrix 
- Return type
- np.ndarray[size=(num_frames,- fft_bins)]
 
- 
soundpy.feats.get_fbank(sound, sr, num_filters, fmin=None, fmax=None, fft_bins=None, **kwargs)[source]¶
- Extract mel-filterbank energy features from audio. - Parameters
- sound ( - np.ndarray [size=(num_samples,)or- (num_samples,- num_features)],- str, or- pathlib.PosixPath) – Sound in raw samples, a power spectrum, or a short-time-fourier-transform. If type string or pathlib.PosixPath, expect pathway to audio file.
- sr ( - int) – The sample rate of sound.
- num_filters ( - int) – The number of mel-filters to use when extracting mel-filterbank energies.
- fmin ( - intor- float, optional) – The minimum frequency of interest. If None, will be set to 0. (default None)
- fmax ( - intor- float, optional) – The maximum frequency of interst. If None, will be set to half of sr. (default None)
- fft_bins ( - int, optional) – The number of frequency bins / fast Fourier transform bins used in calculating the fast Fourier transform. If None, set depending on type of parameter sound. If sound is a raw signal or audio pathway, fft_bins will be set to 1024; if sound is a STFT or power spectrum, fft_bins will be set to 2 * length of sound feature column, or 2 * sound.shape[1].
- **kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.feats.get_stft.
 
- Returns
- fbank – The mel-filterbank energeis extracted. The number of samples depends on the parameters applied in - soundpy.feats.get_stft.
- Return type
- np.ndarray [shape=(num_samples,- num_filters)]
 - References - Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from: https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html 
- 
soundpy.feats.get_mfcc(sound, sr, num_mfcc, remove_first_coefficient=False, sinosoidal_liftering=False, **kwargs)[source]¶
- Extracts mel-frequency cepstral coefficients from audio. - Parameters
- sound ( - np.ndarray [size=(num_samples,)or- (num_samples,- num_features)]or- stror- pathlib.PosixPath) – If sound is a np.ndarray, expected as raw samples, a power spectrum or a short-time Fourier transform. If string or pathlib.PosixPath, should be the pathway to the audio file.
- sr ( - int) – The sample rate of the sound.
- num_mfcc ( - int) – The number of mel-frequency cepstral coefficients
- remove_first_coefficient ( - bool) – If True, the first coefficient, representing amplitude or volume of signal, is removed. Found to sometimes improve automatic speech recognition. (default False)
- sinosoidal_liftering ( - bool) – If True, reduces influence of higher coefficients, found to aid in handling noise in background in automatic speech recognition. (default False)
- **kwargs ( - additional keyword arguments) – Keyword arguments for soundpy.feats.get_fbank()
 
 - References - Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html 
- 
soundpy.feats.get_vad_stft(sound, sr=48000, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', use_beg_ms=120, extend_window_ms=0, energy_thresh=40, freq_thresh=185, sfm_thresh=5, zeropad=True, **kwargs)[source]¶
- Returns STFT matrix and VAD matrix. STFT matrix contains only VAD sections. - Parameters
- sound ( - stror- numpy.ndarray [size=(num_samples,)or- (num_samples,- num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.
- sr ( - int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)
- win_size_ms ( - intor- float) – Window length in milliseconds for Fourier transform to be applied (default 50)
- percent_overlap ( - intor- float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.
- real_signal ( - bool) – If True, only half the FFT spectrum will be used; there should really be no difference as the FFT is symmetrical. If anything, setting real_signal to True may speed up functionality / make functions more efficient.
- fft_bins ( - int) – Number of frequency bins to use when applying fast Fourier Transform. (default 1024)
- window ( - str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)
- use_beg_ms ( - int) – The amount of time in milliseconds to use from beginning of signal to estimate background noise.
- extend_window_ms ( - int) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired.
- energy_thresh ( - int) – The threshold to set for measuring energy for VAD in the signal. (default 40)
- freq_thresh ( - int) – The threshold to set for measuring frequency for VAD in the signal. (default 185)
- sfm_thresh ( - int) – The threshold to set for measuring spectral flatness for VAD in the signal. (default 5)
- zeropad ( - bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.
- **kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.files.loadsound
 
- Returns
- stft_matrix ( - np.ndarray [size=(num_frames_vad,- fft_bins//2+1),- dtype=np.complex_]) – The STFT matrix frames of where voice activity has been detected.
- vad_matrix_extwin ( - np.ndarray [size=(num_frames,)]) – A vector containing indices of the full STFT matrix for frames of where voice activity was detected or not.
 
 
- 
soundpy.feats.get_stft_clipped(samples, sr, win_size_ms=50, percent_overlap=0.5, extend_window_ms=0, window='hann', zeropad=True, **kwargs)[source]¶
- Returns STFT matrix and VAD matrix with beginning and ending silence removed. - Parameters
- samples ( - stror- numpy.ndarray [size=(num_samples,)or- (num_samples,- num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data.
- sr ( - int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded.
- win_size_ms ( - intor- float) – Window length in milliseconds for Fourier transform to be applied (default 50)
- percent_overlap ( - intor- float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.
- extend_window_ms ( - int) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired.
- window ( - str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)
- zeropad ( - bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.
- **kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.files.loadsound.
 
- Returns
- stft_speech ( - np.ndarry [size (num_frames_clipped,- fft_bins//2+1)]) – The STFT of the samples with beginning and ending silences clipped.
- vad_matrix ( - np.ndarry [size (num_frames,- )]) – A vector with zeros and ones indicating which indices of the full STFT that have voice activity or not.
 
 
- 
soundpy.feats.get_vad_samples(sound, sr=None, win_size_ms=50, percent_overlap=0.5, use_beg_ms=120, extend_window_ms=0, energy_thresh=40, freq_thresh=185, sfm_thresh=5, window='hann', zeropad=True, **kwargs)[source]¶
- Returns samples and VAD matrix. Only samples where with VAD are returned. - Parameters
- sound ( - stror- numpy.ndarray [size=(num_samples,)or- (num_samples,- num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.
- sr ( - int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)
- win_size_ms ( - intor- float) – Window length in milliseconds for Fourier transform to be applied (default 50)
- percent_overlap ( - intor- float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.
- use_beg_ms ( - int) – The amount of time in milliseconds to use from beginning of signal to estimate background noise.
- extend_window_ms ( - int) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired.
- energy_thresh ( - int) – The threshold to set for measuring energy for VAD in the signal. (default 40)
- freq_thresh ( - int) – The threshold to set for measuring frequency for VAD in the signal. (default 185)
- sfm_thresh ( - int) – The threshold to set for measuring spectral flatness for VAD in the signal. (default 5)
- window ( - str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)
- zeropad ( - bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.
- **kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.files.loadsound
 
- Returns
- samples_matrix ( - np.ndarray [size = (num_samples_vad,- )]) – The samples of where voice activity was detected.
- vad_matrix_extwin ( - np.ndarray [size = (num_frames,- )]) – A vector of zeros and ones indicating the frames / windows of the samples that either had voice activity or not.
 
 
- 
soundpy.feats.get_samples_clipped(samples, sr, win_size_ms=50, percent_overlap=0.5, extend_window_ms=0, window='hann', zeropad=True, **kwargs)[source]¶
- Returns samples and VAD matrix with beginning and ending silence removed. - Parameters
- samples ( - stror- numpy.ndarray [size=(num_samples,)or- (num_samples,- num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data.
- sr ( - int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded.
- win_size_ms ( - intor- float) – Window length in milliseconds for Fourier transform to be applied (default 50)
- percent_overlap ( - intor- float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.
- extend_window_ms ( - int) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired. (default 0)
- window ( - str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)
- zeropad ( - bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.
- **kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.files.loadsound.
 
- Returns
- stft_speech ( - np.ndarry [size (num_frames_clipped,- fft_bins//2+1)]) – The STFT of the samples with beginning and ending silences clipped.
- vad_matrix ( - np.ndarry [size (num_frames,- )]) – A vector with zeros and ones indicating which indices of the full STFT that have voice activity or not.
 
 
- 
soundpy.feats.normalize(data, max_val=None, min_val=None)[source]¶
- Normalizes data to be between 0 and 1. Should not be applied to raw sample data. - This is useful if you have predetermined max and min values you want to normalize new data with. Is helpful in training models on sound features (not raw samples). - Parameters
- data ( - np.ndarray [size=(num_features,)or- (num_frames,num_features)]) – Data to be normalized.
- max_val ( - intor- float, optional) – Predetermined maximum value. If None, will use max value from data.
- min_val ( - intor- float, optional) – Predetermined minimum value. If None, will use min value from data.
 
- Returns
- normed_data 
- Return type
- np.ndarray [size = (num_features,)or- (num_frames,- num_features)]
 - Examples - >>> # using the min and max of a previous dataset: >>> import numpy as np >>> np.random.seed(0) >>> input_samples = np.random.random_sample((5,)) >>> input_samples array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ]) >>> np.random.seed(40) >>> previous_samples = np.random.random_sample((5,)) >>> previous_samples array([0.40768703, 0.05536604, 0.78853488, 0.28730518, 0.45035059]) >>> max_prev = np.max(previous_samples) >>> min_prev = np.min(previous_samples) >>> output_samples = normalize(input_samples, min_val = min_prev, max_val = max_prev) >>> output_samples array([0.67303388, 0.89996095, 0.74661839, 0.66767314, 0.50232462]) 
- 
soundpy.feats.plot_dom_freq(sound, energy_scale='power_to_db', title=None, save_pic=False, name4pic=None, overwrite=False, **kwargs)[source]¶
- Plots the approximate dominant frequency over a STFT plot of a signal. - If sound has multiple channels, the VAD for each channel is plotted in its own plot. - Parameters
- sound ( - np.ndarray [shape=(num_samples,)or- (num_samples,- num_channels)]) – The sound to plot the dominant frequency of.
- energy_scale ( - str) – The scale of energy for the plot. If in frequency spectrum, likey in power and needs to be put into db. (default ‘power_to_db’)
- title ( - str) – The title for the plot. (default None)
- **kwargs ( - additional keyword arguments) – Keyword arguments used in both- soundpy.feats.get_stftand- soundpy.dsp.get_pitch.
 
- Returns
- Return type
 
- 
soundpy.feats.plot_vad(sound, energy_scale='power_to_db', title='Voice Activity', use_beg_ms=120, extend_window_ms=0, beg_end_clipped=True, save_pic=False, name4pic=None, overwrite=False, **kwargs)[source]¶
- Plots where voice (sound) activity detected on power spectrum. - This either plots immediately or saves the plot at name4pic. If sound has multiple channels, the VAD for each channel is plotted in its own plot. - Parameters
- sound ( - np.ndarray [shape=(num_samples,)or- (num_samples,- num_channels)]) – The sound to plot the VAD of.
- energy_scale ( - str) – If plotting STFT or power spectrum, will plot it in decibels. (default ‘power_to_db’)
- title ( - str) – The title of the plot (default ‘Voice Activity’)
- use_beg_ms ( - int) – The amount of noise to use at the beginning of the signal to measuer VAD. This is only applied if beg_end_silence is set to False.
- extend_window_ms ( - int) – The number of milliseconds VAD should be padded. This is useful if one wants to encompass more speech if the VAD is not including all the speech / desired sound. However, this may capture more noise. (default 0)
- beg_end_silence ( - bool) – If True, just the silences at the beginning and end of the sample will be cut off. If False, VAD will be checked throughout the sample, not just the beginning and end. NOTE: Both options have strengths and weaknesses. Sometimes the VAD checking the entire signal is unreliable (e.i. when beg_end_silence is set to False), not recognizing speech in speech filled samples. And when set to True, some speech sounds tend to get ignored (‘s’, ‘x’ and other fricatives).
- save_pic ( - bool) – If True, the plot will be saved rather than plotted immediately.
- name4pic ( - str) – The full pathway and filename to save the picture (as .png file). A file extension is expected. (default None)
- overwrite ( - bool) – If False, a date tag will be added to name4pic if name4pic already exists. (default False)
- **kwargs ( - keyword arguments) – Additional keyword arguments for soundpy.feats.get_speech_stft or- soundpy.dsp.vad.
 
- Returns
- Return type
 
- 
soundpy.feats.get_change_acceleration_rate(spectro_data)[source]¶
- Gets first and second derivatives of spectral data. - This is useful particularly for speech recognition. - Parameters
- spectro_data ( - np.ndarray [shape = (num_samples,- num_features)]) –
- Returns
- delta ( - np.ndarray [shape = (num_samples,- num_features)]) – The first order derivative of spectral data. Reflects rate of change in signal.
- delta_delta ( - np.ndarray [shape = (num_samples,- num_features)]) – The second order derivative of spectral data. Reflects rate of acceleration in signal.
 
 
- 
soundpy.feats.get_mfcc_fbank(samples, feature_type='mfcc', sr=48000, win_size_ms=20, percent_overlap=0.5, num_filters=40, num_mfcc=40, fft_bins=None, window_function=None, zeropad=True, **kwargs)[source]¶
- Collects fbank or mfcc features via python-speech-features (rather than librosa). 
- 
soundpy.feats.zeropad_features(feats, desired_shape, complex_vals=False)[source]¶
- Applies zeropadding to a copy of feats. 
- 
soundpy.feats.reduce_num_features(feats, desired_shape)[source]¶
- Limits number features of a copy of feats. - This is useful if you want the features to be a certain size, for training models for example. 
- 
soundpy.feats.featshape_new_subframe(feature_matrix_shape, new_frame_size, zeropad=True, axis=0, include_dim_size_1=False)[source]¶
- Subdivides features from (num_frames, num_feats) to (new_frame_size, num_frames, num_feats) - Parameters
- feature_matrix_shape ( - tuple [size=(num_frames,- num_features)]) – Feature matrix shape to be subdivided. Can be multidimensional.
- new_frame_size ( - int) – The number of subframes to section axis into.
- zeropad ( - bool) – If True, frames that don’t completely fill a new_frame_size will be zeropadded. Otherwise, those frames will be discarded. (default True)
- axis ( - int) – The axis where the new_frame_size should be applied. (default 0)
 
- Returns
- new_shape 
- Return type
- tuple [size=(num_subframes,- new_frame_size,- num_feats)]
 
- 
soundpy.feats.apply_new_subframe(feature_matrix, new_frame_size, zeropad=True, axis=0)[source]¶
- Reshapes feature_matrix to allow for new_frame_size. - Note: Dimensions of feature_matrix must be at least 2 and can be up to 5, returning a matrix with one additional dimension. - Parameters
- feature_matrix ( - np.ndarray [size(num_frames,- num_features) ]) – Expects minimum 2D, maximum 5D matrix.
- new_frame_size ( - int) – The number of subframes to section axis into.
- axis ( - int) – The axis to apply the new_frame_size. (default 0)
- zeropad ( - bool) – If True, the feature_matrix will be zeropadded to include frames that do not fill entire frame_size, given the new_frame_size. If False, feature_matrix will not include the last zeropadded frame. (default True)
 
- Returns
- feats_reshaped – The feature_matrix returned with axis subdivided into 2 dimensions, the number of subframes and the other length new_frame_size. 
- Return type
- np.ndarray [size(num_subframes,- new_frame_size,- num_features)]
- Raises
- ValueError if number of dimensions of feature_matrix is below 2 or exceeds 5. – 
 - Examples - >>> import numpy as np >>> matrix = np.arange(24).reshape(3,4,2) >>> # apply new_frame_size to dimension of length 4 (i.e. axis 1) >>> matrix_zp = apply_new_subframe(matrix, new_frame_size = 3, axis = 1) >>> matrix_zp.shape (3, 2, 3, 2) >>> matrix_zp array([[[[ 0, 1], [ 2, 3], [ 4, 5]], - [[ 6, 7], [ 0, 0], [ 0, 0]]], - [[[ 8, 9],
- [10, 11], [12, 13]], - [[14, 15], [ 0, 0], [ 0, 0]]], 
- [[[16, 17],
- [18, 19], [20, 21]], - [[22, 23], [ 0, 0], [ 0, 0]]]]) 
 - >>> matrix_nozp = apply_new_subframe(matrix, new_frame_size = 3, axis = 1, ... zeropad=False) >>> matrix_nozp.shape (3, 1, 3, 2) >>> matrix_nozp array([[[[ 0, 1], [ 2, 3], [ 4, 5]]], - [[[ 8, 9],
- [10, 11], [12, 13]]], 
- [[[16, 17],
- [18, 19], [20, 21]]]]) 
 
- 
soundpy.feats.check_percent_overlap(percent_overlap)[source]¶
- Ensures percent_overlap is between 0 and 1. 
- 
soundpy.feats.separate_dependent_var(matrix)[source]¶
- Separates matrix into features and labels. Expects 3D array. - Assumes the last column of the last dimension of the matrix constitutes the dependent variable (labels), and all other columns the indpendent variables (features). Additionally, it is assumed that for each block of data, only one label is needed; therefore, just the first label is taken for each block. - Parameters
- matrix ( - numpy.ndarray [size = (num_samples,- num_frames,- num_features)]) – The matrix holds the numerical data to separate. num_features is expected to be at least 2.
- Returns
- X ( - numpy.ndarray [size = (num_samples,- num_frames,- num_features -1)]) – A matrix holding the (assumed) independent variables
- y ( - numpy.ndarray,- numpy.int64,- numpy.float64 [size = (num_samples,)]) – A vector holding the labels assigned to the independent variables. If only one value in array, just the value inside is returned
 
 - Examples - >>> import numpy as np >>> #vector >>> separate_dependent_var(np.array([1,2,3,4])) (array([1, 2, 3]), 4) >>> #simple matrix >>> matrix = np.arange(4).reshape(2,2) >>> matrix array([[0, 1], [2, 3]]) >>> X, y = separate_dependent_var(matrix) >>> X array([[0], [2]]) >>> y 1 >>> #more complex matrix >>> matrix = np.arange(20).reshape((2,2,5)) >>> matrix array([[[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9]], [[10, 11, 12, 13, 14], [15, 16, 17, 18, 19]]]) >>> X, y = separate_dependent_var(matrix) >>> X array([[[ 0, 1, 2, 3], [ 5, 6, 7, 8]], [[10, 11, 12, 13], [15, 16, 17, 18]]]) >>> y array([ 4, 14]) 
- 
soundpy.feats.add_tensor(matrix)[source]¶
- Adds tensor / dimension to input ndarray (e.g. features). - Keras requires an extra dimension at some layers, which represents the ‘tensor’ encapsulating the data. - Further clarification taking the example below. The input matrix has shape (2,3,4). Think of it as 2 different events, each having 3 sets of measurements, with each of those having 4 features. So, let’s measure differences between 2 cities at 3 different times of day. Let’s take measurements at 08:00, 14:00, and 19:00 in… Magic City and Never-ever Town. We’ll measure.. 1) tempurature, 2) wind speed 3) light level 4) noise level. - How I best understand it, putting our measurements into a matrix with an added dimension/tensor, this highlights the separate measurements, telling the algorithm: yes, these are 4 features from the same city, BUT they occur at different times. Or it’s just how Keras set up the code :P - Parameters
- matrix ( - numpy.ndarray) – The matrix holds the numerical data to add a dimension to.
- Returns
- matrix – The matrix with an additional dimension. 
- Return type
- numpy.ndarray
 - Examples - >>> import numpy as np >>> matrix = np.arange(24).reshape((2,3,4)) >>> matrix.shape (2, 3, 4) >>> matrix array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]) >>> matrix_2 = add_tensor(matrix) >>> matrix_2.shape (2, 3, 4, 1) >>> matrix_2 array([[[[ 0], [ 1], [ 2], [ 3]], [[ 4], [ 5], [ 6], [ 7]], [[ 8], [ 9], [10], [11]]], [[[12], [13], [14], [15]], [[16], [17], [18], [19]], [[20], [21], [22], [23]]]]) 
- 
soundpy.feats.scale_X_y(matrix, is_train=True, scalars=None)[source]¶
- Separates and scales data into X and y arrays. Adds dimension for keras. - Assumes the last column of the last dimension is the y or label data. - Parameters
- matrix ( - np.ndarray [size = (num_samples,- num_frames,- num_features)]) – Matrix with X and y data
- is_train ( - bool) – Relevant for the scalars parameter. If the data is training data (i.e. True), the scalars will be created. If the data is test data (i.e. False), the function expects scalars to be provided. (default True)
- scalars ( - dict, optional) – Dictionary with scalars to be applied to non-training data.
 
- Returns
- X ( - np.ndarray [size = (num_sampls,- num_frames,- num_features-1,- 1)]) – Scaled features with extra dimension
- y ( - np.ndarray [size = (num_samples,- 1,- 1)]) – Scaled independent variable with extra dimension
- scalars ( - dict) – The scalars either created or previously loaded.
 
 
- 
soundpy.feats.get_feature_matrix_shape(sr=None, dur_sec=None, feature_type=None, win_size_ms=None, percent_overlap=None, fft_bins=None, num_mfcc=None, num_filters=None, rate_of_change=False, rate_of_acceleration=False, context_window=None, frames_per_sample=None, zeropad=True, labeled_data=False, remove_first_coefficient=False, real_signal=False, **kwargs)[source]¶
- Returns expected shapes of feature matrix depending on several parameters. - Parameters
- sr ( - int) – Sample rate of the audio to be extracted.
- dur_sec ( - int,- float) – The number of seconds of audio feature extraction will be applied to.
- feature_type ( - str) – Accepted features include ‘signal’, ‘stft’, ‘powspec’, ‘fbank’, ‘mfcc’. Which feature_type applied will influence the resulting shape of the feature matrix shape.
- win_size_ms ( - intor- float) – The size of the window the audio signal should be broken into. If feature_type is set to ‘signal’, this is irrelevant. Otherwise will raise TypeError if set to None.
- percent_overlap ( - float) – The amount of overlap between windows. If set to 0.5, the number of overlapping samples will be half the number of samples that make up win_size_ms.
- fft_bins ( - int) – The number of frequency bins to use when calculating the fast Fourier transform. If None, the calculated frame_length will be used.
- num_mfcc ( - int) – If extracting ‘mfcc’ features, the total number of coefficients expected.
- num_filters ( - int) – If extracting ‘fbank’ features, the total number of mel-filters to be applied.
- rate_of_change ( - bool) – If True, the first delta will be concatenated to features extracted.
- rate_of_acceleration ( - bool) – If True, the second delta will be concatenated to features extracted.
- context_window ( - int) – The size of context_window or number of samples padding a central frame. This may be useful for models training on small changes occuring in the signal, e.g. to break up the image of sound into smaller parts.
- frames_per_sample ( - int) – The previous keyword argument for sugementing audio into smaller parts. Will be removed in future versions and available in generator functions as context_window. frames_per_sample equals 2 * context_window + 1. See- soundpy.models.dataprep.Generator
- zeropad ( - bool) – If True, windows and frames will be zeropadded to avoid losing any sample data.
- labeled_data ( - bool) – If True, a label will be added to the output shape of features.
- remove_first_coefficient ( - bool) – If True, the first mfcc coefficient will not be included in feature matrix.
- **kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.feats.get_feats. These may not be used in this function as they may not influence the size of the feature matrix.
 
- Returns
 
- 
soundpy.feats.visualize_feat_extraction(feats, iteration=None, dataset=None, label=None, datadir=None, subsections=False, **kwargs)[source]¶
- Saves plots of features during feature extraction or training of models. - Parameters
- feats ( - np.ndarray [shape=(num_samples,)or- (num_samples,- num_frames)or (num_frames,- num_features)or- (num_subsections,- num_frames,- num_features)]) – The extracted features can be raw signal data, stft, fbank, powspec, mfcc data, either as a single plot or subsectioned into batches / subframes.
- iteration ( - int, optional) – The iteration of the audio getting extracted; e.g. the 10th training item.
- dataset ( - str, optional) – The identifying string (for example ‘train’ , ‘val’, or ‘test’, but this can be anything).
- label ( - str,- int, optional) – The label of the audio file. Used in titles and filenames.
- datadir ( - str,- pathlib.PosixPath, optional) – The directory where related data is located. An ‘image’ directory will be created within this datadir where the saved plots will be stored. If None, will be created in current working directory.
- subsections ( - bool, optional) – To subsection raw ‘signal’ data into frames. For other features, this is easier to identify via the shape of- feats.
- **kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.feats.get_feats
 
- Returns
- Return type
 
- 
soundpy.feats.save_features_datasets(datasets_dict, datasets_path2save_dict, context_window=None, frames_per_sample=None, labeled_data=False, subsection_data=False, divide_factor=None, visualize=False, vis_every_n_frames=50, log_settings=True, decode_dict=None, random_seed=None, **kwargs)[source]¶
- Extracts and saves audio features, sectioned into datasets, to indicated locations. - If MemoryError, the provided dataset dicts will be adjusted to allow data to be subsectioned. - Parameters
- datasets_dict ( - dict) – Dictionary with keys representing datasets and values the audifiles making up that dataset. E.g. {‘train’:[‘1.wav’, ‘2.wav’, ‘3.wav’], ‘val’: [‘4.wav’], ‘test’:[‘5.wav’]} for unlabled data or {‘train’:[(0, ‘1.wav’), (1, ‘2.wav’), (0, ‘3.wav’)], ‘val’: [(1, ‘4.wav’)], ‘test’:[(0, ‘5.wav’)]} for labeled data.
- datasets_path2save_dict ( - dict) – Dictionary with keys representing datasets and values the pathways of where extracted features of that dataset will be saved. E.g. {‘train’: ‘./data/train.npy’, ‘val’: ‘./data/val.npy’, ‘test’: ‘./data/test.npy’}
- context_window ( - int) – The size of context_window or number of samples padding a central frame. This may be useful for models training on small changes occuring in the signal, e.g. to break up the image of sound into smaller parts, to feed to a long short-term memory network (LSTM), for example. (Can avoid this by simply reshaping data later).
- frames_per_sample ( - int) – The previous keyword argument for sugementing audio into smaller parts. Will be removed in future versions. This equals 2 * context_window + 1
- labeled_data ( - bool) – If True, expects each audiofile to be accompanied by an integer label. See example given for datasets_dict.
- subsection_data ( - bool) – If you have a large dataset, you may want to divide it into subsections. See soundpy.datasets.subsection_data. If datasets are large enough to raise a MemoryError, this will be applied automatically.
- divide_factor ( - int, optional) – The number of subsections to divide data into. Only large enough sections will be divided. If smaller datasets (i.e. validation and test datasets) are as large or smaller than the new subsectioned larger dataset(s) (i.e. train), they will be left unchanged. (defaults to 5)
- visualize ( - bool) – If True, periodic plots of the features will be saved throughout the extraction process. (default False)
- vis_every_n_frames ( - int) – How often visuals should be made: every 10 samples, every 100, etc. (default 50)
- log_settings ( - bool) – If True, a .csv file will be saved in the feature extraction directory with most of the feature settings saved. (default True)
- decode_dict ( - dict, optional) – The dictionary to get the label given the encoded label. This is for plotting purposes. (default None)
- **kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.feats.get_feats.
 
- Returns
 - See also - soundpy.feats.get_feats
- Extract features from audio file or audio data. 
 
- 
soundpy.feats.save_features_datasets_zipfiles(datasets_dict, datasets_path2save_dict, extract_dir, dur_sec, feature_type='fbank', num_feats=None, sr=22050, win_size_ms=20, percent_overlap=0.5, n_fft=None, frames_per_sample=None, labeled_data=False, subsection_data=False, divide_factor=None, visualize=False, vis_every_n_frames=50, use_librosa=True, center=True, mode='reflect', log_settings=True, decode_dict=None, audiofile_lim=10, **kwargs)[source]¶
- Extracts and saves audio features, sectioned into datasets, to indicated locations. - If MemoryError, the provided dataset dicts will be adjusted to allow data to be subsectioned. - Parameters
- datasets_dict ( - dict) – Dictionary with keys representing datasets and values the audifiles making up that dataset. E.g. {‘train’:[‘1.wav’, ‘2.wav’, ‘3.wav’], ‘val’: [‘4.wav’], ‘test’:[‘5.wav’]} for unlabled data or {‘train’:[(0, ‘1.wav’), (1, ‘2.wav’), (0, ‘3.wav’)], ‘val’: [(1, ‘4.wav’)], ‘test’:[(0, ‘5.wav’)]} for labeled data.
- datasets_path2save_dict ( - dict) – Dictionary with keys representing datasets and values the pathways of where extracted features of that dataset will be saved. E.g. {‘train’: ‘./data/train.npy’, ‘val’: ‘./data/val.npy’, ‘test’: ‘./data/test.npy’}
- feature_type ( - str) – String including only one of the following: ‘signal’, ‘stft’, ‘powspec’, ‘fbank’, and ‘mfcc’. ‘signal’ currently only supports mono channel data. TODO test for stereo ‘powspec’ and ‘stft’ are basically the same; ‘powspec’ is the ‘stft’ except without complex values and squared. E.g ‘mfcc_noisy’ or ‘stft_train’.
- sr ( - int) – The sample rate the audio data should be loaded with.
- n_fft ( - int) – The number of frequency bins used for the Fast Fourier Transform (fft)
- dur_sec ( - intor- float) – The desired duration of how long the audio data should be. This is used to calculate size of feature data and is therefore necessary, as audiofiles tend to differe in length. If audiofiles are longer or shorter, they will be cut or zeropadded respectively.
- num_feats ( - int) – The number of mfcc coefficients (mfcc), mel filters (fbank), or frequency bins (stft).
- win_size_ms ( - int) – The desired window size in milliseconds to process audio samples.
- percent_overlap ( - float) – The amount audio samples should overlap as each window is processed.
- frames_per_sample ( - int, optional) – If you want to section each audio file feature data into smaller frames. This might be useful for speech related contexts. (Can avoid this by simply reshaping data later)
- labeled_data ( - bool) – If True, expects each audiofile to be accompanied by an integer label. See example given for datasets_dict.
- subsection_data ( - bool) – If you have a large dataset, you may want to divide it into subsections. See soundpy.datasets.subsection_data. If datasets are large enough to raise a MemoryError, this will be applied automatically.
- divide_factor ( - int, optional) – The number of subsections to divide data into. Only large enough sections will be divided. If smaller datasets (i.e. validation and test datasets) are as large or smaller than the new subsectioned larger dataset(s) (i.e. train), they will be left unchanged. (defaults to 5)
- visualize ( - bool) – If True, periodic plots of the features will be saved throughout the extraction process. (default False)
- vis_every_n_frames ( - int) – How often visuals should be made: every 10 samples, every 100, etc. (default 50)
- use_librosa ( - bool) – If True, librosa is used to load and extract features. As of now, no other option is available. TODO: add other options. :P I just wanted to be clear that some elements of this function are unique to using librosa. (default True)
- center ( - bool) – Relevant for librosa and feature extraction. (default True)
- mode ( - str) – Relevant for librosa and feature extraction. (default ‘reflect’)
- log_settings ( - bool) – If True, a .csv file will be saved in the feature extraction directory with most of the feature settings saved. (default True)
- decode_dict ( - dict, optional) – The dictionary to get the label given the encoded label. This is for plotting purposes. (default None)
- **kwargs ( - additional keyword arguments) – Keyword arguments for- soundpy.feats.get_feats.
 
- Returns
 - See also - soundpy.feats.get_feats
- Extract features from audio file or audio data. 
 
- 
soundpy.feats.prep_new_audiofeats(feats, desired_shape, input_shape)[source]¶
- Prepares new audio data to feed to a pre-trained model. - Parameters
- feats ( - np.ndarray [shape = (num_frames,- num_features)]) – The features to prepare for feeding to a model.
- desired_shape ( - tuple) – The expected number of samples necessary to fulfill the expected input_shape for the model. The- featswill be zeropadded or limited to match this desired_shape.
- input_shape ( - tuple) – The input_shape the model expects a single sample of data to be.
 
- Returns
- feats_reshaped – The features reshaped to what the model expects. 
- Return type
- np.ndarray [shape = (`input_shape)]`
 
- 
soundpy.feats.feats2audio(feats, feature_type, sr, win_size_ms, percent_overlap, phase=None)[source]¶
- Prepares features into audio playable format. - Parameters
- feats ( - np.ndarray [shape = (num_frames,- num_feats)]) – If the features are a signal, [size = (batch_size * num_frames * num_features, 1)]. Otherwise [size = (batch_size * num_frames, num_features)].
- feature_type ( - str) – Either ‘stft’, ‘fbank’, ‘signal’, or ‘mfcc’. For the ‘signal’ feature, only mono channel is supported.
- sr ( - int) – Sampling rate that the features were extracted with
- win_size_ms ( - int) – The window size in milliseconds the features were extracted with
- percent_overlap ( - float) – The percent overlap between windows.
- phase ( - np.ndarray [shape = (num_frames,- num_feats)], optional) – The original phase information of the reconstructed signal.
 
- Returns
- y – The reconstructed signal in samples. 
- Return type
- np.ndarray [shape = (num_samples,- )]
 
Template deep neural networks¶
The models.template_models module contains functions for building (ideally research-based) models.
- 
soundpy.models.template_models.adjust_layers_cnn(**kwargs)[source]¶
- Reduces layers of CNN until the model can be built. - If the number of filters for ‘mfcc’ or ‘fbank’ is in the lower range (i.e. 13 or so), this causes issues with the default settings of the cnn architecture. The architecture was built with at least 40 filters being applied during feature extraction. To deal with this problem, the number of CNN layers are reduced. - Parameters
- **kwargs ( - Keyword arguments) – Keyword arguments for soundpy.models.template_models.cnn_classifier
- Returns
- settings – Updated dictionary with relevant settings for model. 
- Return type
 - References 
- 
soundpy.models.template_models.cnn_classifier(feature_maps=[40, 20, 10], kernel_size=[3, 3, 3, 3, 3, 3], strides=2, activation_layer='relu', activation_output='softmax', input_shape=79, 40, 1, num_labels=3, dense_hidden_units=100, dropout=0.25)[source]¶
- Build a single or multilayer convolutional neural network. - Parameters
- feature_maps ( - intor- list) – The filter or feature map applied to the data. One feature map per convolutional neural layer required. For example, a list of length 3 will result in a three-layer convolutional neural network.
- kernel_size ( - tupleor- listof- tuples) – Must match the number of feature_maps. The size of each corresponding feature map.
- strides ( - int) –
- activation_layer ( - str) – (default ‘relu’)
- activation_outpu ( - str) – (default ‘softmax’)
- input_shape ( - tuple) – The shape of the input
- dense_hidden_units ( - int, optional) –
- dropout ( - float, optional) – Reduces overfitting
 
- Returns
- model ( - tf.keras.Model) – Model ready to be compiled.
- settings ( - dict) – Dictionary with relevant settings for model.
 
 - Warning - If number features are not compatible with number of layers, warning raised and layers adjusted. E.g. For lower number of MFCC features this will likely be applied if number of layers is greater than 1. - References - A. Sehgal and N. Kehtarnavaz, “A Convolutional Neural Network Smartphone App for Real-Time Voice Activity Detection,” in IEEE Access, vol. 6, pp. 9017-9026, 2018. 
- 
soundpy.models.template_models.autoencoder_denoise(input_shape, kernel_size=3, 3, max_norm_value=2.0, activation_function_layer='relu', activation_function_output='sigmoid', padding='same', kernel_initializer='he_uniform')[source]¶
- Build a simple autoencoder denoiser. - Parameters
- Returns
- autoencoder – Model ready to be compiled 
- Return type
- tf.keras.Model
 - References - Versloot, Christian (2019, December 19). Creating a Signal Noise Removal Autoencoder with Keras. MachineCurve. https://www.machinecurve.com 
- 
soundpy.models.template_models.resnet50_classifier(input_shape, num_labels, activation='softmax', final_layer_name='features')[source]¶
- Simple image classifier built ontop of a pretrained ResNet50 model. - References - Revay, S. & Teschke, M. (2019). Multiclass Language Identification using Deep Learning on Spectral Images of Audio Signals. arXiv:1905.04348 [cs.SD] 
- 
soundpy.models.template_models.cnnlstm_classifier(num_labels, input_shape, lstm_cells, feature_map_filters=32, kernel_size=8, 4, pool_size=3, 3, dense_hidden_units=60, activation_layer='relu', activation_output='softmax', dropout=0.25)[source]¶
- Model architecture inpsired from the paper below. - References - Kim, Myungjong & Cao, Beiming & An, Kwanghoon & Wang, Jun. (2018). Dysarthric Speech Recognition Using Convolutional LSTM Neural Network. 10.21437/interspeech.2018-2250. 
Additional model setup (e.g. Early Stopping)¶
The models.modelsetup module containes functionality for preparing for training a model
- 
soundpy.models.modelsetup.setup_layers(num_features, num_layers, kernel_shape=3, 3, max_feature_map=64)[source]¶
- Sets up feature_maps and kernels for 1 or more layered convolutional neural networks. - Parameters
- num_features ( - int) – The number of features used to train the model. This will be used to set the number of feature_maps for each layer.
- num_layers ( - int) – The number of layers desired
- kernel_shape ( - tupleor- int) – The shape of the desired kernel
- max_feature_map ( - int) – The maximum size of feature map / filter. This depends on the system and is relevant for processing higher definition features, such as STFT features. If this is set too large given memory restraints, training may be ‘killed’.
 
- Returns
 - Warning - If num_features is larger than the max_feature_map. The num_features is usually used to set the first feature map, but if too large, will be reduced to be lower than max_feature_map. 
- 
soundpy.models.modelsetup.setup_callbacks(early_stop=True, patience=15, log=True, log_filename=None, append=True, save_bestmodel=True, best_modelname=None, monitor='val_loss', verbose=1, save_best_only=True, mode='min', tensorboard=True, write_images=False, x_test=None, y_test=None, batch_size=None, embedded_layer_name=None)[source]¶
- Easy set up of early stopping, model logging, and saving best model. - Parameters
- early_stop ( - bool) – Whether or not the model should stop if training is not improving (default True)
- patience ( - int) – The number of epochs the model should complete without improvement before stopping training. (default 15)
- log ( - bool) – If true, the accuracy, loss, and (if possible) the val_accuracy and val_loss for each epoch will be saved in a .csv file. (default True)
- log_filename ( - stror- pathlib.PosixPath, optional) – The filename the logging information will be stored. If None, the date will be used as a unique .csv filename in a subfolder ‘model_logs’ in the local directory.
- save_bestmodel ( - bool) – If True, the best performing model will be saved.
- best_modelname ( - stror- pathlib.PosixPath) – The name to save the best model version under. If None, the date will be used to create a unique .h5 filename and it will be saved in a subfolder ‘best_models’ in the local directory.
- monitor ( - str) – The metric to be used to measure model performance. (default ‘val_loss’
- verbose ( - bool) – If True, the state of the model will be printed. (default True)
- save_best_only ( - bool) – If True, the best performing model will overwrite any previously saved ‘best model’.
- mode ( - str) – If monitor is set to ‘val_loss’, this should be set to ‘min’. If monitor``is set to ‘val_acc’, this should be set to ‘max’. If `mode is set to ‘auto’, the direction will be inferred. (default ‘min’)
- tensorboard ( - bool) – If True, logs for TensorBoard will be made.
 
- Returns
- callbacks – The callbacks ready to be applied to Keras model training. 
- Return type
- # TODO what data type is this?
 
Feeding large datasets to models¶
- 
class soundpy.models.dataprep.Generator(data_matrix1, data_matrix2=None, timestep=None, axis_timestep=0, normalize=True, apply_log=False, context_window=None, axis_context_window=- 2, labeled_data=False, gray2color=False, zeropad=True, desired_input_shape=None, combine_axes_0_1=False)[source]¶
- Bases: - object- Methods - Shapes, norms, and feeds data depending on labeled or non-labeled data. - 
__init__(data_matrix1, data_matrix2=None, timestep=None, axis_timestep=0, normalize=True, apply_log=False, context_window=None, axis_context_window=- 2, labeled_data=False, gray2color=False, zeropad=True, desired_input_shape=None, combine_axes_0_1=False)[source]¶
- This generator pulls data out in sections (i.e. batch sizes). Prepared for 3 dimensional data. - Note: Keras adds a dimension to input to represent the “Tensor” that #handles the input. This means that sometimes you have to add a shape of (1,) to the shape of the data. - Parameters
- data_matrix1 ( - np.ndarray [size=(num_samples,- batch_size,- num_frames,- num_features)or- (num_samples,- num_frames,- num_features+label_column)]) – The training data. This can contain the feature and label data or just the input feature data.
- data_matrix2 ( - np.ndarray [size = (num_samples,- ) `data_matrix1.shape]`, optional) – Either label data for data_matrix1 or, for example, the clean version of data_matrix1 if training an autoencoder. (default None)
- normalize ( - bool) – If False, the data has already been normalized and won’t be normalized by the generator. (default True)
- apply_log ( - bool) – If True, log will be applied to the data.
- timestep ( - int) – The number of frames to constitute a timestep.
- axis_timestep ( - int) – The axis to apply the timestep to. (default 0)
- context_window ( - int) – The size of context_window or number of samples padding a central frame. This may be useful for models training on small changes occuring in the signal, e.g. to break up the image of sound into smaller parts.
- axis_context_window ( - int) – The axis to apply_context_window, if context_window is not None. Ideally should be in axis preceding feature column. (default -2)
- zeropad ( - bool) – If features should be zeropadded in reshaping functions.
- desired_input_shape ( - intor- tuple, optional) – The desired number of features or shape of data to feed a neural network. If type int, only the last column of features will be adjusted (zeropadded or limited). If tuple, the entire data shape will be adjusted (all columns). If the int or shape is larger than that of the data provided, data will be zeropadded. If the int or shape is smaller, the data will be restricted. (default None)
 
 
 
- 
The models.dataprep module covers functionality for feeding features to models.
- 
class soundpy.models.dataprep.Generator(data_matrix1, data_matrix2=None, timestep=None, axis_timestep=0, normalize=True, apply_log=False, context_window=None, axis_context_window=- 2, labeled_data=False, gray2color=False, zeropad=True, desired_input_shape=None, combine_axes_0_1=False)[source]¶
- Bases: - object- Methods - Shapes, norms, and feeds data depending on labeled or non-labeled data. 
- 
class soundpy.models.dataprep.GeneratorFeatExtraction(datalist, datalist2=None, model_name=None, normalize=True, apply_log=False, randomize=True, random_seed=None, desired_input_shape=None, timestep=None, axis_timestep=0, context_window=None, axis_context_window=- 2, batch_size=1, gray2color=False, visualize=False, vis_every_n_items=50, visuals_dir=None, decode_dict=None, dataset='train', augment_dict=None, label_silence=False, vad_start_end=False, **kwargs)[source]¶
- Bases: - soundpy.models.dataprep.Generator- Methods - Extracts features and feeds them to model according to desired_input_shape. 
- 
soundpy.models.dataprep.randomize_augs(aug_dict, random_seed=None)[source]¶
- Creates copy of dict and chooses which augs applied randomly. - Can apply random seed for number of augmentations applied and shuffling order of possible augmentations. 
- 
soundpy.models.dataprep.augment_features(sound, sr, add_white_noise=False, snr=[5, 10, 20], speed_increase=False, speed_decrease=False, speed_perc=0.15, time_shift=False, shufflesound=False, num_subsections=3, harmonic_distortion=False, pitch_increase=False, pitch_decrease=False, num_semitones=2, vtlp=False, bilinear_warp=True, augment_settings_dict=None, random_seed=None)[source]¶
- Randomly applies augmentations to audio. If no augment_settings_dict, defaults applied. 
- 
soundpy.models.dataprep.get_input_shape(kwargs_get_feats, labeled_data=False, frames_per_sample=None, use_librosa=True, mode='reflect')[source]¶
- 
soundpy.models.dataprep.make_gen_callable(_gen)[source]¶
- Prepares Python generator for tf.data.Dataset.from_generator - Bug fix: Python generator fails to work in Tensorflow 2.2.0 + - Parameters
- _gen ( - generator) – The generator function to feed to a deep neural network.
- Returns
- x ( - np.ndarray [shape=(batch_size,- num_frames,- num_features,- 1)]) – The feature data
- y ( - np.ndarray [shape=(1,1)]) – The label for the feature data.
 
 - References - Shu, Nicolas (2020) https://stackoverflow.com/a/62186572 CC BY-SA 4.0 
Other useful non-specific functionality¶
Utils module covers functions that are useful for soundpy but are not directly related to sound data.
- 
soundpy.utils.path_or_samples(input_value)[source]¶
- Checks whether input_value is a path or sample data. Does not check path validity. - This is useful for functions that take both pathways to audio as well as pre-loaded audio data. - Parameters
- input_value ( - str,- pathlib.PosixPath, or- tuple [size= ( (samples,),- sr)]or- np.ndarray [size = (samples,- )]) –
- Returns
- ‘path’ or ‘samples’ 
- Return type
 - Examples - >>> import numpy as np >>> # create some example samples and sample rate >>> samples = np.array([1,2,3,2,1,0]) >>> sr = 5 >>> path_or_samples( (samples, sr) ) 'samples' >>> # expects both audio samples and sr >>> path_or_samples(samples) TypeError: The input for `path_or_samples` expected a str, pathlib.PosixPath, or tuple with samples and sample rate, not type <class 'numpy.ndarray'> >>> # create example string pathway >>> path_or_samples('my_audio.wav') 'path' >>> # create pathlib.PosixPath object >>> import pathlib >>> path_or_samples(pathlib.Path('my_audio.wav') 'path' 
- 
soundpy.utils.get_default_args(func)[source]¶
- References - stackoverflow answer by mgilson: link: https://stackoverflow.com/a/12627202 license: https://creativecommons.org/licenses/by-sa/3.0/ 
- 
soundpy.utils.match_dtype(array1, array2)[source]¶
- Match the dtype of the second array to the first. - Parameters
- array1 ( - np.ndarray) – The numpy array with the dataype to be adjusted and returned.
- array2 ( - np.ndarray) – The numpy array with the orginal or desired datatype.
 
- Returns
- array1 – The array1 with the dtype of array2 
- Return type
- np.ndarray
 
- 
soundpy.utils.get_date()[source]¶
- Get a string containing month, day, hour, minute, second and millisecond. - This is useful for creating a unique filename. - Parameters
- None – 
- Returns
- time_str – A string containing the date and time. 
- Return type
 - Examples - >>> date = get_date() >>> date '6m18d1h16m32s295ms' 
- 
soundpy.utils.check_dir(directory, make=True, append=True)[source]¶
- Checks if directory exists and creates it if indicated. - Parameters
- directory ( - stror- pathlib.PosixPath) – The directory of interest
- make ( - bool) – Whether or not the directory should be created or just checked to ensure it exists. (default True)
- append ( - bool) – If True, if a directory with the same name exists, new items will be saved into the old directory. Otherwise, an error will be raised. (default True)
 
- Returns
- directory – If a directory could be created or confirmed to exist, the directory path will be returned. Otherwise Errors will be raised. 
- Return type
 
- 
soundpy.utils.create_nested_dirs(directory)[source]¶
- Creates directory even if several parent directories don’t exist. - Parameters
- directory ( - str,- pathlib.PosixPath) – The directory to be created.
- Returns
- directory – If successful, the directory path that has been created. 
- Return type
 - Examples - >>> # First an unsucessful creation of nested directory >>> import os >>> new_dir = './testdir/testdir/testdir/' >>> os.mkdir(new_dir) FileNotFoundError: [Errno 2] No such file or directory: './testdir/testdir/testdir/' >>> # try again with create_nested_dirs() >>> directory = create_nested_dirs(new_dir) >>> directory PosixPath('testdir/testdir/testdir') 
- 
soundpy.utils.string2pathlib(pathway_string)[source]¶
- Turns string path into pathlib.PosixPath object. - This is useful when working with pathways from varying operating systems. Windows, Linux, and Mac have different ways of organizing pathways and pathlib turns strings from these different versions into a pathlib object that can be understood by the software regardless of the system. (At least I hope so..) - Parameters
- pathway_string ( - stror- pathlib.PosixPath) – The pathway to be turned into a pathlib object, if need be.
- Returns
- pathway_string – The pathway as a pathlib object. 
- Return type
 - Examples - >>> pathway = 'folder/way2go.txt' >>> pathlib_pathway = string2pathlib(pathway) >>> pathlib_pathway PosixPath('folder/way2go.txt') 
- 
soundpy.utils.restore_dictvalue(value_string)[source]¶
- Takes dict value and converts it to its original type. - When loading a dictionary from a .csv file, the values are strings. This function handles integers, floats, tuples, and some strings. It also has been suited to handle a list of audio files or list of pathlib.PosixPath objects. - Warning: no extensive testing has been completed for this function. It might not handle all value types as expected. - Parameters
- value_string ( - str) – The dictionary value that was converted into a string object .
- Returns
- value_original_type – The value converted back to its original type. 
- Return type
 - :raises ValueError : If passed a nested list of pathlib.PosixPath objects.: - Examples - >>> input_string = "[PosixPath('data/audio/vacuum/vacuum1.wav')]" >>> type(input_string) <class 'str'> >>> typelist = string2list(input_string) >>> typelist [PosixPath('data/audio/vacuum/vacuum1.wav')] >>> type(typelist) <class 'list'> >>> # Get type of the object >>> type(typelist[0]) pathlib.PosixPath >>> # Example with a list of tuples, i.e. label and audio file pairs: >>> input_string = "[(2, PosixPath('data/audio/vacuum/vacuum1.wav')), '+ '(1, PosixPath('data/audio/vacuum/vacuum2.wav'))]" >>> labelaudio_pairs = string2list(input_string) >>> labelaudio_pairs [(2, PosixPath('data/audio/vacuum/vacuum1.wav')), (1, PosixPath('data/audio/vacuum/vacuum2.wav'))] >>> type(labelaudio_pairs) list >>> type(labelaudio_pairs[0]) tuple >>> type(labelaudio_pairs[0][0]) int >>> type(labelaudio_pairs[0][1]) pathlib.PosixPath 
- 
soundpy.utils.adjust_time_units(time_sec)[source]¶
- Turns seconds into relevant time units. - This is useful if measuring time of a process and that process takes longer than a couple minutes. - Examples - >>> adjust_time_units(5) (5, 'seconds') >>> adjust_time_units(500) (8.333333333333334, 'minutes') >>> adjust_time_units(5000) (1.3888888888888888, 'hours') 
- 
soundpy.utils.print_progress(iteration, total_iterations, task=None)[source]¶
- Prints the status of a process based on iteration number. - Assumes the iteration starts at 0 rather than 1. - Parameters
- Returns
- Return type
- sys.stdout.flush() 
 - Examples - >>> print_progress(4, 10) 50% through current task >>> print_progress(4, 10, task = 'testing') 50% through testing 
- 
soundpy.utils.check_extraction_variables(sr=None, feature_type=None, win_size_ms=None, percent_overlap=None)[source]¶
- Checks to ensure extraction variables are compatible. - Parameters
- Returns
- Return type
- Raises
- ValueError – If any of the Parameters aren’t compatible. 
 - Examples - >>> check_extraction_variables(sr=48000, feature_type='signal', win_size_ms=25,percent_overlap=0.5) >>> check_extraction_variables(sr='48000', feature_type='sig',win_size_ms='25',percent_overlap='0.5') ValueError: Sampling rate (sr) must be of type int, not 48000 of type <class 'str'>. 
- 
soundpy.utils.check_noisy_clean_match(noisyfilename, cleanfilename)[source]¶
- Checks if the clean filename is inside of the noisy filename. - This may be helpful to check that two audiofile datasets (a noisy and clean dataset) are aligned. 
- 
soundpy.utils.audiofile_length_match(filename1, filename2)[source]¶
- Checks that two audiofiles have the same length. - This may be useful if you have clean and noisy audiofiles that should be the same length. - Parameters
- filename1 ( - stror- pathlib.PosixPath) – The path to first audio file.
- filename2 ( - stror- pathlib.PosixPath) – The path to second audio file.
 
- Returns
- bool 
- Return type
- True if they match,- False if not.
 - Warning - UserWarning
- If the sample rate of the audio files don’t match. 
- UserWarning
- If the length of the files don’t match.