SoundPy Functionality¶
Built-In Functionality (non Deep Learning)¶
The soundpy.builtin module includes more complex functions that pull from several other functions to complete fairly complex tasks, such as dataset formatting, filtering signals, and extracting features for neural networks.
-
soundpy.builtin.
filtersignal
(audiofile, sr=None, noise_file=None, filter_type='wiener', filter_scale=1, apply_postfilter=False, duration_noise_ms=120, real_signal=False, phase_radians=True, num_bands=None, visualize=False, visualize_every_n_windows=50, max_vol=0.4, min_vol=0.15, save2wav=False, output_filename=None, overwrite=False, use_scipy=False, remove_dc=True, control_vol=False, **kwargs)[source]¶ Apply Wiener or band spectral subtraction filter to signal using noise.
The noise can be provided as a separate file / samples, or it can be taken from the beginning of the provided audio. How much noise is measured can be set in the parameter duration_noise_ms.
- Parameters
audiofile (
str
,np.ndarray [size=(num_samples,)
or(num_samples
,num_channels)]
) – Filename or the audio data of the signal to be filtered.sr (
int
) – The sample rate of the audio. If audiofile is type np.ndarray, sr is required. (default None)noise_file (
str
,tuple
, optional) – Path to either noise audiofile or .npy file containing average power spectrum values. If tuple, must include samples and sr. If None, the beginning of the audiofile will be used for noise data. (default None)filter_type (
str
) – Type of filter to apply. Options ‘wiener’ or ‘band_specsub’.filter_scale (
int
orfloat
) – The scale at which the filter should be applied. This value will be multiplied to the noise levels thereby increasing or decreasing the filter strength. (default 1)apply_postfilter (
bool
) – Whether or not the post filter should be applied. The post filter reduces musical noise (i.e. distortion) in the signal as a byproduct of filtering.duration_noise_ms (
int
orfloat
) – The amount of time in milliseconds to use from noise to apply the Welch’s method to. In other words, how much of the noise to use when approximating the average noise power spectrum.real_signal (
bool
) – If True, only half of the (mirrored) fast Fourier transform will be used during filtering. For audio, there is no difference. This is visible in the plots, however, if you are interested. (default False)phase_radians (
bool
) – Relevant for band spectral subtraction: whether phase should be calculated in radians or complex values/ power spectrum. (default True)num_bands (
int
) – Relevant for band spectral subtraction: the number of bands to section frequencies into. By grouping sections of frequencies during spectral subtraction filtering, musical noise or distortion should be reduced. (defaults to 6)visualize (
bool
) – If True, plots of the windows and filtered signal will be made. (default False)visualize_every_n_windows (
int
) – If visualize is set to True, this controls how often plots are made: every 50 windows, for example. (default 50)max_vol (
int
orfloat
) – The maximum volume level of the filtered signal. This is useful if you know you do not want the signal to be louder than a certain value. Ears are important (default 0.4) TODO improve on matching volume to original signal? At least use objective measures.min_vol (
int
orfloat
) – The minimum volume level of the filtered signal. (default 0.15) TODO improve on matching volume to original signal.save2wav (
bool
) – If True, will save the filtered signal as a .wav fileoutput_filename (
str
,pathlib.PosixPath
, optional) – path and name the filtered signal is to be saved. (default None) If no filename provided, will save under date.overwrite (
bool
) – If True and an audiofile by the same name exists, that file will be overwritten.use_scipy (
bool
) – If False, audiofiles will be loaded using librosa. Otherwise, scipy.io.wavfile. (default False)remove_dc (
bool
) – It True, the DC bias (‘direct current’ bias) will be removed. In other words, the mean amplitude will be made to equal 0.**kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.filters.WienerFilter
or ‘soundpy.filters.BandSubtraction` (depending on filter_type).
- Returns
enhanced_signal (
np.ndarray [size = (num_samples
,)]
) – The enhanced signal in raw sample form. Stereo audio has not yet been tested.sr (
int
) – The sample rate of the enhanced/ filtered signal.
References
Kamath, S. and Loizou, P. (2002). A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. Proc. IEEE Int. Conf. Acoust.,Speech, Signal Processing
Kamath, S. and Loizou, P. (2006). mband.m MATLAB code from the book:
C Loizou, P. (2013). Speech Enhancement: Theory and Practice.
-
soundpy.builtin.
dataset_logger
(audiofile_dir=None, recursive=True)[source]¶ Logs name, format, bitdepth, sr, duration of audiofiles, num_channels
- Parameters
audiofile_dir (
str
orpathlib.PosixPath
) – The directory where audiofiles of interest are. If no directory provided, the current working directory will be used.recursive (
bool
) – If True, all audiofiles will be analyzed, also in nested directories. Otherwise, only the audio files in the immediate directory will be analyzed. (default True)
- Returns
audiofile_dict – Dictionary within a dictionary, holding the formats of the audiofiles in the directory/ies.
- Return type
Examples
>>> audio_info = dataset_logger() >>> # look at three audio files: >>> count = 0 >>> for key, value in audio_info.items(): ...: for k, v in value.items(): ...: print(k, ' : ', v) ...: count += 1 ...: print() ...: if count > 2: ...: break audio : audiodata/dogbark_2channels.wav sr : 48000 num_channels : 2 dur_sec : 0.389 format_type : WAV bitdepth : PCM_16 audio : audiodata/python_traffic_pf.wav sr : 48000 num_channels : 1 dur_sec : 1.86 format_type : WAV bitdepth : DOUBLE audio : audiodata/259672__nooc__this-is-not-right.wav sr : 44100 num_channels : 1 dur_sec : 2.48453514739229 format_type : WAV bitdepth : PCM_16
See also
soundfile.available_subtypes
The subtypes available with the package SoundFile
soundfile.available_formats
The formats available with the package SoundFile
-
soundpy.builtin.
dataset_formatter
(audiodirectory=None, recursive=False, new_dir=None, sr=None, dur_sec=None, zeropad=False, format='WAV', bitdepth=None, overwrite=False, mono=False)[source]¶ Formats all audio files in a directory to set parameters.
The audiofiles formatted can be limited to the specific directory or be extended to the subfolders of that directory.
- Parameters
audiodirectory (
str
orpathlib.PosixPath
) – The directory where audio files live. If no directory provided, the current working directory will be used.recursive (
bool
) – If False, only audiofiles limited to the specific directory will be formatted. If True, audio files in nested directories will also be formatted. (default False)new_dir (
str
orpathlib.PosixPath
) – The audiofiles will be saved with the same structure in this directory. If None, a default directory name with time stamp will be generated.sr (
int
) – The desired sample rate to assign to the audio files. If None, the orignal sample rate will be maintained.dur_sec (
int
) – The desired length in seconds the audio files should be limited to. If zeropad is set to True, the samples will be zeropadded to match this length if they are too short. If None, no limitation will be applied.zeropad (
bool
) – If True, samples will be zeropadded to match dur_sec. (default False)format (
str
) – The format to save the audio data in. (default ‘WAV’)bitdepth (
int
,str
) – The desired bitdepth. If int, 16 or 32 are possible. Defaults to ‘PCM_16’.overwrite (
bool
) – If True and new_dir is None, the audio data will be reformatted in the original directory and saved over any existing filenames. (default False)mono (
bool
) – If True, the audio will be limited to a single channel. Note: not much has been tested for stereo sound and soundpy. (default False)
- Returns
directory – The directory where the formatted audio files are located.
- Return type
See also
soundpy.files.collect_audiofiles
Collects audiofiles from a given directory.
soundpy.files.conversion_formats
The available formats for converting audio data.
soundfile.available_subtypes
The subtypes or bitdepth possible for soundfile
-
soundpy.builtin.
create_denoise_data
(cleandata_dir, noisedata_dir, trainingdata_dir, limit=None, snr_levels=None, pad_mainsound_sec=None, random_seed=None, overwrite=False, **kwargs)[source]¶ Applies noise to clean audio; saves clean and noisy audio to traingingdata_dir.
- Parameters
cleandata_dir (
str
,pathlib.PosixPath
) – Name of folder containing clean audio data for autoencoder. E.g. ‘clean_speech’noisedata_dir (
str
,pathlib.PosixPath
) – Name of folder containing noise to add to clean data. E.g. ‘noise’trainingdata_dir (
str
,pathlib.PosixPath
) – Directory to save newly created train, validation, and test datalimit (
int
, optional) – Limit in number of audiofiles used for training datasnr_levels (
list
ofints
, optional) – List of varying signal-to-noise ratios to apply to noise levels. (default None)pad_mainsound_sec (
int
,float
, optional) – Amount in seconds the main sound should be padded. In other words, in seconds how long the background sound should play before the clean / main / target audio starts. The same amount of noise will be appended at the end. (default None)random_seed (
int
) – A value to allow random order of audiofiles to be predictable. (default None). If None, the order of audiofiles will not be predictable.overwrite (
bool
) – If True, a new dataset will be created regardless of whether or not a matching directory already exists. (default False)**kwargs (
additional keyword arguments
) – The keyword arguments for soundpy.files.loadsound
- Returns
saveinput_path (
pathlib.PosixPath
) – Path to where noisy audio files are locatedsaveoutput_path (
pathlib.PosixPath
) – Path to where clean audio files are located
See also
soundpy.files.loadsound
Loads audiofiles.
soundpy.dsp.add_backgroundsound
Add background sound / noise to signal at a determined signal-to-noise ratio.
-
soundpy.builtin.
envclassifier_feats
(data_dir, data_features_dir=None, perc_train=0.8, ignore_label_marker=None, **kwargs)[source]¶ Environment Classifier: feature extraction of scene audio into train, val, & test datasets.
Saves extracted feature datasets (train, val, test datasets) as well as feature extraction settings in the directory data_features_dir.
- Parameters
data_dir (
str
orpathlib.PosixPath
) – The directory with scene subfolders (e.g. ‘air_conditioner’, ‘traffic’) that contain audio files belonging to that scene (e.g. ‘air_conditioner/ac1.wav’, ‘air_conditioner/ac2.wav’, ‘traffic/t1.wav’).data_features_dir (
str
orpathlib.PosixPath
, optional) – The directory where feature extraction related to the dataset will be stored. Within this directory, a unique subfolder will be created each time features are extracted. This allows several versions of extracted features on the same dataset without overwriting files.perc_train (
float
) – The amount of data to be set aside for train data. The rest will be divided into validation and test datasets.ignore_label_marker (
str
) – A string to look for in the labels if the “label” should not be included. For example, ‘__’ to ignore a subdirectory titled “__noise” or “not__label”.kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.feats.save_features_datasets
andsoundpy.feats.get_feats
.
- Returns
feat_extraction_dir – The pathway to where all feature extraction files can be found, including datasets.
- Return type
See also
soundpy.feats.get_feats
Extract features from audio file or audio data.
soundpy.feats.save_features_datasets
Preparation of acoustic features in train, validation and test datasets.
-
soundpy.builtin.
denoiser_feats
(data_clean_dir, data_noisy_dir, data_features_dir=None, limit=None, perc_train=0.8, **kwargs)[source]¶ Autoencoder Denoiser: feature extraction of clean & noisy audio into train, val, & test datasets.
Saves extracted feature datasets (train, val, test datasets) as well as feature extraction settings in the directory data_features_dir.
- Parameters
data_clean_dir (
str
orpathlib.PosixPath
) – The directory with clean audio files.data_noisy_dir (
str
orpathlib.PosixPath
) – The directory with noisy audio files. These should be the same as the clean audio, except noise has been added.data_features_dir (
str
orpathlib.PosixPath
, optional) – The directory where feature extraction related to the dataset will be stored. Within this directory, a unique subfolder will be created each time features are extracted. This allows several versions of extracted features on the same dataset without overwriting files.limit (
int
, optional) – The limit of audio files for feature extraction. (default None)kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.feats.save_features_datasets
andsoundpy.feats.get_feats
.
- Returns
feat_extraction_dir – The pathway to where all feature extraction files can be found, including datasets.
- Return type
See also
soundpy.datasets.create_denoise_data
Applies noise at specified SNR levels to clean audio files.
soundpy.feats.get_feats
Extract features from audio file or audio data.
soundpy.feats.save_features_datasets
Preparation of acoustic features in train, validation and test datasets.
Built-In Functionality (Deep Learning)¶
The soundpy.models.builtin module includes example functions that train neural networks on sound data.
-
soundpy.models.builtin.
denoiser_train
(feature_extraction_dir, model_name='model_autoencoder_denoise', feature_type=None, use_generator=True, normalize=True, patience=10, **kwargs)[source]¶ Collects training features and train autoencoder denoiser.
- Parameters
feature_extraction_dir (
str
orpathlib.PosixPath
) – Directory where extracted feature files are located (format .npy).model_name (
str
) – The name for the model. This can be quite generic as the date up to the millisecond will be added to ensure a unique name for each trained model. (default ‘model_autoencoder_denoise’)feature_type (
str
, optional) – The type of features that will be used to train the model. This is only for the purposes of naming the model. If set to None, it will not be included in the model name.use_generator (
bool
) – If True, a generator will be used to feed training data to the model. Otherwise the entire training data will be used to train the model all at once. (default True)normalize (
bool
) – If True, the data will be normalized before feeding to the model. (default False)patience (
int
) – Number of epochs to train without improvement before early stopping.**kwargs (
additional keyword arguments
) – The keyword arguments for keras.fit(). Note, the keyword arguments differ for validation data so be sure to use the correct keyword arguments, depending on if you use the generator or not. TODO: add link to keras.fit().
- Returns
model_dir – The directory where the model and associated files can be found.
- Return type
See also
soundpy.datasets.separate_train_val_test_files
Generates paths lists for train, validation, and test files. Useful for noisy vs clean datasets and also for multiple training files.
soundpy.models.generator
The generator function that feeds data to the model.
soundpy.models.modelsetup.setup_callbacks
The function that sets up callbacks (e.g. logging, save best model, early stopping, etc.)
soundpy.models.template_models.autoencoder_denoise
Template model architecture for basic autoencoder denoiser.
-
soundpy.models.builtin.
envclassifier_train
(feature_extraction_dir, model_name='model_cnn_classifier', feature_type=None, use_generator=True, normalize=True, patience=15, add_tensor_last=True, num_layers=3, **kwargs)[source]¶ Collects training features and trains cnn environment classifier.
This model may be applied to any speech and label scenario, for example, male vs female speech, clinical vs healthy speech, simple speech / word recognition, as well as noise / scene / environment classification.
- Parameters
feature_extraction_dir (
str
orpathlib.PosixPath
) – Directory where extracted feature files are located (format .npy).model_name (
str
) – The name for the model. This can be quite generic as the date up to the millisecond will be added to ensure a unique name for each trained model. (default ‘model_cnn_classifier’)feature_type (
str
, optional) – The type of features that will be used to train the model. This is only for the purposes of naming the model. If set to None, it will not be included in the model name.use_generator (
bool
) – If True, a generator will be used to feed training data to the model. Otherwise the entire training data will be used to train the model all at once. (default True)normalize (
bool
) – If True, the data will be normalized before feeding to the model. (default False)patience (
int
) – Number of epochs to train without improvement before early stopping.num_layers (
int
) – The number of convolutional neural network layers desired. (default 3)**kwargs (
additional keyword arguments
) – The keyword arguments for keras.fit(). Note, the keyword arguments differ for validation data so be sure to use the correct keyword arguments, depending on if you use the generator or not. TODO: add link to keras.fit().
- Returns
model_dir – The directory where the model and associated files can be found.
- Return type
See also
soundpy.datasets.separate_train_val_test_files
Generates paths lists for train, validation, and test files. Useful for noisy vs clean datasets and also for multiple training files.
soundpy.models.generator
The generator function that feeds data to the model.
soundpy.models.modelsetup.setup_callbacks
The function that sets up callbacks (e.g. logging, save best model, early stopping, etc.)
soundpy.models.template_models.cnn_classifier
Template model architecture for a low-computational CNN sound classifier.
-
soundpy.models.builtin.
denoiser_run
(model, new_audio, feat_settings_dict, remove_dc=True)[source]¶ Implements a pre-trained denoiser
- Parameters
model (
str
orpathlib.PosixPath
) – The path to the denoising model.new_audio (
str
,pathlib.PosixPath
, ornp.ndarray
) – The path to the noisy audiofile.feat_settings_dict (
dict
) – Dictionary containing necessary settings for how the features were extracted for training the model. Expected keys: ‘feature_type’, ‘win_size_ms’, ‘percent_overlap’, ‘sr’, ‘window’, ‘frames_per_sample’, ‘input_shape’, ‘desired_shape’, ‘dur_sec’, ‘num_feats’.
- Returns
cleaned_audio (
np.ndarray [shape = (num_samples
,)]
) – The cleaned audio samples ready for playing or saving as audio file.sr (
int
) – The sample rate of cleaned_audio.
See also
soundpy.feats.get_feats
How features are extracted.
soundpy.feats.feats2audio
How features are transformed back into audio samples.
-
soundpy.models.builtin.
envclassifier_run
(model, new_audio, feat_settings_dict, dict_decode)[source]¶ Implement a convnet model with new_audio.
- Parameters
model (
str
,pathlib.PosixPath
) – The pathway to the pre-trained model.new_audio (
str
,pathlib.PosixPath
) – The pathway to the audio file to be classified.feat_settings_dict (
dict
) – Dictionary containing necessary settings for feature extraction, such as sample rate, feature type, etc.dict_decode (
dict
) – Dictionary containing encoded labels as keys and string labels as values. for example {0:’office’, 1:’traffic’, 2:’park’}.
- Returns
-
soundpy.models.builtin.
collect_classifier_settings
(feature_extraction_dir)[source]¶ Collects relevant information for some models from files in the feature directory.
These relevant files have been generated in
soundpy.models.builtin.envclassifier_train
.- Parameters
feature_extraction_dir (
str
,pathlib.PosixPath
) – The directory where extracted files are located, included .npy and .csv log files.- Returns
datasets (
NamedTuple
) – A named tuple containing train, val, and test datanum_labels (
int
) – The number of labels used for the data.feat_shape (
tuple
) – The initial shape of the features when they were extracted. For example, labels or context window not applied.num_feats (
int
) – The number of features used to train the pre-trained model.feature_type (
str
) – The feature_type used to train the pre-trained model. For example, ‘fbank’, ‘mfcc’, ‘stft’, ‘signal’, ‘powspec’.
See also
soundpy.models.builtin.envclassifier_train
The builtin functionality for training a simple scene/environment/speech classifier. This function generates the files expected by this function.
-
soundpy.models.builtin.
cnnlstm_train
(feature_extraction_dir, model_name='model_cnnlstm_classifier', use_generator=True, normalize=True, patience=15, timesteps=10, context_window=5, frames_per_sample=None, colorscale=1, total_training_sessions=None, add_tensor_last=False, **kwargs)[source]¶ Example implementation of a Convnet+LSTM model for speech recognition.
Note: improvements must still be made, for example with the context_window. However, this still may be useful as an example of a simple CNN and LSTM model.
- Parameters
feature_extraction_dir (
str
,pathlib.PosixPath
) – The directory where feature data will be saved.model_name (
str
) – The name of the model. (default ‘model_cnnlstm_classifier’)use_generator (
True
) – If True, data will be fed to the model via generator. This parameter will likely be removed and set as a default. (default True)normalize (
bool
) – If True, the data will be normalized before being fed to the model. (default True)patience (
int
) – The number of epochs to allow with no improvement in either val accuracy or loss. (default 15)timesteps (
int
) – The frames dedicated to each subsection of each sample. This allows the long-short term memory model to process each subsection consecutively.context_window (
int
) – The number of frames surrounding a central frame that make up sound context. Note: this needs improvement and further exploration.frames_per_sample (
int
) – Serves basically same role as context_window does currently: frames_per_sample equals context_window * 2 + 1. This parameter will likely be removed in future versions.colorscale (
int
) – The colorscale relevant for the convolutional neural network. (default 1)total_training_sessions (
int
) – Option to limit number of audiofiles used for training, if use_generator is set to False. This parameter will likely be removed in future versions. But as this is just an example model, the low priority may result in this parameter living forever.add_tensor_last (
bool
) – No longer used in the code. Irrelevant.kwargs (
additional keyword arguments.
) – Keyword arguments for keras.model.fit.
- Returns
model_dir (
pathlib.PosixPath
) – The directory where model and log files are saved.history (
tf.keras.callbacks.History
) – Contains model training and validation accuracy and loss throughout training.
References
Kim, Myungjong & Cao, Beiming & An, Kwanghoon & Wang, Jun. (2018). Dysarthric Speech Recognition Using Convolutional LSTM Neural Network. 10.21437/interspeech.2018-2250.
-
soundpy.models.builtin.
resnet50_train
(feature_extraction_dir, model_name='model_resnet50_classifier', use_generator=True, normalize=True, patience=15, colorscale=3, total_training_sessions=None, **kwargs)[source]¶ Continue training a pre-trained resnet50 model for speech recogntion or other sound classification.
- Parameters
feature_extraction_dir (
str
orpathlib.PosixPath
) – The directory where feature extraction files will be saved.model_name (
str
) – The name for the model. (default ‘model_resnet50_classifier’)use_generator (
True
) – If True, data will be fed to the model via generator. This parameter will likely be removed and set as a default. (default True)normalize (
bool
) – If True, the data will be normalized before being fed to the model. (default True)patience (
int
) – The number of epochs to allow with no improvement in either val accuracy or loss. (default 15)timesteps (
int
) – The frames dedicated to each subsection of each sample. This allows the long-short term memory model to process each subsection consecutively.context_window (
int
) – The number of frames surrounding a central frame that make up sound context. Note: this needs improvement and further exploration.frames_per_sample (
int
) – Serves basically same role as context_window does currently: frames_per_sample equals context_window * 2 + 1. This parameter will likely be removed in future versions.colorscale (
int
) – The colorscale relevant for the convolutional neural network. (default 1)total_training_sessions (
int
) – Option to limit number of audiofiles used for training, if use_generator is set to False. This parameter will likely be removed in future versions. But as this is just an example model, the low priority may result in this parameter living forever.
- Returns
model_dir (
pathlib.PosixPath
) – The directory where model and log files are saved.history (
tf.keras.callbacks.History()
) – Contains model training and validation accuracy and loss throughout training.
-
soundpy.models.builtin.
envclassifier_extract_train
(model_name='env_classifier', augment_dict=None, audiodata_path=None, features_dir=None, save_new_files_dir=None, labeled_data=True, ignore_label_marker=None, batch_size=10, epochs=5, patience=15, callbacks=None, random_seed=None, visualize=False, vis_every_n_items=50, label_silence=False, val_data=None, test_data=None, append_model_dir=False, **kwargs)[source]¶ Extract and augment features during training of a scene/environment/speech classifier
- Parameters
model_name (
str
) – Name of the model. No extension (will save as .h5 file) (default ‘env_classifier’)augment_dict (
dict
, optional) – Dictionary containing keys (e.g. ‘add_white_noise’). See `soundpy.augment.list_augmentations`and corresponding True or False values. If the value is True, the key / augmentation gets implemented at random, each epoch. (default None)audiodata_path (
str
,pathlib.PosixPath
) – Where audio data can be found, if no features_dir where previously extracted and prepared files are located. (default None)features_dir (
str
,pathlib.PosixPath
) – The feature directory where previously extracted validation and test data are located, as well as the relevant log files.save_new_files_dir (
str
,pathlib.PosixPath
) – Where new files (logging, model(s), etc.) will be saved. If None, will be set in a unique directory within the current working directory. (default None)labeled_data (
bool
) – Useful in determining shape of data. If True, expected label column to exist at the end of the feature column of feature data. Note: this may be removed in future versions.ignore_label_marker (
str
) – When collecting labels from subdirectory names, this allows a subfolder name to be ignored. For example, if ignore_label_marker is set as ‘__’, the folder name ‘__test__’ will not be included as a label while a folder name ‘dog_barking’ will.**kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.feats.get_feats
.
-
soundpy.models.builtin.
cnnlstm_extract_train
(model_name='cnnlstm_classifier', dataset_dict=None, num_labels=None, augment_dict=None, audiodata_path=None, save_new_files_dir=None, labeled_data=True, ignore_label_marker=None, context_window=5, batch_size=10, epochs=5, patience=15, callbacks=None, random_seed=None, visualize=False, vis_every_n_items=50, label_silence=False, **kwargs)[source]¶ Extract and augment features during training of a scene/environment/speech classifier
- Parameters
model_name (
str
) – Name of the model. No extension (will save as .h5 file)dataset_dict (
dict
, optional) – A dictionary including datasets as keys, and audio file lists (with or without labels) as values. If None, will be created based on audiodata_path. (default None)augment_dict (
dict
, optional) – Dictionary containing keys (e.g. ‘add_white_noise’). See `soundpy.augment.list_augmentations`and corresponding True or False values. If the value is True, the key / augmentation gets implemented at random, each epoch. (default None)audiodata_path (
str
,pathlib.PosixPath
) – Where audio data can be found, if no dataset_dict provided. (default None)save_new_files_dir (
str
,pathlib.PosixPath
) – Where new files (logging, model(s), etc.) will be saved. If None, will be set in a unique directory within the current working directory. (default None)**kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.feats.get_feats
.
-
soundpy.models.builtin.
denoiser_extract_train
(model_name='denoiser', augment_dict=None, audiodata_clean_path=None, audiodata_noisy_path=None, features_dir=None, save_new_files_dir=None, labeled_data=False, ignore_label_marker=None, batch_size=10, epochs=5, patience=15, callbacks=None, random_seed=20, visualize=False, vis_every_n_items=50, label_silence=False, val_data=None, test_data=None, append_model_dir=False, **kwargs)[source]¶ Extract and augment features during training of a scene/environment/speech classifier
- Parameters
model_name (
str
) – Name of the model. No extension (will save as .h5 file) (default ‘env_classifier’)augment_dict (
dict
, optional) – Dictionary containing keys (e.g. ‘add_white_noise’). See `soundpy.augment.list_augmentations`and corresponding True or False values. If the value is True, the key / augmentation gets implemented at random, each epoch. (default None)audiodata_path (
str
,pathlib.PosixPath
) – Where audio data can be found, if no features_dir where previously extracted and prepared files are located. (default None)features_dir (
str
,pathlib.PosixPath
) – The feature directory where previously extracted validation and test data are located, as well as the relevant log files.save_new_files_dir (
str
,pathlib.PosixPath
) – Where new files (logging, model(s), etc.) will be saved. If None, will be set in a unique directory within the current working directory. (default None)labeled_data (
bool
) – Useful in determining shape of data. If True, expected label column to exist at the end of the feature column of feature data. Note: this may be removed in future versions.ignore_label_marker (
str
) – When collecting labels from subdirectory names, this allows a subfolder name to be ignored. For example, if ignore_label_marker is set as ‘__’, the folder name ‘__test__’ will not be included as a label while a folder name ‘dog_barking’ will.**kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.feats.get_feats
.
Augment audio data¶
The augment module includes functions related to augmenting audio data. These functions pull from implementations performed in research.
Other resources for augmentation (not included in soundpy functionality):
Ma, E. (2019). NLP Augmentation. https://github.com/makcedward/nlpaug
Park, D. S., Chan, W., Zhang, Y., Chiu, C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Google Brain. arxiv.org/pdf/1904.08779.pdf
Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084:
1.Signal speed scaling by a random number in[0.8,1.2](SpeedupFactoryRange). 2.Pitch shift by a random number in [−2,2]semitones(SemitoneShiftRange). 3.Volume increase/decrease by a random number in [−3,3]dB(VolumeGainRange). 4.Addition of random noise in the range [0,10]dB(SNR). 5.Time shift in the range [−0.005,0.005]seconds(TimeShiftRange).
-
soundpy.augment.
speed_increase
(sound, sr, perc=0.15, **kwargs)[source]¶ Acoustic augmentation of speech.
References
Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084
Ko, T., Peddinti, V., Povey, D., & Khudanpur (2015). Audio Augmentation for Speech Recognition. Interspeech.
W. Verhelst and M. Roelands, “An overlap-add technique based on waveform similarity (wsola) for high quality time-scale modifica- tion of speech,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, April 1993, pp. 554–557 vol.2.
-
soundpy.augment.
speed_decrease
(sound, sr, perc=0.15, **kwargs)[source]¶ Acoustic augmentation of speech.
References
Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084
-
soundpy.augment.
time_shift
(sound, sr, random_seed=None, **kwargs)[source]¶ Acoustic augmentation of sound (probably not for speech).
Applies random shift of sound by dividing sound into 2 sections and switching them.
Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084
-
soundpy.augment.
shufflesound
(sound, sr, num_subsections=2, random_seed=None, **kwargs)[source]¶ Acoustic augmentation of noise or background sounds.
This separates the sound into num_subsections and pseudorandomizes the order.
References
Inoue, T., Vinayavekhin, P., Wang, S., Wood, D., Munawar, A., Ko, B. J., Greco, N., & Tachibana, R. (2019). Shuffling and mixing data augmentation for environmental sound classification. Detection and Classification of Acoustic Scenes and Events 2019. 25-26 October 2019, New York, NY, USA
-
soundpy.augment.
add_white_noise
(sound, sr, noise_level=0.01, snr=10, random_seed=None, **kwargs)[source]¶ References
Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084
-
soundpy.augment.
harmonic_distortion
(sound, sr, **kwargs)[source]¶ Applies sin function five times.
References
Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084
-
soundpy.augment.
pitch_increase
(sound, sr, num_semitones=2, **kwargs)[source]¶ References
Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084
-
soundpy.augment.
pitch_decrease
(sound, sr, num_semitones=2, **kwargs)[source]¶ References
Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084
-
soundpy.augment.
vtlp
(sound, sr, a=0.8, 1.2, random_seed=None, oversize_factor=16, win_size_ms=50, percent_overlap=0.5, bilinear_warp=True, real_signal=True, fft_bins=1024, window='hann', zeropad=True, expected_shape=None, visualize=False)[source]¶ Applies vocal tract length perturbations directly to dft (oversized) windows.
References
Kim, C., Shin, M., Garg, A., & Gowda, D. (2019). Improved vocal tract length perturbation for a state-of-the-art end-to-end speech recognition system. Interspeech. September 15-19, Graz, Austria.
Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084. https://doi.org/https://doi.org/10.1016/j.ecoinf.2020.101084
-
soundpy.augment.
get_augmentation_dict
()[source]¶ Returns dictionary with augmentation options as keys and values set to False.
Examples
>>> import soundpy as sp >>> ad = sp.augment.get_augmentation_dict() >>> ad {'speed_increase': False, 'speed_decrease': False, 'time_shift': False, 'shufflesound': False, 'add_white_noise': False, 'harmonic_distortion': False, 'pitch_increase': False, 'pitch_decrease': False, 'vtlp': False} >>> # to set augmentation to True: >>> ad['add_white_noise'] = True >>> ad {'speed_increase': False, 'speed_decrease': False, 'time_shift': False, 'shufflesound': False, 'add_white_noise': True, 'harmonic_distortion': False, 'pitch_increase': False, 'pitch_decrease': False, 'vtlp': False}
-
soundpy.augment.
list_augmentations
()[source]¶ Lists available augmentations.
Examples
>>> import soundpy as sp >>> print(sp.augment.list_augmentations()) Available augmentations: speed_increase speed_decrease time_shift shufflesound add_white_noise harmonic_distortion pitch_increase pitch_decrease vtlp
-
soundpy.augment.
get_augmentation_settings_dict
(augmentation)[source]¶ Returns default settings of base function for augmentation.
- Parameters
augmentation (
str
) – The augmentation of interest.- Returns
aug_defaults – A dictionary with the base augmentation function parameters as keys and default values as values.
- Return type
Examples
>>> import soundpy as sp >>> d = sp.augment.get_augmentation_settings_dict('speed_decrease') >>> d {'perc': 0.15} >>> # can use this dictionary to apply different values for augmentation >>> d['perc'] = 0.1 >>> d {'perc': 0.1} >>> # to build a dictionary with several settings: >>> many_settings_dict = {} >>> many_settings_dict['add_white_noise'] = sp.augment.get_augmentation_settings_dict('add_white_noise') >>> many_settings_dict['pitch_increase'] = sp.augment.get_augmentation_settings_dict('pitch_increase') >>> many_settings_dict {'add_white_noise': {'noise_level': 0.01, 'snr': 10, 'random_seed': None}, 'pitch_increase': {'num_semitones': 2}} >>> # change 'snr' default values to list of several values >>> # this would apply white noise at either 10, 15, or 20 SNR, at random >>> many_settings_dict['add_white_noise']['snr'] = [10, 15, 20] >>> # change number of semitones pitch increase is applied >>> many_settings_dict['pitch_increase']['num_semitones'] = 1 >>> many_settings_dict {'add_white_noise': {'noise_level': 0.01, 'snr': [10, 15, 20], 'random_seed': None}, 'pitch_increase': {'num_semitones': 1}}
- Raises
ValueError – If augmentation does not match available augmentations.
See also
soundpy.models.dataprep.augment_features
The above dictionary example many_settings_dict can be applied under the parameter augment_settings_dict to apply augmentation settings when augmenting data, for example, within a generator function. See
soundpy.models.dataprep.GeneratorFeatExtraction
.
Working with audio files¶
The files module contains functions related to handling audio data files, for example loading audio files, saving audio files, and examing and reformatting audio files.
-
soundpy.files.
loadsound
(filename, sr=None, mono=True, dur_sec=None, remove_dc=True, use_scipy=False)[source]¶ Loads sound file with scipy.io.wavfile.read or librosa.load (default librosa)
- Parameters
filename (
str
) – The filename of the sound to be loadedsr (
int
, optional) – The desired sample rate of the audio samples. If None, the sample rate of the audio file will be used.mono (
bool
) – If True, the samples will be loaded in mono sound. If False, if the samples are in stereo, they will be loaded in stereo sound.dur_sec (
int
,float
, optional) – The length in seconds of the audio signal.remove_dc_bias (
bool
) – If True, the mean is subtracted from the signal. This has shown to be very helpful when working with audio data. (default True)use_scipy (
bool
) – If False, librosa will be used to load the audiofile. If True, scipy.io.wavfile and/or soundfile will be used. If the sound file is not compatible with scipy.io.wavfile.read, this functions converts the file to .wav format and/or changes the bit depth to be compatible. (default False)
- Returns
data (
nd.array [size=(num_samples,)
or(num_samples
,num_channels)]
) – The normalized (between 1 and -1) sample data returned according to the specified settings.sr (
int
) – The sample rate of the loaded samples.
See also
soundpy.files.prep4scipywavfile
Prepares audio file for scipy.io.wavfile.read.
soundpy.files.convert_audiofile
Converts audio file to .wav format.
soundpy.files.newbitdepth
Converts audio file to specified bitdepth.
soundpy.dsp.resample_audio
Resampe audio data to a specified sample rate.
soundpy.files.list_possibleformats
Lists the possible formats to load with soundpy.loadsound
librosa.load
The package used to load sound data by default. See
librosa
.scipy.io.wavfile.read
The package used to load sound if use_scipy is set to True. See
scipy
.soundpy.dsp.remove_dc_bias
Removes the ‘direct current’ bias from the signal.
Todo
- Make librosa data and scipy.io.wavfile data more similar
https://stackoverflow.com/questions/54482346/reading-a-wav-file-with-scipy-and-librosa-in-python
-
soundpy.files.
savesound
(audiofile_name, signal_values, sr, remove_dc=True, overwrite=False, use_scipy=False, **kwargs)[source]¶ saves the wave at designated path
- Parameters
audiofile_name (
str
orpathlib.PosixPath
) – path and name the audio is to be saved under. (.wav format)signal_values (
ndarray
) – values of real signal to be savedsr (
int
) – sample rate of the audio samples.remove_dc (
bool
) – If True, the mean is subtracted from the signal. (default True)overwrite (
bool
) – If True, audio with the same naem will be overwritten. (default False)use_scipy (
bool
) – If True, scipy.io.wavfile.write will be used. However, file conversion is limited. Can only save .wav files. Otherwise soundfile.write will be used, which can save audio under more audio fomats.**kwargs (
additional keyword arguments
) – The keyword arguments for soundfile.write: https://pysoundfile.readthedocs.io/en/latest/index.html?highlight=write#soundfile.write
- Returns
audiofile_name – The new audiofile name
- Return type
See also
scipy.io.wavfile.write
soundpy.files.conversion_formats
Lists the possible formats to save audio files if use_scipy is False.
soundpy.dsp.remove_dc_bias
Removes the ‘direct current’ bias from the signal.
-
soundpy.files.
audiofiles_present
(directory, recursive=False)[source]¶ Checks to see if audio files are present.
- Parameters
directory (
str
orpathlib.PosixPath
) – The directory to look for audio.recursive (
bool
) – If True, all nested directories will be checked as well. (default False)
- Returns
True if audio is present; otherwise False.
- Return type
-
soundpy.files.
collect_audiofiles
(directory, hidden_files=False, wav_only=False, recursive=False)[source]¶ Collects all files within a given directory.
This includes the option to include hidden_files in the collection.
- Parameters
directory (
str
orpathlib.PosixPath
) – The path to where desired files are located.hidden_files (
bool
) – If True, hidden files will be included. If False, they won’t. (default False)wav_only (
bool
) – If True, only .wav files will be included. Otherwise, no limit on file type.
- Returns
paths_list – Sorted list of file pathways.
- Return type
list
ofpathlib.PosixPath objects
-
soundpy.files.
collect_zipfiles
(directory, hidden_files=False, ext='tgz', recursive=False)[source]¶ Collects all zipfiles within a given directory.
This includes the option to include hidden_files in the collection.
- Parameters
directory (
str
orpathlib.PosixPath
) – The path to where desired files are located.hidden_files (
bool
) – If True, hidden files will be included. If False, they won’t. (default False)wav_only (
bool
) – If True, only .wav files will be included. Otherwise, no limit on file type.
- Returns
paths_list – Sorted list of file pathways.
- Return type
list
ofpathlib.PosixPath objects
-
soundpy.files.
prep4scipywavfile
(filename, overwrite=False)[source]¶ Takes soundfile and saves it in a format compatible with scipy.io.wavfile
-
soundpy.files.
conversion_formats
()[source]¶ Lists the formats available for conversion.
soundpy uses soundfile to convert files; therefore, whatever available formats soundfile has will be listed here.
Examples
>>> conversion_formats {'AIFF': 'AIFF (Apple/SGI)', 'AU': 'AU (Sun/NeXT)', 'AVR': 'AVR (Audio Visual Research)', 'CAF': 'CAF (Apple Core Audio File)', 'FLAC': 'FLAC (Free Lossless Audio Codec)', 'HTK': 'HTK (HMM Tool Kit)', 'SVX': 'IFF (Amiga IFF/SVX8/SV16)', 'MAT4': 'MAT4 (GNU Octave 2.0 / Matlab 4.2)', 'MAT5': 'MAT5 (GNU Octave 2.1 / Matlab 5.0)', 'MPC2K': 'MPC (Akai MPC 2k)', 'OGG': 'OGG (OGG Container format)', 'PAF': 'PAF (Ensoniq PARIS)', 'PVF': 'PVF (Portable Voice Format)', 'RAW': 'RAW (header-less)', 'RF64': 'RF64 (RIFF 64)', 'SD2': 'SD2 (Sound Designer II)', 'SDS': 'SDS (Midi Sample Dump Standard)', 'IRCAM': 'SF (Berkeley/IRCAM/CARL)', 'VOC': 'VOC (Creative Labs)', 'W64': 'W64 (SoundFoundry WAVE 64)', 'WAV': 'WAV (Microsoft)', 'NIST': 'WAV (NIST Sphere)', 'WAVEX': 'WAVEX (Microsoft)', 'WVE': 'WVE (Psion Series 3)', 'XI': 'XI (FastTracker 2)'}
-
soundpy.files.
convert_audiofile
(filename, format_type=None, sr=None, new_dir=False, overwrite=False, use_scipy=False, **kwargs)[source]¶ Converts and saves soundfile as .wav type in same or new directory.
- Parameters
filename (
str
orpathlib.PosixPath
) – The filename of the audiofile to be converted to .wav typeformat_type (
str
) – The format to convert the audio file to. See soundpy.files.conversion_formats. (defaults to ‘wav’)new_dir (
str
,pathlib.PosixPath
, optional) – If False, the converted files will be saved in same directory as originals. If a path is provided, the converted files will be saved there. If no such directory exists, one will be created.sr (
int
, optional) – The sample rate to be applied to the signal. If none supplied, the sample rate of the original file will be used.**kwargs (
additional keyword arguments
) – The keyword arguments for soundfile.write: https://pysoundfile.readthedocs.io/en/latest/index.html?highlight=write#soundfile.write
- Returns
f_wavfile – The filename / path where the audio file is saved.
- Return type
Examples
>>> audiofile = './example/audio.wav' # in same directory >>> audiofile_flac = sp.files.convert_audiofile(audiofile, format_type='flac') >>> audiofile_flac PosixPath('example/audio.flac') # in new directory >>> audiofile_flac = sp.files.convert_audiofile(audiofile, format_type='flac', new_dir = './examples2/') >>> audiofile_flac PosixPath('examples2/audio.flac') >>> # can establish desired conversion format in `new_dir` >>> audiofile_ogg = sp.files.convert_audiofile(audiofile, new_dir = './examples2/audio.ogg') >>> audiofile_ogg PosixPath('audiodata2/audio.ogg')
See also
soundpy.files.conversion_formats
Lists the possible formats to convert audio files.
soundpy.files.list_possibleformats
Lists the possible formats to load with soundpy.loadsound
-
soundpy.files.
replace_ext
(filename, extension)[source]¶ Adds or replaces an extension in the filename
- Parameters
filename (
str
orpathlib.PosixPath
) – Filename with the missing or incorrect extensionextension (
str
) – The correct extension for the given filename.
- Returns
file_newext – The filename with the new extension
- Return type
-
soundpy.files.
match_ext
(filename1, filename2)[source]¶ Matches the file extensions.
If both have extensions, default set to that of filename1.
-
soundpy.files.
newbitdepth
(wave, bitdepth=16, newname=None, overwrite=False)[source]¶ Convert bitdepth to 16 or 32, to ensure compatibility with scipy.io.wavfile
Scipy.io.wavfile is easily used online, for example in Jupyter notebooks.
-
soundpy.files.
adjustname
(filename, adjustment=None)[source]¶ Adjusts filename.
- Parameters
- Returns
fname – The adjusted filename with the original extension
- Return type
Examples
>>> adjustname('happy.md') 'happy_adj.md' >>> adjustname('happy.md', '_not_sad') 'happy_not_sad.md'
Organizing datasets¶
The datasets module contains functions related to organizing datasets.
-
soundpy.datasets.
create_encodedlabel2audio_dict
(dict_encodelabels, paths_list, limit=None, seed=40)[source]¶ Creates dictionary with audio labels as keys and filename lists as values.
If no label is found in the filename path, the label is not included in the returned dictionary: labels are only included if corresponding paths are present.
- Parameters
dict_encodelabels (
dict
) – Dictionary containing the labels as keys and their encoded values as values.paths_list (
set
,list
) – List containing pathlib.PosixPath objects (i.e. paths) of all audio files; expected the audio files reside in directories with names matching their audio classlimit (
int
, optional) – The integer indicating a limit to number of audiofiles to each class. This may be useful if one wants to ensure a balanced dataset (default None)seed (
int
, optional) – The seed for pseudorandomizing the wavfiles, if a limit is requested. If seed is set to None, the randomized order of the limited wavfiles cannot be repeated. (default 40)
- Returns
label_waves_dict (
OrderedDict
) – A dictionary with encoded audio labels as keys with values being the audio files corresponding to that labelTODO update
Examples
>>> from pathlib import Path >>> labels = dict([('vacuum',2),('fridge',0),('wind',1)]) >>> paths = [Path('data/audio/vacuum/vacuum1.wav'), ... Path('data/audio/fridge/fridge1.wav'), ... Path('data/audio/vacuum/vacuum2.wav'), ... Path('data/audio/wind/wind1.wav')] >>> label_waves_dict = create_encodedlabel2audio_dict(labels, paths) >>> label_waves_dict OrderedDict([(0, [PosixPath('data/audio/fridge/fridge1.wav')]), (2, [PosixPath('data/audio/vacuum/vacuum1.wav'), PosixPath('data/audio/vacuum/vacuum2.wav')]), (1, [PosixPath('data/audio/wind/wind1.wav')])]) >>> #to set a limit on number of audiofiles per class: >>> create_encodedlabel2audio_dict(labels, paths, limit=1, seed=40) OrderedDict([(0, [PosixPath('data/audio/fridge/fridge1.wav')]), (2, [PosixPath('data/audio/vacuum/vacuum2.wav')]), (1, [PosixPath('data/audio/wind/wind1.wav')])]) >>> #change the limited pathways chosen: >>> create_encodedlabel2audio_dict(labels, paths, limit=1, seed=10) OrderedDict([(0, [PosixPath('data/audio/fridge/fridge1.wav')]), (2, [PosixPath('data/audio/vacuum/vacuum1.wav')]), (1, [PosixPath('data/audio/wind/wind1.wav')])])
-
soundpy.datasets.
create_dicts_labelsencoded
(labels_class, add_extra_label=False, extra_label='silence')[source]¶ Encodes audio class labels and saves in dictionaries.
The labels are alphabetized and encoded under their index. If add_extra_label, the extra_label is added as the last entry in the dictionary. This is useful if performing voice activity and want to label non-voice activated sections as silent rather than as some speech label.
- Parameters
- Returns
Examples
>>> labels = {'wind','air_conditioner','fridge'} >>> label2int, int2label = create_dicts_labelsencoded(labels) >>> label2int {'air_conditioner': 0, 'fridge': 1, 'wind': 2} >>> int2label {0: 'air_conditioner', 1: 'fridge', 2: 'wind'}
-
soundpy.datasets.
waves2dataset
(audiolist, perc_train=0.8, seed=40, train=True, val=True, test=True)[source]¶ Organizes audio files list into train, validation and test datasets.
If only two or one dataset is to be prepared, they will be assigned to train and val or simply to train, respectively. The remaining ‘datasets’ will remain empty.
- Parameters
audiolist (
list
) – List containing paths to audio filesperc_train (
float
,int
) – Percentage of data to be in the training dataset (default 0.8)seed (
int
,None
, optional) – Set seed for the generation of pseudorandom train, validation, and test datsets. Useful for reproducing results. (default 40)train (
bool
) – If True, assumed the training data will be prepared. (default True)val (
bool
) – If True, assumed validation data will be prepared. (default True)test (
bool
) – If True, assumed test data will be prepared. (default True)
- Returns
Examples
>>> #Using a list of numbers instead of filenames >>> audiolist = [1,2,3,4,5,6,7,8,9,10] >>> #default settings: >>> waves2dataset(audiolist) ([5, 4, 9, 2, 3, 10, 1, 6], [8], [7]) >>> #perc_train set to 50% instead of 80%: >>> waves2dataset(audiolist, perc_train=50) ([5, 4, 9, 2, 3, 10], [1, 6], [8, 7]) >>> #change seed number >>> waves2dataset(audiolist, seed=0) ([7, 1, 2, 5, 6, 9, 10, 8], [4], [3])
-
soundpy.datasets.
audio2datasets
(audiodata, perc_train=0.8, limit=None, seed=None, audio_only=True, **kwargs)[source]¶ Organizes all audio in audio class directories into datasets (randomized).
The validation and test datasets are halved between what isn’t train data. For example, if perc_train is 0.8, validation data will be 0.1 and test data will be 0.1.
- Parameters
audiodata (
str
,pathlib.PosixPath
,dict
,list
, orset
) – If data has multiple labels, path to the dictionary where audio class labels and the paths of all audio files belonging to each class are or will be stored. The dictionary with the labels and their encoded values can also directly supplied here. If the data does not have labels, a list or set of audiofiles can be provided to be placed in train, val, and test datasets.seed (
int
, optional) – A value to allow random order of audiofiles to be predictable. (default None). If None, the order of audiofiles will not be predictable.audio_only (
bool
) – If audio files are expected (e.g. extensions of .wav, .flac etc.) or not. If True, list will be checked to contain only audio files. Otherwise not. (default True)**kwargs (
additional keyword arguments
) – Keyword arguments for soundpy.datasets.waves2dataset
- Returns
dataset_audio – Named tuple including three lists / datasets of audiofiles or label-audiofile pairs: the train, validation, and test lists, respectively. The label-audiofile pairs are saved as tuples within the lists and contain the encoded label integer (e.g. 0 instead of ‘air_conditioner’) and the audio paths associated to that class and dataset.
- Return type
- Raises
ValueError – If perc_train is set too high for the amount of data or there are simply too few data. Specifically, if the percentage of train data cannot be upheld while also ensuring the validation and test datasets have more than 1 sample.
-
soundpy.datasets.
separate_train_val_test_files
(list_of_files)[source]¶ Checks that file(s) exist, then sorts file(s) into train, val, test lists.
If ‘nois’ or ‘clean’ are in the filenames, two paths lists per dataset will be generated. Otherwise just one. This paths list is useful if there are multiple training files available for training a model (e.g. for large datasets).
- Parameters
list_of_files (
list
,str
, orpathlib.PosixPath
) – The feature files (format: .npy) for training a model.- Returns
(train_paths_list, val_paths_list, test_paths_list) – Tuple comprised of paths lists to train, validation, and test data files. If noisy and clean data files found, each tuple item will be a tuple comprised of two lists: a noisy file paths list and a clean file paths list.
- Return type
Examples
>>> features_files = ['train1.npy', 'train2.npy', 'val.npy', 'test.npy'] >>> datasets = separate_train_val_test_files(features_files) >>> datasets.train [PosixPath('train1.npy'), PosixPath('train2.npy')] >>> datasets.val [PosixPath('val.npy')] >>> datasets.test [PosixPath('test.npy')] >>> # try with noisy and clean data >>> features_files = ['train_noisy.npy', 'train_clean.npy', 'val_noisy.npy', 'val_clean.npy', 'test_noisy.npy', 'test_clean.npy'] >>> datasets = separate_train_val_test_files(features_files) >>> datasets.train.noisy [PosixPath('train_noisy.npy')] >>> datasets.train.clean [PosixPath('train_clean.npy')] >>> datasets.val.noisy [PosixPath('val_noisy.npy')] >>> datasets.val.clean [PosixPath('val_clean.npy')] >>> datasets.test.noisy [PosixPath('test_noisy.npy')] >>> datasets.test.clean [PosixPath('test_clean.npy')]
-
soundpy.datasets.
section_data
(dataset_dict, dataset_paths_dict, divide_factor=None)[source]¶ Expects keys of these two dictionaries to match
Examples
>>> import pathlib >>> # train is longer than val and test >>> d = {'train': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'val': [1, 2, 3, 4, 5], 'test': [1, 2, 3, 4, 5]} >>> # dictionary: paths to where extracted data will be saved >>> dp = {'train': pathlib.PosixPath('train_data.npy'), 'val': pathlib.PosixPath('val_data.npy'), 'test': pathlib.PosixPath('test_data.npy')} >>> d2, dp2 = section_data(d, dp, divide_factor = 3) >>> # val and train not touched (too small) >>> d2 {'train__1': [1, 2, 3, 4, 5], 'train__2': [6, 7, 8, 9, 10], 'train__3': [11, 12, 13, 14, 15], 'val': [1, 2, 3, 4, 5], 'test': [1, 2, 3, 4, 5]} >>> dp2 {'train__1': PosixPath('train_data__1.npy'), 'train__2': PosixPath('train_data__2.npy'), 'train__3': PosixPath('train_data__3.npy'), 'val': PosixPath('val_data.npy'), 'test': PosixPath('test_data.npy')} >>> # repeat: now val and test as long as train >>> # default divide_factor is 2 >>> d3, dp3 = section_data(d2, dp2) >>> d3 {'train__1': [1, 2], 'train__2': [3, 4, 5], 'train__3': [6, 7], 'train__4': [8, 9, 10], 'train__5': [11, 12], 'train__6': [13, 14, 15], 'val__1': [1, 2], 'val__2': [3, 4, 5], 'test__1': [1, 2], 'test__2': [3, 4, 5]} >>> dp3 {'train__1': PosixPath('train_data__1.npy'), 'train__2': PosixPath('train_data__2.npy'), 'train__3': PosixPath('train_data__3.npy'), 'train__4': PosixPath('train_data__4.npy'), 'train__5': PosixPath('train_data__5.npy'), 'train__6': PosixPath('train_data__6.npy'), 'val__1': PosixPath('val_data__1.npy'), 'val__2': PosixPath('val_data__2.npy'), 'test__1': PosixPath('test_data__1.npy'), 'test__2': PosixPath('test_data__2.npy')}
Working with signals¶
Dsp module contains functions pertaining to the actual generation, manipulation, and analysis of sound. This ranges from generating sounds to calculating sound to noise ratio.
-
soundpy.dsp.
generate_sound
(freq=200, amplitude=0.4, sr=8000, dur_sec=0.25)[source]¶ Generates a sound signal with the provided parameters. Signal begins at 0.
- Parameters
freq (
int
,float
) – The frequency in Hz the signal should have (default 200 Hz). This pertains to the number of ossicliations per second.amplitude (
int
,float
) – The parameter controling how much energy the signal should have. (default 0.4)sr (
int
) – The sampling rate of the signal, or how many samples make up the signal per second. (default 8000)
- Returns
sound_samples (
np.ndarray [size = ()]
) – The samples of the generated soundsr (
int
) – The sample rate of the generated signal
Examples
>>> sound, sr = generate_sound(freq=5, amplitude=0.5, sr=5, dur_sec=1) >>> sound array([ 0.000000e+00, 5.000000e-01, 3.061617e-16, -5.000000e-01, -6.123234e-16]) >>> sr 5
-
soundpy.dsp.
get_time_points
(dur_sec, sr)[source]¶ Get evenly spaced time points from zero to length of dur_sec.
The time points align with the provided sample rate, making it easy to plot a signal with a time line in seconds.
- Parameters
- Returns
time
- Return type
np.ndarray [size = (num_time_points,)]
Examples
>>> # 50 milliseconds at sample rate of 100 (100 samples per second) >>> x = get_time_points(0.05,100) >>> x.shape (5,) >>> x array([0. , 0.0125, 0.025 , 0.0375, 0.05 ])
-
soundpy.dsp.
generate_noise
(num_samples, amplitude=0.025, random_seed=None)[source]¶ Generates noise to be of a certain amplitude and number of samples.
Useful for adding noise to another signal of length num_samples.
- Parameters
Examples
>>> noise = generate_noise(5, random_seed = 0) >>> noise array([0.04410131, 0.01000393, 0.02446845, 0.05602233, 0.04668895])
-
soundpy.dsp.
set_signal_length
(samples, numsamps)[source]¶ Sets audio signal to be a certain length. Zeropads if too short.
Useful for setting signals to be a certain length, regardless of how long the audio signal is.
- Parameters
samples (
np.ndarray [size = (num_samples
,num_channels)
, or(num_samples,)]
) – The array of sample data to be zero padded.numsamps (
int
) – The desired number of samples.
- Returns
data – Copy of samples zeropadded or limited to numsamps.
- Return type
np.ndarray [size = (numsamps
,num_channels)
, or(numsamps,)]
Examples
>>> import numpy as np >>> input_samples = np.array([1,2,3,4,5]) >>> output_samples = set_signal_length(input_samples, numsamps = 8) >>> output_samples array([1, 2, 3, 4, 5, 0, 0, 0]) >>> output_samples = set_signal_length(input_samples, numsamps = 4) >>> output_samples array([1, 2, 3, 4])
-
soundpy.dsp.
scalesound
(data, max_val=1, min_val=None)[source]¶ Scales the input array to range between min_val and max_val.
- Parameters
data (
np.ndarray [size = (num_samples,)
or(num_samples
,num_channels)]
) – Original samplesmax_val (
int
,float
) – The maximum value the dataset is to range from (default 1)min_val (
int
,float
, optional) – The minimum value the dataset is to range from. If set to None, will be set to the opposiite of max_val. E.g. if max_val is set to 0.8, min_val will be set to -0.8. (default None)
- Returns
samples – Copy of original data, scaled to the min and max values.
- Return type
np.ndarray [size = (num_samples,)
or(num_samples
,num_channels)]
Examples
>>> import numpy as np >>> np.random.seed(0) >>> input_samples = np.random.random_sample((5,)) >>> input_samples array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ]) >>> input_samples.max() 0.7151893663724195 >>> input_samples.min() 0.4236547993389047 >>> # default setting: between -1 and 1 >>> output_samples = scalesound(input_samples) >>> output_samples array([-0.14138 ,1., 0.22872961, -0.16834299, -1.]) >>> output_samples.max() 1.0 >>> output_samples.min() -1.0 >>> # range between -100 and 100 >>> output_samples = scalesound(input_samples, max_val = 100, min_val = -100) >>> output_samples array([ -14.13800026,100., 22.87296052,-16.83429866,-100.]) >>> output_samples.max() 100.0 >>> output_samples.min() -100.0
-
soundpy.dsp.
shape_samps_channels
(data)[source]¶ Returns data in shape (num_samps, num_channels)
- Parameters
data (
np.ndarray [size= (num_samples,)
or(num_samples
,num_channels)
, or(num_channels
,num_samples)]
) – The data that needs to be checked for correct format- Returns
data
- Return type
np.ndarray [size = (num_samples,)
or(num_samples
,num_channels)]
-
soundpy.dsp.
resample_audio
(samples, sr_original, sr_desired)[source]¶ Allows audio samples to be resampled to desired sample rate.
- Parameters
- Returns
resampled (
np.ndarray [size = (num_samples_resampled,)]
) – The resampled samples.sr_desired (
int
) – The newly applied sample rate
Examples
>>> import numpy as np >>> # example samples from 5 millisecond signal with sr 100 and frequency 10 >>> input_samples = np.array([0.00e+00, 2.82842712e-01, 4.000e-01, 2.82842712e-01, 4.89858720e-17]) >>> # we want to resample to 80 instead of 100 (for this example's sake) >>> output_samples, sr = resample_audio(input_samples, sr_original = 100, sr_desired = 80) >>> output_samples array([-2.22044605e-17, 3.35408001e-01, 3.72022523e-01, 6.51178161e-02])
-
soundpy.dsp.
stereo2mono
(data)[source]¶ If sound data has multiple channels, reduces to first channel
- Parameters
data (
numpy.ndarray
) – The series of sound samples, with 1+ columns/channels- Returns
data_mono – The series of sound samples, with first column
- Return type
numpy.ndarray
Examples
>>> import numpy as np >>> data = np.linspace(0,20) >>> data_2channel = data.reshape(25,2) >>> data_2channel[:5] array([[0. , 0.40816327], [0.81632653, 1.2244898 ], [1.63265306, 2.04081633], [2.44897959, 2.85714286], [3.26530612, 3.67346939]]) >>> data_mono = stereo2mono(data_2channel) >>> data_mono[:5] array([0. , 0.81632653, 1.63265306, 2.44897959, 3.26530612])
-
soundpy.dsp.
add_backgroundsound
(audio_main, audio_background, sr, snr=None, pad_mainsound_sec=None, total_len_sec=None, wrap=False, stationary_noise=True, random_seed=None, extend_window_ms=0, remove_dc=False, mirror_sound=False, clip_at_zero=True, **kwargs)[source]¶ Adds a sound (i.e. background noise) to a target signal. Stereo sound should work.
If the sample rates of the two audio samples do not match, the sample rate of audio_main will be applied. (i.e. the audio_background will be resampled). If you have issues with clicks at the beginning or end of signals, see
soundpy.dsp.clip_at_zero
.- Parameters
audio_main (
str
,pathlib.PosixPath
, ornp.ndarray [size=(num_samples,)
or(num_samples
,num_channels)]
) – Sound file of the main sound (will not be modified; only delayed if specified). If not path or string, should be a data samples corrresponding to the provided sample rate.audio_background (
str
,pathlib.PosixPath
, ornp.ndarray [size=(num_samples,)]
) – Sound file of the background sound (will be modified /repeated to match or extend the length indicated). If not of type pathlib.PosixPath or string, should be a data samples corrresponding to the provided sample rate.sr (
int
) – The sample rate of sounds to be added together. Note: sr of 44100 or higher is suggested.snr (
int
,float
,list
,tuple
) – The sound-to-noise-ratio of the target and background signals. Note: this is an approximation and needs further testing and development to be used as an official measurement of snr. If no SNR provided, signals will be added together as-is. (default None)pad_mainsound_sec (
int
orfloat
, optional) – Length of time in seconds the background sound will pad the main sound. For example, if pad_mainsound_sec is set to 1, one second of the audio_background will be played before audio_main starts as well as after the main audio stops. (default None)total_len_sec (
int
orfloat
, optional) – Total length of combined sound in seconds. If none, the sound will end after the (padded) target sound ends (default None).wrap (
bool
) – If False, the random selection of sound will be limited to end by the end of the audio file. If True, the random selection will wrap to beginning of the audio file if extends beyond the end of the audio file. (default False)stationary_noise (
bool
) – If False,soundpy.feats.get_vad_stft
will be applied to noise to get energy of the active noise in the signal. Otherwise energy will be collected via soundpy.dsp.get_stft. (default True)random_seed (
int
) – If provided, the ‘random’ section of noise will be chosen using this seed. (default None)extend_window_ms (
int
orfloat
) – The number of milliseconds the voice activity detected should be padded with. This might be useful to ensure sufficient amount of activity is calculated. (default 0)remove_dc (
bool
) – If the dc bias should be removed. This aids in the removal of clicks. Seesoundpy.dsp.remove_dc_bias
. (default False)**kwargs (
additional keyword arguments
) – The keyword arguments for soundpy.files.loadsound
- Returns
References
- Yi Hu and Philipos C. Loizouoriginal authors
Copyright (c) 2006 by Philipos C. Loizou
- SIP-Lab/CNN-VAD/GitHub Repo
Copyright (c) 2019 Signal and Image Processing Lab MIT License
See also
soundpy.files.loadsound
Loads audiofiles.
soundpy.dsp.snr_adjustnoiselevel
Calculates how much to adjust noise signal to achieve SNR.
soundpy.feats.get_vad_stft
Returns stft matrix of only voice active regions
soundpy.feats.get_stft
Returns stft matrix of entire signal
-
soundpy.dsp.
hz_to_mel
(freq)[source]¶ Converts frequency to Mel scale
- Parameters
freq (
int
orfloat
orarray like
ofints / floats
) – The frequency/ies to convert to Mel scale.- Returns
mel – The frequency/ies in Mel scale.
- Return type
References
https://en.wikipedia.org/wiki/Mel_scale#Formula
Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
-
soundpy.dsp.
mel_to_hz
(mel)[source]¶ Converts Mel item or list to frequency/ies.
- Parameters
mel (
int
,float
, orlist
ofints / floats
) – Mel item(s) to be converted to Hz.- Returns
freq – The converted frequency/ies
- Return type
References
https://en.wikipedia.org/wiki/Mel_scale#Formula
Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
-
soundpy.dsp.
fbank_filters
(fmin, fmax, num_filters)[source]¶ Calculates the mel filterbanks given a min and max frequency and num_filters.
- Parameters
- Returns
mel_points – An array of floats containing evenly spaced filters (according to mel scale).
- Return type
np.ndarray [size=(num_filters,)]
References
Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
-
soundpy.dsp.
sinosoidal_liftering
(mfccs, cep_lifter=22)[source]¶ Reduces influence of higher coefficients; found useful in automatic speech rec.
- Parameters
mfccs (
np.ndarray [shape=(num_samples
,num_mfcc)]
) – The matrix containing mel-frequency cepstral coefficients.cep_lifter (
int
) – The amount to applysinosoidal_liftering
. (default 22)
References
Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
-
soundpy.dsp.
index_at_zero
(samples, num_dec_places=2)[source]¶ Finds indices of start and end of utterance, given amplitude strength.
- Parameters
samples (
numpy.ndarray [size= (num_samples,)
or(num_samples
,num_channels)]
) – The samples to index where the zeros surrounding speech are located.num_dec_places (
int
) – To the number of decimal places the lowest value in samples should be rounded to. (default 2)
- Returns
Examples
>>> signal = np.array([-1, 0, 1, 2, 3, 2, 1, 0, -1, -2, -3, -2, -1, 0, 1]) >>> zero_1, zero_2 = index_at_zero(signal) >>> # +1 to include zero_2 in signal >>> signal[zero_1:zero_2+1] [ 0 1 2 3 2 1 0 -1 -2 -3 -2 -1 0] >>> # does not assume a zero preceeds any sample >>> signal = np.array([1, 2, 1, 0, -1, -2, -1, 0, 1, 2, 1]) >>> zero_1, zero_2 = index_at_zero(signal) >>> signal[zero_1:zero_2+1] [ 0 -1 -2 -1 0]
-
soundpy.dsp.
clip_at_zero
(samples, samp_win=None, neg2pos=True, **kwargs)[source]¶ Clips the signal at samples close to zero.
The samples where clipping occurs crosses the zero line from negative to positive. This clipping process allows for a smoother transition of audio, especially if concatenating audio.
- Parameters
samples (
np.ndarray [shape = (num_samples
,)
or(num_samples
,num_channels)]
) – The array containing sample data. Should work on stereo sound.start_with_zero (
bool
) – If True, the returned array will begin with 0 (or close to 0). Otherwise the array will end with 0.neg2pos (
bool
) – If True, the returned array will begin with positive values and end with negative values. Otherwise, the array will be returned with the first zeros detected, regardless of surrounding positive or negative values.samp_win (
int
, optional) – The window of samples to apply when clipping at zero crossings. The zero crossings adjacent to the main signal will be used. This is useful to remove already existing clicks within the signal, often found at the beginning and / or end of signals.kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.dsp.index_at_zero
.
Warning
If only one zero found.
Examples
>>> sig = np.array([-2,-1,0,1, 2, 1, 0, -1, -2, -1, 0, 1, 2, 1,0]) >>> clip_at_zero(sig) # defaults [ 0 1 2 1 0 -1 -2 -1 0] >>> # finds first and last insance of zeros, regardless of surrounding >>> # negative or positive values in signal >>> clip_at_zero(sig, neg2pos = False) [ 0 1 2 1 0 -1 -2 -1 0 1 2 1 0] >>> # avoid clicks at start of signal >>> sig = np.array([0,-10,-20,-1,0,1, 2, 1, 0, -1, -2, -1, 0, 1, 2, 1,0]) >>> clip_at_zero(sig, samp_win = 5) [ 0 1 2 1 0 -1 -2 -1 0]
-
soundpy.dsp.
remove_dc_bias
(samples, samp_win=None)[source]¶ Removes DC bias by subtracting mean from sample data.
Seems to work best without samp_win.
# TODO add moving average?
- Parameters
samples (
np.ndarray [shape=(samples
,num_channels)
or(samples)]
) – The sample data to center around zero. This worsk on both mono and stero data.samp_win (
int
, optional) – Apply subtraction of mean at windows - experimental. (default None)
- Returns
samps – The samples with zero mean.
- Return type
np.ndarray [shape=(samples
,num_channels)
or(samples)]
References
Lyons, Richard. (2011). Understanding Digital Signal Processing (3rd Edition).
-
soundpy.dsp.
apply_num_channels
(sound_data, num_channels)[source]¶ Ensures data has indicated num_channels.
To increase number of channels, the first column will be duplicated. To limit channels, channels will simply be removed.
- Parameters
sound_data (
np.ndarray [size= (num_samples,)
or(num_samples
,num_channels)]
) – The data to adjust the number of channelsnum_channels (
int
) – The number of channels desired
- Returns
data
- Return type
np.ndarray [size = (num_samples
,num_channels)]
Examples
>>> import numpy as np >>> data = np.array([1, 1, 1, 1]) >>> data_3d = apply_num_channels(data, 3) >>> data_3d array([[1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1]]) >>> data_2d = apply_num_channels(data_3d, 2) >>> data_2d array([[1, 1], [1, 1], [1, 1], [1, 1]])
-
soundpy.dsp.
apply_sample_length
(data, target_len, mirror_sound=False, clip_at_zero=True)[source]¶ Extends a sound by repeating it until its target_len. If the target_len is shorter than the length of data, data will be shortened to the specificed target_len
This is perhaps useful when working with repetitive or stationary sounds.
- Parameters
data (
np.ndarray [size = (num_samples,)
or(num_samples
,num_channels)]
) – The data to be checked or extended in length. If shape (num_channels, num_samples), the data will be reshaped to (num_samples, num_channels).target_len (
int
) – The length of samples the input data should be.
- Returns
new_data
- Return type
np.ndarray [size=(target_len
,)
or(target_len
,num_channels)]
Examples
>>> import numpy as np >>> data = np.array([1,2,3,4]) >>> sp.dsp.apply_sample_length(data, 12) array([1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]) >>> # two channels >>> data = np.zeros((3,2)) >>> data[:,0] = np.array([0,1,2]) >>> data[:,1] = np.array([1,2,3]) >>> data array([[0., 1.], [1., 2.], [2., 3.]]) >>> sp.dsp.apply_sample_length(data,5) array([[0., 1.], [1., 2.], [2., 3.], [0., 1.], [1., 2.]])
-
soundpy.dsp.
zeropad_sound
(data, target_len, sr, delay_sec=None)[source]¶ If the sound data needs to be a certain length, zero pad it.
- Parameters
data (
numpy.ndarray [size = (num_samples,)
or(num_samples
,num_channels)]
) – The sound data that needs zero padding. Shape (len(data),).target_len (
int
) – The number of samples the data should havesr (
int
) – The samplerate of the datadelay_sec (
int
,float
, optional) – If the data should be zero padded also at the beginning. (default None)
- Returns
signal_zeropadded – The data zero padded.
- Return type
numpy.ndarray [size = (target_len,)
or(target_len
,num_channels)]
Examples
>>> import numpy as np >>> x = np.array([1,2,3,4]) >>> # with 1 second delay (with sr of 4, that makes 4 sample delay) >>> x_zeropadded = zeropad_sound(x, target_len=10, sr=4, delay_sec=1) >>> x_zeropadded array([0., 0., 0., 0., 1., 2., 3., 4., 0., 0.]) >>> # without delay >>> x_zeropadded = zeropad_sound(x, target_len=10, sr=4) >>> x_zeropadded array([1., 2., 3., 4., 0., 0., 0., 0., 0., 0.]) >>> # if signal is longer than desired length: >>> x_zeropadded = zeropad_sound(x, target_len=3, sr=4) UserWarning: The signal cannot be zeropadded and will instead be truncated as length of `data` is 4 and `target_len` is 3. len(data), target_len)) >>> x_zeropadded array([1, 2, 3])
-
soundpy.dsp.
combine_sounds
(file1, file2, match2shortest=True, time_delay_sec=None, total_dur_sec=None)[source]¶ Combines sounds
- Parameters
file1 (
str
) – One of two files to be added togetherfile2 (
str
) – Second of two files to be added togethermatch2shortest (
bool
) – If the lengths of the addition should be limited by the shorter sound. (defaul True)time_delay_sec (
int
,float
, optional) – The amount of time in seconds before the sounds are added together. The longer sound will play for this period of time before the shorter sound is added to it. (default 1)total_dur_sec (
int
,float
, optional) – The total duration in seconds of the combined sounds. (default 5)
- Returns
added_sound (
numpy.ndarray
) – The sound samples of the two soundfiles added togethersr1 (
int
) – The sample rate of the original signals and added sound
-
soundpy.dsp.
calc_frame_length
(dur_frame_millisec, sr)[source]¶ Calculates the number of samples necessary for each frame
- Parameters
- Returns
frame_length – the number of samples necessary to fill a frame
- Return type
Examples
>>> calc_frame_length(dur_frame_millisec=20, sr=1000) 20 >>> calc_frame_length(dur_frame_millisec=20, sr=48000) 960 >>> calc_frame_length(dur_frame_millisec=25.5, sr=22500) 573
-
soundpy.dsp.
calc_num_overlap_samples
(samples_per_frame, percent_overlap)[source]¶ Calculate the number of samples that constitute the overlap of frames
- Parameters
- Returns
num_overlap_samples – the number of samples in the overlap
- Return type
Examples
>>> calc_num_overlap_samples(samples_per_frame=100,percent_overlap=0.10) 10 >>> calc_num_overlap_samples(samples_per_frame=100,percent_overlap=10) 10 >>> calc_num_overlap_samples(samples_per_frame=960,percent_overlap=0.5) 480 >>> calc_num_overlap_samples(samples_per_frame=960,percent_overlap=75) 720
-
soundpy.dsp.
calc_num_subframes
(tot_samples, frame_length, overlap_samples, zeropad=False)[source]¶ Assigns total frames needed to process entire noise or target series
This function calculates the number of full frames that can be created given the total number of samples, the number of samples in each frame, and the number of overlapping samples.
- Parameters
tot_samples (
int
) – total number of samples in the entire seriesframe_length (
int
) – total number of samples in each frame / processing windowoverlap_samples (
int
) – number of samples in overlap between frameszeropad (
bool
, optional) – If False, number of subframes limited to full frames. If True, number of subframes extended to zeropad the last partial frame. (default False)
- Returns
subframes – The number of subframes necessary to fully process the audio samples at given frame_length, overlap_samples, and zeropad.
- Return type
Examples
>>> calc_num_subframes(30,10,5) 5 >>> calc_num_subframes(30,20,5) 3
-
soundpy.dsp.
create_window
(window_type, frame_length)[source]¶ Creates window according to set window type and frame length
the Hamming window tapers edges to around 0.08 while the Hann window tapers edges to 0.0. Both are commonly used in noise filtering.
- Parameters
window_type (
str
) – type of window to be applied (default ‘hamming’)- Returns
window – a window fitted to the class attribute ‘frame_length’
- Return type
ndarray
Examples
>>> #create Hamming window >>> hamm_win = create_window('hamming', frame_length=5) >>> hamm_win array([0.08, 0.54, 1. , 0.54, 0.08]) >>> #create Hann window >>> hann_win = create_window('hann',frame_length=5) >>> hann_win array([0. , 0.5, 1. , 0.5, 0. ])
-
soundpy.dsp.
apply_window
(samples, window, zeropad=False)[source]¶ Applies predefined window to a section of samples. Mono or stereo sound checked.
The length of the samples must be the same length as the window.
- Parameters
samples (
ndarray [shape=(num_samples,)
or(num_samples
,num_channels)]
) – series of samples with the length of input windowwindow (
ndarray [shape=(num_samples,)
or(num_samples
,num_channels)]
) – window to be applied to the signal. If window does not match number of channels of sample data, the missing channels will be applied to the window, repeating the first channel.
- Returns
samples_win – series with tapered sides according to the window provided
- Return type
ndarray
Examples
>>> import numpy as np >>> input_signal = np.array([ 0. , 0.36371897, -0.302721, ... -0.1117662 , 0.3957433 ]) >>> window_hamming = np.array([0.08, 0.54, 1. , 0.54, 0.08]) >>> apply_window(input_signal, window_hamming) array([ 0. , 0.19640824, -0.302721 , -0.06035375, 0.03165946]) >>> window_hann = np.array([0. , 0.5, 1. , 0.5, 0. ]) >>> apply_window(input_signal, window_hann) array([ 0. , 0.18185948, -0.302721 , -0.0558831 , 0. ])
-
soundpy.dsp.
add_channels
(samples, channels_total)[source]¶ Copies columns of samples to create additional channels.
- Parameters
samples (
np.ndarray [shape=(num_samples)
or(num_samples,num_channels)]
) – The samples to add channels to.channels_total (
int
) – The total number of channels desired. For example, if samples already has 2 channels and you want it to have 3, set channels_total to 3.
- Returns
x – A copy of samples with desired number of channels.
- Return type
np.ndarray [shape = (num_samples
,channels_total)]
Examples
>>> import numpy as np >>> samps_mono = np.array([1,2,3,4,5]) >>> samps_stereo2 = add_channels(samps_mono, 2) >>> samps_stereo2 array([[1, 1], ... [2, 2], ... [3, 3], ... [4, 4], ... [5, 5]]) >>> samps_stereo5 = add_channels(samps_stereo2, 5) >>> samps_stereo5 array([[1, 1, 1, 1, 1], ... [2, 2, 2, 2, 2], ... [3, 3, 3, 3, 3], ... [4, 4, 4, 4, 4], ... [5, 5, 5, 5, 5]])
Warning
If channels_total is less than or equal to the number of channels already presesnt in samples. No channels added in those cases.
-
soundpy.dsp.
average_channels
(data)[source]¶ Averages all channels in a stereo signal into one channel.
- Parameters
data (
np.ndarray [size=(num_samples
,num_channels)]
) – The stereo data to average out. If mono data supplied, mono data is returned unchanged.- Returns
data averaged – Copy of data averaged into one channel.
- Return type
np.ndarray [size=(num_samples)]
Examples
>>> import numpy as np >>> input_samples1 = np.array([1,2,3,4,5]) >>> input_samples2 = np.array([1,1,3,3,5]) >>> input_2channels = np.vstack((input_samples1, input_samples2)).T >>> input_averaged = average_channels(input_2channels) >>> input_averaged array([1. , 1.5, 3. , 3.5, 5. ])
-
soundpy.dsp.
calc_fft
(signal_section, real_signal=None, fft_bins=None, **kwargs)[source]¶ Calculates the fast Fourier transform of a time series. Should work with stereo signals.
The length of the signal_section determines the number of frequency bins analyzed if fft_bins not set. Therefore, if there are higher frequencies in the signal, the length of the signal_section should be long enough to accommodate those frequencies.
The frequency bins with energy levels at around zero denote frequencies not prevelant in the signal;the frequency bins with prevalent energy levels relate to frequencies as well as their amplitudes that are in the signal.
- Parameters
signal_section (
ndarray [shape = (num_samples)
or(num_samples
,num_channels)]
) – the series that the fft will be applied to. If stereo sound, will return a FFT for each channel.real_signal (
bool
) – If True, only half of the fft will be returned (the fft is mirrored). Otherwise the full fft will be returned.kwargs (
additional keyword arguments
) – keyword arguments for numpy.fft.fft or nump.fft.rfft
- Returns
fft_vals – the series transformed into the frequency domain with the same shape as the input series
- Return type
ndarray [shape=(num_fft_bins)
, or(num_fft_bins
,num_channels)
,dtype=np.complex_]
-
soundpy.dsp.
calc_power
(fft_vals)[source]¶ Calculates the power of fft values
- Parameters
fft_vals (
ndarray (complex
orfloats)
) – the fft values of a windowed section of a series- Returns
power_spec – the squared absolute value of the input fft values
- Return type
ndarray
Example
>>> import numpy as np >>> matrix = np.array([[1,1,1],[2j,2j,2j],[-3,-3,-3]], ... dtype=np.complex_) >>> calc_power(matrix) array([[0.33333333, 0.33333333, 0.33333333], [1.33333333, 1.33333333, 1.33333333], [3. , 3. , 3. ]])
-
soundpy.dsp.
calc_average_power
(matrix, num_iters)[source]¶ Divides matrix values by the number of times power values were added.
This function assumes the power values of n-number of series were calculated and added. It divides the values in the input matrix by n, i.e. ‘num_iters’.
- Parameters
matrix (
ndarray
) – a collection of floats or ints representing the sum of power values across several series setsnum_iters (
int
) – an integer denoting the number of times power values were added to the input matrix
- Returns
matrix – the averaged input matrix
- Return type
ndarray
Examples
>>> matrix = np.array([[6,6,6],[3,3,3],[1,1,1]]) >>> ave_matrix = calc_average_power(matrix, 3) >>> ave_matrix array([[2. , 2. , 2. ], [1. , 1. , 1. ], [0.33333333, 0.33333333, 0.33333333]])
-
soundpy.dsp.
calc_phase
(fft_matrix, radians=False)[source]¶ Calculates phase from complex fft values.
- Parameters
fft_vals (
np.ndarray [shape=(num_frames
,num_features)
,dtype=complex]
) – matrix with fft valuesradians (
boolean
) – False and complex values are returned. True and radians are returned. (Default False)
- Returns
phase – Phase values for fft_vals. If radians is set to False, dtype = complex. If radians is set to True, dtype = float.
- Return type
np.ndarray [shape=(num_frames
,num_features)]
Examples
>>> import numpy as np >>> frame_length = 10 >>> time = np.arange(0, 10, 0.1) >>> signal = np.sin(time)[:frame_length] >>> fft_vals = np.fft.fft(signal) >>> phase = calc_phase(fft_vals, radians=False) >>> phase[:2] array([ 1. +0.j , -0.37872566+0.92550898j]) >>> phase = calc_phase(fft_vals, radians=True) >>> phase[:2] array([0. , 1.95921533])
-
soundpy.dsp.
reconstruct_whole_spectrum
(band_reduced_noise_matrix, n_fft=None)[source]¶ Reconstruct whole spectrum by mirroring complex conjugate of data.
- Parameters
band_reduced_noise_matrix (
np.ndarray [size=(n_fft,)
,dtype=np.float
ornp.complex_]
) – Matrix with either power or fft values of the left part of the fft. The whole fft can be provided; however the right values will be overwritten by a mirrored left side.n_fft (
int
, optional) – If None, n_fft set to length of band_reduced_noise_matrix. n_fft defines the size of the mirrored vector.
- Returns
output_matrix – Mirrored vector of input data.
- Return type
np.ndarray [size = (n_fft,)
,dtype=np.float
ornp.complex_]
Examples
>>> x = np.array([3.,2.,1.,0.]) >>> # double the size of x >>> x_rec = sp.dsp.reconstruct_whole_spectrum(x, n_fft=int(len(x)*2)) >>> x_rec array([3., 2., 1., 0., 0., 1., 2., 3.]) >>> # overwrite right side of data >>> x = np.array([3.,2.,1.,0.,0.,2.,3.,5.]) >>> x_rec = sp.dsp.reconstruct_whole_spectrum(x, n_fft=len(x)) >>> x_rec array([3., 2., 1., 0., 0., 1., 2., 3.])
-
soundpy.dsp.
apply_original_phase
(spectrum, phase)[source]¶ Multiplies phase to power spectrum
- Parameters
spectrum (
np.ndarray [shape=(n,)
,dtype=np.float
ornp.complex]
) – Magnitude or power spectrumphase (
np.ndarray [shape=(n,)
,dtype=np.float
ornp.complex]
) – Phase to be applied to spectrum
- Returns
spectrum_complex
- Return type
np.ndarray [shape=(n,)
,dtype = np.complex]
-
soundpy.dsp.
calc_posteri_snr
(target_power_spec, noise_power_spec)[source]¶ Calculates and signal to noise ratio of current frame
- Parameters
target_power_spec (
ndarray
) – matrix of shape with power values of target signalnoise_power_spec (
ndarray
) – matrix of shape with power values of noise signal
- Returns
posteri_snr – matrix containing the signal to noise ratio
- Return type
ndarray
Examples
>>> sig_power = np.array([6,6,6,6]) >>> noise_power = np.array([2,2,2,2]) >>> calc_posteri_snr(sig_power, noise_power) array([3., 3., 3., 3.])
-
soundpy.dsp.
get_local_target_high_power
(target_samples, sr, local_size_ms=25, min_power_percent=0.25)[source]¶
-
soundpy.dsp.
get_vad_snr
(target_samples, noise_samples, sr, extend_window_ms=0)[source]¶ Approximates the signal to noise ratio of two sets of power spectrums
Note: this is a simple implementation and should not be used for official/exact measurement of snr.
- Parameters
target_samples (
np.ndarray [size = (num_samples
,)]
) – The samples of the main / speech signal. Only frames with higher levels of energy will be used to calculate SNR.noise_samples (
np.ndarray [size = (num_samples
,)]
) – The samples of background noise. Expects only noise, no speech. Must be the same sample rate as the target_samplessr (
int
) – The sample rate for the audio samples.local_size_ms (
int
orfloat
) – The length in milliseconds to calculate level of SNR. (default 25)min_power_percent (
float
) – The minimum percentage of energy / power the target samples should have. This is to look at only sections with speech or other signal of interest and not periods of silence. Value should be between 0 and 1. (default 0.25)
References
http://www1.icsi.berkeley.edu/Speech/faq/speechSNR.html
Gomolka, Ryszard. (2017). Re: How to measure signal-to-noise ratio (SNR) in real time?. Retrieved from: https://www.researchgate.net/post/How_to_measure_signal-to-noise_ratio_SNR_in_real_time/586a880f217e2060b65a8853/citation/download.
https://www.who.int/occupational_health/publications/noise1.pdf
-
soundpy.dsp.
snr_adjustnoiselevel
(target_samples, noise_samples, sr, snr)[source]¶ Computes scale factor to adjust noise samples to achieve snr.
From script addnoise_asl_nseg.m: This function adds noise to a file at a specified SNR level. It uses the active speech level to compute the speech energy. The active speech level is computed as per ITU-T P.56 standard.
soundpy Note: this functionality was pulled from the MATLAB script: addnoise_asl_nseg.m at this GitHub repo: https://github.com/SIP-Lab/CNN-VAD/blob/master/Training%20Code/Functions/addnoise_asl_nseg.m
I do not understand all that went on to calculate the scale factor and therefore do not explain anything futher than the original script.
- Parameters
target_samples (
np.ndarray [size = (num_samples,)]
) – The audio samples of the target / clean signal.noise_samples (
np.ndarray [size = (num_samples,)]
) – The audio samples of the noise signal.sr (
int
) – The sample rate of both target_samples and noise_samplessnr (
int
) – The desired signal-to-noise ratio of the target and noise audio signals.
- Returns
scale_factor – The factor to which noise samples should be multiplied before being added to target samples to achieve SNR.
- Return type
References
- Yi Hu and Philipos C. Loizouoriginal authors
Copyright (c) 2006 by Philipos C. Loizou
- SIP-Lab/CNN-VAD/GitHub Repo
Copyright (c) 2019 Signal and Image Processing Lab MIT License
ITU-T (1993). Objective measurement of active speech level. ITU-T Recommendation P. 56
See also
-
soundpy.dsp.
asl_P56
(samples, sr, bitdepth=16, smooth_factor=0.03, hangover=0.2, margin_db=15.9)[source]¶ Computes the active speech level according to ITU-T P.56 standard.
Note: I don’t personally understand the functionality behind this function and therefore do not offer the best documentation as of yet.
- Parameters
samples (
np.ndarray [size = (num_samples
,)]
) – The audio samples, for example speech samples.sr (
int
) – The sample rate of samples.bitdepth (
int
) – The bitdepth of audio. Expects 16. (default 16)smooth_factor (
float
) – Time smoothing factor. (default 0.03)hangover (
float
) – Hangover. Thank goodness not the kind I’m familiar with. (default 0.2)
- Returns
References
ITU-T (1993). Objective measurement of active speech level. ITU-T Recommendation P. 56
TODO handle bitdepth variation - what if not 16? TODO improve documentation
-
soundpy.dsp.
calc_posteri_prime
(posteri_snr)[source]¶ Calculates the posteri prime
- Parameters
posteri_snr (
ndarray
) – The signal-to-noise ratio of the noisey signal, frame by frame.- Returns
posteri_prime – The primed posteri_snr, calculated according to the reference paper.
- Return type
ndarray
References
Scalart, P. and Filho, J. (1996). Speech enhancement based on a priori signal to noise estimation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 629-632.
-
soundpy.dsp.
calc_prior_snr
(snr, snr_prime, smooth_factor=0.98, first_iter=None, gain=None)[source]¶ Estimates the signal-to-noise ratio of the previous frame
Depending on the first_iter argument, the prior snr is calculated according to different algorithms. If first_iter is None, prior snr is calculated according to Scalart and Filho (1996); if first_iter is True or False, snr prior is calculated according to Loizou (2013).
- Parameters
snr (
ndarray
) – The sound-to-noise ratio of target vs noise power/energy levels.snr_prime (
ndarray
) – The prime of the snr (see Scalart & Filho (1996))smooth_factor (
float
) – The value applied to smooth the signal. (default 0.98)first_iter (
None
,True
,False
) – If None, snr prior values are estimated the same, no matter if it is the first iteration or not (Scalart & Filho (1996)) If True, snr prior values are estimated without gain (Loizou 2013) If False, snr prior values are enstimed with gain (Loizou 2013) (default None)gain (
None
,ndarray
) – If None, gain will not be used. If gain, it is a previously calculated value from the previous frame. (default None)
- Returns
prior_snr – Estimation of signal-to-noise ratio of the previous frame of target signal.
- Return type
ndarray
References
C Loizou, P. (2013). Speech Enhancement: Theory and Practice.
Scalart, P. and Filho, J. (1996). Speech enhancement based on a priori signal to noise estimation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 629-632.
-
soundpy.dsp.
calc_gain
(prior_snr)[source]¶ Calculates the gain (i.e. attenuation) values to reduce noise.
- Parameters
prior_snr (
ndarray
) – The prior signal-to-noise ratio estimation- Returns
gain – An array of attenuation values to be applied to the signal (stft) array at the current frame.
- Return type
ndarray
References
C Loizou, P. (2013). Speech Enhancement: Theory and Practice.
Scalart, P. and Filho, J. (1996). Speech enhancement based on a priori signal to noise estimation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 629-632.
-
soundpy.dsp.
apply_gain_fft
(fft_vals, gain)[source]¶ Reduces noise by applying gain values to the stft / fft array of the target signal
- Parameters
fft_vals (
ndarray(complex)
) – Matrix containing complex values (i.e. stft values) of target signalgain (
ndarray(real)
) – Matrix containing calculated attenuation values to apply to ‘fft_vals’
- Returns
enhanced_fft – Matrix with attenuated noise in target (stft) values
- Return type
ndarray(complex)
-
soundpy.dsp.
postfilter
(original_powerspec, noisereduced_powerspec, gain, threshold=0.4, scale=10)[source]¶ Apply filter that reduces musical noise resulting from other filter.
If it is estimated that speech (or target signal) is present, reduced filtering is applied.
References
T. Esch and P. Vary, “Efficient musical noise suppression for speech enhancement system,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, 2009.
-
soundpy.dsp.
calc_ifft
(signal_section, real_signal=None, norm=False)[source]¶ Calculates the inverse fft of a series of fft values
The real values of the ifft can be used to be saved as an audiofile
- Parameters
signal_section (
ndarray [shape=(num_freq_bins,)
) – The frame of fft values to apply the inverse fft tonum_fft (
int
, optional) – The number of total fft values applied when calculating the original fft. If not given, length of signal_section is used.norm (
bool
) – Whether or not the ifft should apply ‘ortho’ normalization (default False)
- Returns
ifft_vals – The inverse Fourier transform of filtered audio data
- Return type
ndarray(complex)
-
soundpy.dsp.
control_volume
(samples, max_limit)[source]¶ Keeps max volume of samples to within a specified range.
- Parameters
samples (
ndarray
) – series of audio samplesmax_limit (
float
) – maximum boundary of the maximum value of the audio samples
- Returns
samples – samples with volume adjusted (if need be).
- Return type
np.ndarray
Examples
>>> import numpy as np >>> #low volume example: increase volume to desired window >>> x = np.array([-0.03, 0.04, -0.05, 0.02]) >>> x = control_volume(x, max_limit=0.25) >>> x array([-0.13888889, 0.25 , -0.25 , 0.13888889]) >>> #high volume example: decrease volume to desired window >>> y = np.array([-0.3, 0.4, -0.5, 0.2]) >>> y = control_volume(y, max_limit=0.15) >>> y array([-0.08333333, 0.15 , -0.15 , 0.08333333])
-
soundpy.dsp.
calc_power_ratio
(original_powerspec, noisereduced_powerspec)[source]¶ Calc. the ratio of original vs noise reduced power spectrum.
-
soundpy.dsp.
calc_noise_frame_len
(SNR_decision, threshold, scale)[source]¶ Calc. window length for calculating moving average.
Note: lower SNRs require larger window.
-
soundpy.dsp.
calc_linear_impulse
(noise_frame_len, num_freq_bins)[source]¶ Calc. the post filter coefficients to be applied to gain values.
-
soundpy.dsp.
spread_volumes
(samples, vol_list=[0.1, 0.3, 0.5])[source]¶ Returns samples with a range of volumes.
This may be useful in applying to training data (transforming data).
- Parameters
samples (
ndarray
) – Series belonging to acoustic signal.vol_list (
list
) – List of floats or ints representing the volumes the samples are to be oriented towards. (default [0.1,0.3,0.5])
- Returns
volrange_dict – Tuple of volrange_dict values containing samples at various vols.
- Return type
-
soundpy.dsp.
create_empty_matrix
(shape, complex_vals=False)[source]¶ Allows creation of a matrix filled with real or complex zeros.
In digital signal processing, complex numbers are common; it is important to note that if complex_vals=False and complex values are inserted into the matrix, the imaginary part will be removed.
- Parameters
- Returns
matrix – a matrix filled with real or complex zeros
- Return type
ndarray
Examples
>>> matrix = create_empty_matrix((3,4)) >>> matrix array([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]]) >>> matrix_complex = create_empty_matrix((3,4),complex_vals=True) >>> matrix_complex array([[0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j], [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j], [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j]]) >>> vector = create_empty_matrix(5,) >>> vector array([0., 0., 0., 0., 0.])
-
soundpy.dsp.
overlap_add
(enhanced_matrix, frame_length, overlap, complex_vals=False)[source]¶ Overlaps and adds windowed sections together to form 1D signal.
- Parameters
- Returns
new_signal – Length equals (frame_length - overlap) * enhanced_matrix.shape[1] + overlap
- Return type
np.ndarray [shape=(frame_length,)
,dtype=float]
Examples
>>> import numpy as np >>> enhanced_matrix = np.ones((4, 4)) >>> frame_length = 4 >>> overlap = 1 >>> sig = overlap_add(enhanced_matrix, frame_length, overlap) >>> sig array([1., 1., 1., 2., 1., 1., 2., 1., 1., 2., 1., 1., 1.])
-
soundpy.dsp.
random_selection_samples
(samples, len_section_samps, wrap=False, random_seed=None, axis=0)[source]¶ Selects a section of samples, starting at random.
- Parameters
samples (
np.ndarray [shape = (num_samples
,)]
) – The array of sample datalen_section_samps (
int
) – How many samples should be randomly selectedwrap (
bool
) – If False, the selected noise will not be wrapped from end to beginning; if True, the random selected may take sound sample that is wrapped from the end to the beginning. See examples below. (default False)random_seed (
int
, optional) – If replicated randomization desired. (default None)
Examples
>>> import numpy as np >>> # no wrap: >>> x = np.array([1,2,3,4,5,6,7,8,9,10]) >>> n = sp.dsp.random_selection_samples(x, len_section_samps = 7, ... wrap = False, random_seed = 40) >>> n array([3, 4, 5, 6, 7, 8, 9]) >>> # with wrap: >>> n = sp.dsp.random_selection_samples(x, len_section_samps = 7, ... wrap = True, random_seed = 40) >>> n array([ 7, 8, 9, 10, 1, 2, 3])
-
soundpy.dsp.
get_pitch
(sound, sr=16000, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', **kwargs)[source]¶ Approximates pitch by collecting dominant frequencies of signal.
-
soundpy.dsp.
get_mean_freq
(sound, sr=16000, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', percent_vad=0.75)[source]¶ Takes the mean of dominant frequencies of voice activated regions in a signal.
Note: Silences discarded.
The average fundamental frequency for a male voice is 125Hz; for a female voice it’s 200Hz; and for a child’s voice, 300Hz. (Russell, J., 2020)
References
Russell, James (2020) The Human Voice and the Frequency Range. Retrieved from: https://blog.accusonus.com/pro-audio-production/human-voice-frequency-range/
-
soundpy.dsp.
vad
(sound, sr, win_size_ms=50, percent_overlap=0, real_signal=False, fft_bins=None, window='hann', energy_thresh=40, freq_thresh=185, sfm_thresh=5, min_energy=None, min_freq=None, min_sfm=None, use_beg_ms=120)[source]¶ Warning: this VAD works best with sample rates above 44100 Hz.
- Parameters
References
Moattar and M. M. Homayounpour, “A simple but efficient real-time Voice Activity Detection algorithm,” 2009 17th European Signal Processing Conference, Glasgow, 2009, pp. 2549-2553.
-
soundpy.dsp.
sound_index
(speech_energy, speech_energy_mean, start=True)[source]¶ Identifies the index of where speech or energy starts or ends.
-
soundpy.dsp.
get_dom_freq
(power_values)[source]¶ If real_signal (i.e. half fft bins), might mess up values.
-
soundpy.dsp.
short_term_energy
(signal_windowed)[source]¶ Expects
signal
to be scaled (-1, 1) as well as windowed.References
-
soundpy.dsp.
bilinear_warp
(fft_value, alpha)[source]¶ Subfunction for vocal tract length perturbation.
See also
References
Kim, C., Shin, M., Garg, A., & Gowda, D. (2019). Improved vocal tract length perturbation for a state-of-the-art end-to-end speech recognition system. Interspeech. September 15-19, Graz, Austria.
-
soundpy.dsp.
piecewise_linear_warp
(fft_value, alpha, max_freq)[source]¶ Subfunction for vocal tract length perturbation.
See also
References
Kim, C., Shin, M., Garg, A., & Gowda, D. (2019). Improved vocal tract length perturbation for a state-of-the-art end-to-end speech recognition system. Interspeech. September 15-19, Graz, Austria.
-
soundpy.dsp.
f0_approximation
(sound, sr, low_freq=50, high_freq=300, **kwargs)[source]¶ Approximates fundamental frequency.
Limits the stft of voice active sections to frequencies to between low_freq and high_freq and takes mean of the dominant frequencies within that range. Defaults are set at 50 and 300 as most human speech frequencies occur between 85 and 255 Hz.
References
Filters: Wiener and Band Spectral Subtraction¶
Filters module covers functions related to the filtering out of noise of a target signal.
-
class
soundpy.filters.
FilterSettings
(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, zeropad=None)[source]¶ Bases:
object
Basic settings for filter related classes to inherit from.
-
sr
¶ Desired sampling rate of audio; audio will be resampled to match if audio has other sampling rate. (default 48000)
- Type
-
frame_length
¶ Number of audio samples in each frame: frame_dur multiplied with sr, divided by 1000. (default 960)
- Type
-
overlap_length
¶ Number of overlapping audio samples between subsequent frames: frame_length multiplied by percent_overlap, floored. (default 480)
- Type
-
num_fft_bins
¶ The number of frequency bins used when calculating the fft. Currently the frame_length is used to set num_fft_bins.
- Type
-
zeropad
¶ If False, only full frames of audio data are processed. If True, the last partial frame will be zeropadded. (default False)
- Type
bool
, optional
Methods
Returns window acc.
-
-
class
soundpy.filters.
Filter
(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=None, zeropad=None)[source]¶ Bases:
soundpy.filters.FilterSettings
Interactive class to explore Wiener filter settings on audio signals.
These class methods implement research based algorithms with low computational cost, aimed for noise reduction via mobile phone.
-
beta
¶ Value applied in Wiener filter that smooths the application of ‘gain’; default set according to previous research. (default 0.98)
- Type
-
first_iter
¶ Keeps track if first_iter is relevant in filtering. If True, filtering has just started, and calculations made for filtering cannot use information from previous frames; if False, calculations for filtering use information from previous frames; if None, no difference is applied when processing the 1st vs subsequent frames. (default None)
- Type
bool
, optional
-
target_subframes
¶ The number of total subsections within the total number of samples belonging to the target signal (i.e. audiofile being filtered). Until target_subframes is calculated, it is set to None. (default None)
-
noise_subframes
¶ The number of total subsections within the total number of samples belonging to the noise signal. If noise power spectrum is used, this doesn’t need to be calculated. Until noise_subframes is calculated, it is set to None. (default None)
-
gain
¶ Once calculated, the attenuation values to be applied to the fft for noise reduction. Until calculated, None. (default None)
- Type
ndarray
,None
Methods
check_volume
(samples)ensures volume of filtered signal is within the bounds of the original
get_samples
(audiofile[, dur_sec])Load signal and save original volume
get_window
()Returns window acc.
set_num_subframes
(len_samples[, is_noise, …])Sets the number of target or noise subframes available for processing
set_volume
(samples[, max_vol, min_vol])Records and limits the maximum amplitude of original samples.
-
set_volume
(samples, max_vol=0.4, min_vol=0.15)[source]¶ Records and limits the maximum amplitude of original samples.
This enables the output wave to be within a range of volume that does not go below or too far above the orignal maximum amplitude of the signal.
- Parameters
samples (
ndarray
) – The original samples of a signal (1 dimensional), of any lengthmax_vol (
float
) – The maximum volume level. If a signal has values higher than this number, the signal is curtailed to remain at and below this number.min_vol (
float
) – The minimum volume level. If a signal has only values lower than this number, the signal is amplified to be at this number and below.
- Returns
- Return type
-
-
class
soundpy.filters.
WienerFilter
(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, smooth_factor=0.98, first_iter=None, zeropad=None)[source]¶ Bases:
soundpy.filters.Filter
Methods
check_volume
(samples)ensures volume of filtered signal is within the bounds of the original
get_samples
(audiofile[, dur_sec])Load signal and save original volume
get_window
()Returns window acc.
set_num_subframes
(len_samples[, is_noise, …])Sets the number of target or noise subframes available for processing
set_volume
(samples[, max_vol, min_vol])Records and limits the maximum amplitude of original samples.
apply_postfilter
apply_wienerfilter
-
class
soundpy.filters.
BandSubtraction
(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, num_bands=6, band_spacing='linear', zeropad=None, smooth_factor=0.98, first_iter=None)[source]¶ Bases:
soundpy.filters.Filter
Methods
Calculate over subtraction factor used in the cited paper.
calc_relevant_band
(target_powspec)Calculates band with highest energy levels.
check_volume
(samples)ensures volume of filtered signal is within the bounds of the original
get_samples
(audiofile[, dur_sec])Load signal and save original volume
get_window
()Returns window acc.
set_num_subframes
(len_samples[, is_noise, …])Sets the number of target or noise subframes available for processing
set_volume
(samples[, max_vol, min_vol])Records and limits the maximum amplitude of original samples.
Provides starting and ending frequncy bins/indices for each band.
update_posteri_bands
(target_powspec, …)Updates SNR of each set of bands.
apply_bandspecsub
apply_floor
apply_postfilter
sub_noise
-
setup_bands
()[source]¶ Provides starting and ending frequncy bins/indices for each band.
- Parameters
self (
class
) – Contains variables num_bands (if None, set to 6) and frame_length- Returns
Sets the class variables band_start_freq and band_end_freq.
- Return type
Examples
>>> import soundpy as sp >>> import numpy as np >>> # Default is set to 6 bands: >>> fil = sp.BandSubtraction() >>> fil.setup_bands() >>> fil.band_start_freq array([ 0., 80., 160., 240., 320., 400.]) >>> fil.band_end_freq array([ 80., 160., 240., 320., 400., 480.]) >>> # change default settings >>> fil = sp.BandSubtraction(num_bands=5) >>> fil.setup_bands() >>> fil.band_start_freq array([ 0., 96., 192., 288., 384.]) >>> fil.band_end_freq array([ 96., 192., 288., 384., 480.])
-
update_posteri_bands
(target_powspec, noise_powspec)[source]¶ Updates SNR of each set of bands.
MATLAB code from speech enhancement book uses power, puts it into magnitude (via square root), then puts it back into power..? And uses some sort of ‘norm’ function… which I think is actually just the sum. Original equation can be found in the paper below. page 117 from book?
paper: Kamath, S. D. & Loizou, P. C. (____), A multi-band spectral subtraction method for enhancing speech corrupted by colored noise.
I am using power for the time being.
Examples
>>> import soundpy as sp >>> import numpy as np >>> # setting to 4 bands for space: >>> fil = sp.BandSubtraction(num_bands=4) >>> fil.setup_bands() >>> # generate sine signal with and without noise >>> time = np.arange(0, 10, 0.01) >>> signal = np.sin(time)[:fil.frame_length] >>> np.random.seed(0) >>> noise = np.random.normal(np.mean(signal),np.mean(signal)+0.3,960) >>> powerspec_clean = np.abs(np.fft.fft(signal))**2 >>> powerspec_noisy = np.abs(np.fft.fft(signal + noise))**2 >>> fil.update_posteri_bands(powerspec_clean, powerspec_noisy) >>> fil.snr_bands array([ -1.91189028, -39.22078063, -44.16682922, -45.65265895]) >>> # compare with no noise in signal: >>> fil.update_posteri_bands(powerspec_clean, powerspec_clean) >>> fil.snr_bands array([0., 0., 0., 0.])
-
calc_oversub_factor
()[source]¶ Calculate over subtraction factor used in the cited paper.
Uses decibel SNR values calculated in update_posteri_bands()
paper: Kamath, S. D. & Loizou, P. C. (____), A multi-band spectral subtraction method ofr enhancing speech corrupted by colored noise.
Examples
>>> import soundpy as sp >>> import numpy as np >>> # setting to 4 bands for space: >>> fil = sp.BandSubtraction(num_bands=4) >>> fil.setup_bands() >>> # generate sine signal with and without noise >>> time = np.arange(0, 10, 0.01) >>> signal = np.sin(time)[:fil.frame_length] >>> np.random.seed(0) >>> noise = np.random.normal(np.mean(signal),np.mean(signal)+0.3,960) >>> powerspec_clean = np.abs(np.fft.fft(signal))**2 >>> powerspec_noisy = np.abs(np.fft.fft(signal + noise))**2 >>> fil.update_posteri_bands(powerspec_clean, powerspec_noisy) >>> fil.snr_bands array([ -1.91189028, -39.22078063, -44.16682922, -45.65265895]) >>> a = fil.calc_oversub_factor() >>> a array([4.28678354, 4.75 , 4.75 , 4.75 ]) >>> # compare with no noise in signal: >>> fil.update_posteri_bands(powerspec_clean, powerspec_clean) >>> fil.snr_bands array([0., 0., 0., 0.]) >>> a = fil.calc_oversub_factor() >>> a array([4., 4., 4., 4.])
-
calc_relevant_band
(target_powspec)[source]¶ Calculates band with highest energy levels.
- Parameters
self (
class instance
) – Contains class variables band_start_freq and band_end_freq.target_powerspec (
np.ndarray
) – Power spectrum of the target signal.
- Returns
rel_band_index (
int
) – Index for which band contains the most energy.band_energy_matrix (
np.ndarray [size=(num_bands
,)
,dtype=np.float]
) – Power levels of each band.
Examples
>>> import soundpy as sp >>> import numpy as np >>> # setting to 4 bands for this example (default is 6): >>> fil = sp.BandSubtraction(num_bands=4) >>> fil.setup_bands() >>> # generate sine signal with and with frequency 25 >>> time = np.arange(0, 10, 0.01) >>> full_circle = 2 * np.pi >>> freq = 25 >>> signal = np.sin((freq*full_circle)*time)[:fil.frame_length] >>> powerspec_clean = np.abs(np.fft.fft(signal))**2 >>> rel_band_index, band_power_energies = fil.calc_relevant_band(powerspec_clean) >>> rel_band_index 2 >>> # and with frequency 50 >>> freq = 50 >>> signal = np.sin((freq*full_circle)*time)[:fil.frame_length] >>> powerspec_clean = np.abs(np.fft.fft(signal))**2 >>> rel_band_index, band_power_energies = fil.calc_relevant_band(powerspec_clean) >>> rel_band_index 3
-
-
class
soundpy.filters.
FilterSettings
(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, zeropad=None)[source]¶ Bases:
object
Basic settings for filter related classes to inherit from.
-
sr
¶ Desired sampling rate of audio; audio will be resampled to match if audio has other sampling rate. (default 48000)
- Type
-
frame_length
¶ Number of audio samples in each frame: frame_dur multiplied with sr, divided by 1000. (default 960)
- Type
-
overlap_length
¶ Number of overlapping audio samples between subsequent frames: frame_length multiplied by percent_overlap, floored. (default 480)
- Type
-
num_fft_bins
¶ The number of frequency bins used when calculating the fft. Currently the frame_length is used to set num_fft_bins.
- Type
-
zeropad
¶ If False, only full frames of audio data are processed. If True, the last partial frame will be zeropadded. (default False)
- Type
bool
, optional
Methods
Returns window acc.
-
-
class
soundpy.filters.
Filter
(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=None, zeropad=None)[source]¶ Bases:
soundpy.filters.FilterSettings
Interactive class to explore Wiener filter settings on audio signals.
These class methods implement research based algorithms with low computational cost, aimed for noise reduction via mobile phone.
-
beta
¶ Value applied in Wiener filter that smooths the application of ‘gain’; default set according to previous research. (default 0.98)
- Type
-
first_iter
¶ Keeps track if first_iter is relevant in filtering. If True, filtering has just started, and calculations made for filtering cannot use information from previous frames; if False, calculations for filtering use information from previous frames; if None, no difference is applied when processing the 1st vs subsequent frames. (default None)
- Type
bool
, optional
-
target_subframes
¶ The number of total subsections within the total number of samples belonging to the target signal (i.e. audiofile being filtered). Until target_subframes is calculated, it is set to None. (default None)
-
noise_subframes
¶ The number of total subsections within the total number of samples belonging to the noise signal. If noise power spectrum is used, this doesn’t need to be calculated. Until noise_subframes is calculated, it is set to None. (default None)
-
gain
¶ Once calculated, the attenuation values to be applied to the fft for noise reduction. Until calculated, None. (default None)
- Type
ndarray
,None
Methods
check_volume
(samples)ensures volume of filtered signal is within the bounds of the original
get_samples
(audiofile[, dur_sec])Load signal and save original volume
get_window
()Returns window acc.
set_num_subframes
(len_samples[, is_noise, …])Sets the number of target or noise subframes available for processing
set_volume
(samples[, max_vol, min_vol])Records and limits the maximum amplitude of original samples.
-
__init__
(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=None, zeropad=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
set_volume
(samples, max_vol=0.4, min_vol=0.15)[source]¶ Records and limits the maximum amplitude of original samples.
This enables the output wave to be within a range of volume that does not go below or too far above the orignal maximum amplitude of the signal.
- Parameters
samples (
ndarray
) – The original samples of a signal (1 dimensional), of any lengthmax_vol (
float
) – The maximum volume level. If a signal has values higher than this number, the signal is curtailed to remain at and below this number.min_vol (
float
) – The minimum volume level. If a signal has only values lower than this number, the signal is amplified to be at this number and below.
- Returns
- Return type
-
-
class
soundpy.filters.
WienerFilter
(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, smooth_factor=0.98, first_iter=None, zeropad=None)[source]¶ Bases:
soundpy.filters.Filter
Methods
check_volume
(samples)ensures volume of filtered signal is within the bounds of the original
get_samples
(audiofile[, dur_sec])Load signal and save original volume
get_window
()Returns window acc.
set_num_subframes
(len_samples[, is_noise, …])Sets the number of target or noise subframes available for processing
set_volume
(samples[, max_vol, min_vol])Records and limits the maximum amplitude of original samples.
apply_postfilter
apply_wienerfilter
-
class
soundpy.filters.
BandSubtraction
(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, num_bands=6, band_spacing='linear', zeropad=None, smooth_factor=0.98, first_iter=None)[source]¶ Bases:
soundpy.filters.Filter
Methods
Calculate over subtraction factor used in the cited paper.
calc_relevant_band
(target_powspec)Calculates band with highest energy levels.
check_volume
(samples)ensures volume of filtered signal is within the bounds of the original
get_samples
(audiofile[, dur_sec])Load signal and save original volume
get_window
()Returns window acc.
set_num_subframes
(len_samples[, is_noise, …])Sets the number of target or noise subframes available for processing
set_volume
(samples[, max_vol, min_vol])Records and limits the maximum amplitude of original samples.
Provides starting and ending frequncy bins/indices for each band.
update_posteri_bands
(target_powspec, …)Updates SNR of each set of bands.
apply_bandspecsub
apply_floor
apply_postfilter
sub_noise
-
__init__
(win_size_ms=None, percent_overlap=None, sr=None, window_type=None, max_vol=0.4, num_bands=6, band_spacing='linear', zeropad=None, smooth_factor=0.98, first_iter=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
setup_bands
()[source]¶ Provides starting and ending frequncy bins/indices for each band.
- Parameters
self (
class
) – Contains variables num_bands (if None, set to 6) and frame_length- Returns
Sets the class variables band_start_freq and band_end_freq.
- Return type
Examples
>>> import soundpy as sp >>> import numpy as np >>> # Default is set to 6 bands: >>> fil = sp.BandSubtraction() >>> fil.setup_bands() >>> fil.band_start_freq array([ 0., 80., 160., 240., 320., 400.]) >>> fil.band_end_freq array([ 80., 160., 240., 320., 400., 480.]) >>> # change default settings >>> fil = sp.BandSubtraction(num_bands=5) >>> fil.setup_bands() >>> fil.band_start_freq array([ 0., 96., 192., 288., 384.]) >>> fil.band_end_freq array([ 96., 192., 288., 384., 480.])
-
update_posteri_bands
(target_powspec, noise_powspec)[source]¶ Updates SNR of each set of bands.
MATLAB code from speech enhancement book uses power, puts it into magnitude (via square root), then puts it back into power..? And uses some sort of ‘norm’ function… which I think is actually just the sum. Original equation can be found in the paper below. page 117 from book?
paper: Kamath, S. D. & Loizou, P. C. (____), A multi-band spectral subtraction method for enhancing speech corrupted by colored noise.
I am using power for the time being.
Examples
>>> import soundpy as sp >>> import numpy as np >>> # setting to 4 bands for space: >>> fil = sp.BandSubtraction(num_bands=4) >>> fil.setup_bands() >>> # generate sine signal with and without noise >>> time = np.arange(0, 10, 0.01) >>> signal = np.sin(time)[:fil.frame_length] >>> np.random.seed(0) >>> noise = np.random.normal(np.mean(signal),np.mean(signal)+0.3,960) >>> powerspec_clean = np.abs(np.fft.fft(signal))**2 >>> powerspec_noisy = np.abs(np.fft.fft(signal + noise))**2 >>> fil.update_posteri_bands(powerspec_clean, powerspec_noisy) >>> fil.snr_bands array([ -1.91189028, -39.22078063, -44.16682922, -45.65265895]) >>> # compare with no noise in signal: >>> fil.update_posteri_bands(powerspec_clean, powerspec_clean) >>> fil.snr_bands array([0., 0., 0., 0.])
-
calc_oversub_factor
()[source]¶ Calculate over subtraction factor used in the cited paper.
Uses decibel SNR values calculated in update_posteri_bands()
paper: Kamath, S. D. & Loizou, P. C. (____), A multi-band spectral subtraction method ofr enhancing speech corrupted by colored noise.
Examples
>>> import soundpy as sp >>> import numpy as np >>> # setting to 4 bands for space: >>> fil = sp.BandSubtraction(num_bands=4) >>> fil.setup_bands() >>> # generate sine signal with and without noise >>> time = np.arange(0, 10, 0.01) >>> signal = np.sin(time)[:fil.frame_length] >>> np.random.seed(0) >>> noise = np.random.normal(np.mean(signal),np.mean(signal)+0.3,960) >>> powerspec_clean = np.abs(np.fft.fft(signal))**2 >>> powerspec_noisy = np.abs(np.fft.fft(signal + noise))**2 >>> fil.update_posteri_bands(powerspec_clean, powerspec_noisy) >>> fil.snr_bands array([ -1.91189028, -39.22078063, -44.16682922, -45.65265895]) >>> a = fil.calc_oversub_factor() >>> a array([4.28678354, 4.75 , 4.75 , 4.75 ]) >>> # compare with no noise in signal: >>> fil.update_posteri_bands(powerspec_clean, powerspec_clean) >>> fil.snr_bands array([0., 0., 0., 0.]) >>> a = fil.calc_oversub_factor() >>> a array([4., 4., 4., 4.])
-
calc_relevant_band
(target_powspec)[source]¶ Calculates band with highest energy levels.
- Parameters
self (
class instance
) – Contains class variables band_start_freq and band_end_freq.target_powerspec (
np.ndarray
) – Power spectrum of the target signal.
- Returns
rel_band_index (
int
) – Index for which band contains the most energy.band_energy_matrix (
np.ndarray [size=(num_bands
,)
,dtype=np.float]
) – Power levels of each band.
Examples
>>> import soundpy as sp >>> import numpy as np >>> # setting to 4 bands for this example (default is 6): >>> fil = sp.BandSubtraction(num_bands=4) >>> fil.setup_bands() >>> # generate sine signal with and with frequency 25 >>> time = np.arange(0, 10, 0.01) >>> full_circle = 2 * np.pi >>> freq = 25 >>> signal = np.sin((freq*full_circle)*time)[:fil.frame_length] >>> powerspec_clean = np.abs(np.fft.fft(signal))**2 >>> rel_band_index, band_power_energies = fil.calc_relevant_band(powerspec_clean) >>> rel_band_index 2 >>> # and with frequency 50 >>> freq = 50 >>> signal = np.sin((freq*full_circle)*time)[:fil.frame_length] >>> powerspec_clean = np.abs(np.fft.fft(signal))**2 >>> rel_band_index, band_power_energies = fil.calc_relevant_band(powerspec_clean) >>> rel_band_index 3
-
Extract and manipulate audio features¶
Feats module includes functions related to converting audio sample data to features for analysis, filtering, machine learning, or visualization.
-
soundpy.feats.
plot
(feature_matrix, feature_type, save_pic=False, name4pic=None, energy_scale='power_to_db', title=None, sr=None, win_size_ms=None, percent_overlap=None, x_label=None, y_label=None, subprocess=False, overwrite=False)[source]¶ Visualize feature extraction; frames on x axis, features on y axis. Uses librosa to scale the data if scale applied.
Note: can only take multiple channels if feature_type is ‘signal’. For other feature types, the plot will not work as expected.
- Parameters
feature_matrix (
np.ndarray [shape=(num_samples,)
,(num_samples
,num_channels)
, or(num_features
,num_frames)
,dtype=np.float].
) – Matrix of features. If the features are not of type ‘signal’ and the shape is 1 D, one dimension will be added to be plotted with a colormesh.feature_type (
str
) – Options: ‘signal’, ‘stft’, ‘mfcc’, or ‘fbank’ features, or what user would like to name the feature set. signal: the 1 D samples of sound. STFT: short-time Fourier transform MFCC: mel frequency cepstral coefficients. FBANK: mel-log filterbank energies (default ‘fbank’).save_pic (
bool
) – True to save image as .png; False to just plot it. Ifsubprocess
is True, save_pic will automatically be set to True.name4pic (
str
, optional) – If save_pic set to True, the name the image should be saved under.energy_scale (
str
, optional) – If features need to be adjusted, e.g. from power to decibels. Default is ‘power_to_db’.title (
str
, optional) – The title for the graph. If None, feature_type is used.sr (
int
, optional) – Useful in plotting the time for features.win_size_ms (
int
,float
, optional) – Useful in plotting the time for features in the frequency domain (e.g. STFT, FBANK, MFCC features)percent_overlap (
int
,float
, optional) – Useful in plotting the time for features in the frequency domain (e.g. STFT, FBANK, MFCC features)x_label (
str
, optional) – The label to be applied to the x axis.y_label (
str
, optional) – The label to be applied to the y axis.subprocess (
bool
) – Ifsubprocess
is True, matplotlib will use backend ‘Agg’, which only allows plots to be saved. Ifsubprocess
is False, the default backend ‘TkAgg’ will be used, which allows plots to be generated live as well as saved. The ‘Agg’ backend is useful if one wants to visualize sound while a main process is being performed, for example, while a model is being trained. (default False)overwrite (
bool
) – If False, if .png file already exists under given name, a date tag will be added to the .png filename to avoid overwriting the file. (default False)
- Returns
- Return type
-
soundpy.feats.
plotsound
(audiodata, feature_type='fbank', win_size_ms=20, percent_overlap=0.5, fft_bins=None, num_filters=40, num_mfcc=40, sr=None, save_pic=False, name4pic=None, energy_scale='power_to_db', mono=None, real_signal=False, **kwargs)[source]¶ Visualize feature extraction depending on set parameters.
Stereo sound can be graphed. If feature_type is ‘signal’, all channels will be graphed on same plot. Otherwise, each channel will be plotted separately.
- Parameters
audiodata (
str
,numpy.ndarray [size=(num_samples,)
or(num_samples
,num_channels)]
) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.feature_type (
str
) – Options: ‘signal’, ‘mfcc’, or ‘fbank’ features. MFCC: mel frequency cepstral coefficients; FBANK: mel-log filterbank energies (default ‘fbank’)win_size_ms (
int
orfloat
) – Window length in milliseconds for Fourier transform to be applied (default 20)percent_overlap (
int
orfloat
) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.num_filters (
int
) – Number of mel-filters to be used when applying mel-scale. For ‘fbank’ features, 20-128 are common, with 40 being very common. (default 40)num_mfcc (
int
) – Number of mel frequency cepstral coefficients. First coefficient pertains to loudness; 2-13 frequencies relevant for speech; 13-40 for acoustic environment analysis or non-linguistic information. Note: it is not possible to choose only 2-13 or 13-40; if num_mfcc is set to 40, all 40 coefficients will be included. (default 40).sr (
int
, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)mono (
bool
, optional) – When loading an audiofile, True will limit number of channels to one; False will allow more channels to be loaded. (default None, which results in mono channel loading.)**kwargs (
additional keyword arguments
) – Keyword arguments for soundpy.feats.plot
-
soundpy.feats.
get_feats
(sound, sr=None, feature_type='fbank', win_size_ms=20, percent_overlap=0.5, window='hann', fft_bins=None, num_filters=None, num_mfcc=None, remove_first_coefficient=False, sinosoidal_liftering=False, dur_sec=None, mono=None, rate_of_change=False, rate_of_acceleration=False, subtract_mean=False, real_signal=True, fmin=None, fmax=None, zeropad=True)[source]¶ Collects raw signal data, stft, fbank, or mfcc features.
- Parameters
sound (
str
ornumpy.ndarray [size=(num_samples,)
or(num_samples
,num_channels)]
) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.sr (
int
, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)feature_type (
str
) – Options include ‘signal’, ‘stft’, ‘powspec’, ‘fbank’, or ‘mfcc’ data (default ‘fbank’). signal: energy/amplitude measurements along time STFT: short-time fourier transform powspec : power spectrum (absolute value of stft, squared) FBANK: mel-log filterbank energies MFCC: mel frequency cepstral coefficientswin_size_ms (
int
orfloat
) – Window length in milliseconds for Fourier transform to be applied (default 20)percent_overlap (
int
orfloat
) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.window (
str
ornp.ndarray [size (n_fft
,)]
) – The window function to be applied to each window. (Default ‘hann’)fft_bins (
int
) – Number of frequency bins to apply in fast Fourier transform. (default None)num_filters (
int
) – Number of mel-filters to be used when applying mel-scale. For ‘fbank’ features, 20-128 are common, with 40 being very common. If None, will be set to 40. (default None)num_mfcc (
int
) – Number of mel frequency cepstral coefficients. First coefficient pertains to loudness; 2-13 frequencies relevant for speech; 13-40 for acoustic environment analysis or non-linguistic information. If None, will be set to num_filters or 40. (default None).dur_sec (
float
, optional) – Time in seconds to limit in loading a signal. (default None)mono (
bool
, optional) – For loading an audiofile, True will result in only one channel of data being loaded; False will allow additional channels be loaded. (default None, which results in mono channel data)rate_of_change (
bool
) – If True, the first derivative of spectral data will be concatenated to the features. This is applicable for all feature types except ‘signal’.rate_of_acceleration (
bool
) – If True, the second derivative of spectral data will be concatenated to the features. This is applicable for all feature types except ‘signal’.subtract_mean (
bool
) – If True, the mean of each feature column will be subtracted from each row. This is applicable for all feature types except ‘signal’.
- Returns
feats – Feature data. If feature_type is ‘signal’, returns a tuple containing samples and sampling rate. If feature_type is of another type, returns np.ndarray with shape (num_frames, num_filters/features)
- Return type
tuple (num_samples
,sr)
ornp.ndarray [size (num_frames
,num_filters) dtype=np.float
ornp.complex]
-
soundpy.feats.
load_feat_settings
(feat_settings_dict)[source]¶ Loads feature settings into named tuple. Sets defaults if not present. TODO: test w previous version
-
soundpy.feats.
get_stft
(sound, sr=22050, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', zeropad=True, **kwargs)[source]¶ Returns short-time Fourier transform matrix.
This function allows more flexibility in number of fft_bins and real_signal settings. Additionally, this does not require the package librosa, making it a bit easier to manipulate if desired. For an example, see
soundpy.augment.vtlp
.- Parameters
sound (
np.ndarray [shape=(num_samples,)
or(num_samples
,num_channels)]
,str
, orpathlib.PosixPath
) – If type np.ndarray, expect raw samples in mono or stereo sound. If type str or pathlib.PosixPath, expect pathway to audio file.sr (
int
) – The sample rate of sound.win_size_ms (
int
,float
) – Window length in milliseconds for Fourier transform to be applied (default 50)percent_overlap (
int
orfloat
) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.real_signal (
bool
) – If True, only half the FFT spectrum will be used; there should really be no difference as the FFT is symmetrical. If anything, setting real_signal to True may speed up functionality / make functions more efficient.fft_bins (
int
) – Number of frequency bins to use when applying fast Fourier Transform. (default 1024)window (
str
) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)zeropad (
bool
) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.**kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.files.loadsound
.
- Returns
stft_matrix
- Return type
np.ndarray[size=(num_frames
,fft_bins)]
-
soundpy.feats.
get_fbank
(sound, sr, num_filters, fmin=None, fmax=None, fft_bins=None, **kwargs)[source]¶ Extract mel-filterbank energy features from audio.
- Parameters
sound (
np.ndarray [size=(num_samples,)
or(num_samples
,num_features)]
,str
, orpathlib.PosixPath
) – Sound in raw samples, a power spectrum, or a short-time-fourier-transform. If type string or pathlib.PosixPath, expect pathway to audio file.sr (
int
) – The sample rate of sound.num_filters (
int
) – The number of mel-filters to use when extracting mel-filterbank energies.fmin (
int
orfloat
, optional) – The minimum frequency of interest. If None, will be set to 0. (default None)fmax (
int
orfloat
, optional) – The maximum frequency of interst. If None, will be set to half of sr. (default None)fft_bins (
int
, optional) – The number of frequency bins / fast Fourier transform bins used in calculating the fast Fourier transform. If None, set depending on type of parameter sound. If sound is a raw signal or audio pathway, fft_bins will be set to 1024; if sound is a STFT or power spectrum, fft_bins will be set to 2 * length of sound feature column, or 2 * sound.shape[1].**kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.feats.get_stft
.
- Returns
fbank – The mel-filterbank energeis extracted. The number of samples depends on the parameters applied in
soundpy.feats.get_stft
.- Return type
np.ndarray [shape=(num_samples
,num_filters)]
References
Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from: https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
-
soundpy.feats.
get_mfcc
(sound, sr, num_mfcc, remove_first_coefficient=False, sinosoidal_liftering=False, **kwargs)[source]¶ Extracts mel-frequency cepstral coefficients from audio.
- Parameters
sound (
np.ndarray [size=(num_samples,)
or(num_samples
,num_features)]
orstr
orpathlib.PosixPath
) – If sound is a np.ndarray, expected as raw samples, a power spectrum or a short-time Fourier transform. If string or pathlib.PosixPath, should be the pathway to the audio file.sr (
int
) – The sample rate of the sound.num_mfcc (
int
) – The number of mel-frequency cepstral coefficientsremove_first_coefficient (
bool
) – If True, the first coefficient, representing amplitude or volume of signal, is removed. Found to sometimes improve automatic speech recognition. (default False)sinosoidal_liftering (
bool
) – If True, reduces influence of higher coefficients, found to aid in handling noise in background in automatic speech recognition. (default False)**kwargs (
additional keyword arguments
) – Keyword arguments for soundpy.feats.get_fbank()
References
Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
-
soundpy.feats.
get_vad_stft
(sound, sr=48000, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', use_beg_ms=120, extend_window_ms=0, energy_thresh=40, freq_thresh=185, sfm_thresh=5, zeropad=True, **kwargs)[source]¶ Returns STFT matrix and VAD matrix. STFT matrix contains only VAD sections.
- Parameters
sound (
str
ornumpy.ndarray [size=(num_samples,)
or(num_samples
,num_channels)]
) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.sr (
int
, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)win_size_ms (
int
orfloat
) – Window length in milliseconds for Fourier transform to be applied (default 50)percent_overlap (
int
orfloat
) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.real_signal (
bool
) – If True, only half the FFT spectrum will be used; there should really be no difference as the FFT is symmetrical. If anything, setting real_signal to True may speed up functionality / make functions more efficient.fft_bins (
int
) – Number of frequency bins to use when applying fast Fourier Transform. (default 1024)window (
str
) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)use_beg_ms (
int
) – The amount of time in milliseconds to use from beginning of signal to estimate background noise.extend_window_ms (
int
) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired.energy_thresh (
int
) – The threshold to set for measuring energy for VAD in the signal. (default 40)freq_thresh (
int
) – The threshold to set for measuring frequency for VAD in the signal. (default 185)sfm_thresh (
int
) – The threshold to set for measuring spectral flatness for VAD in the signal. (default 5)zeropad (
bool
) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.**kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.files.loadsound
- Returns
stft_matrix (
np.ndarray [size=(num_frames_vad
,fft_bins//2+1)
,dtype=np.complex_]
) – The STFT matrix frames of where voice activity has been detected.vad_matrix_extwin (
np.ndarray [size=(num_frames,)]
) – A vector containing indices of the full STFT matrix for frames of where voice activity was detected or not.
-
soundpy.feats.
get_stft_clipped
(samples, sr, win_size_ms=50, percent_overlap=0.5, extend_window_ms=0, window='hann', zeropad=True, **kwargs)[source]¶ Returns STFT matrix and VAD matrix with beginning and ending silence removed.
- Parameters
samples (
str
ornumpy.ndarray [size=(num_samples,)
or(num_samples
,num_channels)]
) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data.sr (
int
, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded.win_size_ms (
int
orfloat
) – Window length in milliseconds for Fourier transform to be applied (default 50)percent_overlap (
int
orfloat
) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.extend_window_ms (
int
) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired.window (
str
) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)zeropad (
bool
) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.**kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.files.loadsound
.
- Returns
stft_speech (
np.ndarry [size (num_frames_clipped
,fft_bins//2+1)]
) – The STFT of the samples with beginning and ending silences clipped.vad_matrix (
np.ndarry [size (num_frames
,)]
) – A vector with zeros and ones indicating which indices of the full STFT that have voice activity or not.
-
soundpy.feats.
get_vad_samples
(sound, sr=None, win_size_ms=50, percent_overlap=0.5, use_beg_ms=120, extend_window_ms=0, energy_thresh=40, freq_thresh=185, sfm_thresh=5, window='hann', zeropad=True, **kwargs)[source]¶ Returns samples and VAD matrix. Only samples where with VAD are returned.
- Parameters
sound (
str
ornumpy.ndarray [size=(num_samples,)
or(num_samples
,num_channels)]
) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.sr (
int
, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)win_size_ms (
int
orfloat
) – Window length in milliseconds for Fourier transform to be applied (default 50)percent_overlap (
int
orfloat
) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.use_beg_ms (
int
) – The amount of time in milliseconds to use from beginning of signal to estimate background noise.extend_window_ms (
int
) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired.energy_thresh (
int
) – The threshold to set for measuring energy for VAD in the signal. (default 40)freq_thresh (
int
) – The threshold to set for measuring frequency for VAD in the signal. (default 185)sfm_thresh (
int
) – The threshold to set for measuring spectral flatness for VAD in the signal. (default 5)window (
str
) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)zeropad (
bool
) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.**kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.files.loadsound
- Returns
samples_matrix (
np.ndarray [size = (num_samples_vad
,)]
) – The samples of where voice activity was detected.vad_matrix_extwin (
np.ndarray [size = (num_frames
,)]
) – A vector of zeros and ones indicating the frames / windows of the samples that either had voice activity or not.
-
soundpy.feats.
get_samples_clipped
(samples, sr, win_size_ms=50, percent_overlap=0.5, extend_window_ms=0, window='hann', zeropad=True, **kwargs)[source]¶ Returns samples and VAD matrix with beginning and ending silence removed.
- Parameters
samples (
str
ornumpy.ndarray [size=(num_samples,)
or(num_samples
,num_channels)]
) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data.sr (
int
, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded.win_size_ms (
int
orfloat
) – Window length in milliseconds for Fourier transform to be applied (default 50)percent_overlap (
int
orfloat
) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.extend_window_ms (
int
) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired. (default 0)window (
str
) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)zeropad (
bool
) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.**kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.files.loadsound
.
- Returns
stft_speech (
np.ndarry [size (num_frames_clipped
,fft_bins//2+1)]
) – The STFT of the samples with beginning and ending silences clipped.vad_matrix (
np.ndarry [size (num_frames
,)]
) – A vector with zeros and ones indicating which indices of the full STFT that have voice activity or not.
-
soundpy.feats.
normalize
(data, max_val=None, min_val=None)[source]¶ Normalizes data to be between 0 and 1. Should not be applied to raw sample data.
This is useful if you have predetermined max and min values you want to normalize new data with. Is helpful in training models on sound features (not raw samples).
- Parameters
data (
np.ndarray [size=(num_features,)
or(num_frames,num_features)]
) – Data to be normalized.max_val (
int
orfloat
, optional) – Predetermined maximum value. If None, will use max value from data.min_val (
int
orfloat
, optional) – Predetermined minimum value. If None, will use min value from data.
- Returns
normed_data
- Return type
np.ndarray [size = (num_features,)
or(num_frames
,num_features)]
Examples
>>> # using the min and max of a previous dataset: >>> import numpy as np >>> np.random.seed(0) >>> input_samples = np.random.random_sample((5,)) >>> input_samples array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ]) >>> np.random.seed(40) >>> previous_samples = np.random.random_sample((5,)) >>> previous_samples array([0.40768703, 0.05536604, 0.78853488, 0.28730518, 0.45035059]) >>> max_prev = np.max(previous_samples) >>> min_prev = np.min(previous_samples) >>> output_samples = normalize(input_samples, min_val = min_prev, max_val = max_prev) >>> output_samples array([0.67303388, 0.89996095, 0.74661839, 0.66767314, 0.50232462])
-
soundpy.feats.
plot_dom_freq
(sound, energy_scale='power_to_db', title=None, save_pic=False, name4pic=None, overwrite=False, **kwargs)[source]¶ Plots the approximate dominant frequency over a STFT plot of a signal.
If sound has multiple channels, the VAD for each channel is plotted in its own plot.
- Parameters
sound (
np.ndarray [shape=(num_samples,)
or(num_samples
,num_channels)]
) – The sound to plot the dominant frequency of.energy_scale (
str
) – The scale of energy for the plot. If in frequency spectrum, likey in power and needs to be put into db. (default ‘power_to_db’)title (
str
) – The title for the plot. (default None)**kwargs (
additional keyword arguments
) – Keyword arguments used in bothsoundpy.feats.get_stft
andsoundpy.dsp.get_pitch
.
- Returns
- Return type
-
soundpy.feats.
plot_vad
(sound, energy_scale='power_to_db', title='Voice Activity', use_beg_ms=120, extend_window_ms=0, beg_end_clipped=True, save_pic=False, name4pic=None, overwrite=False, **kwargs)[source]¶ Plots where voice (sound) activity detected on power spectrum.
This either plots immediately or saves the plot at name4pic. If sound has multiple channels, the VAD for each channel is plotted in its own plot.
- Parameters
sound (
np.ndarray [shape=(num_samples,)
or(num_samples
,num_channels)]
) – The sound to plot the VAD of.energy_scale (
str
) – If plotting STFT or power spectrum, will plot it in decibels. (default ‘power_to_db’)title (
str
) – The title of the plot (default ‘Voice Activity’)use_beg_ms (
int
) – The amount of noise to use at the beginning of the signal to measuer VAD. This is only applied if beg_end_silence is set to False.extend_window_ms (
int
) – The number of milliseconds VAD should be padded. This is useful if one wants to encompass more speech if the VAD is not including all the speech / desired sound. However, this may capture more noise. (default 0)beg_end_silence (
bool
) – If True, just the silences at the beginning and end of the sample will be cut off. If False, VAD will be checked throughout the sample, not just the beginning and end. NOTE: Both options have strengths and weaknesses. Sometimes the VAD checking the entire signal is unreliable (e.i. when beg_end_silence is set to False), not recognizing speech in speech filled samples. And when set to True, some speech sounds tend to get ignored (‘s’, ‘x’ and other fricatives).save_pic (
bool
) – If True, the plot will be saved rather than plotted immediately.name4pic (
str
) – The full pathway and filename to save the picture (as .png file). A file extension is expected. (default None)overwrite (
bool
) – If False, a date tag will be added to name4pic if name4pic already exists. (default False)**kwargs (
keyword arguments
) – Additional keyword arguments for soundpy.feats.get_speech_stft orsoundpy.dsp.vad
.
- Returns
- Return type
-
soundpy.feats.
get_change_acceleration_rate
(spectro_data)[source]¶ Gets first and second derivatives of spectral data.
This is useful particularly for speech recognition.
- Parameters
spectro_data (
np.ndarray [shape = (num_samples
,num_features)]
) –- Returns
delta (
np.ndarray [shape = (num_samples
,num_features)]
) – The first order derivative of spectral data. Reflects rate of change in signal.delta_delta (
np.ndarray [shape = (num_samples
,num_features)]
) – The second order derivative of spectral data. Reflects rate of acceleration in signal.
-
soundpy.feats.
get_mfcc_fbank
(samples, feature_type='mfcc', sr=48000, win_size_ms=20, percent_overlap=0.5, num_filters=40, num_mfcc=40, fft_bins=None, window_function=None, zeropad=True, **kwargs)[source]¶ Collects fbank or mfcc features via python-speech-features (rather than librosa).
-
soundpy.feats.
zeropad_features
(feats, desired_shape, complex_vals=False)[source]¶ Applies zeropadding to a copy of feats.
-
soundpy.feats.
reduce_num_features
(feats, desired_shape)[source]¶ Limits number features of a copy of feats.
This is useful if you want the features to be a certain size, for training models for example.
-
soundpy.feats.
featshape_new_subframe
(feature_matrix_shape, new_frame_size, zeropad=True, axis=0, include_dim_size_1=False)[source]¶ Subdivides features from (num_frames, num_feats) to (new_frame_size, num_frames, num_feats)
- Parameters
feature_matrix_shape (
tuple [size=(num_frames
,num_features)]
) – Feature matrix shape to be subdivided. Can be multidimensional.new_frame_size (
int
) – The number of subframes to section axis into.zeropad (
bool
) – If True, frames that don’t completely fill a new_frame_size will be zeropadded. Otherwise, those frames will be discarded. (default True)axis (
int
) – The axis where the new_frame_size should be applied. (default 0)
- Returns
new_shape
- Return type
tuple [size=(num_subframes
,new_frame_size
,num_feats)]
-
soundpy.feats.
apply_new_subframe
(feature_matrix, new_frame_size, zeropad=True, axis=0)[source]¶ Reshapes feature_matrix to allow for new_frame_size.
Note: Dimensions of feature_matrix must be at least 2 and can be up to 5, returning a matrix with one additional dimension.
- Parameters
feature_matrix (
np.ndarray [size(num_frames
,num_features) ]
) – Expects minimum 2D, maximum 5D matrix.new_frame_size (
int
) – The number of subframes to section axis into.axis (
int
) – The axis to apply the new_frame_size. (default 0)zeropad (
bool
) – If True, the feature_matrix will be zeropadded to include frames that do not fill entire frame_size, given the new_frame_size. If False, feature_matrix will not include the last zeropadded frame. (default True)
- Returns
feats_reshaped – The feature_matrix returned with axis subdivided into 2 dimensions, the number of subframes and the other length new_frame_size.
- Return type
np.ndarray [size(num_subframes
,new_frame_size
,num_features)]
- Raises
ValueError if number of dimensions of feature_matrix is below 2 or exceeds 5. –
Examples
>>> import numpy as np >>> matrix = np.arange(24).reshape(3,4,2) >>> # apply new_frame_size to dimension of length 4 (i.e. axis 1) >>> matrix_zp = apply_new_subframe(matrix, new_frame_size = 3, axis = 1) >>> matrix_zp.shape (3, 2, 3, 2) >>> matrix_zp array([[[[ 0, 1], [ 2, 3], [ 4, 5]],
[[ 6, 7], [ 0, 0], [ 0, 0]]],
- [[[ 8, 9],
[10, 11], [12, 13]],
[[14, 15], [ 0, 0], [ 0, 0]]],
- [[[16, 17],
[18, 19], [20, 21]],
[[22, 23], [ 0, 0], [ 0, 0]]]])
>>> matrix_nozp = apply_new_subframe(matrix, new_frame_size = 3, axis = 1, ... zeropad=False) >>> matrix_nozp.shape (3, 1, 3, 2) >>> matrix_nozp array([[[[ 0, 1], [ 2, 3], [ 4, 5]]],
- [[[ 8, 9],
[10, 11], [12, 13]]],
- [[[16, 17],
[18, 19], [20, 21]]]])
-
soundpy.feats.
check_percent_overlap
(percent_overlap)[source]¶ Ensures percent_overlap is between 0 and 1.
-
soundpy.feats.
separate_dependent_var
(matrix)[source]¶ Separates matrix into features and labels. Expects 3D array.
Assumes the last column of the last dimension of the matrix constitutes the dependent variable (labels), and all other columns the indpendent variables (features). Additionally, it is assumed that for each block of data, only one label is needed; therefore, just the first label is taken for each block.
- Parameters
matrix (
numpy.ndarray [size = (num_samples
,num_frames
,num_features)]
) – The matrix holds the numerical data to separate. num_features is expected to be at least 2.- Returns
X (
numpy.ndarray [size = (num_samples
,num_frames
,num_features -1)]
) – A matrix holding the (assumed) independent variablesy (
numpy.ndarray
,numpy.int64
,numpy.float64 [size = (num_samples,)]
) – A vector holding the labels assigned to the independent variables. If only one value in array, just the value inside is returned
Examples
>>> import numpy as np >>> #vector >>> separate_dependent_var(np.array([1,2,3,4])) (array([1, 2, 3]), 4) >>> #simple matrix >>> matrix = np.arange(4).reshape(2,2) >>> matrix array([[0, 1], [2, 3]]) >>> X, y = separate_dependent_var(matrix) >>> X array([[0], [2]]) >>> y 1 >>> #more complex matrix >>> matrix = np.arange(20).reshape((2,2,5)) >>> matrix array([[[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9]], [[10, 11, 12, 13, 14], [15, 16, 17, 18, 19]]]) >>> X, y = separate_dependent_var(matrix) >>> X array([[[ 0, 1, 2, 3], [ 5, 6, 7, 8]], [[10, 11, 12, 13], [15, 16, 17, 18]]]) >>> y array([ 4, 14])
-
soundpy.feats.
add_tensor
(matrix)[source]¶ Adds tensor / dimension to input ndarray (e.g. features).
Keras requires an extra dimension at some layers, which represents the ‘tensor’ encapsulating the data.
Further clarification taking the example below. The input matrix has shape (2,3,4). Think of it as 2 different events, each having 3 sets of measurements, with each of those having 4 features. So, let’s measure differences between 2 cities at 3 different times of day. Let’s take measurements at 08:00, 14:00, and 19:00 in… Magic City and Never-ever Town. We’ll measure.. 1) tempurature, 2) wind speed 3) light level 4) noise level.
How I best understand it, putting our measurements into a matrix with an added dimension/tensor, this highlights the separate measurements, telling the algorithm: yes, these are 4 features from the same city, BUT they occur at different times. Or it’s just how Keras set up the code :P
- Parameters
matrix (
numpy.ndarray
) – The matrix holds the numerical data to add a dimension to.- Returns
matrix – The matrix with an additional dimension.
- Return type
numpy.ndarray
Examples
>>> import numpy as np >>> matrix = np.arange(24).reshape((2,3,4)) >>> matrix.shape (2, 3, 4) >>> matrix array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]) >>> matrix_2 = add_tensor(matrix) >>> matrix_2.shape (2, 3, 4, 1) >>> matrix_2 array([[[[ 0], [ 1], [ 2], [ 3]], [[ 4], [ 5], [ 6], [ 7]], [[ 8], [ 9], [10], [11]]], [[[12], [13], [14], [15]], [[16], [17], [18], [19]], [[20], [21], [22], [23]]]])
-
soundpy.feats.
scale_X_y
(matrix, is_train=True, scalars=None)[source]¶ Separates and scales data into X and y arrays. Adds dimension for keras.
Assumes the last column of the last dimension is the y or label data.
- Parameters
matrix (
np.ndarray [size = (num_samples
,num_frames
,num_features)]
) – Matrix with X and y datais_train (
bool
) – Relevant for the scalars parameter. If the data is training data (i.e. True), the scalars will be created. If the data is test data (i.e. False), the function expects scalars to be provided. (default True)scalars (
dict
, optional) – Dictionary with scalars to be applied to non-training data.
- Returns
X (
np.ndarray [size = (num_sampls
,num_frames
,num_features-1
,1)]
) – Scaled features with extra dimensiony (
np.ndarray [size = (num_samples
,1
,1)]
) – Scaled independent variable with extra dimensionscalars (
dict
) – The scalars either created or previously loaded.
-
soundpy.feats.
get_feature_matrix_shape
(sr=None, dur_sec=None, feature_type=None, win_size_ms=None, percent_overlap=None, fft_bins=None, num_mfcc=None, num_filters=None, rate_of_change=False, rate_of_acceleration=False, context_window=None, frames_per_sample=None, zeropad=True, labeled_data=False, remove_first_coefficient=False, real_signal=False, **kwargs)[source]¶ Returns expected shapes of feature matrix depending on several parameters.
- Parameters
sr (
int
) – Sample rate of the audio to be extracted.dur_sec (
int
,float
) – The number of seconds of audio feature extraction will be applied to.feature_type (
str
) – Accepted features include ‘signal’, ‘stft’, ‘powspec’, ‘fbank’, ‘mfcc’. Which feature_type applied will influence the resulting shape of the feature matrix shape.win_size_ms (
int
orfloat
) – The size of the window the audio signal should be broken into. If feature_type is set to ‘signal’, this is irrelevant. Otherwise will raise TypeError if set to None.percent_overlap (
float
) – The amount of overlap between windows. If set to 0.5, the number of overlapping samples will be half the number of samples that make up win_size_ms.fft_bins (
int
) – The number of frequency bins to use when calculating the fast Fourier transform. If None, the calculated frame_length will be used.num_mfcc (
int
) – If extracting ‘mfcc’ features, the total number of coefficients expected.num_filters (
int
) – If extracting ‘fbank’ features, the total number of mel-filters to be applied.rate_of_change (
bool
) – If True, the first delta will be concatenated to features extracted.rate_of_acceleration (
bool
) – If True, the second delta will be concatenated to features extracted.context_window (
int
) – The size of context_window or number of samples padding a central frame. This may be useful for models training on small changes occuring in the signal, e.g. to break up the image of sound into smaller parts.frames_per_sample (
int
) – The previous keyword argument for sugementing audio into smaller parts. Will be removed in future versions and available in generator functions as context_window. frames_per_sample equals 2 * context_window + 1. Seesoundpy.models.dataprep.Generator
zeropad (
bool
) – If True, windows and frames will be zeropadded to avoid losing any sample data.labeled_data (
bool
) – If True, a label will be added to the output shape of features.remove_first_coefficient (
bool
) – If True, the first mfcc coefficient will not be included in feature matrix.**kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.feats.get_feats
. These may not be used in this function as they may not influence the size of the feature matrix.
- Returns
-
soundpy.feats.
visualize_feat_extraction
(feats, iteration=None, dataset=None, label=None, datadir=None, subsections=False, **kwargs)[source]¶ Saves plots of features during feature extraction or training of models.
- Parameters
feats (
np.ndarray [shape=(num_samples,)
or(num_samples
,num_frames)
or (num_frames,num_features)
or(num_subsections
,num_frames
,num_features)]
) – The extracted features can be raw signal data, stft, fbank, powspec, mfcc data, either as a single plot or subsectioned into batches / subframes.iteration (
int
, optional) – The iteration of the audio getting extracted; e.g. the 10th training item.dataset (
str
, optional) – The identifying string (for example ‘train’ , ‘val’, or ‘test’, but this can be anything).label (
str
,int
, optional) – The label of the audio file. Used in titles and filenames.datadir (
str
,pathlib.PosixPath
, optional) – The directory where related data is located. An ‘image’ directory will be created within this datadir where the saved plots will be stored. If None, will be created in current working directory.subsections (
bool
, optional) – To subsection raw ‘signal’ data into frames. For other features, this is easier to identify via the shape offeats
.**kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.feats.get_feats
- Returns
- Return type
-
soundpy.feats.
save_features_datasets
(datasets_dict, datasets_path2save_dict, context_window=None, frames_per_sample=None, labeled_data=False, subsection_data=False, divide_factor=None, visualize=False, vis_every_n_frames=50, log_settings=True, decode_dict=None, random_seed=None, **kwargs)[source]¶ Extracts and saves audio features, sectioned into datasets, to indicated locations.
If MemoryError, the provided dataset dicts will be adjusted to allow data to be subsectioned.
- Parameters
datasets_dict (
dict
) – Dictionary with keys representing datasets and values the audifiles making up that dataset. E.g. {‘train’:[‘1.wav’, ‘2.wav’, ‘3.wav’], ‘val’: [‘4.wav’], ‘test’:[‘5.wav’]} for unlabled data or {‘train’:[(0, ‘1.wav’), (1, ‘2.wav’), (0, ‘3.wav’)], ‘val’: [(1, ‘4.wav’)], ‘test’:[(0, ‘5.wav’)]} for labeled data.datasets_path2save_dict (
dict
) – Dictionary with keys representing datasets and values the pathways of where extracted features of that dataset will be saved. E.g. {‘train’: ‘./data/train.npy’, ‘val’: ‘./data/val.npy’, ‘test’: ‘./data/test.npy’}context_window (
int
) – The size of context_window or number of samples padding a central frame. This may be useful for models training on small changes occuring in the signal, e.g. to break up the image of sound into smaller parts, to feed to a long short-term memory network (LSTM), for example. (Can avoid this by simply reshaping data later).frames_per_sample (
int
) – The previous keyword argument for sugementing audio into smaller parts. Will be removed in future versions. This equals 2 * context_window + 1labeled_data (
bool
) – If True, expects each audiofile to be accompanied by an integer label. See example given for datasets_dict.subsection_data (
bool
) – If you have a large dataset, you may want to divide it into subsections. See soundpy.datasets.subsection_data. If datasets are large enough to raise a MemoryError, this will be applied automatically.divide_factor (
int
, optional) – The number of subsections to divide data into. Only large enough sections will be divided. If smaller datasets (i.e. validation and test datasets) are as large or smaller than the new subsectioned larger dataset(s) (i.e. train), they will be left unchanged. (defaults to 5)visualize (
bool
) – If True, periodic plots of the features will be saved throughout the extraction process. (default False)vis_every_n_frames (
int
) – How often visuals should be made: every 10 samples, every 100, etc. (default 50)log_settings (
bool
) – If True, a .csv file will be saved in the feature extraction directory with most of the feature settings saved. (default True)decode_dict (
dict
, optional) – The dictionary to get the label given the encoded label. This is for plotting purposes. (default None)**kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.feats.get_feats
.
- Returns
See also
soundpy.feats.get_feats
Extract features from audio file or audio data.
-
soundpy.feats.
save_features_datasets_zipfiles
(datasets_dict, datasets_path2save_dict, extract_dir, dur_sec, feature_type='fbank', num_feats=None, sr=22050, win_size_ms=20, percent_overlap=0.5, n_fft=None, frames_per_sample=None, labeled_data=False, subsection_data=False, divide_factor=None, visualize=False, vis_every_n_frames=50, use_librosa=True, center=True, mode='reflect', log_settings=True, decode_dict=None, audiofile_lim=10, **kwargs)[source]¶ Extracts and saves audio features, sectioned into datasets, to indicated locations.
If MemoryError, the provided dataset dicts will be adjusted to allow data to be subsectioned.
- Parameters
datasets_dict (
dict
) – Dictionary with keys representing datasets and values the audifiles making up that dataset. E.g. {‘train’:[‘1.wav’, ‘2.wav’, ‘3.wav’], ‘val’: [‘4.wav’], ‘test’:[‘5.wav’]} for unlabled data or {‘train’:[(0, ‘1.wav’), (1, ‘2.wav’), (0, ‘3.wav’)], ‘val’: [(1, ‘4.wav’)], ‘test’:[(0, ‘5.wav’)]} for labeled data.datasets_path2save_dict (
dict
) – Dictionary with keys representing datasets and values the pathways of where extracted features of that dataset will be saved. E.g. {‘train’: ‘./data/train.npy’, ‘val’: ‘./data/val.npy’, ‘test’: ‘./data/test.npy’}feature_type (
str
) – String including only one of the following: ‘signal’, ‘stft’, ‘powspec’, ‘fbank’, and ‘mfcc’. ‘signal’ currently only supports mono channel data. TODO test for stereo ‘powspec’ and ‘stft’ are basically the same; ‘powspec’ is the ‘stft’ except without complex values and squared. E.g ‘mfcc_noisy’ or ‘stft_train’.sr (
int
) – The sample rate the audio data should be loaded with.n_fft (
int
) – The number of frequency bins used for the Fast Fourier Transform (fft)dur_sec (
int
orfloat
) – The desired duration of how long the audio data should be. This is used to calculate size of feature data and is therefore necessary, as audiofiles tend to differe in length. If audiofiles are longer or shorter, they will be cut or zeropadded respectively.num_feats (
int
) – The number of mfcc coefficients (mfcc), mel filters (fbank), or frequency bins (stft).win_size_ms (
int
) – The desired window size in milliseconds to process audio samples.percent_overlap (
float
) – The amount audio samples should overlap as each window is processed.frames_per_sample (
int
, optional) – If you want to section each audio file feature data into smaller frames. This might be useful for speech related contexts. (Can avoid this by simply reshaping data later)labeled_data (
bool
) – If True, expects each audiofile to be accompanied by an integer label. See example given for datasets_dict.subsection_data (
bool
) – If you have a large dataset, you may want to divide it into subsections. See soundpy.datasets.subsection_data. If datasets are large enough to raise a MemoryError, this will be applied automatically.divide_factor (
int
, optional) – The number of subsections to divide data into. Only large enough sections will be divided. If smaller datasets (i.e. validation and test datasets) are as large or smaller than the new subsectioned larger dataset(s) (i.e. train), they will be left unchanged. (defaults to 5)visualize (
bool
) – If True, periodic plots of the features will be saved throughout the extraction process. (default False)vis_every_n_frames (
int
) – How often visuals should be made: every 10 samples, every 100, etc. (default 50)use_librosa (
bool
) – If True, librosa is used to load and extract features. As of now, no other option is available. TODO: add other options. :P I just wanted to be clear that some elements of this function are unique to using librosa. (default True)center (
bool
) – Relevant for librosa and feature extraction. (default True)mode (
str
) – Relevant for librosa and feature extraction. (default ‘reflect’)log_settings (
bool
) – If True, a .csv file will be saved in the feature extraction directory with most of the feature settings saved. (default True)decode_dict (
dict
, optional) – The dictionary to get the label given the encoded label. This is for plotting purposes. (default None)**kwargs (
additional keyword arguments
) – Keyword arguments forsoundpy.feats.get_feats
.
- Returns
See also
soundpy.feats.get_feats
Extract features from audio file or audio data.
-
soundpy.feats.
prep_new_audiofeats
(feats, desired_shape, input_shape)[source]¶ Prepares new audio data to feed to a pre-trained model.
- Parameters
feats (
np.ndarray [shape = (num_frames
,num_features)]
) – The features to prepare for feeding to a model.desired_shape (
tuple
) – The expected number of samples necessary to fulfill the expected input_shape for the model. Thefeats
will be zeropadded or limited to match this desired_shape.input_shape (
tuple
) – The input_shape the model expects a single sample of data to be.
- Returns
feats_reshaped – The features reshaped to what the model expects.
- Return type
np.ndarray [shape = (`input_shape
)]`
-
soundpy.feats.
feats2audio
(feats, feature_type, sr, win_size_ms, percent_overlap, phase=None)[source]¶ Prepares features into audio playable format.
- Parameters
feats (
np.ndarray [shape = (num_frames
,num_feats)]
) – If the features are a signal, [size = (batch_size * num_frames * num_features, 1)]. Otherwise [size = (batch_size * num_frames, num_features)].feature_type (
str
) – Either ‘stft’, ‘fbank’, ‘signal’, or ‘mfcc’. For the ‘signal’ feature, only mono channel is supported.sr (
int
) – Sampling rate that the features were extracted withwin_size_ms (
int
) – The window size in milliseconds the features were extracted withpercent_overlap (
float
) – The percent overlap between windows.phase (
np.ndarray [shape = (num_frames
,num_feats)]
, optional) – The original phase information of the reconstructed signal.
- Returns
y – The reconstructed signal in samples.
- Return type
np.ndarray [shape = (num_samples
,)]
Template deep neural networks¶
The models.template_models module contains functions for building (ideally research-based) models.
-
soundpy.models.template_models.
adjust_layers_cnn
(**kwargs)[source]¶ Reduces layers of CNN until the model can be built.
If the number of filters for ‘mfcc’ or ‘fbank’ is in the lower range (i.e. 13 or so), this causes issues with the default settings of the cnn architecture. The architecture was built with at least 40 filters being applied during feature extraction. To deal with this problem, the number of CNN layers are reduced.
- Parameters
**kwargs (
Keyword arguments
) – Keyword arguments for soundpy.models.template_models.cnn_classifier- Returns
settings – Updated dictionary with relevant settings for model.
- Return type
References
-
soundpy.models.template_models.
cnn_classifier
(feature_maps=[40, 20, 10], kernel_size=[3, 3, 3, 3, 3, 3], strides=2, activation_layer='relu', activation_output='softmax', input_shape=79, 40, 1, num_labels=3, dense_hidden_units=100, dropout=0.25)[source]¶ Build a single or multilayer convolutional neural network.
- Parameters
feature_maps (
int
orlist
) – The filter or feature map applied to the data. One feature map per convolutional neural layer required. For example, a list of length 3 will result in a three-layer convolutional neural network.kernel_size (
tuple
orlist
oftuples
) – Must match the number of feature_maps. The size of each corresponding feature map.strides (
int
) –activation_layer (
str
) – (default ‘relu’)activation_outpu (
str
) – (default ‘softmax’)input_shape (
tuple
) – The shape of the inputdense_hidden_units (
int
, optional) –dropout (
float
, optional) – Reduces overfitting
- Returns
model (
tf.keras.Model
) – Model ready to be compiled.settings (
dict
) – Dictionary with relevant settings for model.
Warning
If number features are not compatible with number of layers, warning raised and layers adjusted. E.g. For lower number of MFCC features this will likely be applied if number of layers is greater than 1.
References
A. Sehgal and N. Kehtarnavaz, “A Convolutional Neural Network Smartphone App for Real-Time Voice Activity Detection,” in IEEE Access, vol. 6, pp. 9017-9026, 2018.
-
soundpy.models.template_models.
autoencoder_denoise
(input_shape, kernel_size=3, 3, max_norm_value=2.0, activation_function_layer='relu', activation_function_output='sigmoid', padding='same', kernel_initializer='he_uniform')[source]¶ Build a simple autoencoder denoiser.
- Parameters
- Returns
autoencoder – Model ready to be compiled
- Return type
tf.keras.Model
References
Versloot, Christian (2019, December 19). Creating a Signal Noise Removal Autoencoder with Keras. MachineCurve. https://www.machinecurve.com
-
soundpy.models.template_models.
resnet50_classifier
(input_shape, num_labels, activation='softmax', final_layer_name='features')[source]¶ Simple image classifier built ontop of a pretrained ResNet50 model.
References
Revay, S. & Teschke, M. (2019). Multiclass Language Identification using Deep Learning on Spectral Images of Audio Signals. arXiv:1905.04348 [cs.SD]
-
soundpy.models.template_models.
cnnlstm_classifier
(num_labels, input_shape, lstm_cells, feature_map_filters=32, kernel_size=8, 4, pool_size=3, 3, dense_hidden_units=60, activation_layer='relu', activation_output='softmax', dropout=0.25)[source]¶ Model architecture inpsired from the paper below.
References
Kim, Myungjong & Cao, Beiming & An, Kwanghoon & Wang, Jun. (2018). Dysarthric Speech Recognition Using Convolutional LSTM Neural Network. 10.21437/interspeech.2018-2250.
Additional model setup (e.g. Early Stopping)¶
The models.modelsetup module containes functionality for preparing for training a model
-
soundpy.models.modelsetup.
setup_layers
(num_features, num_layers, kernel_shape=3, 3, max_feature_map=64)[source]¶ Sets up feature_maps and kernels for 1 or more layered convolutional neural networks.
- Parameters
num_features (
int
) – The number of features used to train the model. This will be used to set the number of feature_maps for each layer.num_layers (
int
) – The number of layers desiredkernel_shape (
tuple
orint
) – The shape of the desired kernelmax_feature_map (
int
) – The maximum size of feature map / filter. This depends on the system and is relevant for processing higher definition features, such as STFT features. If this is set too large given memory restraints, training may be ‘killed’.
- Returns
Warning
If num_features is larger than the max_feature_map. The num_features is usually used to set the first feature map, but if too large, will be reduced to be lower than max_feature_map.
-
soundpy.models.modelsetup.
setup_callbacks
(early_stop=True, patience=15, log=True, log_filename=None, append=True, save_bestmodel=True, best_modelname=None, monitor='val_loss', verbose=1, save_best_only=True, mode='min', tensorboard=True, write_images=False, x_test=None, y_test=None, batch_size=None, embedded_layer_name=None)[source]¶ Easy set up of early stopping, model logging, and saving best model.
- Parameters
early_stop (
bool
) – Whether or not the model should stop if training is not improving (default True)patience (
int
) – The number of epochs the model should complete without improvement before stopping training. (default 15)log (
bool
) – If true, the accuracy, loss, and (if possible) the val_accuracy and val_loss for each epoch will be saved in a .csv file. (default True)log_filename (
str
orpathlib.PosixPath
, optional) – The filename the logging information will be stored. If None, the date will be used as a unique .csv filename in a subfolder ‘model_logs’ in the local directory.save_bestmodel (
bool
) – If True, the best performing model will be saved.best_modelname (
str
orpathlib.PosixPath
) – The name to save the best model version under. If None, the date will be used to create a unique .h5 filename and it will be saved in a subfolder ‘best_models’ in the local directory.monitor (
str
) – The metric to be used to measure model performance. (default ‘val_loss’verbose (
bool
) – If True, the state of the model will be printed. (default True)save_best_only (
bool
) – If True, the best performing model will overwrite any previously saved ‘best model’.mode (
str
) – If monitor is set to ‘val_loss’, this should be set to ‘min’. If monitor``is set to ‘val_acc’, this should be set to ‘max’. If `mode is set to ‘auto’, the direction will be inferred. (default ‘min’)tensorboard (
bool
) – If True, logs for TensorBoard will be made.
- Returns
callbacks – The callbacks ready to be applied to Keras model training.
- Return type
# TODO what data type is this?
Feeding large datasets to models¶
-
class
soundpy.models.dataprep.
Generator
(data_matrix1, data_matrix2=None, timestep=None, axis_timestep=0, normalize=True, apply_log=False, context_window=None, axis_context_window=- 2, labeled_data=False, gray2color=False, zeropad=True, desired_input_shape=None, combine_axes_0_1=False)[source]¶ Bases:
object
Methods
Shapes, norms, and feeds data depending on labeled or non-labeled data.
-
__init__
(data_matrix1, data_matrix2=None, timestep=None, axis_timestep=0, normalize=True, apply_log=False, context_window=None, axis_context_window=- 2, labeled_data=False, gray2color=False, zeropad=True, desired_input_shape=None, combine_axes_0_1=False)[source]¶ This generator pulls data out in sections (i.e. batch sizes). Prepared for 3 dimensional data.
Note: Keras adds a dimension to input to represent the “Tensor” that #handles the input. This means that sometimes you have to add a shape of (1,) to the shape of the data.
- Parameters
data_matrix1 (
np.ndarray [size=(num_samples
,batch_size
,num_frames
,num_features)
or(num_samples
,num_frames
,num_features+label_column)]
) – The training data. This can contain the feature and label data or just the input feature data.data_matrix2 (
np.ndarray [size = (num_samples
,) `data_matrix1
.shape]`, optional) – Either label data for data_matrix1 or, for example, the clean version of data_matrix1 if training an autoencoder. (default None)normalize (
bool
) – If False, the data has already been normalized and won’t be normalized by the generator. (default True)apply_log (
bool
) – If True, log will be applied to the data.timestep (
int
) – The number of frames to constitute a timestep.axis_timestep (
int
) – The axis to apply the timestep to. (default 0)context_window (
int
) – The size of context_window or number of samples padding a central frame. This may be useful for models training on small changes occuring in the signal, e.g. to break up the image of sound into smaller parts.axis_context_window (
int
) – The axis to apply_context_window, if context_window is not None. Ideally should be in axis preceding feature column. (default -2)zeropad (
bool
) – If features should be zeropadded in reshaping functions.desired_input_shape (
int
ortuple
, optional) – The desired number of features or shape of data to feed a neural network. If type int, only the last column of features will be adjusted (zeropadded or limited). If tuple, the entire data shape will be adjusted (all columns). If the int or shape is larger than that of the data provided, data will be zeropadded. If the int or shape is smaller, the data will be restricted. (default None)
-
The models.dataprep module covers functionality for feeding features to models.
-
class
soundpy.models.dataprep.
Generator
(data_matrix1, data_matrix2=None, timestep=None, axis_timestep=0, normalize=True, apply_log=False, context_window=None, axis_context_window=- 2, labeled_data=False, gray2color=False, zeropad=True, desired_input_shape=None, combine_axes_0_1=False)[source]¶ Bases:
object
Methods
Shapes, norms, and feeds data depending on labeled or non-labeled data.
-
class
soundpy.models.dataprep.
GeneratorFeatExtraction
(datalist, datalist2=None, model_name=None, normalize=True, apply_log=False, randomize=True, random_seed=None, desired_input_shape=None, timestep=None, axis_timestep=0, context_window=None, axis_context_window=- 2, batch_size=1, gray2color=False, visualize=False, vis_every_n_items=50, visuals_dir=None, decode_dict=None, dataset='train', augment_dict=None, label_silence=False, vad_start_end=False, **kwargs)[source]¶ Bases:
soundpy.models.dataprep.Generator
Methods
Extracts features and feeds them to model according to desired_input_shape.
-
soundpy.models.dataprep.
randomize_augs
(aug_dict, random_seed=None)[source]¶ Creates copy of dict and chooses which augs applied randomly.
Can apply random seed for number of augmentations applied and shuffling order of possible augmentations.
-
soundpy.models.dataprep.
augment_features
(sound, sr, add_white_noise=False, snr=[5, 10, 20], speed_increase=False, speed_decrease=False, speed_perc=0.15, time_shift=False, shufflesound=False, num_subsections=3, harmonic_distortion=False, pitch_increase=False, pitch_decrease=False, num_semitones=2, vtlp=False, bilinear_warp=True, augment_settings_dict=None, random_seed=None)[source]¶ Randomly applies augmentations to audio. If no augment_settings_dict, defaults applied.
-
soundpy.models.dataprep.
get_input_shape
(kwargs_get_feats, labeled_data=False, frames_per_sample=None, use_librosa=True, mode='reflect')[source]¶
-
soundpy.models.dataprep.
make_gen_callable
(_gen)[source]¶ Prepares Python generator for tf.data.Dataset.from_generator
Bug fix: Python generator fails to work in Tensorflow 2.2.0 +
- Parameters
_gen (
generator
) – The generator function to feed to a deep neural network.- Returns
x (
np.ndarray [shape=(batch_size
,num_frames
,num_features
,1)]
) – The feature datay (
np.ndarray [shape=(1,1)]
) – The label for the feature data.
References
Shu, Nicolas (2020) https://stackoverflow.com/a/62186572 CC BY-SA 4.0
Other useful non-specific functionality¶
Utils module covers functions that are useful for soundpy but are not directly related to sound data.
-
soundpy.utils.
path_or_samples
(input_value)[source]¶ Checks whether input_value is a path or sample data. Does not check path validity.
This is useful for functions that take both pathways to audio as well as pre-loaded audio data.
- Parameters
input_value (
str
,pathlib.PosixPath
, ortuple [size= ( (samples,)
,sr)]
ornp.ndarray [size = (samples
,)]
) –- Returns
‘path’ or ‘samples’
- Return type
Examples
>>> import numpy as np >>> # create some example samples and sample rate >>> samples = np.array([1,2,3,2,1,0]) >>> sr = 5 >>> path_or_samples( (samples, sr) ) 'samples' >>> # expects both audio samples and sr >>> path_or_samples(samples) TypeError: The input for `path_or_samples` expected a str, pathlib.PosixPath, or tuple with samples and sample rate, not type <class 'numpy.ndarray'> >>> # create example string pathway >>> path_or_samples('my_audio.wav') 'path' >>> # create pathlib.PosixPath object >>> import pathlib >>> path_or_samples(pathlib.Path('my_audio.wav') 'path'
-
soundpy.utils.
get_default_args
(func)[source]¶ References
stackoverflow answer by mgilson: link: https://stackoverflow.com/a/12627202 license: https://creativecommons.org/licenses/by-sa/3.0/
-
soundpy.utils.
match_dtype
(array1, array2)[source]¶ Match the dtype of the second array to the first.
- Parameters
array1 (
np.ndarray
) – The numpy array with the dataype to be adjusted and returned.array2 (
np.ndarray
) – The numpy array with the orginal or desired datatype.
- Returns
array1 – The array1 with the dtype of array2
- Return type
np.ndarray
-
soundpy.utils.
get_date
()[source]¶ Get a string containing month, day, hour, minute, second and millisecond.
This is useful for creating a unique filename.
- Parameters
None –
- Returns
time_str – A string containing the date and time.
- Return type
Examples
>>> date = get_date() >>> date '6m18d1h16m32s295ms'
-
soundpy.utils.
check_dir
(directory, make=True, append=True)[source]¶ Checks if directory exists and creates it if indicated.
- Parameters
directory (
str
orpathlib.PosixPath
) – The directory of interestmake (
bool
) – Whether or not the directory should be created or just checked to ensure it exists. (default True)append (
bool
) – If True, if a directory with the same name exists, new items will be saved into the old directory. Otherwise, an error will be raised. (default True)
- Returns
directory – If a directory could be created or confirmed to exist, the directory path will be returned. Otherwise Errors will be raised.
- Return type
-
soundpy.utils.
create_nested_dirs
(directory)[source]¶ Creates directory even if several parent directories don’t exist.
- Parameters
directory (
str
,pathlib.PosixPath
) – The directory to be created.- Returns
directory – If successful, the directory path that has been created.
- Return type
Examples
>>> # First an unsucessful creation of nested directory >>> import os >>> new_dir = './testdir/testdir/testdir/' >>> os.mkdir(new_dir) FileNotFoundError: [Errno 2] No such file or directory: './testdir/testdir/testdir/' >>> # try again with create_nested_dirs() >>> directory = create_nested_dirs(new_dir) >>> directory PosixPath('testdir/testdir/testdir')
-
soundpy.utils.
string2pathlib
(pathway_string)[source]¶ Turns string path into pathlib.PosixPath object.
This is useful when working with pathways from varying operating systems. Windows, Linux, and Mac have different ways of organizing pathways and pathlib turns strings from these different versions into a pathlib object that can be understood by the software regardless of the system. (At least I hope so..)
- Parameters
pathway_string (
str
orpathlib.PosixPath
) – The pathway to be turned into a pathlib object, if need be.- Returns
pathway_string – The pathway as a pathlib object.
- Return type
Examples
>>> pathway = 'folder/way2go.txt' >>> pathlib_pathway = string2pathlib(pathway) >>> pathlib_pathway PosixPath('folder/way2go.txt')
-
soundpy.utils.
restore_dictvalue
(value_string)[source]¶ Takes dict value and converts it to its original type.
When loading a dictionary from a .csv file, the values are strings. This function handles integers, floats, tuples, and some strings. It also has been suited to handle a list of audio files or list of pathlib.PosixPath objects.
Warning: no extensive testing has been completed for this function. It might not handle all value types as expected.
- Parameters
value_string (
str
) – The dictionary value that was converted into a string object .- Returns
value_original_type – The value converted back to its original type.
- Return type
:raises ValueError : If passed a nested list of pathlib.PosixPath objects.:
Examples
>>> input_string = "[PosixPath('data/audio/vacuum/vacuum1.wav')]" >>> type(input_string) <class 'str'> >>> typelist = string2list(input_string) >>> typelist [PosixPath('data/audio/vacuum/vacuum1.wav')] >>> type(typelist) <class 'list'> >>> # Get type of the object >>> type(typelist[0]) pathlib.PosixPath >>> # Example with a list of tuples, i.e. label and audio file pairs: >>> input_string = "[(2, PosixPath('data/audio/vacuum/vacuum1.wav')), '+ '(1, PosixPath('data/audio/vacuum/vacuum2.wav'))]" >>> labelaudio_pairs = string2list(input_string) >>> labelaudio_pairs [(2, PosixPath('data/audio/vacuum/vacuum1.wav')), (1, PosixPath('data/audio/vacuum/vacuum2.wav'))] >>> type(labelaudio_pairs) list >>> type(labelaudio_pairs[0]) tuple >>> type(labelaudio_pairs[0][0]) int >>> type(labelaudio_pairs[0][1]) pathlib.PosixPath
-
soundpy.utils.
adjust_time_units
(time_sec)[source]¶ Turns seconds into relevant time units.
This is useful if measuring time of a process and that process takes longer than a couple minutes.
Examples
>>> adjust_time_units(5) (5, 'seconds') >>> adjust_time_units(500) (8.333333333333334, 'minutes') >>> adjust_time_units(5000) (1.3888888888888888, 'hours')
-
soundpy.utils.
print_progress
(iteration, total_iterations, task=None)[source]¶ Prints the status of a process based on iteration number.
Assumes the iteration starts at 0 rather than 1.
- Parameters
- Returns
- Return type
sys.stdout.flush()
Examples
>>> print_progress(4, 10) 50% through current task >>> print_progress(4, 10, task = 'testing') 50% through testing
-
soundpy.utils.
check_extraction_variables
(sr=None, feature_type=None, win_size_ms=None, percent_overlap=None)[source]¶ Checks to ensure extraction variables are compatible.
- Parameters
- Returns
- Return type
- Raises
ValueError – If any of the Parameters aren’t compatible.
Examples
>>> check_extraction_variables(sr=48000, feature_type='signal', win_size_ms=25,percent_overlap=0.5) >>> check_extraction_variables(sr='48000', feature_type='sig',win_size_ms='25',percent_overlap='0.5') ValueError: Sampling rate (sr) must be of type int, not 48000 of type <class 'str'>.
-
soundpy.utils.
check_noisy_clean_match
(noisyfilename, cleanfilename)[source]¶ Checks if the clean filename is inside of the noisy filename.
This may be helpful to check that two audiofile datasets (a noisy and clean dataset) are aligned.
-
soundpy.utils.
audiofile_length_match
(filename1, filename2)[source]¶ Checks that two audiofiles have the same length.
This may be useful if you have clean and noisy audiofiles that should be the same length.
- Parameters
filename1 (
str
orpathlib.PosixPath
) – The path to first audio file.filename2 (
str
orpathlib.PosixPath
) – The path to second audio file.
- Returns
bool
- Return type
True if they match
,False if not.
Warning
- UserWarning
If the sample rate of the audio files don’t match.
- UserWarning
If the length of the files don’t match.