Extract and manipulate audio features

Feats module includes functions related to converting audio sample data to features for analysis, filtering, machine learning, or visualization.

soundpy.feats.plot(feature_matrix, feature_type, save_pic=False, name4pic=None, energy_scale='power_to_db', title=None, sr=None, win_size_ms=None, percent_overlap=None, x_label=None, y_label=None, subprocess=False, overwrite=False)[source]

Visualize feature extraction; frames on x axis, features on y axis. Uses librosa to scale the data if scale applied.

Note: can only take multiple channels if feature_type is ‘signal’. For other feature types, the plot will not work as expected.

Parameters
  • feature_matrix (np.ndarray [shape=(num_samples,), (num_samples, num_channels), or (num_features, num_frames), dtype=np.float].) – Matrix of features. If the features are not of type ‘signal’ and the shape is 1 D, one dimension will be added to be plotted with a colormesh.

  • feature_type (str) – Options: ‘signal’, ‘stft’, ‘mfcc’, or ‘fbank’ features, or what user would like to name the feature set. signal: the 1 D samples of sound. STFT: short-time Fourier transform MFCC: mel frequency cepstral coefficients. FBANK: mel-log filterbank energies (default ‘fbank’).

  • save_pic (bool) – True to save image as .png; False to just plot it. If subprocess is True, save_pic will automatically be set to True.

  • name4pic (str, optional) – If save_pic set to True, the name the image should be saved under.

  • energy_scale (str, optional) – If features need to be adjusted, e.g. from power to decibels. Default is ‘power_to_db’.

  • title (str, optional) – The title for the graph. If None, feature_type is used.

  • sr (int, optional) – Useful in plotting the time for features.

  • win_size_ms (int, float, optional) – Useful in plotting the time for features in the frequency domain (e.g. STFT, FBANK, MFCC features)

  • percent_overlap (int, float, optional) – Useful in plotting the time for features in the frequency domain (e.g. STFT, FBANK, MFCC features)

  • x_label (str, optional) – The label to be applied to the x axis.

  • y_label (str, optional) – The label to be applied to the y axis.

  • subprocess (bool) – If subprocess is True, matplotlib will use backend ‘Agg’, which only allows plots to be saved. If subprocess is False, the default backend ‘TkAgg’ will be used, which allows plots to be generated live as well as saved. The ‘Agg’ backend is useful if one wants to visualize sound while a main process is being performed, for example, while a model is being trained. (default False)

  • overwrite (bool) – If False, if .png file already exists under given name, a date tag will be added to the .png filename to avoid overwriting the file. (default False)

Returns

Return type

None

soundpy.feats.plotsound(audiodata, feature_type='fbank', win_size_ms=20, percent_overlap=0.5, fft_bins=None, num_filters=40, num_mfcc=40, sr=None, save_pic=False, name4pic=None, energy_scale='power_to_db', mono=None, real_signal=False, **kwargs)[source]

Visualize feature extraction depending on set parameters.

Stereo sound can be graphed. If feature_type is ‘signal’, all channels will be graphed on same plot. Otherwise, each channel will be plotted separately.

Parameters
  • audiodata (str, numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.

  • feature_type (str) – Options: ‘signal’, ‘mfcc’, or ‘fbank’ features. MFCC: mel frequency cepstral coefficients; FBANK: mel-log filterbank energies (default ‘fbank’)

  • win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 20)

  • percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.

  • num_filters (int) – Number of mel-filters to be used when applying mel-scale. For ‘fbank’ features, 20-128 are common, with 40 being very common. (default 40)

  • num_mfcc (int) – Number of mel frequency cepstral coefficients. First coefficient pertains to loudness; 2-13 frequencies relevant for speech; 13-40 for acoustic environment analysis or non-linguistic information. Note: it is not possible to choose only 2-13 or 13-40; if num_mfcc is set to 40, all 40 coefficients will be included. (default 40).

  • sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)

  • mono (bool, optional) – When loading an audiofile, True will limit number of channels to one; False will allow more channels to be loaded. (default None, which results in mono channel loading.)

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.plot

soundpy.feats.get_feats(sound, sr=None, feature_type='fbank', win_size_ms=20, percent_overlap=0.5, window='hann', fft_bins=None, num_filters=None, num_mfcc=None, remove_first_coefficient=False, sinosoidal_liftering=False, dur_sec=None, mono=None, rate_of_change=False, rate_of_acceleration=False, subtract_mean=False, real_signal=True, fmin=None, fmax=None, zeropad=True)[source]

Collects raw signal data, stft, fbank, or mfcc features.

Parameters
  • sound (str or numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.

  • sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)

  • feature_type (str) – Options include ‘signal’, ‘stft’, ‘powspec’, ‘fbank’, or ‘mfcc’ data (default ‘fbank’). signal: energy/amplitude measurements along time STFT: short-time fourier transform powspec : power spectrum (absolute value of stft, squared) FBANK: mel-log filterbank energies MFCC: mel frequency cepstral coefficients

  • win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 20)

  • percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.

  • window (str or np.ndarray [size (n_fft, )]) – The window function to be applied to each window. (Default ‘hann’)

  • fft_bins (int) – Number of frequency bins to apply in fast Fourier transform. (default None)

  • num_filters (int) – Number of mel-filters to be used when applying mel-scale. For ‘fbank’ features, 20-128 are common, with 40 being very common. If None, will be set to 40. (default None)

  • num_mfcc (int) – Number of mel frequency cepstral coefficients. First coefficient pertains to loudness; 2-13 frequencies relevant for speech; 13-40 for acoustic environment analysis or non-linguistic information. If None, will be set to num_filters or 40. (default None).

  • dur_sec (float, optional) – Time in seconds to limit in loading a signal. (default None)

  • mono (bool, optional) – For loading an audiofile, True will result in only one channel of data being loaded; False will allow additional channels be loaded. (default None, which results in mono channel data)

  • rate_of_change (bool) – If True, the first derivative of spectral data will be concatenated to the features. This is applicable for all feature types except ‘signal’.

  • rate_of_acceleration (bool) – If True, the second derivative of spectral data will be concatenated to the features. This is applicable for all feature types except ‘signal’.

  • subtract_mean (bool) – If True, the mean of each feature column will be subtracted from each row. This is applicable for all feature types except ‘signal’.

Returns

feats – Feature data. If feature_type is ‘signal’, returns a tuple containing samples and sampling rate. If feature_type is of another type, returns np.ndarray with shape (num_frames, num_filters/features)

Return type

tuple (num_samples, sr) or np.ndarray [size (num_frames, num_filters) dtype=np.float or np.complex]

soundpy.feats.load_feat_settings(feat_settings_dict)[source]

Loads feature settings into named tuple. Sets defaults if not present. TODO: test w previous version

soundpy.feats.get_stft(sound, sr=22050, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', zeropad=True, **kwargs)[source]

Returns short-time Fourier transform matrix.

This function allows more flexibility in number of fft_bins and real_signal settings. Additionally, this does not require the package librosa, making it a bit easier to manipulate if desired. For an example, see soundpy.augment.vtlp.

Parameters
  • sound (np.ndarray [shape=(num_samples,) or (num_samples, num_channels)], str, or pathlib.PosixPath) – If type np.ndarray, expect raw samples in mono or stereo sound. If type str or pathlib.PosixPath, expect pathway to audio file.

  • sr (int) – The sample rate of sound.

  • win_size_ms (int, float) – Window length in milliseconds for Fourier transform to be applied (default 50)

  • percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.

  • real_signal (bool) – If True, only half the FFT spectrum will be used; there should really be no difference as the FFT is symmetrical. If anything, setting real_signal to True may speed up functionality / make functions more efficient.

  • fft_bins (int) – Number of frequency bins to use when applying fast Fourier Transform. (default 1024)

  • window (str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)

  • zeropad (bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.files.loadsound.

Returns

stft_matrix

Return type

np.ndarray[size=(num_frames, fft_bins)]

soundpy.feats.get_fbank(sound, sr, num_filters, fmin=None, fmax=None, fft_bins=None, **kwargs)[source]

Extract mel-filterbank energy features from audio.

Parameters
  • sound (np.ndarray [size=(num_samples,) or (num_samples, num_features)], str, or pathlib.PosixPath) – Sound in raw samples, a power spectrum, or a short-time-fourier-transform. If type string or pathlib.PosixPath, expect pathway to audio file.

  • sr (int) – The sample rate of sound.

  • num_filters (int) – The number of mel-filters to use when extracting mel-filterbank energies.

  • fmin (int or float, optional) – The minimum frequency of interest. If None, will be set to 0. (default None)

  • fmax (int or float, optional) – The maximum frequency of interst. If None, will be set to half of sr. (default None)

  • fft_bins (int, optional) – The number of frequency bins / fast Fourier transform bins used in calculating the fast Fourier transform. If None, set depending on type of parameter sound. If sound is a raw signal or audio pathway, fft_bins will be set to 1024; if sound is a STFT or power spectrum, fft_bins will be set to 2 * length of sound feature column, or 2 * sound.shape[1].

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_stft.

Returns

fbank – The mel-filterbank energeis extracted. The number of samples depends on the parameters applied in soundpy.feats.get_stft.

Return type

np.ndarray [shape=(num_samples, num_filters)]

References

Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from: https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

soundpy.feats.get_mfcc(sound, sr, num_mfcc, remove_first_coefficient=False, sinosoidal_liftering=False, **kwargs)[source]

Extracts mel-frequency cepstral coefficients from audio.

Parameters
  • sound (np.ndarray [size=(num_samples,) or (num_samples, num_features)] or str or pathlib.PosixPath) – If sound is a np.ndarray, expected as raw samples, a power spectrum or a short-time Fourier transform. If string or pathlib.PosixPath, should be the pathway to the audio file.

  • sr (int) – The sample rate of the sound.

  • num_mfcc (int) – The number of mel-frequency cepstral coefficients

  • remove_first_coefficient (bool) – If True, the first coefficient, representing amplitude or volume of signal, is removed. Found to sometimes improve automatic speech recognition. (default False)

  • sinosoidal_liftering (bool) – If True, reduces influence of higher coefficients, found to aid in handling noise in background in automatic speech recognition. (default False)

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_fbank()

References

Fayek, H. M. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. Retrieved from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

soundpy.feats.get_vad_stft(sound, sr=48000, win_size_ms=50, percent_overlap=0.5, real_signal=False, fft_bins=1024, window='hann', use_beg_ms=120, extend_window_ms=0, energy_thresh=40, freq_thresh=185, sfm_thresh=5, zeropad=True, **kwargs)[source]

Returns STFT matrix and VAD matrix. STFT matrix contains only VAD sections.

Parameters
  • sound (str or numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.

  • sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)

  • win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 50)

  • percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.

  • real_signal (bool) – If True, only half the FFT spectrum will be used; there should really be no difference as the FFT is symmetrical. If anything, setting real_signal to True may speed up functionality / make functions more efficient.

  • fft_bins (int) – Number of frequency bins to use when applying fast Fourier Transform. (default 1024)

  • window (str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)

  • use_beg_ms (int) – The amount of time in milliseconds to use from beginning of signal to estimate background noise.

  • extend_window_ms (int) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired.

  • energy_thresh (int) – The threshold to set for measuring energy for VAD in the signal. (default 40)

  • freq_thresh (int) – The threshold to set for measuring frequency for VAD in the signal. (default 185)

  • sfm_thresh (int) – The threshold to set for measuring spectral flatness for VAD in the signal. (default 5)

  • zeropad (bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.files.loadsound

Returns

  • stft_matrix (np.ndarray [size=(num_frames_vad, fft_bins//2+1), dtype=np.complex_]) – The STFT matrix frames of where voice activity has been detected.

  • vad_matrix_extwin (np.ndarray [size=(num_frames,)]) – A vector containing indices of the full STFT matrix for frames of where voice activity was detected or not.

soundpy.feats.get_stft_clipped(samples, sr, win_size_ms=50, percent_overlap=0.5, extend_window_ms=0, window='hann', zeropad=True, **kwargs)[source]

Returns STFT matrix and VAD matrix with beginning and ending silence removed.

Parameters
  • samples (str or numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data.

  • sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded.

  • win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 50)

  • percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.

  • extend_window_ms (int) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired.

  • window (str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)

  • zeropad (bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.files.loadsound.

Returns

  • stft_speech (np.ndarry [size (num_frames_clipped, fft_bins//2+1)]) – The STFT of the samples with beginning and ending silences clipped.

  • vad_matrix (np.ndarry [size (num_frames, )]) – A vector with zeros and ones indicating which indices of the full STFT that have voice activity or not.

soundpy.feats.get_vad_samples(sound, sr=None, win_size_ms=50, percent_overlap=0.5, use_beg_ms=120, extend_window_ms=0, energy_thresh=40, freq_thresh=185, sfm_thresh=5, window='hann', zeropad=True, **kwargs)[source]

Returns samples and VAD matrix. Only samples where with VAD are returned.

Parameters
  • sound (str or numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data. Note: in the latter case, sr must be declared.

  • sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded. (default None)

  • win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 50)

  • percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.

  • use_beg_ms (int) – The amount of time in milliseconds to use from beginning of signal to estimate background noise.

  • extend_window_ms (int) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired.

  • energy_thresh (int) – The threshold to set for measuring energy for VAD in the signal. (default 40)

  • freq_thresh (int) – The threshold to set for measuring frequency for VAD in the signal. (default 185)

  • sfm_thresh (int) – The threshold to set for measuring spectral flatness for VAD in the signal. (default 5)

  • window (str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)

  • zeropad (bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.files.loadsound

Returns

  • samples_matrix (np.ndarray [size = (num_samples_vad, )]) – The samples of where voice activity was detected.

  • vad_matrix_extwin (np.ndarray [size = (num_frames, )]) – A vector of zeros and ones indicating the frames / windows of the samples that either had voice activity or not.

soundpy.feats.get_samples_clipped(samples, sr, win_size_ms=50, percent_overlap=0.5, extend_window_ms=0, window='hann', zeropad=True, **kwargs)[source]

Returns samples and VAD matrix with beginning and ending silence removed.

Parameters
  • samples (str or numpy.ndarray [size=(num_samples,) or (num_samples, num_channels)]) – If str, wavfile (must be compatible with scipy.io.wavfile). Otherwise the samples of the sound data.

  • sr (int, optional) – The sample rate of the sound data or the desired sample rate of the wavfile to be loaded.

  • win_size_ms (int or float) – Window length in milliseconds for Fourier transform to be applied (default 50)

  • percent_overlap (int or float) – Amount of overlap between processing windows. For example, if percent_overlap is set at 0.5, the overlap will be half that of win_size_ms. (default 0.5) If an integer is provided, it will be converted to a float between 0 and 1.

  • extend_window_ms (int) – The amount of time in milliseconds to pad or extend the identified VAD segments. This may be useful to include more speech / sound, if desired. (default 0)

  • window (str) – The window function to apply to each window segment. Options are ‘hann’ and ‘hamming’. (default ‘hann’)

  • zeropad (bool) – If True, samples will be zeropadded to fill any partially filled window. If False, the samples constituting the partially filled window will be cut off.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.files.loadsound.

Returns

  • stft_speech (np.ndarry [size (num_frames_clipped, fft_bins//2+1)]) – The STFT of the samples with beginning and ending silences clipped.

  • vad_matrix (np.ndarry [size (num_frames, )]) – A vector with zeros and ones indicating which indices of the full STFT that have voice activity or not.

soundpy.feats.normalize(data, max_val=None, min_val=None)[source]

Normalizes data to be between 0 and 1. Should not be applied to raw sample data.

This is useful if you have predetermined max and min values you want to normalize new data with. Is helpful in training models on sound features (not raw samples).

Parameters
  • data (np.ndarray [size=(num_features,) or (num_frames,num_features)]) – Data to be normalized.

  • max_val (int or float, optional) – Predetermined maximum value. If None, will use max value from data.

  • min_val (int or float, optional) – Predetermined minimum value. If None, will use min value from data.

Returns

normed_data

Return type

np.ndarray [size = (num_features,) or (num_frames, num_features)]

Examples

>>> # using the min and max of a previous dataset:
>>> import numpy as np
>>> np.random.seed(0)
>>> input_samples = np.random.random_sample((5,))
>>> input_samples
array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ])
>>> np.random.seed(40)
>>> previous_samples = np.random.random_sample((5,))
>>> previous_samples
array([0.40768703, 0.05536604, 0.78853488, 0.28730518, 0.45035059])
>>> max_prev = np.max(previous_samples)
>>> min_prev = np.min(previous_samples)
>>> output_samples = normalize(input_samples, min_val = min_prev, max_val = max_prev)
>>> output_samples
array([0.67303388, 0.89996095, 0.74661839, 0.66767314, 0.50232462])
soundpy.feats.plot_dom_freq(sound, energy_scale='power_to_db', title=None, save_pic=False, name4pic=None, overwrite=False, **kwargs)[source]

Plots the approximate dominant frequency over a STFT plot of a signal.

If sound has multiple channels, the VAD for each channel is plotted in its own plot.

Parameters
  • sound (np.ndarray [shape=(num_samples,) or (num_samples, num_channels)]) – The sound to plot the dominant frequency of.

  • energy_scale (str) – The scale of energy for the plot. If in frequency spectrum, likey in power and needs to be put into db. (default ‘power_to_db’)

  • title (str) – The title for the plot. (default None)

  • **kwargs (additional keyword arguments) – Keyword arguments used in both soundpy.feats.get_stft and soundpy.dsp.get_pitch.

Returns

Return type

None

soundpy.feats.plot_vad(sound, energy_scale='power_to_db', title='Voice Activity', use_beg_ms=120, extend_window_ms=0, beg_end_clipped=True, save_pic=False, name4pic=None, overwrite=False, **kwargs)[source]

Plots where voice (sound) activity detected on power spectrum.

This either plots immediately or saves the plot at name4pic. If sound has multiple channels, the VAD for each channel is plotted in its own plot.

Parameters
  • sound (np.ndarray [shape=(num_samples,) or (num_samples, num_channels)]) – The sound to plot the VAD of.

  • energy_scale (str) – If plotting STFT or power spectrum, will plot it in decibels. (default ‘power_to_db’)

  • title (str) – The title of the plot (default ‘Voice Activity’)

  • use_beg_ms (int) – The amount of noise to use at the beginning of the signal to measuer VAD. This is only applied if beg_end_silence is set to False.

  • extend_window_ms (int) – The number of milliseconds VAD should be padded. This is useful if one wants to encompass more speech if the VAD is not including all the speech / desired sound. However, this may capture more noise. (default 0)

  • beg_end_silence (bool) – If True, just the silences at the beginning and end of the sample will be cut off. If False, VAD will be checked throughout the sample, not just the beginning and end. NOTE: Both options have strengths and weaknesses. Sometimes the VAD checking the entire signal is unreliable (e.i. when beg_end_silence is set to False), not recognizing speech in speech filled samples. And when set to True, some speech sounds tend to get ignored (‘s’, ‘x’ and other fricatives).

  • save_pic (bool) – If True, the plot will be saved rather than plotted immediately.

  • name4pic (str) – The full pathway and filename to save the picture (as .png file). A file extension is expected. (default None)

  • overwrite (bool) – If False, a date tag will be added to name4pic if name4pic already exists. (default False)

  • **kwargs (keyword arguments) – Additional keyword arguments for soundpy.feats.get_speech_stft or soundpy.dsp.vad.

Returns

Return type

None

soundpy.feats.get_change_acceleration_rate(spectro_data)[source]

Gets first and second derivatives of spectral data.

This is useful particularly for speech recognition.

Parameters

spectro_data (np.ndarray [shape = (num_samples, num_features)]) –

Returns

  • delta (np.ndarray [shape = (num_samples, num_features)]) – The first order derivative of spectral data. Reflects rate of change in signal.

  • delta_delta (np.ndarray [shape = (num_samples, num_features)]) – The second order derivative of spectral data. Reflects rate of acceleration in signal.

soundpy.feats.get_mfcc_fbank(samples, feature_type='mfcc', sr=48000, win_size_ms=20, percent_overlap=0.5, num_filters=40, num_mfcc=40, fft_bins=None, window_function=None, zeropad=True, **kwargs)[source]

Collects fbank or mfcc features via python-speech-features (rather than librosa).

soundpy.feats.zeropad_features(feats, desired_shape, complex_vals=False)[source]

Applies zeropadding to a copy of feats.

soundpy.feats.reduce_num_features(feats, desired_shape)[source]

Limits number features of a copy of feats.

This is useful if you want the features to be a certain size, for training models for example.

soundpy.feats.adjust_shape(data, desired_shape, change_dims=None, complex_vals=None)[source]
soundpy.feats.reduce_dim(matrix, axis=0)[source]
soundpy.feats.featshape_new_subframe(feature_matrix_shape, new_frame_size, zeropad=True, axis=0, include_dim_size_1=False)[source]

Subdivides features from (num_frames, num_feats) to (new_frame_size, num_frames, num_feats)

Parameters
  • feature_matrix_shape (tuple [size=(num_frames, num_features)]) – Feature matrix shape to be subdivided. Can be multidimensional.

  • new_frame_size (int) – The number of subframes to section axis into.

  • zeropad (bool) – If True, frames that don’t completely fill a new_frame_size will be zeropadded. Otherwise, those frames will be discarded. (default True)

  • axis (int) – The axis where the new_frame_size should be applied. (default 0)

Returns

new_shape

Return type

tuple [size=(num_subframes, new_frame_size, num_feats)]

soundpy.feats.apply_new_subframe(feature_matrix, new_frame_size, zeropad=True, axis=0)[source]

Reshapes feature_matrix to allow for new_frame_size.

Note: Dimensions of feature_matrix must be at least 2 and can be up to 5, returning a matrix with one additional dimension.

Parameters
  • feature_matrix (np.ndarray [size(num_frames, num_features) ]) – Expects minimum 2D, maximum 5D matrix.

  • new_frame_size (int) – The number of subframes to section axis into.

  • axis (int) – The axis to apply the new_frame_size. (default 0)

  • zeropad (bool) – If True, the feature_matrix will be zeropadded to include frames that do not fill entire frame_size, given the new_frame_size. If False, feature_matrix will not include the last zeropadded frame. (default True)

Returns

feats_reshaped – The feature_matrix returned with axis subdivided into 2 dimensions, the number of subframes and the other length new_frame_size.

Return type

np.ndarray [size(num_subframes, new_frame_size, num_features)]

Raises

ValueError if number of dimensions of feature_matrix is below 2 or exceeds 5.

Examples

>>> import numpy as np
>>> matrix = np.arange(24).reshape(3,4,2)
>>> # apply new_frame_size to dimension of length 4 (i.e. axis 1)
>>> matrix_zp = apply_new_subframe(matrix, new_frame_size = 3, axis = 1)
>>> matrix_zp.shape
(3, 2, 3, 2)
>>> matrix_zp
array([[[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

[[ 6, 7], [ 0, 0], [ 0, 0]]],

[[[ 8, 9],

[10, 11], [12, 13]],

[[14, 15], [ 0, 0], [ 0, 0]]],

[[[16, 17],

[18, 19], [20, 21]],

[[22, 23], [ 0, 0], [ 0, 0]]]])

>>> matrix_nozp = apply_new_subframe(matrix, new_frame_size = 3, axis = 1,
...                                    zeropad=False)
>>> matrix_nozp.shape
(3, 1, 3, 2)
>>> matrix_nozp
array([[[[ 0,  1],
        [ 2,  3],
        [ 4,  5]]],
[[[ 8, 9],

[10, 11], [12, 13]]],

[[[16, 17],

[18, 19], [20, 21]]]])

soundpy.feats.check_percent_overlap(percent_overlap)[source]

Ensures percent_overlap is between 0 and 1.

soundpy.feats.separate_dependent_var(matrix)[source]

Separates matrix into features and labels. Expects 3D array.

Assumes the last column of the last dimension of the matrix constitutes the dependent variable (labels), and all other columns the indpendent variables (features). Additionally, it is assumed that for each block of data, only one label is needed; therefore, just the first label is taken for each block.

Parameters

matrix (numpy.ndarray [size = (num_samples, num_frames, num_features)]) – The matrix holds the numerical data to separate. num_features is expected to be at least 2.

Returns

  • X (numpy.ndarray [size = (num_samples, num_frames, num_features -1)]) – A matrix holding the (assumed) independent variables

  • y (numpy.ndarray, numpy.int64, numpy.float64 [size = (num_samples,)]) – A vector holding the labels assigned to the independent variables. If only one value in array, just the value inside is returned

Examples

>>> import numpy as np
>>> #vector
>>> separate_dependent_var(np.array([1,2,3,4]))
(array([1, 2, 3]), 4)
>>> #simple matrix
>>> matrix = np.arange(4).reshape(2,2)
>>> matrix
array([[0, 1],
       [2, 3]])
>>> X, y = separate_dependent_var(matrix)
>>> X
array([[0],
       [2]])
>>> y
1
>>> #more complex matrix
>>> matrix = np.arange(20).reshape((2,2,5))
>>> matrix
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9]],

       [[10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]]])
>>> X, y = separate_dependent_var(matrix)
>>> X
array([[[ 0,  1,  2,  3],
        [ 5,  6,  7,  8]],

       [[10, 11, 12, 13],
        [15, 16, 17, 18]]])
>>> y
array([ 4, 14])
soundpy.feats.add_tensor(matrix)[source]

Adds tensor / dimension to input ndarray (e.g. features).

Keras requires an extra dimension at some layers, which represents the ‘tensor’ encapsulating the data.

Further clarification taking the example below. The input matrix has shape (2,3,4). Think of it as 2 different events, each having 3 sets of measurements, with each of those having 4 features. So, let’s measure differences between 2 cities at 3 different times of day. Let’s take measurements at 08:00, 14:00, and 19:00 in… Magic City and Never-ever Town. We’ll measure.. 1) tempurature, 2) wind speed 3) light level 4) noise level.

How I best understand it, putting our measurements into a matrix with an added dimension/tensor, this highlights the separate measurements, telling the algorithm: yes, these are 4 features from the same city, BUT they occur at different times. Or it’s just how Keras set up the code :P

Parameters

matrix (numpy.ndarray) – The matrix holds the numerical data to add a dimension to.

Returns

matrix – The matrix with an additional dimension.

Return type

numpy.ndarray

Examples

>>> import numpy as np
>>> matrix = np.arange(24).reshape((2,3,4))
>>> matrix.shape
(2, 3, 4)
>>> matrix
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
>>> matrix_2 = add_tensor(matrix)
>>> matrix_2.shape
(2, 3, 4, 1)
>>> matrix_2
array([[[[ 0],
         [ 1],
         [ 2],
         [ 3]],

        [[ 4],
         [ 5],
         [ 6],
         [ 7]],

        [[ 8],
         [ 9],
         [10],
         [11]]],


       [[[12],
         [13],
         [14],
         [15]],

        [[16],
         [17],
         [18],
         [19]],

        [[20],
         [21],
         [22],
         [23]]]])
soundpy.feats.scale_X_y(matrix, is_train=True, scalars=None)[source]

Separates and scales data into X and y arrays. Adds dimension for keras.

Assumes the last column of the last dimension is the y or label data.

Parameters
  • matrix (np.ndarray [size = (num_samples, num_frames, num_features)]) – Matrix with X and y data

  • is_train (bool) – Relevant for the scalars parameter. If the data is training data (i.e. True), the scalars will be created. If the data is test data (i.e. False), the function expects scalars to be provided. (default True)

  • scalars (dict, optional) – Dictionary with scalars to be applied to non-training data.

Returns

  • X (np.ndarray [size = (num_sampls, num_frames, num_features-1, 1)]) – Scaled features with extra dimension

  • y (np.ndarray [size = (num_samples, 1, 1)]) – Scaled independent variable with extra dimension

  • scalars (dict) – The scalars either created or previously loaded.

soundpy.feats.list_available_features()[source]
soundpy.feats.get_feature_matrix_shape(sr=None, dur_sec=None, feature_type=None, win_size_ms=None, percent_overlap=None, fft_bins=None, num_mfcc=None, num_filters=None, rate_of_change=False, rate_of_acceleration=False, context_window=None, frames_per_sample=None, zeropad=True, labeled_data=False, remove_first_coefficient=False, real_signal=False, **kwargs)[source]

Returns expected shapes of feature matrix depending on several parameters.

Parameters
  • sr (int) – Sample rate of the audio to be extracted.

  • dur_sec (int, float) – The number of seconds of audio feature extraction will be applied to.

  • feature_type (str) – Accepted features include ‘signal’, ‘stft’, ‘powspec’, ‘fbank’, ‘mfcc’. Which feature_type applied will influence the resulting shape of the feature matrix shape.

  • win_size_ms (int or float) – The size of the window the audio signal should be broken into. If feature_type is set to ‘signal’, this is irrelevant. Otherwise will raise TypeError if set to None.

  • percent_overlap (float) – The amount of overlap between windows. If set to 0.5, the number of overlapping samples will be half the number of samples that make up win_size_ms.

  • fft_bins (int) – The number of frequency bins to use when calculating the fast Fourier transform. If None, the calculated frame_length will be used.

  • num_mfcc (int) – If extracting ‘mfcc’ features, the total number of coefficients expected.

  • num_filters (int) – If extracting ‘fbank’ features, the total number of mel-filters to be applied.

  • rate_of_change (bool) – If True, the first delta will be concatenated to features extracted.

  • rate_of_acceleration (bool) – If True, the second delta will be concatenated to features extracted.

  • context_window (int) – The size of context_window or number of samples padding a central frame. This may be useful for models training on small changes occuring in the signal, e.g. to break up the image of sound into smaller parts.

  • frames_per_sample (int) – The previous keyword argument for sugementing audio into smaller parts. Will be removed in future versions and available in generator functions as context_window. frames_per_sample equals 2 * context_window + 1. See soundpy.models.dataprep.Generator

  • zeropad (bool) – If True, windows and frames will be zeropadded to avoid losing any sample data.

  • labeled_data (bool) – If True, a label will be added to the output shape of features.

  • remove_first_coefficient (bool) – If True, the first mfcc coefficient will not be included in feature matrix.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats. These may not be used in this function as they may not influence the size of the feature matrix.

Returns

  • feature_matrix_base (tuple) – The base shape of the feature matrix. This is the shape that should result from extracting the features for each audio file

  • feature_matrix_model (tuple) – The shape relevant to training models. For example, one including space for a context window and label.

soundpy.feats.visualize_feat_extraction(feats, iteration=None, dataset=None, label=None, datadir=None, subsections=False, **kwargs)[source]

Saves plots of features during feature extraction or training of models.

Parameters
  • feats (np.ndarray [shape=(num_samples,) or (num_samples, num_frames) or (num_frames, num_features) or (num_subsections, num_frames, num_features)]) – The extracted features can be raw signal data, stft, fbank, powspec, mfcc data, either as a single plot or subsectioned into batches / subframes.

  • iteration (int, optional) – The iteration of the audio getting extracted; e.g. the 10th training item.

  • dataset (str, optional) – The identifying string (for example ‘train’ , ‘val’, or ‘test’, but this can be anything).

  • label (str, int, optional) – The label of the audio file. Used in titles and filenames.

  • datadir (str, pathlib.PosixPath, optional) – The directory where related data is located. An ‘image’ directory will be created within this datadir where the saved plots will be stored. If None, will be created in current working directory.

  • subsections (bool, optional) – To subsection raw ‘signal’ data into frames. For other features, this is easier to identify via the shape of feats.

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats

Returns

Return type

None

soundpy.feats.save_features_datasets(datasets_dict, datasets_path2save_dict, context_window=None, frames_per_sample=None, labeled_data=False, subsection_data=False, divide_factor=None, visualize=False, vis_every_n_frames=50, log_settings=True, decode_dict=None, random_seed=None, **kwargs)[source]

Extracts and saves audio features, sectioned into datasets, to indicated locations.

If MemoryError, the provided dataset dicts will be adjusted to allow data to be subsectioned.

Parameters
  • datasets_dict (dict) – Dictionary with keys representing datasets and values the audifiles making up that dataset. E.g. {‘train’:[‘1.wav’, ‘2.wav’, ‘3.wav’], ‘val’: [‘4.wav’], ‘test’:[‘5.wav’]} for unlabled data or {‘train’:[(0, ‘1.wav’), (1, ‘2.wav’), (0, ‘3.wav’)], ‘val’: [(1, ‘4.wav’)], ‘test’:[(0, ‘5.wav’)]} for labeled data.

  • datasets_path2save_dict (dict) – Dictionary with keys representing datasets and values the pathways of where extracted features of that dataset will be saved. E.g. {‘train’: ‘./data/train.npy’, ‘val’: ‘./data/val.npy’, ‘test’: ‘./data/test.npy’}

  • context_window (int) – The size of context_window or number of samples padding a central frame. This may be useful for models training on small changes occuring in the signal, e.g. to break up the image of sound into smaller parts, to feed to a long short-term memory network (LSTM), for example. (Can avoid this by simply reshaping data later).

  • frames_per_sample (int) – The previous keyword argument for sugementing audio into smaller parts. Will be removed in future versions. This equals 2 * context_window + 1

  • labeled_data (bool) – If True, expects each audiofile to be accompanied by an integer label. See example given for datasets_dict.

  • subsection_data (bool) – If you have a large dataset, you may want to divide it into subsections. See soundpy.datasets.subsection_data. If datasets are large enough to raise a MemoryError, this will be applied automatically.

  • divide_factor (int, optional) – The number of subsections to divide data into. Only large enough sections will be divided. If smaller datasets (i.e. validation and test datasets) are as large or smaller than the new subsectioned larger dataset(s) (i.e. train), they will be left unchanged. (defaults to 5)

  • visualize (bool) – If True, periodic plots of the features will be saved throughout the extraction process. (default False)

  • vis_every_n_frames (int) – How often visuals should be made: every 10 samples, every 100, etc. (default 50)

  • log_settings (bool) – If True, a .csv file will be saved in the feature extraction directory with most of the feature settings saved. (default True)

  • decode_dict (dict, optional) – The dictionary to get the label given the encoded label. This is for plotting purposes. (default None)

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats.

Returns

  • datasets_dict (dict) – The final dataset dictionary used in feature extraction. The datasets may have been subdivided.

  • datasets_path2save_dict (dict) – The final dataset feature pathway dict. The pathways will have been adjusted if the datasets have been subdivided.

See also

soundpy.feats.get_feats

Extract features from audio file or audio data.

soundpy.feats.save_features_datasets_zipfiles(datasets_dict, datasets_path2save_dict, extract_dir, dur_sec, feature_type='fbank', num_feats=None, sr=22050, win_size_ms=20, percent_overlap=0.5, n_fft=None, frames_per_sample=None, labeled_data=False, subsection_data=False, divide_factor=None, visualize=False, vis_every_n_frames=50, use_librosa=True, center=True, mode='reflect', log_settings=True, decode_dict=None, audiofile_lim=10, **kwargs)[source]

Extracts and saves audio features, sectioned into datasets, to indicated locations.

If MemoryError, the provided dataset dicts will be adjusted to allow data to be subsectioned.

Parameters
  • datasets_dict (dict) – Dictionary with keys representing datasets and values the audifiles making up that dataset. E.g. {‘train’:[‘1.wav’, ‘2.wav’, ‘3.wav’], ‘val’: [‘4.wav’], ‘test’:[‘5.wav’]} for unlabled data or {‘train’:[(0, ‘1.wav’), (1, ‘2.wav’), (0, ‘3.wav’)], ‘val’: [(1, ‘4.wav’)], ‘test’:[(0, ‘5.wav’)]} for labeled data.

  • datasets_path2save_dict (dict) – Dictionary with keys representing datasets and values the pathways of where extracted features of that dataset will be saved. E.g. {‘train’: ‘./data/train.npy’, ‘val’: ‘./data/val.npy’, ‘test’: ‘./data/test.npy’}

  • feature_type (str) – String including only one of the following: ‘signal’, ‘stft’, ‘powspec’, ‘fbank’, and ‘mfcc’. ‘signal’ currently only supports mono channel data. TODO test for stereo ‘powspec’ and ‘stft’ are basically the same; ‘powspec’ is the ‘stft’ except without complex values and squared. E.g ‘mfcc_noisy’ or ‘stft_train’.

  • sr (int) – The sample rate the audio data should be loaded with.

  • n_fft (int) – The number of frequency bins used for the Fast Fourier Transform (fft)

  • dur_sec (int or float) – The desired duration of how long the audio data should be. This is used to calculate size of feature data and is therefore necessary, as audiofiles tend to differe in length. If audiofiles are longer or shorter, they will be cut or zeropadded respectively.

  • num_feats (int) – The number of mfcc coefficients (mfcc), mel filters (fbank), or frequency bins (stft).

  • win_size_ms (int) – The desired window size in milliseconds to process audio samples.

  • percent_overlap (float) – The amount audio samples should overlap as each window is processed.

  • frames_per_sample (int, optional) – If you want to section each audio file feature data into smaller frames. This might be useful for speech related contexts. (Can avoid this by simply reshaping data later)

  • labeled_data (bool) – If True, expects each audiofile to be accompanied by an integer label. See example given for datasets_dict.

  • subsection_data (bool) – If you have a large dataset, you may want to divide it into subsections. See soundpy.datasets.subsection_data. If datasets are large enough to raise a MemoryError, this will be applied automatically.

  • divide_factor (int, optional) – The number of subsections to divide data into. Only large enough sections will be divided. If smaller datasets (i.e. validation and test datasets) are as large or smaller than the new subsectioned larger dataset(s) (i.e. train), they will be left unchanged. (defaults to 5)

  • visualize (bool) – If True, periodic plots of the features will be saved throughout the extraction process. (default False)

  • vis_every_n_frames (int) – How often visuals should be made: every 10 samples, every 100, etc. (default 50)

  • use_librosa (bool) – If True, librosa is used to load and extract features. As of now, no other option is available. TODO: add other options. :P I just wanted to be clear that some elements of this function are unique to using librosa. (default True)

  • center (bool) – Relevant for librosa and feature extraction. (default True)

  • mode (str) – Relevant for librosa and feature extraction. (default ‘reflect’)

  • log_settings (bool) – If True, a .csv file will be saved in the feature extraction directory with most of the feature settings saved. (default True)

  • decode_dict (dict, optional) – The dictionary to get the label given the encoded label. This is for plotting purposes. (default None)

  • **kwargs (additional keyword arguments) – Keyword arguments for soundpy.feats.get_feats.

Returns

  • datasets_dict (dict) – The final dataset dictionary used in feature extraction. The datasets may have been subdivided.

  • datasets_path2save_dict (dict) – The final dataset feature pathway dict. The pathways will have been adjusted if the datasets have been subdivided.

See also

soundpy.feats.get_feats

Extract features from audio file or audio data.

soundpy.feats.prep_new_audiofeats(feats, desired_shape, input_shape)[source]

Prepares new audio data to feed to a pre-trained model.

Parameters
  • feats (np.ndarray [shape = (num_frames, num_features)]) – The features to prepare for feeding to a model.

  • desired_shape (tuple) – The expected number of samples necessary to fulfill the expected input_shape for the model. The feats will be zeropadded or limited to match this desired_shape.

  • input_shape (tuple) – The input_shape the model expects a single sample of data to be.

Returns

feats_reshaped – The features reshaped to what the model expects.

Return type

np.ndarray [shape = (`input_shape)]`

soundpy.feats.feats2audio(feats, feature_type, sr, win_size_ms, percent_overlap, phase=None)[source]

Prepares features into audio playable format.

Parameters
  • feats (np.ndarray [shape = (num_frames, num_feats)]) – If the features are a signal, [size = (batch_size * num_frames * num_features, 1)]. Otherwise [size = (batch_size * num_frames, num_features)].

  • feature_type (str) – Either ‘stft’, ‘fbank’, ‘signal’, or ‘mfcc’. For the ‘signal’ feature, only mono channel is supported.

  • sr (int) – Sampling rate that the features were extracted with

  • win_size_ms (int) – The window size in milliseconds the features were extracted with

  • percent_overlap (float) – The percent overlap between windows.

  • phase (np.ndarray [shape = (num_frames, num_feats)], optional) – The original phase information of the reconstructed signal.

Returns

y – The reconstructed signal in samples.

Return type

np.ndarray [shape = (num_samples, )]

soundpy.feats.grayscale2color(image_matrix, colorscale=3)[source]

Expects grayscale image. Copies first channel into additional channels.

This is useful for pre-trained models that require features to have rgb channels, not grayscale. Assumes last channel the colorscale column.