Organizing datasets¶

The datasets module contains functions related to organizing datasets.

soundpy.datasets.create_encodedlabel2audio_dict(dict_encodelabels, paths_list, limit=None, seed=40)[source]¶

Creates dictionary with audio labels as keys and filename lists as values.

If no label is found in the filename path, the label is not included in the returned dictionary: labels are only included if corresponding paths are present.

Parameters

dict_encodelabels (dict) – Dictionary containing the labels as keys and their encoded values as values.
paths_list (set, list) – List containing pathlib.PosixPath objects (i.e. paths) of all audio files; expected the audio files reside in directories with names matching their audio class
limit (int, optional) – The integer indicating a limit to number of audiofiles to each class. This may be useful if one wants to ensure a balanced dataset (default None)
seed (int, optional) – The seed for pseudorandomizing the wavfiles, if a limit is requested. If seed is set to None, the randomized order of the limited wavfiles cannot be repeated. (default 40)

Returns

label_waves_dict (OrderedDict) – A dictionary with encoded audio labels as keys with values being the audio files corresponding to that label
TODO update

Examples

>>> from pathlib import Path
>>> labels = dict([('vacuum',2),('fridge',0),('wind',1)])
>>> paths = [Path('data/audio/vacuum/vacuum1.wav'),
...         Path('data/audio/fridge/fridge1.wav'),
...         Path('data/audio/vacuum/vacuum2.wav'),
...         Path('data/audio/wind/wind1.wav')]
>>> label_waves_dict = create_encodedlabel2audio_dict(labels, paths)
>>> label_waves_dict
OrderedDict([(0, [PosixPath('data/audio/fridge/fridge1.wav')]), (2, [PosixPath('data/audio/vacuum/vacuum1.wav'), PosixPath('data/audio/vacuum/vacuum2.wav')]), (1, [PosixPath('data/audio/wind/wind1.wav')])])
>>> #to set a limit on number of audiofiles per class:
>>> create_encodedlabel2audio_dict(labels, paths, limit=1, seed=40)
OrderedDict([(0, [PosixPath('data/audio/fridge/fridge1.wav')]), (2, [PosixPath('data/audio/vacuum/vacuum2.wav')]), (1, [PosixPath('data/audio/wind/wind1.wav')])])
>>> #change the limited pathways chosen:
>>> create_encodedlabel2audio_dict(labels, paths, limit=1, seed=10)
OrderedDict([(0, [PosixPath('data/audio/fridge/fridge1.wav')]), (2, [PosixPath('data/audio/vacuum/vacuum1.wav')]), (1, [PosixPath('data/audio/wind/wind1.wav')])])

soundpy.datasets.create_dicts_labelsencoded(labels_class, add_extra_label=False, extra_label='silence')[source]¶

Encodes audio class labels and saves in dictionaries.

The labels are alphabetized and encoded under their index. If add_extra_label, the extra_label is added as the last entry in the dictionary. This is useful if performing voice activity and want to label non-voice activated sections as silent rather than as some speech label.

Parameters

labels_class (set, list) – Set or list containing the labels of all audio classes.
add_extra_label (bool) – If True, extra_label added to dictionary.
extra_label (str) – The extra label to add. (default ‘silence’).

Returns

dict_label2int (dict) – Dictionary where the keys are the string labels and the values are the encoded integers
dict_int2label (dict) – Dictionary where the keys are the encoded integers and the values are the string labels

Examples

>>> labels = {'wind','air_conditioner','fridge'}
>>> label2int, int2label = create_dicts_labelsencoded(labels)
>>> label2int
{'air_conditioner': 0, 'fridge': 1, 'wind': 2}
>>> int2label
{0: 'air_conditioner', 1: 'fridge', 2: 'wind'}

soundpy.datasets.waves2dataset(audiolist, perc_train=0.8, seed=40, train=True, val=True, test=True)[source]¶

Organizes audio files list into train, validation and test datasets.

If only two or one dataset is to be prepared, they will be assigned to train and val or simply to train, respectively. The remaining ‘datasets’ will remain empty.

Parameters

audiolist (list) – List containing paths to audio files
perc_train (float, int) – Percentage of data to be in the training dataset (default 0.8)
seed (int, None, optional) – Set seed for the generation of pseudorandom train, validation, and test datsets. Useful for reproducing results. (default 40)
train (bool) – If True, assumed the training data will be prepared. (default True)
val (bool) – If True, assumed validation data will be prepared. (default True)
test (bool) – If True, assumed test data will be prepared. (default True)

Returns

train_waves (list) – List of audio files for the training dataset
val_waves (list) – List of audio files for the validation dataset
test_waves (list) – List of audio files for the test dataset

Examples

>>> #Using a list of numbers instead of filenames
>>> audiolist = [1,2,3,4,5,6,7,8,9,10]
>>> #default settings:
>>> waves2dataset(audiolist)
([5, 4, 9, 2, 3, 10, 1, 6], [8], [7])
>>> #perc_train set to 50% instead of 80%:
>>> waves2dataset(audiolist, perc_train=50)
([5, 4, 9, 2, 3, 10], [1, 6], [8, 7])
>>> #change seed number
>>> waves2dataset(audiolist, seed=0)
([7, 1, 2, 5, 6, 9, 10, 8], [4], [3])

soundpy.datasets.audio2datasets(audiodata, perc_train=0.8, limit=None, seed=None, audio_only=True, **kwargs)[source]¶

Organizes all audio in audio class directories into datasets (randomized).

The validation and test datasets are halved between what isn’t train data. For example, if perc_train is 0.8, validation data will be 0.1 and test data will be 0.1.

Parameters

audiodata (str, pathlib.PosixPath, dict, list, or set) – If data has multiple labels, path to the dictionary where audio class labels and the paths of all audio files belonging to each class are or will be stored. The dictionary with the labels and their encoded values can also directly supplied here. If the data does not have labels, a list or set of audiofiles can be provided to be placed in train, val, and test datasets.
seed (int, optional) – A value to allow random order of audiofiles to be predictable. (default None). If None, the order of audiofiles will not be predictable.
audio_only (bool) – If audio files are expected (e.g. extensions of .wav, .flac etc.) or not. If True, list will be checked to contain only audio files. Otherwise not. (default True)
**kwargs (additional keyword arguments) – Keyword arguments for soundpy.datasets.waves2dataset

Returns

dataset_audio – Named tuple including three lists / datasets of audiofiles or label-audiofile pairs: the train, validation, and test lists, respectively. The label-audiofile pairs are saved as tuples within the lists and contain the encoded label integer (e.g. 0 instead of ‘air_conditioner’) and the audio paths associated to that class and dataset.

Return type

tuple

Raises

ValueError – If perc_train is set too high for the amount of data or there are simply too few data. Specifically, if the percentage of train data cannot be upheld while also ensuring the validation and test datasets have more than 1 sample.

soundpy.datasets.separate_train_val_test_files(list_of_files)[source]¶

Checks that file(s) exist, then sorts file(s) into train, val, test lists.

If ‘nois’ or ‘clean’ are in the filenames, two paths lists per dataset will be generated. Otherwise just one. This paths list is useful if there are multiple training files available for training a model (e.g. for large datasets).

Parameters: list_of_files (list, str, or pathlib.PosixPath) – The feature files (format: .npy) for training a model.
Returns: (train_paths_list, val_paths_list, test_paths_list) – Tuple comprised of paths lists to train, validation, and test data files. If noisy and clean data files found, each tuple item will be a tuple comprised of two lists: a noisy file paths list and a clean file paths list.
Return type: tuple

Examples

>>> features_files = ['train1.npy', 'train2.npy', 'val.npy', 'test.npy']
>>> datasets = separate_train_val_test_files(features_files)
>>> datasets.train
[PosixPath('train1.npy'), PosixPath('train2.npy')]
>>> datasets.val
[PosixPath('val.npy')]
>>> datasets.test
[PosixPath('test.npy')]
>>> # try with noisy and clean data
>>> features_files = ['train_noisy.npy', 'train_clean.npy', 'val_noisy.npy', 'val_clean.npy', 'test_noisy.npy', 'test_clean.npy']
>>> datasets = separate_train_val_test_files(features_files)
>>> datasets.train.noisy
[PosixPath('train_noisy.npy')]
>>> datasets.train.clean
[PosixPath('train_clean.npy')]
>>> datasets.val.noisy
[PosixPath('val_noisy.npy')]
>>> datasets.val.clean
[PosixPath('val_clean.npy')]
>>> datasets.test.noisy
[PosixPath('test_noisy.npy')]
>>> datasets.test.clean
[PosixPath('test_clean.npy')]

soundpy.datasets.section_data(dataset_dict, dataset_paths_dict, divide_factor=None)[source]¶

Expects keys of these two dictionaries to match

Examples

>>> import pathlib
>>> # train is longer than val and test
>>> d = {'train': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],            'val': [1, 2, 3, 4, 5],            'test': [1, 2, 3, 4, 5]}
>>> # dictionary: paths to where extracted data will be saved
>>> dp = {'train': pathlib.PosixPath('train_data.npy'),              'val': pathlib.PosixPath('val_data.npy'),              'test': pathlib.PosixPath('test_data.npy')}
>>> d2, dp2 = section_data(d, dp, divide_factor = 3)
>>> # val and train not touched (too small)
>>> d2
{'train__1': [1, 2, 3, 4, 5], 'train__2': [6, 7, 8, 9, 10], 'train__3': [11, 12, 13, 14, 15], 'val': [1, 2, 3, 4, 5], 'test': [1, 2, 3, 4, 5]}
>>> dp2
{'train__1': PosixPath('train_data__1.npy'), 'train__2': PosixPath('train_data__2.npy'), 'train__3': PosixPath('train_data__3.npy'), 'val': PosixPath('val_data.npy'), 'test': PosixPath('test_data.npy')}
>>> # repeat: now val and test as long as train
>>> # default divide_factor is 2
>>> d3, dp3 = section_data(d2, dp2)
>>> d3
{'train__1': [1, 2], 'train__2': [3, 4, 5], 'train__3': [6, 7], 'train__4': [8, 9, 10], 'train__5': [11, 12], 'train__6': [13, 14, 15], 'val__1': [1, 2], 'val__2': [3, 4, 5], 'test__1': [1, 2], 'test__2': [3, 4, 5]}
>>> dp3
{'train__1': PosixPath('train_data__1.npy'), 'train__2': PosixPath('train_data__2.npy'), 'train__3': PosixPath('train_data__3.npy'), 'train__4': PosixPath('train_data__4.npy'), 'train__5': PosixPath('train_data__5.npy'), 'train__6': PosixPath('train_data__6.npy'), 'val__1': PosixPath('val_data__1.npy'), 'val__2': PosixPath('val_data__2.npy'), 'test__1': PosixPath('test_data__1.npy'), 'test__2': PosixPath('test_data__2.npy')}