datasets.affect package

Submodules

datasets.affect.get_bert_embedding module

Implements BERT embedding extractors.

datasets.affect.get_bert_embedding.bert_version_data(data, raw_path, keys, max_padding=50, bert_max_len=None)

Get bert encoded data

Parameters:

data (dict) – Data dictionary
raw_path (str) – Path to raw data
keys (dict) – List of keys in raw text getter
max_padding (int, optional) – Maximum padding to add to list. Defaults to 50.
bert_max_len (int, optional) – Maximum length in BERT. Defaults to None.

Returns:

Dictionary from modality to data.

Return type:

dict

datasets.affect.get_bert_embedding.corresponding_other_modality_ids(orig_text, tokenized_text)

Align word ids to other modalities.

Since tokenizer splits the word into parts e.g. ‘##ing’ or ‘you’re’ -> ‘you’, ‘’’, ‘re’ we should get the corresponding ids for other modalities’ features applied to modalities which aligned to words

Parameters:

orig_text (list) – List of strings corresponding to the original text.
tokenized_text (list) – List of lists of tokens.

Returns:

List of ids.

Return type:

list

datasets.affect.get_bert_embedding.get_bert_features(all_text, contextual_embedding=False, batch_size=500, max_len=None)

Get bert features from data.

Use pipline to extract all the features, (num_points, max_seq_length, feature_dim): np.ndarray

Parameters:

all_text (list) – Data to get BERT features from
contextual_embedding (bool, optional) – If True output the last hidden state of bert. If False, output the embedding of words. Defaults to False.
batch_size (int, optional) – Batch size. Defaults to 500.
max_len (int, optional) – Maximum length of the dataset. Defaults to None.

Returns:

BERT features of text.

Return type:

np.array

datasets.affect.get_bert_embedding.get_rawtext(path, data_kind, vids=None)

“Get raw text from the datasets.

Parameters:

path (str) – Path to data
data_kind (str) – Data Kind. Must be ‘hdf5’.
vids (list, optional) – List of video data as np.array. Defaults to None.

Returns:

Text data list, video data list

Return type:

tuple(list, list)

datasets.affect.get_bert_embedding.max_seq_len(id_list, max_len=50)

Fix dataset to max sequence length.

Cut the id lists with the max length, but didnt do padding here. Add the first one as [CLS] and the last one for [SEP].

Parameters:

id_list (list) – List of ids to manipulate
max_len (int, optional) – Maximum sequence length. Defaults to 50.

Returns:

List of tokens

Return type:

list

datasets.affect.get_data module

Implements dataloaders for AFFECT data.

class datasets.affect.get_data.Affectdataset(*args: Any, **kwargs: Any)

Bases: Dataset

Implements Affect data as a torch dataset.

__init__(data: Dict, flatten_time_series: bool, aligned: bool = True, task: str | None = None, max_pad=False, max_pad_num=50, data_type='mosi', z_norm=False) → None

Instantiate AffectDataset

Parameters:

data (Dict) – Data dictionary
flatten_time_series (bool) – Whether to flatten time series or not
aligned (bool, optional) – Whether to align data or not across modalities. Defaults to True.
task (str, optional) – What task to load. Defaults to None.
max_pad (bool, optional) – Whether to pad data to max_pad_num or not. Defaults to False.
max_pad_num (int, optional) – Maximum padding number. Defaults to 50.
data_type (str, optional) – What data to load. Defaults to ‘mosi’.
z_norm (bool, optional) – Whether to normalize data along the z-axis. Defaults to False.

datasets.affect.get_data.drop_entry(dataset): Drop entries where there’s no text in the data.

datasets.affect.get_data.get_dataloader(filepath: str, batch_size: int = 32, max_seq_len=50, max_pad=False, train_shuffle: bool = True, num_workers: int = 2, flatten_time_series: bool = False, task=None, robust_test=False, data_type='mosi', raw_path='/home/runner/backup/pack/mosi/mosi.hdf5', z_norm=False) → torch.utils.data.DataLoader

Get dataloaders for affect data.

Parameters:

filepath (str) – Path to datafile
batch_size (int, optional) – Batch size. Defaults to 32.
max_seq_len (int, optional) – Maximum sequence length. Defaults to 50.
max_pad (bool, optional) – Whether to pad data to max length or not. Defaults to False.
train_shuffle (bool, optional) – Whether to shuffle training data or not. Defaults to True.
num_workers (int, optional) – Number of workers. Defaults to 2.
flatten_time_series (bool, optional) – Whether to flatten time series data or not. Defaults to False.
task (str, optional) – Which task to load in. Defaults to None.
robust_test (bool, optional) – Whether to apply robustness to data or not. Defaults to False.
data_type (str, optional) – What data to load in. Defaults to ‘mosi’.
raw_path (str, optional) – Full path to data. Defaults to ‘~/backup/pack/mosi/mosi.hdf5’.
z_norm (bool, optional) – Whether to normalize data along the z dimension or not. Defaults to False.

Returns:

tuple of train dataloader, validation dataloader, test dataloader

Return type:

DataLoader

datasets.affect.get_data.get_rawtext(path, data_kind, vids): Get raw text, video data from hdf5 file.

datasets.affect.get_data.z_norm(dataset, max_seq_len=50): Normalize data in the dataset.

datasets.affect.get_raw_data module

Handle getting raw data from mosi

datasets.affect.get_raw_data.detect_entry_fold(entry, folds)

Detect entry fold.

Parameters:

entry (str) – Entry string
folds (int) – Number of folds

Returns:

Entry fold index

Return type:

int

datasets.affect.get_raw_data.get_audio_visual_text(csds, seq_len, text_data, vids): Get audio visual from text.

datasets.affect.get_raw_data.get_rawtext(path, data_kind, vids)

Get raw text modality.

Parameters:

path (str) – Path to h5 file
data_kind (str) – String for data format. Should be ‘hdf5’.
vids (list) – List of video ids.

Returns:

Tuple of text_data and video_data in lists.

Return type:

tuple(list,list)

datasets.affect.get_raw_data.get_word2id(text_data, vids)

From text_data, vids get word2id lsit

Parameters:

text_data (list) – List of text data
vids (list) – List of video data

Returns:

List of word2id data

Return type:

list

datasets.affect.get_raw_data.get_word_embeddings(word2id, save=False)

Given a word2id, get the associated glove embeddings ( 300 dimensional ).

Parameters:

word2id (list) – list of word, index pairs
save (bool, optional) – Whether to save data to the folder (unused). Defaults to False.

Returns:

List of embedded words

Return type:

list[np.array]

datasets.affect.get_raw_data.glove_embeddings(text_data, vids, paddings=50)

Get glove embeddings of text, video pairs.

Parameters:

text_data (list) – list of text data.
vids (list) – list of video data
paddings (int, optional) – Amount to left-pad data if it’s less than some size. Defaults to 50.

Returns:

Array of embedded data

Return type:

np.array

datasets.affect.get_raw_data.lpad(this_array, seq_len)

Left pad array with seq_len 0s.

Parameters:

this_array (np.array) – Array to pad
seq_len (int) – Number of 0s to pad.

Returns:

Padded array

Return type:

np.array

datasets.affect package

Submodules

datasets.affect.get_bert_embedding module

datasets.affect.get_data module

datasets.affect.get_raw_data module

Module contents