datasets.affect package
Submodules
datasets.affect.get_bert_embedding module
Implements BERT embedding extractors.
- datasets.affect.get_bert_embedding.bert_version_data(data, raw_path, keys, max_padding=50, bert_max_len=None)
Get bert encoded data
- Parameters:
data (dict) – Data dictionary
raw_path (str) – Path to raw data
keys (dict) – List of keys in raw text getter
max_padding (int, optional) – Maximum padding to add to list. Defaults to 50.
bert_max_len (int, optional) – Maximum length in BERT. Defaults to None.
- Returns:
Dictionary from modality to data.
- Return type:
dict
- datasets.affect.get_bert_embedding.corresponding_other_modality_ids(orig_text, tokenized_text)
Align word ids to other modalities.
Since tokenizer splits the word into parts e.g. ‘##ing’ or ‘you’re’ -> ‘you’, ‘’’, ‘re’ we should get the corresponding ids for other modalities’ features applied to modalities which aligned to words
- Parameters:
orig_text (list) – List of strings corresponding to the original text.
tokenized_text (list) – List of lists of tokens.
- Returns:
List of ids.
- Return type:
list
- datasets.affect.get_bert_embedding.get_bert_features(all_text, contextual_embedding=False, batch_size=500, max_len=None)
Get bert features from data.
Use pipline to extract all the features, (num_points, max_seq_length, feature_dim): np.ndarray
- Parameters:
all_text (list) – Data to get BERT features from
contextual_embedding (bool, optional) – If True output the last hidden state of bert. If False, output the embedding of words. Defaults to False.
batch_size (int, optional) – Batch size. Defaults to 500.
max_len (int, optional) – Maximum length of the dataset. Defaults to None.
- Returns:
BERT features of text.
- Return type:
np.array
- datasets.affect.get_bert_embedding.get_rawtext(path, data_kind, vids=None)
“Get raw text from the datasets.
- Parameters:
path (str) – Path to data
data_kind (str) – Data Kind. Must be ‘hdf5’.
vids (list, optional) – List of video data as np.array. Defaults to None.
- Returns:
Text data list, video data list
- Return type:
tuple(list, list)
- datasets.affect.get_bert_embedding.max_seq_len(id_list, max_len=50)
Fix dataset to max sequence length.
Cut the id lists with the max length, but didnt do padding here. Add the first one as [CLS] and the last one for [SEP].
- Parameters:
id_list (list) – List of ids to manipulate
max_len (int, optional) – Maximum sequence length. Defaults to 50.
- Returns:
List of tokens
- Return type:
list
datasets.affect.get_data module
Implements dataloaders for AFFECT data.
- class datasets.affect.get_data.Affectdataset(*args: Any, **kwargs: Any)
Bases:
DatasetImplements Affect data as a torch dataset.
- __init__(data: Dict, flatten_time_series: bool, aligned: bool = True, task: str | None = None, max_pad=False, max_pad_num=50, data_type='mosi', z_norm=False) None
Instantiate AffectDataset
- Parameters:
data (Dict) – Data dictionary
flatten_time_series (bool) – Whether to flatten time series or not
aligned (bool, optional) – Whether to align data or not across modalities. Defaults to True.
task (str, optional) – What task to load. Defaults to None.
max_pad (bool, optional) – Whether to pad data to max_pad_num or not. Defaults to False.
max_pad_num (int, optional) – Maximum padding number. Defaults to 50.
data_type (str, optional) – What data to load. Defaults to ‘mosi’.
z_norm (bool, optional) – Whether to normalize data along the z-axis. Defaults to False.
- datasets.affect.get_data.drop_entry(dataset)
Drop entries where there’s no text in the data.
- datasets.affect.get_data.get_dataloader(filepath: str, batch_size: int = 32, max_seq_len=50, max_pad=False, train_shuffle: bool = True, num_workers: int = 2, flatten_time_series: bool = False, task=None, robust_test=False, data_type='mosi', raw_path='/home/runner/backup/pack/mosi/mosi.hdf5', z_norm=False) torch.utils.data.DataLoader
Get dataloaders for affect data.
- Parameters:
filepath (str) – Path to datafile
batch_size (int, optional) – Batch size. Defaults to 32.
max_seq_len (int, optional) – Maximum sequence length. Defaults to 50.
max_pad (bool, optional) – Whether to pad data to max length or not. Defaults to False.
train_shuffle (bool, optional) – Whether to shuffle training data or not. Defaults to True.
num_workers (int, optional) – Number of workers. Defaults to 2.
flatten_time_series (bool, optional) – Whether to flatten time series data or not. Defaults to False.
task (str, optional) – Which task to load in. Defaults to None.
robust_test (bool, optional) – Whether to apply robustness to data or not. Defaults to False.
data_type (str, optional) – What data to load in. Defaults to ‘mosi’.
raw_path (str, optional) – Full path to data. Defaults to ‘~/backup/pack/mosi/mosi.hdf5’.
z_norm (bool, optional) – Whether to normalize data along the z dimension or not. Defaults to False.
- Returns:
tuple of train dataloader, validation dataloader, test dataloader
- Return type:
DataLoader
- datasets.affect.get_data.get_rawtext(path, data_kind, vids)
Get raw text, video data from hdf5 file.
- datasets.affect.get_data.z_norm(dataset, max_seq_len=50)
Normalize data in the dataset.
datasets.affect.get_raw_data module
Handle getting raw data from mosi
- datasets.affect.get_raw_data.detect_entry_fold(entry, folds)
Detect entry fold.
- Parameters:
entry (str) – Entry string
folds (int) – Number of folds
- Returns:
Entry fold index
- Return type:
int
- datasets.affect.get_raw_data.get_audio_visual_text(csds, seq_len, text_data, vids)
Get audio visual from text.
- datasets.affect.get_raw_data.get_rawtext(path, data_kind, vids)
Get raw text modality.
- Parameters:
path (str) – Path to h5 file
data_kind (str) – String for data format. Should be ‘hdf5’.
vids (list) – List of video ids.
- Returns:
Tuple of text_data and video_data in lists.
- Return type:
tuple(list,list)
- datasets.affect.get_raw_data.get_word2id(text_data, vids)
From text_data, vids get word2id lsit
- Parameters:
text_data (list) – List of text data
vids (list) – List of video data
- Returns:
List of word2id data
- Return type:
list
- datasets.affect.get_raw_data.get_word_embeddings(word2id, save=False)
Given a word2id, get the associated glove embeddings ( 300 dimensional ).
- Parameters:
word2id (list) – list of word, index pairs
save (bool, optional) – Whether to save data to the folder (unused). Defaults to False.
- Returns:
List of embedded words
- Return type:
list[np.array]
- datasets.affect.get_raw_data.glove_embeddings(text_data, vids, paddings=50)
Get glove embeddings of text, video pairs.
- Parameters:
text_data (list) – list of text data.
vids (list) – list of video data
paddings (int, optional) – Amount to left-pad data if it’s less than some size. Defaults to 50.
- Returns:
Array of embedded data
- Return type:
np.array
- datasets.affect.get_raw_data.lpad(this_array, seq_len)
Left pad array with seq_len 0s.
- Parameters:
this_array (np.array) – Array to pad
seq_len (int) – Number of 0s to pad.
- Returns:
Padded array
- Return type:
np.array