Downloading Datasets

We support a variety of datasets in MultiBench out of the box, but downloading each of them requires separate steps to do so:

Affective Computing

MUStARD

To grab the MUStARD/sarcasm dataset, download the dataset from here.

To load it, you can then use the following snippet:

from datasets.affect.get_data import get_dataloader
traindata, validdata, test_robust = get_dataloader('/path/to/raw/data/file', data_type='sarcasm')

CMU-MOSI

To grab the CMU-MOSI dataset, download the dataset from here.

To load it, you can then use the following snippet:

from datasets.affect.get_data import get_dataloader
traindata, validdata, test_robust = get_dataloader('/path/to/raw/data/file', data_type='mosi')

UR-Funny

To grab the UR-Funny/humor dataset, download the dataset from here.

To load it, you can then use the following snippet:

from datasets.affect.get_data import get_dataloader
traindata, validdata, test_robust = get_dataloader('/path/to/raw/data/file', data_type='humor')

CMU-MOSEI

To grab the CMU-MOSEI dataset, download the dataset from here.

To load it, you can then use the following snippet:

from datasets.affect.get_data import get_dataloader
traindata, validdata, test_robust = get_dataloader('/path/to/raw/data/file', data_type='mosei')

Healthcare

MIMIC

Access to the MIMIC dataset is restricted, and as a result you will need to follow the instructions here to gain access.

Once you do so, email ylyu1@andrew.cmu.edu with proof of those credentials to gain access to the preprocessed im.pk file we use in our dataloaders.

To load it, you can then use the following snippet:

from datasets.mimic.get_data import get_dataloader
traindata, validdata, test_robust = get_dataloader(tasknum, inputed_path='/path/to/raw/data/file')

Robotics

MuJoCo Push

To grab the MuJoCo Push dataset, you simply need to run any of the example experiments, like the following:

python examples/gentle_push/LF.py

This will download the dataset into the datasets/gentle_push/cache folder directly.

As this uses gdown to download the files directly, sometimes this process fails.

Should that happen, you can simply download the following files:

Download this file and place it under datasets/gentle_push/cache/1qmBCfsAGu8eew-CQFmV1svodl9VJa6fX-gentle_push_10.hdf5.
Download this file and place it under datasets/gentle_push/cache/18dr1z0N__yFiP_DAKxy-Hs9Vy_AsaW6Q-gentle_push_300.hdf5.
Download this file and place it under datasets/gentle_push/cache/1JTgmq1KPRK9HYi8BgvljKg5MPqT_N4cR-gentle_push_1000.hdf5.

Then, you can follow this code block as an example to get the dataloaders:

from datasets.gentle_push.data_loader import PushTask
Task = PushTask
modalities = ['control']

# Parse args
parser = argparse.ArgumentParser()
Task.add_dataset_arguments(parser)
args = parser.parse_args()
dataset_args = Task.get_dataset_args(args)

fannypack.data.set_cache_path('datasets/gentle_push/cache')

train_loader, val_loader, test_loader = Task.get_dataloader(
    16, modalities, batch_size=32, drop_last=True)

Vision&Touch

To grab the Vision&Touch dataset, please run the download_data.sh file located under dataset/robotics/download_data.sh.

This dataset, by default, only has the training and validation dataset, which you can access through the following call:

from datasets.robotics.data_loader import get_data
trainloader, valloader = get_data(device, config, "path/to/data/folder")

By default, the task is the Contact task, but passing in output='ee_yaw_next' into get_data will allow you to access the End Effector task

Finance

All of the dataloaders, when created, will automatically download the stock data for you.

As an example, this can be done through the following code block:

from datasets.stocks.get_data import get_dataloader
# Here, the list of stocks is a list of strings of stock symbols in all CAPS.
train_loader, val_loader, test_loader = get_dataloader(stocks, stocks, [args.target_stock])

For the purposes of the MultiBench paper, we used the following lists per dataset:

F&B (18): CAG CMG CPB DPZ DRI GIS HRL HSY K KHC LW MCD MDLZ MKC SBUX SJM TSN YUM
Health (63): ABT ABBV ABMD A ALXN ALGN ABC AMGN ANTM BAX BDX BIO BIIB BSX BMY CAH CTLT CNC CERN CI COO CVS DHR DVA XRAY DXCM EW GILD HCA HSIC HOLX HUM IDXX ILMN INCY ISRG IQV JNJ LH LLY MCK MDT MRK MTD PKI PRGO PFE DGX REGN RMD STE SYK TFX TMO UNH UHS VAR VRTX VTRS WAT WST ZBH ZTS
Tech (100): AAPL ACN ADBE ADI ADP ADSK AKAM AMAT AMD ANET ANSS APH ATVI AVGO BR CDNS CDW CHTR CMCSA CRM CSCO CTSH CTXS DIS DISCA DISCK DISH DXC EA ENPH FB FFIV FIS FISV FLIR FLT FOX FOXA FTNT GLW GOOG GOOGL GPN HPE HPQ IBM INTC INTU IPG IPGP IT JKHY JNPR KEYS KLAC LRCX LUMN LYV MA MCHP MPWR MSFT MSI MU MXIM NFLX NLOK NOW NTAP NVDA NWS NWSA NXPI OMC ORCL PAYC PAYX PYPL QCOM QRVO SNPS STX SWKS T TEL TER TMUS TRMB TTWO TWTR TXN TYL V VIAC VRSN VZ WDC WU XLNX ZBRA

HCI

ENRiCO

To grab the ENRiCO dataset, please use the download_data.sh shell script under the datasets/enrico dataset.

Then, to load the data in, you can use something like the following code-block:

As an example, this can be done through the following code block:

from datasets.enrico.get_data import get_dataloader

(train_loader, val_loader, test_loader), weights = get_dataloader("datasets/enrico/dataset")

Multimedia

AV-MNIST

To grab the AV-MNIST dataset, please download the avmnist.tar.gz file from here .

Once that is done, untar it your preferred location and do the following to get the dataloaders:

from datasets.avmnist.get_data import get_dataloader

train_loader, val_loader, test_loader = get_dataloader("path/to/dataset")

MM-IMDb

To grab the MM-IMDb dataset, please download the multimodal_imdb.hdf5 file from here .

If you plan on testing the model’s robustness, you will also need to download the raw file from here.

Once that is done, untar it your preferred location and do something the following to get the dataloaders:

from datasets.imdb.get_data import get_dataloader

train_loader, val_loader, test_loader = get_dataloader("path/to/processed_data","path/to/raw/data/folder",vgg=True, batch_size=128)

Kinetics400

To download either of the Kinetics datasets, run the appropriate script under special/kinetics_*.py.

Then pass the location of the data to the associated file to finish it.

Clotho

To download the Clotho dataset, clone the repository somewhere on your device and follow the given instructions to pre-process the data.

To get the dataloaders, you will also need to add the path to the above repo to the get_dataloaders function under datasets/clotho/get_data.py.