How to speedup reading dataset

yueniu · December 12, 2020, 10:35pm

I am new to use USC HPC. Currently I am conducting ML training on HPC. The issue I have is reading dataset (store in /scratch) is significantly slower than in the local lab server. In local lab server, reading data is very fast compared to the computation time. But, in HPC, in my case, one-batch computation takes around 0.6s, while reading data takes 3s on average and also is very unstable. I understand dataset is stored in a different node (/scratch) so it might be slower than in the local storage. But is it such slow like this? Anyone ever had this issue?

vegayon · December 14, 2020, 7:00pm

Hi @yueniu, it would help if you could provide more details about your data, particularly what programming language you are using, the format in which the data is stored, and how big the dataset is. If you are using R, data.table would be one of the fastest ways to read the data in. In python, I would say that pydatatable would also be a good choice.

yueniu · December 14, 2020, 7:46pm

The dataset I am using is ImageNet, which is around 150TB in total. All data are stored in image format, namely .JPEG. When accessing then, I use Python package to decode them. I guess JPEG is not a good option considering speed. In this case, will HDF5 be better?

vegayon · December 14, 2020, 8:11pm

What python library are you using? This GitHub repo seems to be relevant for your question https://github.com/ternaus/imread_benchmark