OSError: [Errno 70] Communication error on send

Some of my jobs cannot read /project file system. This problem does not happen at the beginning of training, instead, it randomly happens during the training time. Here is the error message.

This problem happens a lot of times since the last maintenance. I also submitted one ticket two weeks ago. Although the ticket is closed, but I am still facing with the same problem again.

Most of my jobs need to run for over 20 hours. If they fail randomly, I will have to wake up at midnight to resume the job, which is quite painful.

We have responded to your ticket through ticket system! After the ticket was resolved, there were some cable issues that we found and still need to be solved unfortunately. Please use home1, scratch system to avoid such errors for now.