It seems like the S3 read sometimes fails; #22214. I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue.
Fix dash on ray large scale test on K8s. Basically, chmod requires a root access, which we don't have it by default in the k8s cluster. We don't need chmod I think (I verified the test passes without it).
The first migration of test into k8s. We are adopting a conservative approach (migrate slowly while we keep existing test suites). Once things are confirmed to be stable, we will migrate with more speed.
* use nightly
* switch ml cpu to ray cpu
* fix
* add pytest
* add more pytest
* add constraint
* add tensorflow
* fix merge conflict
* add tblib
* fix
* add back uninstall
## Why are these changes needed?
In the nightly test we see
```
Command returned non-success status: 1; Command logs:Traceback (most recent call last): File "dask_on_ray/large_scale_test.py", line 17, in from ray._private.test_utils import monitor_memory_usage File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/test_utils.py", line 18, in import pytest ModuleNotFoundError: No module named 'pytest'
```
This PR fixes this error.
## Related issue number
When testing it we should minimize unnecessary env vars (and it's better working with the default config). This PR removes unnecessary env vars that are set.
* in progress
* in progress
* almost done
* Lint
* almost done
* All tests are available now
* Change the test a little more stressful
* Modify paramter to make tests a little more stressful