* Convert UniqueID::nil() to a constructor
* Cleanup actor handle pickling code
* Add new actor handles to the task spec
* Pass in new actor handles
* Add new handles to the actor registration
* Regression test for actor handle forking and GC
* lint and doc
* Handle pickled actor handles in the backend and some refactoring
* Add regression test for dummy object GC and pickled actor handles
* Check for duplicate actor tasks on submission
* Regression test for forking twice, fix failed named actor leak
* Fix bug for forking twice
* lint
* Revert "Fix bug for forking twice"
This reverts commit 3da85e59d401e53606c2e37ffbebcc8653ff27ac.
* Add new actor handles when task is assigned, not finished
* Remove comment
* remove UniqueID()
* Updates
* update
* fix
* fix java
* fixes
* fix
1. Fix the problem of duplicated stored logs.
2. Save log whose level is higher than severity_threshold, not only with severity_threshold.
3. Fix a `log_dir` bug: storing logs in a wrong path.
* Separate out functionality for querying client table and improve cluster.wait_for_nodes() API.
* Linting
* Add back logging statements.
* info -> debug
## What do these changes do?
This option goes along with `min_workers`, and `max_workers`. When the
cluster is first brought up (or when it is refreshed with a subsequent
`ray up`) this number of nodes will be started.
It's a workaround for issues of scaling (see related issues) where it
can take a long time (or forever in the case where the head node has
`--num-cpus 0`) to scale up a cluster in response to increasing demand.
## Related issue number
Workaround for https://github.com/ray-project/ray/issues/3339 and https://github.com/ray-project/ray/issues/2106
* Push a warning to all users when large number of workers have been started.
* Add test.
* Fix bug.
* Give warning when worker starts instead of when worker registers.
* Fix
* Fix tests
* Fix warning text in pbt logger
* Allow nested mutations in pbt by recursing explore function
* Add test for nested pbt mutation
* Update pbt explore to only call custom explore on top level
* fix test
* Limit Redis max memory to 10GB/shard by default.
* Update stress tests.
* Reorganize
* Update
* Add minimum cap size for object store and redis.
* Small test update.
This change ensures that Ray set up fault handlers only if it has not been enabled by other applications. Otherwise some applications could face strange issues when using Ray, and some unittests using xml runners will fail.
This PR introduces cluster-level fault tolerance for Tune by checkpointing global state. This occurs with relatively high frequency and allows users to easily resume experiments when the cluster crashes.
Note that this PR may affect automated workflows due to auto-prompting, but this is resolvable.
## What do these changes do?
1. Fix the Jenkins test failure by add driver id to Actor GCS Key.
2. Move `object_manager_test.py` from Jenkins to Travis.
* Allowing multiple users to access the /tmp/ray file at the same time
Previous sequence that caused this issue:
* User A starts ray with `ray.init` when /tmp/ray does not exist
* User B starts ray with `ray.init` and /tmp/ray now exists
User B will get a permissions error
Checking the permissions, /tmp/ray is 700
I have identified a race condition in `try_to_create_directory`
* Multiple processes try to create /tmp/ray at the same time
* chmod is either silently erroring or working properly within the race condition
Resolution: Move chmod outside of the check for whether the directory exists or not.
* Adding try except for users who do not own the directory
## What do these changes do?
Previously we logged a warning if the PPO configuration would waste many samples. However, this didn't apply in the case of long episodes in `complete_episodes` batch mode, and also the amount of waste is up to 2x in common cases.
This pr:
- Estimates the number of sampling tasks needed to avoid over-sampling.
- Collects all sample results and never discards any. In principle this can degrade performance at large scale if certain machines are slower. Add a config flag to enable this legacy behavior.
## Related issue number
Closes: https://github.com/ray-project/ray/issues/3549
This PR provides a better error message when the generate_variants code
breaks. Also removes a comment about nesting dependencies.
This comes mainly as a hotfix solution for #3466. We should leave that issue open for future contribution 🙂