Rename `xray_test.py` to `mini_test.py` and use that in the documentation. Right now we suggest that people run `runtest.py`, but that often doesn't succeed and takes too long.
* Implement Node class and move most of services.py into it.
* Wait for nodes as they are added to the cluster.
* Fix Redis authentication bug.
* Fix bug in client table ordering.
* Address comments.
* Kill raylet before plasma store in test.
* Minor
* Convert UniqueID::nil() to a constructor
* Cleanup actor handle pickling code
* Add new actor handles to the task spec
* Pass in new actor handles
* Add new handles to the actor registration
* Regression test for actor handle forking and GC
* lint and doc
* Handle pickled actor handles in the backend and some refactoring
* Add regression test for dummy object GC and pickled actor handles
* Check for duplicate actor tasks on submission
* Regression test for forking twice, fix failed named actor leak
* Fix bug for forking twice
* lint
* Revert "Fix bug for forking twice"
This reverts commit 3da85e59d401e53606c2e37ffbebcc8653ff27ac.
* Add new actor handles when task is assigned, not finished
* Remove comment
* remove UniqueID()
* Updates
* update
* fix
* fix java
* fixes
* fix
* Separate out functionality for querying client table and improve cluster.wait_for_nodes() API.
* Linting
* Add back logging statements.
* info -> debug
## What do these changes do?
This option goes along with `min_workers`, and `max_workers`. When the
cluster is first brought up (or when it is refreshed with a subsequent
`ray up`) this number of nodes will be started.
It's a workaround for issues of scaling (see related issues) where it
can take a long time (or forever in the case where the head node has
`--num-cpus 0`) to scale up a cluster in response to increasing demand.
## Related issue number
Workaround for https://github.com/ray-project/ray/issues/3339 and https://github.com/ray-project/ray/issues/2106
* Push a warning to all users when large number of workers have been started.
* Add test.
* Fix bug.
* Give warning when worker starts instead of when worker registers.
* Fix
* Fix tests
* Limit Redis max memory to 10GB/shard by default.
* Update stress tests.
* Reorganize
* Update
* Add minimum cap size for object store and redis.
* Small test update.
This PR introduces cluster-level fault tolerance for Tune by checkpointing global state. This occurs with relatively high frequency and allows users to easily resume experiments when the cluster crashes.
Note that this PR may affect automated workflows due to auto-prompting, but this is resolvable.
## What do these changes do?
1. Fix the Jenkins test failure by add driver id to Actor GCS Key.
2. Move `object_manager_test.py` from Jenkins to Travis.
* mb impala
* fix
* paropt
* update
* cpu warn
* on cpu
* fix mb
* doc
* docs
* comment
* larger num
* early release
* remove grad clip
* only check loader count in multi gpu mode
* revert bad multigpu changes
* num sgd iter
* comment
* reuse optimizer
* add test
* par load test
* loosen test
* Update run_multi_node_tests.sh
* fix local mode
* Update agent.py
* Add a flag for whether an object has been created before
* Add regression test
* doc
* Share object directory between object and node managers
* Treat evicted actor tasks as failed
* minor
* Check return value
* Fix bug where object locations weren't getting updated on client death
* Fix mac build
* Use RayTaskError
It is possible that `test_free_objects_multi_node` would fail sometimes. If we run this test 20 times, we may found at least one failure.
The cause is that the test is based on function tasks. One raylet may create more than one worker to execute the tasks. So flush operations may be separated to several workers and not clean all the worker objects held by the plasma client.
In this PR, I change function task to actor tasks, which guarantee all the tasks are executed in one worker of a raylet.
* Add script for running stress tests.
* Add an actor tree test where actors die with some probability
* Improve test.
* Small fix
* Update tests.
* Minor change
## What do these changes do?
This passes in the right obs space to the lstm model wrapper, so that it doesn't attempt to un-flatten the already processed dict observation.
## Related issue number
Closes https://github.com/ray-project/ray/issues/3367
This PR introduces single-node fault tolerance for Tune.
## Previous behavior:
- Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.
## New behavior:
- RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available).
- If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
- During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.
Remaining questions:
- Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).
- Waiting for some PRs to merge first (#3239)
Closes#2851.
* Broadcast actor death, clean up dummy objects
* Reduce logging and clean up state when failing a task
* lint
* Make actor failure test nicer, reduce node timeout