* Implement Actor checkpointing
* docs
* fix
* fix
* fix
* move restore-from-checkpoint to HandleActorStateTransition
* Revert "move restore-from-checkpoint to HandleActorStateTransition"
This reverts commit 9aa4447c1e3e321f42a1d895d72f17098b72de12.
* resubmit waiting tasks when actor frontier restored
* add doc about num_actor_checkpoints_to_keep=1
* add num_actor_checkpoints_to_keep to Cython
* add checkpoint_expired api
* check if actor class is abstract
* change checkpoint_ids to long string
* implement java
* Refactor to delay actor creation publish until checkpoint is resumed
* debug, lint
* Erase from checkpoints to restore if task fails
* fix lint
* update comments
* avoid duplicated actor notification log
* fix unintended change
* add actor_id to checkpoint_expired
* small java updates
* make checkpoint info per actor
* lint
* Remove logging
* Remove old actor checkpointing Python code, move new checkpointing code to FunctionActionManager
* Replace old actor checkpointing tests
* Fix test and lint
* address comments
* consolidate kill_actor
* Remove __ray_checkpoint__
* fix non-ascii char
* Loosen test checks
* fix java
* fix sphinx-build
* Stream logs to driver by default.
* Fix from rebase
* Redirect raylet output independently of worker output.
* Fix.
* Create redis client with services.create_redis_client.
* Suppress Redis connection error at exit.
* Remove thread_safe_client from redis.
* Shutdown driver threads in ray.shutdown().
* Add warning for too many log messages.
* Only stop threads if worker is connected.
* Only stop threads if they exist.
* Remove unnecessary try/excepts.
* Fix
* Only add new logging handler once.
* Increase timeout.
* Fix tempfile test.
* Fix logging in cluster_utils.
* Revert "Increase timeout."
This reverts commit b3846b89040bcd8e583b2e18cb513cb040e71d95.
* Retry longer when connecting to plasma store from node manager and object manager.
* Close pubsub channels to avoid leaking file descriptors.
* Limit log monitor open files to 200.
* Increase plasma connect retries.
* Add comment.
* added store_client_ to object_manager and node_manager
* half through...
* all code in, and compiling! Nothing tested though...
* something is working ;-)
* added a few more comments
* now, add only one entry to the in GCS for inlined objects
* more comments
* remove a spurious todo
* some comment updates
* add test
* added support for meta data for inline objects
* avoid some copies
* Initialize plasma client in tests
* Better comments. Enable configuring nline_object_max_size_bytes.
* Update src/ray/object_manager/object_manager.cc
Co-Authored-By: istoica <istoica@cs.berkeley.edu>
* Update src/ray/raylet/node_manager.cc
Co-Authored-By: istoica <istoica@cs.berkeley.edu>
* Update src/ray/raylet/node_manager.cc
Co-Authored-By: istoica <istoica@cs.berkeley.edu>
* fiexed comments
* fixed various typos in comments
* updated comments in object_manager.h and object_manager.cc
* addressed all comments...hopefully ;-)
* Only add eviction entries for objects that are not inlined
* fixed a bunch of comments
* Fix test
* Fix object transfer dump test
* lint
* Comments
* Fix test?
* Fix test?
* lint
* fix build
* Fix build
* lint
* Use const ref
* Fixes, don't let object manager hang
* Increase object transfer retry time for travis?
* Fix test
* Fix test?
* Add internal config to java, fix PlasmaFreeTest
## What do these changes do?
* Improved --no-cuda handling
* Removed deprecated Variable usage
## Related issue number
Fixes#3873
<!-- Are there any issues opened that will be resolved by merging this change? -->
- NodeUpdater gets its' IP in parallel now (no longer in __init__)
- We use persistent connections in SSH (temp folder created only for ray; ControlMaster)
- hash_runtime_conf was performing a pointless hexlify step, wasting time on large files
- We use NodeUpdaterThreads and share the NodeProvider; NodeUpdaterProcess is removed
- AWSNodeProvider caches nodes more aggressively
- NodeProvider now has a shim batch terminate_nodes() call; AWSNodeProvider parallelises it; the autoscaler uses it
- AWSNodeProvider batches EC2 update_tags calls
- Logging changes throughout to provide standardised timing information for profiling
- Pulled out a few unnecessary is_running calls (NodeUpdater will loop waiting for SSH anyway)
## Related issue number
Issue #3599
* Factor out starting Ray processes.
* Detect flags through environment variables.
* Return ProcessInfo from start_ray_process.
* Print valgrind errors at exit.
* Test valgrind in travis.
* Some valgrind fixes.
* Undo raylet monitor change.
* Only test plasma store in valgrind.
* add marvil policy graph
* fix typo
* add offline optimizer and enable running marwil
* fix loss function
* add maintaining the moving average of advantage norm
* use sync replay optimizer for unifying
* remove offline optimizer and use sync replay optimizer
* format by yapf
* add imitation learning objective
* fix according to eric's review
* format by yapf
* revise
* add test data
* marwil
* Refactor code about ray.ObjectID.
* remove from_random and use nil_id instead of constructor
* remove id() in hash
* Lint and fix
* Change driver id to ObjectID
* Replace binary_to_hex(ObjectID.id()) to ObjectID.hex()
Rename `xray_test.py` to `mini_test.py` and use that in the documentation. Right now we suggest that people run `runtest.py`, but that often doesn't succeed and takes too long.
* Implement Node class and move most of services.py into it.
* Wait for nodes as they are added to the cluster.
* Fix Redis authentication bug.
* Fix bug in client table ordering.
* Address comments.
* Kill raylet before plasma store in test.
* Minor
* Convert UniqueID::nil() to a constructor
* Cleanup actor handle pickling code
* Add new actor handles to the task spec
* Pass in new actor handles
* Add new handles to the actor registration
* Regression test for actor handle forking and GC
* lint and doc
* Handle pickled actor handles in the backend and some refactoring
* Add regression test for dummy object GC and pickled actor handles
* Check for duplicate actor tasks on submission
* Regression test for forking twice, fix failed named actor leak
* Fix bug for forking twice
* lint
* Revert "Fix bug for forking twice"
This reverts commit 3da85e59d401e53606c2e37ffbebcc8653ff27ac.
* Add new actor handles when task is assigned, not finished
* Remove comment
* remove UniqueID()
* Updates
* update
* fix
* fix java
* fixes
* fix
* Separate out functionality for querying client table and improve cluster.wait_for_nodes() API.
* Linting
* Add back logging statements.
* info -> debug