* enable parameter space noise for exploration
* enable parameter space noise for exploration
* yapf formatted
* remove the usage of scipy softmax avialable in the latest version only
* enable subclass that has no parameter_noise in the config
* run user specified callbacks and test parameter space noise in multi node setting
* formatted by yapf
* Update dqn.py
* lint
* Implement Actor checkpointing
* docs
* fix
* fix
* fix
* move restore-from-checkpoint to HandleActorStateTransition
* Revert "move restore-from-checkpoint to HandleActorStateTransition"
This reverts commit 9aa4447c1e3e321f42a1d895d72f17098b72de12.
* resubmit waiting tasks when actor frontier restored
* add doc about num_actor_checkpoints_to_keep=1
* add num_actor_checkpoints_to_keep to Cython
* add checkpoint_expired api
* check if actor class is abstract
* change checkpoint_ids to long string
* implement java
* Refactor to delay actor creation publish until checkpoint is resumed
* debug, lint
* Erase from checkpoints to restore if task fails
* fix lint
* update comments
* avoid duplicated actor notification log
* fix unintended change
* add actor_id to checkpoint_expired
* small java updates
* make checkpoint info per actor
* lint
* Remove logging
* Remove old actor checkpointing Python code, move new checkpointing code to FunctionActionManager
* Replace old actor checkpointing tests
* Fix test and lint
* address comments
* consolidate kill_actor
* Remove __ray_checkpoint__
* fix non-ascii char
* Loosen test checks
* fix java
* fix sphinx-build
* Stream logs to driver by default.
* Fix from rebase
* Redirect raylet output independently of worker output.
* Fix.
* Create redis client with services.create_redis_client.
* Suppress Redis connection error at exit.
* Remove thread_safe_client from redis.
* Shutdown driver threads in ray.shutdown().
* Add warning for too many log messages.
* Only stop threads if worker is connected.
* Only stop threads if they exist.
* Remove unnecessary try/excepts.
* Fix
* Only add new logging handler once.
* Increase timeout.
* Fix tempfile test.
* Fix logging in cluster_utils.
* Revert "Increase timeout."
This reverts commit b3846b89040bcd8e583b2e18cb513cb040e71d95.
* Retry longer when connecting to plasma store from node manager and object manager.
* Close pubsub channels to avoid leaking file descriptors.
* Limit log monitor open files to 200.
* Increase plasma connect retries.
* Add comment.
* added store_client_ to object_manager and node_manager
* half through...
* all code in, and compiling! Nothing tested though...
* something is working ;-)
* added a few more comments
* now, add only one entry to the in GCS for inlined objects
* more comments
* remove a spurious todo
* some comment updates
* add test
* added support for meta data for inline objects
* avoid some copies
* Initialize plasma client in tests
* Better comments. Enable configuring nline_object_max_size_bytes.
* Update src/ray/object_manager/object_manager.cc
Co-Authored-By: istoica <istoica@cs.berkeley.edu>
* Update src/ray/raylet/node_manager.cc
Co-Authored-By: istoica <istoica@cs.berkeley.edu>
* Update src/ray/raylet/node_manager.cc
Co-Authored-By: istoica <istoica@cs.berkeley.edu>
* fiexed comments
* fixed various typos in comments
* updated comments in object_manager.h and object_manager.cc
* addressed all comments...hopefully ;-)
* Only add eviction entries for objects that are not inlined
* fixed a bunch of comments
* Fix test
* Fix object transfer dump test
* lint
* Comments
* Fix test?
* Fix test?
* lint
* fix build
* Fix build
* lint
* Use const ref
* Fixes, don't let object manager hang
* Increase object transfer retry time for travis?
* Fix test
* Fix test?
* Add internal config to java, fix PlasmaFreeTest
## What do these changes do?
* Improved --no-cuda handling
* Removed deprecated Variable usage
## Related issue number
Fixes#3873
<!-- Are there any issues opened that will be resolved by merging this change? -->
- NodeUpdater gets its' IP in parallel now (no longer in __init__)
- We use persistent connections in SSH (temp folder created only for ray; ControlMaster)
- hash_runtime_conf was performing a pointless hexlify step, wasting time on large files
- We use NodeUpdaterThreads and share the NodeProvider; NodeUpdaterProcess is removed
- AWSNodeProvider caches nodes more aggressively
- NodeProvider now has a shim batch terminate_nodes() call; AWSNodeProvider parallelises it; the autoscaler uses it
- AWSNodeProvider batches EC2 update_tags calls
- Logging changes throughout to provide standardised timing information for profiling
- Pulled out a few unnecessary is_running calls (NodeUpdater will loop waiting for SSH anyway)
## Related issue number
Issue #3599
* Factor out starting Ray processes.
* Detect flags through environment variables.
* Return ProcessInfo from start_ray_process.
* Print valgrind errors at exit.
* Test valgrind in travis.
* Some valgrind fixes.
* Undo raylet monitor change.
* Only test plasma store in valgrind.
* add marvil policy graph
* fix typo
* add offline optimizer and enable running marwil
* fix loss function
* add maintaining the moving average of advantage norm
* use sync replay optimizer for unifying
* remove offline optimizer and use sync replay optimizer
* format by yapf
* add imitation learning objective
* fix according to eric's review
* format by yapf
* revise
* add test data
* marwil