* Separate out functionality for querying client table and improve cluster.wait_for_nodes() API.
* Linting
* Add back logging statements.
* info -> debug
It is possible that `test_free_objects_multi_node` would fail sometimes. If we run this test 20 times, we may found at least one failure.
The cause is that the test is based on function tasks. One raylet may create more than one worker to execute the tasks. So flush operations may be separated to several workers and not clean all the worker objects held by the plasma client.
In this PR, I change function task to actor tasks, which guarantee all the tasks are executed in one worker of a raylet.
* Suppress duplicate pre-emptive object pushes.
* Add test.
* Fix linting
* Remove timer and inline recent_pushes_ into local_objects_.
* Improve test.
* Fix
* Fix linting
* Enable retrying pull from same object manager. Randomize object manager.
* Speed up test
* Linting
* Add test.
* Minor
* Lengthen pull timeout and reissue pull every time a new object becomes available.
* Increase pull timeout in test.
* Wait for nodes to start in object manager test.
* Wait longer for nodes to start up in test.
* Small fixes.
* _submit -> _remote
* Change assert to warning.
* Trigger reconstruction in ray.wait and mark worker as blocked.
* Add test.
* Linting.
* Don't run new test with legacy Ray.
* Only call HandleClientUnblocked if it actually blocked in ray.wait.
* Reduce time to ray.wait in the test.
Basically a re-implementation of #2281, with modifications of #2298 (A fix of #2334, for rebasing issues.).
[+] Implement sharding for gcs tables.
[+] Keep ClientTable and ErrorTable managed by the primary_shard. TaskTable is managed by the primary_shard for now, until a good hashing for tasks is implemented.
[+] Move AsyncGcsClient's initialization into Connect function.
[-] Move GetRedisShard and bool sharding from RedisContext's connect into AsyncGcsClient. This may make the interface cleaner.
* Convert multi_node_test.py to pytest.
* Convert array_test.py to pytest.
* Convert failure_test.py to pytest.
* Convert microbenchmarks to pytest.
* Convert component_failures_test.py to pytest and some minor quotes changes.
* Convert tensorflow_test.py to pytest.
* Convert actor_test.py to pytest.
* Fix.
* Fix
## What do these changes do?
* distribute load and resource information on a heartbeat
* for each raylet, maintain total and available resource capacity as well as measure of current load
* this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load.
* modify the scheduling policy to perform *capacity-based*, *load-aware*, *optimistically concurrent* resource allocation
* perform task spillover to the heartbeating node in response to a heartbeat, implementing heterogeneity-aware late-binding/work-stealing.
* Add profile table and store profiling information there.
* Code for dumping timeline.
* Improve color scheme.
* Push timeline events on driver only for raylet.
* Improvements to profiling and timeline visualization
* Some linting
* Small fix.
* Linting
* Propagate node IP address through profiling events.
* Fix test.
* object_id.hex() should return byte string in python 2.
* Include gcs.fbs in node_manager.fbs.
* Remove flatbuffer definition duplication.
* Decode to unicode in Python 3 and bytes in Python 2.
* Minor
* Submit profile events in a batch. Revert some CMake changes.
* Fix
* Workaround test failure.
* Fix linting
* Linting
* Don't return anything from chrome_tracing_dump when filename is provided.
* Remove some redundancy from profile table.
* Linting
* Move TODOs out of docstring.
* Minor
* Fix documentation indentation.
* Add error table to GCS and push error messages through node manager.
* Add type to error data.
* Linting
* Fix failure_test bug.
* Linting.
* Enable one more test.
* Attempt to fix doc building.
* Restructuring
* Fixes
* More fixes.
* Move current_time_ms function into util.h.
* Implement global state API for xray.
* Fix object table.
* Fixes for log structure.
* Implement cluster_resources.
* Add driver task to task table.
* Remove python flatbuffers code
* Get some global state API tests running.
* Python linting.
* Fix linting.
* Fix mock modules for doc
* Copy over flatbuffer bindings.
* Fix for tests.
* Linting
* Fix monitor crash.
* Add flake8 to Travis
* Add flake8-comprehensions
[flake8 plugin](https://github.com/adamchainz/flake8-comprehensions) that
checks for useless constructions.
* Use generators instead of lists where appropriate
A lot of the builtins can take in generators instead of lists.
This commit applies `flake8-comprehensions` to find them.
* Fix lint error
* Fix some string formatting
The rest can be fixed in another PR
* Fix compound literals syntax
This should probably be merged after #1963.
* dict() -> {}
* Use dict literal syntax
dict(...) -> {...}
* Rewrite nested dicts
* Fix hanging indent
* Add missing import
* Add missing quote
* fmt
* Add missing whitespace
* rm duplicate pip install
This is already installed in another file.
* Fix indent
* move `merge_dicts` into utils
* Bring up to date with `master`
* Add automatic syntax upgrade
* rm pyupgrade
In case users want to still use it on their own, the upgrade-syn.sh script was
left in the `.travis` dir.
* Use pep8 style
The original style file is actually just pep8 style, but with everything
spelled out. It's easier to use the `based_on_style` feature. Any overrides are
clearer that way.
* Improve yapf script
1. Do formatting in parallel
2. Lint RLlib
3. Use .style.yapf file
* Pull out expressions into variables
* Don't format rllib
* Don't allow splits in dicts
* Apply yapf
* Disallow single line if-statements
* Use arithmetic comparison
* Simplify checking for changed files
* Pull out expr into var