Commit graph

2234 commits

Author SHA1 Message Date
Edward Oakes
d69fe54f6d
Temporarily skip testEndToEndReporting (#7402) 2020-03-02 18:27:34 -06:00
Siyuan (Ryans) Zhuang
0792b5cb93
Fix the numpy ndarray subclass serialization bug (#7392) 2020-03-01 23:05:59 -08:00
Richard Liaw
48cdca843f
[raysgd] Custom training operator (#7211) 2020-03-01 21:22:48 -08:00
Eric Liang
3c6b94f3f5
[rllib] Enable performance metrics reporting for RLlib pipelines, add A3C (#7299) 2020-02-28 16:44:17 -08:00
Richard Liaw
fb73d51d4d
[tune] fix hparams for tbx (#7312)
* fix

* test_hist

* remove unnecessary value check

* pbt

* queue

* skip_for_now

* Apply suggestions from code review
2020-02-28 11:51:56 -08:00
Richard Liaw
ca40b0fcc6
[tune][minor] Avoid throwing error when gpu check fails (#7362) 2020-02-28 11:32:44 -08:00
Edward Oakes
f321eaec9b
Working but not passing test (#7358) 2020-02-28 12:57:28 -06:00
mehrdadn
fb0bc7b947
Partially revert "[Core/RLlib] Move log_once from rllib to ray.util. (#7273)" (#7361)
This partially reverts commit 357232d124.

The addition of python/__init__.py broke the build on Windows. However, this is difficult to notice because Bazel doesn't seem to notice this dependency. You first have to go to a commit that fails on this issue, and then try to re-build this commit, so that Bazel actually performs a rebuild.

A useful command-line for triggering the exact build i:

bazel build --compile_one_dependency //:python/ray/_raylet.pyx
2020-02-28 10:27:45 -08:00
Edward Oakes
93fe4b0b58
Change actor.__ray_kill__() to ray.kill(actor) (#7360) 2020-02-28 11:55:13 -06:00
Richard Liaw
3fc162f93c
[tune] Add Unit Test for nested PBT + Jenkins (#7324) 2020-02-27 18:17:11 -08:00
mehrdadn
8730996682
Windows changes (#7315) 2020-02-27 15:14:10 -08:00
Edward Oakes
ced062319d
Decrease test_object_manager put size to avoid OOMs in CI (#7355) 2020-02-27 11:08:10 -08:00
Edward Oakes
cbf55d69a6
Remove serialized from_random object ids in tests (#7340) 2020-02-27 11:04:06 -08:00
Edward Oakes
bd9411f849
Call TriggerGlobalGC when the plasma store is full (#7337) 2020-02-27 11:01:49 -08:00
Sven Mika
357232d124
[Core/RLlib] Move log_once from rllib to ray.util. (#7273)
* Move log_once from rllib to tune.

* Move log_once from rllib to tune.

* LINT.

* Move to ray.util.debug.
2020-02-27 10:40:44 -08:00
Edward Oakes
d9027acaf2
Deprecate non-direct-call API (#7336) 2020-02-27 10:37:23 -08:00
Edward Oakes
55ccfb6089
Fix asyncio actor race condition (#7335) 2020-02-27 10:16:04 -08:00
Edward Oakes
ee0f71e398
Add __commit__ field to ray package in wheels (#7305) 2020-02-26 17:54:22 -08:00
Edward Oakes
2ad9bc5684
Move plasma retry logic into plasma store provider (#7328) 2020-02-26 16:57:02 -08:00
Eric Liang
b310661338
Add internal_api.global_gc() method, which triggers gc.collect() on all workers (#7327) 2020-02-26 14:09:29 -08:00
Stephanie Wang
9964657815
Fix plasma bug (#7322) 2020-02-25 18:15:28 -08:00
Edward Oakes
44b4394afa
Remove unused AddContainedObjectIDs (#7323) 2020-02-25 16:42:20 -08:00
Richard Liaw
226fcd5aff
Add Dashboard and Util to setup-dev (#7321) 2020-02-25 15:25:09 -08:00
Eric Liang
1ea05a2c08
[tune] Fix a number of reporter regressions and add end-to-end tests (#7274) 2020-02-25 14:31:56 -08:00
Eric Liang
f14b6e477b
Raise gRPC message size limit to 100MB (#7269) 2020-02-24 23:22:49 -08:00
Edward Oakes
f2faf8d26e
Fix passing duplicate by-reference arguments (#7306) 2020-02-24 19:18:16 -08:00
chaokunyang
8b6784de06
[Streaming] Streaming Python API (#6755) 2020-02-25 10:33:33 +08:00
Mitchell Stern
669bb403c3
Add TypeScript and HTML linting to Travis lint job (#7294) 2020-02-24 11:12:07 -08:00
Eric Liang
0ae4fe020d
revert omp threads fix (#7288) 2020-02-23 21:26:49 -08:00
fangfengbin
e7d0ec9531
Enable GCS server when running python unit tests (#7101)
* Enable GCS server when running python unit tests

* restart ci

* restart ci

* fix code style

* restart ci

* restart ci

* restart ci

* restart ci

* restart ci

* Define RAY_GCS_SERVICE_ENABLED as a constant

* fix review comments

* fix code style

* fix code style

* fix code style

* fix code style

* fix review comments

* add gcs service python testcase

* fix TESTSUITE name bug
2020-02-24 09:48:40 +08:00
Sven Mika
0db2046b0a
[RLlib] Policy.compute_log_likelihoods() and SAC refactor. (issue #7107) (#7124)
* Exploration API (+EpsilonGreedy sub-class).

* Exploration API (+EpsilonGreedy sub-class).

* Cleanup/LINT.

* Add `deterministic` to generic Trainer config (NOTE: this is still ignored by most Agents).

* Add `error` option to deprecation_warning().

* WIP.

* Bug fix: Get exploration-info for tf framework.
Bug fix: Properly deprecate some DQN config keys.

* WIP.

* LINT.

* WIP.

* Split PerWorkerEpsilonGreedy out of EpsilonGreedy.
Docstrings.

* Fix bug in sampler.py in case Policy has self.exploration = None

* Update rllib/agents/dqn/dqn.py

Co-Authored-By: Eric Liang <ekhliang@gmail.com>

* WIP.

* Update rllib/agents/trainer.py

Co-Authored-By: Eric Liang <ekhliang@gmail.com>

* WIP.

* Change requests.

* LINT

* In tune/utils/util.py::deep_update() Only keep deep_updat'ing if both original and value are dicts. If value is not a dict, set

* Completely obsolete syn_replay_optimizer.py's parameters schedule_max_timesteps AND beta_annealing_fraction (replaced with prioritized_replay_beta_annealing_timesteps).

* Update rllib/evaluation/worker_set.py

Co-Authored-By: Eric Liang <ekhliang@gmail.com>

* Review fixes.

* Fix default value for DQN's exploration spec.

* LINT

* Fix recursion bug (wrong parent c'tor).

* Do not pass timestep to get_exploration_info.

* Update tf_policy.py

* Fix some remaining issues with test cases and remove more deprecated DQN/APEX exploration configs.

* Bug fix tf-action-dist

* DDPG incompatibility bug fix with new DQN exploration handling (which is imported by DDPG).

* Switch off exploration when getting action probs from off-policy-estimator's policy.

* LINT

* Fix test_checkpoint_restore.py.

* Deprecate all SAC exploration (unused) configs.

* Properly use `model.last_output()` everywhere. Instead of `model._last_output`.

* WIP.

* Take out set_epsilon from multi-agent-env test (not needed, decays anyway).

* WIP.

* Trigger re-test (flaky checkpoint-restore test).

* WIP.

* WIP.

* Add test case for deterministic action sampling in PPO.

* bug fix.

* Added deterministic test cases for different Agents.

* Fix problem with TupleActions in dynamic-tf-policy.

* Separate supported_spaces tests so they can be run separately for easier debugging.

* LINT.

* Fix autoregressive_action_dist.py test case.

* Re-test.

* Fix.

* Remove duplicate py_test rule from bazel.

* LINT.

* WIP.

* WIP.

* SAC fix.

* SAC fix.

* WIP.

* WIP.

* WIP.

* FIX 2 examples tests.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* Fix.

* LINT.

* Renamed test file.

* WIP.

* Add unittest.main.

* Make action_dist_class mandatory.

* fix

* FIX.

* WIP.

* WIP.

* Fix.

* Fix.

* Fix explorations test case (contextlib cannot find its own nullcontext??).

* Force torch to be installed for QMIX.

* LINT.

* Fix determine_tests_to_run.py.

* Fix determine_tests_to_run.py.

* WIP

* Add Random exploration component to tests (fixed issue with "static-graph randomness" via py_function).

* Add Random exploration component to tests (fixed issue with "static-graph randomness" via py_function).

* Rename some stuff.

* Rename some stuff.

* WIP.

* WIP.

* Fix SAC.

* Fix SAC.

* Fix strange tf-error in ray core tests.

* Fix strange ray-core tf-error in test_memory_scheduling test case.

* Fix test_io.py.

* LINT.

* Update SAC yaml files' config.

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2020-02-22 14:19:49 -08:00
Stephanie Wang
4c2de7be54
[core] Ref counting for returning object IDs created by a different process (#7221)
* Add regression tests

* Refactor, split RemoveSubmittedTaskReferences into submitted and finished paths

* Add nested return IDs to UpdateFinishedTaskRefs, rename WrapObjectIds

* Basic unit tests pass

* Fix unit test and add an out-of-order regression test

* Add stored_in_objects to ObjectReferenceCount, regression test now passes

* Add an Address to the ReferenceCounter so we can determine ownership

* Set the nested return IDs from the TaskManager

* Add another test

* Simplify

* Update src/ray/core_worker/reference_count_test.cc

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* comments

* Add python test

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-02-22 13:29:48 -08:00
Amog Kamsetty
1737a113be
[Parallel Iterators] Repartition functionality (#7163)
* repartition and tests

* blacklist lib/ files from import checks

* addressing comments and splitting up tests

* code readability

* adding explicit ref for parent iterator

* formatting
2020-02-21 13:20:18 -08:00
mehrdadn
c6f50ecc51
setpgrp fix (#7250) 2020-02-21 13:15:11 -08:00
Edward Oakes
d190e73727
Use our own implementation of parallel_memcopy (#7254) 2020-02-21 11:03:50 -08:00
Kai Yang
007333b960
[Java] Support direct call for normal tasks (#7193) 2020-02-21 10:03:34 +08:00
Edward Oakes
6c80071a7d
Remove gc.collect() calls from reference counting tests (#7218) 2020-02-20 10:51:02 -08:00
Edward Oakes
16e37416cd
Fix raylet pinning race condition (#7235) 2020-02-20 10:41:36 -08:00
Siyuan (Ryans) Zhuang
0d210a99c3
Ensure deserialized numpy arrays are immutable (#7181)
* ensure numpy arrays are immutable when deserialized from the memory buffer
2020-02-19 23:30:10 -08:00
Simon Mo
b804d40c04
Stop vendoring pyarrow (#7233) 2020-02-19 19:01:26 -08:00
Siyuan (Ryans) Zhuang
48c06f5042
Enhance the serialization refcount test for dynamic classes (#7222)
* enhance the test for dynamic classes
2020-02-19 18:35:35 -08:00
Simon Mo
7bef7031c2
Revert "Revert "Revert "Removing Pyarrow dependency (#7146)" (#7209) (#7214)" (#7232) 2020-02-19 13:35:29 -08:00
Sven Mika
d537e9f0d8
[RLlib] Exploration API: merge deterministic flag with exploration classes (SoftQ and StochasticSampling). (#7155) 2020-02-19 12:18:45 -08:00
Simon Mo
e8941b1b79
Revert "Revert "Removing Pyarrow dependency (#7146)" (#7209) (#7214) 2020-02-19 10:08:52 -08:00
Stephanie Wang
f76ce836b2
Distributed ref counting for serialized ObjectIDs (#6945)
* Skeleton plus a unit test for simple borrower case

* First unit test passes - forward an ID and task returns with 1 submitted task pending on the inner ID

* Invariant for contained_in

* Unit test passes for testing task return without creating a borrower

* Wrap ref count functionality in test case

* Fix bad delete

* Unit test and fix for borrowers creating more borrowers

* Unit test and fix for simple borrowing, but owner sends call after borrower's ref count goes to 0

* Refactor:
- keep a sentinel ref count for task argument IDs
- keep contained_in_borrowed in addition to contained_in_owned

* Unit test for nested IDs passes

* Refactor so that an object ID can only be contained in 1 borrowed ID at a time

* Add check

* Fix

* Unit test (passes) to test nesting object IDs but no borrowers created

* Unit test for nested objects from different owners passes, refactor to unset contained_in when popping refs

* Unit tests for borrowers receiving an ObjectID from multiple sources,
skip adding ownership info if we already have it to handle duplicate
refs

* Unit test for returning object ID passes

* More unit tests for returning object IDs pass

* Add serialized ID tests

* fix serialization issue

* remove swap

* It builds!

* debugging and some fixes:
- register handler for WaitForRefRemoved
- don't create a python reference for arg IDs
- pass in client factory into ReferenceCounter
- fix bad decrement in PopBorrowerRefs

* Fix accounting for serialized IDs:
- don't decrement for IDs on dependency resolution, wait until task finished
- add object IDs that were inlined when building the arguments to the task spec, pin these on the task executor until task finishes

* mu_ -> mutex_

* lint

* fix build

* clear outer_object_id

* add direct call type check

* Fix test for direct call IDs and return IDs for actor calls

* Fix CoreWorkerClient.Addr()

* Remove unneeded lock

* Remove unnecessary ObjectID refs

* Fix worker holding serialized refs test

* Fix hex IDs

* fix

* fix tests

* fix tests

* refactor and cleanups

* lint

* Put inlined Ids in task args and some cleanup

* Add back gc.collect() line for test case

* Refactor and fixes:
- store inlined IDs in RayObject
- allow storing objects with inlined IDs in memory store
- pin objects that were promoted to plasma

* oops

* make sure worker ID is set in address, pass in rpc::Address to CoreWorkerClient

* todos

* cleanups and test builds

* Fix tests

* Add feature flag

* cleanups

* address comments and some cleanups

* cleanup

* fix recursive test

* Comments for tests

* Turn off ref counting by default

* Skip tests

* Fix some bugs for test_array.py, java build

* Don't include nested objects in the ref count when the feature flag is off

* C++ feature flag does not work...

* Remove

* Turn on python tests and add a warning when plasma objects are evicted before being pinned

* Fix build and remove irrelevant test

* Fix for java

* Revert "Fix build and remove irrelevant test"

This reverts commit 056cca9b263ed05b0f9ab2250907338edcbca2d5.

* Fix ray.internal.free

* Fixes and skip some flaky tests

* fix java build

* fix windows build

* Add IDs contained in owned objects

* Update src/ray/protobuf/core_worker.proto

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.cc

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/protobuf/core_worker.proto

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/protobuf/core_worker.proto

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.h

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.h

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.cc

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* update

* Try to fix ::test_direct_call_serialized_id_eviction

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-02-18 18:21:34 -08:00
mehrdadn
4a12243336
Use Process instead of pid_t (round 2) (#6882)
* Revert "Revert "Use Boost.Process instead of pid_t (#6510)" (#6909)"

This reverts commit bde575b8dd.

* Process wrapper, using Boost.Process on Windows

- Reverts bde575b8dd.
- Re-applies fb8e3615d5 after some refactoring.

* Remove Boost.Process dependency

* Don't open /proc file on Linux

* Change FATAL to ERROR and modify error message when process doesn't exist
2020-02-18 17:44:46 -08:00
Eric Liang
0aa9373d62
Revert "Removing Pyarrow dependency (#7146)" (#7209)
This reverts commit 2116fd3bca.
2020-02-18 14:12:06 -08:00
Eric Liang
5df801605e
Add ray.util package and move libraries from experimental (#7100) 2020-02-18 13:43:19 -08:00
ijrsvt
2116fd3bca
Removing Pyarrow dependency (#7146) 2020-02-17 18:00:13 -08:00
mehrdadn
3bd82d0bcd
Fix various issues/warnings that come up on Jenkins (#7147)
* Avoid warning about swap being unlimited

Currently we get the following message on Jenkins:
"Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap."

Since we're not limiting swap anyway, we might as well avoid trying to.
https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details

* Fix escaping in re.search()

* Fix escaping in _noisy_layer()

* Raise a more descriptive error when dashboard data isn't found

* Don't error on dashboard files not being found when webui isn't required

* Change dashboard error to a warning instead
2020-02-17 16:08:55 -08:00