Commit graph

146 commits

Author SHA1 Message Date
Max Fitton
ddb9368f2c
Display GPU Utilization in the Dashboard (#8564) 2020-06-15 15:27:44 -05:00
Alec Brickner
207ab44129
Raise major version limit for msgpack (#8466) 2020-06-01 20:00:36 -07:00
Thomas Desrosiers
457a66ae9c
Reverts setup.py changes from 76450c8d4 (#8670) 2020-05-29 13:24:32 -07:00
Patrick Ames
76450c8d47
[autoscaler] Honor separate head and worker node subnet IDs (#8374) 2020-05-28 18:16:46 -07:00
Philipp Moritz
325aec81bd
Hide aliased autoscaler commands (#8348) 2020-05-07 10:17:59 -07:00
Simon Mo
c5a5a5de89
[Serve] Refactor Metric System: Counter + Measure Support (#8114) 2020-05-06 17:44:02 -07:00
Simon Mo
ec6631ae58
Pin redis-py version (#8290) 2020-05-02 22:09:02 -07:00
mehrdadn
254b1ec370
Set up testing and wheels for Windows on GitHub Actions (#8131)
* Move some Java tests into ci.sh

* Move C++ worker tests into ci.sh

* Define run()

* Prepare to move Python tests into ci.sh

* Fix issues in install-dependencies.sh

* Reload environment for GitHub Actions

* Move wheels to ci.sh and fix related issues

* Don't bypass failures in install-ray.sh anymore

* Make CI a little quieter

* Move linting into ci.sh

* Add vitals test right after build

* Fix os.uname() unavailability on Windows

Co-authored-by: Mehrdad <noreply@github.com>
2020-04-29 21:19:02 -07:00
mehrdadn
0a54407961
[CI] Factor out more Travis code and update GitHub Actions (#8085) 2020-04-21 09:53:08 -07:00
fyrestone
fc6259a656
Cross language serialization for primitive types (#7711)
* Cross language serialization for Java and Python

* Use strict types when Python serializing

* Handle recursive objects in Python; Pin msgpack >= 0.6.0, < 1.0.0

* Disable gc for optimizing msgpack loads

* Fix merge bug

* Java call Python use returnType; Fix ClassLoaderTest

* Fix RayMethodsTest

* Fix checkstyle

* Fix lint

* prepare_args raises exception if try to transfer a non-deserializable object to another language

* Fix CrossLanguageInvocationTest.java, Python msgpack treat float as double

* Minor fixes

* Fix compile error on linux

* Fix lint in java/BUILD.bazel

* Fix test_failure

* Fix lint

* Class<?> to Class<T>; Refine metadata bytes.

* Rename FST to Fst; sort java dependencies

* Change Class<?>[] to Optional<Class<?>>; sort requirements in setup.py

* Improve CrossLanguageInvocationTest

* Refactor MessagePackSerializer.java

* Refactor MessagePackSerializer.java; Refine CrossLanguageInvocationTest.java

* Remove unnecessary dependencies for Java; Add getReturnType() for RayFunction in Java

* Fix bug

* Remove custom cross language type support

* Replace Serializer.Meta with MutableBoolean

* Remove @SuppressWarnings support from checkstyle.xml; Add null test in CrossLanguageInvocationTest.java

* Refine MessagePackSerializer.pack

* Ray.get support RayObject as input

* Improve comments and error info

* Remove classLoader argument from serializer

* Separate msgpack from pickle5 in Python

* Pair<byte[], MutableBoolean> to Pair<byte[], Boolean>

* Remove public static <T> T get(RayObject<T> object), use RayObject.get() instead

* Refine test

* small fixes

Co-authored-by: 刘宝 <po.lb@antfin.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>
2020-04-08 21:10:57 +08:00
acxz
7827d2c2de
Add wheel build dependency (#7877) 2020-04-03 18:10:34 -07:00
Markus Cozowicz
b853df7a3b
[autoscaler] Switch to ARM for Azure deployment (#7717)
* switch to ARM templates for config and VMs

* switch to ARM templates for config and VMs

* auto-formatting

* addressed Scotts comment

* added missing imports

* fixed gpu templates
fixed wheel reference

* added missing reference

* cleanup wording and yamls

* Update doc/source/autoscaling.rst

Co-Authored-By: Scott Graham <5720537+gramhagen@users.noreply.github.com>

Co-authored-by: Ubuntu <marcozo@marcozodev2.zqvgrdyupqrudayw1il1agipig.jx.internal.cloudapp.net>
Co-authored-by: Scott Graham <5720537+gramhagen@users.noreply.github.com>
2020-04-03 15:51:56 -07:00
mehrdadn
65054a2c7c
Python 3.8 compatibility (#7754) 2020-04-01 10:03:23 -07:00
SangBin Cho
c23e56ce9a
Metrics Export Service (#7809) 2020-03-30 23:28:32 -07:00
Edward Oakes
d87563937e
Revert "[Dashboard] Metrics Export Service. (#7728)" (#7789) 2020-03-28 19:27:34 -07:00
SangBin Cho
7a0befb0a7
[Dashboard] Metrics Export Service. (#7728) 2020-03-26 14:03:00 -07:00
Richard Liaw
54a892bb84
[tune] Cancel Experiment via Client (#7719)
* init cancel

* testing

* Update python/ray/tune/tests/test_tune_server.py

Co-Authored-By: Richard Liaw <rliaw@berkeley.edu>

* Apply suggestions from code review

* Apply suggestions from code review

* finished

* set_finished

Co-authored-by: ijrsvt <ian.rodney@gmail.com>
2020-03-24 20:30:12 -07:00
Robert Nishihara
2b80310e6f
Remove setup.py dependence on packaging. (#7714) 2020-03-23 16:21:17 -07:00
Robert Nishihara
ee8c9ff732
Remove six and cloudpickle from setup.py. (#7694) 2020-03-23 11:42:05 -07:00
Robert Nishihara
1a0c9228d0
Remove pytest from setup.py and other minor changes. (#7700) 2020-03-23 08:46:56 -07:00
Robert Nishihara
4d722bf003
Remove dependence on funcsigs. (#7701) 2020-03-22 21:37:24 -07:00
Clark Zinzow
c37f6e745a
Remove duplicate jsonschema from setup.py (#7665) 2020-03-19 13:12:47 -07:00
Scott Graham
37e4d29f87
[autoscaler] Adding Azure Support (#7080)
* adding directory and node_provider entry for azure autoscaler

* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating

* adding todos and switching to auth file for service principal authentication

* adding role / scope to service principal

* resolving issues with app credentials

* adding retry for setting service principal role

* typo and adding retry to nic creation

* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing

* linting

* updating cleanup and fixing bugs

* adding directory and node_provider entry for azure autoscaler

* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating

* adding todos and switching to auth file for service principal authentication

* adding role / scope to service principal

* resolving issues with app credentials

* adding retry for setting service principal role

* typo and adding retry to nic creation

* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing

* linting

* updating cleanup and fixing bugs

* minor fixes

* first working version :)

* added tag support

* added msi identity intermediate

* enable MSI through user managed identity

* updated schema

* extend yaml schema
remove service principal code
add re-use of managed user identity

* fix rg_id

* fix logging

* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)

* run linting

* updating yaml configs and formatting

* updating yaml configs and formatting

* typo in example config

* pulling default config from example-full

* resetting min, init worker prop

* adding docs for azure autoscaler and fixing status

* add azure to docs, fix config for spot instances, update azure provider to avoid caching issues during deployment

* fix for default subscription in azure node provider

* vm dev image build

* minor change

* keeping example-full.yaml in autoscaler/azure, updating azure example config

* linting azure config

* extending retries on azure config

* lint

* support for internal ips, fix to azure docs, and new azure gpu example config

* linting

* Update python/ray/autoscaler/azure/node_provider.py

Co-Authored-By: Richard Liaw <rliaw@berkeley.edu>

* revert_this

* remove_schema

* updating configs and removing ssh keygen, tweak azure node provider terminate

* minor tweaks

Co-authored-by: Markus Cozowicz <marcozo@microsoft.com>
Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-15 14:48:27 -07:00
Markus Cozowicz
ea99063c10
added json schema to setup.py (#7554) 2020-03-11 09:53:21 -07:00
Markus Cozowicz
49439611f1
[autoscaler] Replace cluster yaml validation with json schema v… (#7261)
* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)
- run linting
- moved schema to ray/autoscaler
- fixed typo
- remove importlib dependency

* Update python/ray/autoscaler/autoscaler.py

* read

* restrict allowed properties

* added unit test for invalid yaml
added ray[test] package (remove pytest from default dependencies)

* updated autoscaler test to use ValidationError exception

* add missing dependency

* added pytest

* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)
- run linting
- moved schema to ray/autoscaler
- fixed typo
- remove importlib dependency

* Update python/ray/autoscaler/autoscaler.py

* read

* restrict allowed properties

* added unit test for invalid yaml
added ray[test] package (remove pytest from default dependencies)

* updated autoscaler test to use ValidationError exception

* add missing dependency

* added pytest

* removed parameterized dependency
reverted ray[test] intro

* removed parameterized

* fix_tests

* format

Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-10 18:58:55 -07:00
Sven Mika
510c850651
[RLlib] SAC add discrete action support. (#7320)
* Exploration API (+EpsilonGreedy sub-class).

* Exploration API (+EpsilonGreedy sub-class).

* Cleanup/LINT.

* Add `deterministic` to generic Trainer config (NOTE: this is still ignored by most Agents).

* Add `error` option to deprecation_warning().

* WIP.

* Bug fix: Get exploration-info for tf framework.
Bug fix: Properly deprecate some DQN config keys.

* WIP.

* LINT.

* WIP.

* Split PerWorkerEpsilonGreedy out of EpsilonGreedy.
Docstrings.

* Fix bug in sampler.py in case Policy has self.exploration = None

* Update rllib/agents/dqn/dqn.py

Co-Authored-By: Eric Liang <ekhliang@gmail.com>

* WIP.

* Update rllib/agents/trainer.py

Co-Authored-By: Eric Liang <ekhliang@gmail.com>

* WIP.

* Change requests.

* LINT

* In tune/utils/util.py::deep_update() Only keep deep_updat'ing if both original and value are dicts. If value is not a dict, set

* Completely obsolete syn_replay_optimizer.py's parameters schedule_max_timesteps AND beta_annealing_fraction (replaced with prioritized_replay_beta_annealing_timesteps).

* Update rllib/evaluation/worker_set.py

Co-Authored-By: Eric Liang <ekhliang@gmail.com>

* Review fixes.

* Fix default value for DQN's exploration spec.

* LINT

* Fix recursion bug (wrong parent c'tor).

* Do not pass timestep to get_exploration_info.

* Update tf_policy.py

* Fix some remaining issues with test cases and remove more deprecated DQN/APEX exploration configs.

* Bug fix tf-action-dist

* DDPG incompatibility bug fix with new DQN exploration handling (which is imported by DDPG).

* Switch off exploration when getting action probs from off-policy-estimator's policy.

* LINT

* Fix test_checkpoint_restore.py.

* Deprecate all SAC exploration (unused) configs.

* Properly use `model.last_output()` everywhere. Instead of `model._last_output`.

* WIP.

* Take out set_epsilon from multi-agent-env test (not needed, decays anyway).

* WIP.

* Trigger re-test (flaky checkpoint-restore test).

* WIP.

* WIP.

* Add test case for deterministic action sampling in PPO.

* bug fix.

* Added deterministic test cases for different Agents.

* Fix problem with TupleActions in dynamic-tf-policy.

* Separate supported_spaces tests so they can be run separately for easier debugging.

* LINT.

* Fix autoregressive_action_dist.py test case.

* Re-test.

* Fix.

* Remove duplicate py_test rule from bazel.

* LINT.

* WIP.

* WIP.

* SAC fix.

* SAC fix.

* WIP.

* WIP.

* WIP.

* FIX 2 examples tests.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* Fix.

* LINT.

* Renamed test file.

* WIP.

* Add unittest.main.

* Make action_dist_class mandatory.

* fix

* FIX.

* WIP.

* WIP.

* Fix.

* Fix.

* Fix explorations test case (contextlib cannot find its own nullcontext??).

* Force torch to be installed for QMIX.

* LINT.

* Fix determine_tests_to_run.py.

* Fix determine_tests_to_run.py.

* WIP

* Add Random exploration component to tests (fixed issue with "static-graph randomness" via py_function).

* Add Random exploration component to tests (fixed issue with "static-graph randomness" via py_function).

* Rename some stuff.

* Rename some stuff.

* WIP.

* update.

* WIP.

* Gumbel Softmax Dist.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP

* WIP.

* WIP.

* Hypertune.

* Hypertune.

* Hypertune.

* Lock-in.

* Cleanup.

* LINT.

* Fix.

* Update rllib/policy/eager_tf_policy.py

Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>

* Update rllib/agents/sac/sac_policy.py

Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>

* Update rllib/agents/sac/sac_policy.py

Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>

* Update rllib/models/tf/tf_action_dist.py

Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>

* Update rllib/models/tf/tf_action_dist.py

Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>

* Fix items from review comments.

* Add dm_tree to RLlib dependencies.

* Add dm_tree to RLlib dependencies.

* Fix DQN test cases ((Torch)Categorical).

* Fix wrong pip install.

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Kristian Hartikainen <kristian.hartikainen@gmail.com>
2020-03-06 10:37:12 -08:00
chaokunyang
8b6784de06
[Streaming] Streaming Python API (#6755) 2020-02-25 10:33:33 +08:00
Simon Mo
b804d40c04
Stop vendoring pyarrow (#7233) 2020-02-19 19:01:26 -08:00
Simon Mo
7bef7031c2
Revert "Revert "Revert "Removing Pyarrow dependency (#7146)" (#7209) (#7214)" (#7232) 2020-02-19 13:35:29 -08:00
Simon Mo
e8941b1b79
Revert "Revert "Removing Pyarrow dependency (#7146)" (#7209) (#7214) 2020-02-19 10:08:52 -08:00
Eric Liang
0aa9373d62
Revert "Removing Pyarrow dependency (#7146)" (#7209)
This reverts commit 2116fd3bca.
2020-02-18 14:12:06 -08:00
ijrsvt
2116fd3bca
Removing Pyarrow dependency (#7146) 2020-02-17 18:00:13 -08:00
Qing Wang
94a286ef1d
[Java] Add session_dir as temp_dir for logs, socket files like Python (#7044)
* Support

* Add gcs_server support

* Fix ut

* Fix

* Remove unused py code

* Fix linting

* Fix cross language ci

* Fix CI

* Add docstring

* Fix

* Fix linting

* Add a singleton for config

* Refine

* fix

* Fix

* linting

* Remove FileUnit

* Fix

* Fix

* Fix

* Update java/runtime/src/main/java/org/ray/runtime/config/RayConfig.java

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Fix streaming singleprocess CI

* Fix checkstyle

Co-authored-by: Hao Chen <chenh1024@gmail.com>
2020-02-13 17:49:52 +08:00
ijrsvt
0826f95e1c
Including psutil & setproctitle (#7031) 2020-02-05 14:16:58 -08:00
fangfengbin
ade7ebfc0c
Add service based gcs client (#6686) 2020-02-05 12:06:25 +08:00
Richard Liaw
341ddd0a09
[tune] Default to TensorboardX and include in requirements. (#6836) 2020-01-19 01:49:33 -08:00
Mitchell Stern
763818b476 [Dashboard] Add static assets for speedscope v1.5.3 (#6822) 2020-01-17 20:53:53 -08:00
chaokunyang
4097d076d4 Package ray java jars into wheels (#6600) 2020-01-10 11:41:00 +08:00
Sven
60d4d5e1aa Remove future imports (#6724)
* Remove all __future__ imports from RLlib.

* Remove (object) again from tf_run_builder.py::TFRunBuilder.

* Fix 2xLINT warnings.

* Fix broken appo_policy import (must be appo_tf_policy)

* Remove future imports from all other ray files (not just RLlib).

* Remove future imports from all other ray files (not just RLlib).

* Remove future import blocks that contain `unicode_literals` as well.
Revert appo_tf_policy.py to appo_policy.py (belongs to another PR).

* Add two empty lines before Schedule class.

* Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error.
2020-01-09 00:15:48 -08:00
Chaokun Yang
7bbfa85c66 [Streaming] Streaming data transfer java (#6474) 2019-12-22 10:56:05 +08:00
Philipp Moritz
4d71ab83cf require packaging (#6517) 2019-12-17 12:01:14 -08:00
Philipp Moritz
afae8406da Make sure numpy >= 1.16.0 is installed for fast pickling support (#6486)
* Make sure numpy >= 1.16.0 is installed

* Works for 1.15.4

* lint

* formatting

* update

* put check into the right place

* lint
2019-12-14 16:36:49 -08:00
alindkhare
76e678d775 [Serve] Added deadline awareness (#6442)
* [Serve] Added deadline awareness

Added deadline awareness while enqueuing a query
Using Blist sorted-list implementation (ascending order) to get queries according to their specified deadlines. [buffer_queues]
Exposed slo_ms via handle/http request
Added slo example 
The queries in example will be executed in almost the opposite order of which they are fired
Added slo pytest
Added check for slo_ms to not be negative
Included the changes suggested

* Linting Corrections

* Adding the code changes suggested by format.sh

* Added the suggested changes

Added justification for blist
Added blist in travis/ci/install-dependencies.sh

* Fixed linting issues

* Added blist to ray/doc/requirements-doc.txt
2019-12-11 16:41:54 -08:00
Chaokun Yang
6272907a57 [Streaming] Streaming data transfer and python integration (#6185) 2019-12-10 20:33:24 +08:00
Simon Mo
22b305223a
Build Docker Containers for Linux Wheels (#6233) 2019-11-27 17:05:36 -08:00
Philipp Moritz
decaa65cd6
Use pickle by default for serialization (#5978) 2019-11-10 18:12:18 -08:00
Philipp Moritz
f7455839bf
Expose raylet info to dashboard (#6045) 2019-10-31 17:36:59 -07:00
Ujval Misra
a851d7eb87 [tune] Readable trial progress output (#5822)
* Cleaner, tabulated progress output.

* Minor HTML changes, trial ID instead of name

* Revert basic variant changes

* Cleanup, address richard's comments, add progress_reporter.py

* Add tabulate dependency

* Added more info to table, auto-hide columns with no data.

* lint

* Address comments

* Replace experiment tag w/ trial ID

* Fixed tests.

* Fixed test

* Added requirement

* Fix formatting
2019-10-08 16:38:39 -07:00
Simon Mo
9bb3633cd9
[Serve] Implement metric interface (#5852)
* Implement metric interface

* Address comment: made actor_handles a dict

* Fix iteration

* Lint

* Mark lightweight actors as num_cpus=0 to prevent resource starvation

* Be more explicit about the readiness condition

* Make task_runner non-blocking

* Lint
2019-10-07 09:29:26 -07:00
Simon Mo
e8570874b6
[Serve] Implement flask_request and named python request (#5849)
* Implement flask_request and named python request

* Forgot to include missing files

* Address comment

* Add flask to requirements for doc (lint failed)

* Update doc requirement so lint will build

* Install flask in CI

* Fix typo in .travis.yml
2019-10-06 15:12:30 -07:00