hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 18:11:42 -05:00

Author	SHA1	Message	Date
Charles Sun	c358305ca6	[RLlib] DatasetReader action normalization. (#27356 )	2022-08-09 16:54:03 +02:00
Sven Mika	537f7c65c1	[RLlib] CRR framework torch by default. (#27161 )	2022-08-09 16:53:00 +02:00
kourosh hakhamaneshi	b84dd38f01	[RLlib] Add `__getitem__` to `MultiAgentBatch` to access `policy_batches`. (#27619 )	2022-08-09 16:51:26 +02:00
Dmitri Gekhtman	3293317c40	[kubernetes][docs] Logging guide, networking info, migration guide, fixes. (#27607 ) This PR Adds notes and example on logging for Ray/K8s. Implements an API Reference paging pointing to the configuration guide and the RayCluster CR definition. Takes managed K8s services out of the tabbed structure, to make that page look less sad. Adds a comparison of the KubeRay operator and legacy K8s operator Adds an architecture diagram for the autoscaling sections Fixes some other minor items Adds some info about networking to the configuration guide, removes the previously planned networking page Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>	2022-08-09 00:38:05 -07:00
Alan Guo	c3a8ba0f8a	Add maximum number of characters in logs output for jobs status message (#27581 ) We've seen the API server go down from trying to return 500mb of log output	2022-08-08 20:24:51 -07:00
Nikita Vemuri	0e74bc20b5	[core] Fix how protocol is removed for external ray dashboard URL (#27652 ) * fix how protocol is removed for external dashboard url	2022-08-08 18:23:12 -07:00
matthewdeng	fbdec1add0	[air] remove rllib dependency from tensorflow_predictor (#27671 )	2022-08-08 18:05:48 -07:00
Alan Guo	3a819fafb7	Force grpcio to be >= 1.42.0 for python 3.10 (#27269 )	2022-08-08 17:37:18 -07:00
Jian Xiao	e5c3f1cf3a	Fix a few stale Datasets documentation in AIR (#27623 ) The descriptions of Datasets are not up-to-date now.	2022-08-08 17:33:23 -07:00
Clark Zinzow	3b151c581e	[Datasets] Delay expensive tensor extension type import until Parquet reading. (#27653 ) The tensor extension import is a bit expensive since it will go through Arrow's and Pandas' extension type registration logic. This PR delays the tensor extension type import until Parquet reading, which is the only case in which we need to explicitly register the type. I have confirmed that the Parquet reading in doc/source/data/doc_code/tensor.py passes with this change.	2022-08-08 17:06:25 -07:00
Eric Liang	ffe3716c9a	[docs] Trainer user guide should come before configuring datasets for trainer guide (#27661 )	2022-08-08 16:43:59 -07:00
xwjiang2010	9c7fc5ccdd	[tune/doc] fix emphasized line number. (#27648 )	2022-08-08 16:37:47 -07:00
Yi Cheng	dac7bf17d9	[serve] Make serve agent not blocking when GCS is down. (#27526 ) This PR fixed several issue which block serve agent when GCS is down. We need to make sure serve agent is always alive and can make sure the external requests can be sent to the agent and check the status. - internal kv used in dashboard/agent blocks the agent. We use the async one instead - serve controller use ray.nodes which is a blocking call and blocking forever. change to use gcs client with timeout - agent use serve controller client which is a blocking call with max retries = -1. This blocks until controller is back. To enable Serve HA, we also need to setup: - RAY_gcs_server_request_timeout_seconds=5 - RAY_SERVE_KV_TIMEOUT_S=5 which we should set in KubeRay.	2022-08-08 16:29:42 -07:00
Balaji Veeramani	87ff765647	[AIR] Make `Concatenator` deterministic (#27575 )	2022-08-08 15:49:46 -07:00
Richard Liaw	fb43bd5baf	[air/docs] Update train gettingstarted (#27655 ) Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>	2022-08-08 15:45:00 -07:00
kourosh hakhamaneshi	98b9fa6944	[RLlib] Hotfix for connector tests (#27654 ) hot fix for rllib connector tests Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>	2022-08-08 15:12:47 -07:00
Yi Cheng	cadeccd9b7	[core] Fix job counter not working with storage namespace (#27627 ) JobCounter is not working with storage namespace right now because the key is the same across namespaces. This PR fixed it by just adding it there because this add the minimal changes which is safer. A follow up PR is needed to cleanup redis storage in cpp.	2022-08-08 14:24:32 -07:00
Stephanie Wang	ccbae3325c	[core] Reconstruct manually freed objects (#27567 ) Object freed by the manual and internal free call previously would not get reconstructed. This PR introduces the following semantics after a free call: If no failures occurs, and the object is needed by a downstream task, an ObjectFreedError will be thrown. If a failure occurs, causing a downstream task to be re-executed, the freed object will get reconstructed as usual. Also fixes some incidental bugs: Don't crash on failure to contact local raylet during object recovery. This will produce a nicer error message because we will instead throw an application-level error when someone tries to get an object. Fix a circular lock dependency between task failure <> task dependency resolution. Related issue number Closes #27265. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>	2022-08-08 13:40:51 -07:00
Yi Cheng	1533976b82	[deflakey] test_error_handling.py in workflow (#27630 ) Signed-off-by: Yi Cheng <chengyidna@gmail.com> ## Why are these changes needed? This test timeout. Move it to large. ``` WARNING: //python/ray/workflow:tests/test_error_handling: Test execution time (288.7s excluding execution overhead) outside of range for MODERATE tests. Consider setting timeout="long" or size="large". ```	2022-08-08 13:38:37 -07:00
Avnish Narayan	aee008ab49	[RLlib] PPO release tests tuned and re-enabled. (#27564 )	2022-08-08 21:04:19 +02:00
SangBin Cho	be64df6f5d	Fix a uncaught exception upon deallocation for actors (#27637 ) As specified here, https://joekuan.wordpress.com/2015/06/30/python-3-__del__-method-and-imported-modules/, the del method doesn't guarantee that modules or function definitions are still referenced, and not GC'ed. That means if you access any "modules", "functions", or "global variables", they may have been garbage collected. This means we should not access any modules, functions, or global variables inside del method. While it's something we should handle in the sooner future more holistically, this PR fixes the issue in the short term. The problem was that all of ray actors are decorated by trace_helper.py to make it compatible to open telemetry (maybe we should make it optional). At this time __del__ method is also decorated. When __del__ is invoked, some of functions used within this tracing decorator can be accessed and may have been deallocated (in this case, the _is_tracing_enabled was deallocated). This fixes the issue by not decorating __del__ method from tracing.	2022-08-08 11:51:25 -07:00
Eric Liang	f21ca925ac	[docs] Remove spam banner from master docs (#27599 )	2022-08-08 11:47:39 -07:00
Zyiqin-Miranda	b3f06d97b2	[autoscaler] Consolidate CloudWatch agent/dashboard/alarm support; Add unit tests for AWS autoscaler CloudWatch integration (#22070 ) This PR mainly adds two improvements: We have introduced three CloudWatch Config support in previous PRs: Agent, Dashboard and Alarm. In this PR, we generalize the logic of all three config types by using enum CloudwatchConfigType. Adds unit tests to ensure the correctness of Ray autoscaler CloudWatch integration behavior.	2022-08-08 11:45:07 -07:00
Balaji Veeramani	5087511c46	[AIR] Change `FeatureHasher` input schema to expect token counts (#27523 ) This makes FeatureHasher work more like sklearn's FeatureHasher.	2022-08-08 11:41:57 -07:00
Archit Kulkarni	f6328f46a3	[CI] [runtime env] Fix test_working_dir_2 timeout on Mac (#27563 ) One GC test has unnecessary sleeps which are quite expensive due to the parametrization (2 x 2 x 2 = 8 iterations). They are unnecessary because they check that garbage collection of runtime env URIs doesn't occur after a certain time, but garbage collection isn't time-based. This PR removes the sleeps. This PR is just to fix CI; a followup PR will make the test more effective by attempting to trigger GC in a more targeted way (by starting multiple tasks with different runtime_env resources. GC is only triggered upon creation of a new resource that causes the cache size to be exceeded.) It's still not clear what exactly caused the test suite to start taking longer recently, but it might be due to some change elsewhere in Ray, since there were no runtime_env related commits in that time period.	2022-08-08 11:31:21 -05:00
Richard Liaw	f15ed3836d	[air] Render trainer docstring signatures (#27590 ) Signed-off-by: Richard Liaw <rliaw@berkeley.edu>	2022-08-08 09:29:21 -07:00
Artur Niederfahrenhorst	4fe47d069f	[RLlib] Require ApeX LR schedule test to produce learner info. (#27557 )	2022-08-08 18:19:02 +02:00
kourosh hakhamaneshi	3b2a8427af	[RLlib] Fix SampleBatch to_device(). (#27572 )	2022-08-08 18:18:33 +02:00
SangBin Cho	8c190e2d09	Revert "[serve][xlang]Support deploying Python deployment from Java. (#26877 )" (#27626 ) This reverts commit `9f8b596aaa`.	2022-08-08 06:54:27 -07:00
SangBin Cho	6084ee5a63	Revert "Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" (#27308 )" (#27613 ) This reverts commit `ccf411604e`.	2022-08-08 06:38:19 -07:00
SangBin Cho	b7ab0555c4	[Link Check] Fix the broken link check from the AIR doc (#27632 ) The original link doesn't exist. https://docs.ray.io/en/master/_images/air-ecosystem.svg I fixed it by linking the raw github file link. This should have the exactly same flow as before. I tried finding a link to this image file, but I couldn't. I also couldn't find an easy way to add only a link (without embedding an image). Please lmk if you prefer other option	2022-08-08 06:36:04 -07:00
SangBin Cho	f00d17ea21	[Core/Test] Break down `test_ray_init.py` by 2 files (#27635 ) Seems like this test starts taking longer than 300 seconds because of newly added tests. Breaking down tests into 2 files	2022-08-08 06:35:16 -07:00
SangBin Cho	ad0aec1ca7	Move test placement group 3 (#27614 ) This PR `ccf4116` makes cluster_utils.add_node take about 1 more second because of the raylet start path refactoring. It seems like as a result, test_placement_group_3 has an occasional timeout. I extended the test to be large. Let's see if this fixes the issue.	2022-08-08 00:59:12 -07:00
Balaji Veeramani	2a14bff99d	[Docs] Fix table width (#27611 )	2022-08-07 20:19:19 -07:00
Cheng Su	aeb2346804	[AIR] Replace references of `to_torch` with `iter_torch_batches` (#27574 )	2022-08-07 20:14:12 -07:00
Jun Gong	a61095a480	[RLlib] fix bandit pre-merge tests (#27554 )	2022-08-07 17:48:29 -07:00
Jun Gong	5f07987ab1	[RLlib] Fix connector examples (#27583 )	2022-08-07 17:48:09 -07:00
Jun Gong	89b2f616fd	[RLlib] doc typo (#27542 )	2022-08-07 17:47:42 -07:00
Jun Gong	f8b2128f16	[RLlib] async_request_test needs to run exclusively. (#27603 )	2022-08-07 17:47:29 -07:00
Simon Mo	efee158cec	[Serve] Use Async Handle for DAG Execution (#27411 )	2022-08-06 22:23:44 -07:00
zcin	64c550a2b1	Revert "[serve] Integrate and Document Bring-Your-Own Gradio Applications (#26403 )" (#27587 ) This reverts commit `8a9d994dd0`.	2022-08-06 21:38:55 -07:00
clarng	b404175635	Add tip to first uninstall the Python-only build before installing the Full build (#25754 ) Adds the following to install instructions: Tip If you are only editing Python files, follow instructions for Building Ray (Python Only) to avoid long build times. If you already followed the instructions in Building Ray (Python Only) and want to switch to the Full build in this section, you will need to first delete the symlinks and uninstall Ray.	2022-08-06 21:38:13 -07:00
se4ml	0a489c0c7c	[CI] Update `ci/pipeline/py_dep_analysis_test.py` to properly use `with` statement (#27600 )	2022-08-06 12:17:35 -07:00
clarng	20f01da5bd	[Core] Install memory monitor callback to kill worker when memory usage is above threshold (#27384 ) Why are these changes needed? The Ray-level OOM killer preemptively kills a worker process when the node is under memory pressure. This PR leverages the memory monitor from #27017 and supersedes #26962 to kill worker processes when the system is running low on memory. The node manager implements the callback in the memory monitor and kills the worker process with the newest task. It evicts only one worker at a time and enforces that by tracking the last evicted worker. If the eviction is still in progress it will not evict another worker even if the memory usage is above the threshold. This PR is a no-op since the monitor is disabled by default.	2022-08-06 04:17:19 -07:00
xiaofeng	9f8b596aaa	[serve][xlang]Support deploying Python deployment from Java. (#26877 ) In the previously merged pr(https://github.com/ray-project/ray/pull/22726/commits), java serve's support for python deployment was not implemented. This PR is used to implement this feature. Co-authored-by: nanqi.yxf <nanqi.yxf@antgroup.com>	2022-08-06 14:35:49 +08:00
Alex Wu	50e278f58b	[scheduler][autoscaler] Report placement resources for actor creation tasks (#26813 ) This change makes us report placement resources for actor creation tasks. Essentially, the resource model here is that a placement resource/actor creation task is a task that runs very quickly. Closes #26806 Co-authored-by: Alex <alex@anyscale.com>	2022-08-05 22:02:44 -07:00
Clark Zinzow	f017fcd826	[AIR - Datasets] Fix column assignment in Concatenator for Pandas 1.2. (#27531 ) Heterogeneous tensor column assignment with a list-of-tensors fails for Pandas 1.2, but succeeds with a manually constructed Pandas Series.	2022-08-05 19:44:12 -07:00
Clark Zinzow	dcc1da4ce3	[Core] Fix reference counting bug on objects borrowed for a cancelled actor creation. (#27298 ) This PR fixes a reference counting bug for borrowed objects sent to an actor creation task that is then cancelled. Before this PR, when actor creation is cancelled before the creation task has been scheduled, the GCS-based actor manager would would destroy the actor without replying to the task submission RPC from the actor creating worker, resulting in the reference counts on that worker to never get cleaned up. This caused us to leak borrowed objects when such cancellation-before-scheduling happened for actors. This PR fixes this by ensuring that the task submission RPC receives a reply indicating that the actor creation task has been cancelled, at which point the submitting worker will run through the same reference counting cleanup as is done for normal task cancellation.	2022-08-05 19:42:34 -07:00
Alan Guo	326b5bd1ac	Convert job_manager to be async (#27123 ) Updates jobs api Updates snapshot api Updates state api Increases jobs api version to 2 Signed-off-by: Alan Guo aguo@anyscale.com Why are these changes needed? follow-up for #25902 (comment)	2022-08-05 19:33:49 -07:00
Nikita Vemuri	a82af8602c	[core] Support external ray dashboard URL (#27396 ) Signed-off-by: Nikita Vemuri nikitavemuri@gmail.com Why are these changes needed? Support printing a Ray dashboard URL that the user specifies through environment variable. This can be helpful if the Ray dashboard is hosted externally.	2022-08-05 19:33:10 -07:00

1 2 3 4 5 ...

13893 commits