hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Lingxuan Zuo	43ea467896	Ray support internal native deps reused (#21641 ) To make other system or internal project reuse ray deps bazel function, we need change this local accessing style to global accessing with ray-project namespace. Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>	2022-01-21 13:56:40 +08:00
qicosmos	7172802d8c	[C++ Worker][xlang] Support calling python worker (#21390 ) C++ API need to call python and java worker, this pr support call python worker. Call python worker is similar with call c++ worker, need to pass PyFunction, PyActorClass and PyActorMethod. ## call python normal task ```python #test_cross_language_invocation.py import ray @ray.remote def py_return_input(v): return v ``` c++ api call python function ```c++ auto py_obj1 = ray::Task(ray::PyFunction</ReturnType/int>{/module_name=/"test_cross_language_invocation", /function_name=/"py_return_input"}) .Remote(42); EXPECT_EQ(42, py_obj1.Get()); ``` The user need to fill python module name and function name, then pass arguments into the remote. The user also need to assign the return type and arguments types of the python function, it used to do static safe checking and get result. ## call python actor task ```python #test_cross_language_invocation.py @ray.remote class Counter(object): def __init__(self, value): self.value = int(value) def increase(self, delta): self.value += int(delta) return str(self.value) ``` c++ api call python actor function ```c++ // Create python actor auto py_actor_handle = ray::Actor(ray::PyActorClass{/module_name=/"test_cross_language_invocation", /class_name=/"Counter"}) .Remote(1); EXPECT_TRUE(!py_actor_handle.ID().empty()); // Call python actor task auto py_actor_ret = py_actor_handle.Task(ray::PyActorMethod</ReturnType/std::string>{/actor_function_name=/"increase"}).Remote(1); EXPECT_EQ("2", py_actor_ret.Get()); ``` The user need to fill python module name and class name when creating python actor. PyActorMethod only need to fill the function name. It's also similar with calling c++ actor task, also has compile-time safe checking.	2022-01-21 13:55:30 +08:00
xwjiang2010	c22a9fa731	[Docs] pin ray lightning version to fix lint error. (#21764 )	2022-01-20 17:52:24 -08:00
mwtian	0dbe4b3a56	[Pubsub] fix driver warning for not keeping up with worker logs (#21717 ) GCS pubsub uses long polling, so the subscriber waits instead of returning None from polling when there is no buffered log. It needs a different heuristic to decide if the driver is not keeping up with logs from the worker.	2022-01-20 16:32:42 -08:00
Richard Liaw	0da693c2e3	[codeowners] rllib fix (#21748 )	2022-01-20 16:29:35 -08:00
Yi Cheng	9a88a60f6a	[gcs/ha] Fix global_state_accessor_test to support redisless mode (#21729 ) This PR enables global_state_accessor_test to support redisless mode so that we can enable the flag by default in the future.	2022-01-20 15:32:49 -08:00
xwjiang2010	9af8f11191	Revert "[docs] Clean up doc structure (first part) (#21667 )" (#21763 ) This reverts commit `38e46c9fb3`.	2022-01-20 15:30:56 -08:00
Yi Cheng	90093769df	[nightly] Add more many tasks tests (#21727 ) This PR add four tests for many tasks: many short tasks send from the single node many short tasks send from multiple nodes many long tasks send from multiple nodes many long tasks send from the single node TODO: migrate many nodes actor tests to this one. scheduling envelop should contain: (tasks): scheduling_test_many_xx_tasks_yy_nodes (actors):many_nodes_actor_test (to be combined with this one) (shuffle): pipelined_ingestion_1500_gb_15_windows (shuffle): dask_on_ray_1tb_sort	2022-01-20 14:52:26 -08:00
Yi Cheng	3c63a8410d	[gcs/ha] Fix java related error when enable redisless ray (#21692 ) This PR enables ray java to be able to run without redis. It also fixes java related tests and updated the pipeline.	2022-01-20 13:56:25 -08:00
Archit Kulkarni	f058a1d342	[Jobs] Stream logs during job instead of only at the end (#21659 ) Closes https://github.com/ray-project/ray/issues/21517	2022-01-20 15:21:07 -06:00
Jiajun Yao	fa9feb5033	Fix replace_symlinks_with_junctions for windows (#21720 ) Windows cmd.exe doesn't interpret single quote correctly. See https://github.com/conda-forge/ray-packages-feedstock/pull/43	2022-01-20 12:38:56 -08:00
matthewdeng	976ba5dbfe	[train] fix fashion mnist example (#21689 ) Making some minor fixes. 1. Update input `batch_size` to be global batch size. Introduce `worker_batch_size` so each iteration trains same global batch size. 2. Update dataset `size` calculation to only refer to the fraction of the data that is trained on each worker. This allows calculations (e.g. training progress, accuracy) to be correct. 3. Add `model.train()` for generality. 4. Remove `smoke-test` flag since it's not really being used.	2022-01-20 12:26:02 -08:00
SangBin Cho	b6d3e01e0b	Revert "WINDOWS: enable passing metric tests (#21705 )" (#21738 ) This reverts commit `8104fd5c76`.	2022-01-20 07:27:49 -08:00
Max Pumperla	38e46c9fb3	[docs] Clean up doc structure (first part) (#21667 )	2022-01-20 16:19:04 +01:00
mwtian	a4581e58ee	[Pubsub] improve error handling for GCS AIO subscribers in dashboard (#21712 ) - Tolerate GRPC deadline exceeded and transient failures in Python GCS AIO subscribers, which becomes consistent with Python GCS synchronous subscribers. - Tolerate any exception in dashboard for subscribing to logs and error info, which becomes consistent with how dashboard handles GRPC errors for obtaining node stats.	2022-01-20 07:04:54 -08:00
Sven Mika	c4636c7c05	[RLlib] Issue 21633: SimpleQ should not use a prio. replay buffer. (#21665 )	2022-01-20 11:46:25 +01:00
Eric Liang	5065156dd9	Set task retry delay to zero (#21690 )	2022-01-19 23:41:35 -08:00
SangBin Cho	e3357eb9e5	[Internal Observability] Fix the event stats segfault 1/2 (#21593 ) This PR is a pre-work before actually fixing a thread-safety bug within shutdown. It is doing - Add better logging upon core worker shutdown. - Improve document around core worker shutdown. - Remove unnecessary pointer usage from periodical runner for clean destruction order. - Remove unnecessary `WaitForShutdown` API and combine them into a single `Shutdown` API.	2022-01-19 23:08:54 -08:00
Shantanu	ae60548ef3	Silence "cut: write error: Broken pipe" log spew (#21686 ) On machines without GPUs, this can run subprocesses that spew to stderr. Then with log_to_driver=True, we get log spew from every single raylet. To avoid this, disable the GPU usage check on certain errors. Resolves #14305 Co-authored-by: hauntsaninja <>	2022-01-19 23:01:10 -08:00
Hao Chen	8dcc07ec9c	[Fix][Locality] ref count should remove object locations for dead nodes (#21548 ) When a node is dead, reference table should remove locations for those objects on the node. Otherwise locality-aware scheduling will schedule tasks to the dead node.	2022-01-20 11:58:52 +08:00
Philipp Moritz	fbc51d6d0e	[Kuberay] Ray Autoscaler integration with Kuberay (MVP) (#21086 ) This is a minimum viable product for Ray Autoscaler integration with Kuberay. It is not ready for prime time/general use, but should be enough for interested parties to get started (see the documentation in kuberay.md).	2022-01-19 19:42:17 -08:00
Archit Kulkarni	7d74a9face	[doc] add Ray versions 1.9.1 - 1.10.0 to dask on ray compatibility table (#21360 ) I updated this version compatibility table on the release branch but didn't update it on master. This is my mistake, the process is to make a PR to master and then cherry pick that commit to the release branch.	2022-01-19 18:55:05 -08:00
Wilson Wang	2626c64060	Fix monitor.py exceptions. Enable fetching GCS address from Redis with retries. (#21533 ) GCS, when running as an individual component, can cause other components to fail in case of crashes. Here are two main cases covered in this patch: 1. monitor.py will raise an exception when disconnected from GCS. 2. When GCS becomes available later than other components, the missing KV of GCS address can cause other components to fail to start. In our patch, we fixed these two issues as well as increased the timeout for redis connection which was too small. Co-authored-by: Mingwei Tian <mwtian@anyscale.com>	2022-01-19 18:48:03 -08:00
Stephanie Wang	bab7cd6388	[core] Fix race condition between object free and duplicate creation (#21364 ) An object can get created/pinned twice if the original worker fails mid-task, or when lineage reconstruction is enabled. This can cause inconsistencies in the LocalObjectManager if the second creation races with object spilling and/or object free. For example: 1. Object X get created, then is pending spill. 2. Object X is freed by original owner because it goes out of scope. 3. Task that created X gets re-executed due to failure. 4. Task recreates X, which can now get spilled again while the original copy is also being spilled/freed. This PR better enforces the state machine for objects managed by the LocalObjectManager. An object can be either: pinned, pending spill, or spilled. If we receive a free message from the owner, we do not delete the object metadata until all shared-memory and spilled copies of the object are removed.	2022-01-19 17:58:07 -08:00
mwtian	d3e7abb3c9	[GCS] use separate event loop for GCS pubsub (#21675 ) Use a separate event loop for pubsub work, to provide some isolation from other workload. There is no benchmark result but the downside, if there is any, should not be large.	2022-01-19 17:39:50 -08:00
Matti Picus	8104fd5c76	WINDOWS: enable passing metric tests (#21705 )	2022-01-19 17:09:34 -08:00
SangBin Cho	02af73a571	[Test] First core nightly test migration to k8s (#21698 ) The first migration of test into k8s. We are adopting a conservative approach (migrate slowly while we keep existing test suites). Once things are confirmed to be stable, we will migrate with more speed.	2022-01-19 13:31:49 -08:00
SangBin Cho	b1308b1c8c	[Test Infra] Unrevert team col (#21700 ) This fixes the previous problems from team column revert. This has 2 additional changes; alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289 Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time	2022-01-19 13:29:53 -08:00
Eric Liang	88143cdc35	[data] Unify key function type and error handling across sort, groupby, and agg (#21627 ) Prior to this PR, sort, groupby, and aggregate defined separate types for extracting values from Dataset records. This was confusing since the user had to understand the differences between the different key types (which were basically exactly the same). This PR defines a common key type: KeyFn, which is simply Union[None, str, Callable[[T], Any]]. This is used as sort(KeyFn...), aggregate(Agg(KeyFn)...), groupby(KeyFn).agg(Agg(KeyFn), ...). It also unifies the error generation paths to a common _validate_key_fn utility. This also improves the errors generated when passing explicit AggregateFn classes, which previously failed in the workers if invalid.	2022-01-19 11:15:13 -08:00
Kai Fricke	e233f8172d	[ci/release] Terminate session on session startup timeout (#21703 ) When a session startup times out due to resources not being available, the session may still come up after that timeout. At that time the control script (e2e.py) is already terminated, so the session runs until the autosuspend limit is hit, incurring unnecessary costs. Instead, we should always trigger session termination on session timeout.	2022-01-19 10:01:03 -08:00
Kai Fricke	4ef0c6c434	[tune/release] Demote xgboost_sweep to weekly testing (#21704 ) XGBoost functionality is tested daily in the xgboost release test suite. The expensive XGBoost sweep test can thus be run weekly.	2022-01-19 09:15:04 -08:00
Yi Cheng	82103bf7c1	[gcs/ha] Fix cpp tests related to redis removal (#21628 ) This PR fixed cpp tests and also make ray cpp able to pass.	2022-01-19 01:26:34 -08:00
Chen Shen	74d4e7c20c	install botocore with s3fs to ensure no confliction (#21680 )	2022-01-18 23:09:16 -08:00
Kai Fricke	8fd5b7a5a8	Tune test autoscaler / fix stale node detection bug (#21516 ) See #21458. Currently, Tune keeps its own list of alive node IPs, but this information is only updated every 10 seconds and is usually stale when a new node is added. Because of this, the first trial scheduled on this node is usually marked as failed. This PR adds a test confirming this behavior and gets rid of the unneeded code path. Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>	2022-01-18 16:20:16 -08:00
dependabot[bot]	1f563aaf9b	[data](deps): Bump dask[complete] from 2021.11.0 to 2022.1.0 in /python/requirements/data_processing (#21621 ) Bumps [dask[complete]](https://github.com/dask/dask) from 2021.11.0 to 2022.1.0.	2022-01-18 15:32:07 -08:00
mwtian	ef9d9df4e7	[Doc] add comment for waiting for Ray to shutdown in `test_client_reconnect.py` (#21672 )	2022-01-18 12:06:08 -08:00
mwtian	5893a9eddb	[GCS] enable GCS pubsub by default (#21673 ) Turn the flags on by default.	2022-01-18 12:04:53 -08:00
Jiajun Yao	bb04cc9d80	Use latest cmake for pipelined_ingestion and pipelined_training tests (#21674 )	2022-01-18 12:03:43 -08:00
Jiajun Yao	25e62d85bd	[LOGGING][RFC] Add RAY_CHECK_OP (#21607 )	2022-01-18 11:38:26 -08:00
Yao Yuan	422d20e945	[Dashboard] Fix NPE when there is no GPU on the node (#21650 ) There is an NPE bug that causes browser crash when no GPU on the node. We can add a condition to fix it.	2022-01-18 08:12:49 -08:00
Avnish Narayan	12b087acb8	[RLlib] Base env pre-checker. (#21569 )	2022-01-18 16:34:06 +01:00
mickelliu	75078f965d	[Rllib] Fix `range()` (no keyword args supported!) in torch version of `attention_net.py`. (#21598 )	2022-01-18 16:11:16 +01:00
Vince Jankovics	7dc3de4eed	[RLlib] Fix config mismatch for train_one_step. num_sgd_iter instead of sgd_num_iter. (#21555 )	2022-01-18 16:00:27 +01:00
Jiajun Yao	fa5c167717	Revert "[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988 ) (#21661 ) This reverts commit `4a55d10bb1`.	2022-01-18 06:11:20 -08:00
Jun Gong	1315293dd8	[RLlib] Fix offline RL(BC & MARWIL) weekly learning tests. (#21643 )	2022-01-18 09:29:01 +01:00
mwtian	4faf3e1e31	[GCS] reenable test_client_reconnect.py for GCS HA builds (#21589 ) In test_client_reconnect.py, each test case starts a Ray cluster via client server's default_connect_handler(). The Ray cluster shuts down implicitly when the start_middleman_server() ended and Python GC'es the client server. After turning on GCS pubsub, the time when client server is GC'ed changes. Sometimes the Ray cluster from a previous test cases stays alive after the next test case starts and shuts down later, leading to test failures due to lost data or crashes (race during worker shutdown, will be investigated separately). This PR makes sure each test case shuts down its Ray cluster.	2022-01-17 23:08:47 -08:00
Guyang Song	c321e6e5bd	[script] support using hostname as node_ip_address (#20720 )	2022-01-18 11:05:50 +08:00
Gagandeep Singh	970b7b2a4b	Unskip tests from `ci.sh` (#21483 )	2022-01-17 15:22:57 -08:00
Rong Ma	f54282147c	[PlacementGroup] Support using any available bundle in java api (#21496 ) In python or C++, we can specify the bundle index as -1 to use any available bundle in the placement group. We should also enable it in Java to keep the API consistent across all languages.	2022-01-18 01:58:02 +08:00
Qing Wang	a5cabb324b	Remove streaming deploying process. (#21603 ) 1. Remove the streaming from deploying to maven central. 2. Remove related streaming stuff from setup.py.	2022-01-17 23:37:48 +08:00

1 2 3 4 5 ...

11004 commits