hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Matti Picus	6c6c76c3f0	Starting workers map (#20986 ) PR #19014 introduced the idea of a StartupToken to uniquely identify a worker via a counter. This PR: - returns the Process and the StartupToken from StartWorkerProcess (previously only Process was returned) - Change the starting_workers_to_tasks map to index via the StartupToken, which seems to fix the windows failures. - Unskip the windows tests in test_basic_2.py It seems once a fix to PR #18167 goes in, the starting_workers_to_tasks map will be removed, which should remove the need for the changes to StartWorkerProcess made in this PR.	2021-12-12 19:28:53 -08:00
Seonggwon Yoon	f1acabe9cf	Bump log4j from 2.14.0 to 2.15.0 (#21036 ) Fix Remote code injection in Log4j Log4j versions prior to 2.15.0 are subject to a remote code execution vulnerability via the ldap JNDI parser. Check this refer: [CVE-2021-44228](https://github.com/advisories/GHSA-jfh8-c2jp-5v3q)	2021-12-12 15:07:50 +08:00
Yi Cheng	f4e6623522	Revert "Revert "[core] Ensure failed to register worker is killed and print better log"" (#21028 ) Reverts ray-project/ray#21023 Revert this one since `7fc9a9c227` has fixed the issue	2021-12-11 20:49:47 -08:00
Sven Mika	db058d0fb3	[RLlib] Rename `metrics_smoothing_episodes` into `metrics_num_episodes_for_smoothing` for clarity. (#20983 )	2021-12-11 20:33:35 +01:00
Sven Mika	596c8e2772	[RLlib] Experimental no-flatten option for actions/prev-actions. (#20918 )	2021-12-11 14:57:58 +01:00
Eric Liang	6f93ea437e	Remove the flaky test tag (#21006 )	2021-12-11 01:03:17 -08:00
mwtian	3028ba0f98	[Core][GCS] add feature flag for GCS bootstrapping, and flag to pass GCS address to raylet (#21003 )	2021-12-10 23:48:37 -08:00
Jiajun Yao	f04ee71dc7	Fix driver lease request infinite loop when local raylet dies (#20859 ) Currently if local lease request fails due to raylet death, direct_task_transport.cc will retry forever for driver. With this PR, we treat grpc unavailable as non-retryable error (the assumption is that local grpc is always reliable and grpc unavailable error indicates that server is dead) and will just fail the task. Note: this PR doesn't try to address a bigger problem: don't crash driver when local raylet dies. We have multiple places in the code that assumes the local raylet never fail and have CHECK_STATUS_OK for that. All these places need to be changed so we can properly propagate failures to the user.	2021-12-10 18:02:59 -08:00
mwtian	b9bcd6215a	Disable two tests that are very flaky in GCS HA build (#21012 ) `//python/ray/tests:test_client_reconnect` seems to only flake under GCS HA build. The client server starts to shutdown under injected failures, unlike the behavior without GCS KV or pubsub. `//python/ray/tests:test_multi_node_3` seems to flake more often under GCS HA build, although it is still flaky without GCS HA feature flags. It seems raylet termination did not notify other processes properly. Disable these two tests before they are fixed.	2021-12-10 17:08:25 -08:00
Qing Wang	a3bf1af10e	[core] Fix the risk of iterator invalidation issue. (#20989 ) We erase the elements from object_id_refs_ in the method `RemoveLocalReferenceInternal()` which may cause iterator invalidation issue. Note that, normally flatmap will not trigger any iterator invalidation except triggering `rehash()`. But in this case, we may remove other elements(not only the current iterator), so there is still a risk of it.	2021-12-10 15:59:16 -08:00
Stephanie Wang	3a5dd9a10b	[core] Pin object if it already exists (#20447 ) A worker can crash right after putting its return values into the object store. Then, the owner will receive the worker crashed error, but the return objects will still be in the remote object store. Later, if the task is retried, the worker will crash on [this line](https://github.com/ray-project/ray/blob/master/src/ray/core_worker/transport/direct_actor_transport.cc#L105) because the object already exists. Another way this can happen is if a task has multiple return values, and one of those return values is transferred to another node. If the task is later re-executed on that node, the task will fail because of the same error. This PR fixes the crash so that: 1. If an object already exists, we try to pin that copy. Ideally, we should destroy the old copy and create the new one to make sure that metadata like the owner address is in sync, but this is pretty complicated to do right now. 2. If the pinning fails, we store an OBJECT_LOST error to throw to the application. 3. On the raylet, we check whether we already have the object pinned, and only subscribe to the owner's eviction message if the object is not pinned. 4. Also fixes bugs in the analogous case for `ray.put` (previously this would hang, now the application will receive an error if a `ray.put` object already exists).	2021-12-10 15:56:43 -08:00
Yi Cheng	6280bc4391	Revert "[core] Ensure failed to register worker is killed and print better log" (#21023 ) `linux://python/ray/tests:test_runtime_env_complicated` looks flaky after this pr. Reverts ray-project/ray#20964	2021-12-10 14:57:32 -08:00
mwtian	6871a72a5c	[Core][Dashboard Pubsub 3/n] Migrate pubsub usages in dashboard to GCS pubsub (#20860 ) Add support for Ray pubsub in dashboard. https://github.com/ray-project/ray/pull/20954 is the prerequisite, and contains more complete change under src/.	2021-12-10 14:36:57 -08:00
architkulkarni	7fc9a9c227	[CI] bump tasks_finish_quickly timeout from 1s to 2s (#21015 )	2021-12-10 14:16:12 -08:00
Chris K. W	27665fdf29	[client][test] skip test_valid_actor_state_2 (#20995 ) * skip test_valid_actor_state_2 * rerun	2021-12-10 10:58:42 -08:00
xwjiang2010	46d2f2c160	[release test] Update torch_tune_serve test to be compatible with new TrialCheckpoint class. (#21010 )	2021-12-10 17:26:15 +00:00
Sven Mika	f814c2af89	[RLlib; Docs] Docs API reference pages: `rllib/execution`, `rllib/evaluation`, `rllib/models`, `rllib/offline`. (#20538 )	2021-12-10 09:41:29 +01:00
mwtian	d751a242d8	[Core][Dashboard Pubsub 2/n] Add resource reporter and actor to Python GCS pubsub (#21001 ) Dashboard contains resource reporter and actor subscribers. Dashboard agent has resource report publisher. So GCS pubsub needs to support these channel types. Also refactor GCS AIO subscribers to have each subscriber per channel. This matches the API of GCS sync subscribers, and make subscribing with multiple channels easier.	2021-12-09 23:10:10 -08:00
Yi Cheng	4e0de0053d	[nightly] Add staging nightly test for gcs ha (#21004 ) This PR adds four staging nightly tests for gcs : - many_actors - many_tasks - many_pgs - many_nodes These are benchmark tests that are highly related to gcs ha. To make it easier to add tests, this PR also change e2e.py a little bit to include testing flags to app config.	2021-12-09 23:07:23 -08:00
Yi Cheng	2ed5b1ee07	[2/gcs-mem-kv] Use memory store client when flag is set (#20931 ) This is part of redis removal. In this PR, if `RAY_gcs_storage=memory`, it'll use memory table instead of redis table. The config setup has to be moved into GcsServer because with the memory table it's transistent.	2021-12-09 22:41:05 -08:00
Jules S. Damji	065786b7fe	[docs] Make design pattern example self contained (#20981 ) Signed-off-by: Jules S.Damji jules@anyscale.com Why are these changes needed? The code snippet referenced a python function that was not defined, therefore the code snippet as is won't work. All complete or self-contained code in our docs should run. The changes made were adding the undefined function, iterating over a list of different random large arrays to show the difference between local or distributed sort's execution time, and print them. Closes #20960	2021-12-09 20:19:38 -08:00
Jiajun Yao	5b21bc5c93	[BUILD] Use bazel-skylib rule to check bazel version (#20990 ) Bazel has a rule for enforcing the version so we can just reuse that. This redundant bazel version check logic in setup.py is also causing issue when building conda package, because conda has its own version of bazel and it doesn't support `--version`.	2021-12-09 15:25:22 -08:00
mwtian	2410ec5ef0	[Core][Dashboard Pubsub 1/n] Allow a channel to have subscribers to a key and to the whole channel concurrently (#20954 ) For actor channel, GCS clients subscribe to a single actor but dashboard subscribes to all actors. This change makes supporting this possible. Most of the added code is in `integration_test.cc`, which tests the publisher and subscriber together. Also, add the basic support for dashboard reporter pubsub.	2021-12-09 15:00:38 -08:00
SangBin Cho	f4d46398f7	[Internal Observability] [Part 2] Share the same code for RecordMetrics & DebugString for cluster task manager. (#20958 ) Share the same code for RecordMetrics & DebugString for cluster task manager. Both requires almost identical (and also expensive) operation. This PR makes them share the same `UpdateState` code which stores stats in the struct. Note that we don't update state when metrics are recorded because the debug string is anyway consistently called and states are updated. Ideally, we should dynamically update the stats.	2021-12-09 14:24:33 -08:00
SangBin Cho	05a302b468	[Internal Observability] [Part 3] Support debug state metrics on all components. (#20957 ) This PR adds RecordMetrics and DebugString to all raylet components. Some of methods are probably empty now. They are going to be supported in the next PR	2021-12-09 14:24:15 -08:00
Chen Shen	d0e79a36f9	[chaos-test] chaos test pipeline ingestion (#20929 ) since it has been passing my test run; i'll land it and mark it as unstable.	2021-12-09 13:43:00 -08:00
Chen Shen	6a274dfd76	CI][Chaos-test] chaos test now can set max-nodes-to-kill #20962	2021-12-09 13:41:46 -08:00
Kai Fricke	97ec2a03b6	[ci/buildkite] Add ml pipeline to speed up ML/RLLib tests (#20895 ) ML tests will be built in a separate bootstrap step installing all required dependencies.	2021-12-09 21:14:10 +00:00
Yi Cheng	83c639ea76	[core] Ensure failed to register worker is killed and print better log (#20964 ) Before this PR, then raylet notices there is something wrong with the worker starting, it'll start a new worker but not kill the old one. If the old one is hanging, it'll lead to resource waste. This PR killed the failed worker if it's still alive and also print useful logs	2021-12-09 12:37:39 -08:00
kk-55	9acf2f954d	[RLlib] Example containing a proposal for computing an adapted (time-dependent) GAE used by the PPO algorithm (via callback on_postprocess_trajectory) (#20850 )	2021-12-09 14:48:56 +01:00
Tomasz Wrona	39c202fa66	[RLlib] Allow extra keys in info in multi-agent (#20793 )	2021-12-09 14:44:33 +01:00
Carlo Grisetti	a8286c55af	[RLLib] Fix deprecated convert_to_non_torch_type (#20751 )	2021-12-09 14:42:12 +01:00
Avnish Narayan	6996eaa986	[RLlib] Add necessary fields to Base Envs, and BaseEnv wrapper classes (#20832 )	2021-12-09 14:40:40 +01:00
chenk008	8bb9bfe632	[Core]Add metrics: worker_register_time_ms (#20472 ) Recently I am testing some benchmark about worker registering with running worker in container. Current the Ray core has `process_startup_time_ms` metrics which is about process fork time. This PR try to add metrics about the duration of worker registering.	2021-12-09 21:25:49 +08:00
Sven Mika	63db0e3a7c	[RLlib] Fix SAC learning test flakiness introduced in PR: "Sub-class `Trainer` (instead of `build_trainer()`): All remaining classes; soft-deprecate `build_trainer`." (#20985 )	2021-12-09 14:24:27 +01:00
Yi Cheng	f7b0b872f9	[1/kv-regression] Put KV into a dedicated thread pool (#20922 ) After moving internal kv to grpc, there is a regression in actor launching performance. This PR move the work from main thread to a dedicated thread for internal kv to mitigate it. Co-authored-by: Eric Liang <ekhliang@gmail.com>	2021-12-09 00:21:47 -08:00
Jiajun Yao	655cc584a9	[Scheduler] Support per task/actor SpreadSchedulingStrategy (#20972 ) This PR adds per task/actor SpreadSchedulingStrategy which will try to spread the tasks on a best effort basis.	2021-12-08 22:22:07 -08:00
Eric Liang	22ccc6b300	Initial stats framework for datasets (#20867 ) This adds an initial Dataset.stats() framework for debugging dataset performance. At a high level, execution stats for tasks (e.g., CPU time) are attached to block metadata objects. Datasets have stats objects that hold references to these stats and parent dataset stats (this avoids stats holding references to parent datasets, allowing them to be gc'ed). Similarly, DatasetPipelines hold stats from recently computed datasets. Currently only basic ops like map / map_batches are instrumented. TODO placeholders are left for future PRs.	2021-12-08 16:13:57 -08:00
SangBin Cho	5298a9046c	[Internal Observability] [Part 1] Centralize existing metrics to metric_defs.h (#20728 ) This PR centralizes all existing metrics to `metric_defs.h`. Previously, each file relies on implicit import of metric_def.h within the stats module. After this PR we only precisely import `metric_defs.h` for each file.	2021-12-08 14:06:05 -08:00
Antoni Baum	4c47f56b61	[tune] Add random state to `BasicVariantGenerator` (#20926 ) This PR adds an ability to set a random seed/numpy random generator in BasicVariantGenerator, allowing for reproducibility across separate runs. All the changes are fully backwards compatible. Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2021-12-08 20:15:53 +00:00
architkulkarni	5593819135	Revert "Revert "[runtime env] Allow working_dir and py_module to be Path type"" (#20853 )	2021-12-08 11:17:19 -08:00
Yi Cheng	442b1025cd	[1/gcs-mem-kv] Memory mode for internal kv (#20881 ) This is part work of redis removal. In this PR we introduced a new mode for internal kv, memory mode. There are two ways to address this: - Update store client and use store client in internal kv - Add memory table into internal kv directly. The former one actually is a better choice since it put everything related to storage into a lowerlevel. But it's pretty hard to do this now, since internal kv use hset/hget and redis store client use set/get, so the data will not be compatible and it'll be a brake change. So the easier way to do this is 2) and it's what this PR doing. Next: use the flag for store client	2021-12-08 10:40:35 -08:00
architkulkarni	78cd377775	[Serve] Bump test_cluster from small to medium (#20942 )	2021-12-08 09:58:27 -08:00
Jiajun Yao	5b168a1515	[Scheduler] Support per task/actor PlacementGroupSchedulingStrategy (#20507 ) This PR adds per task/actor scheduling strategy and currently the only strategy are PlacementGroupSchedulingStrategy and DefaultSchedulingStrategy. Going forward, people should use `scheduling_strategy=PlacementGroupSchedulingStrategy` to define placement group for actor/task. The old way will be deprecated.	2021-12-07 23:11:31 -08:00
Lixin Wei	96dc10a95a	[Core] Fix Crash in ObjectDirectory (#20540 ) Here we met a crash in line 446's RAY_CHECK `d26c9e67e8/src/ray/object_manager/ownership_based_object_directory.cc (L441-L450)` And we found out that it's because we didn't set the node_id for dead nodes. If there are dead nodes and we are trying to LookupRemoteConnectionInfo in it. This crash will happen. This PR fixes this crash.	2021-12-07 23:03:49 -08:00
Stephanie Wang	1b9c03adb3	[core] Remove spammy code in object directory client (#20838 ) * log * remove * fix * fix * x * x	2021-12-07 19:51:44 -08:00
Jiajun Yao	6a07f03a6a	[Test] Fix flaky test_failure_2.py (#20949 ) There is a race condition between `DestoryActor` (due to handle out of scope) and `CreateActor`. If `DestoryActor` happens first, then `CreateActor` will fail complaining about not being able to find the registered actor.	2021-12-07 19:44:16 -08:00
Flamur Gogolli	3ca10ccc47	Textual correction on TLS Authentication (#20935 ) Correct wording on the TLS Authentication section of the configure.rst page.	2021-12-07 19:05:16 -08:00
Hankpipi	67518bdc50	[serve] Reconfiguration bug fix (#20315 ) As described in #18884, reconfiguration will mutate state mid-query. I try to solve this problem by adding read/write lock to each replica. Co-authored-by: yuzihao.2001 <yuzihao.2001@bytedance.com>	2021-12-07 18:53:45 -08:00
Jiajun Yao	2208cf7672	[Ray Client] Pickle task options for ray client (#20930 ) We can just pickle task options instead of json so that we don't need to write custom `to_dict` and `from_dict` methods for complex python option objects (e.g. PlacementGroup).	2021-12-07 17:07:19 -08:00

... 3 4 5 6 7 ...

10883 commits