hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Simon Mo	a081579f68	[Dashboard] Fix gRPC GCS healthcheck thread (#19360 )	2021-10-18 13:18:06 -07:00
Eric Liang	1bb2b1fc49	[hotfix] Pin pyspark dep to 3.1.2	2021-10-18 13:10:06 -07:00
Jiajun Yao	4d9585773f	[Release] Remove release process doc (#19312 )	2021-10-18 11:24:03 -07:00
Yi Cheng	f47f69d31e	[nightly] Add decision_tree_autoscaling_20_runs to nightly test	2021-10-18 11:19:40 -07:00
Kai Fricke	ad94eb03c6	[ci/release] wrap pip github installs in quotation marks to prevent comment errors (#19464 )	2021-10-18 18:55:56 +01:00
mwtian	9742abb749	[Debugging] Print Python stack trace in addition to C++ stack trace, when Python worker crashes (#19423 ) Why are these changes needed? Right now the failure signal handler registered in Python worker is skipped on crashes like segfault, because C++ core worker overrides the failure signal handler here and does not call the previously registered handler. This prevents Python stack trace from being printed on crashes. The fix is to make the C++ fault signal handler to call the previous signal handler registered in Python. For example with the script below which segfaults, import ray ray.init() @ray.remote def f(): import ctypes; ctypes.string_at(0) ray.get(f.remote()) Ray currently only prints the following stack trace: (pid=26693) * SIGSEGV received at time=1634418743 * (pid=26693) PC: @ 0x7fff203d9552 (unknown) _platform_strlen (pid=26693) [2021-10-16 14:12:23,331 E 26693 12194577] logging.cc:313: * SIGSEGV received at time=1634418743 * (pid=26693) [2021-10-16 14:12:23,331 E 26693 12194577] logging.cc:313: PC: @ 0x7fff203d9552 (unknown) _platform_strlen With this change, Python stack trace will be printed in addition to the stack trace above: (pid=26693) Fatal Python error: Segmentation fault (pid=26693) (pid=26693) Stack (most recent call first): (pid=26693) File "/Users/mwtian/opt/anaconda3/envs/ray/lib/python3.7/ctypes/__init__.py", line 505 in string_at (pid=26693) File "stack.py", line 7 in f (pid=26693) File "/Users/mwtian/work/ray-project/ray/python/ray/worker.py", line 425 in main_loop (pid=26693) File "/Users/mwtian/work/ray-project/ray/python/ray/workers/default_worker.py", line 212 in <module> This should make debugging crashes in Python worker easier, for users and Ray devs. Also, try to initialize symbolizer in GCS, Raylet and core worker. This is a no-op on MacOS and some Linux environments (e.g. Ray on Ubuntu 20.04 already produces symbolized stack traces), but should make Ray more likely to have symbolized stack traces on other platforms.	2021-10-18 09:05:08 -07:00
Kai Fricke	eee05505b1	[ci/release] Add separate timeout parameter for prepare commands (#19459 )	2021-10-18 16:29:25 +01:00
Kai Fricke	57fe405120	[ci/release] Bump long running release test timeouts to 6 minutes (#19458 )	2021-10-18 16:27:53 +01:00
Chen Shen	9dba5e0ead	[dataset][nightly-test] fix pipeline ingest test (#19437 )	2021-10-18 11:31:24 +01:00
Kai Fricke	6c6639a0d7	[ci/release] hotfix for undefined local variable (#19460 )	2021-10-18 11:28:33 +01:00
matthewdeng	caa42d753c	[release] pin modin>=0.11.0 due to ray.services being removed (#19446 )	2021-10-18 11:23:05 +01:00
Kai Fricke	c10d434713	[release] Allow commit hashes instead of URLs, add bisection utility (#19398 )	2021-10-18 10:44:29 +01:00
Guyang Song	c04fb62f1d	[C++ worker] set native library path for shared library search (#19376 )	2021-10-18 16:03:49 +08:00
Qing Wang	1047914ee0	[Java] Skip javadoc when deploying. (#19428 )	2021-10-17 15:21:13 +08:00
Hao Zhang	c96c2e9b5f	[Collective] Enhance the collective group GC a bit (#19402 )	2021-10-15 18:47:54 -07:00
Yi Cheng	a3dc07b1ee	[core] Fix some legacy issues (#19392 ) ## Why are these changes needed? There are some issues left from previous PRs. - Put the gcs_actor_scheduler_mock_test back - Add comment for named actor creation behavior - Fix the comment for some flags. ## Related issue number	2021-10-15 18:06:01 -07:00
Chen Shen	a9c34d55e3	Throw if infinite (#19418 )	2021-10-15 18:01:53 -07:00
Gagandeep Singh	d226cbf21a	Added StartupToken to idenitfy a process at startup (#19014 ) * Added StartupToken to idenitfy a process at startup * Applied linting formats * Addressed reviews * Fixing worker_pool_test * Fixed worker_pool_test * Applied linting formatting * Added documentation for StartupToken * Fixed linting * Reordered initialisation of WorkerPool members * Fixed Python docs * Fixing bugs in cluster_mode_test * Fixing Java tests * Create and set shim process after verifying startup_token * shim_process.GetId() -> worker_shim_pid * Improvements in startup token and modifying java files * update io_ray_runtime_RayNativeRuntime.h * Fixed java tests by adding startup-token to conf * Applied linting * Increased arg count for startup_token * Attempt to fix streaming tests * Type correction * applied linting * Corrected index of startup token arg * Modified, mock_worker.cc to accept startup tokens * Applied linting * Applied linting changes from CI * Removed override from worker.h * Applied linting from scripts/format.sh * Addressed reviews and applied scripts/format.sh * Applied linting script from ci/travis * Removed unrequired methods from public scope * Applied linting	2021-10-15 15:13:13 -07:00
Chen Shen	acfbf4c170	Fix from Dask bug in Datasets (#19409 )	2021-10-15 15:04:52 -07:00
Gagandeep Singh	07064cddf9	Re-enabling tests from test_basic (#19384 ) Why are these changes needed? Related issue number ##19177 Quoting #19177 (comment) here, The following tests fail when not skipped, =================================== short test summary info ==================================== FAILED python\ray\tests\test_basic.py::test_user_setup_function - subprocess.CalledProcessErro... FAILED python\ray\tests\test_basic.py::test_disable_cuda_devices - subprocess.CalledProcessErr... FAILED python\ray\tests\test_basic.py::test_wait_timing - assert (1634209333.6099107 - 1634209... Results (395.22s): 36 passed 3 failed - ray\tests/test_basic.py:197 test_user_setup_function - ray\tests/test_basic.py:220 test_disable_cuda_devices - ray\tests/test_basic.py:265 test_wait_timing =================================== short test summary info ==================================== FAILED python\ray\tests\test_basic_3.py::test_fair_queueing - AssertionError: 23 Results (198.33s): 1 failed - ray\tests/test_basic_3.py:169 test_fair_queueing The following test passed when not skipped. Opening a PR to verify that. def test_oversized_function(ray_start_shared_local_modes)	2021-10-15 14:02:57 -07:00
Kai Fricke	bb38c5cb1f	[tune] Fix result buffering case check (fixes bug introduced in #19140 ) (#19399 )	2021-10-15 10:43:34 +01:00
Siyuan (Ryans) Zhuang	0d4b0ded27	[Serialization] Update cloudpickle to v2.0.0 (#19383 ) * update cloudpickle to v2.0.0	2021-10-15 02:37:29 -07:00
Hao Zhang	4b92f34ada	[Collective] Remove an unnecessary cuda.stream.synchornize (#19400 )	2021-10-14 21:33:59 -07:00
SangBin Cho	9bfe43198f	Use cleaner code for the map (#19386 )	2021-10-14 21:18:42 -07:00
Matti Picus	f372bb07aa	Enable dashboard on Windows (#19319 )	2021-10-14 14:42:22 -07:00
Kai Fricke	e17b23fa5b	[ci/release] Add support for RAY_WHEELS url (#19364 )	2021-10-14 21:40:01 +01:00
architkulkarni	b3ccec5d76	[runtime_env] Fix bug when all working_dir contents are excluded with Ray Client (#19377 )	2021-10-14 11:20:45 -07:00
Carlo Grisetti	30fe93d285	[Windows] Use correct interpreter and fix prometheus atomic file rename (#19171 )	2021-10-14 10:29:21 -07:00
Kai Fricke	e07d0953ea	[ci/release] Undo faulty change to many_ppo num_samples (#19388 )	2021-10-14 10:27:31 -07:00
Eric Liang	13d4ad6100	[data] Preserve epoch by default when using rewindow() (#19359 )	2021-10-14 09:17:36 -07:00
SangBin Cho	4edb3c4746	[Test] Add complicated threaded actor tests (#19374 ) Why are these changes needed? There are only 2 simple threaded actor tests in Ray repo. This PR adds more complicated threaded actor tests to make sure it is well tested. The third tests print a lot of (pid=42032) [2021-10-13 19:02:36,102 E 42032 10779969] core_worker.cc:270: The global worker has already been shutdown. This happens when the language frontend accesses the Ray's worker after it is shutdown. The process will exit which was the bug @scv119 fixed. Maybe we can start debugging this to make sure when this happens and fix the real shutdown bugs. Related issue number Checks I've run scripts/format.sh to lint the changes in this PR. I've included any doc changes needed for https://docs.ray.io/en/master/. I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ Testing Strategy Unit tests Release tests This PR is not tested :	2021-10-14 09:06:11 -07:00
Antoni Baum	e9df253f5d	[CI/docs] Remove [default] from xgboost-ray (#19186 ) Co-authored-by: Kai Fricke <kai@anyscale.com>	2021-10-14 16:29:55 +01:00
Kai Fricke	9cee83c919	[tune] PBT: Add burn-in period (#19321 )	2021-10-14 16:28:29 +01:00
Edward Oakes	888fb24c25	Remove deprecated ray.services package (#18475 ) Co-authored-by: Kai Fricke <kai@anyscale.com>	2021-10-14 16:28:16 +01:00
Kai Fricke	312dc369a7	Revert "[Hotfix] Revert "[tune/wip] Exclude trial checkpoints in experiment sync"" (#19285 ) This reverts commit `a92f1fedf4`. and fixes the failing test	2021-10-14 11:18:48 +01:00
Qing Wang	2cc164e616	[Java] Fix incompleted core worker dynamic library. (#19342 ) * Fix incompleted core worker dynamic library. * Fix lint.	2021-10-14 14:42:05 +08:00
mwtian	12100015d9	[Lint] Disable `modernize-use-override` (#19368 ) This lint rule cannot apply only to changed lines because currently Ray has `-Winconsistent-missing-override` as a build flag. Either all or none of member functions from a derived class can have the `override` / `final` annocation.	2021-10-13 20:20:08 -07:00
Carlo Grisetti	5cee8a1985	[release tests] Switch from yaml.load to yaml.safe_load (#19365 )	2021-10-13 17:27:25 -07:00
Edward Oakes	2ac81f336a	[serve] Remove BackendConfig broadcasting (#19154 )	2021-10-13 16:25:34 -07:00
Chen Shen	b8c201b7cb	[Core][CoreWorker] Make WorkerContext thread safe, fix race condition. #19343 Why are these changes needed? The theory around #19270 is there are two create actor requests sent to the same threaded actor due to retry logic. Specifically: the first request comes and calls CoreWorkerDirectTaskReceiver::HandleTask, it's queued to be executed by thread pool; then the second request comes and calls CoreWorkerDirectTaskReceiver::HandleTask again, before first request being executed and calls worker_context_.SetCurrentTask; this fails the current dedupe logic and leads to SetMaxActorConcurrency be called twice, which fails the RAY_CHECK. In this PR, we fix the dedupe logic by adding SetCurrentActorId and calling it in the task execution thread. this ensures the dedupe logic works for threaded actor. we also noticed that the WorkerContext is actually not thread safe in threaded actors, thus make it thread safe in this PR as well. Related issue number Closes #19270 Checks I've run scripts/format.sh to lint the changes in this PR. I've included any doc changes needed for https://docs.ray.io/en/master/. I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ Testing Strategy Unit tests Release tests This PR is not tested :(	2021-10-13 16:12:36 -07:00
Linsong Chu	b86a5fcb96	[workflow] fix workflow user metadata return when None is given (#19356 ) ## Why are these changes needed? Quick fix for metadata put. Currently when workflow-level metadata is not given, it will output `null` to `user_run_metadata.json`, this fix will make it output `{}`. ## Related issue number original issue: https://github.com/ray-project/ray/issues/17090 original PR: https://github.com/ray-project/ray/pull/19195	2021-10-13 15:52:12 -07:00
Yi Cheng	1dc03cd49d	[nightly] Put many nodes actor test back (#19313 ) ## Why are these changes needed? There are two issues fixed in this PR: - make sure wait for session count alive node - upgrade the machine to match what's tested in oss ray. ## Related issue number https://github.com/ray-project/ray/issues/19084	2021-10-13 15:51:12 -07:00
matthewdeng	d998373968	[release] fix test by pinning filelock (#19334 ) Co-authored-by: Kai Fricke <kai@anyscale.com>	2021-10-13 22:27:04 +01:00
architkulkarni	b0716f66ae	[runtime env] Fix handling of runtime env with None fields (#19300 )	2021-10-13 13:57:55 -07:00
Jiao	893f76daf9	[serve] Add serve FT nightly test to buildkite (#19361 )	2021-10-13 13:56:55 -07:00
Antoni Baum	3cb0862152	Fix double gym in requirements (#19357 )	2021-10-13 21:43:41 +01:00
Omkar Pangarkar	f1b9b16ae9	[tune] Fix `DistributedTrainable` restore (#19349 )	2021-10-13 21:29:05 +01:00
Carlo Grisetti	da7a485786	[Windows] use dynamic temp path (#19096 )	2021-10-13 13:02:45 -04:00
hazeone	c2f0035fd2	[Java]Support getGpuIds API (#19031 ) Add java getGpuIds() API which is the same as get_gpu_ids in python. We can get deviceId if we've allocated a GPU to a worker.	2021-10-13 23:40:26 +08:00
Kai Fricke	bde9e058da	Revert "[RLlib](deps): Bump tensorflow from 2.5.0 to 2.6.0 in /python/requirements/rllib (#18183 )" (#19351 ) This reverts commit `74ee99ff99`.	2021-10-13 13:06:36 +01:00

1 2 3 4 5 ...

9889 commits