hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Dmitri Gekhtman	6efca71c35	[docs][kubernetes] XGBoost ML example (#27313 ) Adds a guide on running an XGBoost-Ray workload using KubeRay. Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>	2022-08-01 19:30:41 -07:00
Yi Cheng	00d22b6c7c	[core] Fix the test_failure_3.py in win (#27332 ) Win tests were broken because when the child is killed, the parent is also killed. Change the signal sent and make it work.	2022-08-01 18:55:07 -07:00
shrekris-anyscale	324d8e4bca	[Serve] Serialize `user_config` with JSON instead of Pickle (#26235 )	2022-08-01 17:53:43 -07:00
Eric Liang	f7ae8923f6	[docs] Reorganize the tensor data support docs; general editing (#26952 ) Why are these changes needed? Editing pass over the tensor support docs for clarity: Make heavy use of tabbed guides to condense the content Rewrite examples to be more organized around creating vs reading tensors Use doc_code for testing	2022-08-01 17:31:41 -07:00
Jiajun Yao	c50faa126c	Replace boost::filesystem with std::filesystem (#27338 ) Redo #27319 Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-01 17:12:23 -07:00
clarng	fffcae1cb4	[docs] ray core dag docs: edit pass & move code into separate dir (#27318 )	2022-08-01 17:05:36 -07:00
shrekris-anyscale	cc84953da3	[Serve] [Docs] Update "Getting Started" documentation (#26745 )	2022-08-01 16:31:48 -07:00
Jiajun Yao	36d5e5f99d	Revert "Replace boost::filesystem with std::filesystem (#27319 )" (#27337 ) This reverts commit `8e5c51d7d7`.	2022-08-01 13:46:45 -07:00
Jiajun Yao	8e5c51d7d7	Replace boost::filesystem with std::filesystem (#27319 ) std::filesystem is shipped with c++17, there is no need to depend on boost for this. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-01 11:44:39 -07:00
xwjiang2010	c9579fea1c	[air] update pytorch_training_e2e.py to use iter_torch_batches. (#27241 ) update pytorch_training_e2e.py to use iter_torch_batches. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>	2022-08-01 19:23:01 +01:00
clarng	57adde3f7d	memory monitor (#27017 ) Signed-off-by: Clarence Ng clarence.wyng@gmail.com Why are these changes needed? This PR adds a memory monitor in cpp that runs periodically to check if the node memory usage is above a certain threshold. The caller may provide a callback to the monitor to execute at each interval to determine whether an action should be taken. This PR is a no-op since the monitor is disabled by default. Another PR based on this will implement the monitor to take action when memory is running low	2022-08-01 10:40:46 -07:00
jonathan-conder-sm	1d5fef2004	Fix dashboard with prometheus-client 0.14 (#23766 ) Why are these changes needed? The dashboard wasn't working (blank screen). See the linked issue for details. The cause is this exception in /tmp/ray/session_latest/logs/dashboard_agent.log: Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module> loop.run_until_complete(agent.run()) File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete return future.result() File "/usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run modules = self._load_modules() File "/usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules c = cls(self) File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__ self._metrics_agent = MetricsAgent( File "/usr/local/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__ prometheus_exporter.new_stats_exporter( File "/usr/local/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter exporter = PrometheusStatsExporter( File "/usr/local/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__ self.serve_http() File "/usr/local/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http start_http_server( File "/usr/local/lib/python3.9/site-packages/prometheus_client/exposition.py", line 167, in start_wsgi_server TmpServer.address_family, addr = _get_best_family(addr, port) File "/usr/local/lib/python3.9/site-packages/prometheus_client/exposition.py", line 156, in _get_best_family infos = socket.getaddrinfo(address, port) File "/usr/local/lib/python3.9/socket.py", line 954, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags): socket.gaierror: [Errno -2] Name or service not known There was a recent change in prometheus-client which passes the address given to start_http_server to socket.getaddrinfo. This prevents passing in an empty string, but we can get the same effect by passing None. Related issue number Closes #23765	2022-08-01 10:25:38 -07:00
Sihan Wang	410fe1b5ec	[Serve] Support Multiple DAG Entrypoints in DAGDriver (#26573 )	2022-08-01 09:16:36 -07:00
Artur Niederfahrenhorst	a598458c46	[RLlib] Fix complex torch one-hot and flattened layers not being added to module list. (#27304 )	2022-08-01 15:52:28 +02:00
Steven Morad	d0a8e3c36f	[RLlib] User-friendly RNN sequencing. (#27087 )	2022-08-01 15:32:22 +02:00
Steven Morad	77318abfaf	[RLlib] Warn on PPO infinite KL loss term. (#26629 )	2022-08-01 12:55:26 +02:00
matthewdeng	01d473b355	[ci] print out test environment info for all python tests (#27312 ) Recently there have been a number of CI test failures due to direct or transitive dependency version upgrades. Printing out environment information for each test suite allows us to quickly check the diff between failed and successful runs. Notes: 1. In this PR I just manually added `./ci/env/env_info.sh` to each test suite. We may want to generalize this in the future. 2. This is just for CI now, but is applicable to release tests as well. Signed-off-by: Matthew Deng <matt@anyscale.com>	2022-08-01 09:55:13 +01:00
matthewdeng	83fb2fb21d	[tune] pin pymoo (#27311 ) Signed-off-by: Matthew Deng <matt@anyscale.com>	2022-07-31 01:06:27 -07:00
matthewdeng	fedfaddb3f	[docs] add k8s docs to toc (#27310 )	2022-07-30 15:26:30 -07:00
clarng	a61478fb73	import style (#25755 )	2022-07-30 09:43:09 -07:00
Alan Guo	729566d8ff	bump jobs version after making a backwards-incompatible change (#27281 ) Backwards incompatible change was #25902 2.0.0 cherry-pick but not a rc0 blocker Signed-off-by: Alan Guo <aguo@anyscale.com>	2022-07-30 00:11:29 -07:00
SangBin Cho	ec69fec1e0	Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (#26302 )" (#27242 ) This reverts commit `14dee5f6a3`.	2022-07-30 00:08:23 -07:00
Yi Cheng	7da700e337	[core] Suppress the logging error when python exits and actor not deleted. (#27300 ) In py39, it seems the destruction order changed and in __del__ some component might have been uninstalled and some error will be thrown.	2022-07-29 23:12:41 -07:00
Yi Cheng	95da64b53e	[ci] Fix the lint #27291 Signed-off-by: Yi Cheng <chengyidna@gmail.com>	2022-07-29 16:18:13 -07:00
Eric Liang	e3d7930bb3	[data] When no pipeline stats are available, return a helpful message instead of empty # Signed-off-by: Eric Liang <ekhliang@gmail.com>	2022-07-29 15:54:20 -07:00
Chris K. W	18109ec9f5	Deflake test_client_library_integration (#27105 ) Using ray_start_regular_shared test_tune_library_integration seems to make test_serve_handle flake. Separate use ray_start_regular instead. No flake: <img width="610" alt="Screen Shot 2022-07-27 at 1 10 59 PM" src="https://user-images.githubusercontent.com/14043490/181363214-522e9f41-df59-4b84-89b1-d8399b1901c6.png">	2022-07-29 15:50:54 -07:00
Yi Cheng	4ef4ec8eed	[ci] Deflakey gcs_heartbeat_test in windows. (#27275 ) We need to check the time after acquiring the lock to make sure the correctness. Otherwise, it might wait for the lock and the heartbeat has been updated.	2022-07-29 15:42:28 -07:00
Dmitri Gekhtman	8bdeb30510	[docs][ml][kuberay] Add a --disable-check flag to the XGBoost benchmark. (#27277 ) This PR adds a flag --disable-check to the XGBoost benchmark script which disables the RuntimeError that comes up if training or prediction took too long. This is meant for non-CI exploratory use-cases. Specifically, the reason is this: We will include the XGBoost benchmark as an example workload for the KubeRay documentation. The actual performance of the workload is highly sensitive to infrastructure environment, so we won't want to raise an alarming RuntimeError if the workload took too long on the user's infrastructure. (When I tried the 100Gb benchmark on KubeRay, training ran just a couple of minutes longer than the 1000 second cutoff.)	2022-07-29 14:31:10 -07:00
Simon Mo	1a10b53a61	Revert "[Serve] ServeHandle detects ActorError and drop replicas from target group (#26685 )" (#27283 ) This reverts commit `545c51609f`.	2022-07-29 14:24:15 -07:00
Dmitri Gekhtman	059895ab5b	[docs][kubernetes] Shift docs into new structure (#27239 ) This PR shifts KubeRay docs into the structure introduced in #27036. There are no content changes.	2022-07-29 14:19:51 -07:00
Yi Cheng	ad262c1968	[ci] Fix test_gcs_ha_e2e.py (#27263 ) This PR fix the broken test. The test failed because it's not installing the latest wheel. Signed-off-by: Yi Cheng <chengyidna@gmail.com>	2022-07-29 13:53:40 -07:00
Siyuan (Ryans) Zhuang	1bcd3e41d1	[Workflow] Cleanup workflow docs (#27197 ) * cleanup workflow docs Signed-off-by: Siyuan Zhuang <suquark@gmail.com>	2022-07-29 13:03:50 -07:00
Simon Mo	1f1234fde1	[Serve] Disable Serve on macOS Round 2 (#27271 )	2022-07-29 12:04:34 -07:00
Jun Gong	e6e10ce4cf	[RLlib] Revert `41c9ef70`. (#27243 ) Why are these changes needed? Also: Add validation to make sure multi-gpu and micro-batch is not used together. Update A2C learning test to hit the microbatching branch. Minor comment updates.	2022-07-29 11:05:15 -07:00
Eric Liang	467430dd55	Fix logger initialization (#27238 )	2022-07-29 10:55:15 -07:00
Simon Mo	545c51609f	[Serve] ServeHandle detects ActorError and drop replicas from target group (#26685 )	2022-07-29 09:50:17 -07:00
Guyang Song	0b60d90283	[Hotfix] Fix the failure of C++ tests (#27249 ) Signed-off-by: 久龙 <guyang.sgy@antfin.com>	2022-07-30 00:31:02 +08:00
Kai Fricke	1f097e9d12	[tune/docs] Update custom syncer example (#27252 ) There is a small bug in the docs example for custom command based syncers. This PR fixes them and adds a test to test these changes. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-29 16:09:19 +01:00
xwjiang2010	d331489a9d	[ air ] clean up some more `tune.run` (#27117 ) More replacements of tune.run() in examples/docstrings for Tuner.fit() Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-07-29 10:43:45 +01:00
Jian Xiao	693856975a	Fix the ray version to doc version mapping (#27191 ) Why are these changes needed? It doesn't work if the ray version is something like "2.0.0rc0"	2022-07-28 23:35:24 -07:00
Chen Shen	559216780c	[CI][hotfix] remove no-index --no-index will not try to install pip packages from pypi. this breaks CI because it failed to find grpcio==1.43.0 as it's missing from cache.	2022-07-28 23:19:21 -07:00
SangBin Cho	16aa102984	[Usage Stats] Record usage stats when dashboard disabled (#26042 ) Since usage stats are recorded from the dashboard (which will become API server), it is not collected when the dashboard is not included (include_dashboard=False). This PR fixes the issues by change dashboard -> API server (to avoid confusing users that dashboard is still started when include_dashboard=False) Only load modules that are irrelevant to the dashboard from the API server, so it will have the same impact as no dashboard.	2022-07-28 23:01:49 -07:00
Kai Fricke	ee05fc94fe	[tune] Increase volume size for long running pbt failure (#27163 ) Currently running into an issue: Cluster startup Failed. Error: RuntimeError: botocore.exceptions.ClientError: An error occurred (InvalidBlockDeviceMapping) when calling the RunInstances operation: Volume of size 202GB is smaller than snapshot 'snap-02c4e6a0ad06cf3d6', expect size >= 400GB	2022-07-28 22:57:26 -07:00
SangBin Cho	c1ac2bb80f	[Test] Try fixing a flaky gcs heartbeat manager test. (#27096 ) Heartbeat manager starts its own thread to run its background task and that shares the same data structured used within HandleReportHeartbeat (heartbeats_). That said, both methods should run in the same thread. This achieves it by running HandleReportHeartbeat within the io_service thread	2022-07-28 22:42:13 -07:00
Jimmy Yao	749d313dcd	hot fix ray lightning (#27235 ) hot fix ray lightning #27235	2022-07-28 22:41:28 -07:00
Chen Shen	fda345335a	Revert "Allow grpcio >= 1.48 (#26765 )" (#27244 ) This reverts commit `6acd0a4c9b`.	2022-07-28 22:25:21 -07:00
Cheng Su	b95f7b222e	[Datasets] Doing partition filtering in reader constructor (#27156 ) currently we are doing things in this ordering: 1. [create the reader for data source](https://github.com/ray-project/ray/blob/master/python/ray/data/read_api.py#L1136) 2. [calculate number of tasks, based on data source estimated size and cluster resource](https://github.com/ray-project/ray/blob/master/python/ray/data/read_api.py#L1137) 3. [doing partition filtering and generate read tasks](https://github.com/ray-project/ray/blob/master/python/ray/data/read_api.py#L1143) However, we should do partition filtering before step 2, so that the data source estimated size is calculated based on correct set of files. Otherwise estimation is calculated based on all files, while some files could be filtered out later. See https://github.com/ray-project/ray/issues/27152 for more detail.	2022-07-28 17:33:14 -07:00
Simon Mo	ca9e8b3d0b	[Serve] Disable macOS tests (#27218 )	2022-07-28 16:34:46 -07:00
Cade Daniel	0374637e53	Adding --keep-going flag to sphinx-build so all lint failures are listed in CI (#27068 ) This PR adds --keep-going flag to the make html target for building the Ray docs. This means that when there is a lint failure in CI, the BuildKite log will show all lint failures instead of just the first one. Despite continuing past the first lint error, it will still fail the build. Signed-off-by: Cade Daniel <cade@anyscale.com>	2022-07-28 16:24:27 -07:00
Jimmy Yao	73e1632599	Hot fix again ray lightning docs (#27229 )	2022-07-28 16:19:30 -07:00

1 2 3 4 5 ...

13759 commits