Why are these changes needed?
Editing pass over the tensor support docs for clarity:
Make heavy use of tabbed guides to condense the content
Rewrite examples to be more organized around creating vs reading tensors
Use doc_code for testing
Signed-off-by: Clarence Ng clarence.wyng@gmail.com
Why are these changes needed?
This PR adds a memory monitor in cpp that runs periodically to check if the node memory usage is above a certain threshold. The caller may provide a callback to the monitor to execute at each interval to determine whether an action should be taken.
This PR is a no-op since the monitor is disabled by default. Another PR based on this will implement the monitor to take action when memory is running low
Why are these changes needed?
The dashboard wasn't working (blank screen). See the linked issue for details. The cause is this exception in /tmp/ray/session_latest/logs/dashboard_agent.log:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
loop.run_until_complete(agent.run())
File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
modules = self._load_modules()
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
c = cls(self)
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
self._metrics_agent = MetricsAgent(
File "/usr/local/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
prometheus_exporter.new_stats_exporter(
File "/usr/local/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
exporter = PrometheusStatsExporter(
File "/usr/local/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
self.serve_http()
File "/usr/local/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
start_http_server(
File "/usr/local/lib/python3.9/site-packages/prometheus_client/exposition.py", line 167, in start_wsgi_server
TmpServer.address_family, addr = _get_best_family(addr, port)
File "/usr/local/lib/python3.9/site-packages/prometheus_client/exposition.py", line 156, in _get_best_family
infos = socket.getaddrinfo(address, port)
File "/usr/local/lib/python3.9/socket.py", line 954, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
There was a recent change in prometheus-client which passes the address given to start_http_server to socket.getaddrinfo. This prevents passing in an empty string, but we can get the same effect by passing None.
Related issue number
Closes#23765
Recently there have been a number of CI test failures due to direct or transitive dependency version upgrades. Printing out environment information for each test suite allows us to quickly check the diff between failed and successful runs.
**Notes:**
1. In this PR I just manually added `./ci/env/env_info.sh` to each test suite. We may want to generalize this in the future.
2. This is just for CI now, but is applicable to release tests as well.
Signed-off-by: Matthew Deng <matt@anyscale.com>
We need to check the time after acquiring the lock to make sure the correctness. Otherwise, it might wait for the lock and the heartbeat has been updated.
This PR adds a flag --disable-check to the XGBoost benchmark script which disables the RuntimeError that comes up if training or prediction took too long. This is meant for non-CI exploratory use-cases.
Specifically, the reason is this:
We will include the XGBoost benchmark as an example workload for the KubeRay documentation.
The actual performance of the workload is highly sensitive to infrastructure environment, so we won't want to raise an alarming RuntimeError if the workload took too long on the user's infrastructure.
(When I tried the 100Gb benchmark on KubeRay, training ran just a couple of minutes longer than the 1000 second cutoff.)
Why are these changes needed?
Also:
Add validation to make sure multi-gpu and micro-batch is not used together.
Update A2C learning test to hit the microbatching branch.
Minor comment updates.
There is a small bug in the docs example for custom command based syncers. This PR fixes them and adds a test to test these changes.
Signed-off-by: Kai Fricke <kai@anyscale.com>
More replacements of tune.run() in examples/docstrings for Tuner.fit()
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
Since usage stats are recorded from the dashboard (which will become API server), it is not collected when the dashboard is not included (include_dashboard=False).
This PR fixes the issues by
change dashboard -> API server (to avoid confusing users that dashboard is still started when include_dashboard=False)
Only load modules that are irrelevant to the dashboard from the API server, so it will have the same impact as no dashboard.
Currently running into an issue:
Cluster startup Failed. Error: RuntimeError: botocore.exceptions.ClientError: An error occurred (InvalidBlockDeviceMapping) when calling the RunInstances operation: Volume of size 202GB is smaller than snapshot 'snap-02c4e6a0ad06cf3d6', expect size >= 400GB
Heartbeat manager starts its own thread to run its background task and that shares the same data structured used within HandleReportHeartbeat (heartbeats_). That said, both methods should run in the same thread. This achieves it by running HandleReportHeartbeat within the io_service thread
This PR adds --keep-going flag to the make html target for building the Ray docs. This means that when there is a lint failure in CI, the BuildKite log will show all lint failures instead of just the first one. Despite continuing past the first lint error, it will still fail the build.
Signed-off-by: Cade Daniel <cade@anyscale.com>