hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-10 21:36:39 -04:00

Author	SHA1	Message	Date
Eric Liang	9b2794101d	[minor] Change chunk already exists to DEBUG, add flags for rllib multi node testing (#3228 )	2018-11-08 00:04:20 -08:00
Stephanie Wang	d950e92f63	Allow multiple threads to call ray.get and ray.wait (#3244 ) * Handle multiple threads calling ray.get * Multithreaded ray.wait * Pass in current task ID in java backend * Add multithreaded actor to tests, add warning messages to worker for multithreaded ray.get * Fix test * Some cleanups * Improve error message * Add assertion * Cleanup, throw error in HandleTaskUnblocked if task not actually blocked * lint * Fix python worker reset * Fix references to reconstruct_objects * Linting * java lint * Fix java * Fix iterator	2018-11-07 22:39:28 -08:00
Richard Liaw	0bab8ed95c	Expose internal config parameters for starting Ray (#3246 ) ## What do these changes do? This PR exposes the CL option for using a config parameter. This is important for certain tests (i.e., FT tests that removing nodes) to run quickly. Note that this is bad practice and should be replaced with GFLAGS or some equivalent as soon as possible. #3239 depends on this. TODO: - [x] Add documentation to method arguments before merging. - [x] Add test to verify this works? ## Related issue number	2018-11-07 21:46:02 -08:00
Eric Liang	29e3362905	Better errors on process deaths (#3252 )	2018-11-07 14:08:16 -08:00
Robert Nishihara	1dd5d92789	Enable timeline visualizations of object transfers. (#3255 ) * Plot object transfers. * Linting	2018-11-07 12:45:59 -08:00
Philipp Moritz	4182b85611	Cache resources in SchedulingQueue (#3232 ) * cache resources * fix * documentation and remove old code * fix PR * update documentation * linting	2018-11-06 21:23:31 -08:00
Stephanie Wang	ca585703b2	Refactor ObjectDirectory to reduce and fix callback usage (#3227 )	2018-11-06 20:33:10 -08:00
Wang Qing	4968cc5d70	Fix a small typo (#3240 )	2018-11-05 18:30:53 -08:00
Stephanie Wang	bf88aa5013	Increase timeout before reconstruction is triggered (#3217 ) * Increase timeout to 10s * Skip eviction reconstruction tests * Add stress test for many actors to one * Fix test by shortening it. * lower number of processes in stress test * Skip slow test	2018-11-05 18:03:50 -08:00
Ion	d8ae9de99c	Caching task resource requirements. (#3231 ) * caching resource requirements * small fixes * avoid copying the resource map	2018-11-05 15:14:09 -08:00
Philipp Moritz	0da15b1c1f	Fix build system dependency for local_scheduler_client (#3215 )	2018-11-03 13:19:02 -07:00
Stephanie Wang	aacbd007a0	[xray] Implement faster flush policy for lineage cache (#3071 ) * Policy that flushes the lineage stash immediately * Fix bug where remote tasks in uncommitted lineage weren't getting subscribed to, add reg test * test * Fix bug where waiting task was getting subscribed * Cleanup * Update src/ray/raylet/lineage_cache.cc Co-Authored-By: stephanie-wang <swang@cs.berkeley.edu> * Update src/ray/raylet/lineage_cache.cc Co-Authored-By: stephanie-wang <swang@cs.berkeley.edu> * cleanup * cleanup * Add another test for task with many parents * fix, unsubscribe to new waiting tasks * Unsubscribe as soon as the commit notification is handled	2018-10-30 09:59:50 -07:00
Robert Nishihara	fd854ff090	Allow the node manager port and object manager port to be set through… (#3130 ) * Allow the node manager port and object manager port to be set through ray start. * Linting * Fix Java test * Address comments.	2018-10-28 17:28:41 -07:00
Yuhong Guo	befbf78048	Delete empty pubsub keys (#3146 ) We found that there are large amount of pub-sub keys with no content in it (This case is worse when wait-id is used in the key name.). This logic of deleting empty pub-sub keys from GCS was in legacy ray but not in raylet.	2018-10-27 11:58:39 -07:00
Robert Nishihara	658c14282c	Remove legacy Ray code. (#3121 ) * Remove legacy Ray code. * Fix cmake and simplify monitor. * Fix linting * Updates * Fix * Implement some methods. * Remove more plasma manager references. * Fix * Linting * Fix * Fix * Make sure class IDs are strings. * Some path fixes * Fix * Path fixes and update arrow * Fixes. * linting * Fixes * Java fixes * Some java fixes * TaskLanguage -> Language * Minor * Fix python test and remove unused method signature. * Fix java tests * Fix jenkins tests * Remove commented out code.	2018-10-26 13:36:58 -07:00
Robert Nishihara	9c1826ed69	Use XRay backend by default. (#3020 ) * Use XRay backend by default. * Remove irrelevant valgrind tests. * Fix * Move tests around. * Fix * Fix test * Fix test. * String/unicode fix. * Fix test * Fix unicode issue. * Minor changes * Fix bug in test_global_state.py. * Fix test. * Linting * Try arrow change and other object manager changes. * Use newer plasma client API * Small updates * Revert plasma client api change. * Update * Update arrow and allow SendObjectHeaders to fail. * Update arrow * Update python/ray/experimental/state.py Co-Authored-By: robertnishihara <robertnishihara@gmail.com> * Address comments.	2018-10-23 12:46:39 -07:00
Philipp Moritz	8d8b6e5bfa	Retry connections to redis for async and subscribe contexts (#3105 ) This is fixing a problem that @devin-petersohn observed on the windows subsystem for linux. In theory, redis should be up once the async connect is happening and there should be no retries needed for the async connect. However on the windows subsystem for linux, the async connect was failing even though the synchronous one was working. Maybe windows has a different semantics here than linux.	2018-10-22 22:31:13 -07:00
Wang Qing	a4db5bbaea	Fill driver id into actor notification when finishing assigned task. (#3080 ) ## What do these changes do? Fill driver id into actor notification when finishing assigned task. Also it improves codes.	2018-10-21 11:12:20 +08:00
Eric Liang	9d23fa03c9	[xray] All messages on main asio event loop should be written asynchronously (#3023 ) * copy over ref code * wip async writes * compiles * fix error handling * add test * amend * fix test * clang fmgt * clang format * wip * yapf * rename format script * test error * clangfmt * add test to list * warn * ref test * fix test * comment * add capture * Update client_connection.cc * wip * fix compile	2018-10-18 21:56:22 -07:00
Yuhong Guo	653c5b114a	[c++] Refine Log Code (#2816 ) * Support setting logging level from env variable * Remove Env Variable related code * lint	2018-10-18 10:51:36 -07:00
Peter Schafhalter	a41bbc10ef	Add password authentication to Redis ports (#2952 ) * Implement Redis authentication * Throw exception for legacy Ray * Add test * Formatting * Fix bugs in CLI * Fix bugs in Raylet * Move default password to constants.h * Use pytest.fixture * Fix bug * Authenticate using formatted strings * Add missing passwords * Add test * Improve authentication of async contexts * Disable Redis authentication for credis * Update test for credis * Fix rebase artifacts * Fix formatting * Add workaround for issue #3045 * Increase timeout for test * Improve C++ readability * Fixes for CLI * Add security docs * Address comments * Address comments * Adress comments * Use ray.get * Fix lint	2018-10-16 22:48:30 -07:00
Robert Nishihara	faa31ae018	Introduce concept of resources required for placing a task. (#2837 ) * Introduce concept of resources required for placement. * Add placement resources to task spec * Update java worker * Update taskinfo.java	2018-10-04 10:35:39 -07:00
Richard Liaw	01bb073569	Suppress errors when worker or driver intentionally disconnects. (#2935 )	2018-10-04 00:06:34 -07:00
Robert Nishihara	3ce8eb2d4c	Test dying_worker_get and dying_worker_wait for xray. (#2997 ) This tests the case in which a worker is blocked in a call to ray.get or ray.wait, and then the worker dies. Then later, the object that the worker was waiting for becomes available. We need to make sure not to try to send a message to the dead worker and then die. Related to #2790.	2018-10-02 00:08:47 -07:00
Wang Qing	a879302355	Improve log message when failing to fork worker process (#2990 ) ## What do these changes do? ```c++ // Try to execute the worker command. int rv = execvp(worker_command_args[0], const_cast<char const >(worker_command_args.data())); // The worker failed to start. This is a fatal error. RAY_LOG(FATAL) << "Failed to start worker with return value " << rv; ``` When starting a process fails, the return value `rv` always be set to -1. It is useless for us. The log message should show some meaningful infos. For example, If we did't install java. The message showed for us should be: ```shell Failed to start worker: No such file or directory. ``` This could help us to locate issue quickly. ## Related issue number N/A	2018-09-29 22:10:57 +08:00
Hao Chen	971df5ea8a	[java] put function meta in task spec and load functions with function meta (#2881 ) This PR adds a `function_desc` field into task spec. a function descriptor is a list of strings that can uniquely describe a function. - For a Python function, it should be: [module_name, class_name, function_name] - For a Java function, it should be: [class_name, method_name, type_descriptor] There're a couple of purposes to add this field: In this PR: - Java worker needs to know function's class name to load it. Previously, since task spec didn't have such a field to hold this info, we did a hack by appending the class name to the argument list. With this change, we fixed that hack and significantly simplified function management in Java. Will be done in subsequent PRs: - Support cross-language invocation (#2576): currently Python worker manages functions by saving them in GCS and pass function id in task spec. However, if we want to call a Python function from Java, we cannot save it in GCS and get the function id. But instead, we can pass the function descriptor (module name, class name, function name) in task spec and use it to load the function. - Support deployment: one major problem of Python worker's current function management mechanism is #2327. In prod env, we should have a mechanism to deploy code and dependencies to the cluster. And when code is already deployed, we don't need to save functions to GCS any more and can use `function_desc` to manage functions.	2018-09-25 23:05:05 -07:00
Hanwei Jin	9f9e49e4a1	[cmake] enable using thirdparty env variable to find installed dependency (#2912 ) * enable using thirdparty env variable to find installed dependency, to speed up the build process * fix target dependency in cmake. :-) too chaos in each CMakeLists * check env variable defined directory exists	2018-09-23 07:52:33 -07:00
Yuhong Guo	b29839a0a3	Fix node manager failure when ClientTable has a disconnected entry. (#2905 ) When a new raylet starts, `ClientAdded` will be called with the disconnected client data. However, since the client was closed, the connection will fail.	2018-09-21 22:45:06 -07:00
Hao Chen	715ec1bca5	Modularize NodeManager::ProcessClientMessage (#2895 ) Split NodeManager::ProcessClientMessage into a couple of smaller functions, each of which handles one type of message.	2018-09-18 14:18:34 -07:00
Yuhong Guo	a8248e8628	Fix ObjectManager Crash (#2833 ) Fixes issue where object manager sometimes crashes within the `Wait` method: The issue stems from inconsistent behavior of the boost deadline timer's `cancel` method, which is invoked within `WaitComplete` to enforce exactly one `WaitComplete` invocation for each `Wait` request. The `cancel` method sometimes fails to actually prevent the timer's invocation of the provided handler with non-zero error code.	2018-09-16 02:14:13 -04:00
Philipp Moritz	47d2f82c6c	Fix common cmake dependencies (#2876 )	2018-09-15 22:11:12 -07:00
Hao Chen	e96817d074	fix a syntax error of initializing unordered_map (#2871 ) The previous way is incompatible with older version of gcc.	2018-09-14 12:07:08 -07:00
Philipp Moritz	2c9a4f6b41	Evaluate debug logging only in debug mode (#2869 ) This PR makes it so debugging logs are only evaluated during debugging. We found that for the current code, functions called in debug logging code are evaluated even in release mode (even though nothing is printed).	2018-09-14 11:40:44 -07:00
Robert Nishihara	f16d33593b	Mark worker as blocked and trigger reconstruction in ray.wait. (#2864 ) * Trigger reconstruction in ray.wait and mark worker as blocked. * Add test. * Linting. * Don't run new test with legacy Ray. * Only call HandleClientUnblocked if it actually blocked in ray.wait. * Reduce time to ray.wait in the test.	2018-09-13 15:28:17 -07:00
Hanwei Jin	fbf214e408	update ray cmake build process (#2853 ) * use cmake to build ray project, no need to appply build.sh before cmake, fix some abuse of cmake, improve the build performance * support boost external project, avoid using the system or build.sh boost * keep compatible with build.sh, remove boost and arrow build from it. * bugfix: parquet bison version control, plasma_java lib install problem * bugfix: cmake, do not compile plasma java client if no need * bugfix: component failures test timeout machenism has problem for plasma manager failed case * bugfix: arrow use lib64 in centos, travis check-git-clang-format-output.sh does not support other branches except master * revert some fix * set arrow python executable, fix format error in component_failures_test.py * make clean arrow python build directory * update cmake code style, back to support cmake minimum version 3.4	2018-09-12 11:19:33 -07:00
Hao Chen	8414e413a2	[java] refine and simplify java worker code structure (#2838 )	2018-09-10 10:48:17 -07:00
Zhijun Fu	753ba76141	[Issue 2809][xray] Cleanup on driver detach (#2826 ) This change addresses issue #2809. Test #2797 has been enabled for raylet and can pass. The following should happen when a driver exits (either gracefully or ungracefully). #2797 should be enabled and pass. Any actors created by the driver that are still running should be killed. Any workers running tasks for the driver should be killed. Any tasks for the driver in any node_manager queues should be removed. Any future tasks received by a node manager for the driver should be ignored. The driver death notification should only be received once.	2018-09-07 16:11:32 +08:00
Wang Qing	7e13e1fd49	[Java] Remove non-raylet code in Java. (#2828 )	2018-09-06 14:54:13 +08:00
Yuhong Guo	dfb7c2be1e	[Java] Add Plasma Free to Java code path (#2802 )	2018-09-04 15:28:23 +08:00
Robert Nishihara	0ac855e061	Push errors to all drivers when node is marked dead. (#2808 ) * Push errors to all drivers when node is marked dead. * Fix	2018-09-02 20:04:58 -07:00
Yuhong Guo	2691b3a11a	Add signal handlers to improve debuggability (#2757 ) * Add signal handlers to improve debuggability. * Fix Linux compiling * Fix Lint * Change SIGILL case that happens in both Linux and MaxOs * Add signal handler to main functions. * Change handler name. * Address comment * Address comment. * Fix Linux building failure * Introduce RAII mechanism to SignalHandlers. * Add InitShutdownWrapper to handle all RAII requirements * Change util_test to signal_test * Make sure shutdown is not nullptr. * Using google::InstallFailureSignalHandler() instead of our own signal handler * Refine code addording to comment * Fix valgrind test failure. * remove Shutdown template * consistency * linting	2018-09-01 21:58:23 -07:00
Philipp Moritz	869ee8e25d	Integrate plasma store list facility (#2752 )	2018-09-01 16:53:51 -07:00
Alexey Tumanov	fdc9688226	[xray] push warning to driver for infeasible tasks (#2784 ) This PR pushes a warning to the user for infeasible tasks to alert them to the fact that they can't currently be executed. Fixes #2780.	2018-09-01 13:21:27 -07:00
Yucong He	5b45f0bdff	[xray] Implementing Gcs sharding (#2409 ) Basically a re-implementation of #2281, with modifications of #2298 (A fix of #2334, for rebasing issues.). [+] Implement sharding for gcs tables. [+] Keep ClientTable and ErrorTable managed by the primary_shard. TaskTable is managed by the primary_shard for now, until a good hashing for tasks is implemented. [+] Move AsyncGcsClient's initialization into Connect function. [-] Move GetRedisShard and bool sharding from RedisContext's connect into AsyncGcsClient. This may make the interface cleaner.	2018-08-31 15:54:30 -07:00
Ryan Sepassi	b6260003cb	Some small changes (#2782 ) * Add some imports that make it easier to build with Bazel * Use "/tmp" paths for sockets in tests * Move `asio_test` into `run_gcs_tests.sh` instead of starting and stopping Redis within the test fixture with a `system` call.	2018-08-30 22:42:49 -07:00
Wang Qing	514633456b	[Java] Fix out-dated signatures of JNI methods (#2756 ) 1) Renamed the native JNI methods and some parameters of JNI methods. 2) Fixed native JNI methods' signatures by `javah` tool. 3) Removed some useless native methods.	2018-08-30 17:59:29 +08:00
Robert Nishihara	ba7efafa67	Remove force_start argument from StartWorkerProcess. (#2762 ) This removes the force_start argument from StartWorkerProcess in the worker pool so that no more than maximum_startup_concurrency are ever started concurrently. In particular, when the raylet starts up, it my start fewer than num_workers workers.	2018-08-30 13:43:47 +08:00
Robert Nishihara	132f133214	Limit number of concurrent workers started by hardware concurrency. (#2753 ) * Limit number of concurrent workers started by hardware concurrency. * Check if std:🧵:hardware_concurrency() returns 0. * Pass in max concurrency from Python. * Fix Java call to startRaylet. * Fix typo * Remove unnecessary cast. * Fix linting. * Cleanups on Java side. * Comment back in actor test. * Require maximum_startup_concurrency to be at least 1. * Fix linting and test. * Improve documentation. * Fix typo.	2018-08-29 14:53:40 +08:00
Alexey Tumanov	de047daea7	[xray] raylet scheduling mechanism with a simple spillback policy (#2749 ) ## What do these changes do? * distribute load and resource information on a heartbeat * for each raylet, maintain total and available resource capacity as well as measure of current load * this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load. * modify the scheduling policy to perform capacity-based, load-aware, optimistically concurrent resource allocation * perform task spillover to the heartbeating node in response to a heartbeat, implementing heterogeneity-aware late-binding/work-stealing.	2018-08-28 00:03:34 -07:00
Wang Qing	b4cba9a49f	[java] Fix the logic of generating TaskID (#2747 ) ## What do these changes do? Because the logic of generating `TaskID` in java is different from python's, there are many tests fail when we change the `Ray Core` code. In this change, I rewrote the logic of generating `TaskID` in java which is the same as the python's. In java, we call the native method `_generateTaskId()` to generate a `TaskID` which is also used in python. We change `computePutId()`'s logic too. ## Related issue number [#2608](https://github.com/ray-project/ray/issues/2608)	2018-08-27 13:11:33 -07:00

... 2 3 4 5 6 ...

719 commits