Message boards : Questions and problems : Stop switching between WU
Message board moderation
Author | Message |
---|---|
![]() Send message Joined: 8 Aug 08 Posts: 570 ![]() |
Is there a way to prevent BOINC from switching from one GPU WU to another on the same project. An option / switch... |
![]() Send message Joined: 8 Aug 08 Posts: 570 ![]() |
Within the same project is curious. Only seen this in past when there were deadline issues and the scheduler trying to find the shortest. Think the 6.6, on CPU scheduler side, now remembers how long a science version took to compute, so it wont have to test how long they take. The part I've not figured is, does it subsequently update that value since many projects have variable run time work units, to name Help Cure Muscular Dystrophy, which can take an hour or quite a few hours and anything in between depending on parent or child/grandchild tasks.It switches from a couple of seconds to a few minutes. All Seti Cuda tasks and checkpoints are not that wanted, because WU run for 2-12 Minutes. In itself this is not too bad, but it probably doesn't remove the halted task from CPU and GPU memory and thats bad. And totally crashing my system is even worse... At first I thought there was something wrong with my computer. The fix for now is hold a low buffer < 3 days. That seems to work for now. But with the recent outage... that's not the preferred setting. |
Send message Joined: 5 Oct 06 Posts: 5149 ![]() |
I've been following this thread and the related 'BOINC Scheduling Issue' from the sidelines for a while, without being much the wiser about what, if anything, the problem is, and what, if anything, to do about it. It hasn't even always been clear which projects people are talking about: but this is about SETI, so I hope I can add something constructive. First, some facts about SETI, and then some speculation about how relevant they may be.
|
![]() Send message Joined: 8 Aug 08 Posts: 570 ![]() |
I'm in for testing, but this gives some catastrophic result on my computer (XP X64). Now it runs ok. The waiting is not a problem... For some reason the CUDA exe is kept in memory, I counted 7 instances of the CUDA exe in memory. It is the 6.08 V11 VLARKILL version so VLARS are not the cause. And after one of these crashes it started deleting the CUDA exe and WU. Luckily, CUDA only. But I did some visual observations, all CUDA tasks are running for 2-12 Minutes, 2 at a time. And what I did is halted all CUDA WU and allowed about 200 to run at a time no problems whatsoever. But when I try to run more at a time CUDA tasks start to go in waiting to run. and not one but more than 100. Until I see them going into fallback, one of them and finally the system freezes out completely. Put in a work buffer of 3 days and that looks like a stable system. A lot more can't be handled by BOINC anyway. BOINC becomes so SLOOOOOW. I did add the FLOPS in the xml file and that more or less gives a stable WU time, no enormous variations as before. I can't do any reproducing anyway because I hit the 1000 WU limit. And in my opinion a switch do not halt cuda WU would solve the problem. Halting a WU after 1 or 2 seconds makes no sense at all. |
Send message Joined: 5 Oct 06 Posts: 5149 ![]() |
1) If "more than 100" are suspended at the same time, then that can't be "equal deadline". The usual maximum is 20, but even with recent feeder changes, the absolute maximum per scheduler request would be 97. And successive sheduler requests at SETI usually mean that the deadlines are separated by at least 20 seconds. 2) I think we need some clarification on 'memory'. a) "CUDA exe is kept in memory". Surely that must refer to the main CPU application (which later launches the CUDA execution kernels), and system RAM. I don't know of any tool which can show the name of any program, and the size of its memory footprint, inside the CUDA/video RAM system. Ideas anyone? My understanding is that a SETI task requesires of the order of ~150MB - ~200MB of free graphics RAM: two can fit in my 512MB cards, but not three. Having larger numbers in memory, other than main system memory, would require a very large graphics card indeed. b) After a couple of bodged starts, and a "nasty bug", I thought we had reached the point in v6.6.31 that grahics RAM (but not main system RAM) was invariably cleared by BOINC when pre-emtping CUDA tasks, whatever the state of the 'leave apps in memory' flag. I didn't find any problems with graphics RAM over-filling when testing v6.6.31 and later: if there are situations where apps remain in graphics memory (suggested by that 'going into fallback'), then they need to be enumerated, the circumstances identified, and the cause reported as a bug. |
![]() Send message Joined: 8 Aug 08 Posts: 570 ![]() |
1) If "more than 100" are suspended at the same time, then that can't be "equal deadline". The usual maximum is 20, but even with recent feeder changes, the absolute maximum per scheduler request would be 97. And successive sheduler requests at SETI usually mean that the deadlines are separated by at least 20 seconds. 2a) The Exe that runs on the CPU is kept in memory, and I have never seen that happen on other computers. The card has 1.8 G between the 2 of them. So 6 would be ok, that is what I saw, crashing around 7. So that seems to indicate that the GPU memory is not freed. 2b) The leave in memory is not checked. But I have another guess, at that point BOINC is running near 100% of one core. (checked by my own program that measures the user and kernel runtime every 2 seconds) It may be that at that point the programs go out of sync with each other. Or there is not enough time to complete certain task fast enough. Maybe the CUDA cpu exe missed some command from BOINC and keeps on running. Even as BOINC thinks it has stopped. |
Send message Joined: 23 Apr 07 Posts: 1112 ![]() |
b) After a couple of bodged starts, and a "nasty bug", I thought we had reached the point in v6.6.31 that grahics RAM (but not main system RAM) was invariably cleared by BOINC when pre-emtping CUDA tasks, whatever the state of the 'leave apps in memory' flag. I didn't find any problems with graphics RAM over-filling when testing v6.6.31 and later: if there are situations where apps remain in graphics memory (suggested by that 'going into fallback'), then they need to be enumerated, the circumstances identified, and the cause reported as a bug. I tried a couple of weeks ago to interest you in confirming what i had found when 6.6.33 was released, but i wasn't clear enough, Basically if a Seti cuda task get pre-emtped by a Cuda task that needs EDF (or by me just suspending the running task), the following happens: i.) If the task that has been pre-emtped has been running a little while, ie it has finished being fed by the CPU and has checkpointed, Boinc closes the app down. (it disappears from task manager), the next task starts normally, (about 45 to 50% CPU load dropping to 0 to 2%, and % doesn't count up until GPU has finished been fed) OR: ii.) If the task that has been pre-emtped, has only just started being fed to the GPU, and hasn't got as far as a checkpoint, Boinc Doesn't close the app down. (it's still shown in task manager), the next task starts in CPU fallback mode, (50% CPU load, and % starts counting Straight away) Claggy Edit: Boinc 6.6.36 is the same. |
Send message Joined: 5 Oct 06 Posts: 5149 ![]() |
OK, I can buy that. 'Preempt before checkpoint' has a "by design" alternative handling mechanism that doesn't involve removal from memory (for the benefit of science apps that don't checkpoint. Question: are there any CUDA science apps in this day and age which don't checkpoint? And if so, why?). So the next question is - why, and by what mechanism, would a CUDA app be preempted before first checkpoint? I can manage once: app exits on DCF downslope - triggers work fetch - new (cached) task starts - work is allocated and downloaded - turns out to be VHAR - pre-empt on completion of download. But I can't explain seven, let alone hundreds. Do these multiple pre-empts coincide with downloads? or are they previously cached tasks? Need data. |
![]() Send message Joined: 29 Aug 05 Posts: 15632 ![]() |
So the next question is - why, and by what mechanism, would a CUDA app be preempted before first checkpoint? That's two questions. ;-) As to why, here's the change logs: 6.6.3: - client: when preempting a process, remove it from memory if: 1) it uses a coprocessor 2) it has checkpointed since the client started 3) it's being preempted because of a user action (suspend job, project, or all processing) or user preference (time of day, computer in use) 6.6.12: - client: fix bug where if a GPU job is running, and a 2nd GPU job with an earlier deadline arrives, neither job is executed ever. Reorganized things so that scheduling of GPU jobs is done independently of CPU jobs. The policy for GPU jobs: * always EDF * jobs are always removed from memory, regardless of checkpoint (GPU memory is not paged, so it's bad to leave an idle app in memory) 6.6.23: - client: instead of scheduling coproc jobs EDF: * first schedule jobs projected to miss deadline in EDF order * then schedule remaining jobs in FIFO order This is intended to reduce the number of preemptions of coproc jobs, and hence (since they are always preempted by quit) to reduce the wasted time due to checkpoint gaps. As to what does it? CPU_Sched.cpp. |
Send message Joined: 5 Oct 06 Posts: 5149 ![]() |
* first schedule jobs projected to miss deadline in EDF order And if EDF order is undefined (because multiple jobs have identical deadlines?) ;-) |
![]() Send message Joined: 29 Aug 05 Posts: 15632 ![]() |
Imply logic... but we miss some Vulcan code there. ;-) |
![]() Send message Joined: 8 Aug 08 Posts: 570 ![]() |
And on another computer I see the same happening 3 CUDA exe in memory. (this GPU card has less memory) CUDA exe gone into fallback mode. One at 31 seconds, on e at 40 seconds and one at 3:37 waiting and one running but not on the cuda. This happened when I tweaked the FLOP on that computer a bit to represent a better actual time of the WU. Got more work and probably too much to handle. Restarting BOINC and the CUDA task runs as normal. You got a serious problem with this version. |
Send message Joined: 23 Apr 07 Posts: 1112 ![]() |
6.6.12: So why is it if you suspend a Seti Cuda task before it checkpoint's, that it's still left in memory?, causing the very next Seti Cuda task to fall into CPU fallback mode. Claggy |
Send message Joined: 5 Oct 06 Posts: 5149 ![]() |
6.6.12: If you can prove that, it'd be a reportable bug. |
Send message Joined: 23 Apr 07 Posts: 1112 ![]() |
6.6.12: I can reproduce it by suspending Cuda tasks manually, but can't reproduce it by getting tasks Preempted automatically by Shorties, as i don't have a 10+ day cuda cache, with lots of shorties in it. Claggy |
![]() Send message Joined: 8 Aug 08 Posts: 570 ![]() |
Is this now an official bug. Seti I hope someone takes this bug seriously. As it is easily reproduced. It looks like a bug, it walks like a bug, it behaves like a bug, so it may not be a bug? Vulcan logic. And bugs are afraid of programmers so they are the last ones to find them. |
Send message Joined: 26 Jun 09 Posts: 8 ![]() |
I'm crunching 45 cuda WU for a week now and waiting for a fix of this bug. The real problem is that when Boinc jumps from one to another WU, it returns a little to retake the processing and, switching all the time, it takes a lot of more time at the point that it does not worth at all, i.e., the current RAC for this pc is the same as w/out cuda. Shall I keep it running like this, reset the project or go back to 6.4.7? Eduardo |
![]() Send message Joined: 20 Dec 07 Posts: 1069 ![]() |
Shall I keep it running like this, reset the project or go back to 6.4.7? You could do "a little" handiwork ;-) Set No new tasks and suspend all but a few CUDA tasks (you'll have to try out how many you can let running/waiting). Then wait until that bunch has finished and resume the next. You'll have problems uploading the finished results however, since the server bandwidth is maxed out. Gruß, Gundolf Computer sind nicht alles im Leben. (Kleiner Scherz) ![]() |
Send message Joined: 23 Apr 07 Posts: 1112 ![]() |
6.6.12: Well i posted to the Boinc alpha list, and got this back: - client: when suspending a GPU job, always remove it from memory, even if it hasn't checkpointed. Otherwise we'll typically run another GPU job right away, and it will bomb out or revert to CPU mode because it can't allocate video RAM I think it's a improvement, at least we shouldn't see tasks dropping into CPU fallback mode so much, might need a bit more to stop Cuda tasks switching so much when they go EDF, Claggy |
Send message Joined: 26 Jun 09 Posts: 8 ![]() |
Thanks, Gundolf and this is exactly what I'm doing and I'll keep doing it until clear all unfinished cuda tasks and then I'll reset the project to see if it returns to "normal" mode. And not big deal about handwork, despite I'm not being an expert, this is fun. My complaint is about the waste of computer power, since it's costing too much to crunch a cuda WU (up to 7h!) and this has affected the performance on cpu WUs as well, up to 3 times the normal (I estimate the "normal" time comparing to other WUs from the same bunch). My estimative is that I'd be doing more work running only cpu WUs until this bug got a fix. I'll keep trying anyway while waiting for a fix. Thank again, Eduardo P.S.: this pc was in a rising curve at 900 RAC and going to 1300 but now it's at 650 and going further down, less than 50% of its capacity. |
Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.