Message boards : BOINC client : CUDA init failures continue to flush boinc queue
Message board moderation
Author | Message |
---|---|
![]() ![]() Send message Joined: 27 Jun 08 Posts: 641 ![]() |
I reposted this from the gpugrid forum: Got another initialization error, this time after about 2-3 weeks of processing WU's with not a hiccup. Once the initialization error occurs my 9800gtx+ can no longer process WU's and quickly runs thru all the WU's in the queue and within a hour all the available ones for the day. ie: they download and quickly get a compute error and this keeps up until the daily quota is hit. This then repeats till I get around to noticing the problem and cycling the power off and on. I mentioned this several weeks ago and even posted in the CUDA forum for help on how to reset the nvidia board without having to do a power off. The suggestion on the CUDA forum was the graphics board had a hard lockup and needed a power off. Anyway, it would be nice if the next version of BOINC or the gpugrid app would handle an initialization error by stopping the gpu processing till the nvidia board responded. I am using 6.4.1 and will try 6.4.5 as I see it is out. This board, a 9800gtx+, is not used for any gameing. |
![]() Send message Joined: 29 Aug 05 Posts: 15585 ![]() |
Forwarded to developers. |
![]() Send message Joined: 29 Aug 05 Posts: 15585 ![]() |
David Anderson wrote: I'll look into this; there should be a way to make sure |
![]() ![]() Send message Joined: 27 Jun 08 Posts: 641 ![]() |
The following is how I found out there was an initialization error: ====from a compute error==== ====note that the gpu is not identified=== stderr out <core_client_version>6.4.1</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # Using CUDA device 0 Cuda error in file 'deviceQuery.cu' in line 59 : initialization error. </stderr_txt> ]]> ====from a good task==== ====note the gpu is identified==== stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <stderr_txt> # Using CUDA device 0 # Device 0: "GeForce 9800 GTX/9800 GTX+" # Clock rate: 1836000 kilohertz # Number of multiprocessors: 16 # Number of cores: 128 MDIO ERROR: cannot open file "restart.coor" # Time per step: 49.449 ms # Approximate elapsed time for entire WU: 42031.433 s called boinc_finish </stderr_txt> ]]> ====== The error about the restart.coor must be a warning or info message of some type as I see it in all tasks that complete. It would appear the CUDA software does not respond as the core app cannot even get the nvidia board name. system: Q6700 with Vista-64 and nvidia driver 7.15.11.7824 dated 10/7/2008 |
![]() ![]() Send message Joined: 27 Jun 08 Posts: 641 ![]() |
I am still having this problem: Occassionally I get a CUDA initialization error and when that occurs all subsequent WU's immediately get the same error and I end up unable to get WU's because of the daily quota limit. What might be happening and my effort to debug the problem. I cannot get temp's or memory utilization of the GPU. I downloaded the nvidia tools (system monitor) but no memory utilization nor temps are available. This seems to be a reported vista problem. Anyway, I cannot prove the temps are too high or too low one way or another. SpeedFan shows 4 core temps rangeing from 72c to 80c but does not reveal gpu temps. Only GPU tasks are affected so the CPU temps are not causing the problem. Looking at the times the first initialization error occurred It would appear (seemingly) they correspond to when the microsoft auto update occures and my system reboots. This is at 3am each morning, but not all updates end up in a reboot. I cannot easily prove the initialization error occures after a microsoft update reboot, but I stopped the automatic updates and will see if this helps. |
![]() Send message Joined: 29 Aug 05 Posts: 15585 ![]() |
If you want to keep an eye on the GPU temperature, download and run GPU-Z. It'll also tell you how much the GPU is under load and can log everything to a file on the disk. |
![]() ![]() Send message Joined: 27 Jun 08 Posts: 641 ![]() |
If you want to keep an eye on the GPU temperature, download and run GPU-Z. It'll also tell you how much the GPU is under load and can log everything to a file on the disk. thanks, that download worked. temp is 67c, less then the 4 cpu core temps as reported by speedfan. Earlier I tried something called "power strip" and that was a disaster flashing the screen with weird colors, and requiring an immediate reset to get control back. |
Send message Joined: 5 Mar 08 Posts: 272 ![]() |
If you want to keep an eye on the GPU temperature, download and run GPU-Z. It'll also tell you how much the GPU is under load and can log everything to a file on the disk. The NVIDIA drivers (at least under XP) install a program called vtune which starts up in the system tray. One of its more useful features is to show the gpu temp. MarkJ |
![]() ![]() Send message Joined: 27 Jun 08 Posts: 641 ![]() |
If you want to keep an eye on the GPU temperature, download and run GPU-Z. It'll also tell you how much the GPU is under load and can log everything to a file on the disk. I did manage to get the nvidia tools to load and show temp. However, the lag was terrible and I was lucky to exit that nvidia program "nvidia system monitor" without having to bring up the task manager to kill it. The latest version of speedfan, 4.37, shows GPU temp's and can log. I have not tried vtune yet. ![]() |
Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.