Message boards : BOINC client : Benchmarking bug - indefinite suspension of computing
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 25 Nov 05 Posts: 1654 ![]() |
Would it be possible to have some code for BOINC to get the time from a net clock somewhere, and run it's own clock? Do it each time it starts, and just before a bench mark. Perhaps compare it to the system clock each of these times and store an offset, rather than run it's own clock permanently. Although I do think fiddling with the system clock instead of looking at a real calendar is weird. Even a perpetual calendar in the computer would be better. |
Send message Joined: 30 Dec 05 Posts: 475 ![]() |
Therefore by default, even if BOINC projects was not the original objective, BOINC is the computer primary function. Therefore why not let them use a BOINC project to check, and if desired reset, the clock. The thing is although BOINC is supposed to run quietly in the background as you put it. If the computers date/clock is wrong then as described earlier BOINC stops running, or the scheduler could be in difficulties, units past deadline or units that should be in priority mode and not. Windows and other OS's have options to sync with NTP server, but windows only does it once/week and will NOT adjust if more than 15 hours out. And I don't really understand this idea of BOINC running its own clock, where is its point of reference on a computer that has been off, for any reason, and is on limited use dial-up? I agree, on a business computer, that adjusting the clock is not a good idea, and is probably timed synced to local server anyway. But do you want BOINC to run when it should be running on your computer? Do you want the scheduler to run the correct units in the correct order? Would you like a piece of software that spots your stupid mistakes? (If you don't make stupid mistakes, I suppose we better tell Seti we just found ET) @Les, If as my friend just pointed out, how do you check day of week for 10 April 2005. If you have computer in front of you. (customer with receipt, faulty item with 3yr guarantee, business does not open Sunday's) |
Send message Joined: 19 Jan 07 Posts: 1179 ![]() |
@Les, Not our fault that Windows didn't get that right until Vista. |
Send message Joined: 30 Dec 05 Posts: 475 ![]() |
My friends problem was on windows PC running at point of sale, with access to company network only, no internet connection. They have normal wall calender but that only shows current, last and next years. Without leaving position, and would you with person probably making fraudulent claim, what calender do you suggest. And also if in office why bother with internet, most computers have an office suite and all as far as I know have similar to the MS Word calender wizard. Lotus had it before MS in early 90's. In your 'design' of BOINC clock you make no mention of network not available, or what happens when computer is switched off, these are the points that confuse me. Also I am not saying these clock inaccuracy messages and/or correction should be mandatory, but I do think it would be a good option. |
Send message Joined: 25 Nov 05 Posts: 1654 ![]() |
OK, there's 2 ways to check a date: 1) A perpetual calendar (1.5 million web sites!), such as Calendars for the Years 1901 to 2100, and Calendarhome.com. The 2nd looks interesting - 2/3rds down it has a link to Day-of-Week Calculator. 2) a) (Menu)Suspend BOINC. b) (Menu) Exit BOINC. c) Fiddle with clock. d) Reset clock to correct time. e) Restart BOINC. f) Set BOINC to Run. *************** I've just seen your lastest post. I don't think that running BOINC on a point-of-sale computer is a terribly good idea. Companies get a bit narky about this sort of thing. |
Send message Joined: 5 Oct 06 Posts: 5144 ![]() |
OK, there's 2 ways to check a date: There's a third way: 3) a) Double-click on clock in system tray. b) Fiddle with clock. c) Click 'cancel'. *************** I've just seen your lastest post. Andy's friend wouldn't have been running BOINC on the POS, because it has no internet connection. For the same reason, he/she wouldn't have been able to Google for any of the proper tools. I agree with everything that's been said about perpetual calendars. However, everyone who posts here is by definition a Nerd or a Geek, and we understand about things like system integrity. Microsoft, on the other hand, has spent 12 years (since the release of Windows 95) providing end users with a little facility which looks and feels like a perpetual calendar, and which is always guaranteed to be visible onscreen and one doubleclick away from use (unless you're one of those people who hide the taskbar). Think back twelve years: would you have Googled for a perpetual calendar then? With all the overhead of establishing the dial-up internet connection first? It's no wonder that people have got into an engrained habit of (ab)using the system clock for date look-ups. And because it's there, and because it's a habit, people will go on using it: and it will go on being a problem until the last copy of Windows XP is consigned to the great bit-bucket in the sky. BOINC just has to live in the real world. Having a robust, independent, self-validating, self-correcting internal time reference for BOINC is obviously the way forward. But my betting is that that isn't going to be in place this year, for all the reasons that people who've got knowledge of the internal code of BOINC have explained already. In the meantime, can I remind you yet again that there is something that causes an indefinite hang between Suspending computation - running CPU benchmarks and [benchmark_debug] Starting floating-point benchmark Isn't that worth solving? |
Send message Joined: 30 Dec 05 Posts: 475 ![]() |
If BOINC has knowledge the hosts date/time has changed how come this Odd graph picture is allowed to happen. And reference the problems that have been mentioned for starting BOINC without the network being available. But it could be made to time sync at the next connection, probably some time in the next 24hrs. And if the code is so bad that it needs a clock rather than a stop watch at benchmark time then only run benchmarks after next resync at next connection. The clock can be adjusted during the time BOINC is up and running and cause some of these problems. |
![]() Send message Joined: 3 Apr 06 Posts: 547 ![]() |
In the meantime, can I remind you yet again that there is something that causes an indefinite hang between For sure it is. (If I could just correctly link my compiled client executable...) Peter |
Send message Joined: 30 Dec 05 Posts: 475 ![]() |
If BOINC has knowledge the hosts date/time has changed how come this Odd graph picture is allowed to happen. I disagree because from what you proposed if the computer does not have access to an NTP server at startup it will not start. In fact in alot of cases it probably will not start because a lot of internet security programs inhibit network access until they have completed there checks. |
Send message Joined: 19 Jan 07 Posts: 1179 ![]() |
BOINC does not have knowledge that the hosts date/time has changed. The changes I propose would give BOINC that knowledge. The changes you propose would not give BOINC that knowledge. I disagree because from what you proposed if the computer does not have access to an NTP server at startup it will not start. In fact in alot of cases it probably will not start because a lot of internet security programs inhibit network access until they have completed there checks.[/quote] How do you suggest a program can know when the time changes? 1. get current time 2. wait 1 second 3. get current time again 4. if both time measurements differ by more than 2 seconds (or if the one at 3. is *lower* than the one at 1.), time changed, so you know you need to do some corrections That looks like it should work. But nope! What if you suspend your computer? When the computer gets out of suspend mode, gets time in step 3., and it would differ by some hours. And it's also possible that it fails in a normal case, depending on how the "wait 1 second" works internally. |
Send message Joined: 25 Nov 05 Posts: 1654 ![]() |
In the event that net access is not available for some reason, (another reason: a laptop that is traveling without access, such as on a ship), how about issuing a warning message, (popup?), something like: "The sytem clock appears to have changed, and BOINC needs to access the internet to get the correct time. Continuing for a long period without doing this may cause WUs to be late, and rejected." Perhaps with an option to manually enter the correct time. |
Send message Joined: 30 Dec 05 Posts: 475 ![]() |
I'm glad we got that sorted. |
Send message Joined: 5 Oct 06 Posts: 5144 ![]() |
Could anyone reading this thread comment on how/when BOINC updates its time stats, please? When I performed the clock forward / clock back experiment that started this thread, one of the side effects that I noticed was a drop in time metrics: <active_frac>0.045279</active_frac> Six days later, with 24/7 BOINC running (v5.10.45 service install under Windows XP - neither BOINC nor the computer have been restarted), the active_frac remains exactly the same to six decimal places - compare the code above, which is a current paste from client_state, with the figure in my message 16142. ??? I would have expected a similar sort of trap-door function as TDCF - in this case, quick to fall and slow to rise, but no recovery at all? The frac is so low that even on this medium-speed machine (2.0GHz P4), new Einstein tasks at 50% share go immediately into high priority. If, as I suspect, there's no automatic recovery mechanism (or a broken mechanism) for active_frac, that might explain why so many people report problems with high priority and cache sizes on the various project message boards. Please, no sticking-plaster replies: I know what to change and how to change it, but I'm researching whether there's a need to put in another bug report] |
Send message Joined: 5 Oct 06 Posts: 5144 ![]() |
Why dont you post the whole top section of the client_state.xml + some project DCF's so we can have a integral view rather than this step by step guessing game. From current client_state.xml: <host_info> <timezone>3600</timezone> <domain_name>ANONYMOUS</domain_name> <ip_addr>192.168.173.13</ip_addr> <host_cpid>e90761a879d5bf174f2e7e32671872db</host_cpid> <p_ncpus>1</p_ncpus> <p_vendor>GenuineIntel</p_vendor> <p_model> Intel(R) Pentium(R) 4 CPU 2.00GHz [x86 Family 15 Model 2 Stepping 4]</p_model> <p_features>fpu tsc sse sse2 mmx</p_features> <p_fpops>1050903119.868637</p_fpops> <p_iops>1698914891.321735</p_iops> <p_membw>1000000000.000000</p_membw> <p_calculated>1207315872.154749</p_calculated> <m_nbytes>536133632.000000</m_nbytes> <m_cache>1000000.000000</m_cache> <m_swap>1310920704.000000</m_swap> <d_total>39990591488.000000</d_total> <d_free>5179965440.000000</d_free> <os_name>Microsoft Windows XP</os_name> <os_version>Home Edition, Service Pack 2, (05.01.2600.00)</os_version> <accelerators>NVIDIA GeForce3 Ti 200</accelerators> </host_info> <time_stats> <on_frac>0.805489</on_frac> <connected_frac>-1.000000</connected_frac> <active_frac>0.045279</active_frac> <cpu_efficiency>0.937373</cpu_efficiency> <last_update>1209561940.387707</last_update> </time_stats> <net_stats> <bwup>6416.736822</bwup> <avg_up>29442415.841511</avg_up> <avg_time_up>1207383404.717249</avg_time_up> <bwdown>61101.566988</bwdown> <avg_down>1165019550.171246</avg_down> <avg_time_down>1207380038.842249</avg_time_down> </net_stats> Einstein: <duration_correction_factor>0.342887</duration_correction_factor> SETI: <duration_correction_factor>0.260244</duration_correction_factor> (both with Power/Optimised apps, respectively). No other Project entries. From a client_state.xml.bak file dated 26 May 2006 - must have been the last time I used BoincDV to reset debts: <time_stats> <on_frac>0.998095</on_frac> <connected_frac>1.000000</connected_frac> <active_frac>0.999851</active_frac> <cpu_efficiency>0.949493</cpu_efficiency> <last_update>1148667485.875000</last_update> </time_stats> - I would judge that to be pretty normal for this machine: _efficiency @ ~95% is partly because it's the BoincView monitoring host for my LAN. Since there is no easy way to monitor changes in debt values over time (the subject of a different bug report), I wrote myself a small utility to record and graph project debt values. I'll adapt it to log time_stats over time, and report next weekend. Any other tags you would like a time series for? Edit - here are the Einstein tasks for this host. The report/fetch contact at 5 Apr 2008 11:45:46 UTC today was triggered as the LTD from the last high priority run rose above -3600. Initial metrics for the new task are: Computation time to completion: 16 hours 34 minutes 'Work buffer' from BoincView: 16 days 12 hours Project shares are equal (50% - 100::100) Task immediately went into high priority: 05/04/2008 12:43:48|Einstein@Home|Sending scheduler request: To fetch work. Requesting 1490 seconds of work, reporting 1 completed tasks 05/04/2008 12:43:58|Einstein@Home|Scheduler request succeeded: got 1 new tasks 05/04/2008 12:44:00|Einstein@Home|Starting h1_0907.30_S5R3__78_S5R3b_0 05/04/2008 12:44:08|Einstein@Home|Starting task h1_0907.30_S5R3__78_S5R3b_0 using einstein_S5R3 version 436 2nd. edit: Another observation - I run a 0.01 day CI, and a 1 day AC. That work fetch would normally be for (87264 - ε) seconds, where ε = ~4,363 seconds for a 95% CPU efficiency. That demonstrates how the active_frac corruption impinges on work fetch. |
Send message Joined: 5 Oct 06 Posts: 5144 ![]() |
For now, suggest to exit BOINC, open client_state.xml with ASCII text-editor and set that value to the the march 26 one. Also set the DCF's to 1.000000 so at least crunching and work fetching return to normality. The DCF's are indicative of the situation slowly returning to normality.... right now BOINC figures things complete much faster than the other parms indicate. That's exactly the reply I was trying to avoid. In my original post (message 16467), I put a footnote in small print. If you had followed recommended Forum practice, and used 'Reply to Post' (for threading purposes), instead of 'Post to Thread', you would have seen it. Since you clearly missed it, here it is for the visually-impaired: Please, no sticking-plaster replies: I know what to change and how to change it, but I'm researching whether there's a need to put in another bug report |
Send message Joined: 5 Oct 06 Posts: 5144 ![]() |
OK, this is a new post - the previous poster wasn't worth replying to, so I won't reply. He was right, however, right to say that I have all the data available, and he was also right to say that I should have posted the whole story. The key datum is <last_update>1209561940.387707</last_update> in the <time_stats> in my post of 11:29 UTC. Using http://www.onlineconversion.com/unix_time.htm, that equates to Wed, 30 Apr 2008 13:25:40 UTC - still 25 days in the future. Here are lines 127-151 of time_stats.C: // Update time statistics based on current activities // NOTE: we don't set the state-file dirty flag here, // so these get written to disk only when other activities // cause this to happen. Maybe should change this. // void TIME_STATS::update(int suspend_reason) { double dt, w1, w2; bool is_active = !(suspend_reason & ~SUSPEND_REASON_CPU_USAGE_LIMIT); if (last_update == 0) { // this is the first time this client has executed. // Assume that everything is active on_frac = 1; connected_frac = 1; active_frac = 1; first = false; last_update = gstate.now; log_append("power_on", gstate.now); } else { dt = gstate.now - last_update; [color=red]if (dt <= 10) return;[/color] w1 = 1 - exp(-dt/ALPHA); // weight for recent period w2 = 1 - w1; // weight for everything before that // (close to zero if long gap) (sorry, I can't use for code, because of the indent bug on these boards) I call BUG at line 148 (highlighted). This contains an implied assumption that time is always monotonic (i.e. the clock hasn't been fiddled with - which is where we came in). The intention of the test is clearly to reduce workload by only re-calculating active_frac at intervals of 10 time_units or more: the effect is to inhibit updating following a clock-fiddle until (MAX(clock) + 10) is reached. The test should be [pre]if (ABS(dt) <= 10) return;[/pre](or whatever the C construct is - sorry, I'm a VB programmer) Now, I suppose it's up to me to find the line number of the original benchmarking bug. Correction: the 10 time_unit test is at line number 69 of http://boinc.berkeley.edu/trac/browser/trunk/boinc/client/time_stats.C?rev=4610 - I got the first number from my Visual Studio editor, working on the text version of the file which the BOINC/Wiki search function found first. |
Send message Joined: 5 Oct 06 Posts: 5144 ![]() |
Found it! (Subject to checking and validation - please confirm) http://boinc.berkeley.edu/trac/browser/trunk/boinc/client/cs_benchmark.C?rev=12128 307 bool CLIENT_STATE::cpu_benchmarks_poll() { 308 int i; 309 static double last_time = 0; 310 if (!benchmarks_running) return false; 311 312 if (now < last_time + 1) return false; 313 last_time = now; 314 315 active_tasks.send_heartbeats();If benchmarks have been run in the current BOINC session, at some time in the future (as a result of the clock fumbling we've been talking about), the static variable last_time will have been initialised and will have a value of, for example, Wed, 30 Apr 2008 13:25:40 UTC. So the test at line 312 will be satisfied, and the application will loop until the cows come home (or Wed, 30 Apr 2008 13:25:40 UTC, whichever comes sooner). That explains why exiting BOINC and re-starting allows benchmarks to run properly: the variable will be undefined and correctly initialised to zero. Solution: explicitly set the value of last_time to zero on all possible exit routes out of the benchmarking loop, so that it's properly initialised for next time. NB that's OK: this is a timing variable for the benchmark duration, nothing to do with the 5-day interval between benchmarks. That's tested at 250 double diff = now - host_info.p_calculated; 251 if (diff < 0) return true; 252 253 return ((run_cpu_benchmarks || diff > BENCHMARK_PERIOD)); |
Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.