FAH-Addict Forum

High Tech Chat » Having consistent GPU problems Is it possible to choose which project you're given?
On 11/30/2009 at 23h58

Photon

Group: Member

Signed up since: 31/10/2009
Messages: 23
Hi,

I'm having problems with my GPUs ... certain projects ALWAYS give me and "UNSTABLE_MACHINE" complaining about producing NANs ...

This only ever happens in projects betwwen p5765 and p5772 ( might even be of a few more )

Is it possible to ask for a run except for these few projects?

8)
Mark

Mail Web    
On 12/01/2009 at 03h51

Quark

Group: Member

Signed up since: 01/09/2009
Messages: 46
No it is not.

If you're running Nvidia gpu's you can set an environmental variable which might help. Have you tried that?

   
On 12/01/2009 at 09h13

Photon

Group: Member

Signed up since: 31/10/2009
Messages: 23
Set an environment variable to do what?

I have a Athlon64 x2 with 1x9800GT ... fine, runs anything and everything.
The other is a Phenom II x4 with 2x9600GTs ... but they don't run p5765 to p5772 inc.

Both machines run SMP clients under Vista64 Ultimate too.

I underclock all the GPUs by ~25% ( clock speed, memory bus and shaders ) ... none of them run hotter than 65C ... they MAY have peaked at 70C on a hot day.

I've 'set FAH_GPU_IDLE 20' on the 9600GT s

What else should I be doing?



Edit by MarkAGR On 12/01/2009 at 09h17

Mail Web    
On 12/01/2009 at 11h37

Administrator

Group: Administrator

Signed up since: 10/08/2009
Messages: 87
Did you check the boards with MemtestG80 ?

Mail Web    
On 12/01/2009 at 13h55

Photon

Group: Member

Signed up since: 31/10/2009
Messages: 23
Yes ... for hours and hours
I did 2 other GPU tests too ... however I think they were more general - not NVIDIA specific.

:(

Mail Web    
On 12/02/2009 at 03h49

Quark

Group: Member

Signed up since: 01/09/2009
Messages: 46
FAH_GPU_IDLE is the variable I was refering to. Do you see a message like:
Calling fah_main args: 14 usage=80 when your work unit starts. If not then the variable is not set correctly?
I have 9 8800gt's that regularly run all wu's with no problem. I checked that the 2 you mentioned and they run fine.
I'm confused about what is failing with these statements:
Quotation:
This only ever happens in projects betwwen p5765 and p5772 ( might even be of a few more )

Quotation:
I have a Athlon64 x2 with 1x9800GT ... fine, runs anything and everything.
The other is a Phenom II x4 with 2x9600GTs ... but they don't run p5765 to p5772 inc.


It's not clear from your post if you're having this problem on all of your gpu's on both machines or ?


   
On 12/02/2009 at 09h28

Photon

Group: Member

Signed up since: 31/10/2009
Messages: 23
Sorry, I thought it WAS clear ...

I have 2 machines ... both of which run SMP clients on their microprocessors.

One of those machines contains a 9800GT card - that runs a GPU client, and will accept and run everything and anything the server sends it. It's fine! It works wonderfully.

I have another machine, containing 2x 9600GT cards. This second machine is giving me problems, in that "NANs detected on GPU" ... eg.

20:56:02] Calling fah_main args: 14 usage=80
[20:56:02]
[20:56:02] Working on Protein
[20:56:03] Client config found, loading data.
[20:56:03] mdrun_gpu returned
[20:56:03] NANs detected on GPU
[20:56:03]
[20:56:03] Folding@home Core Shutdown: UNSTABLE_MACHINE
[20:56:06] CoreStatus = 7A (122)
[20:56:06] Sending work to server


I believe the mechanism you suggest ( that I AM using btw ) is merely there to prevent overheating ... which it doesn't happen, as I'm underclocking the GPUs. Since having this problem, I have monitored the GPUs closely ... they peak at 65C, however ... I have a record of the 9800GT board reaching 70C on a hot day in summer when the office temperature was > 30C. This time of year my office never goes above 24C and that's when I'm in it.

The problem is only apparent SO FAR when working on units ...
p5765
p5766
p5767
p5768
p5769
p5770
p5771
p5772


Sorry for my previous ambiguity. Does this clarify things?

:)



Edit by MarkAGR On 12/02/2009 at 09h30

Mail Web    
On 12/02/2009 at 16h11

Quark

Group: Member

Signed up since: 01/09/2009
Messages: 46
Quotation:
Sorry for my previous ambiguity. Does this clarify things?

Yes, thanks. I just don't like to assume anything when trying to debug.

Is the X4 running at stock speed or is it overclocked and do you know what size power supply you have on the failing machine?

   
On 12/02/2009 at 17h08

Photon

Group: Member

Signed up since: 31/10/2009
Messages: 23
Nothing is over-clocked at all it Phenom is a 945 running at 3GHz ( no idea of the name of the spin of the Silicon ... the mobo takes care of the timing automagically.)

The GPUs are underclocked by 25%

The PSU is 750W.

I run a mobo, processor, 8GB of PC6400 DDR2, 1x1.5TB SATA, 2x9600GT cards - and that's about it.
Should be more than enough power there!!!

8^)

Mail Web    
On 12/02/2009 at 21h04

Quark

Group: Member

Signed up since: 01/09/2009
Messages: 46
MarkAGR:
Nothing is over-clocked at all it Phenom is a 945 running at 3GHz ( no idea of the name of the spin of the Silicon ... the mobo takes care of the timing automagically.)

The GPUs are underclocked by 25%

The PSU is 750W.

I run a mobo, processor, 8GB of PC6400 DDR2, 1x1.5TB SATA, 2x9600GT cards - and that's about it.
Should be more than enough power there!!!

8^)

You're right, that should be more than enough.

The only thing that comes to mind right now is to pull 1 of the 9600 gpu's out of the pc and see what happens. Then try it with the other gpu. If that leads nowhere I'd swap the gpus, put the working 9800gt in the failing machine and a failing gpu in the working machine, etc.

One thing I wanted to mention, the variable we discussed was designed for peak temperature spikes. There was a lot of discussion at the time that the nature of the wu's was causing severe heat spikes in the gpu core and that the addition of some idle cycles would reduce that problem. It was not something that one would necessarily see via the temp sensors. There were some significant side benefits in stability and usability. You're running at 80% which is more than enough to deal with that. Mine run at 95 with a failure rate of well less than 1%.

   
On 12/04/2009 at 10h48

Photon

Group: Member

Signed up since: 31/10/2009
Messages: 23
I tried all the combinations .... ie

The 9800GT in slot one - by itself works fine ...

Card 1 in slot one - by itself - still bad
Card 2 in slot one - by itself - still bad
Card 1 in slot two - by itself - still bad
Card 2 in slot two - by itself - still bad
Card 2 in slot one, Card 1 in slot two - still bad ...
and back to
Card 1 in slot one, Card 2 in slot two - still bad ...

So I think that if any damage to the cards has been done ... its been done ... no amount of card swapping piggery jokery will fix it.

But the cards are not completely gone ... it's just whatever is done in the start up procedure or first few folds of in projects p5765 to p5772 causes this to happen. Are there any boundary conditions being exposed by "sick" cards. I wonder what is so different about them - that makes this difference? Coz it's pretty consistent.

In all this swapping I did notice one thing ... after all this, it still could be the power ... though I have all my machines battery backed and surge protected ( so I expect the PSU to be working ... )
The 9800GT is a low power card - no extra leads.
The 9600GTs need a secondary power source - the six lead cable.


Thanks for all your help guys ... I think this is one that won't go away :(

Mail Web    
On 12/04/2009 at 11h48

Administrator

Group: Administrator

Signed up since: 10/08/2009
Messages: 87
It could be the cards themselves too ... I had this kind of failure on one of my two 9800 GTX+ ...

The one that started to fail (Unstable_Machine with Selft test error or NaN detected on GPU messages) didn't work in any combination or alone is the machine ... it failed too with everything underclocked. Unfortunately, it didn't fail on MemtestG80 or 3D application ...

I finally decided to bring it back to the shop for RMA with the following reason : "Computation errors in CUDA/Folding@Home". It's not back yet, but I asked for some news, and the shop confirmed that the manufacturer confirmed the failure and accepted it for RMA :hehe

Mail Web    
On 12/04/2009 at 16h26

Quark

Group: Member

Signed up since: 01/09/2009
Messages: 46
How about trying the 9600gt's in machine that normally runs the 9800gt, one at a time, or have you done that?

   
On 12/04/2009 at 17h30

Photon

Group: Member

Signed up since: 31/10/2009
Messages: 23
Yep ... it doesn't work there either! :(

Mail Web    
On 12/04/2009 at 20h33

Quark

Group: Member

Signed up since: 01/09/2009
Messages: 46
MarkAGR:
Yep ... it doesn't work there either! :(


Before we declare the problem identified,

Are you running the console client?
What command string are you using to launch gpu folding?
What versions are fahcore_11.exe and fahcore_14.exe?

edit:
And, what level drivers are you running? Have you tried upgrading the drivers?

2nd edit:
Have you tried this with no smp folding going on?



Edit by Weedacres On 12/05/2009 at 02h12

   
High Tech Chat » Having consistent GPU problems Is it possible to choose which project you're given?  
1 User online : 0 Administrator, 0 Moderator, 0 Member and 1 Visitor
User online : No member online
Answer
You aren't allowed to write in this category