PDA

View Full Version : Linux icc client: b-d node crash; node not inserted



Prototyped
03-15-2003, 04:25 AM
I've been running the icc-built foldtrajlite on a Debian sid machine (Intel Celeron 1300 MHz; VIA VPSD C3M266-L motherboard), using quiet mode (and in a chroot jail), and a custom script to restart it on a crash. Recently, I've been seeing very frequent (once every minute or so) crashes and restarts, with error logs similar to this:



========================[ Mar 15, 2003 7:34 AM ]========================
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (36.35315,22.80636,17.68921)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (19.50951,-37.74817,25.16787)
ERROR: [001.013] {randwalk.c, line 3938} Discriminant =nan while calculating CB
ERROR: [001.001] {bbox.c, line 266} ..57..

ERROR: [001.001] {bbox.c, line 268} .. CB ..

ERROR: [001.001] {bbox.c, line 269} ..1..

ERROR: [001.001] {bbox.c, line 270} .. CA ..

FATAL ERROR: [004.001] {bbox.c, line 282} b-d node crash; node not inserted


The bbox.c:266 line varies, as do the "Tried to BDRemove non-existent (x,y,z) lines, but the way the crash occurs is consistent. I have tried getting rid of the protein files and replacing the foldtrajlite binary from an icc-built tarball downloaded off the distributedfolding.org site, to no avail.

However, just replacing the binary with the gcc-built binary appears to have worked. I still get errors:



========================[ Mar 15, 2003 8:57 AM ]========================
ERROR: [001.013] {randwalk.c, line 3938} Discriminant =nan while calculating CB
ERROR: [001.013] {randwalk.c, line 3938} Discriminant =nan while calculating CB
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (18.83948,11.28310,-26.52561)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (16.54386,15.73593,39.36787)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (40.49839,27.89487,-29.27060)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (42.27337,24.52461,-2.44313)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (11.70769,26.50885,22.84410)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (31.02750,39.47134,3.99350)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (24.11702,8.18178,-15.15368)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (24.47520,8.66482,-14.47268)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (-1.78640,10.58943,-2.69549)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (7.28064,4.82966,-20.84510)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (41.09403,-2752461,-2.44313)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (11.70769,26.50885,22.84410)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (31.02750,39.47134,3.99350)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (24.11702,8.18178,-15.15368)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (24.47520,8.66482,-14.47268)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (-1.78640,10.58943,-2.69549)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (7.28064,4.82966,-20.84510)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (41.09403,-27.27712,40.89683)
ERROR: [001.013] {randwalk.c, line 3938} Discriminant =nan while calculating CB
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (41.69605,18.01223,4.84294)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (16.23276,-19.44517,5.16345)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (11.44027,8.89961,19.72566)
ERROR: [001.001] {bbox.c, line 418} Tried to BDRemove non-existent (20.22554,33.09023,-2.30083)


but I don't get a b-d crash any more.

Other machines with similar setups (chroot jail, icc binary and one with Solaris/SPARC 64-bit binary) show no such issues.

What might be the problem?

Prototyped
03-15-2003, 12:17 PM
Okay, never mind. I noticed that other applications (mencoder, btdownloadcurses.py) were also having very odd floating-point exceptions, primarily having to do with NaN values where such values were not expected to be computed. It turns out that the kernel I was using (2.4.20-wolk4.0s-rc3) had a bunch of experimental patches that were messing things up. Booting the previous stable kernel (2.4.20-ac2) caused all the floating-point computation issues to go away, including any oddities with the icc-built foldtrajlite.

bwkaz
03-15-2003, 02:39 PM
-wolk4.0s-rc3? Where might I get that (or at least more info on what it is)? Seeing as it's screwing up FP calculations, it likely wouldn't be a good patch to apply, but nevertheless, I'd like to know more about it.

Prototyped
03-16-2003, 10:00 PM
It's the Working Overloaded Linux Kernel patchset. You can get it from http://sf.net/projects/wolk (which appears to be down at the moment). It includes the staple interactivity improvement patches (Ingo Molnar's O(1) scheduler, Robert M. Love's kernel preemption patch, Andrew Morton's low-latency scheduling patch, Andrea Arcangeli's VM improvements, Rik van Riel's reverse-mapping implementation), enterprise volume management, ALSA, XFS, IPSEC, GRSecurity and a host of other new items backported from the 2.5 source tree.

Perhaps rebuilding the kernel with a lot fewer of the patches enabled in the config will help. For the moment, though, it's holding up pretty well on 2.4.20 with Alan Cox's patch.

bwkaz
03-17-2003, 08:57 AM
Ah, thanks.

:)