I run two factoring threads on my hyperthreaded P4 2.8GHz.
Stage 2 puts pressure on memory so it's best not to have the clients running in lock step with each other. I started one client and then started the other when the first started stage 2 of it's first test. This has kept the overlapping on stage 2 to a minimum and probably increased performance substantially. It probably makes a big difference. How much memory were you allocating to each factorer and how much total does your system have?
Cheers,
Louie