Monday, February 8, 2010

OU Supercomputer!

Last few days the OU supercomputer is screwing me! I had several tasks and each of them takes a day or two to complete. So I get an account in the OU supercomputer to execute my tasks. It was quite okay but the problem is they have such a long queue that my tasks were waited in the queue for the entire week. Still, I was okay but then some of my jobs were terminated in the middle (huh!) and I have no clue why? I have talked to them and they said it's because of memory usage! alas! So I have to go throw the whole process again, submit the jobs, wait in the queue and finally when they will get chance (I hope they will!) to execute I don't know what going to happen this time?

One more interesting thing. They have two more powerful machine than their regular machines but I found the powerful machine takes longer time than regular machine, funny! Why is that? I still don't have very good explanation for that. Maybe it's because of threads! The jobs which are running on powerful machines have more threads than the jobs running on regular machine. But then again, the powerful machine has twice as many core as in regular machine and my threads are all non-blocking threads. So, the context-switching should not deteriorate the performance but somehow it does!

2 comments:

  1. Sadik, I'm sorry to hear that you're having difficulty with OSCER resources.

    Some thoughts:

    (1) Regarding long queues and wait times: OSCER serves a user community of approximately 650 users, which means, among other things, that we have more users than we have compute nodes. (Sooner has 536 compute nodes total, of which 16 are owned by individual faculty and therefore aren't available to most of our users.)

    Not that all of those users are using OSCER resources at the same time. But since many of the ones who are using at a given moment are using multiple compute nodes (sometimes dozens or even hundreds at a time), in general it isn't practical to have everyone's jobs running at the same time.

    So everyone has to wait at least some of the time.

    It may help you to know that OSCER provides a machine (Sooner) that debuted (in Nov 2008) as the 90th fastest supercomputer in the world, the 14th fastest at a US university, and the 10th fastest at a US university excluding big national supercomputing centers (that is, comparing apples to apples).

    And we provide that resource to our academic users (like yourself) at no charge.

    The number of institutions in the US (or for that matter in the world) that provide such a resource, especially at that scale, is quite small, perhaps 40 other than big national supercomputing centers.

    And OU is one of a much smaller number of institutions that provide that service at no charge, using primarily internal funds.

    (2) Your jobs were terminated when the amount of time that you requested ran out. On OSCER systems, each job is limited to 48 hours of runtime maximum.

    One of the governing factors for your jobs was the fact that the contention for memory (or maybe for cache) among the various threads was causing your jobs to slow down. That's probably why 48 hours wasn't enough (though I admit I haven't taken the time to analyze your code in detail, so I'm speculating).

    We chose 48 hours as the limit because much longer than that would cause a few jobs to take over the entire supercomputer for days or weeks at a time, so that no one else could get anything done.

    (3) With respect to the "two more powerful machines," I assume you mean our quad-socket nodes, each of which has a total of 16 cores and 128 GB of RAM.

    Please note that (a) we only have two of them (they cost about 5-10 times as much as dual socket nodes, which we have 534 of), and (b) the fact that you run on 16 threads instead of 8 by no means guarantees that you'll get twice the performance -- on the contrary, you may actually get worse performance, because there would be more threads contending for the same amount of bandwidth to RAM.

    You may find it helpful to have a look at our video series, "Supercomputing in Plain English:"

    http://www.oscer.ou.edu/education.php

    These videos provide explanations for why various approaches to coding can produce various performance characteristics.

    Henry Neeman, Director
    OU Supercomputing Center for Education & Research (OSCER)
    University of Oklahoma

    ReplyDelete
  2. I much appreciate the service of OU supercomputer. I am sorry if my experience sounds like complain. This is the first time I have been introduced to any kind of supercomputing and unfortunately my code was not originally designed for parallel processing. So, I am not gaining much other than huge number of machines and obviously that makes my life much easier.

    I understand the average queue length and I think it is normal but I don't agree with the explanation for the terminated jobs. I can imagine it is something related to memory but not exactly what you said. My processes are taking too much memory which makes them slower but they are not killed because of time. Some of my jobs were killed within 30 mins of their start time, so apparently the reason is not 48 hours!

    I like the speculation you made on getting worse performance on powerful machines. My point was I am getting worse performance not they are performing worse. Since my code was not designed for parallel processing it is quite normal that my code is performing worse in parallel machines.

    Once again my intention was sharing my experience not complaining.

    ReplyDelete

Please, no abusive word, no spam.