MODEST meeting 2009-10-21

Summary
Rob wasn't here. Naughty Rob.

Sharing data for tightly coupled applications
The AMUSE design works best with loosely coupled modules. The classic example is stellar collisions: there is the orbiting of the stars, and the hydro inside the stars once they collide, and the hydro of the stars. Steve worries that this might be the only interesting application of a loosely coupled system. (Rob adds in post that there is the cluster dissolution in galaxy interactions problem that Inti talked about at the AMUSE workshop in Leiden, which used two different N-body codes.) The broad design of AMUSE could be limited to "loosely coupled", but it would be nice to have the ability to deal with other problems too.

Classic example: Hydro + N-body using two separate codes. If one code dominates the amount of time spent, then you don't worry so much about the overheads. But if both codes are roughly equally heavy -- similar numbers of particles, similar timesteps -- and they each want to respond to the gravity of the material in the other, it becomes important to swap lots of data between the two codes at a high rate.

The previous version (MUSE) used swig, which allowed for shared memory. That made swapping this kind of data back and forth easily. Arjen notes that there \emph{is} a shared-memory model in MPI. However, the stickiest problem comes when you have parallel codes not all running on one machine. How to get the data back and forth? It probably will help efficiency if the codes can communicate directly instead of having to send data up and down through the python AMUSE layer, although real benchmarking of this needs to be done. Ideally, one wants to be able to do this without having to do \emph{any} modification to the included code, or at least as little modification as possible.

An obvious solution does not exist at the moment other than (a) changing the MPI protocol, and (b) more easily, changing the laws of physics.

Transcript
[6:07] Lagrange Euler: Inti, I just sent an e-mail to Simon about interprocess communication in (A)MUSE. The issue of sharing data among modules that need detailed gravity information from each other at each step seems to me to be a central design issue. We toucched on it last time. What ideas to you guys have for efficient communication?

[6:07] Lagrange Euler: Jun, you must have thought about this too.

[6:08] Linguini Mexicola: I think it is an optimization issue, currently we are making the interface more efficient by sending particle batches etc

[6:09] Makino Magic: Ya, but one question is if (A)MUSE would link two separete code, both parallel, but differently.

[6:09] Linguini Mexicola: and we noticed that the MPI shared memory communication is also much faster

[6:09] Linguini Mexicola: (as expected ofcourse)

[6:10] Lagrange Euler: Yes, communication on the same machine will be faster, but if you are coupling parallel codes on different machines, I don't think you can afford to send copies of the system back and forth at every step.

[6:11] Lagrange Euler: Maybe you can redesign algorithms, but that goes against the MUSE idea of wrapping existing codes.

[6:11] Linguini Mexicola: I meant socket communication vs mpich implementation not shared

[6:12] Makino Magic: I'm not sure what kind of situation we have in mind, and if that situation is shared...

[6:13] Linguini Mexicola: (shared mem. communication does not allow spawning)

[6:13] Lagrange Euler: Well, imagine modeling gas in an embedded cluster, so we have an N-body code and an SPH code spanning the same physical space.

[6:14] Lagrange Euler: Or a direct N-body code and a tree code modeling different types of particles in the same space.

[6:14] Makino Magic: You mean intracluster gas?

[6:14] Lagrange Euler: In the most naive view, the N-body code needs to know the SPH particle positions at every step, and vice versa.

[6:15] Lagrange Euler: Yes, I mean intracluster gas.

[6:15] Lagrange Euler: But I don't think we can afford to copy the data, even memory to memory on the same machine, at every step.

[6:15] Lagrange Euler: Or can we?

[6:15] Makino Magic: Well, that means very tight copupling, which *I think* is beyond the original design philosophy of MUSE.

[6:16] Linguini Mexicola: I agree with Makino;

[6:16] Lagrange Euler: It does, and it is, but now we seem to be contemplating these sorts of applications.

[6:16] Linguini Mexicola: if the coupling is very tight then it becomes necessary to have the communication but that would not be enough because it would also very much depend on the algorithm

[6:18] Linguini Mexicola: so, anyway; is someone starting todays session (or have we already!;-) )

[6:18] Makino Magic: That is why all Nbody+SPH code (Gadget or Gasoline or whatever) are designed as single big code...

[6:18] Lagrange Euler: Last week we were discussing hydro codes in MUSE. But the only loosely coupled problem of interest (maybe?) is the one MUSE started with -- stellar collisions, which occur on a scale totally different from the rest of the simulation.

[6:19] Lagrange Euler: If we confine ourselves to those "easy" problems, then MUSE is a "solved" problem.

[6:19] Ico Telling: I think we have already started Inti

[6:19] Linguini Mexicola: I was thinking of going to the coupling with grid hydro codes

[6:20] Makino Magic: Actually, intracluster gas may be something in between. It depends on how we handle interaction between stars and gas.

[6:20] Linguini Mexicola: that is something that will be a whole different approach

[6:21] Linguini Mexicola: and there is also the issue of stopping conditions for the hydro, that is analogous to stellar collisions for gravity dynamics

[6:22] Lagrange Euler: But that should be the goal of (A)MUSE. It is the first (well, maybe second) question that comes up when I describe this to colleagues (or to proposal reviewers).

[6:22] Linguini Mexicola: One case would be star formation, but maybe there are others

[6:23] Makino Magic: In the case of star formation, if we do it with SPH, the number of SPH particles is many orders of magnitudes larger than that of formed stars.

[6:24] Lagrange Euler: So the stars are "free"?

[6:24] Makino Magic: That means simple (and slow) interface would be okay.

[6:24] Makino Magic: I mean yess.

[6:24] Lagrange Euler: I guess so.

[6:24] Pan Numanox: Is it likely that the number of SPH particles for intracluster gas would be much larger than the number of stars too?

[6:25] Lagrange Euler: Quite possibly.

[6:25] Lagrange Euler: So it seems that we can do systems where the cost lies overwhelmingly in one module, since the other modules are perturbations.

[6:26] Makino Magic: Ya.

[6:26] Lagrange Euler: But we will have trouble with "democratic" systems where the cost is roughly evenly distributed.

[6:26] Lagrange Euler: Unless we write a module to deal with them explicitly.

[6:26] Ico Telling: That will exclude quite a few interesting problems, won't it?

[6:27] Lagrange Euler: Yes it will.

[6:28] Lagrange Euler: An advantage of the old MUSE approach, using swig, was that we had a common address space and could in principal pass pointers from one module to another.

[6:28] Makino Magic: One could argue that if we handle hydro coexisting with stellar system hydro cost would always be much larger...

[6:29] Remy Vespucciano: Ok, I did a small speed test

[6:29] Lagrange Euler: (Of course, that is also one of its disadvantages in other contexts.)

[6:29] Remy Vespucciano: sending 9 x 100 000 double values

[6:29] Remy Vespucciano: takes 0.12 seconds on my macbook

[6:29] Remy Vespucciano: with second life also running

[6:30] Lagrange Euler: Presumably the limit is just the memory bandwidth, which is (what?)...

[6:31] Pan Numanox: I think if you're running on multiple nodes, the network speed would be the limit, no?

[6:31] Linguini Mexicola: difficult to interpret the numbers....

[6:31] Remy Vespucciano: Yes, my test was on my local macbook

[6:31] Makino Magic: 9x100 000 doubles is just 7.2MB. So the effective speed is around 60MB/s

[6:31] Lagrange Euler: Yes, but on the macbook, I guess this is memory to memory copy. You get 80 Mbyte/s?

[6:32] Lagrange Euler: Sorry, 60. MB/s. But the memory bandwidth must be GB/s?

[6:32] Remy Vespucciano: This was with MPI, so with all the MPI call overhead, python etc.

[6:32] Lagrange Euler: I thought MPI was smart about using the fastest comm speed available to it.

[6:33] Remy Vespucciano: Some MPI implementations are

[6:33] Makino Magic: That depends how you set it up...

[6:33] Remy Vespucciano: Also depends on how the site is configured

[6:33] Remy Vespucciano: Makino is right

[6:33] Lagrange Euler: But it makes the point. You have to do a hell of a lot of compute to hide the comm overhead.

[6:33] Ico Telling: But when you build MPI it should choose the smart option depepnding on the machine

[6:34] Ico Telling: I think that is the case for mpich

[6:34] Ico Telling: and I imagine that should be the case for openmpi too

[6:34] Remy Vespucciano: MPICH defaults on a special protocol, optimized for speed during configuration

[6:34] Remy Vespucciano: yest

[6:35] Remy Vespucciano: This protocol has one drawback that it uses 10% processor power for every running app

[6:35] Remy Vespucciano: So, works only when number of apps are same as the number of processors

[6:36] Remy Vespucciano: I've compiled MPICH with little bit slower socket interface, but this is more processor friendly

[6:36] Remy Vespucciano: So, yes we have overhead with sending data

[6:36] Remy Vespucciano: But depends on problem size and how your machine is setup

[6:37] Lagrange Euler: Well, we all warn our students never to trust black boxes. Sometimes you just have to look inside.

[6:37] Remy Vespucciano: Yes... So it will be an active part of research to determine how much info must be exchanged with every step or every X steps

[6:38] Linguini Mexicola: I do want to point out that for doing tightly coupled problems it is not sufficient to have fast communications

[6:38] Lagrange Euler: Exactly -- which is fine!

[6:38] Linguini Mexicola: It boils down to what the level of coupling is you want to implement in AMUSE

[6:39] Makino Magic: MPI over DDR Infiniband network can get sonething like 1GB/s, which is i n many cases might be fast enough

[6:39] Linguini Mexicola: At the moment the coupling we envision is not on the level of the algorithms but on the level of the appliations

[6:40] Pan Numanox: Could we get away with allowing the modules to ask for physical parameters when needed? (Like SPH asking N-Body "what do you think the potential is at (x,y,z)?")

[6:40] Lagrange Euler: Understood. And possibly that is the stated goal of the AMUSE proposal. But it is always good to push the boundaries.

[6:41] Linguini Mexicola: pan: yes, for certain applications that is good

[6:41] Lagrange Euler: I think that is how the interface is currently set up, and in principle this should work.

[6:41] Lagrange Euler: But probably not for data-intensive apps.

[6:42] Pan Numanox: Right, at some point it's cheaper to throw the entire state vector over

[6:42] Lagrange Euler: Inti/Arjen, you have new versions of some of the old test scripts, I think. Do you see any measurable difference in performance using the new formulation?

[6:42] Remy Vespucciano: I have not checked against the old script for speed yet

[6:43] Linguini Mexicola: For the tests I have done, there is little different, but these were not communication intensive (so there were large intervals were the code would just run)

[6:44] Lagrange Euler: Right. That is what MUSE was originally intended for.

[6:44] Remy Vespucciano: One of things in AMUSE is that we can send and entire array of data in one mpi calls

[6:44] Lagrange Euler: Hi Simon! I thought I might entice you to visit!

[6:45] Lakhesis Destiny: Yes, indeed you did Steve.

[6:45] Lagrange Euler: Arjen, certainly passing lots of data at once is the most efficient way.

[6:46] Lagrange Euler: There is no substitute for understanding the dataflow...

[6:47] Lakhesis Destiny: Can't we send pointer addresses via MPI?

[6:47] Lagrange Euler: I think I could make the case for not dispensing entirely with the old swig approach, for apps running on a single shared memory system.

[6:47] Remy Vespucciano: MPI also has a version of "shared memory"

[6:47] Remy Vespucciano: MPI also has support for shared memory but I've not looked into it yet

[6:48] Lagrange Euler: But is it a programming model? How do you actually share memory across searate machines?

[6:48] Lagrange Euler: (separate)

[6:48] Linguini Mexicola: I think there is in MPI2 a possibility of one-sided communication

[6:48] Lagrange Euler: There is.

[6:48] Lagrange Euler: Same problem.

[6:49] Linguini Mexicola: with a shared memory space -

[6:49] Lakhesis Destiny: But how big a problem is it really on a shared memory machine. Isn't the data transfer not quite fast, even at an MP domain?

[6:50] Lakhesis Destiny: Of course, it will not be as fast as sending a pointer, though. Can we test the speed?

[6:50] Lagrange Euler: I think it may not be an insurmountable problem on a shared memory machine. But in a cluster (even with IB) or on the grid, that's another thing.

[6:51] Lakhesis Destiny: The solution would be to pass a pointer via MPI. If this option is available, we have no problem at all. and if this is not an option in MPI right now, we can ask to add something about this to be added to the MPI standard.

[6:51] Lagrange Euler: I think in a shared memory environment, you could effectively pass pointers. But that isn't possible in general.

[6:51] Ico Telling: yes let's modify the MPI standard....

[6:51] Ico Telling: :)

[6:51] Lagrange Euler: ...and the laws of physics

[6:51] Ico Telling: easier to chage the laws of physics I thinlk

[6:52] Lagrange Euler: Well, this has been very interesting, and I hope the discussion will continue, but I have to go teach the laws of physics.

[6:52] Ico Telling: however in a shared memory environment you can pass a pointer via MPI (if I remember correctly)

[6:52] Lagrange Euler: Bye all!

[6:52] Linguini Mexicola: bye

[6:52] Pan Numanox: Myself as well. Bye!

[6:52] Makino Magic: Bye!

[6:52] Ico Telling: bye all

[6:53] Lakhesis Destiny: Oh, is everybody bailing out, now we start talking about changing the last of physics?

[6:53] Linguini Mexicola: anybody not going?

[6:53] Lakhesis Destiny: laws, not last

[6:53] Remy Vespucciano: We're here...

[6:54] Remy Vespucciano: Sharing memory, also means sharing data representation

[6:54] Remy Vespucciano: So this only works when the codes are significantly changed to do so

[6:55] Lakhesis Destiny: Ok all, I am off. Buy all.

[6:56] Ico Telling: bye Simon I am going too. Thanks Inty

[6:56] Remy Vespucciano: Simon, bye!

[6:56] Ico Telling: Inty can you save the transcript?

[6:56] Linguini Mexicola: is there an easy way to do that?

[6:56] Ico Telling: Yes but I don't know it....