Parallel programming and parallel program execution can be often very disappointing. You may invest a lot of very hard work in a project, only to discover that your parallel program doesn't run any faster than its sequential sibling. The reason for this is that the cost of communication, especially the cost of communication in a cluster, is extremely high. So you always have to balance amount of work to be carried out on a single node, versus amount of data that has to be exchanged between the nodes. The more work on the node and the less data to exchange, the better.
Communication problems are also at the root of poor scalability of parallel programs. Even if your program does perform up to the expectations, you may discover that your speed-up will no longer be satisfactory when the number of nodes, the program runs on, gets too large. Most production codes are written to run on 16 or 32 nodes maximum. Some may still scale to 64. But I haven't seen many production codes that would run well on, say, 1024 cluster nodes. The cost of communication and synchronization for this many nodes would be prohibitive.
The present day parallel programming paradigms, perhaps with the notable exception of the data parallel paradigm, are not scalable beyond tens, or at best low hundreds, of processes.
But the data parallel paradigm can be laid out on machines with tens of thousands of CPUs. For example, the Connection Machine CM2 used to have up to 64,000 CPUs. This paradigm, as I have remarked above, should map on the PIM systems too. We will almost certainly see some form of it on the petaflops systems that are currently being developed.