February 12, 2007

The Game of Life

Recently, my colleague, Dick Bowler and I wrote a sample application that implements Conway's Game of Life. The purpose of the sample was to visually demonstrate the performance of a single threaded version of an algorithm vs. a scalable multi-threaded version. The app was to display two game boards side-by-side; the first board would have the computation for the next generation handled by a single thread and the second would be multi-threaded. We had fun implementing this sample and learned a few things along the way.

First, we decided to be adventurous and mix native C++ and C++/CLI. We planned to use C++/CLI for the user interface (to pick up the Windows Forms classes from the .NET Base Class Library) and C++ for the game board calculations. Second, we wanted to write code that would scale in performance as it ran on additional processor cores. To this end, we decided to use Intel's Threading Building Blocks library to implement the parallel algorithm code. Finally, we wanted the display to update in real time as the boards were recomputed so that the user could watch successive generations as they evolved.

Going in, as a longtime C/C++ developer, I was skeptical of C++/CLI, and I disliked the strange syntactic extensions to the language. It seemed like another "embrace and extend" tactic where the language would be altered into a proprietary dialect. However, after using C++/CLI for a bit, I must say that the combination worked OK for this application. I was able to manage the syntactic differences ("ref class," "gcnew," handles, etc) fairly easily and was able to easily mix native and managed types. I had access to the same Base Class Library as when using C# and yet could drop down to native C++ code when needed, without having to use COM interop. Overall, I still don’t like the syntactic changes, and the associated loss in portability, but the combination worked as well as we had hoped.

Previously, I have implemented sample programs that partition applications only at a high level - worker threads, user interface thread, etc - and have realized that while they are an improvement over entirely single-threaded applications, applications partitioned in this way will also hit a performance wall as the number of processors increases. Eventually, each of the applications threads could be running on its own processor core, and the application would receive no further benefit from additional processor cores. With the Game of Life sample, we intended to use the Threading Building Blocks to show how applications should continue to extract parallelism at the algorithmic level. In other words, if developers can code algorithms to be multi-threaded, the applications will scale further than those applications that just have high-level partitioning. Intel's Threading Building Blocks (TBB) library is a good tool for implementing parallel algorithms. The TBB library is available for multiple operating systems and compilers and allows developers to implement algorithm-level parallelism in a processor-independent way.

For me, getting the application to update the UI for successive generations was, surprisingly, the hardest part of the application. Each game board window was responsible for updating its contents every time a new board was computed. First, we tried to post a message to the window for it to update itself with the new board data. While this worked, the application UI effectively froze as incoming events were drowned in the requests for board updates. This was definitely unacceptable. After all, we have complained many times in this series of blogs about unresponsive applications :). After some thought we figured a way out: after computing the next board to display, draw directly into the board's graphic context (more like a computer game would as it draw successive frames) instead of posting a message to the window’s message queue.

Anyway, this series of samples has been fun, and we look forward to developing a few more. If you're interested in C++/CLI, TBB, or graphically representing the results of your algorithm in real time, check out the sample.

Michael Jeronimo

Parallel Computing Research Paper

The current trend in microprocessor technology is to take advantage of Moore’s Law (which states that the number of transistors per a certain area on the chip will double approximately every 18 months), to include an increasing number of processor cores in a single integrated circuit chip. Here's an interesting article by a group of Berkeley researchers that discusses parallel computing and why continuing to double the number of processor cores is likely to meet with diminishing returns.

A couple quotes I found interesting:

"Multicore will obviously help multiprogrammed workloads, which contain a mix of independent sequential tasks, but how will individual tasks become faster? Switching from sequential to modestly parallel computing will make programming much more difficult without rewarding this greater effort with a dramatic improvement in power-performance. Hence, multicore is unlikely to be the ideal answer."

"Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel."

This paper has me thinking about parallel programming models and the meaning of “naturally parallel.” For example, there are existing languages, such as Verilog, that are used to describe hardware systems and, as such, must be naturally parallel. However, I certainly wouldn’t want to code application software in languages this way. Instead, we'd want a naturally parallel language at a much higher level of abstraction. Even so, can we ever remove the essential complexity involved in coordinating multiple simultaneous threads of execution? Also, must we, as developers, forever need to explicitly parallelize applications, or should we eventually be able to work in a comfortable sequential programming model and have tools analyze the inherent parallelism and take care of the translation automatically?

Perhaps these issues may be a symptom of relying on imperative programming languages, and the solution will lie instead with declarative programming languages. To take advantage of the increasing number of processor cores we may need to move from specifying how to perform a computation to specifying what should be computed. It’ll be interesting to see what develops in a few years in this area. After all, 16- and 32-way processors are not that far off and, if this paper is right, we will experience diminishing returns at that point and software developers will be in for some big changes.

Michael Jeronimo

Intel Showcases 80-cores

The talk about high core CPU's is heating up.  Intel announced a 80-core processor with less power consumption than a current Core 2 Duo (article link).  That's cool stuff for research, though it's not ready for mass marketing yet.  For one thing, at 275 mm squared vs 143 for the Core 2 Duo, the chip is too big (though not massive by any stretch).  The bigger thing though, is that the chip doesn't support the x86 instruction set.  A VLIW (Very Long Instruction Word) architecture will be fine for specialized applications, but not suitable for running any mainstream operating system.  Oh yeah, and they say it's not able to be connected to memory yet!

Their biggest hurdle, they say, is the fact that modern software isn't ready.  Not only applications, but operating systems as well.  If we can't scale well to dual or quad processors, what's the point of even moving to 16, let alone 80.  I wonder about this argument though.  It seems to me that certain applications are already very well suited to massively parallel operation -- think of grid computing.  Various projects (SETI@Home, Folding@Home, the just announced OpenMacGrid, and many others) allow a huge number of computers to crunch portions of datasets.  Each is handed a chunk of data, then the results are fed back to the central server.  This model sounds ideal for the high-core scenario.  No node is dependent on another.  It's idle until work is assigned, it does its own thing, then it sends back the end result.  One big thing is that this model is designed assuming high-latency.  Smaller chunks of data aren't worth the round-trip.  It's also not intended for any kind of realtime consumption.  Some changes to these base assumptions would be required, but it seems like the multithreaded/core outlook isn't a bleak as it's made out to be.

Server rendering farms for graphics are a smaller-scale grid application.  Each machine in the cluster could just as easily be one of the cores in one of these new systems.  I agree that for the common application, no one knows what to do with the level of parallelism.  Large dataset analysis and multi-user server systems should see great benefit.  When designed in early, most applications can benefit from parallel processing.  It's true that synchronization can be a challenge at first, but it all just makes sense once you've worked with it longer.  I'll be anxious to see how things progress in the next few years.

Link to Intel squeezes 1.8 TFlops out of one processor | TG Daily

Link to Intel shows off 80-core processor | ZDNet

 

Arian Kulp
akulp@3leaf.com

January 30, 2007

Threading Maturity Model (ThMM)

I stumbled across a blog post today by Alan Zeichnick that defined the Threading Maturity Model. This model follows in the footsteps of the Capability Maturity Model and the SOA Maturity Model. It is a 5-tier model (in the post it’s a 6-tier model if you count Unawareness).

This model starts with Awareness, where developers know that parallelization exists and hope that the Operating System or some other software library is handling most of the work. The second tier is Experimentation where developers are experimenting with parallel programming on their own by writing simple applications or prototypes, but not in their major development efforts.

The third-tier is HotSpots where developers use parallel programming to troubleshoot or attack specific performance areas in their applications. Developers are still mainly self-taught and this development effort is not formalized. The fourth-tier is Utilization where developers are formally trained to implement parallelized code in their applications. This is a much higher level of adoption, but complete adoption is still missing.

The fifth tier is Adoption. This is the highest tier in the model where parallel programming is integrated in all development efforts including software library evaluation and adoption of third-party tools.

This model represents a good benchmark for corporations looking to move into the parallel programming environment. With the proliferation of dual core desktop and mobile systems, more software will be written in a parallelized way. This will be an absolute requirement, but I wonder how long it will take before all developers are on tier 5. I wonder if it will be like other new development trends. Newer developers coming out of universities and technical colleges will jump on board immediately and the older developers will stick with the status quo. I wonder though how long older developers will have a choice.

The average lifespan of a new consumer computer is 3-5 years. This means that within the next year or two, most computers will have at least a dual core system inside. I think consumers will continue to demand more responsive software and their data needs are only going to grow. I think that all developers will have to be at a high level of adoption in the next few years or they will be left behind.

The article suggested that most corporations are in Tier 1 or 2 at the moment. There is definitely an interest in parallel programming, but dual core and quad core systems are going to force developers go beyond just hobby programming and integrate these solutions in their production level applications.

Michael Cassens
mcassens@3leaf.com

January 29, 2007

We need better tools…

I read an article today in the Electronic News Network that addressed the deficiency of software tools available for developers writing parallel programs . The article states that although most prototypes are created using high-level languages such as MATLAB, Python, and Mathematica, tools to translate the prototype into a functioning application do not exist. Most developers want to use their desktop tools to write parallelized software instead of switching to VI or EMACS to edit and compile C and Fortran code. Interestingly enough, the last major software update for parallel programming was in the 1980’s. This seems like an opportunity waiting to happen.

A survey in the article also correlated the lack of developer productivity because of this software deficiency. Better software development environments result in better productivity. Often new software ideas emerge as a secondary effect. By making software tools more integrative with parallelized code, a new wave of software developers will emerge. These developers will write parallelized software that enters the marketplace in a shorter timeframe. This is good for developers, corporations, and end-users.

In an earlier blog post, I discussed abstracting parallel programming techniques. I think that post fits with this concern. By simplifying software development, developers can be more efficient and create more applications and build upon existing applications much more quickly. One of the most important requirements is having a good integrated development environment that can translate prototypes into full applications. Microsoft and Intel have started down this road by integrating OpenMP in Visual Studio and creating application building blocks, but there is much more that can be done.

There is a secondary hurdle . Most desktop systems are not powerful enough to handle the datasets that are used in scientific computing. In the article in Electronic News, it states that most datasets contain about 20-30 GB of data, but continue to grow with an expected size of 100-300 GB. Most desktop systems have 300+ GB hard drives, but the transfer speed is not sufficient to realize the improvements. Additionally, the lack of RAM capacity and transfer rates also limit performance. What about processors?

Will dual-core or quad processors help desktop systems? The answer is probably not yet. Although simplified processing might be feasible on these systems, it’s more likely that prototyping with subsets of data is still going to be used by developers. Unfortunately until more powerful desktop machines that run like a mini-supercomputer are released, this problem will not go away.

When these improvements arrive, scientific computing developers will not only be more efficient, but advancements in this field will be realized much sooner. I think there are a number of other industries that will also benefit with these advancements. Imagine libraries of books and journals being indexed and searched in a matter of hours instead of days or weeks. I could also see industries with large archival data like banking or law firms becoming more efficient resulting in lower overhead saving everyone time and money. There is a lot to look forward to as software and hardware continue to improve.

Michael Cassens
mcassens@3leaf.com

Multi-processing programming too hard? Why not abstract it?

I read an article today published in the Technology Review at MIT that focused on the multi-core processors and the hurdles that developers face when writing software to take advantage of multi-core processing.

Researchers at MIT are developing a framework that abstracts some of the details of parallel programming. They want to help mainstream developers move to parallel programming more quickly and still be efficient.

Their focus is preventing complete system failure when different applications or tasks run on separate cores. Systems fail in these scenarios because one application accesses some shared memory and when another application tries to access the same shared memory, the whole system can freeze and eventually shut down.

Currently, developers must put safeguards in place to prevent these system crashes, but at MIT, they are trying to develop a more transactional way of accessing this shared memory. They want to make sure multiple applications can access shared memory without any deadlocks and verify that when an application makes a change to the memory, the memory is still viable for the next transaction.

Major players in the industry like Microsoft, AMD, and Intel are investing a lot of money into making parallel programming easier. This seems like the right path to me. Abstraction is an integral part of Object Oriented programming. I hope a number of different APIs come from these corporations in the near future. Multi-core processors are here to stay. In fact, they will continue to grow. Intel and AMD plan on releasing the first consumer-oriented quad core processor in 2007 putting more pressure on developers to write software to take advantage of this hardware.

Microsoft supports OpenMP in C++ and C# in Visual Studio 2005. This is a good first step, but I think having another application building block release from Microsoft would propel parallel programming into the mainstream.

Making parallel programming easier is the most logical path if wide-spread adoption is the goal. Industry developers do not have a lot of time typically to spend on research and development. Even though they may take time to research alternative ways to develop software, large-scale implementation changes can be too vast and time consuming to be useful. Hopefully by increasing the amount of abstraction, changing to a parallel programming might not be such a huge leap.

Michael Cassens
mcassens@3leaf.com

Wavelets in the real world

What is a wavelet? Wikipedia defines wavelets as:

A wavelet is a representation of a signal in terms of a finite length or fast decaying oscillating waveform.

This definition is not used often outside of mathematics circles, but some of its applications are. For example, wavelets are used for image processing, blood pressure analysis, ECG analysis, DNA analysis, climatology, speech recognition, computer graphics, and the list goes on.

So, why should we take a look at wavelets? I read an article today by Larry O’Brien about multi-threading. In his article, he discusses how multi-core processors enhance the utility of wavelets by compressing a 16 Mega-pixel image. Using Visual Studio 2005, OpenMP, and his multi-core machine, he saw a speed increase of 70% with 100% CPU utilization. The results were impressive and something to consider with other applications.

He argues that dual core speedup does not always outweigh the complexity of writing parallelized code. He continues by saying that optimizing single-threaded code often results in application speed up rivaling the speed increase of multi-threaded applications. However, he states that it won’t be long before quad-core, 16-core, and 32-core processors will arrive, and it will be impossible to deny the benefits of multi-threaded processing.

As developers we are essentially support staff automating tasks for end-users. (This might not be a popular definition, but we are). We build software for businesses, scientists, academics, and consumers. We do this to make their lives easier and more productive. We are asked build solutions that leverage and improve existing techniques.

We must also remember why we build software. We build software so individuals who have lost the ability to type can continue to work using speech recognition software. We build software so that people in need of heart surgery can be diagnosed sooner. We write software not only to make money, but to help people. Multi-threaded software extends this purpose.

There a lot of reasons to be afraid of multi-threaded development (race conditions, hard to debug, etc), but I think the rewards far outweigh any of these uncertainties.

Michael Cassens
mcassens@3leaf.com

January 12, 2007

A Matter of Granularity

If someone is generally klutzy, or neglects one responsibility because of the urgency of another, we jokingly say they “can’t walk and chew gum at the same time.” This is, in a sense, a statement about multi threading. Humans are used to doing multiple tasks simultaneously (though some are better at it than others).  A musical performer, for instance, will be doing chording and fingering with his left hand, plucking or strumming strings with his right hand, singing lyrics and melody with his mouth, and keeping time with a tapping foot.

 If we were to think of the musician as a computer, and the performance as a program, we could break the program down into 4 obvious threads. Those would be left hand, right hand, mouth, and foot. It’s the same way with walking and chewing gum; there are two obvious threads for the program. When programmers are faced with tasks like these, they can immediately see where they can get performance gains by expressing the solution using multiple threads. It becomes more difficult, however, when the task has less obvious threading boundaries.

 Let’s say that a programming team is assigned the “chew gum and walk” program. They do initial analysis and decide that they could implement the program serially, but could get a significant performance improvement if they put chewing gum and walking in separate threads. As they’re about to begin implementation, word comes down that the budget for the project has been reduced, and the “chew gum” feature has been deferred to some future version of the program; version 1 will just “walk.”

 At first, the team thinks that there is no longer an opportunity to gain from multi threading. But shortly they realize that walking engages 2 legs, and they can put each leg in a separate thread. They’ll have to synchronize the threads to keep the program from falling over or walking in a circle, but they know how to do that. Unfortunately, the project budget is cut again, and now the scope has been reduced to 1 leg making a walking motion. The team rises to the occasion, however, and plans multiple threads to handle the calf, thigh, knee, etc.

 The opportunity for using multiple threads is really a question of task granularity. Consider the very low-level case of 3 lines of code:

 

1. a = a + c

2. b = b + d

3. e = a + b

 

 The calculation in line 3 depends on the calculations in line 1 and 2. The calculation in line 2, however, does not depend on the calculation in line 1. Line 1 and 2 can be safely executed simultaneously. The overhead of explicitly starting a thread just to execute line 2 at the same time as line 1 would have negative impact on real performance, programmer efficiency, and program maintainability. However, this level of parallelism (called Instruction Level Parallelism or ILP) is a robust area of research, with the emphasis on compilers and operating systems that can determine which instructions depend on which others, and which can be executed simultaneously, all without explicit program direction.

 If we go back to our musician, it now appears that there are a lot more opportunities for threading than the obvious ones. We have 10 separate digits on 2 hands, a tongue, lips, cheeks, a larynx, and much more. However, we don’t want to thread just because we can. Let’s say we write the musician program with 30 separate threads of execution. What happens when we run the program on a 1 or 2 core processor? We will likely see a loss in performance due to the overhead of thread creation, destruction, and switching.

 It’s all a juggling act. We have to find the balance for threading. We want enough threads so that we can realize a performance gain, but not so many that the gain is overtaken by threading overhead. But it isn’t really a question of whether a given task can be threaded; that’s just an issue of task granularity.

 

Richard Bowler
rbowler@3leaf.com

January 09, 2007

Multi-core Hardware & Software Gap: A User’s Perspective

Intel has just debuted a new lineup of Quad-core processors at this week’s CES and at least one of them isn’t just meant for server applications. The Core 2 Quad marks Intel’s first attempt to spread the technology to platforms beyond those in the back office and the competition is close on their heels with similar offerings. But comments like “My apps aren’t that much faster on my dual core box, let alone a quad core” or “I’ve heard my apps just get slower the more cores that are in my machine”. Both of these statements are valid to varying degrees.

If you’ve recently made a recent computer purchase for personal or business use you may be wondering why you don’t see more performance out of your machine. Perhaps you’ve read some blogs in this space that touch upon the reasons as to perhaps why this is. Software developers have to rethink their application’s design and “thread” them. This is difficult and sometimes costly. Some applications already sing with multiple processing cores but for the average user it can be hard to put up with underperforming software. Let’s face it: vendors can’t expect customers to put up with limitations for long.

The thing to keep in mind is that the current gap between the chips and software won’t stay as wide as it now is for very long. Companies are aware of their customer’s desires and innovate or someone eats their lunch. So while it is true we are currently experiencing a software lag with respect to multi-core, there could shortly follow a period of increased competition and as such consumers will be the winners. Look for vendors to rush products to market with shiny new features and thread awareness in an effort to garner new customers.

It is very true that there is a gap (OK, a large gap) between what today’s processors are capable of and what the software asks of them. After recently wrenching the performance crown away from its main competitor Intel is trying to stay in lead with product releases. Nowadays this means more cores. Most purchasers of the new quad core lineup will be advanced multimedia content producers and gamers that demand every ounce of performance they can get. Intel is working hard with software vendors to close the gap and offer more and more applications that are thread aware.

During this “gap” time, scan the horizon for thread aware alternative products that suit your needs for new software. For new software purchases demand thread awareness. Ditch the older applications that perform slowly as well as those that refuse to produce thread aware versions. Becoming a prosumer that is aware of product features will always net you get the best value.

Jason Shigley

January 08, 2007

Is Multithreading knocking on your door?

Earlier I posted about the challenges and opportunities available to those willing to take on the multithreaded monster in the design of their applications. The way I see it, this opportunity is here, now, in the present and at multiple levels. The hardware is here and its market penetration is getting better each day. Given a project life cycle of between six months and a couple years the penetration will be even larger and the realized performance even better on the emerging hardware. You need to make threading a part of your design now.

Many have bemoaned the lack of multithreaded awareness in lots of mainstream apps. I look at it as a loss of opportunity by that vendor. Or rather, is it a knock on the door of their competitor? We all know that everyone is looking for an edge on the competition. With multi-core processors in such proliferation I look for software to soon be categorized as either threaded or non-threaded. So you’re browsing the software section of your favorite geek store/site looking for a video editing package, development tool, strategy game, etc. Some say they’re threaded, some do not. Which would you buy? I’m betting it might even sway a brand loyal consumer to the other product. Those that are threaded will have the big wins while those that are not will lose market share. Honestly, can you afford not to have your app threaded?

One level of this opportunity that is immediately apparent is that you can thread your existing apps. Another level is to make sure threading is represented in your future product designs. I know, easier said than done. But almost every shop/project has a training budget, right? So put it to use; I would wager there aren’t many better uses for those dollars right now. Look at it another way… just like 32-bit meant a retooling of many applications and crept into software designs, so too will threading. Other opportunities exist on an individual level. Obviously with lots of software projects seeking to remain competitive and/or performant there is a demand for those with demonstrable skills in threading applications. I see many tech job sites already advertising for these talents. Hit some seminars and read, read, read.

Today the hardware is definitely ahead of the software but in my opinion it won’t stay this way for long. Vendors will innovate and opportunities will be realized. Those that don’t will have missed the boat.

Jason Shigley