Monday, June 10, 2013

How Computer Architecture and C++ can help a Java developer to improve in performance

Recently I came across an interesting example of how the knowledge of C/C++ and Computer Architecture can help a Java developer to code better performing code and want to share with you all.

Lets take an example of Matrix multiplication of two 1000x1000 matrices in two ways: (Download the source code from: https://docs.google.com/file/d/0BxxKTIP8UwuMbXpqb3JJYVN6Rm8/edit?usp=sharing)
1. The traditional logic   ( usualMatrixMul() function has this logic in the code shared )
This is the traditional way, which was taught to us in schools. We traverse matrix 1 horizontally and matrix 2 vertically, multiplying the corresponding elements and calculating partial sums and adding all partial sums at last to find the resulting matrix's element.

2. Slightly modified logic ( modifiedMatrixMul() function has this logic in the code shared )
 In this slightly modified algorithm, we first transpose martix 2, and do the same traditional logic, but catered to the transposed matrix 2.

A demo run of these two algorithms took the following time to run: (x86 processor, Win32)
Usual matrix multiplication algorithm takes 24797 ms
Modified matrix multiplication algorithm takes 8500 ms


How this small change in logic can reduce the time of run by about three times?
                                            Well, a little bit of knowledge about C/C++ and Computer Architecture can help you understand this.


1. From C/C++: Multidimensional arrays are stored linear in the memory
JVM is itself implemented in C/C++. Under the hood, Java arrays are mapped to C/C++ arrays as the dynamic memory allocation logic for Java arrays are . In C/C++ arrays, be in single/multi dimension/s, are stored linear in memory.
For instance,
Array X in memory is stored as: X11 X12 X21 X22
Array Y in memory is stored as Y11 Y12 Y21 Y22


2. From Computer Architecture: Processor Cache caches chunks of linear data from memory to Cache
When your CPU needs some data from the memory, it first consults the cache, if it has the data. If the cache has the data from expected memory address (called Cache Hit), it immediately gives the data to the CPU and CPU proceeds with its work. But if the cache haven't cached the memory address, there is a Cache Miss and the CPU then goes to main memory to fetch the block to cache and proceeds with its execution.

 Cache accessing time is in nanoseconds, where was accessing main memory (RAM) will take long time in comparison.

How these help in the slightly modified algorithm?
                                               When the array is very large, parts of the array are cached and when there is a miss on some other part of array, then the subsequent part of the array is cached linearly.

In case of matrix multiplication by the traditional logic, matrix X is cached linearly and access linearly, so Cache misses are less. But, for matrix Y, the cache is like Y11 Y12 Y21 Y22, but is accessed like Y11 Y21 Y12 Y22.

For large data, like 1000x1000 matrix, the cache miss for matrix Y will be huge and everytime, the CPU has to bring data from main memory which is comparatively a time expensive operation and as said earlier, cache works like caching a chunk of data in increasing linear way of addressable memory.

Hence, tweaking the logic little bit and access matrix Y linearly reduces cache misses and shows improvement in performance of time.


Download source code at: https://docs.google.com/file/d/0BxxKTIP8UwuMbXpqb3JJYVN6Rm8/edit?usp=sharing

This performance improvement is based on caching and is specific to the traditional algorithm and is true for any programming language, Java/C/C++. Just java developers need to know additionally the way java arrays are handled by JVM.

Thursday, April 4, 2013

Parallel make utility to leverage from multi core processors

By default make utility, which is used for building native code is single threaded, which means that though there can be different targets compiled in parallel from the makefiles, make utility will go in a serial fashion and build targets mentioned in the makefiles.

But if you have multi core processors, you can utilize a handy switch in the make utility to build different targets in parallel.

To get the number of CPUs you have in your machine, use: grep 'processor.*:' /proc/cpuinfo | wc –l

Suppose you have two CPUs, you can command make utility to utilize both the CPUs at the same time and perform two different compilations in parallel. Make utility takes a parameter, -j, which specifies the number of parallel make threads you need. You can say make -j 2 makefiles, which will performs compilations in two parallel make threads.

Now, to tie the make threads to the number of CPUs, we can use:

make -f makefile -j `grep 'processor.*:' /proc/cpuinfo | wc -l` -k buildall

This can utilize the full true computation power of your machine and complete compilation in less time.


Drawback:

All separate make threads write output messages to the same output stream, so in your terminal, you will see jumbled messages from all parallel make threads. This could be a problem for logs, but for developers, who just want only the compilation to be done fast and not worried about the logs, this is cool.

Thursday, February 21, 2013

Controlling C# application from web application

 In this small tutorial, we are going to create a simple C# windows application with a web browser component, load a web page in it, and control the complete native application from web based application which is loaded into the web browser control.

By this, you can make your app fully as a web based app, in any of your favorite web based programming/scripting language like php, html, javascript and control the native application with web based coding.

A demo app - the notifications app is made with this concept. The app is open source at https://code.google.com/p/notificationsapp/

I will be writing the steps in this blog, but till that time, I made a small video tutorial explaining the steps. Please check that.