-
Comparisons
-
Hardware Features
-
Systems
-
Software
-
Company
Knowledge-Base
- Software
- Drivers
Data sorting with SSE commands
Data from the Spectrum digitizers is always delivered by the DMA engine in a multiplexed form. For a 4 channel card that would mean there's a data stream of A0, B0, C0, D0, A1, B1, C1, D1, A2, ... There are calculation routines that can be easily adopted to these data sorting but for some cases it is necessary to resort (de-multiplex) the data to separate data arrays of channels A, B C and D (for channel A data shoule be A0, A1, A2, A3, ...).
Classic sorting algorithm
The most simple way is to use a for-loop (examples in C++). In our example we de-multiplex four channels where pDataMux is a pointer to the data array of multiplexed data and pDataA,p DataB, pDataC and pDataD are pointers to the separate arrays for each channel
The algorithm is easy to understand and simple to implement but needs a lot of CPU clock cycles to perform one loop. If doing fast streaming this simple loop will consume quite a lot of processing power.
for (lSamples = 0; lSamples < lMaxSamples; lSamples++)
{
(*pDataA++) = (*pDataMux++);
(*pDataB++) = (*pDataMux++);
(*pDataC++) = (*pDataMux++);
(*pDataD++) = (*pDataMux++);
}
SIMD commands and SSE
Speeding up this process can be done by SIMD (Single Instruction Multiple Data) commands. These commands allow to manipulate multiple data with just one CPU command with a very few CPU clock cycles.
For x86 CPU's the SSE2 (Streaming SIMD Extensions 2) introduced by Intel in 2001 with the Pentium 4 family is a good choice to de-multiplex data. SSE2 is an extension of the x86 instruction set. It is also supported by the AMD64 architecure. More information on the SSE2 command set can be found on Wikipedia.
SSE2 commands can easily be used in C++ using intrinsic function calls without the need to do Assembler programming. General information on intrinsic functions can be found at Wikipedia.
SSE2 sorting algorthm
All SSE2 commands are based on a data length of 128 bits (16 bytes or 8 words). One command therefore manipulates 8 samples of 12/14 or 16 bit width with just one command. Spectrum implemented a few de-multiplexing routines as part of their examples for free use. The examples installer can be downloaded here. The sorting functions are implemented for C++ for 2, 4 and 8 channel of 16 bit data. The header and source files are found in the SSE folder of the C++ examples.
A list of intrinsic functions can be found at Intel.
Speed Advantage
The speed advantage between classic algorithm and SSE2-based algorithm is best for a huge number of channels. While there's nearly no speed advantage for 2 channels the 8 channel SSE2 algorthm is nearly 8 times faster as the classic algorithm.
SSE2 Example for 2 channels
The following except is the inner de-multiplexing loop of the SSE2 command based algorithm that is used for 2 channel data acquisition with 16 bit samples (covering 12 bit, 14 bit and 16 bit ADC resolution)
for (uint32 i = 0; i < dwNumSamples * dwNumCh; i += dwNumSamplesPer128Bit * dwNumCh)
{
xmmData1 = _mm_load_si128 (powReadPos++);
xmmData2 = _mm_load_si128 (powReadPos++);
// sort lower 4 samples
xmmTmp1 = _mm_shufflelo_epi16 (xmmData1, (3 << 6) | (1 << 4) | (2 << 2) | (0 << 0));
xmmTmp1 = _mm_shufflehi_epi16 (xmmTmp1, (3 << 6) | (1 << 4) | (2 << 2) | (0 << 0));
xmmTmp1 = _mm_shuffle_epi32 (xmmTmp1, (3 << 6) | (1 << 4) | (2 << 2) | (0 << 0));
// sort upper 4 samples
xmmTmp2 = _mm_shufflelo_epi16 (xmmData2, (3 << 6) | (1 << 4) | (2 << 2) | (0 << 0));
xmmTmp2 = _mm_shufflehi_epi16 (xmmTmp2, (3 << 6) | (1 << 4) | (2 << 2) | (0 << 0));
xmmTmp2 = _mm_shuffle_epi32 (xmmTmp2, (3 << 6) | (1 << 4) | (2 << 2) | (0 << 0));
xmmTmp3 = _mm_unpacklo_epi64 (xmmTmp1, xmmTmp2);
_mm_store_si128 (powWritePosCh0, xmmTmp3);
xmmTmp3 = _mm_unpackhi_epi64 (xmmTmp1, xmmTmp2);
_mm_store_si128 (powWritePosCh1, xmmTmp3);
powWritePosCh0++;
powWritePosCh1++;
}