You are here

Software Based Averaging | Spectrum


M4i.6620-x8 AWGThe block, or segmented memory, averaging mode is used with Digitizers for different applications where incoherent noise needs to be removed from a signal. Independent of the manufacturer of the digitizer all FPGA based hardware implementations of the block averaging mode limit the maximum size of the segment to be averaged. The limit depends on the capacity of the FPGA and usually ranges from 32k up to 500k samples.

This white paper shows how to use the fast PCIe streaming capabilities of the Spectrum M4i series digitizers to implement block averaging in software to go beyond these limits. Using the M4i.2230-x8 (1 channel, 5 GS/s, 8 bit digitizer with 1.5 GHz bandwidth) results achieved with both the hardware and software block averaging methods are compared.

What is Block Averaging?

The Block Averaging mode can be used to improve the fidelity of any repetitive signal by removing its random noise components. The mode allows multiple single acquisitions to be made, accumulated and averaged. The process reduces random noise and improves the visibility of the repetitive signal. The averaged signal has an enhanced measurement resolution and increased signal-to-noise (SNR) ratio.

Noisy time signal with average improvementsThe Block Averaging mode can be used to improve measurements in a variety of different applications like Radar Test, Mass Spectroscopy, Medical Imaging, Ultrasonic Test, Optical Fiber Test and Laser Ranging.

The right side screenshot shows a low level signal (approximately 2 mV) that is completely overlaid by random noise and the improvement that can be achieved when using different averaging factors. While the source signal is not even visible in the original single-shot acquisition, averaging 10 times shows that there is actually a signal with 5 peaks. Doing a block average of 1000 times improves the signal quality even further revealing the real shape of the signal complete with secondary maximum and minimum peaks.

This example was made using a digitizer with a sampling rate of 500 MS/s (2 ns per point) and 14 bit  resolution.

System Setup

The test system was a standard office PC from the Spectrum development department consisting of the following components:

  • Motherboard: Gigabyte GA-H77-D3H
  • CPU: Intel i7-3770 4 x 3.4 GHz
  • Memory: 8 GByte DDR3 memory
  • SSD: 120 GByte Samsung 840 EVO
  • Operating System: Windows 7 Professional 64 Bit
  • Compiler: Visual Studio 2005 Standard Edition

The motherboard has one free PCIe x8 Gen2 slot which is used by the digitizer card. This slot has a payload of 256 which allows the Spectrum M4i cards to reach a full streaming speed of  around 3.4 GByte/s (without any data processing).

Software Implementation

Software Block Average SourceThe test software was done in plain C++ and is based on the Spectrum streaming examples. The test card was fed with an external trigger and acquired one segment of data on every trigger event. Data was stored in the cards on-board memory and transferred by scatter-gather DMA directly into PC memory where it was accumulated to perform the block averaging. Different setups and improvement methods have been tested to see what performance levels could be achieved.

The small source code excerpt shows the threaded version of the main summation loop. This is the crucial and speed determining part of the software.

The following list gives information and comments on the different aspects of the implementation found in the results section:

  • Segmentsize: the number of samples for one data segment that will be acquired after receiving a trigger event.
  • Averages: the number of averages (summations) that are performed until a segment is stored and the average process is restarted.
  • Notifysize: the amount of data after which an interrupt is generated by the PC hardware. This notifysize defines the pace of the complete average loop. If the notifysize is larger than the segmentsize multiple segments are summarized on one interrupt. This reduces the overhead for thread communication and interrupt handling.
  • Buffersize: the overall target buffer in memory for the DMA transfer. In our example the buffer is a fixed size of 16 times the notifysize.
  • Triggerrate: the repetition rate of the external signal generator. In the results we show the maximum achieved triggerrate without filling up (overflowing) the buffers.
  • Threads: to speed up the summation process we parallelized this task by splitting the summation into a number of different software threads as shown on the previous page. If Threads is shown as zero the summation process does not use threading but runs directly inline in a loop.
  • CPU Load: as the average process is done in software the CPU(s) need to do all the work. Luckily modern CPU's consist of multiple cores allowing an easy way to share working tasks between them.
  • SSE/SSE2 commands: on a first look these commands seem to be perfectly suited to parallelize the summation process and speed up the software without the need of any thread based programming. However, unfortunately the SSE command set is all based on data of the same type. As the acquired data is 8 bit wide and the average buffer is 32 bit wide this is not a solution that can be used here.


All measurements are made with a digitizer using 1 channel sampling at 5 GS/s, with 8 bit resolution and an external trigger. The table also lists different program settings to show the result differences. The best result for each segmentsize is marked yellow in the table.

Samplerate Segmentsize Averages Notifysize Mode Threads MAx triggerate CPU Load
5 GS/s 32 kSamples 1000 1 MByte Hardware - 150 kHz < 5%
5 GS/s 128 kSamples 1000 1 MByte Hardware - 38 kHz < 5%
5 GS/s 256 kSamples 1000 256 kByte Software 2 10.3 kHz 25%
5 GS/s 256 kSamples 1000 1 MByte Software 2 12.6 kHz 17%
5 GS/s 256 kSamples 1000 1 MByte Software 4 12.8 kHz 16%
5 GS/s 256 kSamples 1000 1 MByte Software . 6.4 kHz 14%
5 GS/s 512 kSamples 1000 512 kByte Software 2 5.9 kHz 25%
5 GS/s 512 kSamples 1000 512 kByte Software 4 6.0 kHz 29%
5 GS/s 512 kSamples 1000 1 MByte Software 4 6.4 kHz 23%
5 GS/s 512 kSamples 1000 2 MByte Software 4 6.4 kHz 23%
5 GS/s 512 kSamples 1000 8 MByte Software 4 6.4 kHz 14%
5 GS/s 512 kSamples 1000 8 MByte Software - 3.4 kHz 14%
5 GS/s 1 MSamples 1000 1 MByte Software - 1.5 kHz 16%
5 GS/s 1 MSamples 1000 1 MByte Software 2 2.9 kHz 24%
5 GS/s 1 MSamples 1000 1 MByte Software 4 2.9 kHz 23%
5 GS/s 1 MSamples 100 1 MByte Software 4 2.9 kHz 30%
5 GS/s 1 MSamples 10000 1 MByte Software 4 2.9 kHz 23%
5 GS/s 2 MSamples 1000 2 MByte Software - 0.7 kHz 14%
5 GS/s 2 MSamples 1000 2 MByte Software 4 1.3 kHz 40%
5 GS/s 4 MSamples 1000 4 MByte Software - 340 Hz 15%
5 GS/s 4 MSamples 1000 4 MByte Software 2 410 Hz 24%
5 GS/s 4 MSamples 1000 4 MByte Software 4 390 Hz 50%
5 GS/s 8 MSamples 1000 8 MByte Software - 160 Hz 14%
5 GS/s 8 MSamples 1000 8 MByte Software 2 190 Hz 35%


As the above results show block averaging in software can be used to improve the overall segmentsize as long as the repetition rate doesn't get to high. Thanks to the high-speed data transfer rates of the PCIe bus much longer acquisitions can be averaged overcoming one of the main limitations of FPGA based averaging processes. For situations where extremely high repetition rates need to be managed hardware block averaging will still be the best choice.

The above test program is free to use for your own tests or as a base for implementation in other software programs.

The best performance is reached when using a notifysize of 1 MByte. The number of averages that is performed does not have any visible impact on the test results. The time used to copy the result segment and to clear the result buffer is irrelevant compared to the sample summation.

As the complete data handling and summaration process does not differ when acquiring multiple channels the result can simply be re-calculated for other channel combinations. The following settings will all result in exactly the same maximum triggerrate:

  • 1 channel 5 GS/s @ segmentsize
  • 2 channels 2.5 GS/s @ segmentsize/2
  • 4 channels 1.25 GS/s @ segmentsize/4

Reducing the sampling speed for one channel down to 2.5 GS/s allows to run one channel with the maximum theoretical software averaging speed. For 1 MSample segmentsize including the 160 samples dead time the theoretical maximum triggerrate is at:

(2.5 GS/s) / (1 MSample + 160) = 2.38 kHz

This is far below the measured maximum of 2.9 kHz @ 5 GS/s.