【Introduction】Finite impulse response (FIR) and infinite impulse response (IIR) filters are both commonly used digital signal processing algorithms – especially for audio processing applications. Therefore, in a typical audio system, a significant portion of the processor core’s time is devoted to FIR and IIR filtering. The on-chip FIR and IIR hardware accelerators on the digital signal processor are also called FIRA and IIRA, respectively. We can use these hardware accelerators to share the FIR and IIR processing tasks and let the core perform other processing tasks. In this article, we will explore how these accelerators can be utilized in practice with the help of different usage models and real-time test examples.
Figure 1. Block diagram of FIRA and IIRA systems.
Figure 1 shows a simplified block diagram of FIRA and IIRA and how they interact with the rest of the processor system and resources.
● Both FIRA and IIRA modules mainly contain a calculation engine (multiply-accumulate (MAC) unit) and a small local data and coefficient RAM.
• To begin FIRA/IIRA processing, the core initializes a chain of DMA transfer control blocks (TCBs) in processor memory with channel-specific information. The starting address of this TCB chain is then written into the FIRA/IIRA chain pointer register, and the FIRA/IIRA control register is then configured to initiate accelerator processing. Once all channels are configured, an interrupt is sent to the kernel so that the kernel can use the processed output for subsequent operations.
● In theory, the best approach is to offload all FIR and/or IIR tasks from the core to the accelerator and allow the core to do other things at the same time. But in practice, this is not always possible, especially when the kernel needs to use the accelerator output for further processing, and there are no other independent tasks that need to be done concurrently. In this case, we need to choose an appropriate accelerator usage model to achieve the best results.
In this article, we will discuss various models for taking advantage of these accelerators for different application scenarios.
Use FIRA and IIRA in real time
Figure 2. Typical real-time audio data flow.
Figure 2 shows a typical real-time PCM audio data flow diagram. A frame of digitized PCM audio data is received through the synchronous serial port (SPORT) and sent to memory through direct memory access (DMA). While continuing to receive frame N+1, frame N is processed by the core and/or the accelerator, and the output of the previously processed frame (N-1) is sent to the DAC via SPORT for digital-to-analog conversion.
Accelerator usage model
As mentioned earlier, depending on the application, the accelerator may need to be used in different ways to maximize the offloading of FIR and/or IIR processing tasks and to save as much core cycles as possible for other operations. From a high-level perspective, accelerator usage models can be divided into three categories: drop-in replacement, split tasks, and data pipelines.
direct replacement
• The kernel FIR and/or IIR processing is directly replaced by the accelerator, and the kernel simply waits for the accelerator to complete this task.
• This model is only valid if the accelerator is processing faster than the core; that is, using the FIRA block.
split task
• FIR and/or IIR processing tasks are distributed between cores and accelerators.
● This model is especially useful when multiple channels can be processed in parallel.
• Based on a rough timing estimate, distribute the total number of channels between the core and accelerator so that both can roughly complete their tasks simultaneously.
● As shown in Figure 3, this usage model saves more core cycles than the direct replacement model.
data pipeline
● The data flow between the core and the accelerator can be pipelined, enabling the two to be processed in parallel on different data frames.
● As shown in Figure 3, the kernel processes the Nth frame, and then starts the accelerator to process the frame. The kernel then proceeds to further parallelize the output of frame N-1 produced by the accelerator in the previous iteration. This sequence allows the FIR and/or IIR processing tasks to be completely offloaded to the accelerator, but with some delay in the output.
● Both pipeline stages and output latency may increase, depending on the number of such FIR and/or IIR processing stages in the complete processing chain.
Figure 3 illustrates how frames of audio data are transferred between the three phases of the different accelerator usage models—DMA IN, core/accelerator processing, and DMA OUT. It also shows how by offloading all or part of the FIR/IIR processing to the accelerator using a different accelerator usage model, compared to the kernel-only model, the core idle cycles are increased.
Figure 3. Accelerator usage model comparison.
FIRA and IIRA on SHARC processors
The following ADI SHARC® processor families support on-chip FIRA and IIRA (old to new).
● ADSP-214xx (eg, ADSP-21489)
● ADSP-SC58x
● ADSP-SC57x/ADSP-2157x
● ADSP-2156x
These processor families:
● Different calculation speed
• The basic programming model remains the same, with the exception of Auto Configuration Mode (ACM) on the ADSP-2156x processors.
● FIRA has four MAC units, while IIRA has only one MAC unit.
FIRA/IIRA Improvements for ADSP-2156x
The ADSP-2156x is the latest in the SHARC processor family. It was the first single-core 1 GHz SHARC processor, and its FIRA and IIRA also ran at 1 GHz. FIRA and IIRA on the ADSP-2156x processors feature several improvements over their predecessor ADSP-SC58x/ADSP-SC57x processors.
performance improvements
● 8 times faster computation (from SCLK-125 MHz to CCLK-1 GHz).
● Reduced data and MMR access latency between the core and accelerator due to tighter integration of the core and accelerator with dedicated core fabrics.
Functional improvements
● Added ACM support to minimize kernel intervention required for accelerator processing. This mode mainly has the following new features:
● Allows accelerators to be paused for dynamic task queuing.
● No channel limit.
● Supports trigger generation (master) and trigger wait (slave).
● Generate selective interrupts for each channel.
Experimental results
In this section, we discuss the results of implementing two real-time multi-channel FIR/IIR use cases using the model with different accelerators on the ADSP-2156x evaluation board
use case 1
Figure 4 shows the block diagram of use case 1. The sample rate is 48 kHz, the block size is 256 samples, and the core-to-accelerator channel ratio used in the split-task model is 5:7.
Table 1 shows the measured number of cores and FIRA MIPS, as well as the saved core MIPS results obtained compared to the core-only model. The table also shows the additional output delay added by the corresponding usage model. As we can see, using the accelerator in conjunction with the data pipeline usage model saves up to 335 core MIPS, but results in 1 block (5.33 ms) of output latency. Drop-in replacement and split-task-use models also save 98 MIPS and 189 MIPS, respectively, without incurring any additional output latency.
Figure 4. Use case 1 block diagram.
Table 1. Kernel and FIR/IIRA MIPS Summary for Use Case 1
use case 2
Figure 5 shows the block diagram of use case 2. The sampling rate is 48 kHz, the module size is 128 samples, and the core to accelerator channel ratio used in the split task model is 1:1.
Like Table 1, Table 2 also shows the results for this use case. As we can see, using the accelerator in conjunction with the data pipeline usage model saves up to 490 core MIPS, but results in 1 module (2.67 ms) of output latency. Splitting the task usage model saves 234 core MIPS without incurring any additional output latency. Note that unlike in use case 1, in use case 2 the kernel uses frequency domain (fast convolution) processing instead of time domain processing. This is why fewer core MIPS are required to process one pass than FIRA MIPS, which can lead to a drop-in replacement using the model to achieve negative core MIPS savings.
Figure 5. Use case 2 block diagram.
Table 2. Kernel and FIR/IIRA MIPS Summary for Use Case 2
in conclusion
In this article, we see how to utilize the different accelerator usage models to achieve the desired MIPS and processing targets, thereby transferring a large number of core MIPS to the FIRA and IIRA accelerators on the ADSP-2156x processors.
【Introduction】Finite impulse response (FIR) and infinite impulse response (IIR) filters are both commonly used digital signal processing algorithms – especially for audio processing applications. Therefore, in a typical audio system, a significant portion of the processor core’s time is devoted to FIR and IIR filtering. The on-chip FIR and IIR hardware accelerators on the digital signal processor are also called FIRA and IIRA, respectively. We can use these hardware accelerators to share the FIR and IIR processing tasks and let the core perform other processing tasks. In this article, we will explore how these accelerators can be utilized in practice with the help of different usage models and real-time test examples.
Figure 1. Block diagram of FIRA and IIRA systems.
Figure 1 shows a simplified block diagram of FIRA and IIRA and how they interact with the rest of the processor system and resources.
● Both FIRA and IIRA modules mainly contain a calculation engine (multiply-accumulate (MAC) unit) and a small local data and coefficient RAM.
• To begin FIRA/IIRA processing, the core initializes a chain of DMA transfer control blocks (TCBs) in processor memory with channel-specific information. The starting address of this TCB chain is then written into the FIRA/IIRA chain pointer register, and the FIRA/IIRA control register is then configured to initiate accelerator processing. Once all channels are configured, an interrupt is sent to the kernel so that the kernel can use the processed output for subsequent operations.
● In theory, the best approach is to offload all FIR and/or IIR tasks from the core to the accelerator and allow the core to do other things at the same time. But in practice, this is not always possible, especially when the kernel needs to use the accelerator output for further processing, and there are no other independent tasks that need to be done concurrently. In this case, we need to choose an appropriate accelerator usage model to achieve the best results.
In this article, we will discuss various models for taking advantage of these accelerators for different application scenarios.
Use FIRA and IIRA in real time
Figure 2. Typical real-time audio data flow.
Figure 2 shows a typical real-time PCM audio data flow diagram. A frame of digitized PCM audio data is received through the synchronous serial port (SPORT) and sent to memory through direct memory access (DMA). While continuing to receive frame N+1, frame N is processed by the core and/or the accelerator, and the output of the previously processed frame (N-1) is sent to the DAC via SPORT for digital-to-analog conversion.
Accelerator usage model
As mentioned earlier, depending on the application, the accelerator may need to be used in different ways to maximize the offloading of FIR and/or IIR processing tasks and to save as much core cycles as possible for other operations. From a high-level perspective, accelerator usage models can be divided into three categories: drop-in replacement, split tasks, and data pipelines.
direct replacement
• The kernel FIR and/or IIR processing is directly replaced by the accelerator, and the kernel simply waits for the accelerator to complete this task.
• This model is only valid if the accelerator is processing faster than the core; that is, using the FIRA block.
split task
• FIR and/or IIR processing tasks are distributed between cores and accelerators.
● This model is especially useful when multiple channels can be processed in parallel.
• Based on a rough timing estimate, distribute the total number of channels between the core and accelerator so that both can roughly complete their tasks simultaneously.
● As shown in Figure 3, this usage model saves more core cycles than the direct replacement model.
data pipeline
● The data flow between the core and the accelerator can be pipelined, enabling the two to be processed in parallel on different data frames.
● As shown in Figure 3, the kernel processes the Nth frame, and then starts the accelerator to process the frame. The kernel then proceeds to further parallelize the output of frame N-1 produced by the accelerator in the previous iteration. This sequence allows the FIR and/or IIR processing tasks to be completely offloaded to the accelerator, but with some delay in the output.
● Both pipeline stages and output latency may increase, depending on the number of such FIR and/or IIR processing stages in the complete processing chain.
Figure 3 illustrates how frames of audio data are transferred between the three phases of the different accelerator usage models—DMA IN, core/accelerator processing, and DMA OUT. It also shows how by offloading all or part of the FIR/IIR processing to the accelerator using a different accelerator usage model, compared to the kernel-only model, the core idle cycles are increased.
Figure 3. Accelerator usage model comparison.
FIRA and IIRA on SHARC processors
The following ADI SHARC® processor families support on-chip FIRA and IIRA (old to new).
● ADSP-214xx (eg, ADSP-21489)
● ADSP-SC58x
● ADSP-SC57x/ADSP-2157x
● ADSP-2156x
These processor families:
● Different calculation speed
• The basic programming model remains the same, with the exception of Auto Configuration Mode (ACM) on the ADSP-2156x processors.
● FIRA has four MAC units, while IIRA has only one MAC unit.
FIRA/IIRA Improvements for ADSP-2156x
The ADSP-2156x is the latest in the SHARC processor family. It was the first single-core 1 GHz SHARC processor, and its FIRA and IIRA also ran at 1 GHz. FIRA and IIRA on the ADSP-2156x processors feature several improvements over their predecessor ADSP-SC58x/ADSP-SC57x processors.
performance improvements
● 8 times faster computation (from SCLK-125 MHz to CCLK-1 GHz).
● Reduced data and MMR access latency between the core and accelerator due to tighter integration of the core and accelerator with dedicated core fabrics.
Functional improvements
● Added ACM support to minimize kernel intervention required for accelerator processing. This mode mainly has the following new features:
● Allows accelerators to be paused for dynamic task queuing.
● No channel limit.
● Supports trigger generation (master) and trigger wait (slave).
● Generate selective interrupts for each channel.
Experimental results
In this section, we discuss the results of implementing two real-time multi-channel FIR/IIR use cases using the model with different accelerators on the ADSP-2156x evaluation board
use case 1
Figure 4 shows the block diagram of use case 1. The sample rate is 48 kHz, the block size is 256 samples, and the core-to-accelerator channel ratio used in the split-task model is 5:7.
Table 1 shows the measured number of cores and FIRA MIPS, as well as the saved core MIPS results obtained compared to the core-only model. The table also shows the additional output delay added by the corresponding usage model. As we can see, using the accelerator in conjunction with the data pipeline usage model saves up to 335 core MIPS, but results in 1 block (5.33 ms) of output latency. Drop-in replacement and split-task-use models also save 98 MIPS and 189 MIPS, respectively, without incurring any additional output latency.
Figure 4. Use case 1 block diagram.
Table 1. Kernel and FIR/IIRA MIPS Summary for Use Case 1
use case 2
Figure 5 shows the block diagram of use case 2. The sampling rate is 48 kHz, the module size is 128 samples, and the core to accelerator channel ratio used in the split task model is 1:1.
Like Table 1, Table 2 also shows the results for this use case. As we can see, using the accelerator in conjunction with the data pipeline usage model saves up to 490 core MIPS, but results in 1 module (2.67 ms) of output latency. Splitting the task usage model saves 234 core MIPS without incurring any additional output latency. Note that unlike in use case 1, in use case 2 the kernel uses frequency domain (fast convolution) processing instead of time domain processing. This is why fewer core MIPS are required to process one pass than FIRA MIPS, which can lead to a drop-in replacement using the model to achieve negative core MIPS savings.
Figure 5. Use case 2 block diagram.
Table 2. Kernel and FIR/IIRA MIPS Summary for Use Case 2
in conclusion
In this article, we see how to utilize the different accelerator usage models to achieve the desired MIPS and processing targets, thereby transferring a large number of core MIPS to the FIRA and IIRA accelerators on the ADSP-2156x processors.
The Links: LTA035A350F NL3224AC35-20 BUYPART