

# High Level Design languages for Intel® FPGAs CNRS DAQ Seminar – Fréjus, November 2018

francisco.perez@intel.com

 Introduce the high level design coding tools available for Intel® FPGAs to increase the abstraction level and boost your productivity

• Make the FPGAs more "friendly" to software programmers.

• We will cover OpenCL and HLS (High Level Synthesis).

**OBJECTIVES** 



# HLS vs OpenCL<sup>™</sup>

| HLS                                                             | OpenCL                                                                                                        |
|-----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|
| Quickens design of blocks to fit into a traditional FPGA design | Quickens the development of a kernel<br>to fit into a <b>system</b> with an FPGA<br>accelerator card and host |
| Uses the C/C++ programming<br>language for design of components | Uses the kernel C (similar to C) for kernel design, and a host API for interaction with the host              |

Meant to help you design a block to fit into a traditional FPGA design

Meant to help you create a FPGA accelerator to fit into an OpenCL compliant system





# OpenCL<sup>TM</sup> on FPGAs for Software Programmers CNRS DAQ Seminar – Frejus, November 2018



\*OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos

**OBJECTIVES** 

- Introduce the concept of OpenCL<sup>™</sup> for heterogeneous programming using Intel® FPGAs
- Understand how to develop kernels and how they are executed on Intel® FPGAs
- Know which are the singular features of OpenCL<sup>™</sup> applied to Intel® FPGAs

# **Class Agenda**

Types of Parallel Computing Intro to OpenCL™ for Intel FPGAs OpenCL™ Platform model and Host-side Software Executing OpenCL Kernels The Intel FPGA SDK for OpenCL

# **Class Agenda**

## **Types of Parallel Computing**

Intro to OpenCL for Intel FPGAs

OpenCL<sup>™</sup> Platform model and Host-side Software

#### Executing OpenCL Kernels

Writing & compiling kernels Launching kernels Harnessing Pipeline parallelism

### The Intel FPGA SDK for OpenCL

SDK components Debug Tools FPGA-specific Features (channels, libraries)

# **Parallel Computing**

"A form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently (in parallel)"

~ Highly Parallel Computing, Amasi/Gottlieb (1989)



# Types of Parallelism

Data Parallelism

Task Parallelism

**Pipeline Parallelism** 



# **Data Parallelism**

Input data separated and sent to parallel resources, results recombined

- Same operation(s) applied across different data in parallel
- Single Program Multiple Data (SPMD)
- Single Instruction Multiple Data (SIMD)



# **Task Parallelism**

Decompose problem into sub-problems (tasks). Divide and conquer

- Tasks operate on same or different data
- Example: Multi-CPU system where each CPU execute a different thread
- A.K.A. Simultaneous Multithreading (SMT), Thread/Function Parallelism





# **Pipeline Parallelism**

Task parallelism where tasks have a producer consumer relationship

- Operates on pipelined data
  - Different tasks operate in parallel on different data
- Example
  - Task1 FFT, Task 2 Frequency Filter, Task3-Inverse FFT





# Heterogeneous Computing Systems

Modern systems contain more than one kind of processor

- Applications exhibit different behaviors
  - Control intensive (Searching, parsing, etc...)
  - Data intensive (Image processing, data mining, etc...)
  - Compute intensive (Iterative methods, financial modeling, etc...)
- Gain performance by using specialized capabilities of different types of processors



13

# Example Heterogeneous System

Modern computing platform contains many dissimilar processors

- Multi-core, general purpose, central processing units (CPUs)
- Digital Signal Processing (DSPs) processors
- Graphics Processing units (GPUs)
- Field Programmable Gate Arrays (FPGAs)



### Challenge: How to build a software ecosystem for a heterogeneous platform?

14

# Traditional Approach to Heterogeneous Computing

- Write software for each software programmable architecture CPU, GPU, DSP
  - Using different languages and vendor specific tools

- Develop custom parallel hardware for FPGA
  - Fine-grained parallelism
  - Write HDL
  - Simulation, timing closure, on-chip verification etc.







# **Class Agenda**

Heterogeneous Parallel Computing

## Intro to OpenCL for Intel FPGAs

OpenCL<sup>™</sup> Platform model and Host-side Software

### Executing OpenCL Kernels

Writing & compiling kernels Launching kernels Harnessing Pipeline parallelism

### The Intel FPGA SDK for OpenCL

SDK components Debug Tools FPGA-specific Features (channels, libraries)

# What is OpenCL™?

- Open Computing Language (OpenCL<sup>™</sup>) Framework for heterogeneous computing
  - General purpose programming model for multiple platforms
  - Host API and kernel language
  - Low-level Programming language based on C/C++
  - Provides increased performance with hardware acceleration
- Open, royalty-free standard
  - Managed by Khronos\* Group
    - Intel<sup>®</sup> is an active member
  - <u>http://www.khronos.org</u>









'inte

# Two Sides of OpenCL<sup>™</sup> Standard

- Kernel Function
  - OpenCL<sup>™</sup> C
  - Software that runs on accelerators (OpenCL devices)
  - Usually used for computationally intensive tasks
- Host Program
  - Software running conventional microprocessor
  - Supports efficient plumbing of complicated concurrent programs with low overhead
    - Through OpenCL host API

### Used together to efficiently implement algorithms









# Mapping OpenCL Programs





# **Traditional FPGA Design Flow**



# FPGA High Level Design with OpenCL<sup>™</sup>

## Goal: Design FPGA custom hardware with C-based software language

kernel void foo ( global float \*x) { int i ...

Intel<sup>®</sup> FPGA SDK for OpenCL<sup>®®</sup>



- Benefits
  - Makes FPGA acceleration available to software engineers
  - Debug and optimize in a software-like environment
  - Significant productivity gains compared to hardware-centric flow
  - Easier to perform design exploration
  - Abstracts away FPGA design flow and FPGA hardware



# Compiling OpenCL<sup>™</sup> to Intel<sup>®</sup> FPGA

- Custom hardware generated automatically for each kernel
  - Get the advantages of the FPGA without the lengthy design process
- Organized into functional units based on operation
- Able to execute OpenCL<sup>™</sup> threads in parallel





# **Class Agenda**

Heterogeneous Parallel Computing

Intro to OpenCL for Intel FPGAs

## OpenCL<sup>™</sup> Platform model and Host-side Software

### Executing OpenCL Kernels

Writing & compiling kernels Launching kernels Harnessing Pipeline parallelism

### The Intel FPGA SDK for OpenCL

SDK components Debug Tools FPGA-specific Features (channels, libraries)

# OpenCL<sup>™</sup> Platform Model

- One Host with one or more OpenCL<sup>™</sup> Devices
  - Each Device is composed of one or more compute units
- Memory divided into Host Memory and various types of Device Memory





## **Heterogeneous Platform Model**





# Intel<sup>®</sup> FPGA OpenCL<sup>™</sup> Device

Each device is made of many independent compute units

Each compute unit is custom built from kernel code



intel

# OpenCL<sup>™</sup> Platform Layer and Runtime Layer API

OpenCL<sup>™</sup> API divided into two layers

- Platform Layer API
  - Discover platform and device capabilities
  - Setup execution environment
- Runtime Layer API
  - Executes compute kernels on devices
  - Manage device memory



# **Platform Layer API**

Setup device execution environment

- Necessary to allow for heterogeneous environments and multiple devices

Tasks

- Allows host to discover devices and capabilities
- Query, select and initialize compute devices
- Create compute contexts to manage OpenCL<sup>™</sup> objects

### **Typical Platform Layer Steps**

- 1. Query platforms
- 2. Query devices
- 3. Create a context for the devices

# Context

Abstract containers that manage host device interaction

- Purpose
  - Coordinates the mechanisms for host-device interaction
  - Manages the device memory
  - Keeps track of kernels to be executed on each device





# **Runtime Layer API**

Execute kernels on the device

- Tasks
  - Memory management
  - Run kernels on the device
  - Host/device synchronization

## **Typical Runtime Layer Steps**

- 1. Create a command queue
- 2. Write to the device
- 3. Launch kernel
- 4. Read results back from the device



# **Command Queue**

Mechanism for host to request action by the device

- Each command queue associated with one device
  - Each device can have one or more command queues
- Host submits commands to the appropriate queue
- Operations in the queue will execute in-order for Intel<sup>®</sup> FPGAs

 Device
 Write to Device

 Device
 Execute Kernel

 Read from Device



# Host / Device Physical Memory Space

- The host and the device each has its own physical memory space
  - Data needs to be physically located on a device before kernel execution
- Use OpenCL<sup>™</sup> API functions to allocate, transfer, and free device memory
  - Using memory objects through command queues





# **Data Transfers Calls**

Use Read and Write Host API calls to explicitly transfer data from/to the device

- Commands placed on the command queue
- If kernel dependent on the buffer is executed on the accelerator device, buffer is transferred to the device
- Runtime determines precise timing of data movement





# Terminology

## OpenCL / Poker table

| Host      | -> Card dealer  |
|-----------|-----------------|
| Context   | -> Table        |
| device    | -> Player       |
| cmd queue | -> player hands |
| kernel    | -> card         |
|           |                 |





# How they interact.....

Dealer sits at the card table and determine the player The host selects the devices and places them in a context

The dealer selects cards from the deck and deals them (in hand) Host select kernels from program, add them on the cmd\_queue

Each player looks at their hand and decides what to do Each device process kernel from the device queue

Dealer respond to host during the game Host receive events from the devices and invokes event-handling queue

The dealer look at player and decide who won Once all kernel are done the host receive the results



# **Class Agenda**

Heterogeneous Parallel Computing

Intro to OpenCL for Intel FPGAs

OpenCL<sup>™</sup> Platform model and Host-side Software

## Executing OpenCL Kernels

### Writing & compiling kernels

Launching kernels Harnessing Pipeline parallelism

#### The Intel FPGA SDK for OpenCL

SDK components Debug Tools FPGA-specific Features (channels, libraries)

#### OpenCL<sup>™</sup> Kernels

Functions that run on OpenCL<sup>™</sup> devices

- Begins with the keyword \_\_\_\_kernel
- Returns void
- Pointers in kernels should be qualified with an address space
  - \_\_private, \_\_local, \_\_global, **or** \_\_constant
- Kernel language derived from ISO C99 with certain restrictions

\_kernel void my kernel (\_\_global float \*data) {



#### Kernel Example

```
kernel void my kernel ( global float *a,
                          global float *b,
                          global float *c,
                          int N)
   int index;
   for (index = 0; index < N; index++)
       c[index] = a[index] + b[index];
}
```



# Compiling OpenCL<sup>™</sup> Kernel to FPGAs

Kernels are compiled offline using an Offline Compiler (AOC)

- Kernels are first translated into an AOC Object file (.aoco)
  - Represents the FPGA hardware system
- Object file used to generate the AOC Executable file (.aocx)
  - Used to program the FPGA or Flash







### OpenCL<sup>™</sup> Kernels to Dataflow Circuits

Each kernel is converted into custom dataflow hardware (Compute Unit)

- Gain the benefits of FPGAs without the lengthy design process
- Implement C operators as circuits
  - HDL code located in <SDK Installation>/ip
  - Load Store units to read/write memory
  - Arithmetic units to perform calculations
  - Flow control units
  - Connect circuits according to data flow in the kernel
- May replicate circuit to accelerate algorithm

S S Isu\_basic\_coalescer.v Isu\_burst\_master.v Isu\_bursting\_load\_stores.v Isu\_enabled.v Isu\_ic\_top.v Isu\_non\_aligned\_write.v Isu\_pipelined.v Isu\_prefetch\_block.v

Vame





'inte

### **Compilation Example**

Kernel compiled into dataflow circuit with flow control

Includes branch and merge units



For Entry

#### Altera's OpenCL Flow Intel's OpenCL SDK for FPGA takes a system level view



- Board Support Package (BSP)
  - "Chassis" to hold the newly created kernel
- Kernel Compiler
  - Optimized pipelines from C
- System Integrator
  - Merge all together and generate partial reconfiguration files for FPGA



#### FPGA Architecture for OpenCL<sup>™</sup> Implementation





íntel

#### **Partial Reconfiguration**

- Reconfigures part of the FPGA while others continues operation
- Every aocx file represent a set of concurrent OpenCL kernels
- Allows kernels to be swapped while maintaining host-device communication





#### Intel FPGA Preferred Board for OpenCL

- Intel<sup>®</sup> FPGA Preferred Board for OpenCL<sup>™</sup>
  - Available for purchase from preferred partners and Intel
  - Passes conformance testing



- Download and install Intel FPGA OpenCL compatible BSP from vendor
  - Supplies board information required by the offline compiler
  - Provides software layer necessary to interact with the host code including drivers





#### **Class Agenda**

Heterogeneous Parallel Computing

Intro to OpenCL for Intel FPGAs

OpenCL<sup>™</sup> Platform model and Host-side Software

#### **Executing OpenCL Kernels**

Writing & compiling kernels Launching kernels Harnessing Pipeline parallelism

#### The Intel FPGA SDK for OpenCL

SDK components Debug Tools FPGA-specific Features (channels, libraries)

#### OpenCL<sup>™</sup> Execution Flow Device





#### **Class Agenda**

Heterogeneous Parallel Computing

Intro to OpenCL for Intel FPGAs

OpenCL<sup>™</sup> Platform model and Host-side Software

#### **Executing OpenCL Kernels**

Writing & compiling kernels Launching kernels Harnessing Pipeline parallelism

#### The Intel FPGA SDK for OpenCL

SDK components Debug Tools FPGA-specific Features (channels, libraries)

### Mapping Multithreaded Kernels to FPGAs

The most simple way of mapping kernel functions to FPGAs is to replicate the unrolled hardware for each thread

- Inefficient and wasteful

Better method involves taking advantage of pipeline parallelism

- Attempt to create a deeply pipelined representation of a kernel
- On each clock cycle, we attempt to send in input data for a new thread
- Method of mapping coarse grained thread parallelism to fine-grained FPGA parallelism















On each cycle the portions of the datapath are processing different threads

While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored









#### Silicon used efficiently at steady-state



#### **Execution of Threads on FPGA**

Better method involves taking advantage of *pipeline parallelism* 

Throughput = 1 thread per cycle





#### **Pipeline parallelism execution**

- A typical OpenCL<sup>™</sup> kernel can have hundreds of pipeline stages
- This means, hundreds of simultaneous in-flight threads executing on the kernel.
- Very efficient usage of the inferred Hardware for maximize throughput



### **Class Agenda**

Heterogeneous Parallel Computing

Intro to OpenCL for Intel FPGAs

OpenCL<sup>™</sup> Platform model and Host-side Software

Executing OpenCL Kernels Writing & compiling kernels Launching kernels Harnessing Pipeline parallelism

#### The Intel FPGA SDK for OpenCL

#### SDK components

Debug Tools FPGA-specific Features

# **SDK Components**

- Offline Compiler (AOC)
  - Translates your OpenCL<sup>™</sup> C kernel source file into an Intel<sup>®</sup> FPGA hardware image
- Host Libraries
  - Provides the OpenCL host API to be used by OpenCL host applications
- AOCL Utility
  - Perform various tasks related to the board, drivers, and compile process

### Intel<sup>®</sup> FPGA SDK for OpenCL<sup>™</sup> Directory Structure

| Directory                                                | Description                                                                                                                                                                                                 |
|----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| bin                                                      | Main compiler and utility executables                                                                                                                                                                       |
| windows64/bin<br>linux64/bin                             | Runtime DLLs and other executables<br>This should be in your path.                                                                                                                                          |
| board                                                    | Design files related to specific supported boards                                                                                                                                                           |
| ір                                                       | IP cores required for kernel compilation                                                                                                                                                                    |
| host                                                     | Files used by the compilation flow for user programs.                                                                                                                                                       |
| host/include                                             | OpenCL <sup>™</sup> API header files, and the interface files used to compile<br>and link a user host program. Add this directory to the include file<br>search path when compiling an OpenCL host program. |
| host/windows64/lib<br>host/linux64/lib<br>host/arm32/lib | The OpenCL host runtime libraries. Add this directory to the library file search path when linking an OpenCL host program.                                                                                  |



# Offline Kernel Compiler (aoc)

#### aoc -board=<my board> <my kernel file>

| Option                          | Description                                                                    |
|---------------------------------|--------------------------------------------------------------------------------|
| -help or -h                     | Help for the tool                                                              |
| -c                              | Creates .aoco object file and sets up a Quartus® Prime hardware design project |
| -rtl                            | Creates .aocr file that links all of the .aoco files                           |
| -board= <board name=""></board> | Compile for the specified board                                                |
| -list-boards                    | Prints a list of available boards                                              |

- Compiles kernels for a specific board defined by a board support package
- Generates aoco, and aocx files
- For detailed info on supported kernel constructs see the Intel<sup>®</sup> FPGA SDK for OpenCL<sup>™</sup> programming Guide

There are many other debugging, optimization, and build options.



# **Compiling the Host Program**

- Include CL/opencl.h or CL/cl.hpp
- Use a conventional C compiler (Visual Studio\*/GCC)
- Add \$INTELFPGASDKROOT/host/include to your file search path
  - Recommended to use aocl compile-config
- Link to Intel<sup>®</sup> FPGA OpenCL<sup>™</sup> libraries
  - Link to libraries located in the \$INTELFPGASDKROOT/host/<OS>/lib directory
    - Recommended to use aocl link-config

main() {
 read\_data( ... );
 manipulate( ... );
 clEnqueueWriteBuffer( ... );
 clEnqueueNDRange(...,sum,...);
 clEnqueueReadBuffer( ... );
 display\_result( ... );





# **AOCL Utility**

| Host Compilation Commands (Use in your makefile)          |                                                                           |  |  |  |  |  |  |
|-----------------------------------------------------------|---------------------------------------------------------------------------|--|--|--|--|--|--|
| aocl compile-config                                       | Displays the compiler flags for compiling your host program               |  |  |  |  |  |  |
| aocl link-config                                          | Shows the link options needed by the host program to link with libraries  |  |  |  |  |  |  |
| aocl makefile                                             | Shows example Makefile fragments for compiling and linking a host program |  |  |  |  |  |  |
| Board Management Commands (Functionality Provided by BSP) |                                                                           |  |  |  |  |  |  |
| aocl install                                              | Installs a board driver onto your host system                             |  |  |  |  |  |  |
| aocl diagnose                                             | Runs the board vendor's test program                                      |  |  |  |  |  |  |
| aocl flash <.aocx>                                        | Programs the on-board flash with the FPGA image over JTAG                 |  |  |  |  |  |  |
| View Kernel Compilation Rep                               | View Kernel Compilation Report                                            |  |  |  |  |  |  |
| aocl report                                               | Displays kernel execution profiler data                                   |  |  |  |  |  |  |

Run aocl help or aocl help <subcommand> for detailed information about the tool



#### **Class Agenda**

Heterogeneous Parallel Computing

Intro to OpenCL for Intel FPGAs

OpenCL<sup>™</sup> Platform model and Host-side Software

Executing OpenCL Kernels Writing & compiling kernels Launching kernels Harnessing Pipeline parallelism

#### The Intel FPGA SDK for OpenCL

SDK components Debug Tools FPGA-specific Features

#### Kernel Development Flow and Tools







Enable kernel functional debug on x86 systems

Quickly generate x86 executables that represent the kernel



- Debug support for
  - Standard OpenCL<sup>™</sup> syntax, Channels, Printf statements



### **HTML Report**

Static report showing optimization, area, and architectural information

- Automatically generated with the object file (acc -rtl)
  - Located in <kernel file folder>\reports\report.html
- Dynamic reference information to original source code
- Loop Analysis Optimization report
  - Information on how loops are implemented
- Area report
  - Detailed FPGA resource utilization by source code or system block
- Architectural viewer
  - Memory access implementation and kernel pipeline information



### HTML Loop Analysis Optimization Report

- Actionable feedback on pipeline status of loops
  - Shows loop carried dependencies and bottlenecks
  - Especially important for single work-item kernels since they have an outer loop
- Shows loop unrolling status
- Shows loop nesting relationship

| Reports View reports+                                                                        |                           |          |            |                         |                                                                 |  |  |  |  |  |  |
|----------------------------------------------------------------------------------------------|---------------------------|----------|------------|-------------------------|-----------------------------------------------------------------|--|--|--|--|--|--|
| Loops analysis                                                                               | Show fully unrolled loops |          |            |                         | summation.cl -                                                  |  |  |  |  |  |  |
|                                                                                              | Pipelined                 | н        | Bottleneck | Details                 | 1kernel void summation (<br>global const float *restrict input, |  |  |  |  |  |  |
| ernel: summation (summation.cl:6)                                                            |                           |          |            | Single work-item exec   | 3global float *restrict output,<br>4 unsigned rows,             |  |  |  |  |  |  |
| summation.B1 (summation.cl:8)                                                                | Yes                       | >=1      | n/a        |                         | 5 unsigned cols)<br>6 - {<br>7 float result = 0;                |  |  |  |  |  |  |
| summation.B2 (summation.cl:10)                                                               | Yes                       | ~1       | n/a        | II is an approximation. | <pre>8 for (unsigned i = 0; i &lt; rows; i++) 9 - {</pre>       |  |  |  |  |  |  |
|                                                                                              |                           |          |            |                         | <pre>10   for (unsigned j = 0; j &lt; cols; j++) 11</pre>       |  |  |  |  |  |  |
| Details                                                                                      |                           |          |            |                         |                                                                 |  |  |  |  |  |  |
| ummation.B2:<br>I is an approximation due to the follo<br>• Load Operation (summation.cl: 12 |                           | instruct | ion:       |                         |                                                                 |  |  |  |  |  |  |

#### HTML Area Report

Generate detailed estimated area utilization report of kernel code

- Detailed breakdown of resources by source line or by system blocks
- Provides architectural details of HW
  - Suggestions to resolve inefficiencies

| eports View rep                                                                 | orts <del>•</del> |             |             |            |           |                  |    |            |                                                            |           |        |       |            | 1 |  |  |  |
|---------------------------------------------------------------------------------|-------------------|-------------|-------------|------------|-----------|------------------|----|------------|------------------------------------------------------------|-----------|--------|-------|------------|---|--|--|--|
| Area analysis of source<br>(area utilization values<br>Notation file:X > file:Y | s are estimat     |             | n line X wa | as inlined | d using c | ode on line Y.   |    | L          | tion.cl<br>kernel void summation (<br>global c<br>global f | onst floa | t *res | trict | •<br>input | × |  |  |  |
|                                                                                 | ALUTs             | FFs         | RAMs        | DSPs       | MLABs     | Details          |    | 1          | unsigned rows,<br>unsigned cols)                           |           |        |       |            |   |  |  |  |
| Static Partition                                                                | 66866 (8%)        | 133600 (8%) | 179 (7%)    | 0 (0%)     | 0 (0%)    |                  |    | 5 - {<br>/ | <pre>float result = 0; for (unsigned i = 0; i</pre>        |           |        |       |            |   |  |  |  |
| Board interface                                                                 | 66866             | 133600      | 179         | 0          | 0         | Platform i       |    | 9 -        | for (unsigned j =                                          |           |        | •)    |            |   |  |  |  |
| ♥ Kernel System                                                                 | 7400 (1%)         | 12859 (1%)  | 61 (2%)     | 4 (0%)     | 17 (0%)   |                  |    | L v        | {<br>result += inpu                                        |           |        | · ·   |            |   |  |  |  |
| Global interconnect                                                             | 2338              | 4125        | 0           | 0          | 0         | For 1 global loa | 13 | 1          | }                                                          |           |        |       |            |   |  |  |  |
| System description ROM                                                          | 0                 | 67          | 2           | 0          | 0         | Contains inform  | 15 | 5 }        | *output = result;                                          |           |        |       |            |   |  |  |  |
| ✓ summation                                                                     | 5062 (1%)         | 8667 (1%)   | 59 (2%)     | 4 (0%)     | 17 (0%)   | 1 compute unit.  | 17 |            |                                                            |           |        |       |            |   |  |  |  |
| Data control overhead                                                           | 110               | 149         | 0           | 0          | 7         | Feedback+Clust   |    |            |                                                            |           |        |       |            |   |  |  |  |
| Function overhead                                                               | 1338              | 2411        | 0           | 0          | 10        | Kernel dispatch  | -  |            |                                                            |           |        |       |            |   |  |  |  |
| Details                                                                         |                   |             |             |            |           |                  |    |            |                                                            |           |        |       |            | × |  |  |  |
| Board interface:                                                                |                   |             |             |            |           |                  |    |            |                                                            |           |        |       |            |   |  |  |  |

## **HTML System Viewer**

- Displays kernel pipeline implementation and memory access implementation
- Visualize
  - Off-chip memory
    - Load-store units
    - Accesses
  - Stalls
  - Latencies
  - On-chip memory
    - Implementation
    - Accesses



## **HTML Kernel Memory Viewer**

Helps you identify data movement bottlenecks in your kernel design. Illustrates:

- Memory replication
- Banking
- Implemented arbitration
- Read/write capabilities of each memory port



#### Reports View reports...-





#### Profiler

- Inserts counters and profiling logic into the HW design
- Dynamically reports the performance of kernels



# **Collecting and Viewing Profile Information**

- Compile kernel with aoc --profile option
  - .source file generated containing source information
- Run host application with generated aocx file
  - Performance counters will collect profile information
  - Host saves a profile.mon monitor description file to working directory
- View statistical data using the profiler GUI
  - Optionally provide .source file to view source code of profiled application

aocl report <kernel file>.aocx profile.mon [<kernel file>.source]



### **Profiler Reports**

- Get runtime information about kernel performance
- Reports bottlenecks, bandwidth, saturation, and pipeline occupancy
  - At data access points

|          | Kernel                    | Total Time: 396.74ms |                                     |                                                                                                    |                                          |          |            |                                           |
|----------|---------------------------|----------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|------------------------------------------|----------|------------|-------------------------------------------|
| Device 0 | matrixMul                 |                      |                                     |                                                                                                    |                                          |          |            |                                           |
| Device 0 | Memory Transfers          |                      | 1                                   |                                                                                                    |                                          |          |            |                                           |
| Device 1 | matrixMul                 |                      | I.                                  |                                                                                                    |                                          |          |            |                                           |
| Device 1 | Memory Transfers          | 1                    | 1                                   |                                                                                                    |                                          |          |            |                                           |
| Device 1 | Memory Copy (from device) |                      | 1                                   |                                                                                                    |                                          |          |            |                                           |
| Device 1 | Memory Copy (to device)   |                      | Source Code Kernel                  | Execution   fft1d   output_kernel   input_kernel                                                   |                                          |          |            |                                           |
|          |                           |                      | File Name                           | Directory                                                                                          |                                          |          |            |                                           |
|          |                           |                      | fft.cl<br>fft.h<br>twid_radic_2_2.h | <pre><pre><pre><pre><pre><pre><pre><pre></pre></pre></pre></pre></pre></pre></pre></pre>           |                                          |          |            |                                           |
| τ        |                           |                      | Line #                              | Source: fft.cl                                                                                     | Attributes                               | Stall%   | Occupancy% | Bandwidth                                 |
|          | -                         |                      | 6 7                                 | #define DEPTH0 _attribute_((depth(0)))                                                             |                                          |          |            |                                           |
|          |                           |                      | 8                                   | channel TYPE DEPTH0 input_stream0, DEPTH0 input_stream1, DEPTH0 input_stream2, DEPTH0 input_stream |                                          |          |            |                                           |
|          |                           |                      | 10                                  | kemel void input_kemel(global TVPE *src) {                                                         |                                          |          |            |                                           |
|          |                           |                      | 12<br>13                            | int i = get_global_id(0);<br>int base = (i >> (LOGN - 2)) << LOGN;                                 |                                          |          |            |                                           |
|          |                           |                      | 14                                  | int offset = i & (N / 4 - 1);<br>write channel intellinput stream0.src[base + offset]);            | 0 disbal (DDR) read                      | 0: 3.13% | 0:100.0%   | 0: 1836.0MB/s, 100.00%Efficiency          |
|          |                           |                      | 16                                  | write channel intel(input stream1,src[base + N / 2 + offset]);                                     | 0:global{DDR}.read<br>0:global{DDR}.read | 0: 6.25% | 0:100.0%   | 0: 1836.0MB/s, 100.00%Efficiency          |
|          |                           |                      | 17                                  | write channel intel(input stream2,src[base + N / 4 + offset]);                                     | 0: _global{DDR},read                     | 0: 9.37% | 0:100.0%   | 0: 1836.0MB/s, 100.00%Efficiency          |
|          |                           |                      | 110                                 | write channel intellinput stream3.src[base + 3 * N / 4 + offset]);                                 | 0: _global{DDR}.read                     | 0:12.5%  | 0:100.0%   | 0: 1836.0MB 0: 1836.0MB/s, 100.00%Efficie |



### **Class Agenda**

Heterogeneous Parallel Computing

Intro to OpenCL for Intel FPGAs

OpenCL<sup>™</sup> Platform model and Host-side Software

Executing OpenCL Kernels Writing & compiling kernels Launching kernels Harnessing Pipeline parallelism

#### The Intel FPGA SDK for OpenCL

SDK components Debug Tools FPGA-specific Features (channels, libraries)

### Traditional OpenCL<sup>™</sup>: Host-Centric Architecture

All communication to/from kernels done through global memory





### Idea: Communication without Global Memory



- Kernel-to-kernel communication done directly on-chip
- IO-to-kernel communication done without the host

### Implementing FIFOs with Channels / Pipes

Use FIFOs instead of global memory for efficient communication to/from kernel compute units



- Supported with Intel® FPGA's channels extension and OpenCL<sup>™</sup> 2.0 pipes
- Works well with applications fitting into the general streaming template
  - E.g. Wireline Processing, Financial HFT Applications, Video Pipelines



### **Channels / Pipe Features**

- Provides FIFO-like communication mechanism
- Each call site is unidirectional



- Allows BSP-specific I/O communication with kernel compute units
- Advantages
  - Leverage internal bandwidth of the FPGA
  - Avoid the bottleneck of using off-chip memory
  - Reduces overall latency by allowing concurrent Kernel execution
  - Reduce storage requirements when data is consumed as it is produced



### Kernel-to-Kernel Channel Performance Gains

- Standard
  - If communication between kernels is required, host forced to launches kernels sequentially
    - Kernel 1 writes to global memory, kernel 2 reads from global memory

Kernel 1 Kernel 2

- With channels
  - Host can launch kernels in parallel
    - kernel 1 writes to channel as kernel 2 reads from it





79

### **IO Channel Performance Gains**

#### Standard

- Data needs to be written to global memory first before kernel can process it and then read back after processing
- Limited by PCIe\* bandwidth and memory throughput



- With IO channels
  - Kernel can run while data flows across network interface
  - System running at speed of network interface





### I/O Channels

- Channels used with input or output features of a board
  - E.g., network interfaces, PCIe interfaces, camera interfaces, etc.
- Behavior defined by the Board Support Package (check board\_spec.xml)

<channels>

<interface name="udp\_0" port="udp0\_out" type="streamsource" width="256" chan\_id="eth0\_in"/>
<interface name="pcie" port="tx" type="streamsink" width="32" chan\_id="pcie\_out" />
</channels>

Declaration of I/O channel using the io attribute

channel QUDPWord udp\_in\_IO \_\_attribute\_\_((io("eth0\_in"))); channel float data \_\_attribute\_\_((io("pcie\_out")));

- Usage same as other channels
  - data = read\_channel\_intel(udp\_in\_IO);



### **OpenCL™** Libraries

Create libraries from RTL or OpenCL<sup>™</sup> source and call those library functions from user OpenCL code

#### Why use RTL modules?

- You want to use optimized and verified RTL modules in OpenCL<sup>™</sup> kernels without rewriting the modules as OpenCL functions
- You want to implement OpenCL kernel functionality that you cannot express effectively in OpenCL



### **OpenCL™** Libraries





### Developing a custom Board Support Package

When you need (or want) to use your own boards with OpenCL

- Framework of host software and FPGA interface design to enable the use of OpenCL<sup>™</sup> on a custom board
- FPGA design, software, and board bring up skills required
- Custom BSP provides
  - Timing-closed Hardware
  - MMD software layer (drivers)
  - Some AOCL utility function



### Working with custom boards

#### Remember the concept of BSP



- Board Support Package (BSP)
  - Initial FPGA image as "Chassis" to hold the newly created kernel



### BSP includes some software stuff as well...





### **References and Documentation**

- Intel<sup>®</sup> FPGA OpenCL collateral
  - <u>https://www.intel.com/content/www/us/en/software/programmable/sdk-for-opencl/overview.html</u>
  - Intel FPGA SDK for OpenCL<sup>™</sup> Getting Started Guide
  - Intel FPGA SDK for OpenCL Programming Guide
  - Intel FPGA SDK for OpenCL Best Practices Guide
  - Free Intel FPGA OpenCL Online Trainings
- Khronos\* Group OpenCL Page
- OpenCL Reference Card
  - <u>https://www.khronos.org/files/OpenCLPP12-reference-card.pdf</u>



87

SUMMARY ....

- High-level parallel computing as the way to solve performance bottlenecks problems of your processing systems.
- OpenCL<sup>™</sup> SDK with Intel<sup>®</sup> FPGAs facilitates the adoption of heterogeneous computing.
- We went through the basics of the OpenCL<sup>™</sup> standard and how compile and run OpenCL<sup>™</sup> programs using the available Intel<sup>®</sup> FPGA tools.





# High-Level Synthesis with Intel® FPGAs CNRS DAQ Seminar – Frejus, November 2018

francisco.perez@intel.com

**OBJECTIVES** 

- Understand the concept of high-level synthesis for Intel® FPGAs
- Use the Intel HLS Compiler to synthesize, functionally verify, and simulate design IP for Intel FPGA
- Understand how the component executes on the FPGA





Introduction to high-level synthesis with the Intel® HLS Compiler HLS flow

HLS interfaces for integration in Platform Designer



#### Introduction to high-level synthesis with the Intel® HLS Compiler

HLS flow

HLS interfaces for integration in Platform Designer

### **High Level Synthesis**

Synthesize a C/C++ function in to an RTL implementation

- Develop the component in a software environment
- Functionally verify the component within a software environment
- Seamlessly integrate with hardware simulation environment
- Optimize design using software-centric tools and reports
- Integrate generated IP easily within traditional FPGA design tools



### Intel® HLS Compiler

Accelerated Development

- Untimed C++ to optimized RTL
- Fast functional debug iterations
- Export to Platform Designer IP Library

#### **Optimized Results**

- Increased Fmax with Pipeline insertion
- Increased throughput with Parallelism
- Map to device hardware resources
- Ability to target hard floating-point blocks with Intel FPGAs





### Accelerate FPGA Development Cycles

#### Traditional RTL Design Methodology



#### RTL vs Untimed C++ Functional Verification Times

| Design             | RTL Sim<br>Time | C Sim Time | Acceleration |
|--------------------|-----------------|------------|--------------|
| AES Encryption     | 22 mins         | 46 ms      | 29,000x      |
| Huffman Encoding   | 13 mins         | 52ms       | 15,000x      |
| Optical Flow       | ~2 Days         | 10 seconds | 12,000x      |
| Complex FIR Filter | 4.5 min         | 63 ms      | 4,200x       |

#### Fast Functional iteration Cycle



Benchmark performed using the following hardware & software

Intel® HLS v0.9, ModelSim-SE-64 10.4d, Hardware: 2x8-core Intel Xeon ES-2680 @2.7 GHz, 256 GB RAM

Tests measure performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks



### Automatically Verified RTL

Generated RTL is verified to the original C++ System Model

- New top-level C++ testbench executable is generated that supports ModelSim co-simulation
- Simulation files automatically generated and executed





### Easier Design Reuse Enabled through Abstraction

Easily reuse C++ based IP in multiple projects (building libraries)

- Parametrize with directives
  - Performance
  - Interfaces
  - Memories
- C++ Source easier to modify vs RTL

Generate Library IP for use by Qsys Pro System design environment



97

### **Class Agenda**

Introduction to high-level synthesis with the Intel® HLS Compiler

### HLS flow

HLS interfaces for integration in Platform Designer

### Intel<sup>®</sup> HLS Compiler

- Targets Intel<sup>®</sup> FPGAs
- Command-line executable: i++
- Builds an IP block
  - To be integrated into a traditional FPGA design using FPGA tools



- Leverages standard C/C++ development environment
- Goal: Same performance as hand-coded RTL with 10-15% more resources

### **HLS Procedure**







### Example Component/Testbench Source





### Cosimulation

Cosimulation: combines x86 testbench with RTL simulation

- HDL code for the component runs in an RTL Simulator
  - Verilog
  - RTL testbench automatically created from software
- main() and everything else called from main runs on x86 as the testbench
- Communication using SystemVerilog Direct Programming Interface (DPI)
  - Allows C/C++ to interface SystemVerilog
  - Inter-process communication (IPC) library used to pass testbench input data to RTL simulator, and returns the data back to the x86 testbench



## Cosimulation Verifying HLS IP

The Intel<sup>®</sup> HLS compiler automatically compiles and links C++ testbench with an instance of the component running in an RTL simulator

- To verify RTL behavior of IP, just run the executable generated by the HLS compiler targeting the FPGA architecture
  - Any calls to the component function becomes calls the simulator through DPI





### C/C++ Functions to Dataflow Circuits

Each component function is converted into custom dataflow hardware

- Gain the benefits of Intel<sup>®</sup> FPGAs without the length design process
- Implement C/C++ operators as circuits
  - HDL code located in <HLS Installation>\ip
  - Load Store units to read/write memory
  - Arithmetic units to perform calculations
  - Flow control units
  - Connect circuits according to data flow in the function

| acl_staging_reg.v     | acl_work_group_li   | bram_512x4M_hw.tcl   | dotp_core.vhd     |
|-----------------------|---------------------|----------------------|-------------------|
| acl_stall_free_sink.v | acl_work_group_li   | bram_512x33M.v       | dotp_core_sv.vhd  |
| acl_stall_free_sink   | acl_work_item_iter  | bram_512x33M_hw      | dotProduct64_dut  |
| acl_stall_monitor.v   | avalon_concatenat   | config_switch1.v     | dotProduct64_dut  |
| acl_start_signal_ch   | avalon_concatenat   | config_switch32.v    | dotProduct64_safe |
| acl_stream_fifo.v     | avalon_conduit_fa   | CosDPStratixVf400    | dotp_wrapper.v    |
| acl_stream_to_vect    | avalon_conduit_fa   | CosDPStratixVf400    | dotp_wrapper_sv.v |
| acl_task_copy_finis   | avalon_split_multib | CosPiDPStratixVf40   | dotp_wrapper_tom  |
| acl_toggle_detect.v   | avalon_split_multib | CosPiDPStratixVf40   | dp_addb.vhd       |
| acl_token_fifo_cou    | barrier_fifo.v      | cra_ring_node.sv     | dp_addpipe.vhd    |
| acl_valid_fifo_coun   | bram_256x4M.v       | cra_ring_node_hw.tcl | dp_adds.vhd       |
| acl_vector_to_stre    | bram_256x4M_hw.tcl  | cra_ring_rom.sv      | dp_clz64.vhd      |
| acl_vector_to_stre    | bram_256x67M.v      | cra_ring_rom_hw.tcl  | dp_clzpipe64.vhd  |
| acl_work_group_di     | bram_256x67M_hw     | cra_ring_root.sv     | dp_div_core.vhd   |
| acl_work_group_di     | bram_512x4M.v       | cra_ring_root_hw.tcl | dp_divnornd.vhd   |



### **Compilation Example**

Software compiled into dataflow circuit with flow control





### Main HTML Report

The Intel<sup>®</sup> HLS Compiler automatically generates HTML report that analyzes various aspects of your function including area, loop structure, memory usage, and system data flow

Located at a.prj/reports/report.html





### HLS Procedure: Integration





### Intel<sup>®</sup> Quartus<sup>®</sup> Software Integration

- a.prj/components directory contains all the files to integrate
  - One subdirectory for each component
    - Portable, can be moved to a different location if desire
- 2 use scenarios
  - 1. Instantiate in HDL
  - 2. Adding IP to a Platform Designer system



## **HDL** Instantiation

- Add Components to Intel® Quartus® Software Project
  - <component>.qsys to Standard Edition
  - <component>.ip to Pro Edition
- Instantiate component module in your design
- Use template
  a.prj/components/<component>/<component>\_inst.v

```
add add inst
 // Interface: clock (clock end)
  .clock
             (), // 1-bit clk input
 // Interface: reset (reset end)
            (), // 1-bit reset n input
  . resetn
 // Interface: call (conduit sink)
  .start
             (), // 1-bit valid input
             (), // 1-bit stall output
  busy
  // Interface: return (conduit source)
  . done
             (), // 1-bit valid output
            (), // 1-bit stall input
  .stall
 // Interface: returndata (conduit source)
  .returndata(), // 32-bit data output
 // Interface: a (conduit sink)
             (), // 32-bit data input
 // Interface: b (conduit sink)
  . b
                 // 32-bit data input
```

# Platform Designer System Integration Tool



available IP

Catalog of

- Interface protocols
- Memory
- DSP
- Embedded
- Bridges
- PII
- Custom Components
- Custom Systems

Accelerate development





Simplify integration

Automate integration tasks



## **Platform Designer Integration**

- Platform Designer component generated for each component:
  - For PD Standard a.prj/components/<component>/<component>.qsys
  - For Platform Designer a.prj/components/<component>/<component>.ip
- In Platform Designer, instantiate component from the IP Catalog in the HLS project directory
  - Add IP directory to IP Catalog Search Locations
    - May use a.prj/components/\*\*/\*
  - Can be stitched with other user IP or Intel® FPGA IP with compatible interfaces
- See tutorials under tutorials/usability

# Platform Designer HLS Component Example

#### Example

- Cascad and hig

|                      | 🛱 Sy                                   | stem          | Contents 🛛                              | Address Map 🛛 😂 | Inter | connect Requirements 🛛 🖾 | Detai | ils ⊠   |                   |            |  |  |
|----------------------|----------------------------------------|---------------|-----------------------------------------|-----------------|-------|--------------------------|-------|---------|-------------------|------------|--|--|
| ded low peep filter  | System: top Path: top_lpf_0.returndata |               |                                         |                 |       |                          |       |         |                   |            |  |  |
| aded low-pass filter |                                        | Use Connectio |                                         | Name            |       | Description              |       | Export  |                   | Clock      |  |  |
| igh-pass filter      | -                                      | 2             |                                         | 🗆 clock_in      |       | Clock Bridge             |       |         |                   |            |  |  |
| ign pass men         | ×<br>×                                 |               | ⊳                                       | in_clk          |       | Clock Input              |       | clk     |                   | exported   |  |  |
|                      |                                        |               |                                         | out_clk         |       | Clock Output             |       | Double  | e-click to export | clock_in_o |  |  |
|                      | <b>1</b>                               | ~             |                                         | 🗆 reset_in      |       | Reset Bridge             |       |         |                   |            |  |  |
|                      |                                        |               | <b>♦</b> >                              | clk             |       | Clock Input              |       | Double  | e-click to export | clock_in   |  |  |
|                      |                                        |               |                                         | in_reset        |       | Reset Input              |       | reset   |                   | [clk]      |  |  |
|                      |                                        | _             |                                         | out_reset       |       | Reset Output             |       | Double  | e-click to export | [clk]      |  |  |
|                      | -                                      |               |                                         | 🗆 top_hpf_0     |       | hpf_internal             |       |         |                   |            |  |  |
|                      |                                        |               |                                         | anjorra         |       | Conduit                  |       |         | f_0_alpha         | [clock]    |  |  |
|                      |                                        |               |                                         | call            |       | Conduit                  |       |         | e-click to export |            |  |  |
|                      |                                        |               | • • • • • • • • • • • • • • • • • • • • | clock           |       | Clock Input              |       |         | e-click to export |            |  |  |
|                      |                                        |               |                                         | reset           |       | Reset Input              |       |         | e-click to export |            |  |  |
|                      |                                        |               |                                         | recontr         |       | Conduit                  |       |         | f_0_return        | [clock]    |  |  |
| Components           |                                        |               |                                         | returndata      |       | Conduit                  |       |         | f_0_returndata    | [clock]    |  |  |
| Components           |                                        |               |                                         | Х               |       | Conduit                  |       | Double  | e-click to export | [clock]    |  |  |
|                      |                                        | ~             |                                         | 🗆 top_lpf_0     |       | lpf_internal             |       |         |                   |            |  |  |
|                      |                                        |               |                                         | carportex       |       | Conduit                  |       |         | _0_alpha          | [clock]    |  |  |
|                      |                                        |               |                                         | call            |       | Conduit                  |       | top_lpf |                   | [clock]    |  |  |
|                      |                                        |               | •       - ?                             | clock           |       | Clock Input              |       |         |                   | clock_in   |  |  |
|                      |                                        |               | $\bullet$                               | reset           |       | Reset Input              |       |         | e-click to export |            |  |  |
|                      |                                        |               | •                                       | return          |       | Conduit                  |       | Double  | e-click to export |            |  |  |
|                      |                                        |               | •                                       | returndata      |       | Conduit                  |       | Double  | e-click to export | [clock]    |  |  |
|                      |                                        |               | <u></u> ~~                              | ×               |       | Conduit                  |       | top_lpf | _0_x              | [clock]    |  |  |



HLS

## Avalon® Interfaces

Easily connects components in an Intel® FPGA to simplify system design

- Standard interfaces design for interoperability
- HLS compiler generates Avalon® interfaces around HLS components
- Avalon Streaming Interface (Avalon-ST)
  - Unidirectional flow of data, simple flexible interface
- Avalon Memory Mapped Interface (Avalon-MM)
  - Address-based read/write interface typical of master-slave connections
- Other Interfaces
  - Conduit, Tri-State Conduit, Interrupt, Clock, Reset



## Avalon<sup>®</sup>-MM Interfaces

- Address-based (memory-mapped) protocol that allows components to communicate using read/write requests
- Master interface
  - Initiates read/write transfers targeting specific address
- Slave interface
  - Accepts and responds to transfer requests
- Interconnect handles decoding of master address request to actual slave interface, backpressure, clocking differences, etc.
- Associated with a clock interface





#### Avalon®-ST Interfaces

- Standard, flexible, and modular protocol for transfer of data
  - Unidirectional
  - Point-to-point connections
  - Fully synchronous
  - Supports simple and complex interface requirements





# Avalon<sup>®</sup> Interface Specification

- Defines the entire Avalon interface standard, including all variations
- Provides reference information on additional transfer types
  - Use cases
  - Waveform diagrams
- <u>http://www.altera.com/literature/manual/mnl\_avalon\_spec.pdf</u>

| Avalon <sup>®</sup>               | Interface                             | Specificati                 | ons           |
|-----------------------------------|---------------------------------------|-----------------------------|---------------|
| MNL-AVABUSREF<br>2017.05.08       |                                       |                             |               |
| ast updated for Intel®<br>ditions | Quartus <sup>®</sup> Prime Design Sui | te: Quartus Prime Pro v17.1 | Stratix 10 ES |
|                                   |                                       |                             |               |
|                                   |                                       |                             |               |
|                                   |                                       |                             |               |
|                                   |                                       |                             |               |
|                                   |                                       |                             |               |
|                                   |                                       |                             |               |
|                                   |                                       |                             |               |
|                                   |                                       |                             |               |
|                                   |                                       |                             |               |
| Subscribe<br>Send Feedback        |                                       |                             |               |



## **Explicit MM Master Interface**

- Explicitly declare Avalon-MM Master interfaces using mm\_master<> class
  - Greater control over interface
  - Specify attributes through parameters





## **Streaming Interfaces**

- Scalar function arguments become pipelined input ports on the HDL module
  - Avalon Streaming interface associated with start and busy inputs
  - Implicit
- Explicit Streaming Interfaces
  - Use ihc::stream\_in<> and ihc::stream\_out<> template classes
    - Pass by reference
  - Creates Avalon Streaming interface with valid and ready signals
  - Explicit control over interface



## Explicit Streaming Interface Example





# Memory-Mapped HLS Component in a System









## **References and Documentation**

- Intel® FPGA high-level design tools landing page
- Intel HLS Compiler support page
- References
  - Intel HLS Compiler User Guide
  - Intel HLS Compiler Getting Started Guide
  - Intel HLS Compiler Reference Manual
  - Intel HLS Compiler Best Practices Guide



 Intel<sup>®</sup> HLS Compiler increases the designer productivity by raising the design entry abstraction from RTL to C++

Shortens development time through accelerated verification

• Implements FPGA specific optimization techniques to deliver great quality of results



SUMMARY

#### Legal Disclaimers/Acknowledgements

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at <u>www.intel.com</u>.

Intel, the Intel Iogo, Intel Inside, the Intel Inside Iogo, MAX, Stratix, Cyclone, Arria, Quartus, HyperFlex, Intel Atom, Intel Xeon and Enpirion are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

OpenCL is the trademark of Apple Inc. used by permission by Khronos

\*Other names and brands may be claimed as the property of others

© Intel Corporation

