< ROSE Compiler Framework

Overview

Basic concept: outlining is the process of replacing a block of consecutive statements with a function call to a new function containing those statements. Conceptually, outlining the inverse of inlining.

Use: Outlining is widely used to generate kernel functions to be executed on CPU and/or GPU.

  • help implementing programming models such as OpenMP
  • support empirical tuning of a code portion by first generating a function out of the code portion.

ROSE provide a builtin translator called AST outliner, which can outline a specified portion of code and generate a function from it.

Official documentation for the AST outliner is located in Chapter 37 Using the AST Outliner with the ROSE Tutorial. pdf.

There are two basic ways to use the outliner.

  • Command line method: You can use a command (outline )with options to specify outlining targets, there are two ways to specify which portion of code to outline
    • use a special pragma to mark outline targets in the input program, and then call a high-level driver routine to process these pragmas.
    • using abstract handle strings (detailed in Chapter 46 of ROSE tutorial) as command line options
  • Function call method: call “low-level” outlining routines that operate directly on AST nodes you want to outline

Command line

The tool rose/bin/outline comes from tutorial/outliner/outline.cc, which relies on 1) pragmas in input codes or 2) abstract handles specified as command line options to find the target code portion to be outlined.

  • The pragma: put #pragam rose_outline right in front of the code portion you want to outline in your input code
  • abstract handle: -rose:outline:abstract_handle your_handle_string

options

./outline --help | more

Outliner-specific options
Usage: outline [OPTION]... FILENAME...
Main operation mode:
        -rose:outline:preproc-only                     preprocessing only, no actual outlining
        -rose:outline:abstract_handle handle_string    using an abstract handle to specify an outlining target
        -rose:outline:parameter_wrapper                use an array of pointers to pack the variables to be passed
        -rose:outline:structure_wrapper                use a data structure to pack the variables to be passed
        -rose:outline:enable_classic                   use parameters directly in the outlined function body without transferring statement, C only
        -rose:outline:temp_variable                    use temp variables to reduce pointer dereferencing for the variables to be passed
        -rose:outline:enable_liveness                  use liveness analysis to reduce restoring statements if temp_variable is turned on
        -rose:outline:new_file                         use a new source file for the generated outlined function
        -rose:outline:output_path                      the path to store newly generated files for outlined functions, if requested by new_file. The original source file's path is used by default.
        -rose:outline:exclude_headers                  do not include any headers in the new file for outlined functions
        -rose:outline:use_dlopen                       use dlopen() to find the outlined functions saved in new files.It will turn on new_file and parameter_wrapper flags internally
        -rose:outline:copy_orig_file                   used with dlopen(): single lib source file copied from the entire original input file. All generated outlined functions are appended to the lib source file
        -rose:outline:enable_debug                     run outliner in a debugging mode

Example use

  • outline test.cpp // outline code portions in test.cpp. These code portions are marked by the special rose_outline pragma
  • outline -rose:skipfinalCompileStep -rose:outline:new_file test.cpp // skip compiling the generated rose_? file, put the generated function into a new file

Using abstract handles at command lines, no need to insert pragmas into your input codes anymore

  • outline -rose:outline:abstract_handle ”ForStatement<position,12>” test3.cpp // outline the for loop located at line 12 of test3.cpp
  • outline -rose:outline:abstract_handle ”FunctionDeclaration<name,initialize>::ForStatement<numbering,2>” test2.cpp // outline the 2nd for loop within a function named "initialize" within the test2.cpp file.

/home/liao6/workspace/masterDevClean/buildtree/tests/roseTests/astOutliningTests/outline -rose:outline:new_file -rose:outline:temp_variable -rose:outline:exclude_headers -rose:outline:abstract_handle 'ForStatement<numbering,1>' -c /home/liao6/workspace/masterDevClean/sourcetree/tests/roseTests/astOutliningTests/complexStruct.c

Programming API

You can build your own translators leveraging the outlining support in ROSE. The programming API is defined in

  • Header file: src/midend/programTransformation/astOutlining/
  • Namespace: Outliner

A few functions and options are provided:

  • Functions: Outliner::outline(), Outliner::isOutlineable()
  • Options

Internal control variables

Outliner.cc

namespace Outliner {
  //! A set of flags to control the internal behavior of the outliner
  bool enable_classic=false;
  // use a wrapper for all variables or one parameter for a variable or a wrapper for all variables
  bool useParameterWrapper=false;  // use an array of pointers wrapper for parameters of the outlined function
  bool useStructureWrapper=false;  // use a structure wrapper for parameters of the outlined function
  bool preproc_only_=false;  // preprocessing only
  bool useNewFile=false; // generate the outlined function into a new source file
  bool copy_origFile=false; // when generating the new file to store outlined function, copy entire original file to it.
  bool temp_variable=false; // use temporary variables to reduce pointer dereferencing
  bool enable_liveness =false;
  bool enable_debug=false; // 
  bool exclude_headers=false;
  bool use_dlopen=false; // Outlining the target to a separated file and calling it using a dlopen() scheme. It turns on useNewFile.
  std::string output_path=""; // default output path is the original file's directory
  std::vector<std::string> handles; //  abstract handles of outlining targets, given by command line option -rose:outline:abstract_handle for each

// DQ (3/19/2019): Suppress the output of the #include "autotuning_lib.h" since some tools will want to define there own supporting libraries and header files.
  bool suppress_autotuning_header = false; // when generating the new file to store outlined function, suppress output of #include "autotuning_lib.h".
};

Algorithm

the outline program's top level driver: PragmaInterface.cc

  • Outliner::outlineAll (SgProject* project)
    • collectPragms()
    • outline(SgPragmaDeclaration)
      • outline(SgStatement, func_name)
        • preprocess(s)
        • outlineBlock (s_post, func_name) // Transform.cc The key function here!!
    • deleteAST(SgPragmaDeclaration)

Check if a SgNode is eligible for outlining.

  • Outliner::isOutlineable() src/Check.cc:251
    • checkType() // only specified SgNode types can be outlined, a list is maintained here
    • excluding SgVariableDeclaration
    • must enclosed inside a function declaration
      • excluding template instantiation (member) function declaration
    • does not refer hidden types ...

two phases: preprocessing and actual transformation

  • SgBasicBlock* s_post = preprocess (s);
    • SgStatement * processPragma (SgPragmaDeclaration* decl) // check if it's an outline pragma (#pragma rose_outline), return the next stmt if so.
    • Outliner::preprocess(SgStatement);
      • SgBasicBlock * Outliner::Preprocess::preprocessOutlineTarget (SgStatement* s)
        • normalizeVarDecl()
        • createBlock()
        • Outliner::Preprocess::transformPreprocIfs
        • Outliner::Preprocess::transformThisExprs
        • Outliner::Preprocess::transformNonLocalControlFlow
        • Outliner::Preprocess::gatherNonLocalDecls(); // duplicate function declarations here, e.g. test2005_179.C

Outliner::outline(stmt) --> generateFuncName(s) unique function name Outliner::outline (stmt, func_name)

  • Outliner::Transform::outlineBlock (s_post, func_name); // Transform.cc
    • Outliner::Transform::collectVars (s, syms); // collect variables to be passed
    • Outliner::Transform::generateFunction() //generate an outlined function
      • createFuncSkeleton()
      • ASTtools::appendStmtsCopy (s, func_body); // deep copy s to get func_body?
      • variableHandling (syms, func, vsym_remap); // append unwrapping statements
        • createParam() // create parameters
        • createUnpackDecl() // create unpacking statement: int local = parameter
        • createPackStmt() // transfer local back to the parameter after all local calculation
      • remapVarSyms (vsym_remap, func_body); // variable substitution
    • insert() from Insert.cc // insert outlined function and its prototype
      • insertFriendDecls()
      • insertGlobalPrototype()
        • GlobalProtoInserter::insertManually ()
          • generatePrototype()
    • generateCall() // generate a call to the outlined function
    • ASTtools::replaceStatement () // replace the original portion with the call

Advanced features

Some details for outlining can be specified by using command line options or internal flags of the programming API.

List

  • wrap all variables into a data structure: Outliner::useStructureWrapper


dlopen

use_dlopen option tells the outliner to use the dlopen() to find and call the outlined function stored into a dynamically loadable library.

This option will turn on several other options

  • -rose:outline:exclude_headers
  • useNewFile= true;
  • useParameterWrapper = true;
  • temp_variable = true;

outline -rose:outline:use_dlopen -I/home/liao6/workspace/outliner/build/../sourcetree/projects/autoTuning -rose:outline:output_path . -c /path/to/input.c

Testing

The ROSE AST outliner has a dedicated testing directory: rose/tests/nonsmoke/functional/roseTests/astOutliningTests

  • Some C, C++ and Fortran test input files are prepared there.
  • Sample command line options are available in the Makefile.am file within this test directory.


full command line example

  • /home/liao6/workspace/rose/buildtree/tests/nonsmoke/functional/roseTests/astOutliningTests/outline -rose:outline:use_dlopen -rose:outline:temp_variable -I/home/liao6/workspace/rose/buildtree/../sourcetree/projects/autoTuning -rose:outline:exclude_headers -rose:outline:output_path . -c /home/liao6/workspace/rose/sourcetree/tests/nonsmoke/functional/roseTests/astOutliningTests/array1.c

To trigger single test , assuming the input file is named inputFile.c:

  • make classic_inputFile.c.passed //classic behavior
  • make dlopen_inputFile.c.passed // dlopen feature

As you can see, the prefix indicate different options of using the outliner.

Example input and output

As a standalone tool

Input file, with a pragma to indicate which code portion to be outlined:

int main()
{
    double n, start=1, total;
    double unlucky=0, lucky;
    double *number;
	                 
    scanf("%lf",&n);                    
    total = 9;                      
    for(int j =1; j < n; j++)
    {
      total = total * 10;
      start = start *10;
    }

    number = (double*)malloc(n * sizeof(double));                           
    for(double i = start; i < start*10; i++)
    {
      double temp = i;
#pragma rose_outline
      for(int j = 1; j<= n; j++)
      {
	number[j]=(int)temp%10;
	temp = temp/10;
      }
      for(int k = n; k>=1; k--)
      {
	if(number[k] == 1 && number[k-1] == 3){
	  unlucky++;
	  break;
	}
      }
    }                                   
    lucky = total - unlucky;
    printf("there are %f lucky integers in %f digits integers", lucky, n);
    return 0;
}


//------------output file is

static void OUT__1__2222__(double *np__,double **numberp__,double *tempp__);

int main()
{
  double n;
  double start = 1;
  double total;
  double unlucky = 0;
  double lucky;
  double *number;
  scanf("%lf",&n);
  total = 9;
  for (int j = 1; j < n; j++) {
    total = total * 10;
    start = start * 10;
  }
  number = ((double *)(malloc(n * (sizeof(double )))));
  for (double i = start; i < start * 10; i++) {
    double temp = i;
    OUT__1__2222__(&n,&number,&temp);
    for (int k = n; k >= 1; k--) {
      if (number[k] == 1 && number[k - 1] == 3) {
        unlucky++;
        break; 
      }
    }
  }
  lucky = total - unlucky;
  printf("there are %f lucky integers in %f digits integers",lucky,n);
  return 0;
}

static void OUT__1__2222__(double *np__,double **numberp__,double *tempp__)
{
  double *n = (double *)np__;
  double **number = (double **)numberp__;
  double *temp = (double *)tempp__;
  for (int j = 1; j <=  *n; j++) {
    ( *number)[j] = (((int )( *temp)) % 10);
     *temp =  *temp / 10;
  }
}

work with C++ member functions

Input code:

int a;

class B 
{
  private: 

  int b;
 inline void foo(int c)
 {
#pragma rose_outline
   b = a+c;
 }
};

Output code

  • add friend declaration for the outlined function so it can access private class members
  • pass this pointer to a class object as a function argument
int a;
static void OUT__1__2386__(int *cp__,void *this__ptr__p__);

class B 
{
  public: friend void ::OUT__1__2386__(int *cp__,void *this__ptr__p__);
  private: int b;
  

  inline void foo(int c)
{
// //A declaration for this pointer
    class B *this__ptr__ = this;
    OUT__1__2386__(&c,&this__ptr__);
  }
}
;

static void OUT__1__2386__(int *cp__,void *this__ptr__p__)
{
  int &c =  *((int *)cp__);
  class B *&this__ptr__ =  *((class B **)this__ptr__p__);
  this__ptr__ -> b = a + c;
}

Using -rose:outline:parameter_wrapper , the result will be slightly different:

  • all parameters will be wrapped into an array of pointers in the caller function
  • the array will be unpacked to retrieve the parameters in the outlined function
int a;
static void OUT__1__2391__(void **__out_argv);

class B 
{
  public: friend void ::OUT__1__2391__(void **__out_argv);
  private: int b;
  

  inline void foo(int c)
{
// //A declaration for this pointer
    class B *this__ptr__ = this;
    void *__out_argv1__1527__[2];
    __out_argv1__1527__[0] = ((void *)(&this__ptr__));
    __out_argv1__1527__[1] = ((void *)(&c));
    OUT__1__2391__(__out_argv1__1527__);
  }
}
;

static void OUT__1__2391__(void **__out_argv)
{
  int &c =  *((int *)__out_argv[1]);
  class B *&this__ptr__ =  *((class B **)__out_argv[0]);
  this__ptr__ -> b = a + c;
}

Used for OpenMP Implementation

See more at ROSE_Compiler_Framework/OpenMP_Support.

Below is an example translation:

/*a test C program. You can replace this content with yours, within 20,000 character limit (about 500 lines) . */
#include<stdio.h>
#include<stdlib.h>

int main(int argc, char* argv[])
{
    int nthreads, tid;
    #pragma omp parallel private(nthreads, tid)
    {
        tid = omp_get_thread_num();
	printf("Hello World from thread = %d ", tid);
	if(tid == 0)
	{
	    nthreads = omp_get_num_threads();
	    printf("Number of threads = %d", nthreads);
	}
    }
    return 0;
}


//------------- output code --------------
/*a test C program. You can replace this content with yours, within 20,000 character limit (about 500 lines) . */
#include<stdio.h>
#include<stdlib.h>
#include "libxomp.h" 
static void OUT__1__2231__(void *__out_argv);

int main(int argc,char *argv[])
{
  int status = 0;
  XOMP_init(argc,argv);
  int nthreads;
  int tid;
  XOMP_parallel_start(OUT__1__2231__,0,1,0,"/tmp/test-20191219_224253-113680.c",8);
  XOMP_parallel_end("/tmp/test-20191219_224253-113680.c",17);
  XOMP_terminate(status);
  return 0;
}

static void OUT__1__2231__(void *__out_argv)
{
  int _p_nthreads;
  int _p_tid;
  _p_tid = omp_get_thread_num();
  printf("Hello World from thread = %d ",_p_tid);
  if (_p_tid == 0) {
    _p_nthreads = omp_get_num_threads();
    printf("Number of threads = %d",_p_nthreads);
  }
}

Used to Generate CUDA kernels for OpenMP 4.x

Example input and output code for the classic Jacobi OpenMP 4.0 version:

//--------------input--------------

void jacobi( )
{
  REAL omega;
  int i,j,k;
  REAL error,resid,ax,ay,b;
  //      double  error_local;

  //      float ta,tb,tc,td,te,ta1,ta2,tb1,tb2,tc1,tc2,td1,td2;
  //      float te1,te2;
  //      float second;

  omega=relax;
  /*
   * Initialize coefficients */

  ax = 1.0/(dx*dx); /* X-direction coef */
  ay = 1.0/(dy*dy); /* Y-direction coef */
  b  = -2.0/(dx*dx)-2.0/(dy*dy) - alpha; /* Central coeff */

  error = 10.0 * tol;
  k = 1;

  // An optimization on top of naive coding: promoting data handling outside the while loop
  // data properties may change since the scope is bigger:
#pragma omp target data map(to:n, m, omega, ax, ay, b, f[0:n][0:m]) map(tofrom:u[0:n][0:m]) map(alloc:uold[0:n][0:m])
  while ((k<=mits)&&(error>tol))
  {
    error = 0.0;

    /* Copy new solution into old */
#pragma omp target map(to:n, m, u[0:n][0:m]) map(from:uold[0:n][0:m])
#pragma omp parallel for private(j,i) collapse(2)
    for(i=0;i<n;i++)
      for(j=0;j<m;j++)
        uold[i][j] = u[i][j];

#pragma omp target map(to:n, m, omega, ax, ay, b, f[0:n][0:m], uold[0:n][0:m]) map(from:u[0:n][0:m])
#pragma omp parallel for private(resid,j,i) reduction(+:error) collapse(2) // nowait
    for (i=1;i<(n-1);i++)
      for (j=1;j<(m-1);j++)
      { 
        resid = (ax*(uold[i-1][j] + uold[i+1][j])\
            + ay*(uold[i][j-1] + uold[i][j+1])+ b * uold[i][j] - f[i][j])/b;

        u[i][j] = uold[i][j] - omega * resid;
        error = error + resid*resid ;
      }
...

    /* Error check */

    if (k%500==0)
      printf("Finished %d iteration with error =%f\n",k, error);
    error = sqrt(error)/(n*m);

    k = k + 1;
  }          /*  End iteration loop */
  printf("Total Number of Iterations:%d\n",k);
  printf("Residual:%E\n", error);
  printf("Residual_ref :%E\n", resid_ref);
  printf ("Diff ref=%E\n", fabs(error-resid_ref));
  assert (fabs(error-resid_ref) < 1E-13);
}



//----------------output-----------------

#include "libxomp.h" 
#include "xomp_cuda_lib_inlined.cu" 
...



__global__ void OUT__1__8714__(float omega,float ax,float ay,float b,int __final_total_iters__2__,int __i_interval__3__,float *_dev_per_block_error,float *_dev_u,float *_dev_f,float *_dev_uold)
{
  int _p_i;
  int _p_j;
  float _p_error;
  _p_error = 0;
  float _p_resid;
  int _p___collapsed_index__5__;
  int _dev_lower;
  int _dev_upper;
  int _dev_loop_chunk_size;
  int _dev_loop_sched_index;
  int _dev_loop_stride;
  int _dev_thread_num = getCUDABlockThreadCount(1);
  int _dev_thread_id = getLoopIndexFromCUDAVariables(1);
  XOMP_static_sched_init(0,__final_total_iters__2__ - 1,1,1,_dev_thread_num,_dev_thread_id,&_dev_loop_chunk_size,&_dev_loop_sched_index,&_dev_loop_stride);
  while(XOMP_static_sched_next(&_dev_loop_sched_index,__final_total_iters__2__ - 1,1,_dev_loop_stride,_dev_loop_chunk_size,_dev_thread_num,_dev_thread_id,&_dev_lower,&_dev_upper))
    for (_p___collapsed_index__5__ = _dev_lower; _p___collapsed_index__5__ <= _dev_upper; _p___collapsed_index__5__ += 1) {
      _p_i = _p___collapsed_index__5__ / __i_interval__3__ * 1 + 1;
      _p_j = _p___collapsed_index__5__ % __i_interval__3__ * 1 + 1;
      _p_resid = (ax * (_dev_uold[(_p_i - 1) * 512 + _p_j] + _dev_uold[(_p_i + 1) * 512 + _p_j]) + ay * (_dev_uold[_p_i * 512 + (_p_j - 1)] + _dev_uold[_p_i * 512 + (_p_j + 1)]) + b * _dev_uold[_p_i * 512 + _p_j] - _dev_f[_p_i * 512 + _p_j]) / b;
      _dev_u[_p_i * 512 + _p_j] = _dev_uold[_p_i * 512 + _p_j] - omega * _p_resid;
      _p_error = _p_error + _p_resid * _p_resid;
    }
  xomp_inner_block_reduction_float(_p_error,_dev_per_block_error,6);
}

...


void jacobi()
{
  float omega;
  int i;
  int j;
  int k;
  float error;
  float resid;
  float ax;
  float ay;
  float b;
//      double  error_local;
//      float ta,tb,tc,td,te,ta1,ta2,tb1,tb2,tc1,tc2,td1,td2;
//      float te1,te2;
//      float second;
  omega = relax;
/*
     * Initialize coefficients */
/* X-direction coef */
  ax = (1.0 / (dx * dx));
/* Y-direction coef */
  ay = (1.0 / (dy * dy));
/* Central coeff */
  b = (- 2.0 / (dx * dx) - 2.0 / (dy * dy) - alpha);
  error = (10.0 * tol);
  k = 1;
/* Translated from #pragma omp target data ... */
{
    xomp_deviceDataEnvironmentEnter();
    float *_dev_u;
    int _dev_u_size = sizeof(float ) * n * m;
    _dev_u = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)u),_dev_u_size,1,1)));
    float *_dev_f;
    int _dev_f_size = sizeof(float ) * n * m;
    _dev_f = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)f),_dev_f_size,1,0)));
    float *_dev_uold;
    int _dev_uold_size = sizeof(float ) * n * m;
    _dev_uold = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)uold),_dev_uold_size,0,0)));
    while(k <= mits && error > tol){
      int __i_total_iters__0__ = (n - 1 - 1 - 1 + 1) % 1 == 0?(n - 1 - 1 - 1 + 1) / 1 : (n - 1 - 1 - 1 + 1) / 1 + 1;
      int __j_total_iters__1__ = (m - 1 - 1 - 1 + 1) % 1 == 0?(m - 1 - 1 - 1 + 1) / 1 : (m - 1 - 1 - 1 + 1) / 1 + 1;
      int __final_total_iters__2__ = 1 * __i_total_iters__0__ * __j_total_iters__1__;
      int __i_interval__3__ = __j_total_iters__1__ * 1;
      int __j_interval__4__ = 1;
      int __collapsed_index__5__;
      int __i_total_iters__6__ = (n - 1 - 0 + 1) % 1 == 0?(n - 1 - 0 + 1) / 1 : (n - 1 - 0 + 1) / 1 + 1;
      int __j_total_iters__7__ = (m - 1 - 0 + 1) % 1 == 0?(m - 1 - 0 + 1) / 1 : (m - 1 - 0 + 1) / 1 + 1;
      int __final_total_iters__8__ = 1 * __i_total_iters__6__ * __j_total_iters__7__;
      int __i_interval__9__ = __j_total_iters__7__ * 1;
      int __j_interval__10__ = 1;
      int __collapsed_index__11__;
      error = 0.0;
/* Copy new solution into old */
{
        xomp_deviceDataEnvironmentEnter();
        float *_dev_u;
        int _dev_u_size = sizeof(float ) * n * m;
        _dev_u = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)u),_dev_u_size,1,0)));
        float *_dev_uold;
        int _dev_uold_size = sizeof(float ) * n * m;
        _dev_uold = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)uold),_dev_uold_size,0,1)));
/* Launch CUDA kernel ... */
        int _threads_per_block_ = xomp_get_maxThreadsPerBlock();
        int _num_blocks_ = xomp_get_max1DBlock(__final_total_iters__8__ - 1 - 0 + 1);
        OUT__2__8714__<<<_num_blocks_,_threads_per_block_>>>(__final_total_iters__8__,__i_interval__9__,_dev_u,_dev_uold);
        xomp_deviceDataEnvironmentExit();
      }
{
        xomp_deviceDataEnvironmentEnter();
        float *_dev_u;
        int _dev_u_size = sizeof(float ) * n * m;
        _dev_u = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)u),_dev_u_size,0,1)));
        float *_dev_f;
        int _dev_f_size = sizeof(float ) * n * m;
        _dev_f = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)f),_dev_f_size,1,0)));
        float *_dev_uold;
        int _dev_uold_size = sizeof(float ) * n * m;
        _dev_uold = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)uold),_dev_uold_size,1,0)));
/* Launch CUDA kernel ... */
        int _threads_per_block_ = xomp_get_maxThreadsPerBlock();
        int _num_blocks_ = xomp_get_max1DBlock(__final_total_iters__2__ - 1 - 0 + 1);
        float *_dev_per_block_error = (float *)(xomp_deviceMalloc(_num_blocks_ * sizeof(float )));
        OUT__1__8714__<<<_num_blocks_,_threads_per_block_,(_threads_per_block_ * sizeof(float ))>>>(omega,ax,ay,b,__final_total_iters__2__,__i_interval__3__,_dev_per_block_error,_dev_u,_dev_f,_dev_uold);
        error = xomp_beyond_block_reduction_float(_dev_per_block_error,_num_blocks_,6);
        xomp_freeDevice(_dev_per_block_error);
        xomp_deviceDataEnvironmentExit();
      }
//    }
/*  omp end parallel */
/* Error check */
      if (k % 500 == 0) {
        printf("Finished %d iteration with error =%f\n",k,error);
      }
      error = (sqrt(error) / (n * m));
      k = k + 1;
/*  End iteration loop */
    }
    xomp_deviceDataEnvironmentExit();
  }
  printf("Total Number of Iterations:%d\n",k);
  printf("Residual:%E\n",error);
  printf("Residual_ref :%E\n",resid_ref);
  printf("Diff ref=%E\n",(fabs((error - resid_ref))));
  fabs((error - resid_ref)) < 1E-14?((void )0) : __assert_fail("fabs(error-resid_ref) < 1E-14","jacobi-ompacc-opt2.c",236,__PRETTY_FUNCTION__);
}


See details at ROSE_Compiler_Framework/OpenMP_Acclerator_Model_Implementation

Known issues

List

  • the message "error in side effect analysis!" when setting Outliner::useStructureWrapper to true. This also happens in the outlineIfs example from the tutorial directory.
    • you can ignore this warning message if your translator still works. The outliner uses quite some analyses internally if Outliner::useStructureWrapper is turned on. But some of the analyses may not always handle all situations so they just give up and notify the outliner. The outliner is designed to make conservative decisions in this case and to generate less optimal translated code.

Publications

A paper describing the internals of the AST outliner, the default paper to cite if you happen to use the AST outliner for your research work

  • Chunhua Liao, Daniel J. Quinlan, Richard Vuduc, and Thomas Panas. 2009. Effective source-to-source outlining to support whole program empirical optimization. In Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing (LCPC'09)

To support generating multi-threaded kernels for CPUs and GPUs

  • Chunhua Liao , Daniel J. Quinlan , Thomas Panas , Bronis R. de Supinski, A ROSE-Based OpenMP 3.0 research compiler supporting multiple runtime libraries, Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more, June 14-16, 2010, Tsukuba, Japan
  • C. Liao, Y. Yan, B. R. de Supinski, D. J. Quinlan, and B. Chapman, “Early experiences with the openmp accelerator model,” in Openmp in the era of low power devices and accelerators, Springer, 2013, pp. 84-98.

Being used to support empirical tuning or autotuning

  • Shirley Moore, Refactoring and automated performance tuning of computational chemistry application codes, Proceedings of the Winter Simulation Conference, December 09-12, 2012, Berlin, Germany
  • Nicholas Chaimov , Scott Biersdorff , Allen D Malony, Tools for machine-learning-based empirical autotuning and specialization, International Journal of High Performance Computing Applications, v.27 n.4, p.403-411, November 2013
This article is issued from Wikibooks. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.