NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

PDF Slides

A Fast and Massively-Parallel Solver for Nonlinear Tomographic Image Reconstruction


Comparative Performance Evaluation of Multi-GPU MLFMM Implementation for 2-D VIE Problems


Scalable Parallel DBIM Solutions of Inverse-Scattering Problems


Thoughts on Massively-Parallel Heterogeneous Computing for Solving Large Problems


Large Inverse-Scattering Solutions with DBIM on GPU-Enabled Supercomputers


Adaptive Cache Bypass and Insertion for Many-Core Accelerators




Heterogeneous System Benchmarking

GPU Neural Network for GPGPUSim

A from-scratch feed-forward network in CUDA 4.0 suitable for GPGPUSim


Docker images with latex

ECE408 / CS483 Course Development

Students add a convolution layer to MXNet


Generate PDF, docx, html, and txt resume/cv from a single markdown source.

Cognitive Application Builder

Cognitive Application Builder

High-Performance Application Studies

Tools and Techniques for Code Acceleration


An LLVM Version Manager

Positions and Experience


  • Summer 2017 - Research Intern for Optimized CLOUD Systems, IBM TJ Watson Research Center, Yorktown Heights, NY
  • Summer 2014, Summer 2015 - Research Intern, MulticoreWare Inc., Champaign, IL
  • Summer 2013 - Co-op Engineer Floating-Point RTL, AMD, Fort Collins, CO
  • Summer 2012 - Co-op Engineer Physical Design, AMD, Fort Collins, CO


  • 2018 Spring University of Illinois Project TA for ECE408/CS483
  • 2017 Fall University of Illinois Head TA for ECE408/CS483
  • 2017-2018 University of Illinois Mavis Future Faculty Fellow.

I have been a teaching assistant for the following courses:

  • ECE408/CS483: Heterogeneous Parallel Programming at the University of Illinois
  • E155: Microprocesser-based Systems: Design & Applications at Harvey Mudd College
  • E85: Digital Electronics and Computer Architecture at Harvey Mudd College

I have also been a teaching assistant for the Programming and Tuning Massively Parallel Systems (PUMPS) summer school in Barcelona since 2014.

Recent & Upcoming Talks

NVIDIA Deep Learning Institute Tutorial
Sun, Jun 24, 2018
Bigger GPUs and Bigger Nodes
Wed, Jun 6, 2018
ADA Annual Review Project Pitch
Wed, May 16, 2018
Towards Automatic Heterogeneous Computing Performance Analysis
Fri, Mar 30, 2018
Comparative Performance Evaluation of Multi-GPU MLFMM Implementation for 2-D VIE Problems
Fri, Jun 23, 2017
RAI: A Scalable Submission System for GPU Applications
Mon, May 8, 2017
GPU Performance Nuggets
Wed, Jun 15, 2016

Awards and Recognition

Dan Vivoli Endowed Fellowship - UIUC 2018-2019

Mavis Future Faculty Fellowship - UIUC 2017-2018

Top-20 Poster - NVIDIA GPU Technology Conference 2017

Teacher Ranked as Excellent by Students - UIUC Fall 2015


Web-based method for physical object delivery though use of 3d printing technology

United States 20140122579

Filed November 1, 2012


Board of Governors, University YMCA.

Recent Posts

I’m looking for a few good undergraduate students! Please contact me if you’re interested in any of the tasks on this page. More significant engagement with masters or doctoral students on any of these topics is also welcome. (Last updated Feb 25, 2018) System Characterization The IMPACT group is working on a heterogeneous system benchmarking tool: rai-project/microbench. We’d like to extend it with the following capabilities: Adding persistent storage benchmarking Enhancing the CUDA communication benchmarking (full-duplex) Benchmarking system atomics, cooperative kernel launches, and other new CUDA features Adding network storage discovery and performance characterization (MPI, RPC, and so on) Adding communication collective performance characterization (MPI / NCCL / Blink / direct CUDA implementations, others) Application Characterization I have written a heterogeneous application profiling tool.


I manage the two Minsky machines available to the C3SR center at Illinois. Minsky Machine Overview Product IBM S822LC Model 8335-GTB CPU 2x Power8 GPU 4x NVIDIA P100 w/ 16GB RAM RAM 512 GB Each P8 CPU has 10 cores with 8-way SMT, yielding 80 threads per CPU or 160 threads on each Minsky machine.


I’m helping teach the Programming and tUning Massively Parallel Systems (PUMPS) hosted by the Barcelona Supercomputing Center at UPC Barcelona, Spain!


I’m attending CEM 17 hosted at UPC Barcelona, Spain!


I’ve made my first trip to NVIDIA’s GPU Technology Conference this year, to present some work with my collaborators Abdul Dakkak and Cheng Li. I’ve wanted to attend GTC ever since my first year in the IMPACT group, so this is an exciting trip for me!



  • 222 Coordinated Science Lab, 1308 W. Main St., Urbana, Illinois 61801
  • Face-to-face by appointment