ECE408: CUDA Optimization for LeNet
Urbana, IL
Fall 2024
Individual Completed
- Stream: Overlap the data transfer with kernel execution. In this way, I divide large vectors into segments and simultaneously execute a kernel while performing a copy between device and host memory.
- Kernel Fusion: I first implement convolution with matrix multiplication by three kernel: unrolling kernel, shared matrix multiplication kernel and permute kernel. Then I use kernel fusion to combine three kernels into one kernel for optimization.
- High-Level Libraries: I use Tensor Cores via Warp Matrix Functions and CUDA Basic Linear Algebra Subprograms (cuBLAS) library.
- Other optimization: I also make other optmization, such as constant memory for weight matrix, "__restrict__" keyword and loop unrolling.
ECE391: Computer Systems Engineering Implementation
Urbana, IL
Fall 2023
Group member of Four-member Team
- Constructed a Linux-like operating system kernel with C, having basic function such as paging virtual memory, fully functional IDT, GDT and i8259-based interrupt controller, etc.
- Constructed a file system, operating device driver such as Real Time Clock, keyboard, Programmable Interval Timer and ATA driver.
- Used x86 to establish the system call linkage between user-level program and kernel, passing all test cases provided by the course. Furthermore, realized single CPU task scheduling and multiple terminals switching.
- Full point for the overall 5-checkpoints project.