Brief Description
 
 
  • Multicore and Transactional Memory Systems

    What is Transactional Memory?

       Transactional Memory (TM) is a promising technique that makes parallel programming convenient by eliminating the efforts and programming complexity required for synchronizations. The main concept of TM is providing lock-free data structures to avoid common problems associated with conventional locking techniques, including priority inversion, convoying, and difficulty of avoiding deadlock. In addition, more parallelism can be achieved by allowing optimistic accesses to shared data. In TM systems, transactions either can be completely executed on commit or can be flushed and retried on abort to satisfy the atomicity property.








    Contention management

       Earlier TM researches have been mainly focused on the method of conflict detection and version management. However, another important research topic contention management has been arisen with increased transaction size and the number of running threads. Advanced contention management is essential factor to guarantee a performance improvement in TM systems.





    What is Swarm Intelligence?

       Swarm intelligence (SI) is the collective behavior of decentralized, self-organized systems. SI systems are typically made up of a population of simple agents interacting with one another and with thier environment. The agents follow very simple rules but interaction between agents lead to the intelligence global behavior even there isn't centralized control structure. In nature, ants and bees are good example of SI. Our research is adapting SI to multicore processor to improve its scalibility.

      
    What We Are Doing

       Centralized transaction scheduler has been widely studied to lessen the contention which causes wasteful abort recovery. Nevertheless of reduced contention and improved performance, the architecture induces additional software or hardware overhead with increased complexity. To address this point, we are focusing on simple and distributed contention management method. Also, this approach will provide more scalability than centralized technique


    Fold


  •  
  • Energy Efficient Architecture for IoT and ADAS

    Low-Power / High-Performance Computing

        Advanced Driver Assistance System (ADAS) makes decision for the various cases that can be faced during the driving. The decision is based on the large number of sensors, driving environment, and vision information from several cameras. Since the decision should be made in time with the information, the embedded system for ADAS requires high-performance computing to process the data correctly.





    Heterogeneous Architecture

       General-purpose computing on GPUs (GPGPUs) has improved computing performance of applications that need a large amount of computation power and memory bandwidth. It has emerged as a computing paradigm on the graphics hardware. In fact, GPU generally includes an array of hundreds of hardware threads on which high-performance computing (HPC), including scientific calculations, mathematics, and physics, is now performed. With this hardware support, GPU is very attractive for HPC rather than modern microprocessors. GPU has strength to compute such as data-intensive calculation since it is a throughput-oriented architecture. However, with increasing the number of processor cores on a single die as well as improving single instruction multiple data (SIMD) computational units, CPU seized a chance to work with GPU (i.e. heterogeneous computing).




       For the heterogeneous computing, CPU that generally controls a processing flow of program execution should wait its turn until execution on GPU is completed. Thus, our goals are to reduce the wait time of the CPU and to co-operate both CPU and GPU in parallel. To achieve the cooperative computing, several method can be used; a new programming model or virtualization technique. First, exploiting a new programming model to support the cooperative computing is relatively straightforward. The programming model provides abstraction programming interfaces (APIs) in order to interact with an OS directly or with a device driver, which supports the graphics hardware. Second, using virtualization technique is more complicated. It needs several techniques such as language translation, workload distribution, and so on. 




       Especially, in the generation of IoT (Internet of Things), various devices are connected through wired or wireless network. From the perspective of the utilization of computing resources, IoT is able to provide heterogeneous processor network for users. Exploiting the benefit of the computing resource network can be done with the heterogeneous computing framework.


    Fold


  •  
  • GPU Architecture and Computing

    Rise of General-Purpose GPUs

       The GPU was first developed to accelerate graphics processing, where there are a massive amount of instructions which have no dependencies with each other. With the emergence of the multi-core processors, the focus is now on parallel programming, where independent instructions are more favorable to parallelize. Seeing this, the GPU has now been extended to support general purpose instructions as well, by making it programmable on many cores (GPGPU: General Purpose Programming on Graphics Processing Unit).


    GPGPU Instruction Set Simulator

       The core purpose of this research is to implement an instruction set simulator modeled after the GPGPU. Furthermore, we will also enable system software porting, such as for the OS and compiler. Lastly, software modules allowing changes to the hardware architecture will be provided, along with software for evaluating the results of the changes and a dynamic analysis tool for analyzing the performance of software applications.



    What We Are Doing

       We model the GPGPU chips of NVIDIA, up to the hardware pipeline and memory hierarchy, consisted of local memory, global memory, constant memory, texture memory, shared memory, registers and buses. This simulator will be implemented in C/C++, and virtually executing the instructions of the CUDA programming language.





    Fold


  •  
  • Parallel Processing and Optimization

    What is Multiple Pattern Matching?

       Multiple pattern matching algorithms are proposed to simultaneously search multiple specific patterns on a data set. In fact, the algorithms are widely used in various applications including multiple strings searching on a text, genetic data analysis in bioinformatics, and intrusion detection systems in computer networks. Finding predefined patterns fast is an important factor in those fields.Now, the amount of data used in the recent applications has exponentially increased due to the advance in information technology. However, the traditional sequential multiple pattern matching algorithms cannot provide enough performance for efficient pattern matching operations.





    What we are doing?


       Parallel computing approach which provides remarkable performance gain in various fields can be a good solution. In these days, most computing platforms are based on multi-threading and multi-core technology shows significant performance improvement. However, the software designers who want to get the advantages form parallelization need to carefully develop parallel algorithm. Unfortunately, parallelization does not always promise the performance enhancement in software execution. In fact, the parallelization sometime could cause even worse performance due to the unbalanced workload distribution over the multiple threads. Therefore, smart load distribution and data decomposition are significant for the overall performance in parallel computing.

    In the research, we are developing the multi-threaded multiple pattern matching algorithms which have been developed from the famous Wu-Manber algorithm. For achieving high performance on the multi-threaded algorithms, we first concentrate on load balancing between multiple threads and pattern locality on memory. Secondly, we focus on assigning the multiple threads and data set to multi-core systems including homogeneous or heterogeneous multi-core processors.





    Parallelized Network Coding

       Network coding is a new scheme in information theory that can effectively improve the throughput of multicasting environments. The basic principle of network coding is to allow coding at the nodes in the network topology between the source and destination nodes.

       Network coding has been researched widely for several network environments. Network coding is proved that it is useful for both wired and wireless networks. Network coding is also useful to P2P networks. Network coding allows smooth, fast download and efficient server utilization on P2P networks.

       On network coding, destination node decodes received packets to recover original data. Decoding operations take long time because of its complexity. Our researches are focused on the workload balancing for multiple threads and parallelization of progressive decoding. We implement parallel network coding using different processors such as x86, Cell/BE, and GPGPU with thread level parallelism and instruction level parallelism both. Also, we implement a parallelized decoding accelerator using the reconfigurable device such as FPGA.


    Accelerated Network Coding on Graphics Processing Unit 

       We focused on the performance imbalance of previously proposed parallel algorithms when dealing with different-sized data blocks and multiple streams. To overcome this problem and to realize a network coding process for practical use, we proposed the DSD algorithm, which is a GPU-based parallel progressive decoding algorithm for multiple incoming streams based on the progressive Gauss–Jordan elimination. This algorithm can simultaneously process multiple incoming streams and can maintain its maximum decoding performance irrespective of the size and number of transfer units.



    Progressive Decoding

       The decoding operation of network coding is based on matrix inversion and the traditional matrix inversion needs a whole matrix. However the waiting time can be long enough to cancel out the advantages from network coding. Therefore, to minimize waiting time we use progressive scheme which decodes packets partially before all packets arrive. With the progressive decoding, we can achieve higher throughput on network coding environments.




    FPGA Implemented Decoding Accelerator

       We optimize the decoding algorithm by using a parallelized ALU structure considering the coefficient matrix size. By virtue of the FPGA’s high-reconfigurable nature, the modification of hardware design can easily be achieved. In order to support various sizes of the coefficient matrices, we develop customized ALU architecture with 16, 32, 64, and 128 ALUs which are used for arithmetic operations in Galois Field, respectively. Also a pipelined version is being implemented for the larger sizes of the coefficient matrix.


    Fold


  •  

     

     

     
     
      Collaboration
     

     

     
    Parallel Systems & Computer Architecture Lab (PASCAL) in the Electrical Engineering & Computer Science Department of the University of California, Irvine
                   

     

     
    Super Computing In Pocket (SCIP) Research Group in the Ming Hsieh Department of Electrical Engineering, University of Southern California
                   
     
     
     
     
     
      Architecture Conference Schedule