Aater Suleman

Aater Suleman

Austin, Texas, United States
8K followers 500+ connections

Über uns

I am a Co-founder of a startup accelerator called Vixul. Unlike typical startup…

Courses by Aater

Articles by Aater

Activity

Join now to see all activity

Erleben Sie

  • Vixul Graphic

    Vixul

    Austin, Texas Metropolitan Area

  • -

    Austin area, texas

  • -

    Austin, Texas Metropolitan Area

  • -

    Austin, Texas Area

  • -

    Austin, Texas Area

  • -

    Austin Area, Texas

  • -

    austin, texas

  • -

    Austin, Texas Area

  • -

    Austin, Texas Area

  • -

    Austin, Texas Area

  • -

    Austin, Texas Area

Bildung

  • The University of Texas at Austin Graphic

    The University of Texas at Austin

    -

    Advisor: Yale Patt
    Thesis: An asymmetric Multi-core Architecture for Efficiently Accelerating Critical Paths in Multithreaded Programs
    Research Interests: Multicore architectures, graphics architectures, hardware/software co-design, and parallel programming.

  • With Highest Honors

Publications

Projects

  • TCP performance benchmarking using multiple NICs

    In collaboration with a CA-based startup, we developed a tool for measuring performance of concurrently using multiple network interfaces for TCP communication.

  • NoSQL Database benchmarking

    - Present

    We are benchmarking (and occasionally optimizing) various database products in collaboration with Cloudant Inc. The databases being considered include CouchDB, Amazon DynamoDB, and traditional SQL databases.

  • Hadoop Performance Profiler

    - Present

    Developing a Hadoop profiling tool in collaboration with a California-based stealth startup. We have modified the Hadoop framework to provide some very in-depth performance data that was previously unavailable. Using this data and some post processing, the tool can project/predict performance of the applications on different hardware architectures and under different constraints.

  • Solar Panel monitoring

    We are developing an end-to-end solution to monitor solar panels deployed in the field. For each unit, the system measures the power output of the solar panel using an AC power monitor, uses a MiFi device to upload this data via TCP to a Cassandra Database, analyzes this data periodically, and produces analytics and summary reports for the provider and the clients.

  • iPass.pk

    - Present

    A web-based iOS6's passbook pass generator. I did this as a hobby to help people design passes without having to write code. At the time of its inception, it was the first pass creator with live updates.

    See project
  • The Asymmetric Chip Multiprocessor

    - Present

    To leverage multicore, programmers must parallelize their programs. Due to resource constraints, programmers often leave some parts as serial which can form the critical path of the program. This makes parallel programs inherently asymmetric: some parts are more important for performance than others. However, today’s CMPs are symmetric: all cores run at the same performance and all code –critical or non-critical-- runs at the same speed. To balance power, performance, and programmer-effort, I…

    To leverage multicore, programmers must parallelize their programs. Due to resource constraints, programmers often leave some parts as serial which can form the critical path of the program. This makes parallel programs inherently asymmetric: some parts are more important for performance than others. However, today’s CMPs are symmetric: all cores run at the same performance and all code –critical or non-critical-- runs at the same speed. To balance power, performance, and programmer-effort, I proposed the Asymmetric Chip Multiprocessor (ACMP) paradigm. As opposed to providing only symmetric cores, ACMP provides one or a few large cores and many small cores. Programmers can parallelize only the parts which are easier to parallelize and leave the parts which are difficult to parallelize as serial. The ACMP can run the parallel parts at high throughput using the small cores and runs the serial parts speedily using the large core(s). My thesis and the papers I published on ACMP in ASPLOS’08 & ‘09, ISCA’10, Micro Top Picks’09 & ‘10, and PACT’10 make the following contributions:
    1. Identify the three types of common serial bottlenecks: non-parallel kernels, critical sections, and limiter pipeline stages. Contrary to conventional wisdom, these bottlenecks are inherently different and so is their impact on performance.
    2. Propose analytic models to characterize the performance impact of these bottlenecks
    3. Design hardware, compiler, ISA, and software support required to identify and accelerate these bottlenecks using the ACMP
    4. Evaluate in depth the trade-offs of accelerating these bottlenecks via qualitative analysis and cycle-accurate simulation

    Evaluation of many common parallel programs shows that accelerating the serial parts using the ACMP speeds up the program, increases the program scalability, increases the achievable speedup, and saves the time and resources programmers would have otherwise invested in parallelizing the difficult to parallelize parts.

    Other creators
  • Multi-threaded Heterogeneous Core Simulator

    -

    During my ACMP research, I single-handedly developed many unique and powerful tools. The most important is a cycle-accurate microprocessor simulator which can deterministically simulate the execution of multi-threaded workloads on heterogeneous cores. The simulator implements a thin asymmetry-aware OS and simulates x86 in-order and out-of-order cores, on-chip interconnects, and a modern DDR DRAM system. The simulator is able to simulate large-scale workloads such as the MySQL database engine.

  • Calibration of the laser altimeter for NASA‘s Ice, Cloud, and land Elevation Satellite (ICESat)

    -

    NASA planned the ICESat mission with the goal to study the movement of Ice on the earth. (launched Jan 2003) [http://icesat.gsfc.nasa.gov]. The satellite reported readings of what it observed at a particular location at a given time using radars. There was a chance that the satellite’s readings may sometimes be inaccurate. NASA thus wanted a mechanism to validate the readings independently. Our solution involved planting a laser gun on the satellite and laying 1000+ laser sensors on the ground…

    NASA planned the ICESat mission with the goal to study the movement of Ice on the earth. (launched Jan 2003) [http://icesat.gsfc.nasa.gov]. The satellite reported readings of what it observed at a particular location at a given time using radars. There was a chance that the satellite’s readings may sometimes be inaccurate. NASA thus wanted a mechanism to validate the readings independently. Our solution involved planting a laser gun on the satellite and laying 1000+ laser sensors on the ground in the flight path of the satellite. When the satellite passed over the sensors, the sensors detected the laser and hence we could tell where exactly was the satellite flying and at what time.

    Specifically, I contributed in three efforts:

    1. I developed the software for the timing system and played a key role in designing the hardware as well. When the laser hit our sensors, it transmitted an electrical signal which reached the computer. My system recorded which detectors were hit and at what exact time were they hit. While NASA only needed us to be 1/1000th accurate, my program was accurate to 1/80millionth of a second. The system is in-use at the White sands Missile range in New Mexico.
    2. I developed tools to help pick the optimal sensor arrangement which will maximize the chances of the laser beam hitting our sensors.
    3. We needed to test our setup before putting it aboard the ICESat. We designed a test where an airplane emulated the satellite and had an on-board laser. For this experiment, the airplane operator needed a GPS-based positioning tool which provided him/her the details of where the detectors are on the ground. Our need was very specific and no off-the-shelf GPS system suited our needs so I had to develop a system from scratch which conducted the required task.

  • Cassandra Training for Fraud Prevention Group at Bank of America

    -

    Prepared and provided a customized Cassandra Training to engineers at Bank of America Fraud Prevention Department in NYC. The training was delivered in partnership with AllTech consulting.

  • Data Marshalling in Multi-core Architectures

    -

    Parts to be accelerated are often moved from the small cores to the large in an ACMP. This has a major cost: any data used by the task has to be sent to the large core and back. I came up with a mechanism in which the data being used by a task can be identified and sent to the core it will be transferred to beforehand, thus reducing this overhead to a zero.

  • Designed and Developed an Alpha processor

    -

    Designed, implemented, and synthesized a subset of 64-bit Alpha Architecture. The processor had a 4-stage pipeline and was synthesized using three standard cell libraries: 0.13, 0.18, and 0.35 um.

  • Designed and Developed an x86 processor

    -

    Implemented a subset of the x86 ISA in gate-level Verilog as a part of a 3-person team. The processor had a 7-stage pipeline and a 10-us cycle time on 0.35 um. The design included Caches, Virtual Memory, and Branch Prediction.

  • Development of OR1200 processor

    -

    Collaborated with a team in designing a 1 GHz RISC processor. Optimized the area, timing, and power of the wishbone bus.

  • EasyPeers.com

    -

    Again a website I developed on the side. Its a lightweight task-tracking and collaboration system for small teams. You can assign a task to anyone with a Google account and then track progress, inquire status, and monitor activity logs. I envision that it can be used at home, school, and startups. The back-end is in Django-norel and front-end is iWebKit.

    See project
  • Embedded Operating System

    -

    Programmed operating system commands to implement a multithreaded environment and a file system.

  • Feedback-Driven Pipelining

    -

    Pipeline parallelism is a promising approach to extract parallelism in programs. In pipeline parallelism, the programmer splits a program into multiple stages that execute concurrently on multiple cores. Efficient execution using pipeline parallelism requires that the cores are carefully distributed among stages: assigning too many or too few cores to a given stage wastes energy and reduces performance. Choosing the best core-to-stage allocation is either impossible or requires enormous work…

    Pipeline parallelism is a promising approach to extract parallelism in programs. In pipeline parallelism, the programmer splits a program into multiple stages that execute concurrently on multiple cores. Efficient execution using pipeline parallelism requires that the cores are carefully distributed among stages: assigning too many or too few cores to a given stage wastes energy and reduces performance. Choosing the best core-to-stage allocation is either impossible or requires enormous work because there are billions of possible core-to-stage allocations and the best allocation depends not only on the program but the execution environment of the program (i.e., the input to program, the configuration of the computer, and the other programs running on the computer). Pipeline parallelism, although known to be powerful, is often avoided because of this challenge. I proposed Feedback-Driven Pipelining (FDP), an algorithm to choose the best core-to-stage allocation during the program execution without incurring any cost. FDP also develops a novel method for writing adaptable parallel programs without requiring compiler support, a major shortcoming of all previous research in the area.

  • Feedback-Driven Threading

    -

    In parallel programs, an important question is how many threads to split a program into. Too few threads waste performance opportunity and too many waste energy without providing any benefit. Furthermore, the number of threads required to saturate performance is different depending on the program input and hardware configuration. Traditionally, either programmers are burdened with the tasking of choosing the number of threads or programs are left to run inefficiently. I proposed a mechanism to…

    In parallel programs, an important question is how many threads to split a program into. Too few threads waste performance opportunity and too many waste energy without providing any benefit. Furthermore, the number of threads required to saturate performance is different depending on the program input and hardware configuration. Traditionally, either programmers are burdened with the tasking of choosing the number of threads or programs are left to run inefficiently. I proposed a mechanism to choose the best number of threads at run-time which can save energy and improve performance.

  • Identifying Input Dependent Branches at Compile-time

    -

    Branch prediction is a well-understood problem in computer architecture. If the prediction turns out to be correct, it is a win but if the prediction turns out to be incorrect, it is a loss. While most branches can be predicted with near-100% accuracy, there are some branches which are hard to predict because they are input-dependent. These are often the hard to predict branches as they do not follow any specific pattern. We proposed 2D-profiling which was the first attempt at identifying such…

    Branch prediction is a well-understood problem in computer architecture. If the prediction turns out to be correct, it is a win but if the prediction turns out to be incorrect, it is a loss. While most branches can be predicted with near-100% accuracy, there are some branches which are hard to predict because they are input-dependent. These are often the hard to predict branches as they do not follow any specific pattern. We proposed 2D-profiling which was the first attempt at identifying such branches at compile time using only a single input set. 2D profiling recorded how the branches behaved over time and used heuristics to identify input dependent branches.

  • Increasing Cache Capacity through Line Distillation

    -

    We showed that a large amount of data stored in the L2 cache is actually never used. Using that insight, we developed a mechanism we called Line Distillation. We filtered the unused words in the cache and created extra capacity by proactively discarding those words from the cache.

  • iPhone app: PhoneGap Browser

    -

    Developed an app called “PhoneGap” browser that eases the development process using the PhoneGap infrastructure. Refer to my blog for details.

    See project
  • JobAlerts Web 2.0 Service

    -

    I developed a simple free Job posting notification service for Freelances. It alerts the job seekers if an opportunity matching their criteria is posted on the web. The backend uses Django-norel, feedparser, and Google data store running on Google App Engine Python. The frontend uses Django templates and jQuery.

    See project
  • Matchlessdeals.com

    -

    Just like Netflix suggests movies based on a user's viewing history, my site employs a machine learning algorithm, I developed, to suggest deals based on a user's deal-viewing history. The site uses Python Google App Engine as back-end and iWebKit for its front-end.

    See project
  • SMS gateway for RFID and Video Processing-based payment solution

    -

    Developed an SMS communication gateway in Python for Automaton Inc to be used this Skip payment solution. Skip uses Microsoft Kinect and RFIDs to provide a distributed payment system. More details are intentionally omitted.

  • Web-based Performance Evaluation of iPhone

    -

    I wrote a simple web-based system evaluate performance of C code running on the ARM core inside an iPhone. Refer to my blog for details.

    See project
  • Web/Phone/SMS controlled Generic Button Presser

    -

    An Arduino controls a servo motor that could press a button when activated. Arduino connects to the internet via ethernet shield + a TP Link router. I further wrote a simple relay script in node.js and the accompanying Arduino code that allowed web access to the button presser from outside the home LAN. Also interfaced with Twilio API such that the user can activate the button presser by calling a number, or sending an SMS, or visiting a website. Possible applications of the button presser…

    An Arduino controls a servo motor that could press a button when activated. Arduino connects to the internet via ethernet shield + a TP Link router. I further wrote a simple relay script in node.js and the accompanying Arduino code that allowed web access to the button presser from outside the home LAN. Also interfaced with Twilio API such that the user can activate the button presser by calling a number, or sending an SMS, or visiting a website. Possible applications of the button presser include automating a cheap $20 garage door opener, controlling a power script, pressing buttons on the TV remote, etc.

Languages

  • Urdu

    -

  • Punjabi

    -

Organizations

  • Eta Kappa Nu

    Publishing Chair

    - Present
  • Eta Kappa Nu, Electrical Engineering Honors Society

    Vice President

    -

Recommendations received

4 people have recommended Aater

Join now to view

More activity by Aater

View Aater’s full profile

  • See who you know in common
  • Get introduced
  • Contact Aater directly
Join to view full profile

Other similar profiles

Gemeinsame Artikel erkunden

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Aater Suleman

Add new skills with these courses