Mohit Kumar Researcher/Consultant/Trainer Programming is more than just typing.

Artificial Intelligence based chip design: Chip Design

 feature image

This is a departure from my usual tech blog. In the current series titled Artificial-Intelligence-based-chip-design, I/we present the use of Artificial Intelligence to design FPGA chips for Ultra-Low-Latency based speech transformers. This showcases our products as well as our skills in the respective domains. This started as an experiment to lower latency for a seq-to-seq LSTM based speech synthesis system. Post a thorough analysis, we broke it down into 1. mapping the matrix multiplication of deep learning transformers into an FPGA chip(Transformers On Chip(part-1)) and 2. co-designing the FPGA chip for the appropriate workload(ChipDesign(part-2)). In part-2 we treat the resources on FPGA chip, Workload, other design parameters as design variables. Finally, we feed the design variables into the Deep Reinforcement Learning agent to learn. Post learning the expectation from Deep Reinforcement Learning agent being the optimal placement of blocks to maximize certain goals like latency, etc. The Deep Reinforcement Learning Algorithm is supposed to figure out a balance that speeds up computation for tolerable accuracy losses(if at all).To use a boxing term, pound for pound, the same hardware micro-designed by a Deep Reinforcement Learning Agent for carrying a specific load.

Artificial Intelligence based chip design: Transformers on Chip

 feature image

This is departure from my usual tech blog. In the current series titled Artificial-Intelligence-based-chip-design, I/we present the use of Artificial Intelligence to design FPGA chip for Ultra-Low-Latency based speech transformers. This showcases our products as well as our skills in the respective domains. This started as an experiment to lower latency for a seq-to-seq LSTM based speech synthesis system. Post a thorough analysis, we broke it down into 1. mapping the matrix multiplication of deep learning transformers into an FPGA chip(Transformers On Chip(part-1)) and 2. co-designing the FPGA chip for the appropriate workload(ChipDesign(part-2)). In part-1 we look specifically at resources available on an FPGA chip and a few different strategies of hardware mapping of the workload(matrix multiplication) on the above chip.

Meditating with microprocessors Series: Part-5: Appendix:Tools of the trade(part-5)

 feature image

The current article is part of a bigger series titled Meditating-with-microprocessors,in which I demonstrate the use of Artificial Intelligence to tune microprocessors for Ultra Low Latency and Realtime loads. There are 5 parts to the series Artificial Intelligence based Hardware(Microprocessor) tuning: Implementing a very simple idea(part-1) , A crashcourse in Microarchitecture and Linux CPUIDLE interface(part-2), Trading off power for UltraLowLatency(part-3), Artificial Intelligence guided Predictive MicroProcessor tuning(part-4), Appendix:Tools of the trade(part-5) . In the current article, we drill down into the inner workings of Ftrace and EBPF. With EBPF based techniques and frameworks Linux tooling has gained tracing superpowers. It has capabilities that now exceed that of Dtrace on Solaris. This is not an exhaustive section and it does not do a breadth-first scan of Linux tooling. I am highlighting a couple of tools, why they work for us, and what makes them special. Moreover, tools play a vital role in verifying or rebutting the theories we make about the systems we design. Building/designing a system is like building a beehive, little blocks you put together making it a whole. But these little blocks fit more snuggly if the theory is rock-solid. Tools provide us the evidence for that.

Meditating with microprocessors Series: Part-4: Artificial Intelligence guided Predictive MicroProcessor tuning

 feature image

The current article is part of a bigger series titled Meditating-with-microprocessors,in which I demonstrate the use of Artificial Intelligence to tune microprocessors for Ultra Low Latency and Realtime loads. There are 5 parts to the series Artificial Intelligence based Hardware(Microprocessor) tuning: Implementing a very simple idea(part-1) , A crashcourse in Microarchitecture and Linux CPUIDLE interface(part-2), Trading off power for UltraLowLatency(part-3), Artificial Intelligence guided Predictive MicroProcessor tuning(part-4), Appendix:Tools of the trade (part-5) . In the current article, we explore the use of Artificial Intelligence to reduce latency and save power at the same time. As we have understood, left to itself, the microprocessor will sense the load and configure itself for it(suitable c-states, P-states, etc. ). However, this process is reactive at best, and if latency matters then it is already late. By reactive, I mean the arrival of a remote message triggers an interrupt and from there, one thing leads to another before the microprocessor configures itself for the load. Causal is another word for it. With an Artificial Intelligence based model can we reduce latency by predicting load and preconfigure the microprocessor for load and then postconfigure to save power once the load has tided over. In other words making the process more proactive and predictive.

Meditating with microprocessors Series: Part-3: Trading off power for UltraLowLatency

 feature image

The current article is part of a bigger series titled Meditating-with-microprocessors,in which I demonstrate the use of Artificial Intelligence to tune microprocessors for Ultra Low Latency and Realtime loads. There are 5 parts to the series Artificial Intelligence based Hardware(Microprocessor) tuning: Implementing a very simple idea(part-1) , A crashcourse in Microarchitecture and Linux CPUIDLE interface(part-2), Trading off power for UltraLowLatency(part-3), Artificial Intelligence guided Predictive MicroProcessor tuning(part-4), Appendix:Tools of the trade(part-5) . In the current article, I present irrefutable evidence that if the processor is configured correctly with a goal in mind(for e.g. UltraLowLatency), the OS jitter caused by the processor can nearly be eliminated. The system can be much more predictable. Furthermore, It has a huge impact on the latency of the system for good. A substantial improvement in latency due to configuration on the Core, but beyond substantial due to Uncore. The improvement due to Uncore is to be expected because there is a whole lot more circuitry on the Uncore. However, this is a trade off that comes at the cost of expending more power.

Meditating with microprocessors Series: Part-2: A crashcourse in Microarchitecture and Linux CPUIDLE interface

 feature image

The current article is part of a bigger series titled Meditating-with-microprocessors,in which I demonstrate the use of Artificial Intelligence to tune microprocessors for Ultra Low Latency and Realtime loads. There are 5 parts to the series Artificial Intelligence based Hardware(Microprocessor) tuning: Implementing a very simple idea(part-1) , A crashcourse in Microarchitecture and Linux CPUIDLE interface(part-2), Trading off power for UltraLowLatency(part-3), Artificial Intelligence guided Predictive MicroProcessor tuning(part-4), Appendix:Tools of the trade(part-5) . In the current article I lay the technical groundwork for later articles in the series to build on. The term's and technologies I'll be using later must be understood really well. So we look at 3 concepts primarily. Firstly, Architecture of a modern microprocessor in short ,really really short. Just enough to make a mental model of the microprocessor to work with. Secondly, How does does software(Linux) interface with the microprocessor. Again just enough to make sense of the data we gather form the microprocessor using various UltraLowLatency profilers and tracers. Thirdly, help a programmers like you to get started.

Meditating with microprocessors Series: Part-1: Artificial Intelligence based Hardware(Microprocessor) tuning: Implementing a very simple idea

 feature image

In the current series titled Meditating-with-microprocessors, I demonstrate the use of Artificial Intelligence to tune microprocessors for Ultra-Low-Latency and Realtime loads. The techniques, in general, can be extended to other components of a computer system like storage devices, memory, etc. However, the article series and my work is currently restricted to Intel microprocessors only. In future, we may extend this to other hardware components of a computer system. This is a very specialized and intense field and hence I intend to break it down using the first-principles approach into simpler pieces of technology that are easy to understand. There are 5 parts to the series Artificial Intelligence based Hardware(Microprocessor) tuning: Implementing a very simple idea(part-1), A crashcourse in Microarchitecture and Linux CPUIDLE interface(part-2), Trading off power for UltraLowLatency (part-3), Artificial Intelligence guided Predictive MicroProcessor tuning (part-4), Appendix:Tools of the trade(part-5) . In the balance then, this is a documentation of my journey navigating these utterly specialized fields ( microarchitecture and Artificial Intelligence ), how to marry them together, the issues I faced, the respective solutions, what (how much) are the benefits if any, and what to have in the toolbox.

Natural Language Processing Series: Neural Machine Translation(NMT):Part-1: Highly Simplified, completely Pictorial understanding of Neural Machine Translation

 feature image

In the current article, we look at Artificial Intelligence(AI) based translation systems from one Natural Language to another. Natural language, as in the way humans speak and how far have we come in designing machines that understand and act on it. The first of the multi-part series is highly simplified, completely pictorial, dressing down of Neural Machine Translation systems. Once you have a solid understanding of systems like these, we take a look at where are we currently applying it to(case studies), to broaden the horizons. The rest of the parts will be a rigorous under-the-skin(math,code and logic) look of Neural Machine Translators, Attention and, the cherry on the cake Googles Neural Machine Translators

RNN Series:LSTM internals:Part-4: MultiRNNCell and Bidirectional RNNs

 feature image

In this (part-1,part-2,part-3) we look into composing LSTM into multiple higher layers and its directionality. Though Multiple layers are compute-intensive, they have better accuracy and so does bidirectional connections. More importantly, a solid understanding of the above mentioned paves the way for concepts like Highway connections, Residual Connections, Pointer networks, Encoder-Decoder Architectures and so forth in a future article. I do this using the first principles approach for which I have pure python implementation Deep-Breathe of most complex Deep Learning models.

RNN Series:LSTM internals:Part-3: The Backward Propagation

In this multi-part series(part-1,part-2), we look inside LSTM Backward Propagation. This would usually involve lots and lots of math. I love math but I am not a trained mathematician. However, I have decent intuition and intuition by itself can only get you so far. So I used my programming skills to validate pure theoretical results often cited by pure/trained mathematicians. I do this using the first principles approach for which I have pure python implementation Deep-Breathe of most complex Deep Learning models.

RNN Series:LSTM internals:Part-2:The Forward pass

In this multi-part series, we look inside LSTM forward pass. Read the part-1's before you come back here. Once you are back, in this article, we explore the meaning, math and the implementation of an LSTM cell. I do this using the first principles approach for which I have pure python implementation Deep-Breathe of most complex Deep Learning models.

RNN Series:LSTM internals:Part-1:The Big Picture

LSTMs make up everything directly or indirectly from Time Series(TS) to Neural Machine Translators(NMT) to Bidirectional Encoder Representation from Transformers(BERT) to Neural Turing Machine(NTM) to Differentiable Neural Computers(DNC) etc, and yet they are not very well understood. This is a multi-part series that will unravel the mystery behind LSTMs. Especially it's gradient calculation when participating in a more complex model like NMT, BERT, NTM, DNC, etc. I do this using the first principles approach for which I have pure python implementation Deep-Breathe of most complex Deep Learning models.

Softmax and Cross-entropy

The marriage of Softmax and CrossEntropy. This is a loss calculating function post the yhat(predicted value) that measures the difference between Labels and predicted value(yhat).

Softmax and its Gradient

From the perspective of Deep Neural networks, softmax is one the most important activation function, maybe the most important. I had trouble understanding it in the beginning, especially its why its chosen, its gradient, its relationship with cross-entropy loss and the combined gradient. In this article, I further dumb it down and add code to theory. But, there is another selfish reason for reproducing these here, I will be able to refer to these while explaining more complex concepts like LSTMs, NMTs, BERTs, XLNETs, etc.

Introduction: Why another technical blog?

Pictures say a thousand words is a cliche, but I believe in it. I have made a career out of it. Complex programming or logical concepts should be understood with the help of pictures.

MIT License © 2021 Mohit Kumar.