Back to blog

NLP in Academia vs The Real World

ML Talks

NLP in Academia vs The Real World

Interview with Harshit Surana, Author of Practical NLP and Co-Founder of Chaos Genius

Created with DALL-E

Introduction

I stumbled upon the book Practical Natural Language Processing, and one of the authors, Harshit Surana, happened to join our bitgrit community.

We talked briefly about his ML research experience at Carnegie Mellon University and MIT, founding DeepFlux and Chaos-Genius, writing Practical NLP, and more. It’s a compelling story.

Who are you?

I’m Harshit Surana, CTO and co-founder at Chaos Genius (YC W20), a data ops company. I’m also a Machine Learning graduate student from Carnegie Mellon and researched CAPTCHA and reCAPTCHA. I also worked with MIT on knowledge graphs.

My last startup was DeepFlux, which recently got acquired, and was used by many Fortune 500 companies such as Disney. With my involvement in NLP and Deep Learning, I also wrote the book Practical NLP with other researchers and industry experts.

Talk about your research at CMU and MIT.

My main research at (Carnegie Mellon University) CMU was on CAPTCHA and reCAPTCHA for data collection.

A CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a program that can tell whether its user is a human or a computer.

An example of reCAPTCHA challenge in 2007

reCAPTCHA is a system created by Luis von Ahn, my mentor for my research, who’s also the co-founder and CEO of Duolingo.

Google purchased reCAPTCHA several years ago, and now it is also used to collect street view data. And if you surf the internet, you’ll see this check box below.

The NoCAPTCHA reCAPTCHA

The fundamental idea behind them is this — tests that computers can create but cannot pass. Thus, it’s not easy for bots to masquerade as users. It’s like a reverse Turing test.

In my research, I was more involved in figuring out how to intelligently collect data from human beings. I was at CMU around the late 2000s, and Deep Learning was not in its current state.

At that time, many ML models were encumbered in various ways because of the lack of data. Even with a lot of data, the model wasn’t good enough.

To solve that, we devised many creative ways for humans to play games and generate labeling data. That involves building interactive games to create labels. One example is ESP games. Google acquired it, and it became a Google image labeler.

Aside from my work at CMU, I also worked at MIT on building common sense for machines. Human beings have an innate ability to reason about the world, but Deep Learning is still bad at common sense.

Many of these things were still behind, especially when you do multi-chain logical stuff (A -> B -> C); computers are still pretty bad about that. What I did was I built a game to collect common sense data and use statistical techniques to clean and extract more information.

That data enriched ConceptNet (common sense dataset) and was built over two decades. The game increased coverage of common sense entities by more than 50%.

What were some memorable experiences you had as a grad student?

Grad school was hard, but I learned a lot.

Some courses were so hard that my friends and I had to do all-nighters for the assignments.

That course was called convex optimization, and the professor was a student of Stephen Boyd, author of the book Convex Optimization at Stanford. We learned so much that we understood the entrails of what’s happening inside almost all ML systems.

Most people are using these optimizers as black boxes, but having built all of them from scratch builds a lot of background understanding about how ML systems are built at scale.

Another cool memory not related to grad school was when I was helping out with some parts of reCAPTCHA, I experienced a Google acquisition in front of me. That’s something you don’t get to witness every day.

What inspired you to do research in ML?

I was involved in some research labs in India, which were more focused on NLP. And that essentially introduced me to the area.

I was generally interested in language, which caught my attention. When I was at Carnegie Mellon, I was interested in the different areas I was working in.

Most of my research was on ML, crowdsourcing, and data labeling. And data labeling and crowdsourcing are also in the Human-Computer Interaction (HCI) area, which is orthogonal to ML. That was what I found inspiring.

What is crowdsourcing?

It’s essentially about using the wisdom of the crowd to get something important.

It could be a prediction or a label for an ML system. Some of the data labeling techniques all involved crowdsourcing. We are not paying people but using fun/security as an incentive.

What advice do you have for future ML grad students?

Coming back to my advice, one thing is to figure out what you really want to do.

You have to ask yourself, “Do you want to do a Ph.D.?”

It can be a pretty hard and long journey. Often people get very frustrated in the middle as they can instead do applied ML research at FAANG companies.

So it’s important to know why you’re doing it and understand that the area is moving incredibly fast, as what you have built may be obsolete in five years.

Think of it as a journey where you learn a lot, not just the destination. Learning the process of building something stays with you a long way.

One more thing to add is that it’s hard to get into a good grad school for ML because it has become so popular. There is a trend in many universities where 20% of professors are in ML in the CS department, but 80% of grad students want to do ML.

What is DeepFlux, and what did it solve?

We focus on using alternative data to help provide insights to companies.

Alternative data is essentially data that isn’t available in a company. For an e-commerce company, internal company data would be sales and inventory, and alternative data would estimate what the competitor is selling. It’s data that I don’t know, but I’m inferring. Another example is hedge funds looking at satellite images to determine the potential footfall to beat Wall Street.

We started out focusing on alternative data for the cinema industry, and DeepFlux captured a lot of insights for the Disney team to decide where to hold movie launches for Avengers Endgame. It’s essentially an optimization problem. Traditionally, it was done by the marketing team based on gut feeling and domain knowledge, and they didn’t have access to the data we were able to provide.

We’re also expanding to retail. Competitors often only know what their product is and where it’s selling, but they also want to understand the general landscape. They may have a few competitors and DeepFlux can help them understand what kind of products they are selling for them; which ones are selling well, and which are not.

What mistakes did you make as a founder?

One mistake was focusing on only a few big customers.

At the start, there’s something that I found very interesting. It seems awesome that you have some of the largest companies in the world using your product on a large scale.

But, there are only that many companies like that in the world, right?

So then, what essentially ends up happening is your product becomes too focused on use cases of a few large companies. So I believe it’s important to start to focus on probably more medium-sized companies if you are building a product company.

If you’re building a services company, or if you are building a company that you explicitly have decided you’ll only focus on Fortune 500, and you believe that all of your customers could be there, then it’s a different game. It’s generally a much more sales-heavy game.

But if you’re building a scalable product, you focus on big communities, such as PyTorch in machine learning, Postman in engineering, etc.

If your target is not as large, I believe you can iterate and figure out the product much better. And essentially, on a broader note, startups are all about figuring out a product that can sell well first to a few customers and then sell well to thousands of customers.

So that process of iteration is called product-market fit. And in the process, I believe it’s better to look for product-market fit at not the largest companies.

What’s the main motivation behind the book Practical NLP?

I have been involved in NLP and machine learning since 2006, and in 2015, we saw this insane rush to machine learning.

Everybody was interested in knowing more about it. However, we saw that even people in large industries were too driven by hype. They didn’t understand how things should work. For example, people end up applying the coolest state-of-the-art (SOTA) model, even though it might be an irrelevant and unstructured way to do machine learning.

So an old friend and I in India noticed this broad problem and realized the need for applying NLP in the industry. So that is what the foundation of the book was, and the rest is history.

What’s your experience of writing the book?

In my research days, I wrote a lot of research papers and journals, so I had experience in writing. But beyond that, I hadn’t written a book before, and actually, none of us did except for one of my partners who was helping review some technical books.

So she had some experience in that, but broadly, we were not familiar with the process. More importantly, writing a book is not just about writing a book. It’s also figuring out the right set of publishers and pitch for the world.

A lot of other such things went into that. Besides, managing things like writer’s block can happen over several years, and you have to figure that out. So that was interesting.

However, one thing that we have been doing and I’ve been doing for a long while is reading a lot of good technical books, both theoretical and applied. And we had some books that we looked up to.

So one of these books is Designing Data-Intensive Algorithms. This is also a book by O’Reilly, which is a book that is now widely used for understanding how systems should work. But that is a book we got a lot of inspiration for. Similarly, there was another book by François Chollet on Keras — Deep Learning with Python.

Those two books were big inspirations because we looked up to them. We learned a lot from them. And in fact, we ended up taking inspiration from them.

So as an example, most applied books don’t have references. References are more for theoretical books, but we wanted to have a high-level overview of many things, and people can deep dive if they want.

So we ended up adding hundreds of references. And that is something that we learned from this book as an example.

Talk about the gap between NLP research and practical NLP.

Let me talk about how machine learning has been happening for the last decade in academia and industry.

So essentially, in academic research, there are a few benchmark datasets and a few benchmark problems that academics focus on. They try to devise algorithms that work well in them, and essentially, beating them is like beating SOTA. And that is a holy grail for many academics for a bunch of these core problems.

And most of the advancement that was happening, especially in the last decade, there are various GLUE benchmarks for NLP. There’s ImageNet for computer vision. So all of these datasets and competitions were the driving force of a lot of innovation in academia and research.

But what happens when you end up applying those methods to your real-world problem. They often are very different or some sort of transformation over these academically defined problems and datasets. And when you end up applying whatever’s happening in the research world to real-world problems, they don’t translate well. In fact, at times, they can translate very badly.

One reason is that many of these datasets and these problems are very well defined. NSF, DARPA, and all these research organizations spend tens of millions of dollars when creating those data sets. These models built on these clean data sets won’t translate well to organizations that have limited labels and noisy data.

Two, you may need to spend a lot of time creating the right data set for yourself. Collecting data at times is more important than applying the most complex statistical models.

When we were writing this book, we spoke to many senior folks from these companies, trying to understand how they do it with limited data, as well as folks who are quickly bootstrapping the system. And that is often one of the harder things people end up having to do.

If you’re running a company and in the ML division, people often ask what’s the ROI of the model. Bridging a baseline ROI soon is very important to make your project successful. One of the focuses was ensuring you’re always tracking ROI with all the cool stuff, the data stuff.

And that requires a very different mindset, and there are fixed set of patterns, design patterns, and techniques that could be applied to ensure that you are initially building something that all the stakeholders in the company or your organization find useful is starting to add value. And often, that’s not the case if you apply large academic models.

What advice do you have for startups implementing NLP?

My first question for them is, “Why do they need NLP?” At times simple heuristics can give better results than a model unless you have massive amounts of data.

So often people lose this context that, for NLP or any machine learning to work with, you need to have a lot of data for your specific set, unless you are doing a very generic thing. So people are not applying NLP for part of speech tagging or parsing but for a specific end goal. And that end goal may require a lot of real-label data that many startups don’t have.

So the first job is figuring out what kind of your heuristics and rules can be your first gut model, even though that’s not a machine learning model.

Secondly, determine the processes you can use to collect more data over time so that you have enough data to apply a modern NLP model and get even better results. And I feel that many startups, whether it’s an NLP, computer vision, or even traditional machine learning, often miss the first bit of “Why are they even using it?”

There’s also a caveat of NLP where you may need a lot of contexts and fine-tuning and huge amounts of data to get some use out of the models. So that’s something to be wary of.

Where do you think NLP is headed right now?

Many shallow tasks are taken care of, like basic chatbots, semantic parsing, and so on.

However, they may not translate to very specific domains. It’s a very different model that works on, say, medical literature than it works on industrial literature.

GPT3, as an example, is amazing for many things, but these models still don’t understand what they are doing, and because they don’t, there are lots of ways that you will find that it’s very easy to prime a GPT with certain sets of initial things that can lead to something completely absurd.

That would honestly not happen even in the more traditional machine learning or statistical world because many of these deep learning models are quite mysterious. And because of that, they cannot be as reliable as yet.

And what I also see in large and smaller companies is using GPT3 as a black box and adding a layer that ensures that any craziness that the model can throw at you has been filtered out.

Tell us what Chaos Genius is and what problem it solves.

Chaos Genius is an ML-powered analytics engine. It’s focused on outlier detection and root cause analysis. In this context, that means trying to understand what happened to your business and data metrics, the real cause behind it, etc.

The inspiration behind this was my experiences from DeepFlux when we were trying to generate these insights for these companies, as well as my co-founder’s past experiences when she was a product owner, and we saw similar problems it was not tackled well, so that laid the foundation for Chaos Genius.

We’re fortunate because Chaos Genius is a horizontal product in the sense that it can be applied to any industry. Companies ranging from large IoT systems to gaming companies, FinTech, e-commerce, and even blockchain companies are using our product.

What is outlier detection & Root Cause Analysis?

Now, every industry has its own set of metrics, and they want to understand when there’s an issue in those metrics, which they cannot always look at manually because they have so much data in so many dimensions.

With this complex data moving in and out and monitoring at a large scale, it becomes difficult to keep track of things. That is where outlier detection becomes important.

Outlier detection tells us something happened, but why did that issue happen? This is where root cause analysis comes in. You have this issue in your data, but at times you may not have an outlier; you still need to understand because the data keeps changing.

Let’s say you had 5.5 million in revenue last quarter, but this quarter you have 5.1 million. It’s not an outlier since it’s within the range. But you still want to understand the real reason behind this drop. Root cause analysis can provide you with answers as to why that happened.

With Chaos Genius, we save data scientists time and effort from writing complex code to keep track of big data and performing causal analysis. With a simple installation, they can connect to any data source and utilize it on any KPIs.

What is the meaning of life?

42.

Jokes aside, I think life is inherently meaningless. I believe we as human beings attach meaning to things, and we try to find our sense of meaning. There’s a lot of darkness in the world, and we must supply our own light.

There’s no light outside. So it could be doing machine learning, creating great companies, saving the environment, or anything.

Thanks for reading!


Liked this article? Here are some other articles you may enjoy:

Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!

Follow Bitgrit’s socials 📱 to stay updated on workshops, events, and upcoming competitions!