Doing Market Research with GitHub


A few years back, Big Data was very hot but now it is not. Artificial Intelligence is now the darling of the world. How many startup companies have jumped onto the Big Data bandwagon and not making any money at all? So what happened to all those glowing market research reports on Big Data?

How do you find out what is really hot before jumping into the fray? We need to do our own assessment. To find out what is really going on in the market, one cannot simple just follow market trends or market research without doing any facts checking. Customers showing interest does not automatically translate to sales opportunity for a product. One way to figure out is to look at the user engagement activities. As an example, this article will show you how to get meaningful and relevant metrics to measure user engagement activities with a simple step-by-step guide.

Step #1 Starts with a Credible Source

For any metrics to be meaningful and relevant, the source has to be credible — provides information about what is happening now. For open source projects, a good information source is GitHub. What is GitHub? It is a web-based hosting service for version control using Git. It is mostly used for computer code. Git lets anyone readily download a copy of their source code to their own machine, make changes, and then, whenever they felt like it, upload those changes back to the central repository. And it did this in a way that everyone’s change would merge seamlessly together.

Most of the open source world, including Google, Facebook, Twitter, and even Microsoft now house its code on GitHub. With more than 28 million registered users across the globe and over 85 million repositories on the platform, GitHub is now among the top 100 most popular sites worldwide and 49% of the Fortune 100 use GitHub Enterprise. The huge community of software developers on GitHub makes it an invaluable source of information for anyone doing market research for an open resource project.

Step #2 Learn How to Extract Useful Data from Statistics

Aside from being used as a tool by millions of developers, one can get a lot of useful information about a project on GitHub. Here’s an example of how one can use GitHub to research a new product. Let’s go to the repository page for Tensorflow — Here’s the screenshot you will see. You might see some differences as the page will get updated frequently but the format will be the same.



On the top of the page, you will see a few metrics — Watch, Star and Fork. “Watch” is the number of people watching a repository to receive notifications for new pull requests and issues that are created. “Star” shows how many people are interested in the repository. And “Fork” is the number of people has made a copy of the repository.

Step #3 Number of User Engagements is What Matters Not Amount of Interests (Signal versus Noise)

Of the three, “Star” has been the most popular metric people are using to gauge how popular an open source project is. In general, the more stars a project gets; the more popular it is. In this example, Tensorflow has over 100,000 stars which is very popular. However, showing interest is not a one to one translation to user engagement. If we look at the number of forks, it is only around 64,000 i.e. only about 64,000 people made a copy of the repository instead of the 100,000 people expressed interest. That means not everyone starred the project is doing something with it.

While looking at the number of stars and forks might be a good starting point, it does not show what the developers are actually doing. You need to go down to the next level of details such as the number of issues, recent submissions and the number of contributors etc to get more information. To make it easier for users to see the next level of contribution activity, GitHub provides a very useful feature called Insights for viewing contribution activity in a repository.

Step #4 Getting Meaningful Data Out of User Engagement Activities

Under your repository name, click on the Insights folder tab. You will see a page similar to the screen shot below with the following menu options — Pulse, Contributors, Commits, Code Frequency, Dependency Graph, Network and Forks. The ones that will give you more information on user engagement activities are Pulse and Contributors.


Pulse is an overview of a repository’s activity. You get a good snapshot of user activities over a period of time (from 24 hours to a month.) The overview will give you a summary of active pull requests and active issues. You can also get more details by clicking on a closed issue. By looking at the overall summary of activities, you will get a better understanding of the product status.

With the Contributors option, one can also view the top 100 contributors to a repository below the Contributors graph. This is a very interesting option as you will be able to see who are the top contributors, where they are located and companies they are associated with. In the case of Tensorflow project, many of them are from the San Francisco Bay Area and many of them are working for Google. If you do a similar search on a competing project like the MXNet project, you will see many of them are in China and many of them are working for Amazon. The question then is how you can make use of the data.

If your startup company is looking at hiring top talent, this is a good place to recruit new employees. And if you are a product manager, based on location of the developers, you can get a sense of where the product might be more popular. For example, MXNet might be more popular in China than in the US. If you were to develop a new product running on MXNet, you might want to focus more on the market in China. As you can see, you can find out what the developers or end users are actually doing by digging deeper into their activities. However, you have to know what you are looking for. This is a very nice way for someone to get a better understanding of the market and not just what people are interested in.

Step #5 Know Thy Competition

Before you commit to develop a product based on a particular platform, you probably would want to know how does it compares against its competitors. With Tensorflow, it has multiple competing products like MXNet and Caffe etc. By using similar metrics above to measure user engagement activities, one can see the adoption trend for the competing products. Just a one word of caution, one should not be making judgement on a product solely based on popularity. While it is safer to develop a new product on a platform with lot of adopters already, it is important to think about what problem your product is trying to solve first. For example, will the less popular platform delivers better performance for your product?

Step #6 Trust but Verify

In addition to doing research online, talking to people in the same industry is also a good way to validate what is going on in the market. A lot of times, talking to people at a trade show or conference will give you a lot of good information about a product or market segment. In general, you can always ask them questions like what they are seeing in the market. If this is someone you know, you can probably find out if they are just getting customer inquiries or sales orders. Lastly, keep in mind that there are certain things might work for big companies but not for the startups. A big company can afford to take a risk but most startups do not have the same luxury to start over again. Remember, always do your homework!