How are software tools and terms interrelated? What does the software ecosystem look like?
Use the cooccurrences of tags from over 5 million software questions on Stack Overflow to identify related terms
A graph showing the relations between every tag on stackoverflow, organized so that closely related terms are close together.
Pan and zoom using mouse
The size of each point represents the number of questions tagged with that tag. The points were automatically classified into 10 groups of closely related tags. The color of each point encodes which group it is in. The points are positioned such that the most closely related terms lie close together.
This map was made by scraping stackoverflow.com to find questions tagged with more than one subject, and to count the cooccurrences of each pair of tags. In the map, each point represents a single tag. The map is generated by making each pair of tags attract one another with a strength proportional to the number of times the tags appear together in questions. In this manner the tags to self-organize into a 'map' showing how the cencepts of computer science and software development are related to one another
The red cluster to the top-left of the map, centered around iOS, Objective C and the iPhone represent questions cenetered around developing for these devices. The operating system OSX sits to the right of the red cluster, identified as being closely related to iOS development, but its tight ties to other desktop OSes align it more closely with them.
Toward the upper-right of the map are the command line tools, shell scripts and tools related to kernel internals. Unsurprisingly, this is also where C settled down on the map, and also where hardcore debugging codes like valgrind are housed. The two outlying points around here (VHDL and Verilog) are hardware design tools.
To the far upper-right of the map lie a large number of pieces of academic tooling, notably around here are functional and logical programming languages, only lightly linked to the rest of the graph. This area also contains the tools and concepts of data mining, which are largely centered around the Matlab package.
There are not many distinct keywords associated with version control or continuous integration, but just to the left of the iOS blob, we see these tools in bright yellow. Initially it is surprising that git is not in with these points, but because if its strong links to Linux, Git has been pulled away from the rest of the version control group.
The web spider that collated this data is a Python script, powered by BeautifulSoup and requests. Data is dumped into a CSV file that can be read by the open-source graph visualization program, Gephi, which was used to classify, position and visualize all of the points. The weights of the edges between points in the graph are calculated from the fraction of all questions with a tag that point to a given target. For example, there are 600,000 questions tagged with "Python", of which 100,000 are also tagged "Django". The weight of the edge between Python and Django is 0.16. Groups are identified in the data using modularity maximization. It is these modularity groupings that decide the colors of the points.