Most source code files hosted on GitHub are actually clones of previously created files, according to a recent study conducted by a joint team of researchers from the University of California, Irvine, the Czech Technical University, Microsoft Research, and Northeastern University.
Researchers looked at 4.5 million original (non-forked) GitHub projects, holding a total of 482 million different files. They found that only 85 million files were unique, or approximately 17.63% of all the analyzed files.
C++ came second, with 73% of all files being duplicates of other files, while Python recorded 71% code reuse, and Java only 40%.
Researchers also looked at duplicate files based on partial matches of the file’s content (based on token hashes), but the results were almost identical.
While package managers exist in other programming languages, NPM is today’s largest package manager in the world, with over 350,000 libraries, which is more than double the next most populated package registry —the Apache Maven repository.
As for the most “reappropriated” code, the situation is as follows:
C++: GNU ISO C++ Library, a particular student homework template, Arduino examples
Java: Minecraft-API, PhoneGap
Python: Cactus, Shadowsocks, Scons
Code reuse research is critical for other studies
“The source control system upon which GitHub is built, Git, encourages forking projects,” researchers say. “However, there is a lot more duplication of code that happens in GitHub that does not go through the fork mechanism, and, instead, goes in via copy and paste of files and even entire libraries.”
“This study has some important consequences. First, it would seem that GitHub, itself, might be able to compress its corpus to a fraction of what it is. Second, more and more research is being done using large collections of open source projects readily available from GitHub.
“Code duplication can severely skew the conclusions of those studies. The assumption of diversity of projects in those datasets may be compromised. DéjàVu can help researchers and developers navigate through code cloning in GitHub, and avoid it when necessary.”
The team’s research paper is entitled “DéjàVu: A Map of Code Duplicates on GitHub,” and is available for download from here and here. The study’s raw data is also available as MySQL dumps that can be downloaded from here.
November 20, 2017