82% of the Code on GitHub Consists of Clones of Previously Created Files

Most source code files hosted on GitHub are actually clones of previously created files, according to a recent study conducted by a joint team of researchers from the University of California, Irvine, the Czech Technical University, Microsoft Research, and Northeastern University.

Researchers looked at 4.5 million original (non-forked) GitHub projects, holding a total of 482 million different files. They found that only 85 million files were unique, or approximately 17.63% of all the analyzed files.

JavaScript projects contained the most duplicate files

The study only looked at source code projects written in C++, Java, JavaScript, and Python. Of the four, JavaScript projects contained the most duplicate code with 94% of files being a 100% identical clone (based on the file hash) of another file hosted on GitHub.

C++ came second, with 73% of all files being duplicates of other files, while Python recorded 71% code reuse, and Java only 40%.

Researchers also looked at duplicate files based on partial matches of the file’s content (based on token hashes), but the results were almost identical.

NPM to blame for most of the JavaScript file duplicates

The reason why JavaScript contained the most reused code samples is simple to explain, and that’s NPM, the de-facto package manager for all client and server-side JavaScript projects.

While package managers exist in other programming languages, NPM is today’s largest package manager in the world, with over 350,000 libraries, which is more than double the next most populated package registry —the Apache Maven repository.

Because NPM contains more helpful libraries, developers also use it more. Because developers use it more, they import more libraries in JavaScript projects than in other programming languages, hence the high amount of reused code.

As for the most “reappropriated” code, the situation is as follows:

C++: GNU ISO C++ Library, a particular student homework template, Arduino examples
Java: Minecraft-API, PhoneGap
JavaScript: PhoneGap’s Hello World Template, OctoPress, a BlueMix template
Python: Cactus, Shadowsocks, Scons

Code reuse research is critical for other studies

“The source control system upon which GitHub is built, Git, encourages forking projects,” researchers say. “However, there is a lot more duplication of code that happens in GitHub that does not go through the fork mechanism, and, instead, goes in via copy and paste of files and even entire libraries.”

“This study has some important consequences. First, it would seem that GitHub, itself, might be able to compress its corpus to a fraction of what it is. Second, more and more research is being done using large collections of open source projects readily available from GitHub.

“Code duplication can severely skew the conclusions of those studies. The assumption of diversity of projects in those datasets may be compromised. DéjàVu can help researchers and developers navigate through code cloning in GitHub, and avoid it when necessary.”

The team’s research paper is entitled “DéjàVu: A Map of Code Duplicates on GitHub,” and is available for download from here and here. The study’s raw data is also available as MySQL dumps that can be downloaded from here.

November 20, 2017