GitHub Repositories with Links to Academic Papers: Open Access, Traceability, and Evolution

Abstract:  Traceability between published scientific breakthroughs and their implementation is essential, especially in the case of Open Source Software implements bleeding edge science into its code. However, aligning the link between GitHub repositories and academic papers can prove difficult, and the link impact remains unknown. This paper investigates the role of academic paper references contained in these repositories. We conducted a large-scale study of 20 thousand GitHub repositories to establish prevalence of references to academic papers. We use a mixed-methods approach to identify Open Access (OA), traceability and evolutionary aspects of the links. Although referencing a paper is not typical, we find that a vast majority of referenced academic papers are OA. In terms of traceability, our analysis revealed that machine learning is the most prevalent topic of repositories. These repositories tend to be affiliated with academic communities. More than half of the papers do not link back to any repository. A case study of referenced arXiv paper shows that most of these papers are high-impact and influential and do align with academia, referenced by repositories written in different programming languages. From the evolutionary aspect, we find very few changes of papers being referenced and links to them.

 

GitHub Repositories with Links to Academic Papers: Open Access, Traceability, and Evolution

Abstract:  Traceability between published scientific breakthroughs and their implementation is essential, especially in the case of Open Source Software implements bleeding edge science into its code. However, aligning the link between GitHub repositories and academic papers can prove difficult, and the link impact remains unknown. This paper investigates the role of academic paper references contained in these repositories. We conducted a large-scale study of 20 thousand GitHub repositories to establish prevalence of references to academic papers. We use a mixed-methods approach to identify Open Access (OA), traceability and evolutionary aspects of the links. Although referencing a paper is not typical, we find that a vast majority of referenced academic papers are OA. In terms of traceability, our analysis revealed that machine learning is the most prevalent topic of repositories. These repositories tend to be affiliated with academic communities. More than half of the papers do not link back to any repository. A case study of referenced arXiv paper shows that most of these papers are high-impact and influential and do align with academia, referenced by repositories written in different programming languages. From the evolutionary aspect, we find very few changes of papers being referenced and links to them.

 

Microsoft has reportedly acquired GitHub – The Verge

“Microsoft has reportedly acquired GitHub, and could announce the deal as early as Monday. Bloomberg reports that the software giant has agreed to acquire GitHub, and that the company chose Microsoft partly because of CEO Satya Nadella. Business Insider first reported that Microsoft had been in talks with GitHub recently.

GitHub is a vast code repository that has become popular with developers and companies hosting their projects, documentation, and code. Apple, Amazon, Google, and many other big tech companies use GitHub. Microsoft is the top contributor to the site, and has more than 1,000 employees actively pushing code to repositories on GitHub. Microsoft even hosts its own original Windows File Manager source code on GitHub. The service was last valued at $2 billion back in 2015, but it’s not clear exactly how much Microsoft has paid to acquire GitHub….”

Code Ocean | Discover & Run Scientific Code

“Our mission is to make the world’s scientific code more reusable, executable and reproducible

 

Code Ocean is a cloud-based computational reproducibility platform that provides researchers and developers an easy way to share, discover and run code published in academic journals and conferences.

More and more of today’s research includes software code, statistical analysis and algorithms that are not included in traditional publishing. But they are often essential to reproducing the research results and reusing them in a new product or research. This creates a major roadblock for researchers, one that inspired the first steps of Code Ocean as part of the 2014 Runway Startup Postdoc Program at the Jacobs Technion Cornell Institute. Today, the company employs more than 10 people and officially launched the product in February 2017.

For the first time, researchers, engineers, developers and scientists can upload code and data in 10 programming languages and link working code in a computational environment with the associated article for free. We assign a Digital Object Identifier (DOI) to the algorithm, providing correct attribution and a connection to the published research.

The platform provides open access to the published software code and data to view and download for everyone for free. But the real treat is that users can execute all published code without installing anything on their personal computer. Everything runs in the cloud on CPUs or GPUs according to the user needs. We make it easy to change parameters, modify the code, upload data, run it again, and see how the results change….”