Cross-language clone detection by learning over abstract syntax trees

This paper is to be presented at MSR'19.
The paper is available here.

In this page, we briefly present the different tools and datasets we built and used in this paper.

bigcode-tools GitHub repo

bigcode-tools is a set of tools to help to work with source code. It contains multiple tools to fetch source code, transform source code into AST, visualize generated ASTs or learn embedding for AST nodes.

Documentation can be found in the repository README and a tutorial is available to quickly get started.


We created two datasets containing a large amount of code in Java and in Python.

Java dataset [download link] (463M, md5sum: 043496410add508610c07e7d318d9875)
Contains all the Java files found on GitHub in the Apache organization
Python dataset [download link] (204M, md5sum: f0f004849681a23443f0e1ca7dc325e6)
Contains all Python files of popular Python projects on GitHub.

More information about these dataset can be found in the paper.

Clone detection tool GitHub repo

This repository contains the implementation of our prototype to perform cross-language clone detection. It uses Tensorflow and Keras for the model implementation.

Documentation can be found in the repository README.


We created a cross-language clones dataset by using data from the Japanese competitive programming website AtCoder.

SQLite3 database [download link] (76M, md5sum: 32b62552afc96faaf796c40c7f2bdfdd)
SQLite3 database containing the metadata, the source code and their ASTs. See repository README for information about the schema.
Raw data [download link] (122M, md5sum: b2846f2a1f3e1b9fe7fdfe3635630a5f)
Archive containing the raw data used to create the SQLite3 database. This file should generally not be needed.

More information about the data can be found in the paper and in the repository README.