Cross-language clone detection by learning over abstract syntax trees
In this page, we briefly present the different tools and datasets we built and used in this paper.
bigcode-tools GitHub repo
bigcode-tools is a set of tools to help to work with source code. It contains multiple tools to fetch source code, transform source code into AST, visualize generated ASTs or learn embedding for AST nodes.
Documentation can be found in the repository README and a tutorial is available to quickly get started.
Dataset
We created two datasets containing a large amount of code in Java and in Python.
- Java dataset [download link] (463M, md5sum: 043496410add508610c07e7d318d9875)
- Contains all the Java files found on GitHub in the Apache organization
- Python dataset [download link] (204M, md5sum: f0f004849681a23443f0e1ca7dc325e6)
- Contains all Python files of popular Python projects on GitHub.
Clone detection tool GitHub repo
This repository contains the implementation of our prototype to perform cross-language clone detection. It uses Tensorflow and Keras for the model implementation.
Documentation can be found in the repository README.
Dataset
We created a cross-language clones dataset by using data from the Japanese competitive programming website AtCoder.
- SQLite3 database [download link] (75M, md5sum: be8de11ceae996d2a77b4c954dc369b4)
- SQLite3 database containing the metadata, the source code and their ASTs. See repository README for information about the schema.
- Raw data [download link] (122M, md5sum: b2846f2a1f3e1b9fe7fdfe3635630a5f)
- Archive containing the raw data used to create the SQLite3 database. This file should generally not be needed.
More information about the data can be found in the paper and in the repository README.