About

This is the companion website for the research “On the Usage of Databases in Open Source Projects”. The main goal is to investigate which and how DBMS are used in Java Open Source projetcs. This research has so far been divided into two main analyses: current and historical. In the Current Analysis, we investigate the use of DBMS in the current version of the projects. In the Historical Analysis, we investigate the adoption of DBMS throughout the projects’ life cycle.

Team

Vanessa Braganholo (UFF, Brazil)
Leonardo Gresta Paulino Murta (UFF, Brazil)
Igor Wiese (UTFPR, Brazil)
Igor Steinmacher (NAU, USA)
Marco Aurélio Gerosa (NAU, USA)
Camila Acácio de Paiva (UFF, Brazil)
Raquel Maximino de Barros Santos (UFF, Brazil)
Frederico Gomes de Paiva (UFF, Brazil)
João Felipe Pimentel (UFF, Brazil)

Seleciton of the Project Corpus

The table below shows the workflow we use to select the projects for our corpus. The table shows the name of the script, its purpose, the required input, and the produced output.

Name	Goal	Input	Output
collect.py	Queries projects’ metadata from GitHub using the v4 API	None	projects.xlsx
filter.ipynb	Applies some extra filters	projects.xlsx	filtered.xlsx
analyze.ipynb	Produces statistics about the final corpus	annotated_java.xlsx	None
download.py	Clones all repositories from the corpus	annotated_java.xlsx	None
reset.py	Tries to fix name collisions for case-insensitive File Systems	annotated_java.xlsx	None

Heuristics Extraction

To find out which DBMS is used by a given project, we use heuristics that are based on regular expressions. We use git grep to search the projects source code (one at a time) and store the results in a relational DBMS. We also use heuristics to find other information about database usage in our corpus, such as how queries are performed, and which vulnerabilities there are in the source code. The table below shows the workflow for executing the heuristics on our corpus. The table shows the name of the script, its purpose, the required input, and the produced output.

Name	Goal	Input	Output
extract_last.py	Runs git grep on the HEAD commit and populates the relational DBMS with the results	annotated_java.xlsx	None
extract.py	Runs git grep on the history and populates the relational DBMS with the results	annotated_java.xlsx	None
create_1st_level_heuristics.py	Generates .txt files that contains dbCode Heuristics	DataBase (Implementation Heuristics)	Path .first-level
extract_classes.py	Runs git grep and populates the database with dependencies of dbCode	Path .first-level	None
create_vulnerabilityDatabase.py	Produces a database about vulnerabilities	Vulnerability_Version_20061101_Date_20220913.xlsx	None
extract_historical_vulnerabilities.py	Runs git grep and populates the database with historial of vulnerabilities	DataBase	None

Results Analysis

Current Analysis

The table below shows the workflow for analyzing the results for the current version of the projects in our corpus.

Name	Goal	Input	Output
create_2nd_level_heuristics.py	Counts the results of bdCode and its dependencies to the project	DataBase (Second Level)	Path .second-level and usage_fan_in_file.xlsx
results_in_xlsx.py	Generates the xlsx that it will be used to analyze the results	DataBase	count_implementation.xlsx, count_sql.xlsx, database.xlsx, implementation.xlsx, implementation_names.xlsx, query.xlsx
results_database_characterization.ipynb	Produces statistics about database Heuristics	database.xlsx	None
results_implementation_characterization.ipynb	Produces statistics about implementation Heuristics	database.xlsx, implementation.xlsx, implementation_names.xlsx, query.xlsx	None

Historical Analysis

The table below shows the workflow for the historical analysis of the results. This analysis only requires the execution of the Heuristics Extraction script in historical mode: extract.py”.

Name	Goal	Input	Output
historical_implementation.py	Generates a coded dataset with the results of the history of the projects	DataBase	historical.xlsx
historical_analysis.ipynb	Produces statistics about DBMS adopted throughout the history of projects	historical.xlsx	historical_join.xlsx
historical_count_models.ipynb	Produces statistics and dataset about DBMS models	historical_join.xlsx, databases_models.xlsx	historical_join_db.xlsx
historical_graphs.ipynb	Produces graphs with statistics about DBMS models and project domains	historical_join_db.xlsx	None
historical_coocurrence_version1.ipynb	Generates association rules for the results found in the projects history first slice.	historical_join.xlsx	historical_rulesv1.xlsx
historical_coocurrence_version5.ipynb	Generates association rules for the results found in the projects history fifth slice.	historical_join.xlsx	historical_rulesv5.xlsx
historical_coocurrence_version10.ipynb	Generates association rules for the results found in the projects history last slice.	historical_join.xlsx	historical_rulesv10.xlsx
historical_coocurrence_filters_v1.ipynb	Applies filters to analyze the correlations found in the first version	historical_rulesv1.xlsx	None
historical_coocurrence_filters_v5.ipynb	Applies filters to analyze the correlations found in the fifth version	historical_rulesv5.xlsx	None
historical_coocurrence_filters_v10.ipynb	Applies filters to analyze the correlations found in the last version	historical_rulesv10.xlsx	None
historical_seqpatterns_format.ipynb	Converts the historical_join dataset to the file format required by the SPMF library	historical_join.xlsx	input_sequencial_init_in_out.txt, output_tam1.txt, output_tam3.txt, output_tam4_sid.txt
historical_seqpatterns_filters.ipynb	Applies filters to search for established replacement patterns and generates some data mining measures	output_tam1.txt, output_tam3.txt, output_tam4_sid.txt	pattern_selection_measures.xlsx

Name	Goal	Input	Output
related-work.py	Traverses the DBLP XML file using the SAX API to get papers from major Database and Software Engineering conferences and journals	venue_keys.txt	papers.xlsx
related-work.ipynb	Filters for papers with “database” on the title and that were published in Software Engineering venues	papers.xlsx	filtered_papers.xlsx

Excel Spreadsheet (validate that all fields in the spreadsheet are filled in correctly. The convertion of formulas may cause an error.)
Collection Scripts (see the installation instructions below to run the scripts on your computer)

Installation

Requirements

We assume you have Python 3.7+, Node 12.10+, and Git 2.23+ installed on your computer. OBS: At the moment, sqlalchemy-utils has an incompatibility with sqlalchemy 1.4.0b1. Please use an older version, for example sqlalchemy 1.3.23.

Steps for setting up the environment (needs to be done just once)

Configuring project base

Clone our repository:

~$ git clone https://github.com/gems-uff/db-mining.git

Go into the project directory:

~$ cd db-mining

Install pipenv (if it is not already installed):

~/db-mining$ python -m pip install pipenv

Prepare the Python environment:

~/db-mining$ pipenv install

Go into the React app directory:

~/db-mining$ cd web

Prepare the React app environment:

~/db-mining/web$ npm install

Configuring Okta authentication

This project uses Okta as the authentication mechanism. Follow the steps below to set up your credentials.

Access https://www.okta.com/ and create an account (if you do not already have one).
Access your panel and create a new application with the following settings:

Parameter	Value
Platform	Single-Page App
Base URIs	http://127.0.0.1:5000/
Login redirect URIs	http://localhost:3000/implicit/callback http://localhost:5000/implicit/callback http://127.0.0.1:5000/implicit/callback
Logout redirect URIs	http://localhost:5000/login http://localhost:3000/login http://127.0.0.1:5000/login
Implicit checkbox	Marked

Go to the file authentication.json and update the variables issuer and client_id to match your credentials.

Configuring database access

You can use either SQLite or PostgreSQL database.

Go to the file database.json.
Edit it according to the database you will use:

This JSON file has a drop_database field, which indicates whether you would like the application to drop the existing database and create a new empty one. If that is the case, the value of drop_database should be True. The database_type field specifies which database management system will be used: SQLite or PostgreSQL. The remaining fields depend on the type of database you are using.

If you are using SQLite, these are the mandatory fields of the JSON file:

{
  "drop_database": "False",
  "database_type": "sqlite",
  "database_name": "dbmining.sqlite"
}

If you are using PostgreSQL, these are the mandatory fields:

{
  "drop_database": "False",
  "database_type": "postgresql",
  "host": "none",
  "port": "none",
  "username": "none",
  "password": "none",
  "database_name": "dbmining"
}

Now, if you just want to run the project using the analysis we made with the databases and the projects we selected, follow the steps below. But, if you want to run the project to create your own analysis, go to Steps for creating your own analysis.

Steps for running the application

Running the scripts

Go into the project directory:

~$ cd db-mining

Activate the environment:

~/db-mining$ pipenv shell

Run the download.py script to clone all the repositories in the corpus:

~/db-mining$ python src/download.py

Run the reset.py to fix name colisions for case-insensitive file systems:

~/db-mining$ python src/reset.py

Run the extract.py to execute the Current Analysis or extract.py -s 10 -l all to execute the Historical Analysis and populate the database:

~/db-mining$ python src/extract.py

Starting the application

Go into the React app directory:

~/db-mining$ cd web

Build the application:

~/db-mining/web$ npm run build

Back to the root directory:

~/db-mining$ cd ..

Run the server.py to start the Flask server:

~/db-mining$ python src/server.py

Access the React app at http://127.0.0.1:5000

In this case, the URL http://127.0.0.1:5000 is served by Flask and uses the last build of the React app produced by npm run build. Alternatively, you can run the application with Node.js server if you want to immediately reflect your changes into the React app without the need of rebuilding it every time during development. If so, follow the remaining steps.

Add http://localhost:3000 as a trusted origin with CORS enabled in the Okta panel, API > Trusted origin.
Go into the React app directory:

~/db-mining$ cd web

Start the Node.js server:

~/db-mining/web$ npm start

Access the React app at http://localhost:3000.

The URL http://localhost:3000 is served by Node.js and has hot reload capability. Please, note that it is significantly slower than rebuilding the React app (i.e., npm run build) and serving using Flask (http://127.0.0.1:5000). As our architecture is based on a REST API, even when accessing unsing Node.js, the Flask server should be online (Step 9), to respond REST requests.

Steps for creating your own analysis

Steps for Current Analysis

Go into the project directory: ~$ cd db-mining
Run the results_in_xlsx.py to generate a one-hot coded dataset with the results of the current Analysis:

~/db-mining$ python src/results_in_xlsx.py

Execute the next scripts in Google Colab or Jupyter Notebook platforms.
Run the results_implementation_characterization.ipynb to produce statistics about DBMS adopted.

Steps for Historical Analysis

Go into the project directory:

~$ cd db-mining

Run the historical_implementation.py to generate a one-hot coded dataset with the results of the projects history:

~/db-mining$ python src/historical_implementation.py

Execute the next scripts in Google Colab or Jupyter Notebook platforms.
Run the historical_analysis.ipynb to produce statistics about DBMS adopted throughout the projects history.
Run the historical_count_models.ipynb to produce statistics and dataset about DBMS models.
Run the historical_graphs.ipynb to produce graphs with statistics about DBMS models and project domains.
Run the historical_coocurrence_version1.ipynb, historical_coocurrence_version5.ipynb, and historical_coocurrence_version1o.ipynb to generate association rules for the first, fifth and last slices of project history.
Run the historical_coocurrence_filters_v1.ipynb, historical_coocurrence_filters_v5.ipynb, and historical_coocurrence_filters_v1o.ipynb to apply filters to analyze the correlations found in the three moments of the projects’ history.
Run the historical_seqpatterns_format.ipynb to generate the standard input file for the SPMF library.
Run the historical_seqpatterns_filters.ipynb to filter the replacement patterns and generate the measures.

Spreadsheets description

There are two sets of spreadsheets. The first one is related to the selection of projects for our corpus. The second one is related to our search for related work. They are described below and can be found in the resources folder.

Project Corpus

Name	Content	# of projects
projects.xlsx	All public, non-fork, and active (with pushes in the last 3 months) projects with ≥1000 stars from GitHub on March 27, 2021	21,149
filtered.xlsx	All projects from projects.xlsx with ≥1000 stars, ≥1000 commits, ≥10 contributors, and Java programming languages	633
annotated_java.xlsx	All Java projects from filtered.xlsx with manual annotations classifying the domain of the projects and discarding inadequate projects	317

We searched the DBLP XML file for papers that have “database” on the title, and that were published in major Software Engineering conferences and journals. The DBLP XML file was downloaded on Setember 16th, 2019. We then conducted a snowballing. The spreadsheets below are the result of this search.

Name	Content	# of papers
papers.xlsx	All papers published in major Database and Software Engineering conferences and journals. The list of venues is specified in the `venue_keys.txt` file	40,730
filtered_papers.xlsx	All papers from the `papers.xlsx` file that have “database” in the title, filtered by Software Engineering venues and	260

Acknowledgements

We would like to thank CNPq and NSF for funding this research.

License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

Team

Seleciton of the Project Corpus

Heuristics Extraction

Results Analysis

Current Analysis

Historical Analysis

Related Work

Installation

Requirements

Steps for setting up the environment (needs to be done just once)

Configuring project base

Configuring Okta authentication

Configuring database access

Steps for running the application

Running the scripts

Starting the application

Steps for creating your own analysis

Steps for Current Analysis

Steps for Historical Analysis

Spreadsheets description

Project Corpus

Related Work

Acknowledgements

License