Low-rate Deletion Correcting Code

About the Project

“Random-Greedy” is the short name for this project, aiming to generate and compare low-rate deletion correcting codes. This short name is also the name of out main artifact, a simple algorithm that provides better deletion correcting abilities, compared to other codes we checked. In the project, eight methodologies for generating codes (some of them with meta-parameters) were created, and compared for their deletion correcting abilities (so no insertions nor substitutions are allowed). The codes are low-rate, so each codeword has $n$ bits in total, and $O(log(n))$ of them are information bits. We achieve this by generating the codes with the number of codewords to be equal to the codeword length. The full description of the project is available in the project report PDF file.

This work is submitted as the final project in the course “Coding and Algorithms for Memories” (236379), at Taub Faculty of Computer Science, Technion - Israel Institute of Technology. The project was written by Orel Adivi (orel.adivi [at] cs.technion.ac.il) and Daniel Noor (daniel.noor [at] cs.technion.ac.il), and under the supervision of Daniella Bar-Lev and associate professor Eitan Yaakobi. The work was done in semester winter 2022-2023. The project is released under MIT license.

Usage

In order to generate the codes and use them, an installation of CPython 3.10 is required (the implementation is platform independent, and was tested on both Windows and Linux).

The project uses NumPy, Matplotlib, overrides, pylcs, and tqdm Python libraries, which can be installed using Pip package installer:

python -m pip install -r requirements.txt

Then, the codes can be imported, generated and used. For example, for generating RandomGreedyCode of length 100 and with 2 options for each selection point, and using it, the following Python code can be used:

import numpy as np
from codes.RandomGreedyCode import RandomGreedyCode

my_code = RandomGreedyCode(length=100, options=2)                   # generate the code
print(f'Codeword for the value 0:\t{my_code.encode(24)}')           # encode a value
print(f'Value for a codeword:\t{my_code.decode(np.zeros((100,)))}') # decode a codeword
print(f'Maximal number of deletions:\t{my_code.max_deletions()}')   # calculate deletion-distance

Other codes can be used similarly, as they share the interface defined in the Code class.

Codes

In this project, eight code classes are available:

RandomCode – this class generates code by selecting serially random codewords that were not previously chosen.
GreedyCode – this class generates code selecting codewords to be as different as possible, by flipping bits during a serial traversal of the previously
generated codewords.
LinSpaceCode – this class generates code by repeating a pattern of alternating $0$-s and $1$-s with a length that increases linearly.
LogSpaceCode – this class generates code by repeating a pattern of alternating pattern_false-s ($0$ is the default) and pattern_true-s ($1$ is the default) with a length that increases linearly.
RepetitionCode – This class generates code by taking all words of length $log(n)$ for $n$ the codeword length, and repeating each letter $n/log(n)$ times, thus generating codewords of length $n$.
VTRepetitionCode – this class generates code by taking all words from $VT_0(m)$ for a given parameter m and multiplying each letter $n/m$ times.
VTRepetitionNaryCode – this class generates code by taking all words from $VT_0(m, q)$ for given parameter m and a base q, converting the words to binary and multiplying each letter $n/m$ times.
RandomGreedyCode – this class generates code serially by generating a set of codeword candidates of size options, and then selecting the candidate with the largest distance from the previous codewords.

All these classes are derived from the class Code, which allows the selection of codeword length in the construction (with the parameter length) and provides the methods encode (for encoding a value), decode (for decoding a codeword), and max_deletions (for calculating the maximal number of deletions allowed).

Utilities

In the utils folder, additional classes and functions, used in the implementation of the different codes, are provided:

LevenshteinDistance.py – this file contains a function that computes Levenshtein deletion distance between an expected string and a given string.
LongestCommonSubsequence.py – contains a function that computes the length of the longest common subsequence of given two strings (the function uses a dynamic programming algorithm, and was replaced with the implementation in pylcs).
VTCode.py – contains a VT code generator, based on an implementation from a previous work.

Experiments

In order to compare the different code generation methodologies, and to tune meta-parameters for the relevant codes, we wrote a Python script that runs these experiments, and another Python script that generates the figures from these experiments. For running the two files, the following script can be executed:

python Experiments.py
python GenerateFigures.py

Running the experiments is expected to last several hours, so we also run this script in a GitHub action. Additionally, the results are available in the artifacts folder. Furthermore, for testing the utils folder files, we wrote a test file that can be executed using the following command:

python -m unittest Unittests.py

In the execution of the first two files, the following figures are generated:

Figure 1 – this figure shows the connection between the codeword length and maximal number of fixable deletions, for different codes.
Figure 2 – this figure shows the connection between the codeword length and maximal number of fixable deletions, for codes that can not generate the desired number of codewords.
Figure 3 – this figure shows the connection between the codeword length and maximal number of fixable deletions, for VTRepetitionNaryCode in different bases.
Figure 4 – this figure shows the connection between the codeword length and maximal number of fixable deletions, for RandomGreedyCode with different number of options to choose from in each iteration.

The figures are described and discussed in detailed in the project report PDF file.

Project Engineering

Design and Development

The project was designed in accordance with the object-oriented programming (OOP) principles. For documentation, a website was created and a SUPPORT.md file was written. The project was written using PyCharm Professional and was managed using GitHub.

Continuous Integration

In order to ensure the correctness of commits sent to the GitHub server, a continuous integration pipeline was set. These checks are run automatically for each pull request and each push. The following actions were set:

1) Run experiments - this action runs all the experiments and generates the related figures. 2) Run unittests - this action runs all the unittests for the utils folder files. 3) Style check - this action performs basic checks of the Python files for detecting syntax errors. 4) Compile report - this action compiles the LaTeX file to the PDF report file. 5) Website - this action updates the Random-Greedy website with the content of this README.md file. 6) Vulnerability check - this action help managing the Python code and its dependencies. 7) Dependency review - this action help managing the Python code dependencies for pushes. 8) Dependabot - this action helps update the versions of the dependencies.

For the relevant actions, the checks were run in all the supported Python version CPython 3.10 and on both Windows (Windows Server 2022) and Linux (Ubuntu 20.04).

Please feel free to contact us with any questions you have about this project.