The end is just another beginning

26 August 2020 - 4 mins read time
Tags: GSoC 2020-Blog#5 Summary TextAnalysis WordTokenizer GoogleDrive

Hello there,

The extraordinary journey of Google Summer of Code 2020 is coming to end. In this post, I will be summarizing my GSoC journey and the work done so far with Future goals and milestones

Google Summer of Code logo

Over the past few months, I continued working with the Julia Language in its NLP Ecosystem. Initially, I proposed Writing the ALBERT in my GSoC. Fortunately, extended my proposal to Statistical language model or Language model interface.

I worked on the following projects in GSoC

Packages and PRs	status	open source code
Language Model Interface	Approved	PR#210
Statistical Tokenizer	Approved and merged	PR#51
Converting Tf weight to BSON	Released	Gist
ALBERT	completed	PR#203
ALBERT.jl	completed	Github Repo
GoogleDrive	completed	Github Repo

1. Language Model Interface

In the first phase of Google Summer of Code, I implemented the Language Model Interface. It provides implemented well-known Langauge models and Framework to create your own Language model with high-level APIs, which is complete and reviewed by my mentors. The blog on the same is here.

2. Statistical Tokenizer

SentencePiece is a re-implementation of sub-word units, an effective way to alleviate the open vocabulary problems in neural machine translation. SentencePiece supports two segmentation algorithms, byte-pair-encoding (BPE) [Sennrich et al.] and unigram language model. I have implemented the Sentencepiece Encoder to help Julia users in WordTokenizer. The implementation is described in the blog: Divergence - Tale of Sentencepiece Library and code can be found here

3. Converting Tensorflow weight to Desire Julia Format

We have converted Tensorflow weights release by Google Research to the following BSON files

ALBERT Version-1 base v1 large v1 xlarge v1 xxlarge v1

ALBERT Version-2 base v2 large v2 xlarge v2 xxlarge v2

In this version, we apply ‘no dropout’, ‘additional training data’ and ‘long training time’ strategies to all models.

The code for conversion can be found here

4. ALBERT

ALBERT is “A Lite” version of BERT, a popular unsupervised language representation learning algorithm. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation. The Detail of ALBERT is describe in my proposal

The code reside in TextAnalysis PR#203 and kept on hold until the TextAnalysis is shifted to Zygote-based Flux

I have written the Blog: First sight of albert, the land of Transformers and the following tutorial for ALBERT

ALBERT Transformer Tutorials a. Fine-tuning b. Pre-training

Other Packages and Blogs

The following packages are made as the part of Google Summer of Code

1. GoogleDrive 2. ALBERT.jl

The following blogs are written by me as part of Google Summer of Code

1. Hitting the road 2. First Milestone 3. Divergence -Tale of Sentencepiece Library 4. First sight of albert

Future Goal

Moving TextAnlaysis to Zygote-based-Flux version and Complete the PR #209
Implementing ROBERTA, With TextAnalysis.ALBERT and Transformers, we already have everything to cast it

I would also like to work on other ecosystems of Julia Lang

Acknowledgement

I would like to thank Google and JuliaLang for giving me this amazing opportunity to meet the most amazing people of Julia computing and other open source contributors. I am also grateful to my mentor @Aviks (Avik Sengupta) and @Ayushk4 (Ayush Kaushal) for guiding me through my project.

To sum up, I would like to call it the summer of learning , this was the most productive summer in my life. I Learnt how to write good software and better documentation. Looking back, it feels that these past four months passed way too quickly. I still remember anticipating for proposal acceptance results like it was yesterday.

References

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations - Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates - Kudo
Transformers - Peter Cheng
google-research/albert
Neural Machine Translation of Rare Words with Subword Units - Rico Sennrich, Barry Haddow, Alexandra Birch

Fun fact- The title indicates I will keep walking on the road, writing software, experimenting with Machine learning Model, …. (and other 1000s of thing), maybe making mistakes sometimes and surely powering to Julia and other open-source Community