The end is just another beginning

Hello there,

The extraordinary journey of Google Summer of Code 2020 is coming to end. In this post, I will be summarizing my GSoC journey and the work done so far with Future goals and milestones

Google Summer of Code logo
Google Summer of Code logo

Over the past few months, I continued working with the Julia Language in its NLP Ecosystem. Initially, I proposed Writing the ALBERT in my GSoC. Fortunately, extended my proposal to Statistical language model or Language model interface.

I worked on the following projects in GSoC

Packages and PRs status open source code
Language Model Interface Approved PR#210
Statistical Tokenizer Approved and merged PR#51
Converting Tf weight to BSON Released Gist
ALBERT completed PR#203
ALBERT.jl completed Github Repo
GoogleDrive completed Github Repo


1. Language Model Interface

In the first phase of Google Summer of Code, I implemented the Language Model Interface. It provides implemented well-known Langauge models and Framework to create your own Language model with high-level APIs, which is complete and reviewed by my mentors. The blog on the same is here.

2. Statistical Tokenizer

SentencePiece is a re-implementation of sub-word units, an effective way to alleviate the open vocabulary problems in neural machine translation. SentencePiece supports two segmentation algorithms, byte-pair-encoding (BPE) [Sennrich et al.] and unigram language model. I have implemented the Sentencepiece Encoder to help Julia users in WordTokenizer. The implementation is described in the blog: Divergence - Tale of Sentencepiece Library and code can be found here

3. Converting Tensorflow weight to Desire Julia Format

We have converted Tensorflow weights release by Google Research to the following BSON files

In this version, we apply ‘no dropout’, ‘additional training data’ and ‘long training time’ strategies to all models.

The code for conversion can be found here

4. ALBERT

ALBERT is “A Lite” version of BERT, a popular unsupervised language representation learning algorithm. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation. The Detail of ALBERT is describe in my proposal

The code reside in TextAnalysis PR#203 and kept on hold until the TextAnalysis is shifted to Zygote-based Flux

I have written the Blog: First sight of albert, the land of Transformers and the following tutorial for ALBERT


Other Packages and Blogs

The following packages are made as the part of Google Summer of Code

The following blogs are written by me as part of Google Summer of Code

Future Goal

I would also like to work on other ecosystems of Julia Lang

Acknowledgement

I would like to thank Google and JuliaLang for giving me this amazing opportunity to meet the most amazing people of Julia computing and other open source contributors. I am also grateful to my mentor @Aviks (Avik Sengupta) and @Ayushk4 (Ayush Kaushal) for guiding me through my project.

To sum up, I would like to call it the summer of learning , this was the most productive summer in my life. I Learnt how to write good software and better documentation. Looking back, it feels that these past four months passed way too quickly. I still remember anticipating for proposal acceptance results like it was yesterday.

References

Fun fact- The title indicates I will keep walking on the road, writing software, experimenting with Machine learning Model, …. (and other 1000s of thing), maybe making mistakes sometimes and surely powering to Julia and other open-source Community