The end is just another beginning
The extraordinary journey of Google Summer of Code 2020 is coming to end. In this post, I will be summarizing my GSoC journey and the work done so far with Future goals and milestones
Over the past few months, I continued working with the Julia Language in its NLP Ecosystem. Initially, I proposed Writing the ALBERT in my GSoC. Fortunately, extended my proposal to Statistical language model or Language model interface.
I worked on the following projects in GSoC
|Packages and PRs||status||open source code|
|Language Model Interface||Approved||PR#210|
|Statistical Tokenizer||Approved and merged||PR#51|
|Converting Tf weight to BSON||Released||Gist|
1. Language Model Interface
In the first phase of Google Summer of Code, I implemented the Language Model Interface. It provides implemented well-known Langauge models and Framework to create your own Language model with high-level APIs, which is complete and reviewed by my mentors. The blog on the same is here.
2. Statistical Tokenizer
SentencePiece is a re-implementation of sub-word units, an effective way to alleviate the open vocabulary problems in neural machine translation. SentencePiece supports two segmentation algorithms, byte-pair-encoding (BPE) [Sennrich et al.] and unigram language model.
I have implemented the Sentencepiece Encoder to help Julia users in WordTokenizer.
The implementation is described in the blog: Divergence - Tale of Sentencepiece Library and
code can be found here
3. Converting Tensorflow weight to Desire Julia Format
We have converted Tensorflow weights release by Google Research to the following BSON files
In this version, we apply ‘no dropout’, ‘additional training data’ and ‘long training time’ strategies to all models.
The code for conversion can be found here
ALBERT is “A Lite” version of BERT, a popular unsupervised language representation learning algorithm. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation. The Detail of
ALBERT is describe in my proposal
The code reside in TextAnalysis PR#203 and kept on hold until the TextAnalysis is shifted to Zygote-based Flux
I have written the Blog: First sight of albert, the land of Transformers and the following tutorial for ALBERT
Other Packages and Blogs
The following packages are made as the part of Google Summer of Code
The following blogs are written by me as part of Google Summer of Code
Moving TextAnlaysis to Zygote-based-Flux version and Complete the PR #209
Implementing ROBERTA, With
Transformers, we already have everything to cast it
I would also like to work on other ecosystems of Julia Lang
I would like to thank Google and JuliaLang for giving me this amazing opportunity to meet the most amazing people of Julia computing and other open source contributors. I am also grateful to my mentor @Aviks (Avik Sengupta) and @Ayushk4 (Ayush Kaushal) for guiding me through my project.
To sum up, I would like to call it the summer of learning , this was the most productive summer in my life. I Learnt how to write good software and better documentation. Looking back, it feels that these past four months passed way too quickly. I still remember anticipating for proposal acceptance results like it was yesterday.
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations - Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut
- Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates - Kudo
- Transformers - Peter Cheng
- Neural Machine Translation of Rare Words with Subword Units - Rico Sennrich, Barry Haddow, Alexandra Birch
Fun fact- The title indicates I will keep walking on the road, writing software, experimenting with Machine learning Model, …. (and other 1000s of thing), maybe making mistakes sometimes and surely powering to Julia and other open-source Community