First Milestone

10 June 2020 - 6 mins read time
Tags: GSoC 2020-Blog#2 Language Models TextAnalysis

The coding period for GSoC commenced from 1 june 2020

I started by reading about SentencePiece’s Unigram for ALBERT. Apart from that I was actively working on Statistical Language model, which is completed and reviewed by my mentors.

Types of Language Models

Statistical Language Models: These models use traditional statistical techniques like N-grams, Hidden Markov Models (HMM) and certain linguistic rules to learn the probability distribution of words.
Neural Language Models: These are relatively new Methods in the NLP town and have surpassed the statistical language models in their effectiveness. They use different kinds of Neural Networks to model language. We will be discussing about it in next bolg.

Implementation of Statistical Language Model

I am proud :smiley: to announce our Statistical Language Model Framework in TextAnalysis.jl inspired from NLTK.lm. It provides implemented well known Langauge models and Frame work to creat your own Language model with high level APIs

TextAnalysis provide following different Language Models

MLE - Base Ngram model.
Lidstone - Base Ngram model with Lidstone smoothing.
Laplace - Base Ngram language model with Laplace smoothing.
WittenBellInterpolated - Interpolated Version of witten-Bell algorithm.
KneserNeyInterpolated - Interpolated version of Kneser -Ney smoothing.

APIs

To use the API, we first Instantiate desired model and then load it with train set

MLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
        
Lidstone(word::Vector{T}, gamma:: Float64, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
        
Laplace(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
        
WittenBellInterpolated(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
        
KneserNeyInterpolated(word::Vector{T}, discount:: Float64=0.1, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
        
(lm::<Languagemodel>)(text, min::Integer, max::Integer)

Arguments:

word : Array of strings to store vocabulary.
unk_cutoff: Tokens with counts greater than or equal to the cutoff value will be considered part of the vocabulary.
unk_label: token for unkown labels
gamma: smoothing arugment gamma
discount: discounting factor for KneserNeyInterpolated

for more information see docstrings of vocabulary

julia> voc = ["my","name","is","salman","khan","and","he","is","shahrukh","Khan"]

julia> train = ["khan","is","my","good", "friend","and","He","is","my","brother"]
# voc and train are used to train vocabulary and model respectively

julia> model = MLE(voc)
MLE(Vocabulary(Dict("khan"=>1,"name"=>1,"<unk>"=>1,"salman"=>1,"is"=>2,"Khan"=>1,"my"=>1,"he"=>1,"shahrukh"=>1,"and"=>1…), 1, "<unk
        >", ["my", "name", "is", "salman", "khan", "and", "he", "is", "shahrukh", "Khan", "<unk>"]))
julia> print(voc)
11-element Array{String,1}:
 "my"      
 "name"    
 "is"      
 "salman"  
 "khan"    
 "and"     
 "he"      
 "is"      
 "shahrukh"
 "Khan"    
 "<unk>"   
# you can see "<unk>" token is added to voc 
julia> fit = model(train,2,2) #considering only bigrams
julia> unmaskedscore = score(model, fit, "is" ,"<unk>") #score output P(word | context) without replacing context word with "<unk>"
0.3333333333333333
julia> masked_score = maskedscore(model,fit,"is","alien")
0.3333333333333333
#as expected maskedscore is equivalent to unmaskedscore with context replaced with "<unk>"

!!! NOTE

When you call `MLE(voc)` for the first time, It will update your vocabulary set as well. 

Evaluation Method

we Provide following Evaluation Method to work with Statistical Language Models.

`score`

used to evaluate the probability of word given context, P(word | context)

score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

Arguments:

m : Instance of Langmodel struct.
temp_lm: output of function call of instance of Langmodel.
word: string of word

context: context of given word

 In case of Lidstone and Laplace it apply smoothing and, 

 In Interpolated language model, provide Kneserney and WittenBell smoothing  

`maskedscore`

It is used to evaluate score with masks out of vocabulary words

The arguments are the same as for score

`logscore`

Evaluate the log score of this word in this context.

The arguments are the same as for score and maskedscore

`entropy`

entropy(m::Langmodel,lm::DefaultDict,text_ngram::word::Vector{T}) where { T <: AbstractString}

Calculate cross-entropy of model for given evaluation text.

Input text must be Array of ngram of same lengths

`perplexity`

Calculates the perplexity of the given text.

This is simply 2 ** cross-entropy(entropy) for the text, so the arguments are the same as entropy.

Preprocessing

For Preprocessing following functions:

everygram: Return all possible ngrams generated from sequence of items, as an Array{String,1}
padding_ngrams: padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n It also pad the original input Array of string

Code

Future Mile stones :checkered_flag:

I will be working on the following for coming weeks

AlbertTokenizer based on Sentencepiece. we will also be providing wordpiece of BERT for our model to give it more tokenization option to researchers.
Pretrained weights in BSON (converted from google released Pretrained models)
ALBERT Transformer
APIs for loading Pretrained weights to ALBERT tokenizer.