Emotions form an integral part of human interactions. The Intelligence Augmentation for AI Hackathon 2021 paves the way toward more empathy AI systems by aiming to build systems to recognize emotions from audio. The best entry into the competition from our team - Prompt Engineers - is a system that leverages not only the audio features but also the semantics of the spoken words, fusing the two intertwined modalities to achieve a runner-up position on the leaderboard with 61.38% Accuracy. We further improve the latency of the approach by more than 42% via feature reuse, weight sharing and multi-task learning at the cost of only 0.2% Accuracy drop.
Our best performing model is a phono-linguistic model, leveraging both the semantics of the spoken works and the speech features. We obtain speech features from Hubert - a speech pretrained transformer model and language features from Bert - a language model running over the output of the transcribed speech. The features from the two modalities are fused together to achieve 61.38% Accuracy. We observe that Bert features over the transcribed speech alone achieves 55.77% Accuracy. Whereas classifying on only the speech features from Hubert yields 58.98% Accuracy. Together the two modalities achieves the best performance.
We improve latency by multi-task learning the HuBert for Audio features as well as speech transcribing (ASR). This leads to 42% less model parameters with only 0.2% performance drop.
We exported our conda environments for training the models and running the app.
train_env.yml
: Our environment for training. Create using conda env create --name prompt --file=train_env.yml
& conda activate prompt
app_env.yml
: Our environment for app. Create using conda env create --name prompt_app --file=app_env.yml
& conda activate prompt_app
Please note that our app_env
was ran on a MacOS 11.2 Machine with Intel processor, whereas our training (train_env
) was done on a linux machine with Nvidia GPUs. The same conda environment may not work on other machines. Instead you may download the packages individually.
You may also download the following dependencies individually as an alternate means to create the environment:
Additional dependencies for running the webapp: streamlit, plotly
pip install streamlit
pip install soundfile
pip install sounddevice
pip install pydub
pip install transformers==4.10
If you are training the AST model, then also download the following dependencies:
matplotlib, numba, timm, zipp, wget, llvmlite
app
folder: cd app
.webapp
folder with the name cpu_model.pt
streamlit run app.py
TrainAudioFiles
& TestAudioFiles
and place it inside the dataset
folder.python3 run_asr.py
from inside the linguistic
folder.python3 train.py
from inside linguistic
folder with the following optional command line arguments .
--batch_size=16
--lr=1e-5
--n_epochs=10
--dummy_run
--device=cuda
--seed=1
--test_model
--bert_type=bert-base-uncased
ast/README.md
, the dataset needs to downloaded and kept inside ast/egs/hack/data/
.phono
folder, run bash run.sh
. You may edit the run.sh
to change the following arguments:
--pooling_mode
Options: [“mean”,’max’, ‘sum’]--model_name_or_path
--model_mode
Example arguments: hubert
or wav2vec2
--per_device_train_batch_size
[Type: Integer]--per_device_eval_batch_size
[Type: Integer]--learning_rate
--num_train_epochs
--gradient_accumulation_steps
(set 1 for no accumulation)--save_steps
, eval_steps
, logging_steps
--save_total_limit
--freeze_feature_extractor
, --input_column=filename
, --target_column=emotion
, output_dir="output_dir"
delimiter="comma"
, --evaluation_strategy="steps"
, --fp16
, --train_file="./../dataset/train_set.csv"
, --validation_file="./../dataset/valid_set.csv"
, --test_file="./../dataset/test_set.csv"
--do_eval
, --do_train
, --do_predict
train
and test
set from phono_feat_extractor
:
linguistic
, put the test.json
and train.json
in the phono_feat_extractor
folder.python3 merge_text.py
inside the phono_feat_extractor
folder.bash run.sh
from inside the same folder. You may change the same arguments as above mentioned for only on audio features. Additional argument of the Bert model: –bert_name=’bert-base-uncased’.train.pkl
and test.pkl
files inside phono_feat_extractor
. Put these files in phono_linguistic/data
folder.python3 bertloader.py
from phono-linguistic
folder to cache the dataloader for training.python3 train.py
.python3 bertloader.py
and python3 train.py
. Make sure the same arguments are passed to the two commands
--seed=1
(type=int)--batch_size=16
(type=int)--lr=1e-5
(type=float)--n_epochs=5
(type=int)--dummy_run
--device
--wandb
--bert_type=bert-base-uncased
--model_name_or_path='facebook/hubert-large-ll60k'