Projects/ Competitions

Jigsaw Multilingual Toxic Comment Classification

Kaggle Competition: Silver Medal, Top 4%, 64th/1621

1. Fine-tuned a vanilla XLM-Roberta model combined with pretrained foreign language monolingual Transformer models, such as BETO for Spanish and CamemBERT for French, on English only training data and ran toxicity predictions on test set comments across six non-English languages.

2. Pseudo-labeled all test-set predictions as soft-labels and continuously improved predictions by refining pseudo-labels provided to the models as training data, which was inspired by the knowledge distillation and MultiFiT.

3. Added label smoothing in the loss function to improve model accuracy and used two stage training strategy to fine- tune on training set and validation set separately with different learning rate to improve scores..

Google Quest Q&A Labeling

Kaggle Competition: Silver Medal, Top 3.5%, 55th/1571

1. Built predictive algorithm to improve automated understanding of complex question answer content and evaluated the quality/relevance of the given question/ corresponding answer.

2. Used Universal Sentence Encoding (USE) to encode question-answer pairs, and computed similarities between them.

3. Built two separate Roberta models for question title-body and question title-answer pairs with different learning rate settings for encoder and head. Ensembled Bert variants with different training weights and post-processed the output to match the distribution of the labels in the training data.

Tweet Sentiment Extraction

1. Extracted phrase that best supported the given tweet that exemplifies the provided sentiment.

2. Added two heads on top of the Transformer models to predict the start/end index of the selected phrase of a tweet separately, instead of directly predicting a span, then used Grad-CAM to visualize if the prediction make sense.

3. Implemented LSTM, BERT, RoBERTa, ALBert and XLNet to increase diversities of the models’ structures.

4. Used sequence bucketing to accelerate training by dynamically padding every batch to the maximum sequence length which occurs in that batch.