Case Study
Academic Research Support with
Custom NLP and Classification Models
Company |
Industry |
Academic Research |
Client
The InfoCoV project was led by the Faculty of Informatics and Digital Technologies at the University of Rijeka (UNIRi). It was focused on understanding how information spreads during crises, so their team developed a research framework for analyzing COVID-19-related communication on social media. This case reflects a collaboration between academia and industry, where Velebit AI supported the project with advanced NLP model development and implementation.
Impact
By building custom NLP models for Croatian-language data, we enabled the InfoCoV team to analyze sentiment and information spread during the height of the COVID-19 crisis. Our models helped uncover patterns in public communication, supported the publication of scientific papers, and gave the research team tools they can adapt and reuse in future work.
Problem
The project aimed to analyze how information and sentiment around COVID-19 spread on Twitter - specifically in Croatian, a language with limited NLP resources. The client team had collected a large dataset of tweets and associated metadata, but needed custom-trained models to extract meaningful insights from it.
Our role focused on building language models, developing classification systems, and integrating multiple data sources into a unified workflow. The end goal was to predict tweet sentiment and forecast how likely a tweet would be retweeted.
Challenges
Massive communication datasets
The dataset included over 200,000 COVID-19-related tweets, over half a million user comments on COVID-19 articles in online portals, and over 180,000 full-text articles. The dataset was of high variability in language and limited structure. Extracting accurate insights from this volume, especially during a fast-evolving global crisis, required robust models and targeted pre-processing.
Low-resource language limitations
Croatian lacks large pretrained NLP resources, making transfer learning less effective out of the box. We needed to train and fine-tune models specifically on Croatian COVID-related texts to reach usable performance levels.
COVID-19-related new vocabulary
New pandemic-related terms, hashtags, and phrases emerged quickly, many of which weren't captured by existing tokenizers. Accurately handling this vocabulary was key to effective sentiment and topic classification.
Imbalanced sentiment labels
The human-labeled tweet sentiment dataset of 10,000 examples (Senti-Cro-CoV-Tweets) was heavily skewed toward neutral and negative examples. Training models on this imbalance risked poor performance on underrepresented classes, requiring careful balancing and loss adjustments.
Integration of knowledge from various data sources
Combining tweet text with structured user data (e.g., follower count, engagement history) introduced architectural and normalization complexity. But this integration was critical for improving prediction accuracy, especially for retweet forecasting.
Solution
We fine-tuned two models in a self-supervised fashion, BERT and ELECTRA, on a large Croatian COVID-related unlabeled text corpus. On top of that, we implemented supervised models for two main tasks: predicting tweet sentiment and estimating how likely a tweet would be retweeted. These models combined the tweet text with structured features like user metadata, all implemented using custom PyTorch architectures.
To boost performance, we engineered additional features from both tweet-level and user-level data and ran feature importance analysis to better understand what was driving the predictions. All models and code were delivered with clear documentation, making it easy for the researchers to adapt and reuse the work in future studies.
Tools and Technologies
Results
The final models combined structured and unstructured data in a way that significantly improved prediction accuracy, especially for sentiment and virality tasks. By tailoring the language models to Croatian COVID-19 content, we helped the team analyze complex patterns in a challenging, low-resource language setting. The results fed directly into their ongoing research, with published papers, public presentations, and further studies now building on this work.
Discover More Case Studies
Discover the impact of our custom AI solutions on business success through customer stories