Case Study
Company |
Industry |
| Academic Research |
The InfoCoV project was led by the Faculty of Informatics and Digital Technologies at the University of Rijeka (UNIRi). It was focused on understanding how information spreads during crises, so their team developed a research framework for analyzing COVID-19-related communication on social media. This case reflects a collaboration between academia and industry, where Velebit AI supported the project with advanced NLP model development and implementation.
The project aimed to analyze how information and sentiment around COVID-19 spread on Twitter – specifically in Croatian, a language with limited NLP resources. The client team had collected a large dataset of tweets and associated metadata, but needed custom-trained models to extract meaningful insights from it.
Our role focused on building language models, developing classification systems, and integrating multiple data sources into a unified workflow. The end goal was to predict tweet sentiment and forecast how likely a tweet would be retweeted.
The project aimed to analyze how information and sentiment around COVID-19 spread on Twitter – specifically in Croatian, a language with limited NLP resources. The client team had collected a large dataset of tweets and associated metadata, but needed custom-trained models to extract meaningful insights from it.
Our role focused on building language models, developing classification systems, and integrating multiple data sources into a unified workflow. The end goal was to predict tweet sentiment and forecast how likely a tweet would be retweeted.
The dataset included over 200,000 COVID-19-related tweets, over half a million user comments on COVID-19 articles in online portals, and over 180,000 full-text articles. The dataset was of high variability in language and limited structure. Extracting accurate insights from this volume, especially during a fast-evolving global crisis, required robust models and targeted pre-processing.
Croatian lacks large pretrained NLP resources, making transfer learning less effective out of the box. We needed to train and fine-tune models specifically on Croatian COVID-related texts to reach usable performance levels.
New pandemic-related terms, hashtags, and phrases emerged quickly, many of which weren’t captured by existing tokenizers. Accurately handling this vocabulary was key to effective sentiment and topic classification.
The human-labeled tweet sentiment dataset of 10,000 examples (Senti-Cro-CoV-Tweets) was heavily skewed toward neutral and negative examples. Training models on this imbalance risked poor performance on underrepresented classes, requiring careful balancing and loss adjustments.
Combining tweet text with structured user data (e.g., follower count, engagement history) introduced architectural and normalization complexity. But this integration was critical for improving prediction accuracy, especially for retweet forecasting.
We fine-tuned two models in a self-supervised fashion, BERT and ELECTRA, on a large Croatian COVID-related unlabeled text corpus. On top of that, we implemented supervised models for two main tasks: predicting tweet sentiment and estimating how likely a tweet would be retweeted. These models combined the tweet text with structured features like user metadata, all implemented using custom PyTorch architectures.
To boost performance, we engineered additional features from both tweet-level and user-level data and ran feature importance analysis to better understand what was driving the predictions. All models and code were delivered with clear documentation, making it easy for the researchers to adapt and reuse the work in future studies.
The final models combined structured and unstructured data in a way that significantly improved prediction accuracy, especially for sentiment and virality tasks. By tailoring the language models to Croatian COVID-19 content, we helped the team analyze complex patterns in a challenging, low-resource language setting. The results fed directly into their ongoing research, with published papers, public presentations, and further studies now building on this work.