Case Study

Academic Research Support with
Custom NLP and Classification Models

Company
Industry
UNIRI Academic Research
Academic Research Case Study illustration
User icon

Client

The InfoCoV project was led by the Faculty of Informatics and Digital Technologies at the University of Rijeka (UNIRi). It was focused on understanding how information spreads during crises, so their team developed a research framework for analyzing COVID-19-related communication on social media. This case reflects a collaboration between academia and industry, where Velebit AI supported the project with advanced NLP model development and implementation.

Impact

By building custom NLP models for Croatian-language data, we enabled the InfoCoV team to analyze sentiment and information spread during the height of the COVID-19 crisis. Our models helped uncover patterns in public communication, supported the publication of scientific papers, and gave the research team tools they can adapt and reuse in future work.

Case Study Impact illustration
Case Study Problem illustration

Problem

The project aimed to analyze how information and sentiment around COVID-19 spread on Twitter - specifically in Croatian, a language with limited NLP resources. The client team had collected a large dataset of tweets and associated metadata, but needed custom-trained models to extract meaningful insights from it.

Our role focused on building language models, developing classification systems, and integrating multiple data sources into a unified workflow. The end goal was to predict tweet sentiment and forecast how likely a tweet would be retweeted.

Challenges

Crosshair icon

Massive communication datasets

The dataset included over 200,000 COVID-19-related tweets, over half a million user comments on COVID-19 articles in online portals, and over 180,000 full-text articles. The dataset was of high variability in language and limited structure. Extracting accurate insights from this volume, especially during a fast-evolving global crisis, required robust models and targeted pre-processing.

Crosshair icon

Low-resource language limitations

Croatian lacks large pretrained NLP resources, making transfer learning less effective out of the box. We needed to train and fine-tune models specifically on Croatian COVID-related texts to reach usable performance levels.

Crosshair icon

COVID-19-related new vocabulary

New pandemic-related terms, hashtags, and phrases emerged quickly, many of which weren't captured by existing tokenizers. Accurately handling this vocabulary was key to effective sentiment and topic classification.

Crosshair icon

Imbalanced sentiment labels

The human-labeled tweet sentiment dataset of 10,000 examples (Senti-Cro-CoV-Tweets) was heavily skewed toward neutral and negative examples. Training models on this imbalance risked poor performance on underrepresented classes, requiring careful balancing and loss adjustments.

Crosshair icon

Integration of knowledge from various data sources

Combining tweet text with structured user data (e.g., follower count, engagement history) introduced architectural and normalization complexity. But this integration was critical for improving prediction accuracy, especially for retweet forecasting.

Case Study Challenges illustration
Case Study Solution illustration

Solution

We fine-tuned two models in a self-supervised fashion, BERT and ELECTRA, on a large Croatian COVID-related unlabeled text corpus. On top of that, we implemented supervised models for two main tasks: predicting tweet sentiment and estimating how likely a tweet would be retweeted. These models combined the tweet text with structured features like user metadata, all implemented using custom PyTorch architectures.

To boost performance, we engineered additional features from both tweet-level and user-level data and ran feature importance analysis to better understand what was driving the predictions. All models and code were delivered with clear documentation, making it easy for the researchers to adapt and reuse the work in future studies.

Tools and Technologies

Python Python
PyTorch PyTorch
Hugging Face Hugging Face
LightGBM LightGBM
Scikit-learn Scikit-learn
Growth icon

Results

The final models combined structured and unstructured data in a way that significantly improved prediction accuracy, especially for sentiment and virality tasks. By tailoring the language models to Croatian COVID-19 content, we helped the team analyze complex patterns in a challenging, low-resource language setting. The results fed directly into their ongoing research, with published papers, public presentations, and further studies now building on this work.


Discover More Case Studies

Discover the impact of our custom AI solutions on business success through customer stories

Let's discuss how AI can help
your business success.

Contact us

Members of