Technologies Involved:
PYTHON
Area Of Work: Computer Vision
Project Description

A Switzerland-based digital solutions provider aimed to automate PDF-based data extraction for faster inventory mapping. With a growing need for structured article pricing data, the client sought a robust backend module to streamline extraction, classification, and integration. The solution needed to support real-time processing through a web-based interface for GraphQL-connected systems.

Scope Of Work

The client sought a solution with Oodles for a system to extract article numbers and prices from PDF files and fetch related results via GraphQL API. The goal was to replace manual data collection with a scalable module. Key areas of work included PDF parsing, OCR integration, NER model execution, GraphQL connectivity, and structured export generation.

Our Solution

To address the client’s need for automation, a Django-based REST API system was built to handle PDF-to-text conversion, entity recognition, and result generation. Here's how it worked:

  • Image Conversion Module: PDF files were converted into JPGs using pdf2img for improved OCR accuracy.
  • Text Extraction: Google Tesseract OCR processed the images to extract raw textual content, saved in intermediate .txt files.
  • Entity Recognition: A spaCy-based Named Entity Recognition (NER) model was trained to extract specific fields from the text.
  • Data Mapping & Export: Extracted data was sent through a GraphQL API to fetch the updated article details. 

Related Projects

aiShare Your Requirements