Skip to main content

PDF Parser

What Does PDF-Parser do?

PDF parser provides an easy way to extract and chunk the information from PDF file. It can extract text, tables and even embed images from the PDF file. This chunked information can be used for a variety of puposes such as RAG for Large Language Models. The major steps in PDF Parser are:

  1. PDF -> MD - This is achieved using the fork of an external tool - Marker which has been modified to integrate into our document flow
  2. MD parsing using Marked - This is a markdown parser which converts the MD to a temporary JSON, which helps in easier chunking and manipulation of the data.
  3. We detect the headings from the parsed MD and chunk the data into CSV and a final JSON, while adding metadata to each chunk like the startPage,endPage,pdfName etc.
  4. After detecting the text, PDF Parser starts detecting images in the PDF and embeds it into the previously created chunks.
  5. Finally, we summarize the content of the chunks.

Steps involved in PDF-Parsing

  • Each page is extracted from the pdf (depending on parameters passed for number of pages to be processed at a time)
  • Text is extracted along with the structure metadata (bbox of each word) in the form using PyMuPDF
  • If no text is extracted/ text extracted has an unkown encoding, it is sent to Pytesseract and text is OCRed for the required language (language should be provided at input- functionality to predict the language is not yet built in- can be done using clip
  • Extracted text is is segmented into 'blocks' (line of the same font) along with the bbox for each block
  • Blocks are sent to object detection and segment detector (Layout detector and column detector )
  • Each block is classifed into heading, text line, table- row/column/heading
  • Latex equations are converted into the right .md format
  • The post processsing for converting text to .md format for each block is done
  • Blocks are stitched together for a page and Page numbers are added to each page
  • The pages are stitched together
  • The .md of the entire batch of pages is converted to json format with the required strcuture (content is mapped to headings and subheading, subheading to heading etc)
  • The json is flattened out to a csv with chunks of content mapped to headings
  • The chunks are post-processed to break down big chunks too smaller and smaller chunks combined to one to maintain uniformity of size
  • The chunks are embedded and stored in a vector DB, summaries are generated if required, transaltions are done if required
  • Images are detected for all pages through the pdf and sent to GPT for summarization and to minio for storage
  • The minio links for the image are inserted back into the chunked csv along with embeddings of the image summaries

Setting up locally

Clone the repository locally

git clone https://github.com/BharatSahAIyak/pdf-parser/ && cd pdf-parser

Create and activate venv. Ex:

python3.10 -m venv venv
source venv/bin/activate &&

This project uses poetry.

pip install poetry==1.7.0
poetry install

Initialize the github sub-repository for this project

git submodule update --init --recursive

Install the requirements for the submodules

cd marker_final && poetry install

For parsing Hindi PDF's install tesseract to your system as well

apt-get install -y tesseract-ocr tesseract-ocr-eng tesseract-ocr-hin

Setup a .env file in the root directory.

cp sample.env .env

Update the created env file with your OPENAI and BHASHINI keys along with MINIO information

TESSDATA_PREFIX is the location of the tesseract installation on your system

Start the Docker Containers

  • REDIS
docker run -d -p 6379:6379 --name my-redis redis
  • MinIO
mkdir -p ~/minio/data

docker run \
-p 9000:9000 \
-p 9001:9001 \
--name minio \
-v ~/minio/data:/data \
-e "MINIO_ROOT_USER=ROOTNAME" \
-e "MINIO_ROOT_PASSWORD=CHANGEME123" \
quay.io/minio/minio server /data --console-address ":9001"

Start the Celery Worker

celery -A worker worker --loglevel=info

Start the Web Server

uvicorn main:app --workers 4 --host 0.0.0.0 --port 8000 --reload

Architecture

API specification

Download

GET
/download/ Generate download links for a particular `taskId`

This endpoint is used to return the MinIO links for the processed parts associated with a given taskId

Query Parameters

NameTypeRequiredDescription
taskIdStringYESThe taskId returned when starting a conversion with /process/
formatStringNOSpecifies the type of response (md, json, csv, or lexer)

When format is not passed, all 3 formats, md,csv and json are returned

Example request using cURL

When `format` is passed in the query
curl --location 'localhost:8000/download/?format=json&taskId=<taskId>'
{
"data": {
"batchId": "8101a566-2da0-4691-9c2d-184e8b439ecf",
"taskId": "828c22d3-b103-460e-b8ae-c1ead28f4186",
"url": "<URL for original file>",
"files": {
"828c22d3-b103-460e-b8ae-c1ead28f4186": [
{
"start": "0",
"end": "2",
"url": {
"json": "<URL FOR THE JSON FILE>"
}
}
]
}
},
"error": false
}
When `format` is not passed in the query
curl --location 'localhost:8000/download/?taskId=<taskId>'
{
"data": {
"batchId": "8101a566-2da0-4691-9c2d-184e8b439ecf",
"taskId": "828c22d3-b103-460e-b8ae-c1ead28f4186",
"url": "PDF URL",
"files": {
"828c22d3-b103-460e-b8ae-c1ead28f4186": [
{
"start": "0",
"end": "2",
"url": {
"pdf": "URL FOR THE PART",
"md": "MD URL",
"csv": "CSV URL",
"json": "JSON URL"
}
}
]
}
},
"error": false
}

Process

POST
/process/ Upload a batch of PDFs and generate chunks

Request Body

NameTypeRequiredDescription
languageStringYESLanguage of the uploaded PDF (english or hindi)
fileFileYESThe PDF file to be processed
urlStringYESLink to a PDF file
splitIntegerNOThe number of pages to process from the PDF in one iteration
toStringNOThe final stage after processing (md for processing till md. Default end is JSON)
webhookURLStringNOThe webhook to hit on completion of a PDF part
summarizeBooleanNOBoolean value to control if summation should be done after simple parsing
imagesBooleanNOBoolean value to control if images should be extracted from the PDF after simple parsing

Either one of file or url can be processed at a time

Example Request using cURL

curl  --location 'localhost:8000/process/' \
--form 'file=@"/home/shady/example_english.pdf"' \
--form 'language="English"' \
curl  --location 'localhost:8000/process/' \
--form 'url="https://ncert.nic.in/pdf/NCFSE-2023-August_2023.pdf"' \
--form 'language="English"' \
--form 'split="50"'

Response

Success Response
{
"data": {
"batchId": "ffd20e4f-da81-4a1c-8aba-1a774c0ed27a",
"taskId": ["322b9125-15e4-4e94-b4bf-13b6f79f8acb"]
},
"error": false
}
Failure Response
{
"message": "Something went wrong. Please try again after some time.",
"error": true,
"data": {
"errorId": "<ERROR_ID>"
}
}

Status

GET
/status/ Retrieve the status using batchId or taskId

Query Parameters

NameTypeRequiredDescription
taskIdStringYESOne of the taskIds returned when starting a batch conversion with /process/
batchIdStringYESThe batchId returned when starting a conversion using /process/

You can pass either one or both of taskId and batchId

Example requests using cURL

With both `taskId` and `batchId` passed
curl --location 'localhost:8000/status/?taskId=<taskID>&batchId=<batchId>'
{
"error": false,
"data": {
"batchId": "ffd20e4f-da81-4a1c-8aba-1a774c0ed27a",
"taskId": "322b9125-15e4-4e94-b4bf-13b6f79f8acb",
"taskStatus": {
"percent": 100,
"total": 2,
"stage": "simple-parsing"
},
"batchStatus": {
"percent": 100,
"total": "1"
}
},
"message": ""
}
With `batchId` passed
curl --location 'localhost:8000/status/?batchId=<batchId>'
{
"error": false,
"data": {
"batchId": "ffd20e4f-da81-4a1c-8aba-1a774c0ed27a",
"batchStatus": {
"percent": 100,
"total": "1"
}
},
"message": "SUCCESS"
}
With just `taskId`
curl --location 'localhost:8000/status/?taskId=<taskId>'
{
"error": false,
"data": {
"taskId": "322b9125-15e4-4e94-b4bf-13b6f79f8acb",
"batchId": "ffd20e4f-da81-4a1c-8aba-1a774c0ed27a",
"taskStatus": {
"percent": 100,
"total": 2,
"stage": "simple-parsing"
}
},
"message": "SUCCESS"
}

Connecting to personal service

In the /process/ endpoint, there is an optional paramter , webhookURL, which is hit everytime a split of a PDF is completed. The payload for this is:

{
"error": false,
"data": {
"batchId": "", // UUID for the batch of PDFs being processed
"taskId": "a57313b3-4293-4d9f-98cb-4be366ec23ff", //UUID for the PDF
"url": "", //miniIO link for the PDF
"chunks": [ //contains the extracted data from the PDF
{
"content": {
"heading":"",
"text": "",
}
"id": "" ,

}
],
"metadata": {
"end": 2,
"start": 0,
"id": "" // ID of the part being processed
},
"taskStatus": {
"percent": 100,
"total": "", //total number of pages in the pdf being processed
"stage": "simple-parsing"
},
"batchStatus": {
"percent" : ,
"total": "" //total number of pdfs in the batch
}
},
"message":"",
}