π OCR-API: Document Extraction System
π§° Technology Stackβ
π» What Powers Our Systemβ
- Python: A user-friendly programming language that helps our system process information quickly and efficiently - think of it as the brain of our operation.
π§© Main Building Blocksβ
- π€ Text Recognition Technology: Our system can "read" text from images and documents, similar to how humans read but done by a computer.
- π Specialized Document Processors: We have different tools designed to handle specific types of documents:
- π Resume Processor: Understands and extracts information from job applications
- πΆ Birth Certificate Processor: Pulls important details from birth records
- πͺͺ ID Document Processor: Works with driver's licenses, passports, and other identification
- π Academic Certificate Processor: Handles diplomas and educational records
- π·ββοΈ Work Permit Processor: Manages employment authorization documents
π Connected Systemsβ
- βοΈ Google Drive Connection: After we process your documents, we store them securely in Google Drive - similar to saving files in a digital filing cabinet.
- π Airtable Connection: We organize all the extracted information in Airtable - think of this as a smart spreadsheet that keeps everything organized.
βοΈ How Document Processing Worksβ
Our document processing pipeline uses a series of technical components to manage, identify, and extract information. Here's how it works:
πΊοΈ Dictionary Mapping Structureβ
The system relies on four key dictionaries to manage the extraction flow:
-
CONSTANT_COLUMN: Maps confirmation keys to upload keys
CONSTANT_COLUMN = {
"Extracted TIN Number Upload Confirmation": "TIN Number Upload",
"Extracted Birth Certificate Confirmation": "Birth Certificate",
"Extracted Occupational Permit Confirmation": "Occupational Permit",
"Extracted SSS ID Upload Confirmation": "SSS ID Upload",
"Extracted UMID Number Upload Confirmation": "UMID Number Upload",
"Extracted Upload Resume Confirmation": "Upload Resume",
"Extracted School Records Confirmation": "School Records"
} -
CONSTANT_COLUMN_EXTRACTED: Maps upload keys to extracted data fields
CONSTANT_COLUMN_EXTRACTED = {
"TIN Number Upload": "Extracted TIN Number Upload",
"Birth Certificate": "Extracted Birth Certificate",
"Occupational Permit": "Extracted Occupational Permit",
"SSS ID Upload": "Extracted SSS ID Upload",
"UMID Number Upload": "Extracted UMID Number Upload",
"Upload Resume": "Extracted Upload Resume",
"School Records": "Extracted School Records"
} -
DOCUMENT_REQUIREMENTS: Associates extraction functions with document types
DOCUMENT_REQUIREMENTS = {
"extract_birth_cert": ["Birth Certificate"],
"extract_cv": ["Upload Resume"],
"extract_id": ["SSS ID Upload", "UMID Number Upload", "TIN Number Upload"],
"extract_diploma": ["School Records"],
"extract_working_permit": ["Occupational Permit"]
} -
EXTRACTOR_MAP: Links extraction functions to their processor classes
EXTRACTOR_MAP = {
"extract_birth_cert": BirthCertExtractor,
"extract_cv": CVExtractor,
"extract_id": IDExtractor,
"extract_diploma": DiplomaExtractor,
"extract_working_permit": WorkPerminExtractor
}
π Technical Process Flowβ
When a document enters our system, it undergoes this technical process:
-
Document Ingestion: The document is submitted via API endpoint or picked up by the automatic background service.
-
Attachment Verification: The system checks if the document has an attachment by examining if the corresponding key in
CONSTANT_COLUMNcontains "No Attachment". -
Processing Assessment: If the document has an attachment and needs processing, the system proceeds to extraction.
-
Extractor Function Mapping: The system identifies the document type and maps it to the appropriate extraction function using the
DOCUMENT_REQUIREMENTSdictionary. -
Processor Class Selection: It retrieves the corresponding processor class via the
EXTRACTOR_MAPdictionary (e.g., CVExtractor, IDExtractor). -
Information Extraction: The processor class parses the document using OCR technology and structures the data according to predefined schemas.
-
Cloud Storage: The processed data is stored in Google Drive with appropriate metadata.
-
Database Update: The system updates Airtable using the
CONSTANT_COLUMN_EXTRACTEDdictionary to map fields correctly.
This technical pipeline ensures efficient document processing with appropriate routing based on document types. The system runs both on-demand via API endpoints and automatically through a scheduled background service.
π Processing Flow Diagramβ
+----------------+
| Document Input |
+-------+--------+
|
v
+------------------+ +-----------------+
| CONSTANT_COLUMN |---------->| Check if "No |
+------------------+ | Attachment" |
+--------+--------+
|
v
+--------------------+ +------------------+
| JSON Response Data |--------->| Document needs |<-------+
+--------------------+ | extraction? | |
+--------+---------+ |
| |
v |
+--------------------+ +------------------+ |
| DOCUMENT_ |------->| Map to extractor | |
| REQUIREMENTS | | function | |
+--------------------+ +--------+---------+ |
| |
v |
+------------------+ +------------------+ |
| EXTRACTOR_MAP |--------->| Select extractor | |
+------------------+ | class | |
+--------+---------+ |
| |
v |
+-------------------+ |
| Extract document | |
| data | |
+--------+----------+ |
| |
v |
+-------------------+ |
| Save to Google | |
| Drive | |
+--------+----------+ |
| |
v |
+-------------------+ |
| Upload to Airtable|--------+
+-------------------+
π Background Serviceβ
Our system includes a helpful automated assistant that works behind the scenes:
β±οΈ Automatic Document Processingβ
-
π€ Always Working: Our background service runs continuously, checking for new documents every 60 seconds - like having an assistant who never sleeps!
-
π What It Does: This service automatically:
- Checks Airtable for new document submissions
- Downloads any new files it finds
- Determines what type of documents they are
- Processes them using the appropriate specialist
- Uploads the results back to Google Drive
- Updates Airtable with the extracted information
-
β‘ Benefits: This automation means:
- No manual triggering needed
- Documents are processed promptly
- Information flows smoothly into your systems
- Everything stays up-to-date without human intervention
The background service ensures that your document processing pipeline runs efficiently and continuously, providing a seamless experience for all users of the system. π