Document classification
Last updated
Was this helpful?
Last updated
Was this helpful?
The Classifier algorithm finds documents in the image and assigns them a document type.We support a wide array of documents for Eastern Europe, and can train our algorithms to process any custom type of document with a provided dataset (as we did for invoices, tax and covid forms).
The algorithm looks for rectangular shapes on the incoming image that look like documents and cuts them out.
The Classifier assigns a class to each cut out area: Passport, Driver’s Licence, and so on. A list of currently supported document types is available at the link.
The algorithm evaluates the orientation of the document in space. If necessary, the classifier rotates or mirrors the document.
API specification
POST
https://latest.handl.io/classify
min_shape
integer
>0, the default value is 256. The minimum size of the image in pixels on the short side. If smaller, the low_image_weight parameter in the response will return true. If larger - false.
min_filesize
integer
>0, the default value is 10240. Minimal image weight in bytes. If less, the low_image_weight parameter in the response will return true. If more - false.
max_exposure_score
number
>0, default value is 0.4. Maximum exposure of the image. If the exposure is higher, the image_exposure parameter in the response will return overexposed. If less - normal.
min_exposure_score
number
>0, the default is 0.05. The minimum exposure of the image. If the exposure is less, the image_exposure parameter will return underexposed in the response. If higher - normal.
max_blur_score
number
>0, defaults to 2. The minimal coefficient of clarity of the image. If less, the image_blured parameter in the response will return true. If more, false.
doc_type
array
A list of document types to search for in the input file. It is used for deterministic processes, for example, if only the main spread of a passport needs to be found in the document stream, and all other types do not need to be utilized. By default all values of the parameter are selected (all types are available in the classifier).
priority
integer
>0, default value is 1. Priority of an asynchronous task in the queue for processing.
simple_cropper
boolean
false (default) - the simplified algorithm of cutting documents from images is not used. true - the simplified algorithm of cutting documents from images is used: it is faster, but the result is less accurate. On images with a complex background documents may be cut out less accurately.
async
boolean
true - asynchronous mode of processing requests. false - synchronous mode of request processing.
check_fake_experimental
boolean
This one is out of date and is not used.
check_fake
boolean
true - the algorithm searches the file metadata for signs of modification via digital editors, the result is returned in a separate field called “fake”. false - the metadata checking algorithm is disabled.
pdf_raw_images
boolean
true - the algorithm leaves the decision of PDF files’ rasterization to the auto_pdf_raw_images parameter. false - all PDF files will be rasterized, the value of the auto_pdf_raw_images parameter will be ignored.
auto_pdf_raw_images
boolean
true - the algorithm leaves the decision of PDF files’ rasterization to the auto_pdf_raw_images parameter. false - the algorithm will never rasterize PDF.
dpi
integer
>0, the default value is 300 - sets the number of pixels per inch for PDF rasterization. We recommend 300. Higher values usually do not increase the quality, but increase the weight of the image.
quality
integer
0-100, the default value is 75 - sets the degree of JPEG compression for PDF rasterization. The recommended value is 75 for balance between the weight of the image and its quality.
Below is the API specification for the document classification method. For more details on how to compose a classification request, see .