A New Adaptive Method for Extracting Header Words from Official Printed Arabic Documents

Abstract

Words extraction techniques from documents have very significant and effective role in document image analysis and retrieval systems. In this paper, a new method has been proposed for detecting and extracting header words from official printed Arabic documents. In the proposed method line of Arabic words with various fonts, styles, and sizes have been extracted from printed Arabic documents with different shapes, colors and resolutions. The extraction of header words based on effective segmentation technique that will separate different objects in a document including text lines, graphics, signature, logo, and other objects. The segmentation operation depends on document analysis that will efficiently predict vertical and horizontal distances between objects in Arabic documents. After segmentation operation, header words detection will performed by using sequence of influential rules within decision tree that correctly detected header words in a document image. Finally, list of header words will extracted as separated text lines from document image. Extracted header words can be utilized in many applications like words matching, words spotting, documents classification, documents retrieval and other applications that depends on words extraction. In this paper, a dataset of different official printed Arabic documents has been constructed and tested by the proposed method. These Arabic documents dataset obtained and gathered from various official institutions websites and offices. The proposed Arabic header words extraction method obtained 96% for recall, 98% for precision and 97% for f-score.