Publication: Investigation of feature selection for historical document layout analysis


Authors: Hao Wei, Kai Chen, Anguelos Nicolaou, Marcus Liwicki, Rolf Ingold


Published in: IPTA 2014 (Paris)


Abstract:
In this paper we investigate the importance of individual features for the task of document layout analysis, in particular for the classification of the document pixels. The feature set consists of numerous state-of-the-art features, including color, gradient, and local binary patterns (LBP). To deal with the high dimensionality of the feature set, we propose a cascade of an adapted forward selection and a genetic selection. We have evaluated our feature selection method on three historical document datasets. For the classification we used machine learning methods which classify each pixel into either periphery, background, text block, or decoration. The proposed cascading feature selection method reduced the number of features significantly while preserving the cross-validation performance. Furthermore, it selected less features with comparable performance, compared with the conventional feature selection methods. In our analysis we found that LBP features are consistently selected by all feature selection methods on all three datasets. This indicates that LBP correlate highly with the pixel classes much more than any other type of features does. These findings suggest a clue in paradigm for document layout analysis in general.



bibtex entry :
 @inproceedings{wei2014investigation,
title={Investigation of feature selection for historical document layout analysis},
author={Wei, Hao and Chen, Kai and Nicolaou, Anguelos and Liwicki, Marcus and Ingold, Rolf},
booktitle={Image Processing Theory, Tools and Applications (IPTA), 2014 4th International Conference on},
pages={1--6},
year={2014},
organization={IEEE}
}


Creative Commons License
Any work in this page other than source code or program binaries is licensed under a Creative Commons Attribution 4.0 International License. When applicable atribution should be in the form of a citation.