Bilingual Base Noun Phrase (BaseNP) extraction is one of the key tasks of Natural Language Processing (NLP). This task is more challenging for the pair of English-Vietnamese due to the lack of available Vietnamese language resources such as treebanks, part-of-speech taggers, and parsers. In this paper, we propose a combination model that uses language characteristics based on statistics and the projection method to extract BaseNP correspondences from a bilingual corpus. The language characteristics used in this model include the word segmentation, word order and word classification [1]. Our model overcomes not only the lack of resources of Vietnamese, but also improves the performance of miss-alignment, null-alignment, overlap and conflict projection of the existing methods. The proposed model can be easily applied to other language pairs. Experiment on 66,646 pairs of sentences in the English-Vietnamese bilingual corpus shows that our proposed model is very satisfactory.


