Towards a Historical Corpus of early Singapore English

Project Number
SUG 16/15 JL

Project Duration
March 2016 - June 2017


This project seeks to investigate the history of lexical items in Singapore English through creating a corpus of early Singaporean texts. The aim is to download as many freely available texts relating to Singapore from two free data services: the Internet Archive (, and Google Books ( The Internet Archive books can be downloaded simultaneously in DejaVu and plain text formats. The DejaVu format provides scanned pages in an easily accessible and searchable format, and have accompanying plain text (ASCII) files that are OCR’d versions of the scanned pages. These plain text files need to be edited for incorporation into a corpus. Freely available Google Books are in PDF format and will need to be OCR’d in order to obtain plain text files, and thus will require more processing time. However, as most of the texts freely available on Google Books are also available from the Internet Archive, there should be only a limited number of these. The resulting corpus is projected to be in the order of 20 million words, and thus will represent a substantial body of language on which scholar investigation can take place. As with all OCR’d texts, a certain error rate is present, but much of this will be amended through the editing process, and the significance of what errors remain will be substantially reduced given the overall size of the corpus. Once the texts are amalgamated into a single corpus database, the corpus will be able to be searched through a variety of techniques in order to extract the loanwords. The data gathered will be valuable for academic work on the history of Singapore English, and will also go towards a citation collection that can be used for lexicographical publications on Singapore English.

Funding Source

