Software for Text Classification An important development related to the Comparative Agendas Project is the introduction of automated text classification tools, otherwise known as supervised learning systems. The goal of this research agenda is to substantially lower the costs of topic classifying large numbers of events, including data updates, while maintaining high levels of accuracy and intercoder reliability. Supervised learning systems mimic the decisions of trained human coders. Research to date indicates that this approach can reduce the number of cases to be manually coded by 70-80%. Results vary depending on the nature of the data and the size of the training sample. Our methodology draws on well established algorithms and stemming techniques from the information sciences. The main cost reductions derive from ensemble learning and active learning. When multiple algorithms (an ensemble) make the same topic prediction, human coders can have high confidence that the event has been properly classified. Active learning refers to a human-centered process of identifying cases where the system is not performing as well, and intervening with additional training to reduce or eliminate similar mistakes in future rounds. The readings to the right describe this subject and our approach in more detail. | RTextTools RTextTools is a free, open source machine learning package for automatic text classification that makes it simple for both novice and advanced users to get started with supervised learning. The package includes nine algorithms for ensemble classification (svm, slda, boosting, bagging, random forests, glmnet, decision trees, neural networks, maximum entropy), comprehensive analytics, and thorough documentation. The package was developed by Timothy P. Jurka at UC Davis, Loren Collingwood at University of Washington, Amber E. Boydstun at UC Davis, Emiliano Grossman at Sciences Po Paris, and Wouter van Atteveldt at Vrije Universiteit Amsterdam. The beta release was unveiled at the The 4th Annual Conference of the Comparative Policy Agendas Project on June 24, 2011. The full release is available on the installation page. The RTextTools repository is available via Google Code, and the help mailing list is on Google Groups. Text Tools Paul Wolfgang (Temple University, wolfgang@temple.edu) has developed software based on the work of Hillard, Purpura and Wilkerson (article below) that enables CAP researchers to apply supervised learning methods to their research. As of February, 200, Text Tools has been updated to include stemmers for many languages. The latest version can be found at: http://www.cis.temple.edu/~wolfgang/ . Jonathan Moody (Penn State University, jon.w.moody@gmail.com) has prepared additional documentation, demonstration datasets, and template files to assist research in learning how to use the Text Tools environment. These are meant to provide step-by-step instructions for how to prepare datasets, operate Text Tools, and analyze the results. Documentation: Text Tools Documentation and Templates (.rar) Alternatively, the files are available individually:
SLTK Auto-coding and Supervised Coding Hans Then of Pythea company (h.then@pythea.nl) has developed a software tool for automated coding and supervised coding. It is based on the Texttools package put together by Paul Wolfgang. Link: SLTK tool
Language Stemming The language specific stemming algorithms and stop word lists used in RTextTools and Text Tools are developed by Porter (see http://snowball.tartarus.org). danish > dutch > english > finnish > french > german > hungarian > italian > norwegian > portuguese > russian > spanish > swedish Related Readings Research in this area offers many tips for improving the accuracy of automated methods for a given sample size. We have conducted some experiments with our data and report them in the following papers: Hillard, Purpura and Wilkerson, "Automated Text Classification for Mixed Methods Social Science Research" Journal of Information Technology and Politics, June 2008 Breeman, Then, Kleinnijenhuis, van Atteveldt, Timmermans, See also: Cardie and Wilkerson, Text Annotation for Political Science (Editor's Introduction), Journal of Information Technology and Politics, August 2008 |