So yesterday, I decided to learn Python. Been a .NET guy primarily for the last n years, had some people work in it around me, but never was inclined to try it out. DUH!!!! Such a nice language. It took a couple minutes to get my bearings, but I figured…why not! Everyone in the Valley is so anti-MS and so pro-(Python, MySQL, PHP) one needs to embrace the flow.
For the last couple years I’ve been using a very simple, yet (what I believe to be) a strong POS tagger built by Mark Watson and based on Eric Brill’s work. Written in C#, it gave me a very straightforward paring knife to do tokenization and POS tagging quickly and easily in .NET. Now Monty Tagger and NTLK are definitely incredible resources for NLP in Python, but I wanted something very strightforward and portable without all the bells and whistles so I can build on the core myself. Not to mention I wanted something fun for my first outting in Python. Well…ta da! Here it is.
It’s comprised of two (count them 2) VERY simple source files. The first is the basic hashing and pickling utility if you want to make changes to the lexicon (I believe I’m using the same lexicon file as Monty Tagger), and the second is the actual tagger/tokenizer.
I’ve made some additional tweaks to the versions I run and plan to port some of them also to Python. If you’re intersted in additions add a comment and I’ll do my best to share/accomodate.
You can download my Python NLP Part-of-Speech Tagger here.
This is my first anything outside of some Hello World stuff in Python. It definitely works, and does so at a decent clip (speed wise), but I’m sure I could have done some of the operations a little more elegantly. Leave comments though with recommendations/suggestions/!flames.