Saturday, January 25, 2014

Parts of Speech (POS)

Parts of Speech (POS)

I've added Percy Wegmann's (http://www.percywegmann.com/) implementation of the Brill POS tagger: jsPOS to REve.

I changed the parser to split the text into words and punctuation. I also split contractions like can't, we're, I'm, she'll, he'd
into
can n't, we 're, I 'm, she 'll, he 'd and then added lexical entries for these;

n't JJ (adjective)
're VB
'm VB
'd MD (modal)

's is mapped to [POS,VBZ] (possessive ending, is-verb).

Then I added a few steps to the post tagging, notably to change POS to VBZ.

I also added tests in the tagger to check that contextual rules only change a word's tag when the new tag is in the word's list of possible tags.

The original code only applied 8 transformational rules. I added a set of ~200 contextual rules which I borrowed from my python code, in turn borrowed from Brill.

Includes a few tags that never match?

Probably less than 85 % correct tagging.

The lexicon is from the WSJ corpus and includes many buisness and finance terms that are unlikely to be matched in dream descriptions. I removed several 100s of these.

Allgemeine
TuHulHulZote,
"non-interest-bearing"
"property-and-casualty"
"junk-bond-financed"
yff
"F.S.L.I.C"
"Bonds-b"
"DIAL-A-PIANO-LESSON"
"J.J.G.M."
"Asia\\",
"Junk-bond"
"Junk-bonds"
"junk-bond"

The lexicon did not include "I" and the method for tagging CD (cardinal numbers) was replaced.

The result uses the color text icons described in my previous post.


If you hover the mouse over a tag the description is shown in a tool tip.

Here's a list of the tags and their parts of speech;
  • CC: Coord Conjuncn
  • CD: Cardinal number
  • DT: Determiner
  • EX: Existential there
  • FW: Foreign Word
  • IN: Preposition or subordinating conjunction
  • JJ: Adjective
  • JJR: Adjective, comparative
  • JJS: Adjective, superlative
  • LS: List item marker
  • MD: Modal
  • NN: Noun, singular  or mass
  • NNP: Proper noun, singular
  • NNPS: Proper noun, plural
  • NNS: Noun, plural
  • POS: Possessive ending
  • PDT: Predeterminer
  • PRP: Personal pronoun
  • PRP$: Possessive Personal pronoun
  • RB: Adverb
  • RBR: Adverb, comparative
  • RBS: Adverb, superlative
  • RP: Particle
  • SYM: Symbol
  • TO: to
  • UH: Interjection
  • VB: verb, base form
  • VBD: verb, past tense
  • VBG: verb, gerund or present participle
  • VBN: verb, past participle
  • VBP: Verb, non-3rd person singular present
  • VBZ: Verb, 3rd person singular present
  • WDT: Wh-determiner
  • WP: Wh-pronoun
  • WP$: Possessive-Wh
  • WRB: Wh-adverb
  • !: Excalmation
  • ,: Comma
  • .: Sent-final punct
  • :: Mid-sent punct
  • $: Dollar sign
  • #: Pound sign
  • \: quote
  • (: Left paren
  • ): Right paren
The original pos-js is Copyright 2010, Percy Wegmann and is available at: https://github.com/fortnightlabs/pos-js
Licensed under the LGPLv3 license
http://www.opensource.org/licenses/lgpl-3.0.html

No comments:

Post a Comment