Katrin Tomanek bio photo

Katrin Tomanek

Data Scientist and Research Engineer

LinkedIn Tumblr Github

available resources

Overview

Language Resources, Corpora

Timed Muc7

MUC7_T consists of 100 articles from the MUC7 corpus training set reannotated for named entities (persons, locations and organizations) with a time stamp indicating the time measured for the linguistic decision making process. The corpus was developed for two principal purposes: for use in evaluations of selective sampling strategies, such as Active Learning; and to create predictive models for annotation costs. The annotation was performed by two advanced students of linguistics with good English language skills who followed the the original guidelines of the MUC7 named entity task (which can be found in the online documentation for the MUC7 corpus).

get it from the Linguistic Data Consortium (LDC)

Software

UIMA

I am a co-author of the jCoRe NLP toolsuite, a collection of UIMA-compliant NLP components, as well as a generic UIMA type system.

JCoRe toolsuite download

UIMA type system download

AL Framework

A generic framework for Active Learning for Natural Language Processing tasks. Ask for download.