Sofus' Projects Page

These are various projects that I am/have been working on:

Classification in Networked Data
I am studying classification and learning in networked data. This includes an in-depth study of network-only classification which uses only class-labels or related instances to estimate the class of a given instance (e.g., classify research-papers based on citation links knowing the category of only a few research papers in the citation graph.) I have developed, as part of this study, an open-source Network Learning Toolkit (NetKit). Related research includes applying network learning techniques in various domains, active acquisition of secondary data to improve performance.
Worked on with: Foster Provost
Simple Models for Relational Learning
Relational data differs from traditional data in that they violate the instance-independence assumption. Instances can be related, or linked, in various ways. The label of an instance might depend on the instances it is related to either directly or through arbitrarily long chains of relations. This relational structure further complicates matters as it makes it harder, if not impossible, to cleanly separate the data into test and train sets without losing much relational information. We are working on baseline methods, such as the Relational Neighbor classifier (RN), to which relational learners should be compared when assessing how well they have extracted a useful model from the given relational structure.
Worked on with: Foster Provost
Evaluation using ROC Curves
We are investigating the problem of comparing the performance of classifiers using Receiver-Operator Characteristic (ROC) analysis. ROC graphs plot false-positive (FP) rates on the x-axis and true-positive (TP) rates on the y-axis. Usually two or more ROC curves are compared in one of three ways: by simple visual inspection without confidence assessments, by focusing on one particular point of the ROC curve and generating confidence intervals around that point, or by comparing the areas under the curves. Little work has studied the soundness of ROC confidence intervals, or their use for comparing entire curves. Preliminary investigations have shown that existing techniques for generating confidence intervals either are not applicable or do not translate well to creating confidence bands, as the resulting bands contain far fewer ROC curves than is desired. This project works to understand why and to identify techniques for generating better confidence bands and intervals.
Worked on with: Foster Provost, Michael L. Littman
Information Triage
Introduction of a new framework for getting at a user's interest in order to apply machine learning techniques using multiple information sources for learning a user model. The framework incorporates new and novel techniques for getting a user's interest, learning methodology for acquiring a complementary user model and finally analysing the user model for better insight into the domain of the model.
Worked on with: Haym Hirsh, Foster Provost, Ramesh Sankaranarayanan and Vasant Dhar.
Information Valets
We're working on techniques for unsupervised learning of user interest using relevance feedback in a variety of domains. Initial work has been on setting up a generic framework, the Information Valet Framework, to work with multiple devices and multiple information sources.
The EmailValet was first instantiation of this work. The EmailValet learns to predict whether to forward a new email message to a user's pager based on past email reading behavior of the user on the pager.
Worked on with: Haym Hirsh and Aynur A. Dayanik
Using Numerical Features in Text Classifiers
A new technique for incorporating numerical features into text classification systems (e.g. vector spaced models that use tokens and have no knowledge of numbers; The Naive Bayes classifier and TFIDF based methods are two such systems). We convert numerical features into sets of tokens, using a method much like "Thermometer Coding" representation, having close numbers have much overlap in their sets while distant numbers have less overlap. We have shown that using this method, standard text classifiers can perform comparably to numerical methods on purely numerical datasets. We are currently investigating datasets that have both textual and numerical features.
Worked on with: Haym Hirsh, Aynur A. Dayanik, and Arunava Banerjee.
Agent Architecture/Framework for the Web
This project was worked on while at Information Architects. I built an agent architecture to help keeping track of publically accessible information in a distributed manner to minimize network as well as server load. See my publication on Maintaining information resources.
Worked on with Leon Shklar.
Web Clustering
We are examining techniques for performing data mining over the web. Initial efforts have focussed on methods for clustering of web pages that result from search engine queries, and comparisons to how humans perform this task.
Worked on with: Arunava Banerjee, Brian D. Davison, and Haym Hirsh.
Survey Paper on believable Agents:
A project on surveying projects that in some way relates to the building of a believable agent, where a believable agent is termed as any agent that is believable within its context and functionality. (Again, this is a quick definition, that somewhat butchers everything a believable agent entails).
Worked on with my advisor, Haym Hirsh.
People who are interested can get a copy of the last survey draft (dated August 1998) by mailing me.