Author Topic: A data set for data mining? (Read 6148 times)

moerl · « **on:** April 29, 2008, 08:43 PM »

I'm taking a (really fun!) database class that involved using all sorts of PostGRE SQL queries and working with Python. Now I'm working on the final project for the class, which involves picking an interesting dataset (well, any dataset will do, but it's more bearable work if it's interesting of course!) to perform data mining on. So here's an important snippet:

The problems/questions will typically be of three basic types:
classification/estimation: Is it possible to determine, predict, or estimate some attribute in the dataset, based on the values of other attributes? The medical diagnosis and credit-card promotion problems that we have covered in lecture are examples of this type of problem. If you choose a problem of this type, you will employ some type of classification learning and/or numeric estimation.
finding associations: Are there non-obvious associations or relationships between attributes in the dataset? Market-basket analysis (e.g., finding products that customers tend to purchase together) is one example of this type of problem. If you choose a problem of this type, you will employ some type of association learning.
grouping entities: Are there non-obvious ways of dividing up the entities described by the dataset into distinct groups? Another way of thinking of this type of problem is as a search for unexpected similarities or differences among the entities. If you choose a problem of this type, you will employ some type of clustering. It tends to be difficult to effectively use this type of approach, so if you choose to employ clustering, you should probably also plan to use one of the other two types of data mining as well.

We can use Weka data mining software to help us mine this data and it may be necessary to transform the data. Anyway.. I've had some real trouble finding a truly interesting dataset to work on. Do you guys know any interesting sources for data sets off hand? One of you might, so I thought it's worth asking

I've already scoured the web multiple times and found a metric ton of sources on open data and data sets. Here is an amazing link on that topic: http://www.readwrite...open_data_on_the.php

The problem isn't so much finding available open data for use in a data mining project... really, it's PICKING a data set to work with. There's lots of stuff out there and I thought you may be able to help me narrow it all down a bit.

I don't want to flood you with information here, but I thought including the Requirements for the project may be helpful:

Requirements
Your final project must include:
the application of one or more data-mining techniques to a dataset that you choose
a clear and compelling presentation of the results that you obtain, both from the data mining and any other analysis that you perform. This presentation should include at least one example of a data graphic that follows Tufte's principles, as presented in the guest lectures by Prof. Snyder. You may want to make use of the tools at many-eyes.com to help you with this. The graph template that we gave you for Problem Set 8 may also be useful.

In addition, your project must include two of the following:
the creation of one or more relational tables from your dataset, and examples of useful queries that you performed on those tables. The tables must be constructed in a way that avoids redundancy and that captures any constraints that are present in the data. One possible use of SQL queries would be to compute averages or other types of summary statistics that you could then present in your report.
the use of a Python program to manipulate the data in some useful way. For example, you could use a program to extract the data that you need from a relational database, or to discretize one or more attributes in an existing dataset file.
one or more additional data graphics.

Lots of text, I know.. but I tried to make it easy to look at

mouser · « **Reply #1 on:** April 29, 2008, 08:50 PM »

Some machine learning datasets:
http://archive.ics.uci.edu/ml/

moerl · « **Reply #2 on:** April 29, 2008, 11:30 PM »

Some machine learning datasets:
http://archive.ics.uci.edu/ml/
-mouser (April 29, 2008, 08:50 PM)

Yap.. came across that one before. It's a gold-mine of data sets. Most of them there are pretty complex though.. with 10+ attributes in the set. I suppose I can use that but it may make matters more complicated. I'll take a second look around. Thanks!

Renegade · « **Reply #3 on:** April 30, 2008, 12:13 PM »

Data set? You wanna data set?

Go check out that data set for Netflix.

There's a $1,000,000 reward for any algorithm that betters what they have by 10%.

But if you want a data set... They've got one for you!

Now the chances that you'll win are small, but it's one heck of a data set that they're making freely available!

AND!!! You'll be able to measure your progress too! (Well, if you're good enough to submit anything that is.)

Renegade · « **Reply #4 on:** April 30, 2008, 12:15 PM »

Here are the terms:

Terms and Conditions in a Nutshell

Contest begins October 2, 2006 and continues through at least October 2, 2011.

Contest is open to anyone, anywhere (except certain countries listed below).

You have to register to enter.
Once you register and agree to these Rules, you’ll have access to the Contest training data and qualifying test sets.

To qualify for the $1,000,000 Grand Prize, the accuracy of your submitted predictions on the qualifying set must be at least 10% better than the accuracy Cinematch can achieve on the same training data set at the start of the Contest.

To qualify for a year’s $50,000 Progress Prize the accuracy of any of your submitted predictions that year must be less than or equal to the accuracy value established by the judges the preceding year.

To win and take home either prize, your qualifying submissions must have the largest accuracy improvement verified by the Contest judges, you must share your method with (and non-exclusively license it to) Netflix, and you must describe to the world how you did it and why it works.

Lashiec · « **Reply #5 on:** April 30, 2008, 06:43 PM »

Hey, is that contest still going? mouser, what happened with the project you and Gothi[c] were working on to submit there?

Author Topic: A data set for data mining? (Read 6148 times)

moerl

A data set for data mining?

mouser

Re: A data set for data mining?

moerl

Re: A data set for data mining?

Renegade

Re: A data set for data mining?

Renegade

Re: A data set for data mining?

Lashiec

Re: A data set for data mining?