I'm taking a (really fun!) database class that involved using all sorts of PostGRE SQL queries and working with Python. Now I'm working on the final project for the class, which involves picking an interesting dataset (well, any dataset will do, but it's more bearable work if it's interesting of course!) to perform data mining on. So here's an important snippet:
The problems/questions will typically be of three basic types:
- classification/estimation: Is it possible to determine, predict, or estimate some attribute in the dataset, based on the values of other attributes? The medical diagnosis and credit-card promotion problems that we have covered in lecture are examples of this type of problem. If you choose a problem of this type, you will employ some type of classification learning and/or numeric estimation.
- finding associations: Are there non-obvious associations or relationships between attributes in the dataset? Market-basket analysis (e.g., finding products that customers tend to purchase together) is one example of this type of problem. If you choose a problem of this type, you will employ some type of association learning.
- grouping entities: Are there non-obvious ways of dividing up the entities described by the dataset into distinct groups? Another way of thinking of this type of problem is as a search for unexpected similarities or differences among the entities. If you choose a problem of this type, you will employ some type of clustering. It tends to be difficult to effectively use this type of approach, so if you choose to employ clustering, you should probably also plan to use one of the other two types of data mining as well.
We can use Weka data mining software to help us mine this data and it may be necessary to transform the data. Anyway.. I've had some real trouble finding a truly interesting dataset to work on. Do you guys know any interesting sources for data sets off hand? One of you might, so I thought it's worth asking
I've already scoured the web multiple times and found a metric ton of sources on open data and data sets. Here is an amazing link on that topic: http://www.readwrite...open_data_on_the.php
The problem isn't so much finding available open data for use in a data mining project... really, it's PICKING a data set to work with. There's lots of stuff out there and I thought you may be able to help me narrow it all down a bit.
I don't want to flood you with information here, but I thought including the Requirements for the project may be helpful:
Your final project must include:
- the application of one or more data-mining techniques to a dataset that you choose
- a clear and compelling presentation of the results that you obtain, both from the data mining and any other analysis that you perform. This presentation should include at least one example of a data graphic that follows Tufte's principles, as presented in the guest lectures by Prof. Snyder. You may want to make use of the tools at many-eyes.com to help you with this. The graph template that we gave you for Problem Set 8 may also be useful.
In addition, your project must include two of the following:
- the creation of one or more relational tables from your dataset, and examples of useful queries that you performed on those tables. The tables must be constructed in a way that avoids redundancy and that captures any constraints that are present in the data. One possible use of SQL queries would be to compute averages or other types of summary statistics that you could then present in your report.
- the use of a Python program to manipulate the data in some useful way. For example, you could use a program to extract the data that you need from a relational database, or to discretize one or more attributes in an existing dataset file.
- one or more additional data graphics.
Lots of text, I know.. but I tried to make it easy to look at