Data mining has a slightly ominous ring to it, and recent snags in our social sites’ privacy operations have prompted us to examine the possibility that our information is no longer solely ours. Where we shop, what we like to eat, what music we’re listening to—all this is marketing fodder to any gigantic corporation looking for an effective business strategy to reach their target demographic.
Contrary to popular misconceptions, data mining methods have been around for several years now, and they are not all bad. There are various techniques for picking up data in bulks, and there are different reasons for it too. These are a couple of uses for data mining that actually render benefits for those involved.
Sifting through huge amounts of data can be quite tedious—even more so when you’re looking for a mistake. (This can very well be equated to the proverbial needle in a haystack.) However, a specific data set can offer a good sample perspective of what typical and atypical patterns might look like, and statistics may then be used to look for anything significantly different. This is especially true in the case of the IRS, who can create a faux scenario of what tax returns typically look like, and then use anomaly detection to single out problematic returns.
Ever been on Amazon.com? Amazon is an excellent example of association learning in data mining. They constantly monitor how many customers today bought a bathroom wall sconce, and then go even further to find out that most customers who bought bathroom wall sconces also tend to buy an extra light bulb or two. The ensuing relationships between different products are often used to target coupons and bargains, subsequently improving their advertising strategies. Another good example of association determination is the mechanism behind Netflix movie recommendations. The more you watch movies of a certain genre, the more they tend to recommend similar movies in that genre or a related genre as well.
This is a type of pattern recognition in which the computer program identifies similarities among data groups, and then groups them in clusters based on similar characteristics. The clear advantage here is that data mining programs are more efficient, compared to leaving the job to an analyst in the traditional way. Using computer programs eliminates (or at least, significantly reduces) the possibility of missing out important categories.
Algorithms in data mining can go one step further and create new categories when consistent patterns are detected outside predetermined categories. This is essentially how spam filters work. They rapidly sift through large sets of emails, with the filters programmed to notice key differences in the word usages and phrasing between spam messages and legitimate emails.
An example of regression is the algorithm used by Facebook, in which the social site predicts a user’s future behavior by analyzing past content. Personal information comes into the picture as well, not to mention posts, comments, friend requests (which ones are accepted, which ones go down the drain), number of photos tagged, and so on.