Systems Engineering and RDBMS

Data Scientist

Posted by decipherinfosys on December 22, 2010

Data Mining has fast moved on from being just a buzzword to something that most of the organizations are now quickly adopting to make informed decisions.  A client of ours recently inquired whether we have someone on staff who is a “Data Scientist”.  If you are not familiar with the term, here are some links to get you started:

  • Recent post on GigaOm – here.
  • Flowing Data post from 2009 – here.

And there are many other good posts that you can read by googling/binging the word “Data Scientist”.  In short, a data scientist is a term used for a person who is well versed with the technique of gleaning meaningful information from a bunch of data from different sources.  The expertise involves several key areas:

  1. Procuring the data or in other words: Data Acquisition (could be through Feeds, Web Crawlers, internal data sources, Social Media etc.),
  2. Scrubbing & managing the data using proper ETL, queries, key-value pairs etc.,
  3. Model & interpret the data using analytics which could be different kind of techniques in multivariate statistics, NLP, machine learning etc.
  4. Visualization/Presentation of the data, and
  5. Translating it to meaningful information for answering the business questions.

Essentially, it is mining of the data to glean meaningful information and presenting it in a way to the end user who can then interact with the data to look at the effect of different parameters and get a predictive analysis.  KDD (Knowledge Discovery in Databases) is another acronym used for it.

A data scientist may not be an expert in all the areas listed above but will possess depth in certain areas and be familiar enough with the other areas so that he/she can quickly perform basic tasks in those areas.  Typically, it will always be a team of data scientists working together on such a project.

So, what kind of business problems can be solved by the approaches listed above?  Forecasting alone is just one of them.  Others areas are: Risk management (example – used by insurance companies or the banking industry), churn analysis (example – for customer retention, many telecommunications company use these algorithms to help them retain their customer base), Marketing to the right segment (example – used by companies like Amazon by customizing their presentation to cater to your needs), Detecting anomalies (example – money laundering or credit card frauds).

Needless to state, there are a lot of tools and techniques out there that can help you build such solutions.  In our posts, we will be using SQL Server 2008 R2’s Data Mining features to demonstrate how the MSFT tools can be used to build such a solution.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: