Saturday, May 9, 2015

Statistical Modeling And Data Validation made Easy for Research Scholars And Analytics Professionals

Shailendra Kadre

10th March, 2015


Statistical methods are used very extensively in many business domains, sociology, economics and in the scientific domains like life sciences, and so on. More than 70% research projects use quantitative methods. Quantitative research involving statistical methodologies may start with collection of a large amount of data - may be based upon a hypothesis, a theory or it may even result from financial transactions involving credit cards and other data sources. This kind of data is called raw data and it usually needs recording, verification and validation before it can be used for any further analysis. Data analytics needs a scientific approach towards problem solving.  The typical steps involved are

  1. Identification of data sources
  2. Data collection and recording it in a homogeneous form like relational database tables. The data recording methodologies may differ very from one research project to another.
  3. Verification, validation and cleaning of data
  4. Model, theories or hypothesis generation
  5. Validation of model
  6. Drawing analysis inferences



Let’s discuss a couple of practical examples to get a feel of what we are talking about. In clinical trials, for example, a team of new drug developers might be interested in the study of drug intake and measurable physiological effects like weight loss, the efficacy of a particular drug in preventing a disease in humans and so on. Another example can be a structured social media survey on what is the general perception of section business professionals on the effectiveness of big data in solving their supply chain problems. The day-to-day weather prediction is also based upon the collection of vast amounts of data on the parameters like temperature, concentration of dust particles and other gases, and so on.

A new book that makes modelling and applying many other statistical methods very easy

I and coauthor Venkat Reddy started working on a book titled Practical Business Analytics Using SAS: A Hands-on Guide. We spent a couple years looking at books on business analytics and predictive modeling. Some of them were really good. But most of them were too intense and deep on the theory and mathematics of statistical algorithms, which are an integral part of this subject. Some people like books that take that tack, but most practitioners—even those in the industry—don’t have the deep background in the math required or the interest in learning it. Working professionals, particularly newcomers to the field of business analytics, are not very comfortable with the deep theoretical treatment of statistical algorithms generally provided in most of the books available on analytics. The market need we discerned, therefore, was to simplify the presentation of algorithms for professionals who don’t need to know the details to succeed in their work. Besides, once introduced to the subject, one can always refer to the advanced texts on statistics if such academic rigor is required. The good news is that today’s analytics software, like SAS, is designed to do most of the math. Thus, we strongly felt there was a need for a book like this one, which takes the power of the software into account and, at the same time, simplifies the mathematical concepts involved in the process. With this motivation in mind, we started our work and strongly feel we have been successful in showing you how to use SAS to perform common analytical procedures while providing the basic knowledge of statistics required. The book keeps the theoretical part as simple as possible yet uses numerous real business scenarios to explain the concepts and the way they are used in the industry. Venkat’s working experience with the world’s leading banks and his vast experience working with students as an analytics trainer has come in handy in designing the case studies and examples used in this book.

Here is the content of this book, titled, Practical Business Analytics Using SAS

About the Authors ...................................................................................................xix
Acknowledgments ..................................................................................................xxi
Preface .................................................................................................................xxiii
■ Part 1: Basics of SAS Programming for Analytics .............................. 1
■ Chapter 1: Introduction to Business Analytics and Data Analysis Tools ...............3
■ Chapter 2: SAS Introduction ................................................................................29
■ Chapter 3: Data Handling Using SAS ..................................................................55
■ Chapter 4: Important SAS Functions and Procs .................................................. 95
■ Part 2: Using SAS for Business Analytics ..................................................... 145
■ Chapter 5: Introduction to Statistical Analysis .................................................... 147
■ Chapter 6: Basic Descriptive Statistics and Reporting in SAS ........................... 165
■ Chapter 7: Data Exploration, Validation, and Data Sanitization ........................ .197
■ Chapter 8: Testing of Hypothesis ........................................................................261
■ Chapter 9: Correlation and Linear Regression ................................................... 295
■ Chapter 10: Multiple Regression Analysis .......................................................... 351
■ Chapter 11: Logistic Regression ......................................................................... 401
■ Chapter 12: Time-Series Analysis and Forecasting ............................................ 441
■ Chapter 13: Introducing Big Data Analytics ...................................................... ..509
Index ..................................................................................................................... ..541




You can buy this book right now at Amazon. Reviews and suggestions are welcome at shailendrakadre@gmail.com



Wednesday, April 22, 2015

How to Chose Your Data Analysis Tool...

This blog will discuss some of the more commonly used business analytics tools that are used in the industry today. It is not enough for a data analyst to learn about just one tool. They need to apply different tools as per the situation or what the problem at hand demands. A general knowledge of the strengths and weaknesses of the tools will definitely add value to a data analytics career.


This blog will discuss the features of and give resources for further information for three industry-leading tools: SAS, R, and SPSS.

Business analytics aims to model the data in order to discover useful information or business insights and to support decision making. It requires various operations on the data, such as reviewing, cleansing, transforming and modeling, validations, and interpretations to gain useful business insights. Sometimes the sets of data may have a million records or more. Handling and operating such complex data requires automated analysis tools. Fortunately, many such good tools are available today. A simple Google search for data analysis tools will give you a list of a number of such tools. Many of them are open source and free for use. SAS, SPSS, and R are the most widely used software packages today, at least for business analytics applications. R is the most popular and widely used statistical analysis package in the open source category, and SAS and SPSS are the two most widely used data analysis packages that are commercially available.

The SAS tool has been around since the 1970s. There are so many legacies built using this tool that most of the companies in the corporate world that are into business analytics at any level continue to use SAS. R was introduced in 1996. Over the years, a lot of new capabilities and features have been built around R. It is one of the most powerful open source data analysis tools available today. This makes it popular in the research and academic community. Many companies from the corporate world have also started using R. SPSS has also existed for more than 20 years now. It has a strong user base in the social sciences and many other communities.

Commonly Used Data Analysis Software

In the following sections, we talk about some commonly used data analysis software and how to make a choice. SAS, SPSS, and R may be termed as the most commonly used software in the industry.
SAS
·         Most widely used commercial advanced analytics tool
·         Has lot of predictive modeling algorithms
SPSS
·         Has good text mining and data mining algorithms
R
·         Most widely used open source analytics tool
·         Has several packages for data analysis
MATLAB
·         Widely used for numerical analysis and computing
RapidMiner
·         Good GUI-based tool for segmentation and clustering; can also be used for conventional modeling
·         Open source
Weka
·         Open source
·         Machine learning tool
SAP
·         Tool for managing business operations and customer relations
·         Most widely used operations tracking tool
Minitab
·         A light mversion analytics tool
Apache Mahout
·         Open source
·         Advanced analytics tool for big data
Other Tools
·         Statistica
·         KXEN Modeler
·         GNU Octave
·         Knime

Choosing a Tool

The final choice of data analysis tool to be used depends upon many considerations.
·         The business application, its complexity, and the level of expertise available in the organization.
·         The long-term business, information, and analytics strategy of the organization.
·         Existing organizational processes.
·         Budgetary constraints.
·         The investments proposed or already done in the processing hardware systems, which in turn might decide on factors such as processing power and memory available for the software to run.
·         Overall organization structure and where the analytics work is being carried out.
·         Project environment and governance structure.
·         Comfort level of the company in using open source software and warranties and other legal considerations.
·         The size of data to be handled.
·         The sophistication of graphics and presentation required in the project.
·         What analytics techniques to be used and how frequently they will be used.
·         How the current data is organized and how comfortable the team is in handling data.
·         Whether a data warehouse is in place and how adequately it covers business and customer information that may be required for the analysis.
·         Legacy considerations. Is any other similar tool in use already? If yes, how much time and resources are required for any planned switch-over?
·         Return-on-investment expectations.
Many more considerations specific to an organization or a project can be added to this list. The order of importance of these considerations may vary from person to person, from project to project, and from one organization to another. The final decision, however, is not an easy one. The later sections of this blog will list a few comparative features of SAS, SPSS, and R, which might help the decision-making process on the choice of tool that will best suit your needs. Finally, instead of zeroing in on a single tool, deciding to use multiple tools for different business analytics needs is also possible.

In some cases, it might be concluded that a simple spreadsheet application tool, such as Microsoft Excel, is the most convenient and effective and yet gives sufficient insights required to solve the business problem in hand.

Sometimes a single analytics project might require the use of more than one tool. A data analyst will be expected to apply different software tools depending on the problem at hand.

Main Parts of SAS, SPSS, and R

SAS and SPSS have hundreds of functions and procedures and can be broadly divided into five parts.
·         Data management and input functions, which help to read, transform, and organize the data prior to the analysis
·         Data analysis procedures, which help in the actual statistical analysis
·         SAS’s output delivery system (ODS) and SPSS’s output management system (OMS), which help to extract the output data for final representation or to be used by another procedures as inputs
·         Macro languages, which can be used to give sets of commands repeatedly and to conduct programming-related tasks
·         Matrix languages (SAS IML and SPSS Matrix), which can be used to add new algorithms
R has all these five areas integrated into one. Most of the R procedures are written using the R language, while SAS and SPSS do not use their native languages to write their procedures. Being open source, R’s procedures are available for the users to see and edit to their own advantage.

SAS

As per the SAS web site, the SAS suite of business analytics software has 35+ years of experience and 60,000+ customer sites worldwide. It has the largest market share globally with regard to advanced analytics. It can do practically everything related to advanced analytics, business intelligence, data management, and predictive analytics. It is therefore not strange that our entire book is dedicated to explaining the applications of SAS in advanced business analytics.

SAS development originally started at North Carolina State University, where it was developed from 1966 to 1976. The SAS Institute, founded in 1976, owns this software worldwide. Since 1976, new modules and functionalities have been being added in the core software. The social media analytics module was added in 2010.

The SAS software is overall huge and has more than 200 components. Some of the interesting components are the following:
·         Base SAS: Basic procedures and data management
·         SAS/STAT: Statistical analysis
·         SAS/GRAPH: Graphics and presentation
·         SAS/ETS: Econometrics and Time Series Analysis
·         SAS/IML: Interactive matrix language
·         SAS/INSIGHT: Data mining
·         Enterprise Miner: Data mining

Analysis Using SAS: The Basic Elements

This section will concentrate on Base SAS procedures. Base SAS helps to read, extract, transform, manage, and do statistical analysis on almost all forms of data. This data can be from a variety of sources such as Excel, flat files, relational databases, and the Internet. SAS provides a point-and-click graphical user interface to perform statistical analysis of data. This option is easy to use and may be useful to  nontechnical users or as a means to do a quick analysis. SAS also provides its own programming language, called the SAS programming language. This option provides everything that the GUI has, in addition to several advanced operations and analysis. Many professional SAS users prefer using only the programming option because it gives almost unlimited control to the user on data manipulation, analysis, and presentation.

Most SAS programs have a DATA step and a PROC step. The DATA step is used for retrieval and manipulation of data, while the PROC step contains code for data analysis and reporting. There are approximately 300 PROC procedures. SAS also provides a macro language that can be used to perform routine programming tasks, including repetitive calls to SAS procedures. In the earlier system, SAS provided an ODS, and by using it, SAS data could be published in many commonly used file formats such as Excel, PDF, and HTML.  Many of the SAS procedures have the advantage of long history, a wide user base, and excellent documentation.

The Main Advantage Offered by SAS

The SAS programming language is a high-level procedural programming language that offers a plethora of built-in statistical and mathematical functions. It also offers both linear and nonlinear graphics capabilities with advanced reporting features. It is possible to manipulate and conveniently handle the data using SAS programming language, prior to applying statistical techniques. The data manipulation capabilities offered by SAS become even more important because up to three-fourths of the time spent in most analytics project is on data extraction, transformation, and cleaning. This capability is nonexistent in some other popular data analysis packages, which may require data to be manipulated or transformed using several other programs before it can be submitted to the actual statistical analysis procedures. Some statistical techniques such as analysis of variance (ANOVA) procedures are especially strong in the SAS environment.

Listing 1-1 and Listing 1-2 are samples of SAS code. They are just to give you a feel of how SAS code generally looks.
Listing 1-1. Regression SAS Code
Proc reg data=sales;
Model bill amount=income + Average spending + family members + Age;
Run;
Listing 1-2. Cluster Analysis Code
Proc fastclus  data= sup_market radius=0 replace=full maxclusters = 5 ;
idcust_id;
Varvisitsincome age spends;
run;

The R Tool

Discussed in the earlier sections, R is an integrated tool for data manipulation, data management, data analysis, and graphics and reporting capabilities. It can do all of the following in an integrated environment:
·         Data management functions such as extraction, handling, manipulation and transformation, storage
·         The full function and object-oriented R programming language
·         Statistical analysis procedures
·         Graphics and advanced reporting capabilities
R is open source software maintained by the R Development Core Team and a vast R community (www.r-project.org). It is supported by a large number of packages, which makes it feature rich for the analytics community. About 25 statistical packages are supplied with the core R software as standard and recommended packages. Many more are made available from the CRAN web site at http://CRAN.R-project.org and from other sources. The CRAN site at http://cran.r-project.org/doc/manuals/R-intro.html#Top offers a good resource for an R introduction, including documentation resources.

R’s extensibility is one of its biggest advantages. Thousands of packages are available as extensions to the core R software. Developers can see the code behind R procedures and modify it to write their own packages. Most popular programming languages such as C++, Java, and Python can be connected to the R environment. SPSS has a link to R for users who are primarily using the SPSS environment for data analysis. SAS also offers some means to move the data and graphics between the two packages. Table 1-2 lists the most widely used R packages (see http://piccolboni.info/2012/05/essential-r-packages.html).
Table 1-2. Most Widely Used R Packages
Rank
Package
Description
1
Stats
Distributions and other basic statistical stuff
2
Methods
Object-oriented programming
3
graphics
Of course, graphics
4
MASS
Supporting material for Modern Applied Statistics with S
5
grDevices
Graphical devices
6
utils
In a snub to modularity, a little bit of everything, but very useful
7
lattice
Graphics
8
grid
More graphics
9
Matrix
Matrices
10
mvtnorm
Multivariate normal and t distributions
11
sp
Spatial data
12
tcltk
GUI development
13
splines
Needless to say, splines
14
nlme
Mixed-effects models
15
survival
Survival analysis
16
cluster
Clustering
17
R.methodsS3
Object-oriented programming
18
coda
MCMC
19
igraph
Graphs (the combinatorial objects)
20
akima
Interpolation of irregularly spaced data
21
rgl
3D graphics (openGL)
22
rJava
Interface with Java
23
RColorBrewer
Palette generations
24
ape
Phylogenetics
25
gtools
Functions that didn’t fit anywhere else, including macros
26
nnet
Neural networks
27
quadprog
Quadratic programming
28
boot
Bootstrap
29
Hmisc
Yet another miscellaneous package
30
car
Companion to the Applied Regression book
31
lme4
Linear mixed-effects models
32
foreign
Data compatibility
33
Rcpp
R C++ integration
Here are a few R code snippets. It is not necessary to understand them at this stage. They can be understood at a later stage.
Input_data=read.csv("Datasets/Insurance_data.csv")
#reads an external CSV file

input_data_final=Input_data[,-c(1)]
#stores the variables of the dataset separately

input_data_final=scale(input_data_final)
#normalizes the data

clusters_five<-kmeans(input_data_final,5)
#creates 5 clusters from the given data

cluser_summary_mean_five=aggregate(input_data_final,by=list(clusters_five$cluster),FUN=mean)
#summarizes clusters by mean

View(cluser_summary_count_five)
#returns the results summarized by size

IBM SPSS Analytics Tool

SPSS originally stood for Statistical Package for the Social Sciences. It is a software package used for statistical analysis, originally developed by SPSS Inc. It was acquired by IBM in 2009, and IBM renamed it as SPSS Statistics, with a latest version of SPSS Statistics 22.

Many SPSS users think it has a stronger command menu option compared to R and SAS; its learning curve is also shorter.

The web site at http://fmwww.bc.edu/GStat/docs/StataVSPSS.html has the following opinion about SPSS:
SPSS has its roots in the social sciences and the analysis of questionnaires and surveys is where many of its core strengths lie.
SPSS has been in existence for a long time and hence has a strong user base. Like with any other software, you always have to do a cost-to-benefit analysis while making a buying decision.
 Users may find SAS and SPSS similar to each other, and switching from one to the other may be fairly easy. R may look somewhat different for first-time users.

Features of SPSS Statistics 22

SPSS Statistics 22 is built on the philosophy of data-driven decision making anytime, anywhere. It has many new features, such as interaction with mobile devices. It works on Windows, Mac, and Linux desktops. For mobile devices, it supports Apple, Windows 8, and Android devices. It has support for Automatic Linear Modeling (ALM) and heat maps. It enhances the Monte Carlo simulation to help in improving the accuracy of predictive models for uncertain inputs. SPSS Statistics Server is good as far as scalability and performance is concerned. Custom programming is also made easier than before. Python plug-ins can be added as a part of the main installation.

Monte Carlo simulation is a problem-solving technique that is used for approximating certain results by doing multiple trials or simulations that use random variables.

Selection of Analytics Tools

The web site at http://stanfordphd.com/Statistical_Software.html contains a statistical feature comparison for R, Minitab, SAS, STATA, and SPSS. R looks feature-rich given its supporting packages, which are written by the R core development team and many other R enthusiasts. SAS has been around since the 1970s and has a large user base. It has great data management capabilities, which make it a one-stop shop for most of the analytics exercises. The user does not need to go to any other program whether for reading the data files or for the final presentation.

SPSS has great menu-driven features and does not need any training in programming in most cases. The SAS market share and its wide appeal make it the main topic of our book. Several companies in the corporate world find themselves comfortable with SAS, mainly because of the large legacy built around it over the years, its industrial quality code, its good quality of documentation, and the skill availability in the market.

As discussed in this blog, the fact that all available applications have their own strengths and weaknesses needs to be accepted. No software is fit for all occasions. A data analyst must learn multiple software products and use them as the situation demands to be successful.

These edited excerpts are taken from Shailendra Kadre and Venkat Reddy’s latest book titled, Practical Business Analytics Using SAS: A Hands-on Guide. It’s available on Amazon.com and Amazon India web sites.

·         Paperback: 580 pages
·         Publisher: Apress; 1 edition (January 29, 2015)
·         Language: English
·         ISBN-10: 1484200446
·         ISBN-13: 978-1484200445





Friday, April 3, 2015

Business Analytics Techniques Used in the Industry

Business Analytics Techniques Used in the Industry

3rd April, 2015 Shailendra Kadre and Venkat Reddy

The previous few blog posts introduced the uses of data mining or business analytics. This post will examine the terminology in detail. Only the frequently used terms in the industry are discussed here.
Then the blog post will introduce and give examples of many of these analytics techniques and applications. Some of the more frequently used techniques will be covered in later posts.














Regression Modeling and Analysis

To understand regression and predictive modeling, consider the same example of a bank trying to aggressively increase its customer base for some of its credit card offerings. The credit card manager wants to attract new customers who will not default on credit card loans. The bank manager might want to build a model from a similar set of past customer data that resembles the set of target customers closely. This model will be used to assign a credit score to the new customers, which in turn will be used to decide whether to issue a credit card to a potential customer. There might be several other considerations aside from the calculated credit score before a final decision is made to allocate the card.

The bank manager might want to view and analyze several variables related to each of the potential clients in order to calculate their credit score, which is dependent on variables such as the customer’s age, income group, profession, number of existing loans, and so on. The credit score here is a dependent variable, and other customer variables are independent variables. With the help of past customer data and a set of suitable statistical techniques, the manager will attempt to build a model that will establish a relationship between the dependent variable (the credit score in this case) and a lot of independent variables about the customers, such as monthly income, house and car ownership status, education, current loans already taken, information on existing credit cards, credit score and the past loan default history from the federal data bureaus, and so on. There may be up to 500 such independent variables that are collected from a variety of sources, such as credit card application, federal data, and customers’ data and credit history available with the bank. All such variables might not be useful in building the model. The number of independent variables can be reduced to a more manageable number, for instance 50 or less, by applying some empirical and scientific techniques. Once the relationship between independent and dependent variables is established using available data, the model needs validation on a different but similar set of customer data. Then it can be used to predict the credit scores of the potential customers. A prediction accuracy of 90 percent to 95 percent may be considered good in banking and financial applications; an accuracy of 75 percent is must. This kind of predictive model needs a periodic validation and may be rebuilt. It is mandatory in some financial institutions to revalidate the model at least once a year with renewed conditions and data.
In recent times, revenues for new movies depend largely on the buzz created by that movie on social media in its first weekend of release. In an experiment, data for 37 movies was collected. The data was the number of tweets on a movie and the corresponding tickets sold. The graph in Figure 1-2 shows the number of tweets on the x-axis and number of tickets sold on the y-axis for a particular movie. The question to be answered was, If a new movie gets 50,000 tweets (for instance), how many tickets are expected to be sold in the first weeks? A regression model was used to predict the number of tickets (y) based on number of tweets (x) (Figure 1-3).













Figure 1-2. Number of Tickets Sold vs. Number of Tweets - a Data Collection for Sample Movies















Figure 1-3. The Regression Model for Number of Tickets Sold vs. Number of Tweets - Prediction using Regression Model


Using the previous regression predictive model equation, the number of tickets was estimated to be 5,271 for a movie hat had 50,000 tweets in the first week of release.

Time Series Forecasting

Time series forecasting is a simple form of forecasting technique, wherein some data points are available over regular time intervals of days, weeks, or months. If some patterns can be identified in the historical data, it is possible to  project those patterns into the future as a forecast. Sales forecasting is a popular usage of time series forecasting. In Figure 1-3, a straight line shows the trend from the past data. This straight line can easily be extended into a few more time periods to have fairly accurate forecasts. In addition to trends, time series forecasts can also show seasonality, which is simply a repeat pattern that is observed within a year or less (such as more sales of gift items on occasions such as Christmas or Valentine’s Day). Figure 1-4 shows an actual sales forecast, the trend, and the seasonality of demand.
















Figure 1-4. A Time Series Forecast Showing the Seasonality of Demand

Figure 1-4 shows the average monthly sales of an apparel showroom for three years. There is a stock clearance sale every four months, with huge discounts on all brands. The peak in every fourth month is apparent from the figure.

Time series analysis can also be used in the example of bank and credit card to forecast losses or profits in future, given the same data for a historical period of 24 months, for instance. Time series forecasts are also used in weather and stock market analysis.

Other examples of time series data include representations of yearly sales volumes, mortgage interest rate variations over time, and data representations in statistical quality control measurements such as accuracy of an industrial lathe machine for a period of one month. In these representations, the time component is taken on the x-axis, and the variable, like sales volume, is on the y-axis. Some of these trends may follow a steady straight-line increase or a decline over a period of time. Others may be cyclic or random in nature. While applying time series forecasting techniques, it is usually assumed that the past trend will continue for a reasonable time in the future. This future forecasting of the trend may be useful in many business situations, such as stocks procurement, planning of cash flow, and so on.

Conjoint Analysis

Conjoint analysis is a statistical technique, mostly used in market research, to determine what product (or service), features, or pricing would be attractive to most of the customers in order to affect their buying decision positively.

In conjoint studies, target responders are shown a product with different features and pricing levels. Their preferences, likes, and dislikes are recorded for the alternative product profiles. Researchers then apply statistical techniques to determine the contribution of each of these product features to overall likability or a potential buying decision. Based on these studies, a marketing model can be made that can estimate the profitability, market share, and potential revenue that can be realized from different product designs, pricing, or their combinations.

It is an established fact that some mobile phones sell more because of their ease of use and other user-friendly features. While designing the user interface of a new phone, for example, a set of target users is shown a carefully controlled set of different phone models, each having some different and unique feature yet very close to each other in terms of the overall functionality. Each user interface may have a different set of background colors; the placement of commonly used functions may also be different for each phone. Some phones might also offer unique features such as dual SIM. The responders are then asked to rate the models and the controlled set of functionalities available in each variation. Based on a conjoint analysis of this data, it may be possible to decide which features will be received well in the marketplace. The analysis may also help determine the price points of the new model in various markets across the globe.

Cluster Analysis

The intent of any cluster analysis exercise is to split the existing data or observations into similar and discrete groups. Each observation is divided group wise in classification type of problems, while in cluster analysis; the aim is to determine the number and composition of groups that may exist in a given data or observation set.

For example, the customers could be grouped into some distinct groups in order to target them with different pricing strategies and customized products and services. These distinct customer groups (Figure 1-5) may include frequent customers, occasional customers, high net worth customers, and so on. The number of such groups is unknown when beginning the analysis but is determined as a result of analysis.

















Figure 1-5. A cluster analysis plot

The graph in Figure 1-6 shows the income to debt ratio versus age. Customer segments that are similar in nature can be identified using cluster analysis.













Figure 1-6. Income to Debt Ratio vs. Age

The income-to-debt ratio in Figure 1-6 is low for age groups 20 to 30. The 30-to-45 age group segment has a higher debt ratio. The three groups need to be treated differently instead of as one single population, depending on the business objective. 

Segmentation


Segmentation is similar to classification, where the criteria to divide observations into distinct groups needs to be found. The number of groups may be apparent even at the beginning of the analysis, while the aim of cluster analysis is to identify areas with concentrations different than other groups. Hence, clustering is discovering the presence of boundaries between different groups, while segmentation uses boundaries or some distinct criterion to form the groups.

Clustering is about dividing the population into different groups based on all the factors available. Segmentation is also dividing the population into different groups but based on predefined criteria such as maximizing the profit variable, minimizing the defects, and so on. Segmentation is widely used in marketing to create the right campaign for the customer segment that yields maximum leads.

Principal Components and Factor Analysis

These statistical methodologies are used to reduce the number of variables or dimensions in a model building exercise. These are usually independent variables. Principal component analysis is a method of combining a large number of variables into a small number of subsets, while factor analysis is a methodology used to determine the structure or underlying relationship by calculating the hidden factors that determine the variable relationships.

Some analysis studies may start with a large number of variables, but because of practical constraints such as data handling, data collection time, budgets, computing resources available, and so on, it may be necessary to drastically reduce the number of variables that will appear on the final data model. Only those independent variables that make most sense to the business need to be retained.

There might also be interdependency between some variables. For example, income levels of individuals in a typical analysis might be closely related to the monthly dollars they spend. The more the income, the more the monthly spend. In such a case, it is better to keep only one variable for the analysis and remove the monthly spend from the final analysis.

The regression modeling section discussed using 500 variables as a starting point to determine the credit score of potential customers. The principal component analysis can be one of the methods to reduce the number of variables to a manageable level of 40 variables (for example), which will finally appear in the final data model.

Correspondence Analysis

Correspondence analysis is similar to principal component analysis but applies to nonquantitative or categorical data such as gender, status of pass or fail, color of objects, and field of specialization. It especially applies to cross-tabulation. Correspondence analysis provides a way to graphically represent the structure of cross-tabulations with each row and column represented as a point.
Note You can find a complete example of correspondence analysis on the SAS web site. See http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_corresp_sect026.htm.

Survival Analytics

Survival analytics is typically used when variables such as time of death, duration of a hospital stay, and time to complete a doctoral thesis need to be predicted. It basically deals with the time to event data. For a more detailed treatment of this topic, please refer to  www.amstat.org/chapters/northeasternillinois/pastevents/presentations/summer05_Ibrahim_J.pdf.

Some Practical Applications of Business Analytics

The following sections discuss a couple of examples on the practical usage of application of business analytics in the real world. Predicting customer behavior towards some product features, or application of business analytics in the supply chain to predict the constraints, such as raw material lead times, are very common examples. Applications of analytics are very popular in retail and predicting trends on social media as well.

Customer Analytics

Predicting consumer behavior is the key to all marketing campaigns. Market segmentation, customer relationship management, pricing, product design, promotion, and direct marketing campaigns can benefit to a large extent if consumer behavior can be predicted with reasonable accuracy. Companies with direct interaction with customers collect and analyze a lot of consumer-related data to get valuable business insights that may be useful in positively affecting sales and marketing strategies. Retailers such as Amazon and Walmart have a vast amount of transactional data available at their disposal, and it contains information about every product and customer on a daily basis. These companies use business analytics techniques effectively for marketing, pricing policies, and campaign designs, which enable them to reach the right customers with the right products. They understand customer needs better using analytics. They can swap better-selling products at the cost of less-efficient ones. Many companies are also tapping the power of social media to get meaningful data, which can be used to analyze consumer behavior. The results of this analytics can also be used to design more personalized direct marketing campaigns.

Operational Analytics

Several companies use operational analytics to improve existing operations. It is now possible to look into business processes in real time for any given time frame, with companies having enterprise resource planning (ERP) systems such as SAP, which give an integrated operational view of the business. Drilling down into history to re-create the events is also possible. With proper analytics tools, this data is used to analyze root cases,  uncover trends, and prevent disasters. Operational analytics can be used to predict lead times of shipments and other constraints in supply chains. Some software can present a graphical view of supply chain, which can depict any possible constraints in events such as shipments and production delays.

Social Media Analytics

Millions of consumers use social media at any given time. Whenever a new mobile phone or a movie, for instance, is launched in the market, millions of people tweet about it almost instantly, write about their feelings on Facebook, and give their opinions in the numerous blogs on the World Wide Web. This data, if tapped properly, can be an important source to uncover user attitudes, sentiments, opinions, and trends. Online reputation and future revenue predictions for brands, products, and effectiveness of ad campaigns can be determined by applying proper analytical techniques on these instant, vast, and valuable sources of data. In fact, many players in the analytics software market such as IBM and SAS claim to have products to achieve this.
Social media analytics is simply text mining or text analytics in some sense. Unstructured text data is available on social media web sites, which can be challenging to analyze using traditional analytics techniques. (Describing text analytics techniques is out of scope for this book.)
Some companies are now using consumer sentiment analysis on key social media web sites such as Twitter and Facebook to predict revenues from new movie launches or any new products introduced in the market.

Data Used in Analytics

The data used in analytics can be broadly divided into two types: qualitative and quantitative. The qualitative, discrete, or categorical data is expressed in terms of natural languages. Color, days of a week, street name, city name, and so on, fall under this type of data. Measurements that are explained with the help of numbers are quantitative data, or a continuous form of data. Distance between two cities expressed in miles, height of a person measured in inches, and so on, are forms of continuous data.
This data can come from a variety of sources that can be internal or external. Internal sources include customer transactions, company databases and data warehouses, e-mails, product databases, and the like. External data sources can be professional credit bureaus, federal databases, and other commercially available databases. In some cases, such as engineering analysis, a company may like developing its own data to solve an uncommon problem.
Selecting the data for building a business analytics problem requires a thorough understanding of the overall business and the problem to be resolved. The past sections discussed that an analytics model uses data combined with the statistical techniques used to analyze it. Hence, the accuracy of any model is largely dependent upon the quality of underlying data and statistical methods used to analyze it.
Obtaining data in a usable format is the first step in any model-building process. You need to first understand the format and content of the raw data made available for building a model. Raw data may require extraction from its base sources such as a flat file or a data warehouse. It may be available in multiple sources and in a variety of formats. The format of the raw data may warrant separation of desired field values, which otherwise appear to be  junk or have little meaning in its raw form. The data may require a cleansing step as well, before an attempt is made to process it further. For example, a gender field may have only two values of male and female. Any other value in this field may be considered as junk. However, it may vary depending upon the application. In the same way, a negative value in an amounts field may not be acceptable.
In some cases, the size of available data may be so large that it may require sampling to reduce it to a manageable form for analysis purposes. A sample is a subset from the available data, which for all practical purposes represents all the characteristics of the original population. The data sourcing, extraction, transformation, and cleansing may eat up to 70 percent of total hours made available to a business analytics project.


These edited excerpts are taken from Shailendra Kadre and Venkat Reddy’s latest book titled, Practical Business Analytics Using SAS: A Hands-on Guide. It’s available on Amazon.com and Amazon India web sites.

·         Paperback: 580 pages
·         Publisher: Apress; 1 edition (January 29, 2015)
·         Language: English
·         ISBN-10: 1484200446
·         ISBN-13: 978-1484200445