Sunday, February 28, 2016

#3 - Big Unstructured Data v/s Structured Relational Data

This is the third post of my blog for MIS 587 Business Intelligence course in the Eller College of Management. In this post, I am going to talk about the differences between unstructured and structured data. I will be discussing about the present state and volume of various data types available to organizations. Then, there will be a small section on the limitations of data warehousing in analyzing different types of data. I will finally be concluding by discussing where the role of data warehouse is headed in the near future.

Differences between unstructured and structured data

Let us first look at an image which clearly illustrates the major difference between unstructured and structured data.
Furthermore, Wikipedia defines unstructured data as information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. Structured data refers to any data that resides in a fixed field within a record or file. This includes data contained in relational databases and spreadsheets.

Unstructured data is not useful when fit into a schema/table. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents. Structured data has the advantage of being easily entered, stored, queried and analyzed. At one time, because of the high cost and performance limitations of storage, memory and processing, relational databases and spreadsheets using structured data were the only way to effectively manage data. Anything that couldn't fit into a tightly organized structure would have to be stored on paper in a filing cabinet.

How Data Warehouse fits into analyzing this data

In today's world, real-time data is very unstructured. Many organizations believe that their unstructured data stores include information that could help them make better business decisions. Unfortunately, it's often very difficult to analyze unstructured data. There are four basic properties of data which makes it very difficult to mine and get useful information out of it. They are:

Volume: These data are with a size which is beyond the ability of customary databases and software tools. It can include large data points, longer periods in time, more variables and can discover more subtle patterns.
Variety: Such data has a lot greater variety than other data. The data available to organizations are structured, unstructured and semi-structured.
Velocity: The rate at which the data is transmitted and received is higher than usual. It is often the result of new applications like Facebook, Twitter, etc.
Veracity: Credibility of data is different from older models. Data is not generated by users of the data and hence not always trustworthy.

In computing, Data Warehouse (DW) is used for reporting and data analysis. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating analytical reports for knowledge workers throughout the enterprise. Data Warehousing can help in transforming unstructured data into a structured form.  Data warehousing incorporates data stores and conceptual, logical, and physical models to support business goals and end-user information needs. A data warehouse (DW) is the foundation for a successful BI program. Creating a DW requires mapping data between sources and targets, then capturing the details of the transformation in a metadata repository. The data warehouse provides a single, comprehensive source of current and historical information. 

Data warehouses tend to have a high query success, as they have complete control over the four main areas of data management systems:
  • Clean data
  • Indexes: multiple types
  • Query processing: multiple options
  • Security: data and access

Limitations of Data Warehousing

However, there are considerable disadvantages involved in moving data from multiple, often highly disparate, data sources to one data warehouse that translate into long implementation time, high cost, lack of flexibility, dated information and limited capabilities:
  • Major data schema transforms from each of the data sources to one schema in the data warehouse, which can represent more than 50% of the total data warehouse effort
  • Data owners lose control over their data, raising ownership (responsibility and accountability), security and privacy issues
  • Long initial implementation time and associated high cost
  • Adding new data sources takes time and associated high cost
  • Limited flexibility of use and types of users - requires multiple separate data marts for multiple uses and types of users
  • Typically, data is static and dated
  • Typically, no data drill-down capabilities
  • Difficult to accommodate changes in data types and ranges, data source schema, indexes and queries
  • Typically, cannot actively monitor changes in data


Role of Data Warehouse in the near future

With data no more being called as data, but rather big data because of its volume and variety, it is of utmost important that organizations use Data Warehousing to get information from these large unstructured parts of data. The data is out there being generated at an incredible velocity and variety, but not many are tapping into the immense potential of it. All answers to management problems in organizations lie within these data but it is incredibly hard to make any sensible information from them. For example, a fast food chain wants to know which location will be most suitable to open a new franchise shop. It knows people have been tweeting about their place and some have been complaining it is very far from their home. They know these tweets but do not know the origin. They can set up a data warehouse to place these tweets into a structured data warehouse and then using ETL, get some information about them and come to a decision. 

As the days pass by, organizations are realizing the importance of Data Warehousing and Business Intelligence. They are realizing the fact that without these tools they are losing their business to competitors. It is somewhat funny to notice that how all the answers are around you but there is nothing which can exactly point to it. Data Warehousing can do that magic to organizations and help make management decisions based on the outputs of the BI tool. In the near future, the success of organizations will not only be decided by whether they are incorporating DW/BI, but rather how good their DW/BI tool is.


References

  1. Google
  2. Wikipedia
  3. Webopedia
  4. http://www.smartdatacollective.com/michelenemschoff/206391/quick-guide-structured-and-unstructured-data
  5. http://www.webopedia.com/TERM/U/unstructured_data.html
  6. https://tdwi.org/portals/data-warehousing.aspx
  7. http://www.whamtech.com/adv_disadv_dw.htm

Sunday, February 14, 2016

#2 - Dimensional model for FedEx Express

This is the second post of my blog for MIS 587 Business Intelligence course in the Eller College of Management. In this blog, I am going to talk about FedEx corporation and the air cargo transportation industry. I will be explaining what kind of metrics the CEO of FedEx will be interested in capturing and thus describe the dimensional model that suits best for their business model.








FedEx Corporation is an American global courier delivery services company headquartered in Memphis, Tennessee. The name "FedEx" is a syllabic abbreviation of the name of the company's original air division, Federal Express, which was used from 1973 until 2000. The company has expanded from its original focus on ground and air-based services to include home delivery options and a retail division. Internationally, the company primarily competes with DHL Express, UPS and TNT Express. In April 2015, the company reached an agreement to purchase its Dutch rival TNT Express, subject to regulatory and shareholder approval. The acquisition is expected to be completed in the first half of 2016.
FedEx is organized into many operating units but for the purpose of this exercise, I will be concentrating on its original service, the FedEx Express. FedEx is the the original overnight courier service of FedEx Corporation, providing next day air service within the United States and time-definite international service. FedEx Express operates one of the largest civil aircraft fleets in the world and the largest fleet of wide bodied civil aircraft; it also carries more freight than any other airline.

Understanding the air cargo transportation business

Before we design the dimensional model for FedEx, we need to understand about the air cargo transportation business process. The above diagram shows the major actors of the specific industry. The shipper ships the shipment through FedEx by approaching an agent who does the booking and generates an Air Waybill (AWB). The air waybill is the most important document issued by a carrier either directly or through its authorized agent. It is a non-negotiable transport document. It covers transport of cargo from airport to airport. By accepting a shipment an IATA cargo agent is acting on behalf of the carrier whose air waybill is issued. The shipment is carried by the freight forwarders to the airports where export customs take over. The Ground Handling Agent is responsible for transporting the shipment in and out of the flight. On reaching the destination, freight forwarders carry the shipment and deliver it to the consignee. 

Metrics of interest to track

The most important metrics to track for this particular business is the shipped goods status and location. Every time the goods changes its location, the new location must be entered into the system for effective tracking. In order to measure the net revenue generated in each AWB, all receivables and payables need to be calculated. FedEx receives the shipment cost from the shipper directly or the IATA agent. They need to pay 5% commission fees to the IATA agent and another 3% to GHA. Provided that there are no interline OALs (Other Airlines) for the particular shipment, deducting the commission charges from the gross revenue will give the net revenue. If there are interline airlines involved in the shipment, then the revenue gets prorated based on the distance traveled by each airline. For simplicity, we consider only one FedEx flight present in the entire shipment process. The CEO might be interested in knowing the daily, weekly, bi-weekly and monthly net revenue generation for FedEx. Such companies have their pre-defined accounting periods and this decides the date range that needs to be considered for revenue accounting. Invoicing and payments are also handled and thus the CEO might also be interested in participant outstanding reports and distribution of dunning letters.

Use of Dimensional Model

Dimensional model can be used to streamline this process easily. The Accumulating Snapshot type of dimensional model is the most appropriate type here since there are pre-determined milestones in the entire air cargo transportation process. This type of dimensional model approach is needed since the row of the fact table needs to be updated every time the location of the goods change to track its current location. The grain definitely is each AWB and the granularity can be tracked for every movement of the goods and also calculate gross and net revenue. The possible dimensions are the different actors: shipper, IATA agent, GHA, freight forwarders, flight and consignee along with the Date dimension.

The different facts that need to be tracked are current location, chargeable weight, rate, weight charges, valuation charges, Gross Revenue, Host Revenue, Net Revenue, total receivable amount and total payable amount to form a very simple dimensional model. The data should correspond to each AWB and can be aggregated or rolled-up according to any date to get aggregated reports for senior management. The dimensional model has been designed using a star schema as shown below.



References -

1. Wikipedia
2. Unisys Logistics Management System
3. FedEx Official SIte

Thursday, February 4, 2016

#1 - Business Intelligence & Analysis Products Scan & Evaluation

This is the start of my blog for MIS 587 Business Intelligence course that I am taking at the Eller College of Management. This post will be the first of many others to follow every two weeks. This will certainly be different from my other blog (can be found at shiladityasahu.wordpress.com). It will be a fully technical blog and my biggest challenge will be to make it as easily understandable as I can to people who are not that familiar with Information Technology (IT). I will also be punctual with my posts (since my grades depend on it J).

In this post, I will be listing 5 Business Intelligence & Analysis products and do a weighted analysis based on 5 criteria.

So what actually is Business Intelligence?

So what is the meaning of this fancy term - 'Business Intelligence' or BI. Everyone talks about it but hardly anyone can define it. IBISWorld defines it as 'Computer-based techniques for identifying, extracting and analyzing business data, such as revenue by product or geography'. In more simpler terms, BI is an umbrella term that refers to a variety of software applications used to analyze an organization's raw data. Raw data more often than not make no sense unless processed, analyzed and presented in a visual way (preferably). The processed data is called as information and can be viewed by senior management of an organization and help them make management decisions. BI as a discipline is made up of several related activities, including data mining (the process of extracting patterns from large data sets by combining methods from statistics, artificial intelligence and database management), online analytical processing, querying and reporting.

Gartner's Magic Quadrant for BI & Analytics Platforms 2015

Gartner has made this Magic Quadrant for different Business Intelligence & Analytics Platforms. It has listed some core BI platforms and analyzed them based on 13 critical capabilities and use cases. Then, they have plotted each one of them on a graph where x-axis denotes completeness of vision and the y-axis denotes ability to execute. In doing so, Gartner has divided the entire graph into 4 quadrants namely Leaders, Challengers, Niche Players and Visionaries.

Magic Quadrant

As per the Magic Quadrant, Gartner recognizes the following as the best BI and Analytics Platforms available in the market - 

Tableau

Tableau Software produces a family of interactive data visualization products focused on business intelligence and analytics. Tableau's intuitive, visual-based data discovery capabilities have transformed business users' expectations about what they can discover in data and share without extensive skills or training with a BI platform. Users can use Tableau to create dashboards without much knowledge of BI or data warehousing.
Pros - 
  • Ease of use
  • High business benefits
Cons -
  • Limited product line focused on data discovery
  • Limited with very large scale enterprise data
  • Limited advanced analytics capabilities

TIBCO Spotfire

TIBCO Software's BI platform cover a wide range of analytic capabilities. Spotfire is a leading data discovery and visualization tool which gives users the ability to access, combine, prepare and visualize data in the form of highly interactive analytic dashboards. 
Pros - 
  • Cover a wide range of analytic use cases
  • Ease of use
Cons - 
  • Struggle with market positioning and sales execution
  • Poor customer support

SAS

SAS Business Intelligence software provides the most comprehensive, integrated and easy-to-use reporting and analysis features. SAS visual analytics delivers interactive, self-service analytic capabilities at an enterprise level. SAS also leverages its range of platform components and expertise in various industries to offer a wide range of vertical- and domain-specific analytic applications.
Pros - 
  • Excellent market understanding
  • High functionality and product quality
  • Meets the needs of a diverse set of use cases
Cons -
  • High license cost
  • Difficult in migration

Qlik
QlikView is a mature, self-contained, tightly integrated development platform used by IT or more technical users for building intuitive and interactive dashboard applications faster and easier than traditional BI platforms.
Pros -
  • Ease of use
  • Highly interactive dashboard development product

Cons-
  • Dual product and pricing
  • Does not offer production reporting
SAP
SAP delivers a wide range of BI and analytics tools. They mainly target large BI enterprises for decentralized data discovery deployments. SAP keeps on investing heavily on their BI products and that has enabled them to stay in the leaders quadrant.
Pros - 
  • Continued investing in R&D
  • Broad range of functionality
Cons -
  • Poor customer service
  • Predominantly used for large customer bases
Criteria for comparison
  1. Ease of use - It is very important that a BI tool is easy to use and generate reports. There may be users who are not familiar with data warehousing and Business Intelligence and yest must be able to use the tool without much effort and trouble.
  2. User support - The vendors must provide constant support to users. Often there are complex functionalities associated with these tools and if a user is unable to perform a function, he will need support. Good customer service is very important in deciding which is a better product over another.
  3. Support for advanced analytics - Some BI tools support simple dashboards but others support advanced analytics and processing. This feature totally depends on the usage of the user but a very important factor for large enterprises.
  4. Cloud support - With cloud storage being available in almost any technology solution, a product with ease of saving files into cloud storage will be more favorable among the users.
  5. Cost - Cost goes a long way in deciding the BI product. It is always a balance between what features are needed and the cost associated with it. Its impossible to have all the features for the least cost but it is very important to provide competitive price in the market.
Weighted Analysis
Now we will do a weighted analysis on the above mentioned products based on the discussed criteria.

Criteria
Weight
Tableau
Spotfire
SAS
Qlik
SAP
Ease of Use
30%
10
9
7
10
7
User Support
12%
8
5
6
7
5
Advanced Analytics
20%
6
7
10
6
10
Cloud Support
15%
6
6
10
7
10
Cost
23%
9
8
5
7
5
Points
100%
8.13
7.44
7.47
7.7
7.35
Rank

1
4
3
2
5

As per the weighted analysis based on the above criteria, Tableau seems to be the best BI and Analytics tool mainly because of its easy to use, user-support and low cost. SAS and SAP though have Advanced Analytics support, they are also expensive and not easy to use. Most users will choose a product that is easier to use than which support advanced features and also costlier. Spotfire and Qlik are also easy to use tools but costlier than Tableau. In conclusion, Tableau is the best BI tool as per the above weighted criteria.

References

  • IBISWorld - http://clients1.ibisworld.com.ezproxy2.library.arizona.edu/
  • www.gartner.com
  • Google
  • Wikipedia
  • Official websites of Tableau, TIBCO Spotfire, SAS, Qlik, SAP, etc.