Monday, December 5, 2016

Split Semicolon separated column in Sql Server (2008,2012), Versions below 2016

In case you have a column as below:
RowDataTable
Col1
1;A;2016-10-01
2;B;2016-10-02
Use code snippet below:
-- Declare the destination table;

Create table SeparatedTable ( Id int , Name nVarChar(100) , Date Date)
declare
 @c1 nVarChar(max),
 @c2 nVarChar(max),
 @c3 nVarChar(max)

DECLARE c_split CURSOR 
 FOR SELECT * FROM RowDataTable
OPEN c_split

FETCH NEXT FROM c_split into @c1
 while @@FETCH_STATUS = 0
 begin
 set @c2 = substring(@c1 ,charindex(';',@c1 )+1, len(@c1 ) )
 set @c3 = substring(@c2 ,charindex(';',@c2 )+1, len(@c2 ) )
 set @c4 = substring(@c3 ,charindex(';',@c3 )+1, len(@c3 ) )
insert into SeparatedTable
 (FILECONTENTID, NETSALARY, ACCOUNTNUMBER,)
 Values (
 subString(@c1 , 1 , charIndex(';' , @c1 )-1) ,
 subString(@c2 , 1 , charIndex(';' , @c2 )-1) ,
 subString(@c3 , 1 , charIndex(';' , @c3 )-1))
 FETCH NEXT FROM c_split into @c1
 end
close c_split
Deallocate c_split

Tuesday, November 17, 2015

Interpreting box plots.

The box-and-whisker plot is an exploratory graphic, created by John W. Tukey, used to show the distribution of a dataset (at a glance). Think of the type of data you might use a histogram with, and the box-and-whisker (or box plot, for short) could probably be useful.

Let's say we ask 2,852 people (and they miraculously all respond) how many hamburgers they've consumed in the past week. We'll sort those responses from least to greatest and then graph them with our box-and-whisker.


Take the top 50% of the group (1,426) who ate more hamburgers; they are represented by everything above the median (the white line). Those in the top 25% of hamburger eating (713) are shown by the top "whisker" and dots. Dots represent those who ate a lot more than normal or a lot less than normal (outliers). If more than one outlier ate the same number of hamburgers, dots are placed side by side.

Thursday, November 12, 2015

Mean, Mode, Median, and Range.

1- Mean:
   It's other fancy name is "average", and it's simply the summation of all numbers in the set divided by the count of the set members.

2- Mode:
   The mode is the number that is repeated more often than any other number in the set.

3- Median:
   The median is the middle value after sorting the set, and it has two ways to calculate one if the data set members count is odd and other for the even count member data set.
    I] For odd count set: the number in the middle after sorting.
    II] For even count set: the summation of the two numbers in the middle divided by 2.

4- Range:
   Simply is the result of subtraction of the largest member of the set and smallest number.

EX:
   Calculate the Mean, Mode, Median, and Range for the following sets:
a) { 1, 2, 2, 4, 5 }
Mean = ( 1+2+2+4+5) / 5 = 2.8
Mode = 2 <Number 2 has been repeated two times>
Median = 2
Range= 5-1 = 4

b) { 1, 2, 2, 3, 4, 5 }
Mean = ( 1+2+2+3+4+5) / 6 = 2.833
Mode = 2 <Number 2 has been repeated two times>
Median = (2+3) / 2 = 2.5
Range= 5-1 = 4

Standard Deviation

The Standard Deviation is a measure of how spread out numbers are, in other words it's the average distance that each value in the set is far from the set average.

Ex:
If we have a data set of { 1,2,3,4,5 } then the average will be 3, to calculate the Standard Deviation we use the below formula: 

1. Find the average of the data set, .
To find the average, add up all the numbers and divide
by the number of numbers in the data set, n.
2. For each number, subtract the average from it.
3. Square each of the differences.
4. Add up all the results from Step 3.
5. Divide the sum of squares (Step 4) by the number of
numbers in the data set, minus one (n – 1).
If you do Steps 1 through 5 only, you have found
another measure of variability, called the variance.
6. Take the square root of the variance. This is the standard
deviation.

where "s" is the sample standard deviation, N is the number of observations in the data set, x is the data set, and  is the mean of the given data set, and can be calculated from the following formula: 



x̄  = ( 1 + 2 + 3 + 4 + 5 ) / 5  
    = 3

s  = √( (1/(5-1) * ( ( sqr(1-3) + sqr(2-3) + sqr(3-3) + sqr(4-3) + sqr(5-3) ) ) )
    = √ ( .25 * ( 4 + 1 + 0 + 1 + 4 ) )
    = √ ( .25 * 10 )
    = 1.5811

Implementation of sd in R example:
sd {stats} R Documentation
Standard Deviation

Description
   This function computes the standard deviation of the values in x. If na.rm is TRUE then missing values are removed before computation proceeds.

Usage
sd(x, na.rm = FALSE)
Arguments

x: a numeric vector or an R object which is coercible to one by as.vector(x, "numeric").
na.rm: logical. Should missing values be removed?

EX:
> v <- c(1,2,3,4,5 )
> sd(v)

[1] 1.581139

Means that the average distance from any point to the set mean is 1.511


NB:

Standard Deviation for the population is:



while the Standard Deviation for a sample is:



Wednesday, November 11, 2015

V/S car engine

1- V engine:
    Is a common configuration for an internal combustion engine. The cylinders and pistons are aligned, in two separate planes or 'banks', so that they appear to be in a "V" when viewed along the axis of the crankshaft. The Vee configuration generally reduces the overall engine length, height and weight compared to an equivalent inline configuration.

   Each pair of corresponding pistons from each bank of cylinders share one crank pin on the crankshaft, either by master/slave rods or by two ordinary connecting rods side by side. However, some V-twin engine designs have two-pin cranks, while other V configurations include split crank-pins for more even firing.


2- Straight engine:
   The straight or inline engine is an internal-combustion engine with all cylinders aligned in one row and having no offset. Usually found in four, six and eight cylinder configurations, they have been used in automobiles, locomotives and aircraft.

   The inline-four engine is the most common four-cylinder configuration, whereas the straight-6 has largely given way to the V6 engine, which although not as naturally smooth-running is smaller in both length and height and easier to fit into the engine bay of smaller modern cars. Some manufacturers, including Acura, Audi, Mercedes-Benz, Toyota, Volkswagen and Volvo, have also used straight-five configurations.



Tuesday, November 10, 2015

Operational Data Store (ODS)

ODS is a source like database that collects data from variety of operational systems with a minor enhancements.

Unlike a master data store, the data is not passed back to operational systems. It may be passed for further operations and to the data warehouse for reporting.

Because the data originates from multiple sources, the integration often involves cleaning, resolving redundancy and checking against business rules for integrity. An ODS is usually designed to contain low-level or atomic (indivisible) data (such as transactions and prices) with limited history that is captured "real time" or "near real time" as opposed to the much greater volumes of data stored in the data warehouse generally on a less-frequent basis.

The Primary purpose of the ODS is to collect, integrate, and distribute current information about the subject and to provide an enterprise view of this subject. The subject can be any that is important for the organization. For example, the customer ODS will typically house the most current information on a customer as well as information on all recent customer interaction with the organization including product ownership and summary usage statistics, statement information, summary-level contacts and other related information.


Monday, November 9, 2015

Report execution time!

If in your report you are calculating on every value you have or even any significant amount of data being reported for that matter then you need to take a step back and realize your data is not modeled in the best manner. 

It should not be acceptable to be doing a ton of calculations in a report. Not only does it make the report slower but it will be a real pain going forward when someone wants the report modified. Companies have spent and will continue to spend thousands on just trying to change a report that could have been prevented.

A key point to pay attention to is how fast your report runs. As long as your report isn’t so complex that it uses millions of rows your report should most of the time run in seconds (no more than 30 seconds).

Herein under is the simplest model of a well established data mart: