3.54 msim - Calculate Similarity Between Two Variables

Find out the degree of similarity between two variable fields (distance) at f= parameter, specify the degree of similarity (distance) function at c= parameter to derive the similarity matrix.

Format

msim c= f= [a=] [k=] [-d] [i=] [o=] [bufcount=] [-nfn] [-nfno] [-x] [-q] [precision=] [--help] [--version]

Parameters

k=

Field(s) (multiple items can be specified) specified here is used as the unit of calculation.

f=

Field names for the calculation of degree of similarities between two fields.

c=

Specify the similarity measure(s) (distance) (multiple fields can be specified).

 

As shown in the example below, the field name of the similarity measure results can be defined by using a : (colon).

 

If the name of field is not defined with colon, the type of degree of similarity (distance) is used as the field name.

 

Example: msim f=x,y,z c=pearson:Pearson product-moment correlation coefficient,

 

euclid:Euclidean distance,cosine:Cosine

 

Similarity measure=covar|ucovar|pearson|spearman|kendall|euclid|cosine|

 

  cityblock|hamming|chi|phi|jaccard|supportr|lift|confMax|

 

  confMin|yuleQ|yuleY|kappa|oddsRatio|convMax|convMin

a=

Specify the field name that indicates the name of the two variables.

 

Specify the two arguments with a comma. Field names fld1,fld2 are used if a= is not defined.

-d

Output as diagonal matrix and upper triangular matrix.

 

Only the lower triangular matrix of similarity matrix is shown if -d option is not specified,

 

but both upper triangular matrix and diagonal matrix are shown by when -d option is specified.

Definition of similarity (distance)

Real vector

Definition of size for the degree of similarity (or distance) in relation to two real number vectors ${\bf x}=(x_1,x_2,\cdots ,x_ n),{\bf y}=(x_1,x_2,\cdots ,x_ n)$ is shown in Table 3.25.

Table 3.25: Summary of degree of similarity for real number vectors

Parameter value

Detail

Distance/similarity

Equation definition

Range

covar

Covariance

Degree of similarity

$ \frac{1}{n}\sum _{i=1}^ n~ (x_ i-\bar{x})(y_ i-\bar{y}) $

$-\infty $$\infty $

ucovar

Unbiased covariance

Degree of similarity

$ \frac{1}{n-1}\sum _{i=1}^ n~ (x_ i-\bar{x})(y_ i-\bar{y}) $

$-\infty $$\infty $

pearson

Pearson’s product-moment correlation coeff

Degree of similarity

$ \frac{\frac{1}{n}\sum _{i=1}^ n~ (x_ i-\bar{x})(y_ i-\bar{y})}{\sqrt {\frac{1}{n}\sum _{i=1}^ n~ (x_ i-\bar{x})^2}\sqrt {\frac{1}{n}\sum _{i=1}^ n~ (y_ i-\bar{y})^2}}~  $

$-1.0$$1.0$

spearman

Spearman’s rank correlation coefficient

Degree of similarity

$\bf {x},\bf {y}$ Product-moment correlation coefficient is converted into a ranking

$-1.0$$1.0$

kendall

Kendall’s rank correlation coefficient

Degree of similarity

$ \frac{c-d}{\frac{1}{2}n(n-1)} ^{Note:1,2} $

$-1.0$$1.0$

euclid

Euclidean distance (number)

Distance

$ \sqrt {\sum _{i=1}^ n~ (x_ i-y_ i)^2}~  $

$0$$\infty $

cosine

Cosine

Degree of similarity

$ \frac{\bf {x}\cdot ~ \bf {y}}{\mid \bf {x}\mid \mid \bf {y}\mid }=\frac{\sum _{i=1}^ n~ x_ i~ y_ i}{\sqrt {\sum _{i=1}^ n~ x_ i^2}\sqrt {\sum _{i=1}^ n~ y_ i^2}} $

$-1.0$$1.0$

cityblock

City block distance

Distance

$ \sum _{i=1}^ n~ \mid ~ x_ i-y_ i\mid $

$-\infty $$\infty $

hamming

Hamming distance

Distance

$ \mid \{ i \mid x_ i\ne y_ i, i=1,2,\cdots ,n\} \mid $

$0$$n$


Note 1: $c=|\{ (i,j)|(x_ i>x_ j ~ {\rm and}~  y_ i>y_ j) ~ {\rm or}~  (x_ i<x_ j ~ {\rm and}~  y_ i<y_ j), i>j, i=1,2,\cdots ,n, j=1,2,\cdots ,n\} |$
Note 2: $d=|\{ (i,j)|(x_ i>x_ j ~ {\rm and}~  y_ i<y_ j) ~ {\rm or}~  (x_ i<x_ j ~ {\rm and}~  y_ i>y_ j), i>j, i=1,2,\cdots ,n, j=1,2,\cdots ,n\} |$

0-1Vector

Take the value as 0 or 1, the definition of degree of similarity of two 0-1 vectors ${\bf a}=(a_1,a_2,\cdots ,a_ n),{\bf b}=(b_1,b_2,\cdots ,b_ n)$ is shown in Table 3.27. The $f_{jk}$ symbols used in the table, the value of $a_ i,b_ i$ is enumerated in different combinations of (0,1), and shown in Table 3.26.

Table 3.26: Combinations of the values of the 2 variables in $2\times 2$ contingency table
 

$b_ i=1$

$b_ i=0$

Total

$a_ i=1$

$f_{11}$

$f_{10}$

$f_{1.}$

$a_ i=0$

$f_{01}$

$f_{00}$

$f_{0.}$

Total

$f_{.1}$

$f_{.0}$

$f_{..}$

Further, meaning of $P(\cdot )$ is shown below.

$P(a)=f_{1.}/f_{..}$

$P(b)=f_{.1}/f_{..}$

$P({\bar a})=f_{0.}/f_{..}$

$P(a,b)=f_{11}/f_{..}$

$P(a|b)=f_{11}/f_{.1}$

Table 3.27: Summary of degree of similarity for vector 0-1

Parameter values

Content

Distance/similarity

Equation

Range

chi

Chi-square value

Degree of similarity

$ \sum _{i=0}^1~ \sum _{j=0}^1~ \frac{f_{ij}-e_{ij}}{e_{ij}}~  ^{Note:1} $

$0$$\infty $

phi

Phi coefficient

Degree of similarity

$ \frac{f_{11}f_{00}-f_{10}f_{01}}{\sqrt {f_{1.}f_{0.}f_{.1}f_{.0}}} $

$-1.0$$1.0$

jaccard

Jack card factor

Degree of similarity

$ \frac{P(a,b)}{P(a)+P(b)-P(a,b)} $

$0.0$$1.0$

support

Support

Degree of similarity

$ P(a,b) $

$0.0$$1.0$

lift

Value of lift

Degree of similarity

$ \frac{P(a,b)}{P(a)P(b)} $

$0$$\infty $

confMax

Maximum confidence

Degree of similarity

$ \max (P(a|b),P(b|a)) $

$0.0$$1.0$

confMin

Minimum confidence

Degree of similarity

$ \min ((P(a|b),P(b|a)) $

$0.0$$1.0$

yuleQ

Ren correlation coefficient of yule (Q)

Degree of similarity

$ \frac{\alpha -1}{\alpha +1} ^{Note: 2} $

$-1.0$$1.0$

yuleY

Ren correlation coefficient of yule (Y)

Degree of similarity

$ \frac{\sqrt {\alpha }-1}{\sqrt {\alpha }+1} ^{Note: 2} $

$-1.0$$1.0$

kappa

kappa

Degree of similarity

$ \frac{P(a,b)+P(\bar{a},\bar{b})-P(a)P(b)-P(\bar{a})P(\bar{b})}{1-P(a)P(b)-P(\bar{a})P(\bar{b})} $

$-1.0$$1.0$

oddsRatio

oddsRatio

Degree of similarity

$ \frac{P(a,b)P(\bar{a},\bar{b})}{P(a,\bar{b})P(\bar{a},b)} $

$0$$\infty $

convMax

Maximum conviction

Degree of similarity

$ \max (\frac{P(a)P(\bar{b})}{P(a,\bar{b})},\frac{P(\bar{a})P(b)}{P(\bar{a},b)}) $

$0.5$$\infty $

convMin

Minimum conviction

Degree of similarity

$ \min (\frac{P(a)P(\bar{b})}{P(a,\bar{b})},\frac{P(\bar{a})P(b)}{P(\bar{a},b)}) $

$0.5$$\infty $

Note 1: $e_{ij}=\frac{f_{i.}f_{.j}}{f_{..}}$ Note 2: $\alpha =\frac{f_{11}f_{00}}{f_{10}f_{01}}$

Examples

Example 1: Basic Example

Calculate the cosine and Pearson’s product-moment correlation coefficient for the combination of two items among x, y, z fields.

$ more dat1.csv
x,y,z
14,0.17,-14
11,0.2,-1
32,0.15,-2
13,0.33,-2
$ msim c=pearson,cosine f=x,y,z i=dat1.csv o=rsl1.csv
#END# kgsim c=pearson,cosine f=x,y,z i=dat1.csv o=rsl1.csv
$ more rsl1.csv
fld1,fld2,pearson,cosine
x,y,-0.5088704666,0.7860308044
x,z,0.1963041929,-0.5338153343
y,z,0.3311001423,-0.5524409416

Example 2: Output diagonal matrix, the upper triangular matrix

Calculate the cosine and Pearson’s product-moment correlation coefficient for the combination of two items between x, y, z fields (with d option).

$ msim c=pearson,cosine f=x,y,z -d i=dat1.csv o=rsl2.csv
#END# kgsim -d c=pearson,cosine f=x,y,z i=dat1.csv o=rsl2.csv
$ more rsl2.csv
fld1,fld2,pearson,cosine
x,x,1,1
x,y,-0.5088704666,0.7860308044
x,z,0.1963041929,-0.5338153343
y,x,-0.5088704666,0.7860308044
y,y,1,1
y,z,0.3311001423,-0.5524409416
z,x,0.1963041929,-0.5338153343
z,y,0.3311001423,-0.5524409416
z,z,1,1

Example 3: Calculation based on key unit

Calculate using key field as unit.

$ more dat2.csv
key,x,y,z
A,14,0.17,-14
A,11,0.2,-1
A,32,0.15,-2
B,13,0.33,-2
B,10,0.8,-5
B,15,0.45,-9
$ msim k=key c=pearson,cosine f=x,y,z i=dat2.csv o=rsl3.csv
#END# kgsim c=pearson,cosine f=x,y,z i=dat2.csv k=key o=rsl3.csv
$ more rsl3.csv
key%0,fld1,fld2,pearson,cosine
A,x,y,-0.8746392857,0.8472573627
A,x,z,0.3164384831,-0.521983618
A,y,z,0.1830936883,-0.6719258683
B,x,y,-0.7919009884,0.8782575583
B,x,z,-0.471446429,-0.9051543403
B,y,z,-0.1651896746,-0.8514129252

Example 4: Specify the type of degree of similarity

Using the data with 01 values, compute the phi coefficient and Hamming distance.

$ more dat3.csv
x,y,z
1,1,0
1,0,1
1,0,1
0,1,1
$ msim c=hamming,phi f=x,y,z i=dat3.csv o=rsl4.csv
#END# kgsim c=hamming,phi f=x,y,z i=dat3.csv o=rsl4.csv
$ more rsl4.csv
fld1,fld2,hamming,phi
x,y,0.75,-0.5773502692
x,z,0.5,-0.3333333333
y,z,0.75,-0.5773502692

Example 5: Rename the column containing degree of similarity

Using the data with 01 values, compute the phi coefficient and Hamming distance and change the output field name.

$ msim c=hamming:HammingDist,phi:PhiCoeff a=variable1,variable2 f=x,y,z i=dat3.csv o=rsl5.csv
#END# kgsim a=variable1,variable2 c=hamming:HammingDist,phi:PhiCoeff f=x,y,z i=dat3.csv o=rsl5.csv
$ more rsl5.csv
variable1,variable2,HammingDist,PhiCoeff
x,y,0.75,-0.5773502692
x,z,0.5,-0.3333333333
y,z,0.75,-0.5773502692

Related Commands

mstats : Calculate the statistics of one variable.

mmvsim : Calculate sliding window similarity measure.