3.54 msim - Calculate Similarity Between Two Variables

Find out the degree of similarity between two variable fields (distance) at f= parameter, specify the degree of similarity (distance) function at c= parameter to derive the similarity matrix.

Format

msim c= f= [a=] [k=] [-d] [i=] [o=] [bufcount=] [-nfn] [-nfno] [-x] [-q] [precision=] [--help] [--version]

Parameters

`k=`	Field(s) (multiple items can be specified) specified here is used as the unit of calculation.
`f=`	Field names for the calculation of degree of similarities between two fields.
`c=`	Specify the similarity measure(s) (distance) (multiple fields can be specified).
	As shown in the example below, the field name of the similarity measure results can be defined by using a : (colon).
	If the name of field is not defined with colon, the type of degree of similarity (distance) is used as the field name.
	Example: `msim f=x,y,z c=pearson:Pearson product-moment correlation coefficient,`
	`euclid:Euclidean distance,cosine:Cosine`
	Similarity measure`=covar\|ucovar\|pearson\|spearman\|kendall\|euclid\|cosine\|`
	`cityblock\|hamming\|chi\|phi\|jaccard\|supportr\|lift\|confMax\|`
	`confMin\|yuleQ\|yuleY\|kappa\|oddsRatio\|convMax\|convMin`
`a=`	Specify the field name that indicates the name of the two variables.
	Specify the two arguments with a comma. Field names `fld1,fld2` are used if `a=` is not defined.
`-d`	Output as diagonal matrix and upper triangular matrix.
	Only the lower triangular matrix of similarity matrix is shown if `-d` option is not specified,
	but both upper triangular matrix and diagonal matrix are shown by when `-d` option is specified.

Definition of similarity (distance)

Real vector

Definition of size for the degree of similarity (or distance) in relation to two real number vectors ${\bf x}=(x_1,x_2,\cdots ,x_ n),{\bf y}=(x_1,x_2,\cdots ,x_ n)$ is shown in Table 3.25.

Table 3.25: Summary of degree of similarity for real number vectors

Parameter value	Detail	Distance/similarity	Equation definition	Range
covar	Covariance	Degree of similarity	$\frac{1}{n}\sum _{i=1}^ n~ (x_ i-\bar{x})(y_ i-\bar{y})$	$-\infty$ 〜 $\infty$
ucovar	Unbiased covariance	Degree of similarity	$\frac{1}{n-1}\sum _{i=1}^ n~ (x_ i-\bar{x})(y_ i-\bar{y})$	$-\infty$ 〜 $\infty$
pearson	Pearson’s product-moment correlation coeff	Degree of similarity	$\frac{\frac{1}{n}\sum _{i=1}^ n~ (x_ i-\bar{x})(y_ i-\bar{y})}{\sqrt {\frac{1}{n}\sum _{i=1}^ n~ (x_ i-\bar{x})^2}\sqrt {\frac{1}{n}\sum _{i=1}^ n~ (y_ i-\bar{y})^2}}~$	$-1.0$ 〜 $1.0$
spearman	Spearman’s rank correlation coefficient	Degree of similarity	$\bf {x},\bf {y}$ Product-moment correlation coefficient is converted into a ranking	$-1.0$ 〜 $1.0$
kendall	Kendall’s rank correlation coefficient	Degree of similarity	$\frac{c-d}{\frac{1}{2}n(n-1)} ^{Note:1,2}$	$-1.0$ 〜 $1.0$
euclid	Euclidean distance (number)	Distance	$\sqrt {\sum _{i=1}^ n~ (x_ i-y_ i)^2}~$	$0$ 〜 $\infty$
cosine	Cosine	Degree of similarity	$\frac{\bf {x}\cdot ~ \bf {y}}{\mid \bf {x}\mid \mid \bf {y}\mid }=\frac{\sum _{i=1}^ n~ x_ i~ y_ i}{\sqrt {\sum _{i=1}^ n~ x_ i^2}\sqrt {\sum _{i=1}^ n~ y_ i^2}}$	$-1.0$ 〜 $1.0$
cityblock	City block distance	Distance	$\sum _{i=1}^ n~ \mid ~ x_ i-y_ i\mid$	$-\infty$ 〜 $\infty$
hamming	Hamming distance	Distance	$\mid \{ i \mid x_ i\ne y_ i, i=1,2,\cdots ,n\} \mid$	$0$ 〜 $n$

Note 1: $c=|\{ (i,j)|(x_ i>x_ j ~ {\rm and}~ y_ i>y_ j) ~ {\rm or}~ (x_ i<x_ j ~ {\rm and}~ y_ i<y_ j), i>j, i=1,2,\cdots ,n, j=1,2,\cdots ,n\} |$
Note 2: $d=|\{ (i,j)|(x_ i>x_ j ~ {\rm and}~ y_ i<y_ j) ~ {\rm or}~ (x_ i<x_ j ~ {\rm and}~ y_ i>y_ j), i>j, i=1,2,\cdots ,n, j=1,2,\cdots ,n\} |$

0-1Vector

Take the value as 0 or 1, the definition of degree of similarity of two 0-1 vectors ${\bf a}=(a_1,a_2,\cdots ,a_ n),{\bf b}=(b_1,b_2,\cdots ,b_ n)$ is shown in Table 3.27. The $f_{jk}$ symbols used in the table, the value of $a_ i,b_ i$ is enumerated in different combinations of (0,1), and shown in Table 3.26.

Table 3.26: Combinations of the values of the 2 variables in $2\times 2$ contingency table

	$b_ i=1$	$b_ i=0$	Total
$a_ i=1$	$f_{11}$	$f_{10}$	$f_{1.}$
$a_ i=0$	$f_{01}$	$f_{00}$	$f_{0.}$
Total	$f_{.1}$	$f_{.0}$	$f_{..}$

Further, meaning of $P(\cdot )$ is shown below.

$P(a)=f_{1.}/f_{..}$

$P(b)=f_{.1}/f_{..}$

$P({\bar a})=f_{0.}/f_{..}$

$P(a,b)=f_{11}/f_{..}$

$P(a|b)=f_{11}/f_{.1}$

Table 3.27: Summary of degree of similarity for vector 0-1

Parameter values	Content	Distance/similarity	Equation	Range
chi	Chi-square value	Degree of similarity	$\sum _{i=0}^1~ \sum _{j=0}^1~ \frac{f_{ij}-e_{ij}}{e_{ij}}~ ^{Note:1}$	$0$ 〜 $\infty$
phi	Phi coefficient	Degree of similarity	$\frac{f_{11}f_{00}-f_{10}f_{01}}{\sqrt {f_{1.}f_{0.}f_{.1}f_{.0}}}$	$-1.0$ 〜 $1.0$
jaccard	Jack card factor	Degree of similarity	$\frac{P(a,b)}{P(a)+P(b)-P(a,b)}$	$0.0$ 〜 $1.0$
support	Support	Degree of similarity	$P(a,b)$	$0.0$ 〜 $1.0$
lift	Value of lift	Degree of similarity	$\frac{P(a,b)}{P(a)P(b)}$	$0$ 〜 $\infty$
confMax	Maximum confidence	Degree of similarity	$\max (P(a\|b),P(b\|a))$	$0.0$ 〜 $1.0$
confMin	Minimum confidence	Degree of similarity	$\min ((P(a\|b),P(b\|a))$	$0.0$ 〜 $1.0$
yuleQ	Ren correlation coefficient of yule (Q)	Degree of similarity	$\frac{\alpha -1}{\alpha +1} ^{Note: 2}$	$-1.0$ 〜 $1.0$
yuleY	Ren correlation coefficient of yule (Y)	Degree of similarity	$\frac{\sqrt {\alpha }-1}{\sqrt {\alpha }+1} ^{Note: 2}$	$-1.0$ 〜 $1.0$
kappa	kappa	Degree of similarity	$\frac{P(a,b)+P(\bar{a},\bar{b})-P(a)P(b)-P(\bar{a})P(\bar{b})}{1-P(a)P(b)-P(\bar{a})P(\bar{b})}$	$-1.0$ 〜 $1.0$
oddsRatio	oddsRatio	Degree of similarity	$\frac{P(a,b)P(\bar{a},\bar{b})}{P(a,\bar{b})P(\bar{a},b)}$	$0$ 〜 $\infty$
convMax	Maximum conviction	Degree of similarity	$\max (\frac{P(a)P(\bar{b})}{P(a,\bar{b})},\frac{P(\bar{a})P(b)}{P(\bar{a},b)})$	$0.5$ 〜 $\infty$
convMin	Minimum conviction	Degree of similarity	$\min (\frac{P(a)P(\bar{b})}{P(a,\bar{b})},\frac{P(\bar{a})P(b)}{P(\bar{a},b)})$	$0.5$ 〜 $\infty$

Note 1: $e_{ij}=\frac{f_{i.}f_{.j}}{f_{..}}$ Note 2: $\alpha =\frac{f_{11}f_{00}}{f_{10}f_{01}}$

Examples

Example 1: Basic Example

Calculate the cosine and Pearson’s product-moment correlation coefficient for the combination of two items among x, y, z fields.

$ more dat1.csv
x,y,z
14,0.17,-14
11,0.2,-1
32,0.15,-2
13,0.33,-2
$ msim c=pearson,cosine f=x,y,z i=dat1.csv o=rsl1.csv
#END# kgsim c=pearson,cosine f=x,y,z i=dat1.csv o=rsl1.csv
$ more rsl1.csv
fld1,fld2,pearson,cosine
x,y,-0.5088704666,0.7860308044
x,z,0.1963041929,-0.5338153343
y,z,0.3311001423,-0.5524409416

Example 2: Output diagonal matrix, the upper triangular matrix

Calculate the cosine and Pearson’s product-moment correlation coefficient for the combination of two items between x, y, z fields (with d option).

$ msim c=pearson,cosine f=x,y,z -d i=dat1.csv o=rsl2.csv
#END# kgsim -d c=pearson,cosine f=x,y,z i=dat1.csv o=rsl2.csv
$ more rsl2.csv
fld1,fld2,pearson,cosine
x,x,1,1
x,y,-0.5088704666,0.7860308044
x,z,0.1963041929,-0.5338153343
y,x,-0.5088704666,0.7860308044
y,y,1,1
y,z,0.3311001423,-0.5524409416
z,x,0.1963041929,-0.5338153343
z,y,0.3311001423,-0.5524409416
z,z,1,1

Example 3: Calculation based on key unit

Calculate using key field as unit.

$ more dat2.csv
key,x,y,z
A,14,0.17,-14
A,11,0.2,-1
A,32,0.15,-2
B,13,0.33,-2
B,10,0.8,-5
B,15,0.45,-9
$ msim k=key c=pearson,cosine f=x,y,z i=dat2.csv o=rsl3.csv
#END# kgsim c=pearson,cosine f=x,y,z i=dat2.csv k=key o=rsl3.csv
$ more rsl3.csv
key%0,fld1,fld2,pearson,cosine
A,x,y,-0.8746392857,0.8472573627
A,x,z,0.3164384831,-0.521983618
A,y,z,0.1830936883,-0.6719258683
B,x,y,-0.7919009884,0.8782575583
B,x,z,-0.471446429,-0.9051543403
B,y,z,-0.1651896746,-0.8514129252

Example 4: Specify the type of degree of similarity

Using the data with 01 values, compute the phi coefficient and Hamming distance.

$ more dat3.csv
x,y,z
1,1,0
1,0,1
1,0,1
0,1,1
$ msim c=hamming,phi f=x,y,z i=dat3.csv o=rsl4.csv
#END# kgsim c=hamming,phi f=x,y,z i=dat3.csv o=rsl4.csv
$ more rsl4.csv
fld1,fld2,hamming,phi
x,y,0.75,-0.5773502692
x,z,0.5,-0.3333333333
y,z,0.75,-0.5773502692

Example 5: Rename the column containing degree of similarity

Using the data with 01 values, compute the phi coefficient and Hamming distance and change the output field name.

$ msim c=hamming:HammingDist,phi:PhiCoeff a=variable1,variable2 f=x,y,z i=dat3.csv o=rsl5.csv
#END# kgsim a=variable1,variable2 c=hamming:HammingDist,phi:PhiCoeff f=x,y,z i=dat3.csv o=rsl5.csv
$ more rsl5.csv
variable1,variable2,HammingDist,PhiCoeff
x,y,0.75,-0.5773502692
x,z,0.5,-0.3333333333
y,z,0.75,-0.5773502692

Related Commands

mstats : Calculate the statistics of one variable.

mmvsim : Calculate sliding window similarity measure.