[ Text Similarity ] String-Based Similarity (Cosine, Jaccard)

Notice

Recent Posts

Recent Comments

Link

Link to blog "한 사람의 일상"

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Tags more

Archives

Today

Total

관리 메뉴

바이오 대표

[ Text Similarity ] String-Based Similarity (Cosine, Jaccard) 본문

Master dissertion

[ Text Similarity ] String-Based Similarity (Cosine, Jaccard)

바이오 대표 2022. 1. 11. 01:02

Text Similarity 를 구하는 방법에는 String-based, Corpus(말뭉치)-based, Knowledge based가 있다

논문에서 Side Effect to Side Effect, Disease to Disease Similarity 를 위해 간단히 String-Based Text Similarity 를 적용하였다.

String-Based Text Similarity:

[1] Character-based

[2] Term-based ( Cosine Similarity, Jaccard Similarity)

Jaccard Similarity	number of shared terms over the number of all unique terms in both strings
Cosine Similarity	similarity b/w two vectors of an inner product space that measures the cos of the angle between them

해당 방법은 각도 기반 유사도 측정법이다. 유사도 측정법에는 크게 거리기반 그리고 각도 기반 측정법이 있다.

거리 기반: 가장 가까운 점들이 유사도가 높다/ 각도 기반: 기울기가 비슷한 점들이 유사도가 높다.

[1] Jaccard Similarity

def Jaccard_Similarity(doc1, doc2): 
    # List the unique words in a document
    words_doc1, words_doc2 = set(doc1.lower().split()), set(doc2.lower().split())
    
    # Find the intersection of words list of doc1 & doc2
    intersection = words_doc1.intersection(words_doc2)

    # Find the union of words list of doc1 & doc2
    union = words_doc1.union(words_doc2)
        
    # Calculate Jaccard similarity score using length of intersection set divided by length of union set
    return float(len(intersection)) / len(union)
    
doc_1 = "Data is the new oil of the digital economy"
doc_2 = "Data is a new oil"

Jaccard_Similarity(doc_1,doc_2)  # 0.44444

[2] Cosine similarity

from numpy import dot
from numpy.linalg import norm

def cos_sim(A, B):
	return dot(A, B) / (norm(A) * norm(B))

A = [0.123, 0.456, 0.789]
B = [0.345, 0.765, 0.987]
result = cos_sim(A,B)  # result = 0.9821175

import sklearn.metrics.pairwise

result2 = sklearn.metrics.pairwise.cosine_similarity([A,B])
result2  # array([[1.        , 0.98211752],
         #        [0.98211752, 1.        ]])

다양한 Similarity 방법들을 아래 논문에서 확인 할 수 있다.

Reference

[1] Gomaa, W.H., & Fahmy, A.A. (2013). A Survey of Text Similarity Approaches. International Journal of Computer Applications, 68, 13-18.

저작자표시

'Master dissertion' 카테고리의 다른 글

[ Cross-Validation ] 교차 검증 (K-fold, Leave One Out Cross-Validation) (0)	2022.02.21
[ Drug Pair Scoring ] "A Unified View of Relational Deep Learning for Drug Pair Scoring" (0)	2022.01.21
[Leave-one-out] "Algebraic shortcuts for leave-one-out cross-validation in supervised network inference" (0)	2021.11.05
[Deep Tensor Factorization] Deep Tensor Factorization For Predicting Anticancer Drug Synergy_2020 (0)	2021.10.21
[Rank Decomposition] Rank of matrix개념, CP decomposition (0)	2021.10.19

'Master dissertion' Related Articles

바이오 대표

[ Text Similarity ] String-Based Similarity (Cosine, Jaccard) 본문

[ Text Similarity ] String-Based Similarity (Cosine, Jaccard)

'Master dissertion' 카테고리의 다른 글

티스토리툴바