Detecting Similar Strings or Sentences

发表于 2013 年 4 月 18 日由 jacoxu

Detecting similar Strings or Sentences are very different from similar Files or Documents There are two main categories of algorithms for matching string similarity: equivalence methods and similarity ranking methods. Equivalence methods compare two strings and return a value of … 继续阅读 →

Reading two Patents about Chinese Query Refinement by Google Inc.

发表于 2013 年 4 月 16 日由 jacoxu

1. Systems and methods for translating Chinese pinyin to Chinese characters [Link] a). Systems and methods to process and translate pinyin to Chinese characters and words are disclosed. b). 2. Systems and methods for spell correction of non-roman characters … 继续阅读 →

simHash是否适合短文本的相似文本匹配

发表于 2013 年 4 月 10 日由 jacoxu

附注：用BigInteger类型来存储64位的hash码 http://doc.java.sun.com/DocWeb/api/all/java.math.BigInteger 很好用，xor()异或、BigInteger.bitLength()取位长、.bitCount()取位为1的个数。 simHash算法流程： 1): 计算simHash码 a). 字符串String分词得到tokens； b). 计算每个tokens的64位Hash码； c). 按Hash码的位进行标记，1则标记为1、否则标记为-1； d). 把每个tokens的Hash码按位进行统计求和； e). 进行签名，大于0则为1，否则为0，得到64位simHash指纹。 2): 把64位simHash码均分为汉明距离n+1块，方便后续查找的所有近邻simHash码； 3): 计算两个simHash码的汉明距离，方法一：给出simHash的64位二进制码字符串：str1.charAt(i) != str2.charAt(i)；方法二：给出simHash的int值：先做异或，然后统计异或后二进制位数为1的个数问题！？：simHash在短文本的可行性！测试相似文本的相似度与汉明距离测试文本：20个城市名作为词串：北京,上海,香港,深圳,广州,台北,南京,大连,苏州,青岛,无锡,佛山,重庆,宁波,杭州,成都,武汉,澳门,天津,沈阳相似度矩阵： simHash码：勘误：0.667, Hm:13 是对比的msg 1与2。可见：相似度在0.8左右的Hamming距离为7，只有相似度高到0.9412，Hamming距离才近到4，此时，反观Google对此算法的应用场景：网页近重复。 … 继续阅读 →

关于SimHash算法原理

发表于 2013 年 4 月 3 日由 jacoxu

前言： SimHash: Similarity Estimation Techniques from Rounding Algorithms 2002 A locality sensitive hashing scheme: sim(x,y). is a method of performing probabilistic dimension reduction of high-dimensional data. MinHash(Min-wise independent permutations): is a technique for quickly estimating how similar two sets are. … 继续阅读 →

Read paper about similarity Search

发表于 2013 年 3 月 29 日由 jacoxu

8B pages == 8Billion pages Title: Detecting Near-Duplicates for Web Crawling WWW 2007, Track: Data Mining. Session: Similarity Search Keywords: Hamming Distance, near-duplicate Problem: the problem of elimination of near-duplicate web documents in a generic crawl has not received attention. … 继续阅读 →

漫话匹配学习(Learning to match) 与Q&A

发表于 2013 年 3 月 21 日由 jacoxu

李航在2012年的一年共做了四篇关于匹配学习的报告，大致内容一致。 2012. 07 mmds_2012 @Stanford University Large Scale Machine Learning for Query Document Matching in Web Search 2012.08 sigir2012_tutorial Beyond Bag of Words Machine Learning for Query-Document Matching in Web Search 2012.08 yssnlp_2012 Learning to Match for Natural Language Processing … 继续阅读 →

关于Time的字符串，如何比较时间

发表于 2013 年 3 月 18 日由 jacoxu

从一个文本中提取出时间的字符串，分别为： 2012-06-09 23:08 2012-01-22 21:14 2012-01-01 23:12 2012-01-01 00:26 2012-01-01 05:08 2012-08-20 18:27 那么该如何比较这些时间的大小呢？最简单的方法想到的是，把字符串中的中英文字母和空格全部剔除掉，然后转化成int型数据进行大小判断： private static long timeConvert(String sendTimeStr) { //去掉中英文字符 String timeStr = sendTimeStr.replaceAll(“[\p{Punct}\s+\pP]+”, “”); long timeInt = Integer.parseInt(timeStr); return timeInt; } 但是此时运行出错，发现了一个新的问题就是，已经精确到分的时间转化int型时例201206092308已经超出了int型的范围。那么考虑调用Date库进行比对，那么需要把字符串都转化为Date型数据： import java.util.Date; import … 继续阅读 →

利用java迭代器Itetator遍历并删除HashMap中的元素问题

发表于 2013 年 3 月 18 日由 jacoxu

问题：下面的代码试图利用HashMap的Iterator对象遍历该HashMap并删除满足条件的元素（比如超时的元素），但会抛出java.util.ConcurrentModificationException异常 public static void main(String[] args) { HashMap hs=new HashMap(); hs.put(“p1″, “1″); hs.put(“p2″, “1″); hs.put(“p3″, “1″); hs.put(“p4″, “1″); hs.put(“p5″, “1″); hs.put(“p6″, “1″); Iterator it=hs.keySet().iterator(); while(it.hasNext()) { String str=(String)it.next(); System.out.println(hs); //逻辑处理……… …………. hs.remove(str); } } 原因应该是hs.remove(str)后，it内容没变，并且it里的指针列表又重新排序，所以只要确保删除任一元素后，it保持同步更新即可：解决方案一：删除任一元素后，it保持同步更新 ………… Iterator it=hs.keySet().iterator(); … 继续阅读 →

Java中HashMap遍历的两种方式效率不同

发表于 2013 年 3 月 14 日由 jacoxu

【原文】:http://www.javaweb.cc/language/java/032291.shtml 第一种： Map map = new HashMap(); Iterator iter = map.entrySet().iterator(); while (iter.hasNext()) { Map.Entry entry = (Map.Entry) iter.next(); Object key = entry.getKey(); Object val = entry.getValue(); } 效率高,以后一定要使用此种方式！第二种： Map map = new HashMap(); Iterator iter = map.keySet().iterator(); while … 继续阅读 →

利用词袋实现短文本相似度计算