蛋疼了,剛發現 GFW 還是雙向的

November 6, 2011

最近訪問 dapenti.com 一直鏈接被重置 (connection refused).
然後做了個測試:
1. dig 查找 dapenti.com 的 IP 為 61.147.103.165
2. ping 可以通
3. wget 61.147.103.165 得到完整的首頁,此時用 IP 在瀏覽器可正常訪問。
4. wget dapenti.com,鏈接被重置。瀏覽器用 IP 也失效。

完整的 log 在這裡:https://gist.github.com/1344235

最近的一些事

October 13, 2011

最近發生的事不少,稍微記錄一下,感慨。

大事:
Apple CEO Jobs 辭世
C 語言創始人之一,UNIX 创造者之一 Dennis Ritchie 辭世
以上兩個都不多說了,無法盡言。

瑣碎:
今年忘記關注圖靈獎了。後來發覺 Leslie Valiant 是得獎者。
剛好在上 Machine Learning,有學到他提出的 PAC, 以及一篇影響深遠的paper “A theory of learnable”.
之前在做 approximation algorithm 的時候,也拜讀過他關於 network reliability 與 counting 的複雜度文章。

一間 local guitar store 促銷,入了一把 American Standard Stratocaster,我最喜歡的 3-color sunburst!

訂了回國機票。

(完)

“哥哥,我想你了”

September 21, 2011

家裡能上網了,想起好幾天沒跟家裡聯系,怕父母在 QQ 上留言給我,於是便登陸了半個月沒上過的 QQ。

他們沒留言給我,倒是我表弟留言給我,說已經在大學城上學了。突然覺得時間過得很快,當年我在北中上學的時候,他還在小學。現在他已然從北中畢業並且上大學了。

隨便說了幾句話,由於他有事要走了,於是便準備道別。最後他說,“哥哥,我想你了”。

雖然是很普通的一句話,不過由於從一個男性口中發出,不由得讓我感到很肉麻…… 但是同時也感到很溫馨。

父母從來都不是善於表達感情的人,潛移默化,我甚至也沒對他們說過一句“想你”的話,更無論其他人。

也許表弟比我更善於表露感情罷了。但換過來想,如果這句話由我口中冒出,也許那已經是我情感近乎崩潰的時候,說出這句話的時候,必然正是淚流滿面。

來美國不知不覺已經一年有餘,果然是人越大越心煩,因為要處理的事情越來越多。特別是在異國他鄉,事無鉅細,都要自己一人包辦。

房租、電話、銀行、網絡、水電、油費、騎車、課程、研究,單單是 name 已經可以 name a few,繼續 enumerate 下去可以嘮叨一天。

於是忙得甚至沒有時間去寫篇 blog,於是忙得週末從來也不出門,於是忙得一星期也沒有彈一次琴打一次球,於是忙得一星期也沒有關心家人一次。

一直很怕跟人 keep in touch 會浪費時間,這也的確如此。我相信五分鐘已經可以讓對方感覺到你在關心他,但理想狀況往往達不到,往往需要半個小時才能把近況說清楚。

而家人與朋友那麼一大把,實在很難勻出時間來分給他們。不是我自私,而是實際情況就是這樣的。感覺給家裡電話、視頻已經是我的極限。

所以我的朋友們,不是我我不關心大家,而是家人已經優先了。

於是朋友大概會越來越少罷。

Survey on Network Reliability Problem

September 8, 2011

The network reliability problem ask one to determine the survival/failure probability of some terminals in a probabilistic network. In particular, the network is given as a graph, either directed or undirected. Each edge is associated with failure (survival) probability q_e (p_e = 1-q_e), which means it will fail (survive) with probability q_e (p_e). The problem ask you to compute the failure/survival rate of some terminals. The simplest version is s-t reliability, where the terminal sets contains only two terminals. Other variants include k-terminal and all-terminal reliability. The problem is known to be hard: it is #P-complete. It remains hard even when the graph is directed acyclic with maximum degree three. No computational effective methods exist unless P=NP. Hence people are interested in finding approximation algorithms.

Last semester I wrote an survey for the problem as a project in Approximation Algorithm course. The article introduces three methods for the problem, all of them utilize Monte-Carlo as the basic framework.

1. R. Karp and M. Luby’s Monte Carlo [Karp and Luby]
R. Karp and M. Luby showed that Monte Carlo algorithm is an FPRAS (Fully polynomial time randomized approximation scheme) when the number of samples are sufficiently small. Furthermore, the number of samples are linear to the ratio of the size of positive samples and size of the sample space. Hence, how to formulate the sample space becomes the most important problem. They proposed several methods, which includes K-separating cuts, paths and cycles, and finally walks. The first two modeling is straightforward and easy to understand, however it utilized some false assumptions. The last one makes the Monte Carlo truly FPRAS. But personally I found it very hard to follow, and impractical.

2. Karger’s alpha-Minimum Cut method [Karger FPRAS]
I have to say this is an interesting and clever method. First note that the network breaks only when the failure edges are in one of the cuts. He showed that only small cuts are vital when computing the failure probability. Here `small cuts’ can be interpreted as the set of min cuts and `near’ min cuts. In another Karger’s paper, he showed that these cuts can be found can be found in polynomial time and the number is polynomial [Karger min cut] (very interesting algorithm, highly recommended). Hence, the problem can be reduced to DNF counting:

We assign a boolean variable xe to each edge e, where xe = true if e fails and false otherwise. For the ith small cut c_i, it disconnects the network if and only if all of its edges fail. Let F_i be the conjunction of all x_e, for all e in c_i. Clearly F_i is true if and only if c_i fails. Hence, the event that some small cuts fail can be written as F = disjunction of all F_i. We want to know the probability that F is true, which is clearly the failure probability of the network. Note that F is in disjunctive normal form, and the number of clauses in F is exactly the number of small cuts, which is a polynomial. There is a FPRAS for the DNF counting problem, which also uses Monte-Carlo. It is straight-forward now.

3. Zenklusen’s FPRAS for s-t reliability problem on acyclic network [Zenklusen]
This is a relatively new result. The basic idea is, in a DAG, one can compute the mean number of intact path between s and t after the failure process. In a highly unreliable network, this number is a good estimate for the s-t reliability. Monte Carlo is used to estimate the ratio between reliability(s,t) and the mean number of intact s-t paths. He showed that the s-t paths can be sampled in linear time.

[Karp and Luby]: Monte-carlo algorithms for the planar multiterminal network reliability problem.
[Karger FPRAS]: A randomized fully polynomial time approximation scheme for the all-terminal network reliability problem.
[Karger min cut]: An O~(n^2) algorithm for minimum cuts.
[Zenklusen]: Combinatorial methods for analyzing network security and reliability

—-

I decided to release my manuscript from CS598: Approximation Algorithms, which is a survey on network reliability problem. I am not an expert on approximation algorithm, and my writing may not be so great. For any mysterious or unclear part, please refer to the original papers.

The file can be downloaded from here. The password is ‘cs598’. If this document is of any help, please give appropriate citation in your work.

Suffix tree: an illustration of the algorithm

August 31, 2011

I recently did some study on suffix tree, and found Ukkonen’s linear time online algorithm quite hard to understand. Mark Nelson has a good article about this, however, in my opinion the writing is not so great and some part is confusing.

For example, the active point and end point concepts are very important. It is defined as ‘the first non-leaf node’. However it is not actually a node when you are looking at the tree, because there may be some ‘implicit nodes’, which you cannot find anywhere when you draw a tree. Indeed, it should be viewed from the suffix’s point.

I have to read his source code to completely understand how it works. To better demonstrate, I designed the following example to illustrate the algorithm.
This is how the algorithm works step by step when the input string is ‘cacao’. Hope this helps.

20110831-esemmwq7yh2ks751s5atwuw39i.png

Other useful resource include this tutorial and online suffix tree computation form. Here’s an excellent java applet that can generate the suffix tree on the fly. Finally, another detailed tutorial is given by Prof Sartaj Sahni, but I found the presentation introduced lots of symbols and quite hard to follow. A lecture note from Prof. Shaojie Zhang is written clearly, though not many details given.

Here is some points that I want to clarify about the algorithm and suffix tree, assuming you know how a suffix tree looks like:
– Intuitively, suffix tree is a compressed trie. You can construct it trivially by inserting each suffix one by one into a trie tree, then compress it. However, this takes quadratic time.
– To speed up, people proposed different approach. A little history from wikipedia:

The concept was first introduced as a position tree by Weiner in 1973, which Donald Knuth subsequently characterized as "Algorithm of the Year 1973". The construction was greatly simplified by McCreight in 1976, and also by Ukkonen in 1995. Ukkonen provided the first online-construction of suffix trees, now known as Ukkonen’s algorithm, with running time that matched the then fastest algorithms.

Ukkonen is the easiest to understand and it is on-line. Most people should be using this one.

– The basic idea of the algorithm is, given an string S[1..n], we process the characters one by one. We extend the previously inserted suffix by adding the current processing character to them. For example, suppose we are given ‘cacao’. When we are processing ‘o’, the existing suffix is ‘caca’, ‘aca’, ‘ca’ and ‘a’. We want to append ‘o’ to each of them.
– For character S[i+1], the current suffices are S[j..i], where j=1..i. In the current tree we trace a path that contains substring S[j..i]. Note that this path may stop at an edge (implicit node), an explicit node or leaf node. There are basically three case when you want to insert a character for suffix S[j..i],
1. There is already a path from the end of S[j..i] that starts with S[i+1], or S[j..i+1] corresponds to a path. We do nothing. This in fact introduce an implicit node.
2. S[j..i] ends at a leaf, simply append S[i+1].
3. No path from the end of S[j..i] starts with S[i+1]. Split the edge at the branching position, and add a new leaf with S[i+1]. Note that this is the ONLY case that will introduce new leaf.
– However, the above method has to go through all suffices at each iteration, still quadratic time.
– The speed up and memory reduction come from the following important observations and techniques:
1. We only have to consider the suffices between the Active Point and End point at each iteration. It can be shown that suffices outside this range are already at leaf nodes, and hence require do nothing (See Case 1). Furthermore, the end point will be the active point at the next iteration.
2. We introduce a suffix pointer for each internal node (non-leaf non-root) node, which points to a node that contains the Longest Proper Suffix of the current suffix. For example, the longest proper suffix of "cacao" is "acao". With this technique, we don’t need to search for the node that contains the next suffix from root every time, instead, we jump from the just processed one to the next one, e.g., from "cacao" to "acao".
3. We don’t need to actually store the substrings in the nodes, we just need to know the index of the starting and ending characters. Hence, we actually do nothing when encounters Rule #2.
4. We can use hash table to store the edges, which can help inserting, removing and finding edges in constant time. The details can be found in Mark’s algorithm implementation, and the comparison in the wiki.
– Intuitively, the active point is the first suffix that requires you to do something, that is, when encounters Rule #3. The end point is the first suffix that you should do nothing again, i.e., when we encounters Rule #1 or Rule #2 again when processing the active point during the iteration.
– To avoid a suffix being a prefix of another suffix, the last character in the string must be unique. In general, we can just append a special character that is not in the vocabulary, e.g., the ‘$’ used in the example. For example, ‘book’ does not need to append since ‘k’ is unique, but ‘banana’ needs since ‘a’ is not unique. In this case, after the algorithm terminates, there will be exactly |S| + 1 leaves in the tree, one for each suffix, or, all the suffices end at leaf!

Important application of suffix tree includes:
– Find a substring P in S in O(|P|) time. Note that other linear algorithm such as KMP requires O(|P|+|S|). When there are multiple Ps, suffix tree is more efficient.
– Find the longest common substring (LCS) between P and S. We can construct a suffix tree for string P#S$, where ‘#’ and ‘$’ are two characters that are not in the vocabulary. To find the LCS, we traverse the suffix tree and sum up the number of characters visited for each internal node. Then, an internal node that has the largest sum and has at least one leaf node that represents the suffix of P and S, respectively, is the desired node. This can be done in O(|P|+|S|) time. Note that the typical solution, i.e., dynamic programming, requires O(|P||S|) time. This can be extended to a more general case where there are multiple strings S1, S2, …, Sn. And the runtime is the sum of the lengths of the strings.
– Find the longest palindrome in S. Here is one excellent article in Chinese. The basic idea is, let P be the reverse of S, construct suffix tree for S#P$. Then for each S[i], find the largest Least Common Ancestor of S[i], S[n-i+1] (even length) and S[i+1], S[n-i+1] (odd length). LCA can be done in linear time using Tarjan’s algorithm.
– Given a set of strings S1, S2, …, Sn, find all the Si such that Si contains pattern P. This is a typical application in computational biology. The set of strings are gene banks, the pattern is a gene segment. Let T=S1#S2#…#Sn$. This is called a multiple string suffix tree. To find all the Si that contains P, we walk down from root according to P. If we cannot complete a path that represents P, no string contains P. Otherwise, we will stop at a node N, then from its leaves we can know what strings Si contains P. A special case is that N is a leaf node, which means only one string contains P. To speed up, we can maintain a list of indices of Si for each node, which lists all the Si that are a start point of the suffices in the leaves.
– Refer to this and this for more detail.

Update 关于 Mac OS X 的 filesystem encoding

July 25, 2011

跟上一个问题有关。在 stackoverflow 上问了问题,得到了如下答复 [1]:

The problem is that MacOS X’s default filesystem changes all filenames you give it to an unusual normalization form which does not use precomposed characters.

根据 Python unicodedata 的文档 [2]:

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C).

其中的 U+00C7 是 ‘Ç’

对应的 python 源代码我也更新了。

请参考:
[1] http://stackoverflow.com/questions/6803621/how-to-prevent-the-command-line-argument-from-being-encoded/6809703#6809703
[2] http://docs.python.org/library/unicodedata.html#unicodedata.normalize

Update 重命名 GBK 乱码文件的 Python 脚本

July 24, 2011

详细的问题描述请见之前两个帖子: [1] [2] 写了一个重命名乱码文件的 Python 脚本,在这里
注意在 Mac OS X 下有点问题,看这里

Update: 关于迅雷离线下载 以及 Windows 的 zip 在 Linux 下乱码

July 23, 2011

现在发觉这两个问题其实是同一个问题,而且找到了更方便简洁的方法去做,主要是利用了 encode(‘latin-1’) 不对 unicode 做任何编码转换而转换为 ascii 字符串。
代码如下:

#!/usr/bin/env python
import sys
for i in sys.argv[1:]:
print i.decode(‘utf8’).encode(‘latin-1’).decode(‘gbk’)

PS. 不知道为什么我用网上介绍的 convmv 做不出来。

关于在 linux 下,用 chrome 下载迅雷离线文件乱码的问题

July 23, 2011

一扯到编码问题就很难说清楚。情况是这样的。迅雷的网页本身是用 UTF-8 编码的,所以在浏览器里显示正常。但是点击下载后,浏览器识别出的文件名却是这么一串:

[ĸÇ×].Mother.mkv

如果在 python 里面 print 出来则是这么一段:

‘[xc3x84xc2xb8xc3x87xc3x97].Mother.mkv’

(原串是 ‘[母亲].Mother.mkv’)。我的用户编码设置是 en_US.UTF-8,如果我设置为 zh_CN.GB18030,则能正确识别出该字串,但是不能保存该文件到硬盘,因为我的 hdd 都是用 UTF-8 编码的。所以猜测是,在 UTF-8 环境下,浏览器读入了 GBK 编码的字串,但是直接理解为了 unicode 的字串,并且直接把它们编码成了 UTF-8。

以’母‘字为例,下面这幅图解释了问题所在:

所以,上面字串中,’xc3x84xc2xb8’对应了‘母’字,而 ‘xc3x87xc3x97’ 对应了’亲‘字。
解决的方法就是把这个过程逆转过来做一次。我用 python 稍微写了一下,稍微有点繁琐,不知道能不能简化。
大致的思路就是,先把非ascii的字符匹配出来,然后每4个作为一个组(代表了一个 GBK 的字符),
然后decode 为 unicode code point。这时的 code point 的 *value* 实质上是 GBK 的字符编码,但因为它是一个 unicode object,因此必须把它转换为一个普通的 str。转换完后,就可以从 GBK decode 为 unicode 了。当然输出时,我们要再 encode 它为 UTF-8。
示例代码在这里

klayout

July 19, 2011

klayout 可以很方便地使用 ruby script 進行操作。 但是 Mac 的 precompiled version 沒有 klayout 這個命令。
事實上,直接調用 /Applications/klayout.app/Contents/MacOS/klayout 即可。
但是如果直接 soft link 會有一個問題,就是pwd的問題,
因爲該app打包了qt,rblib等資源,所以會搜索不到。
解決方法是像 mvim 那樣,寫一個 script 把它包裝起來:

#!/bin/sh

KLAYOUT_APP_DIR=/Applications

binary="$KLAYOUT_APP_DIR/klayout.app/Contents/MacOS/klayout"

exec "$binary" ${1:+"$@"}