Skip to content →

Tag: Big Data

Exporting and Importing Elasticsearch Indicies

In my project I need to run some local tests with data from a production elasticsearch cluster, so I exported data from the production server and imported to my local cluster. This can also be used when backing up and restoring data. Here’re the instructions.

Before you start, check out the official documentation: Snapshot and Restore.

Backing up/exporting data:

  1. Modify your eleasticsearch configuration file (normally elasticsearch.yml) and add a path.repo line, for example:
    path.repo: /usr/local/var/backups/
  2. Make sure this path has the correct permissions so that elasticsearch can read and write.
  3. Create snapshot:
    curl -XPUT http://localhost:9200/_snapshot/my_backup -d '{"type": "fs", "settings": {"compress": "true", "location": "/usr/local/var/backups/"}}}'
    curl -XPUT http://localhost:9200/_snapshot/my_backup/snapshot_1?wait_forcompletion=true
  4. Copy the files in the configured location to your local machine.

Restoring/importing data:

  1. Modify your local elasticsearch configuration similarly like step 1 when backing up.
  2. Place the snapshot files to the repo path.
  3. Close your indices:
    curl -XPOST http://localhost:9200/knx-bus/_close
  4. Import data:
    curl -XPOST http://localhost:9200/_snapshot/my_backup/snapshot_1/_restore?pretty
  5. Reopen your indices:
    curl -XPOST http://localhost:9200/knx-bus/_open

It is important that your the elasticsearch version on your importing party is compatible with the one exporting data, i.e., in this case your local machine has to be the same version or newer. If not, you need to upgrade elasticsearch first. The official documentation says:

The information stored in a snapshot is not tied to a particular cluster or a cluster name. Therefore it’s possible to restore a snapshot made from one cluster into another cluster. All that is required is registering the repository containing the snapshot in the new cluster and starting the restore process. The new cluster doesn’t have to have the same size or topology. However, the version of the new cluster should be the same or newer than the cluster that was used to create the snapshot.

2 Comments

NumPy’s ndarray indexing

In NumPy a new kind of array is provided: n-dimensional array or ndarray. It’s usually fixed-sized and accepts items of the same type and size. For example, to define a 2×3 matrix:

import numpy as np
a = np.array([[1,2,3,], [4,5,6]], np.int32)

When indexing ndarray, it supports “array indexing” other than single element indexing.  (See http://docs.scipy.org/doc/numpy/user/basics.indexing.html)

It is possible to index arrays with other arrays for the purposes of selecting lists of values out of arrays into new arrays. There are two different ways of accomplishing this. One uses one or more arrays of index values. The other involves giving a boolean array of the proper shape to indicate the values to be selected. Index arrays are a very powerful tool that allow one to avoid looping over individual elements in arrays and thus greatly improve performance.

So you basically can do the following:

a = np.array([1, 2, 3], np.int32)
a[np.array([0, 2])) # Fetch the first the third elements, returns np.array([1, 3])
a[np.array([True, False, True])] # Same as the line above

Besides, when you do equals operation on ndarrays, another ndarray is returned by comparing each element:

a = np.array([1, 2, 3], np.int32)
a == 2 # Returns array([False,  True, False], dtype=bool)
a != 2 # Returns array([ True, False,  True], dtype=bool)
a[a != 2] # Returns a sub array that excludes elements with a value 2, in this case array([1, 3], dtype=int32)
Leave a Comment

MapReduce in MongoDB

http://docs.mongodb.org/manual/core/map-reduce/

http://docs.mongodb.org/manual/reference/command/mapReduce/

> db.lattern_money_record.mapReduce( function() { emit(this.quantity, 1) }, function(key, values) { return Array.sum(values) }, {   query: {'quantity': {$gt: 500}}, out: {inline: 1} } )
{
	"results" : [
		{
			"_id" : 550,
			"value" : 3
		},
		{
			"_id" : 570,
			"value" : 1
		},
		{
			"_id" : 580,
			"value" : 1
		},
		{
			"_id" : 583,
			"value" : 1
		},
		{
			"_id" : 587,
			"value" : 1
		},
		{
			"_id" : 600,
			"value" : 2
		},
		{
			"_id" : 660,
			"value" : 1
		},
		{
			"_id" : 700,
			"value" : 2
		},
		{
			"_id" : 800,
			"value" : 5
		},
		{
			"_id" : 900,
			"value" : 2
		},
		{
			"_id" : 924,
			"value" : 1
		},
		{
			"_id" : 949,
			"value" : 1
		},
		{
			"_id" : 980,
			"value" : 1
		},
		{
			"_id" : 990,
			"value" : 1
		},
		{
			"_id" : 1000,
			"value" : 12
		}
	],
	"timeMillis" : 36,
	"counts" : {
		"input" : 35,
		"emit" : 35,
		"reduce" : 6,
		"output" : 15
	},
	"ok" : 1,
}

The MapReduce code I used to analyze the 20 million hotel reservation records:

def get_aggregation(collection):
    '''
    1. Get unique set of people
    2. Get most frequent users
    3. Get aggregation by location of birth, age, month and day of birth
    '''
    # Emit multiple times in mapper function:
    # http://docs.mongodb.org/manual/reference/command/mapReduce/
    mapper = Code('''
                  function() {
                    function validate_rid(id) {
                        // From: https://gist.github.com/foxwoods/1817822
                        // 18位身份证号
                        // 国家标准《GB 11643-1999》
                        function rid18(id) {
                            if(! /\d{17}[\dxX]/.test(id)) {
                                return false;
                            }
                            var modcmpl = function(m, i, n) { return (i + n - m % i) % i; },
                                f = function(v, i) { return v * (Math.pow(2, i-1) % 11); },
                                s = 0;
                            for(var i=0; i<17; i++) {
                                s += f(+id.charAt(i), 18-i);
                            }
                            var c0 = id.charAt(17),
                                c1 = modcmpl(s, 11, 1);
                            return c0-c1===0 || (c0.toLowerCase()==='x' && c1===10);
                        }

                        // 15位身份证号
                        // 2013年1月1日起将停止使用
                        // http://www.gov.cn/flfg/2011-10/29/content_1981408.htm
                        function rid15(id) {
                            var pattern = /[1-9]\d{5}(\d{2})(\d{2})(\d{2})\d{3}/,
                                matches, y, m, d, date;
                            matches = id.match(pattern);
                            y = +('19' + matches[1]);
                            m = +matches[2];
                            d = +matches[3];
                            date = new Date(y, m-1, d);
                            return (date.getFullYear()===y && date.getMonth()===m-1 && date.getDate()===d);
                        }

                        // return rid18(id) || rid15(id);
                        try {
                            ret = rid18(id) || rid15(id);
                            return ret;
                        } catch (err) {
                            return false;
                        }
                    }

                    function validateEmail(email) {
                        // http://stackoverflow.com/questions/46155/validate-email-address-in-javascript
                        var re = /^(([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)|(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/;
                        return re.test(email);
                    }

                    var str = this.CtfId;
                    if (str && validate_rid(str)) {
                        var prov = parseInt(str.slice(0, 2));
                        var year, month, day, sex;
                        if (str.length == 15) {
                            year = parseInt('19' + str.slice(6, 8));
                            month = parseInt(str.slice(8, 10));
                            day = parseInt(str.slice(10, 12));
                            sex = parseInt(str.slice(14, 15)) % 2 ? 'M' : 'F';
                        } else {
                            year = parseInt(str.slice(6, 10));
                            month = parseInt(str.slice(10, 12));
                            day = parseInt(str.slice(12, 14));
                            sex = parseInt(str.slice(16, 17)) % 2 ? 'M' : 'F';
                        }
                        var age = 2013 - year;
                        var valid_provs = [11, 12, 13, 14, 15,
                            21, 22, 23, 31, 32, 33, 34, 35, 36, 37,
                            41, 42, 43, 44, 45, 46,
                            50, 51, 52, 53, 54,
                            61, 62, 63, 64, 65,
                            71, 81, 82, 91];
                        if (age <= 0 || age > 100 ||
                            month <=0 || month > 12 ||
                            day <= 0 || day > 31 ||
                            valid_provs.indexOf(prov) == -1) {
                            emit('Corrupted', 1);
                        } else {
                            // emit('Province ' + prov, 1);
                            // emit('Age ' + age, 1);
                            // emit('Month ' + month, 1);
                            // emit('Day ' + day, 1);
                            // emit('Sex ' + sex, 1);
                            // emit('Prov ' + prov + ' Sex ' + sex, 1);
                            // if (this.Address && this.Address.length > 3) {
                            //     var cur_prov = this.Address.slice(0, 3);
                            //     emit('From ' + prov + ' to ' + cur_prov, 1);
                            // }

                            // var email = this.EMail;
                            // if (email && validateEmail(email)) {
                            //     var idx = email.lastIndexOf('@');
                            //     var domain = email.slice(idx + 1);
                            //     emit(domain.toLowerCase(), 1);
                            // }

                            if (prov == 32 && sex == 'M') {
                                emit(str, 1);
                            }
                            // if (prov == 32 && sex == 'F') {
                            //     emit(str, 1);
                            // }
                        }
                    } else {
                        emit('Corrupted', 1);
                    }
                  }''')
    reducer = Code('''
                   function(key, values) {
                    return Array.sum(values);
                   }''')
    result = collection.map_reduce(
        mapper, reducer, 'aggregation', query={'CtfTp': 'ID'}
    )
    return result

 

Leave a Comment

funf smart phone data collecting

读完几本大数据的书,偶然发现了一个叫funf的手机应用,号称能将安卓手机变身为数据收集器。于是我很兴奋的下载安装,让这个app在我手机上采集了两天的数据,分析得到的结果如下:

battery_life phone_temperature relative_activity screen_activity

 

老实说,收集了两天的数据,就能分析出这么点东西,着实让人失望。不过这个生成图像信息的程序是开源的,而且是用Python实现的,今后有空了也可以自己来分析一下收集到的数据。funf在我的手机上也经常不能响应,可以说这个app是十足的半成品。不过这个由来自MIT的团队开发的不仅仅是一个手机app,号称是一个Open Sensing Framework,前几天还刚刚被Google收购了。虽然是个半成品,但Google这个时候收购团队总比让他们羽翼丰满之后再收购所花的代价要小得多。

Leave a Comment

Notes on Big Data

《大数据时代》这本书强调了这个时代数据的重要性,所谓大数据,即全体数据而非抽样数据,大数据强调混杂性而不追求准确性,注重相关关系而非因果关系。在商业中,不仅需要得到大数据,而且需要知道如何利用大数据。例如,如何筛选自己需要的信息,了解自己真正需要什么。在这个过程中重视产生的结果,却不要过分纠结于产生这个结果的原因。本书作者似乎是业界翘楚,书中旁征博引,很有说服力。可以看出来作者Viktor Mayer-Schonberger是Taylorism的坚定信徒,他相信任何事情都是可以用数据测量和表述的。这一点跟《The Shallows》的作者Nicholas Carr的观点不一样,Carr认为并不是所有的东西都是可以测量的。当然Schonberger在书最后也提到了这一点,他也认为我们处理的信息不过是世界的某个投影,大数据也只是一个工具,我们在使用大数据的时候不要自负,要“铭记人性之本”。

《删除》是《大数据时代》作者Schonberger的另一本书,作者这次从用户的角度出发,讲诉在大数据时代中应该做出对自己有利的事情。计算机和存储介质的发明,就意味着数据会被永久存储;特别是近年来各种软件和在线服务的出现,使得人们越来越没有能力控制自己信息的散布,流动,以及存在的时限。很多信息一旦公开(甚至是对少许几人公开),你就没有在将它控制住的能力了,颇有“覆水难收”的味道。书中举了一个在MySpace上贴出自己饮酒照片被上司发现而最终被取消教师资格的女孩的故事;而我前几天发现我的网站被另一个网站做了历史镜像的经历也同样让我不安。一旦信息公布,就不能再收回了。作者在书中给了很多建议,如节制数字化信息的使用,知道自己应该公开那些信息,重视公开信息的后果;重视隐私法律的建立以及提供相应的技术支持;调整大众对数字隐私的认知;等等。同时作者最后还抛出了一个给所有信息都加上一个存储期限的想法,这个不管从技术上还是用户体验上都难以在近期时限。但是,最近确实有类似的在线服务出现,如阅后即焚的聊天服务snapchat,在线文本存储与共享工具pastebin等,同时Google等巨头也一再缩短存储用户信息的时间。随着大众对隐私的觉醒,相信在不久会有更多的服务在信息存储期限上下功夫。下一个会不会是email呢?毕竟我们需要的只是一小部分对我们有用的东西,如果我们都根本不记得邮件里有哪些内容,就让它们被慢慢的遗忘吧。

《爆发》从行为预测的角度讲诉了大数据的用途。这本书看到一半时我觉得这是我看过的最好的业界趋势读物,因为作者巴拉巴西的写作手法很奇特:全书每一章都分为两个部分,前半部分讲技术,后一部分讲历史故事。每看完一章我都想迫不及待的看下一章,颇有章回体小说的味道。但是等我把全书看完之后却还是云里雾里,完全不理解作者的观点是什么,到底人的行为是否可以被准确预测。不过确定的有一点,如书名所诉,我们的行为充满爆发性(Bursts)。我们可以很长时间不写邮件,但是同时也有可能在短时间里写很多邮件;我写博客也是,可能好几个月都不写一篇,有时候又连续几天都写;我好几个月都不看书,一看就连续看好几本;甚至花钱也是,好几天一分钱都不花,也有可能一天花好多……作者认为我们之所以会有这些行为,从根本来讲是因为我们的生物特性中即有爆发性,细胞生成的过程即是由一个又一个的爆发组成的。从行为学上讲,我们会给自己要做的事情安排一个优先级,只有优先级高的事情才会被完成。由此生成的优先级队列就隐含了爆发性,导致我们实际完成事情的时候也是一组一组爆发式的完成的。因为我们的行为不是一个随机过程,所以我们不遵循泊松分布。书中的历史故事也很精彩,讲诉了16世纪发生在匈牙利的一个农民起义的故事。起义领袖赛克勒从发起起义到被捕都发生在很短的时间内,按作者的说法他的起义过程实际上是一个快速燃烧完的爆发点。赛克勒并没有经历太多就变成了起义领袖,由此就注定必败的命运。一切来的太突然了,不是每个人都能把握好自己。所以“天将降大任于斯人也,必先苦其心志,劳其胫骨,乏其体肤”也是不无道理的。一个人需要蛰伏很久,才能从容的面对破茧而出的那一刻。

Leave a Comment