Skip to content →

Tag: Big Data

Exporting and Importing Elasticsearch Indicies

In my project I need to run some local tests with data from a production elasticsearch cluster, so I exported data from the production server and imported to my local cluster. This can also be used when backing up and restoring data. Here’re the instructions.

Before you start, check out the official documentation: Snapshot and Restore.

Backing up/exporting data:

  1. Modify your eleasticsearch configuration file (normally elasticsearch.yml) and add a path.repo line, for example:
    path.repo: /usr/local/var/backups/
  2. Make sure this path has the correct permissions so that elasticsearch can read and write.
  3. Create snapshot:
    curl -XPUT http://localhost:9200/_snapshot/my_backup -d '{"type": "fs", "settings": {"compress": "true", "location": "/usr/local/var/backups/"}}}'
    curl -XPUT http://localhost:9200/_snapshot/my_backup/snapshot_1?wait_forcompletion=true
  4. Copy the files in the configured location to your local machine.

Restoring/importing data:

  1. Modify your local elasticsearch configuration similarly like step 1 when backing up.
  2. Place the snapshot files to the repo path.
  3. Close your indices:
    curl -XPOST http://localhost:9200/knx-bus/_close
  4. Import data:
    curl -XPOST http://localhost:9200/_snapshot/my_backup/snapshot_1/_restore?pretty
  5. Reopen your indices:
    curl -XPOST http://localhost:9200/knx-bus/_open

It is important that your the elasticsearch version on your importing party is compatible with the one exporting data, i.e., in this case your local machine has to be the same version or newer. If not, you need to upgrade elasticsearch first. The official documentation says:

The information stored in a snapshot is not tied to a particular cluster or a cluster name. Therefore it’s possible to restore a snapshot made from one cluster into another cluster. All that is required is registering the repository containing the snapshot in the new cluster and starting the restore process. The new cluster doesn’t have to have the same size or topology. However, the version of the new cluster should be the same or newer than the cluster that was used to create the snapshot.


NumPy’s ndarray indexing

In NumPy a new kind of array is provided: n-dimensional array or ndarray. It’s usually fixed-sized and accepts items of the same type and size. For example, to define a 2×3 matrix:

import numpy as np
a = np.array([[1,2,3,], [4,5,6]], np.int32)

When indexing ndarray, it supports “array indexing” other than single element indexing.  (See

It is possible to index arrays with other arrays for the purposes of selecting lists of values out of arrays into new arrays. There are two different ways of accomplishing this. One uses one or more arrays of index values. The other involves giving a boolean array of the proper shape to indicate the values to be selected. Index arrays are a very powerful tool that allow one to avoid looping over individual elements in arrays and thus greatly improve performance.

So you basically can do the following:

a = np.array([1, 2, 3], np.int32)
a[np.array([0, 2])) # Fetch the first the third elements, returns np.array([1, 3])
a[np.array([True, False, True])] # Same as the line above

Besides, when you do equals operation on ndarrays, another ndarray is returned by comparing each element:

a = np.array([1, 2, 3], np.int32)
a == 2 # Returns array([False,  True, False], dtype=bool)
a != 2 # Returns array([ True, False,  True], dtype=bool)
a[a != 2] # Returns a sub array that excludes elements with a value 2, in this case array([1, 3], dtype=int32)
Leave a Comment

MapReduce in MongoDB

> db.lattern_money_record.mapReduce( function() { emit(this.quantity, 1) }, function(key, values) { return Array.sum(values) }, {   query: {'quantity': {$gt: 500}}, out: {inline: 1} } )
	"results" : [
			"_id" : 550,
			"value" : 3
			"_id" : 570,
			"value" : 1
			"_id" : 580,
			"value" : 1
			"_id" : 583,
			"value" : 1
			"_id" : 587,
			"value" : 1
			"_id" : 600,
			"value" : 2
			"_id" : 660,
			"value" : 1
			"_id" : 700,
			"value" : 2
			"_id" : 800,
			"value" : 5
			"_id" : 900,
			"value" : 2
			"_id" : 924,
			"value" : 1
			"_id" : 949,
			"value" : 1
			"_id" : 980,
			"value" : 1
			"_id" : 990,
			"value" : 1
			"_id" : 1000,
			"value" : 12
	"timeMillis" : 36,
	"counts" : {
		"input" : 35,
		"emit" : 35,
		"reduce" : 6,
		"output" : 15
	"ok" : 1,

The MapReduce code I used to analyze the 20 million hotel reservation records:

def get_aggregation(collection):
    1. Get unique set of people
    2. Get most frequent users
    3. Get aggregation by location of birth, age, month and day of birth
    # Emit multiple times in mapper function:
    mapper = Code('''
                  function() {
                    function validate_rid(id) {
                        // From:
                        // 18位身份证号
                        // 国家标准《GB 11643-1999》
                        function rid18(id) {
                            if(! /\d{17}[\dxX]/.test(id)) {
                                return false;
                            var modcmpl = function(m, i, n) { return (i + n - m % i) % i; },
                                f = function(v, i) { return v * (Math.pow(2, i-1) % 11); },
                                s = 0;
                            for(var i=0; i<17; i++) {
                                s += f(+id.charAt(i), 18-i);
                            var c0 = id.charAt(17),
                                c1 = modcmpl(s, 11, 1);
                            return c0-c1===0 || (c0.toLowerCase()==='x' && c1===10);

                        // 15位身份证号
                        // 2013年1月1日起将停止使用
                        function rid15(id) {
                            var pattern = /[1-9]\d{5}(\d{2})(\d{2})(\d{2})\d{3}/,
                                matches, y, m, d, date;
                            matches = id.match(pattern);
                            y = +('19' + matches[1]);
                            m = +matches[2];
                            d = +matches[3];
                            date = new Date(y, m-1, d);
                            return (date.getFullYear()===y && date.getMonth()===m-1 && date.getDate()===d);

                        // return rid18(id) || rid15(id);
                        try {
                            ret = rid18(id) || rid15(id);
                            return ret;
                        } catch (err) {
                            return false;

                    function validateEmail(email) {
                        var re = /^(([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)|(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/;
                        return re.test(email);

                    var str = this.CtfId;
                    if (str && validate_rid(str)) {
                        var prov = parseInt(str.slice(0, 2));
                        var year, month, day, sex;
                        if (str.length == 15) {
                            year = parseInt('19' + str.slice(6, 8));
                            month = parseInt(str.slice(8, 10));
                            day = parseInt(str.slice(10, 12));
                            sex = parseInt(str.slice(14, 15)) % 2 ? 'M' : 'F';
                        } else {
                            year = parseInt(str.slice(6, 10));
                            month = parseInt(str.slice(10, 12));
                            day = parseInt(str.slice(12, 14));
                            sex = parseInt(str.slice(16, 17)) % 2 ? 'M' : 'F';
                        var age = 2013 - year;
                        var valid_provs = [11, 12, 13, 14, 15,
                            21, 22, 23, 31, 32, 33, 34, 35, 36, 37,
                            41, 42, 43, 44, 45, 46,
                            50, 51, 52, 53, 54,
                            61, 62, 63, 64, 65,
                            71, 81, 82, 91];
                        if (age <= 0 || age > 100 ||
                            month <=0 || month > 12 ||
                            day <= 0 || day > 31 ||
                            valid_provs.indexOf(prov) == -1) {
                            emit('Corrupted', 1);
                        } else {
                            // emit('Province ' + prov, 1);
                            // emit('Age ' + age, 1);
                            // emit('Month ' + month, 1);
                            // emit('Day ' + day, 1);
                            // emit('Sex ' + sex, 1);
                            // emit('Prov ' + prov + ' Sex ' + sex, 1);
                            // if (this.Address && this.Address.length > 3) {
                            //     var cur_prov = this.Address.slice(0, 3);
                            //     emit('From ' + prov + ' to ' + cur_prov, 1);
                            // }

                            // var email = this.EMail;
                            // if (email && validateEmail(email)) {
                            //     var idx = email.lastIndexOf('@');
                            //     var domain = email.slice(idx + 1);
                            //     emit(domain.toLowerCase(), 1);
                            // }

                            if (prov == 32 && sex == 'M') {
                                emit(str, 1);
                            // if (prov == 32 && sex == 'F') {
                            //     emit(str, 1);
                            // }
                    } else {
                        emit('Corrupted', 1);
    reducer = Code('''
                   function(key, values) {
                    return Array.sum(values);
    result = collection.map_reduce(
        mapper, reducer, 'aggregation', query={'CtfTp': 'ID'}
    return result


Leave a Comment

funf smart phone data collecting


battery_life phone_temperature relative_activity screen_activity


老实说,收集了两天的数据,就能分析出这么点东西,着实让人失望。不过这个生成图像信息的程序是开源的,而且是用Python实现的,今后有空了也可以自己来分析一下收集到的数据。funf在我的手机上也经常不能响应,可以说这个app是十足的半成品。不过这个由来自MIT的团队开发的不仅仅是一个手机app,号称是一个Open Sensing Framework,前几天还刚刚被Google收购了。虽然是个半成品,但Google这个时候收购团队总比让他们羽翼丰满之后再收购所花的代价要小得多。

Leave a Comment

Notes on Big Data

《大数据时代》这本书强调了这个时代数据的重要性,所谓大数据,即全体数据而非抽样数据,大数据强调混杂性而不追求准确性,注重相关关系而非因果关系。在商业中,不仅需要得到大数据,而且需要知道如何利用大数据。例如,如何筛选自己需要的信息,了解自己真正需要什么。在这个过程中重视产生的结果,却不要过分纠结于产生这个结果的原因。本书作者似乎是业界翘楚,书中旁征博引,很有说服力。可以看出来作者Viktor Mayer-Schonberger是Taylorism的坚定信徒,他相信任何事情都是可以用数据测量和表述的。这一点跟《The Shallows》的作者Nicholas Carr的观点不一样,Carr认为并不是所有的东西都是可以测量的。当然Schonberger在书最后也提到了这一点,他也认为我们处理的信息不过是世界的某个投影,大数据也只是一个工具,我们在使用大数据的时候不要自负,要“铭记人性之本”。



Leave a Comment