Exporting and Importing Elasticsearch Indicies

In my project I need to run some local tests with data from a production elasticsearch cluster, so I exported data from the production server and imported to my local cluster. This can also be used when backing up and restoring data. Here’re the instructions.

Before you start, check out the official documentation: Snapshot and Restore.

Backing up/exporting data:

  1. Modify your eleasticsearch configuration file (normally elasticsearch.yml) and add a path.repo line, for example:
  2. Make sure this path has the correct permissions so that elasticsearch can read and write.
  3. Create snapshot:
  4. Copy the files in the configured location to your local machine.

Restoring/importing data:

  1. Modify your local elasticsearch configuration similarly like step 1 when backing up.
  2. Place the snapshot files to the repo path.
  3. Close your indices:
  4. Import data:
  5. Reopen your indices:

It is important that your the elasticsearch version on your importing party is compatible with the one exporting data, i.e., in this case your local machine has to be the same version or newer. If not, you need to upgrade elasticsearch first. The official documentation says:

The information stored in a snapshot is not tied to a particular cluster or a cluster name. Therefore it’s possible to restore a snapshot made from one cluster into another cluster. All that is required is registering the repository containing the snapshot in the new cluster and starting the restore process. The new cluster doesn’t have to have the same size or topology. However, the version of the new cluster should be the same or newer than the cluster that was used to create the snapshot.

Installing Theano and CUDA on Mac OS X

I started trying Theano today and wanted to use the GPU (NVIDIA GeForce GT 750M 2048 MB) on my Mac. Here’s a brief instruction on how to use the GPU on Mac, largely following the instructions from http://deeplearning.net/software/theano/install.html#mac-os.

Install Theano:

Download and install CUDA: https://developer.nvidia.com/cuda-downloads

Put the following lines into your ~/.bash_profile:

Note that the PATH line is necessary. Otherwise you may see the following message:

ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.

Configure Theano:

Test if GPU is used:

A more realistic example:

So it seems this GPU does not outperform the CPU. Well,GT 750M may not be the best GPU you can get… Someone else here has a similar experience.


NumPy’s ndarray indexing

In NumPy a new kind of array is provided: n-dimensional array or ndarray. It’s usually fixed-sized and accepts items of the same type and size. For example, to define a 2×3 matrix:

When indexing ndarray, it supports “array indexing” other than single element indexing.  (See http://docs.scipy.org/doc/numpy/user/basics.indexing.html)

It is possible to index arrays with other arrays for the purposes of selecting lists of values out of arrays into new arrays. There are two different ways of accomplishing this. One uses one or more arrays of index values. The other involves giving a boolean array of the proper shape to indicate the values to be selected. Index arrays are a very powerful tool that allow one to avoid looping over individual elements in arrays and thus greatly improve performance.

So you basically can do the following:

Besides, when you do equals operation on ndarrays, another ndarray is returned by comparing each element:

MapReduce in MongoDB



The MapReduce code I used to analyze the 20 million hotel reservation records: