пятница, 23 октября 2015 г.

Working with Cassandra table from Apache Pig

Apache Pig is another great tool for analyzing big data along with Hive. There are a lot of useful scripts and reports were build in Apache Pig. Nowadays Apache Spark is going to be standard for Big Data processing.

So here is my short 'step-by-step' guide to connect Apache Pig with your data stored in Apache Cassandra.


Set up environment

First we need to set up Cassandra cluster server address and port  for Pig. It can be done through environment variables or later in pig script, trough the connection string parameters. For Unix machines setting up  environment variables will look like the following


Register jars

Now we can run grunt. We need to plug some jars to Pig regarding to Cassandra. You can do it with the following command

Fetch datasource

Now everything is ready to fetch data from your Cassandra table. Let`s assume that you have keyspace with the name 'mykeyspace' and table with the name 'mytable'. Th following snippet will fetch the whole table from Cassandra



Here is full specification for Cassandra connection string

cql://[username:password@]<keyspace>/<columnfamily>[?[page_size=<size>][&columns=<col1,col2>][&output_query=<prepared_statement>][&where_clause=<clause>][&split_size=<size>][&partitioner=<partitioner>][&use_secondary=true|false][&init_address=<host>][&rpc_port=<port>]]

As you can see, we can fetch not only the whole table, but only particular view expressed as select statement, or just some columns.
Also we can set up Cassandra cluster host and port in connection string not in environment path variables.

2 комментария: