Apache Pig is another great tool for analyzing big data along with Hive. There are a lot of useful scripts and reports were build in Apache Pig. Nowadays Apache Spark is going to be standard for Big Data processing.
So here is my short 'step-by-step' guide to connect Apache Pig with your data stored in Apache Cassandra.
So here is my short 'step-by-step' guide to connect Apache Pig with your data stored in Apache Cassandra.
Set up environment
First we need to set up Cassandra cluster server address and port for Pig. It can be done through environment variables or later in pig script, trough the connection string parameters. For Unix machines setting up environment variables will look like the following
Register jars
Now we can run grunt. We need to plug some jars to Pig regarding to Cassandra. You can do it with the following command
Fetch datasource
Now everything is ready to fetch data from your Cassandra table. Let`s assume that you have keyspace with the name 'mykeyspace' and table with the name 'mytable'. Th following snippet will fetch the whole table from Cassandra
Here is full specification for Cassandra connection string
cql://[username:password@]<keyspace>/<columnfamily>[?[page_size=<size>][&columns=<col1,col2>][&output_query=<prepared_statement>][&where_clause=<clause>][&split_size=<size>][&partitioner=<partitioner>][&use_secondary=true|false][&init_address=<host>][&rpc_port=<port>]]
As you can see, we can fetch not only the whole table, but only particular view expressed as select statement, or just some columns.
Also we can set up Cassandra cluster host and port in connection string not in environment path variables.