Implementation of Java Project with Cassandra
The purpose of this article is to write down my findings as I implement my project.
Pre-requisites:
- Installed CCM and is able to start a cluster containing 3 nodes.
- Have Java 8, Python 2.7, Ant set up
- Followed Chapter 7 of the guidebook on configuring Cassandra
- Have a basic project set up in an IDLE and configured Maven to manage dependencies such as DataStax (I use Intellij).
- Have some data files that is to be loaded to the local Cassandra managed by CCM that is installed.
Configuring DataStax Driver in project
A short refresher on how to add dependencies to Maven. We do so via the pom.xml file. Here is a screenshot of mine. For eg, if i want to add the dependency: com.datastax.oss, i will add a child element to the pom file then rebuild the project where Maven will import this dependency.
Over here, you notice that i have 2 types of driver….i am currently not too sure but i think the previous one above (3.10.2) is an older version. Will remove that if i am indeed going to use 4.2.0.
Update: My project will use the older version instead.
Need to start up ccm cluster
- CCM requires python 2.7 and Java 8.
- For my local set-up, I created a virtual Python environment in Anaconda. I have to use Anaconda prompt to activate the virtual environment. Then, navigate to the directory anaconda3 -> Envs -> Name-Of-virtual-environment -> Scripts
- The anaconda environment should contains all the libraries required in the ccm repository guide. On top of this, need to install Java in the envirnoment: conda install -c anaconda openjdk=8
- And Type python ccm.py XXX
- To start all the nodes in the ccm, assuming i have created a cluster previously: python ccm.py start
- To interact with CQLSH: python ccm.py node1 cqlsh. A new terminal will be created although there will be a p error but it can be ignored.
Figuring out how to load the CSV data files into my local copy of Cassandra managed by CCM
It is important to note that the data to be bulk loaded must be in the form of SSTables. Cassandra does not support loading data in any other format such as CSV, JSON, and XML directly. More documentation on this: https://cassandra.apache.org/doc/latest/operating/bulk_loading.html#:~:text=Bulk%20loading%20of%20data%20in,%2C%20JSON%2C%20and%20XML%20directly
Therefore, we need to find a way to convert these data in the form of CSV files into SSTables. When looking through my seniors’ github code, I noticed that one of the project uses the CQLSH command line commands in the Java Project to do data loading. I later found out that bulk loading of data in Apache Cassandra is supported by different tools.
Existing tools for bulk loading for Cassandra / DSE include:
1) the cqlsh COPY TO / FROM command — which doesn’t scale or handle errors well.
2) Cassandra’s sstableloader can be used to load data but isn’t really a bulk loader. Sstableloader is a tool that, given a set of sstable data files, streams them to a live cluster. It does not simply copy the set of sstables to every node, but only transfers the relevant part of the data to each, conforming to the replication strategy of the cluster. This means we must have SSTables already.
3) DS Bulk, a brand new bulk loader which builds on lessons learned from Cassandra loader
4) CQL commands: Just use the library.
At first, I thought that I would try to use DSBulk but i read that DSBulk is designed to load files as they’re presented into existing database tables. That is, DSBulk uses existing tables and will not create new tables. To create tables in DSE, other tools should be used, such as the CQL shell tool, cqlsh.
My Problems with CQLSH commands:
- To run CQLSH commands, we can use a Runtime object provided by the Java Library to execute terminal commands
- But due to the p error that was thrown each time i run it and the fact that cqlsh will be started in a new terminal, this way is not possible for me. I was not able to run it via the Java Code. (btw, i am using a Windows computer and my two other teammates faces the same p error problem — can google about it and it seems to be a common problem).
Eventually, I chose to execute CQL commands.
That’s it. This article is just a short one on how to execute commands on the CCM cluster manager for Cassandra via Java Code.
Some useful links: