K
- type of the value (or composite values) of the indexed column
(or columns) in the main column family. This value (or composite
value) will become the row key in the secondary index column.C
- type of the column name in the secondary index column family
(row key in the main column family or composite value when sorting
information is included).D
- type of the denormalized data to set as the value in the indexed
columns.public interface CustomSecondaryIndex<K extends Serializable,C extends Serializable & Comparable<C>,D> extends ColumnFamilyHandler
Cassandra already provides native index, however is like a hashed index which means you can only do equality query and not range query. One advantage though is that Cassandra's native secondary indexes already handle updates.
Cassandra's index is only recommended for attributes with low cardinality, i.e. attributes that have few unique values e.g., color of a product.
When a column family is created to keep custom secondary indexes, the row key in such column family would typically be a value of the indexed column and the column names would be the row keys in the main column family that has such value set plus any attribute included for sorting. The column value will include any denormalized data.
When working with secondary indexes, one of two strategies is typically employed: either the column values (or names if composite column namme is used) contain row keys pointing to a separate column family which contains the actual data, or the complete (or partial) set of data for each entity is stored in the secondary index itself (denormalization). With the first strategy, which is similar to building an index, you first fetch a set of row keys from a index and then multiget the matching data rows from a separate column family. This approach is appealing to many at first because it is more normalized; it allows for easy updates of entities, doesn't require you to repeat the same data in multiple column families. However, the second step of the data fetching process, the multiget, is fairly expensive and slow. It requires querying many nodes where each node will need to perform many disk seeks to fetch the rows if they aren't well cached. This approach will not scale well with large data sets.
Examples:
column_family_secondary_index { "indexed_value_1": { id_i: <data provided by the denormalizer>, ... id_j: <data provided by the denormalizer>, } "indexed_value_2": { id_m: <data provided by the denormalizer>, ... id_n: <data provided by the denormalizer>, } } column_family_persons_by_status { "Single": { person_id_i: <Name, Last Name>, ... person_id_j: <Name, Last Name>, } "Married": { person_id_m: <Name, Last Name>, ... person_id_n: <Name, Last Name>, } }If the columns need to be sorted by a different attribute the attribute is included as part of the column name (composite column name). This does not affect the denormalizer.
Example:
column_family_persons_by_status_sorted_by_last_name_and_name { "Single": { <last_name_i, name_i, person_id_i>: <Birthdate_i>, ... <last_name_j, name_j, person_id_j>: <Birthdate_j>, } "Married": { <last_name_m, name_m, person_id_m>: <Birthdate_m>, ... <last_name_n, name_n, person_id_n>: <Birthdate_n>, } }
Modifier and Type | Method and Description |
---|---|
void |
clear(DataStoreContext context)
Updates the index after deleting all rows from the main column family.
|
long |
count(K indexKey,
DataStoreContext context)
Counts the number of columns in the index.
|
void |
delete(C indexEntry,
K indexKey,
DataStoreContext context)
Updates the index after a row has been deleted from the main column family.
|
void |
delete(K indexKey,
DataStoreContext context)
Deletes a row in the secondary index column family.
|
void |
insert(C indexEntry,
D denormalizedData,
K indexKey,
DataStoreContext context)
Updates the index after a row has been inserted into the main column
family.
|
List<Column<C,D>> |
read(K indexKey,
DataStoreContext context)
Reads the index entries.
|
List<Column<C,D>> |
read(List<C> indexEntries,
K indexKey,
DataStoreContext context)
Reads the index entries.
|
getColumnFamilyDefinitions
void insert(C indexEntry, D denormalizedData, K indexKey, DataStoreContext context)
indexEntry
- row key in the main column family (or composite value
when sorting information is included). Such row key (or
composite value) will become the name of the column in the
secondary index.denormalizedData
- denormalized data to include as part of the
indexed columns. null
if no denormalization is used.indexKey
- value (or composite values) of the indexed column (or
columns) in the main column family. This value (or composite
value) will become the row key in the secondary index column.context
- data store context.void delete(C indexEntry, K indexKey, DataStoreContext context)
indexEntry
- key of the row removed from the main column family.indexKey
- value of the indexed column in the main column family.context
- data store context.void delete(K indexKey, DataStoreContext context)
This method should be the preferred way to clear a secondary index if it contains a small well-know set of rows. Truncating a column family is an expensive operation.
indexKey
- value of the indexed column in the main column family.context
- data store context.void clear(DataStoreContext context)
context
- data store context.long count(K indexKey, DataStoreContext context)
indexKey
- value of the indexed column in the main column family
or row key in the secondary index column family.context
- data store context.List<Column<C,D>> read(K indexKey, DataStoreContext context)
indexKey
- value of the indexed column in the main column family
or row key in the secondary index column family.context
- data store context.List<Column<C,D>> read(List<C> indexEntries, K indexKey, DataStoreContext context)
An index is normally used to get rows (from the main column family)
that match a specific indexed value, not to load entries known to match
the indexed value - like in this method. This method has been defined
to allow secondary indexes to be used by a
SecondaryIndexIntegrator.SecondaryIndexReader
.
indexEntries
- index entries to read.indexKey
- value of the indexed column in the main column family
or row key in the secondary index column family.context
- data store context.indexEntries
that doesn't exist in the
index given by indexKey
is not included in the result.Copyright © 2015. All Rights Reserved.