My BI World: October 2014

Wednesday, October 29, 2014

shell scripting cheatsheet

Here are some snippets of code to do some data manipulation in your *nix environment.

Say you have in a text file test.txt
a
b
c
and you want to convert it to a,b,c, in other words, convert rows to columns
you can run this:
tr -s '\n' ',' < test.txt

Say you have in your text file a,b,c and you want to change it to single quoted like 'a','b','c'
you can run this:
cat /tmp/your_csv_row.txt | awk '{n=split($0,a,","); for (i = 0; ++i <=n;) print "'\''" a[i] "'\'',"}' | tr -s '\n' ',' | sed 's/.$//'

Wednesday, October 8, 2014

Impala create table, add partitions etc cheatsheet

I used pig to convert the incoming csv file to parquet format, then in hive, created the external table:

create external table salestransactions
(
PartnerCode string,
TransactionNumber string,
TransactionEndDateTime string,
TransactionType string,
TransactionSubType string,
StoreCode string
)
PARTITIONED BY (batchdate BIGINT)
ROW FORMAT DELIMITED
fields terminated by '\t'
STORED AS
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat';

Store the parquet format data file in this hdfs folder /user/data/salestransactions_parquet/20141002

then again in hive, issue this command:
ALTER TABLE salestransactionsADD PARTITION (batchdate=20141002) LOCATION '/user/data/salestransactions_parquet/20141002';

Then in your impala-shell, if this is your first time accessing the table that you just created in hive, type "invalidate metadata", then type "refresh salestransactions". If you then type "show partitions salestransactions", you should be able to see the above partition you just added.

to drop a partition,
alter table riq_salestransactions drop partition (batchdate=20141002);