Wednesday, October 29, 2014

shell scripting cheatsheet

Here are some snippets of code to do some data manipulation in your *nix environment.

Say you have in a text file test.txt
a
b
c
and you want to convert it to a,b,c, in other words, convert rows to columns
you can run this:
tr -s '\n' ',' < test.txt



Say you have in your text file a,b,c and you want to change it to single quoted like 'a','b','c'
you can run this:
cat /tmp/your_csv_row.txt | awk '{n=split($0,a,","); for (i = 0; ++i <=n;) print "'\''" a[i] "'\'',"}' | tr -s '\n' ',' | sed 's/.$//'


Wednesday, October 8, 2014

Impala create table, add partitions etc cheatsheet

I used pig to convert the incoming csv file to parquet format, then in hive, created the external table:

create external table salestransactions
(
PartnerCode string,
TransactionNumber string,
TransactionEndDateTime string,
TransactionType string,
TransactionSubType string,
StoreCode string
)
PARTITIONED BY (batchdate BIGINT)
ROW FORMAT DELIMITED
fields terminated by '\t'
STORED AS
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat';


Store the parquet format data file in this hdfs folder /user/data/salestransactions_parquet/20141002

then again in hive, issue this command:
ALTER TABLE salestransactionsADD PARTITION (batchdate=20141002) LOCATION '/user/data/salestransactions_parquet/20141002';

Then in your impala-shell, if this is your first time accessing the table that you just created in hive, type "invalidate metadata", then type "refresh salestransactions". If you then type "show partitions salestransactions", you should be able to see the above partition you just added.

to drop a partition,
alter table riq_salestransactions drop partition (batchdate=20141002);


Thursday, February 13, 2014

Configuring Microstrategy 9.3.1 to work with Impala 1.2.3

We have Microstrategy 9.3.1 installed on CentOS 6.

Installing the driver and setting it up was relatively easy if you follow the instructions exactly.

I downloaded the driver from Cloudera's page here:
http://www.cloudera.com/content/support/en/downloads/download-components/download-products/downloads-listing/connectors/microstrategy.html


I followed Cloudera's instructions here to install the ODBC connector: http://www.cloudera.com/content/cloudera-content/cloudera-docs/Connectors/MicroStrategy/Cloudera-Connector-for-MicroStrategy/ccms_linux.html

Then I followed the instructions here to set things up on our Microstrategy linux box:
https://resource.microstrategy.com/support/Authoring/DisplayTN.aspx?formatted=1&tnkey=43595

I did have to open a ticket to get a PDS file from Microstrategy and install it so that I can see 'Impala 1.x' in the drop down list when setting up a new database instance in Microstrategy admin.

odbc.ini file:
[ODBC Data Sources]
IMPALA-SERVER=Hive ODBC Driver

[IMPALA-SERVER]
Driver=/usr/lib/libhiveodbc.so.1
Description=Hive ODBC Driver
DATABASE=default
HOST=impalaservername
PORT=21050
FRAMED=0
Trace=Yes
TraceFile=/tmp/odbc.log
Type=HS2NoSasl

ODBC.sh in env folder:
#
# ODBC Driver for Hive
#
HIVE_CONFIG='/usr'
if [ "${HIVE_CONFIG}" != '<HIVE_CONFIG>' ]; then
           export HIVE_CONFIG

           mstr_append_path LD_LIBRARY_PATH "${HIVE_CONFIG:?}"/lib
           export LD_LIBRARY_PATH
fi


It works and it's very exciting! I then created a quick free form sql report against our Impala table in hadoop. We can finally have our users run reports directly against our hadoop cluster. We'll have to do some more testing but this is a very good start.

Tuesday, January 28, 2014

Deduping rows in Netezza / Deleting rows in Netezza.

netezza has a hidden rowid for every row in all tables. to dedup, you can do this:

delete from omszip_stage
where rowid in (
select  max(rowid) as rowid
from OMSZIP_STAGE
group by couponid, zip, pid
having count(*) > 1)