Friday, November 22, 2013

Netezza ERROR: 65798 : no more space for TXMgr transactions array

Every now and then, in Netezza, we'll get an error message : ERROR:  65798 : no more space for TXMgr transactions array.

When this happens, nobody can issue an query against the database, and the database is unusable.

There is a limit of approximately 65,000 transaction objects in Netezza. Once that limit is reached, the system will not accept any new queries. It usually isn't a problem because as transactions completes, the spot that was taken up by that query is freed up and given to new queries. However, sometimes there will be queries that will get stuck and will hang onto the transaction objects until we go into the system and manually abort those sessions.

This script will look for queries that has been running for more than 24 hours (it's an argument thus configurable), and sends an email that displays how long it has been running, the username, the sql command, and automatically kills the sessions.




#! /bin/bash

MaxMinutes="${1}"

if [ -z "$1" ]
  then MaxMinutes=1440
fi

rowcount=0

nzsql -t -c "select id, conntime, username, dbname, command from _v_session where status='active' and command is not null
;" > sessionstokill.txt 2>&1

CurrentTimeEpoch=$(date +%s)
message="Long Running Queries:\n----------------\n"

while read line
do

  id=`echo $line | cut -d"|" -f1`
  starttime=`echo $line | cut -d"|" -f2`
  username=`echo $line | cut -d"|" -f3`
  dbname=`echo $line | cut -d"|" -f4`
  command=`echo $line | cut -d"|" -f5`
  StartTimeEpoch=$(date -d "$starttime" +%s)
  DifferenceInMinutes=$((($CurrentTimeEpoch - $StartTimeEpoch)/(60)))

  #if it has been running for more than X minutes (argument 1)
  if [ $DifferenceInMinutes -ge $MaxMinutes ] && [ $id > 0 ];
  then
     rowcount=$[$rowcount+1];
     message+="SESSIONID: ${id} has been running for ${DifferenceInMinutes} minutes.\nUSERNAME:${username}\nDATABASE:${dbname}\nSQL:${command}\n"
   
     #we only want to kill it if the queries came from these users
     if [[ "$username" == *INTERNAL*_W* ]] || [[ "$username" == *ADMIN* ]];
     then
        message+="\n!!This session will be killed!!\n";

        `nzsession abortTxn -force -id $id`
     fi

     message+="-----------------\n\n"
  fi

done < sessionstokill.txt

if [ $rowcount -gt 0 ]
then
  sendMail -dst "youremail@yourcompany.com" -msg "Long Running Queries / Hung Transactions NZ" -bodyText "$message"

fi

Wednesday, November 20, 2013

Exporting resultset using Impala and chaining multiple commands


Using Hue's Impala UI, you can download the resultset as CSV or Excel. But Impala shell offers a much more flexible way of exporting resultset as you can specify any delimiter you want. Another benefit over using Hue UI Impala 1.1 is that in the Impala UI, you cannot execute multiple commands (even though it doesn't report a failure). Only the first statement will be executed.

If you want to chain a few commands, create a script file (can be any extension although the convention is to give it a .hql extension).

Put all your sql commands in a file. The last command should select the output you want to save. To execute the file, type:

impala-shell -f myscript.hql -B -o results.txt --output_delimiter='\t';

The above command will invoke the impala shell, and run the commands stored in myscrpt.hql, will output the results to a file called "results.txt", and the field delimited is tab.

Contents of myscript.hql:

drop table if exists mytable; create table mytable (adddate string, rowcount bigint); insert into mytable select to_date(adddate), count(*) from transactionmaster group by to_date(adddate); select * from mytable;


-------------------------------------------------------------------------------------------

You can then view the output in the file you specified, in my case:
cat results.txt

Thursday, November 14, 2013

Hive / Impala - create external tables with data from subfolders

At my workplace, we already store a lot of files in our HDFS..and I wanted to create impala tables against them. But our files are stored as lzo compressed files, and as of Impala 1.1, you cannot create the tables that are using lzo files through Impala, but you can create them in Hive, then query them in Impala.

Our typical folder structure look like this:

/data/mylogfiles/20130101
/data/mylogfiles/20130102
/data/mylogfiles/20130103

So basically we create a folder for each day of data.

At first I tried this and this doesn't work:
create external table mylogfile
    (
    keyword string,              
    campaign_id string,            
    mydate string,                 
    mytime string,                 
    display_site string,           
    was_clicked int,               
    cpc int,                       
    country string,                
    placement string) 
    PARTITIONED BY (log_date STRING)         
    ROW FORMAT DELIMITED           
    fields terminated by '\t' 
    stored as textfile             
    location '/data/mylogfiles/*/'; doesn't work!!!

Then I read about partitions and this works..although I do think it's a hassle to have to execute a statement to add a new partition for each folder of data in our hdfs. If you know of any other better way, please feel free to leave it in the comments section.


 create external table mylogtable
    (
    keyword string,             
    campaign_id string,           
    mydate string,                
    mytime string,                
    display_site string,          
    was_clicked int,              
    cpc int,                      
    country string,               
    placement string)
    PARTITIONED BY (folderdate BIGINT)        
    ROW FORMAT DELIMITED          
    fields terminated by '\t'; leave the location empty


Note: when you first go to impala after creating the table in Hive, you will need to issue these 2 commands or the table will not be visible in Impala:

invalidate metadata;
refresh mylogtable;

You only need to create the table in Hive, then you add the partitions in impala shell.


 ALTER TABLE mylogtable ADD PARTITION (folderdate=20130101) LOCATION '/data/mylogfiles/20130101';

ALTER TABLE mylogtable ADD PARTITION (folderdate=20130102) LOCATION '
/data/mylogfiles/20130102';


You can do a quick sanity check after addition of each partition...the rowcount should go up.

select count(*) 
from mylogfile;


Notice that the partition key is not an existing column of the table. You just give the partition your own name, and assign the appropriate file to that partition. In my example above, I am effectively naming the partition "20130101" for the file located in hdfs folder: "/data/mylogfiles/20130101". I could have named the partition "mypartition1" if I wanted to.

The real use of partitions though, it to speed up query by reading only the data necessary if you know which partitions the data resides in. 

For example, if you are looking for data with mydate = 2013-11-02, and you know that the data resides in the 20131102 and 20131103 partitions, you can issue a statement like this which will speed up the query:

select count(*)
from mylogtable
where to_date(mydate) >= '2013-11-02' and to_date(mydate) < '2013-11-03'
and folderdate in (20131102,20131103); --> folderdate is the partition key, not an actual columnname



Regular expression

in pig, say you want to match only 2013 feb - may:

/*

 * Include only records where the 'order_dtm' field matches
 * the regular expression pattern:
 *
 *   ^       = beginning of string
 *   2013    = literal value '2013'
 *   0[2345] = 0 followed by 2, 3, 4, or 5
 *   -       = a literal character '-'
 *   \\d{2}  = exactly two digits
 *   \\s     = a single whitespace character
 *   .*      = any number of any characters
 *   $       = end of string
 *
 * If you are not familiar with regular expressions and would
 * like to know more about them, see the Regular Expression
 * Reference at the end of the Exercise Manual.
 */
A = FILTER data by order_date matches '^2013-0[2345]-\\d{2}\\s.*$';

Wednesday, November 13, 2013

More Pig stuff

Given a 2 column dataset - Salesperson,  SalesAmount
Mary  20000
Tom 1200
Mary 6000
Jane 8000
Jim 9000
Tom 20000
Amy 28000
Barry 35000
Charles 2400
Dawson 7384
Haley 2847
Kelly 29495
Lucy 3648



Here's how you get the top 3 Salesperson in pig:

salespeople = LOAD 'data' AS (salesperson:chararray, salesamount:int);
salespeople_group = GROUP salespeople BY salesperson;
top_sales = FOREACH salespeople_group {
sorted = ORDER salespeople BY salesamount DESC;
highest_sales = LIMIT sorted 3;
GENERATE group as salesperson, highest_sales;
};

Monday, November 11, 2013

Datastage stop and reset long hanging jobs

We run a lot of loading jobs from our source SQL Server databases into Netezza using Datastage..these are just simple table to table load with no data transformation and runs every hour.

Every now and them...some of these jobs will hang but will not abort, and it will be a in a perpetual running state until one of us comes in and manually stop the job and recompile it then the next hourly scheduled run will kick off successfully.

I wrote a little shell script to check for Datastage jobs that have been running for more than a certain interval and if it is on the "okay to reset and kill" list (stored in a textfile), then it will stop the job and reset using dsjob commands.

#! /bin/bash

PROG=`basename ${0}`
EXIT_STATUS=0
ProjectName="${1}"
MaxMinutesBeforeReset="${2}"

if [ ${#} -ne 2 ]; then
   echo "${PROG}    : Invalid parameter list. The script needs 2 parameters:"
   echo "Param 1    : DS Project Name "
   echo "Param 2    : MinutesBeforeReset"

   EXIT_STATUS=99
   echo ${NOW} ${PROG} Exiting without completion with status [${EXIT_STATUS}]
   exit ${EXIT_STATUS}
fi

#go to /opt/IBM/InformationServer/Server/DSEngine
BinFileDirectory=`cat /.dshome`/bin
cd ${BinFileDirectory}

#Get current epochtime to as a baseline
CurrentTimeEpoch=$(date +%s)

#check for current running Datastage jobs
ps aux | grep 'DSD.RUN' | tr -s ' ' | cut -d" " -f13 | tr -d '\.' | while read JobName;
do

   #check if it is in the jobs to monitor & reset file, if not skip it
   if grep -Fxq "$JobName" /home/myfolder/JobsToMonitorAndReset.txt
   then
     #Get starttime which is on the 3rd row after the colon
     StartTime=$(./dsjob -jobinfo $ProjectName $JobName | sed -n 3p | grep -o ": .*" | grep -o " .*")
     StartTimeEpoch=$(date -d "$StartTime" +%s)
     DifferenceInMinutes=$((($CurrentTimeEpoch - $StartTimeEpoch)/(60)))
     echo "$JobName has been running for $DifferenceInMinutes minutes"

     #if it has been running more than x (2nd argument) minutes, stop and reset job
     if [ $DifferenceInMinutes -ge $MaxMinutesBeforeReset];
     then
       echo "$JobName will be stopped and reset."
       ./dsjob -stop $ProjectName $JobName
       ./dsjob -run -mode RESET -wait -jobstatus $ProjectName $JobName
       exit 0
     fi
   fi
done

exit 0

---------------------------------------------------------------
If you want to monitor only specific jobs, just add the datastage JobName your file, I stored mine in /home/myfolder/JobsToMonitorAndReset.txt.

You can send an email to yourself too with the jobs that were stopped and reset.

Thursday, November 7, 2013

Convert string to JSON array using Perl

original json file (note: the string values are delimited by a colon)
[
{"cxbrandid":"100158:100159","cxbrandname":"APPLE, ORANGE & BANANA:JUICE JUICE"},
{"cxbrandid":"3000","cxbrandname":"JUNGLE, FOREST"}
]

desired output json file:
[{"cxbrandid":["100158","100159"],"cxbrandname":["APPLE, ORANGE & BANANA","JUICE JUICE"]},{"cxbrandid":["3000"],"cxbrandname":["JUNGLE, FOREST"]}]


Perl script to do that:
========================================================================
use JSON;

my $json;
{
  local $/; #Enable 'slurp' mode
  open my $fh, "<", "t.json";
  $json = <$fh>;
  close $fh;
}
my $data = decode_json($json);

#count number of elements in array and minus one
$t = scalar(@{$data});
$t = $t - 1;

for (my $i=0; $i <= $t; $i++) {

$cxbrandname =  $data->[$i]->{'cxbrandname'};
$cxbrandid = $data->[$i]->{'cxbrandid'};

my $brandidtable=[];
my $brandnametable=[];

@cxbrandidarray=split(/:/, $cxbrandid);
@cxbrandnamearray=split(/:/, $cxbrandname);

foreach my $val (@cxbrandidarray) {
  push @{$brandidtable}, $val;
}

foreach my $val (@cxbrandnamearray) {
  push @{$brandnametable}, $val;
}

$data->[$i]->{'cxbrandid'}= $brandidtable;
$data->[$i]->{'cxbrandname'}= $brandnametable;

}

open my $fh, ">", "data_out.json";
print $fh encode_json($data);
close $fh;

Wednesday, November 6, 2013

Mongodb find locations within certain miles given a coordinate

Note: this has been tested with Mongodb 2.4

I recently got to work on a side project that needed to look for close by stores within 10 miles given a coordinate.

The stores collection already stored the latitude and longitude but they were not stored in the expected GeoJSON format, which is this:

{ "loc": [ -73.946922, 40.780182] }

or this is also acceptable

{ "loc" : [ "lng": -73.946922, "lat": 40.780182 ] }


Instead the collection I was working with had this:

{"lat" : 40.780182, "lng" : -73.946922 }

So first thing, I had to update the collection to add a new field, and set a new property called "loc" to store a coordinates array:

db.stores.find().forEach(function(doc){db.stores.update({_id: doc._id}, { $set: {loc: [doc.longitude, doc.latitude] }})});

Then build an index

db.stores.ensureIndex( { loc : "2d" } );


Then, after that, say you want to find all stores that is within 5 miles of a given set of coordinate, in mongodb 2.4, all you have to do is issue this command:

db.stores.find( { loc: { $geoWithin: { $centerSphere: [ [ -122.0751549, 37.386812 ] , 5 / 3959 ] } } } );

Note: why divide by 3959? The radius of the Earth is approximately 3,959 miles.