Kylin Performance

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Kylin Performance

Alberto Ramón
I made a small tech notes about my performance tests, the doc is unfinished (I need more time, test and knowledge)
Review my English mistakes is pending

If somebody have any comment, test, more experience , ... feel free make any suggestion

Alb
Reply | Threaded
Open this post in threaded view
|

Re: Kylin Performance

shaofengshi
Hi Alberto, where I can preview this doc? Thanks!

2016-12-21 6:46 GMT+08:00 Alberto Ramón <[hidden email]>:

> I made a small tech notes about my performance tests, the doc is
> unfinished (I need more time, test and knowledge)
> Review my English mistakes is pending
>
> If somebody have any comment, test, more experience , ... feel free make
> any suggestion
>
> Alb
>



--
Best regards,

Shaofeng Shi 史少锋
Reply | Threaded
Open this post in threaded view
|

Re: Kylin Performance

Alberto Ramón
I attached as PDF,  ... I don't know it this is forbidden in MailList

googleDrive
<https://drive.google.com/drive/folders/0B-6nZ2q-HPTNem1KTTRHbDhpOG8?usp=sharing>
 (tell me if there is any problem)

2016-12-21 10:27 GMT+01:00 ShaoFeng Shi <[hidden email]>:

> Hi Alberto, where I can preview this doc? Thanks!
>
> 2016-12-21 6:46 GMT+08:00 Alberto Ramón <[hidden email]>:
>
> > I made a small tech notes about my performance tests, the doc is
> > unfinished (I need more time, test and knowledge)
> > Review my English mistakes is pending
> >
> > If somebody have any comment, test, more experience , ... feel free make
> > any suggestion
> >
> > Alb
> >
>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
Reply | Threaded
Open this post in threaded view
|

Re: Kylin Performance

shaofengshi
Hi Alberto, this is a Great test, the only issue might be the data set is
too small for Kylin, but the conclusion are the same, like a) enable
compression can improve overall performance; b) optimize the cube design
with "hierarchy"/"joint" can reduce the calculations and storage, etc

For "Cube_06" test, usually partition is used for table which has huge
amount of data (partition can be used for data pruning); Lookup tables
don't need be partitioned: making all records in 1 single file will be more
efficient than diving them into 70 files;  \

If you want to compare hive parition/non-partition, suggest you find a
bigger fact table, e.g 5 or 10 million rows;

2016-12-21 17:53 GMT+08:00 Alberto Ramón <[hidden email]>:

> I attached as PDF,  ... I don't know it this is forbidden in MailList
>
> googleDrive
> <https://drive.google.com/drive/folders/0B-6nZ2q-HPTNem1KTTRHbDhpOG8?usp=
> sharing>
>  (tell me if there is any problem)
>
> 2016-12-21 10:27 GMT+01:00 ShaoFeng Shi <[hidden email]>:
>
> > Hi Alberto, where I can preview this doc? Thanks!
> >
> > 2016-12-21 6:46 GMT+08:00 Alberto Ramón <[hidden email]>:
> >
> > > I made a small tech notes about my performance tests, the doc is
> > > unfinished (I need more time, test and knowledge)
> > > Review my English mistakes is pending
> > >
> > > If somebody have any comment, test, more experience , ... feel free
> make
> > > any suggestion
> > >
> > > Alb
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> > Shaofeng Shi 史少锋
> >
>



--
Best regards,

Shaofeng Shi 史少锋
Reply | Threaded
Open this post in threaded view
|

Re: Kylin Performance

Alberto Ramón
Yes (thanks for your help)

My fact table is only 3.9 Millons, I will try Cube_06 with more data
One of my Dim have 800K rows, I want test create this Dim with Buckets on
Hive

2016-12-21 11:25 GMT+01:00 ShaoFeng Shi <[hidden email]>:

> Hi Alberto, this is a Great test, the only issue might be the data set is
> too small for Kylin, but the conclusion are the same, like a) enable
> compression can improve overall performance; b) optimize the cube design
> with "hierarchy"/"joint" can reduce the calculations and storage, etc
>
> For "Cube_06" test, usually partition is used for table which has huge
> amount of data (partition can be used for data pruning); Lookup tables
> don't need be partitioned: making all records in 1 single file will be more
> efficient than diving them into 70 files;  \
>
> If you want to compare hive parition/non-partition, suggest you find a
> bigger fact table, e.g 5 or 10 million rows;
>
> 2016-12-21 17:53 GMT+08:00 Alberto Ramón <[hidden email]>:
>
> > I attached as PDF,  ... I don't know it this is forbidden in MailList
> >
> > googleDrive
> > <https://drive.google.com/drive/folders/0B-6nZ2q-
> HPTNem1KTTRHbDhpOG8?usp=
> > sharing>
> >  (tell me if there is any problem)
> >
> > 2016-12-21 10:27 GMT+01:00 ShaoFeng Shi <[hidden email]>:
> >
> > > Hi Alberto, where I can preview this doc? Thanks!
> > >
> > > 2016-12-21 6:46 GMT+08:00 Alberto Ramón <[hidden email]>:
> > >
> > > > I made a small tech notes about my performance tests, the doc is
> > > > unfinished (I need more time, test and knowledge)
> > > > Review my English mistakes is pending
> > > >
> > > > If somebody have any comment, test, more experience , ... feel free
> > make
> > > > any suggestion
> > > >
> > > > Alb
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > > Shaofeng Shi 史少锋
> > >
> >
>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
Reply | Threaded
Open this post in threaded view
|

Re: Kylin Performance

Luke_Selina
In reply to this post by shaofengshi
Great and Agree! But I still have an question like Alberto, why in an AGG one dim can use only one regulation(mandatory, join, hierachy)?
Reply | Threaded
Open this post in threaded view
|

Re: Kylin Performance

Alberto Ramón
When Kylin 2149 <https://issues.apache.org/jira/browse/KYLIN-2149> will be
solved the performance will be* improve even more*, because:

you know that 2016-05-05 Belongs to May, Week 18, and friday , but kylin
doesnt know it
It will try to calulate the combination of 2016-05-05 with January February
March, ... Monday Tuesday ..., W1 W2 ..., Q2 Q3 Q4 ==> There are a lot of
combination wasted

2016-12-21 12:57 GMT+01:00 Luke_Selina <[hidden email]>:

> Great and Agree! But I still have an question like Alberto, why in an AGG
> one
> dim can use only one regulation(mandatory, join, hierachy)?
>
> --
> View this message in context: http://apache-kylin.74782.x6.
> nabble.com/Kylin-Performance-tp6713p6728.html
> Sent from the Apache Kylin mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Kylin Performance

Yang
Very good work!

Btw, we are also doing benchmarks on SSB and TPC-H data sets, based on
below work. Will share more info soon.

- http://www.cs.umb.edu/~poneil/StarSchemaB.PDF
- https://github.com/hortonworks/hive-testbench


Cheers
Yang

On Wed, Dec 21, 2016 at 8:45 PM, Alberto Ramón <[hidden email]>
wrote:

> When Kylin 2149 <https://issues.apache.org/jira/browse/KYLIN-2149> will be
> solved the performance will be* improve even more*, because:
>
> you know that 2016-05-05 Belongs to May, Week 18, and friday , but kylin
> doesnt know it
> It will try to calulate the combination of 2016-05-05 with January February
> March, ... Monday Tuesday ..., W1 W2 ..., Q2 Q3 Q4 ==> There are a lot of
> combination wasted
>
> 2016-12-21 12:57 GMT+01:00 Luke_Selina <[hidden email]>:
>
> > Great and Agree! But I still have an question like Alberto, why in an AGG
> > one
> > dim can use only one regulation(mandatory, join, hierachy)?
> >
> > --
> > View this message in context: http://apache-kylin.74782.x6.
> > nabble.com/Kylin-Performance-tp6713p6728.html
> > Sent from the Apache Kylin mailing list archive at Nabble.com.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Kylin Performance

Alberto Ramón
Hello

from v0, I correct english sintaxis


After tunning of cube:
  -  Use Hive input compress table
  -  Define  Hierarchy, Joint, Dim
  -  . . .

Now:  57% if for first steps (flat table, steps: 1,2,3)  and 43% for build
cube

I saw flat table uses SEQUENCEFILE, then I tested to use
   ORC,
   ORC + Snappy
   ORC + Snappy + Vectorization

without good results, more ideas ??


I'm thinking that 'Redistribute Flat Hive Table' is a simple count and uses

*30% of total time*
  Is this the normal case ?
  We can aprox this count to: count of Fact Table (Will true 99% of time),
and put in // with step 1, is necessary be precise?

2016-12-22 14:00 GMT+01:00 Li Yang <[hidden email]>:

> Very good work!
>
> Btw, we are also doing benchmarks on SSB and TPC-H data sets, based on
> below work. Will share more info soon.
>
> - http://www.cs.umb.edu/~poneil/StarSchemaB.PDF
> - https://github.com/hortonworks/hive-testbench
>
>
> Cheers
> Yang
>
> On Wed, Dec 21, 2016 at 8:45 PM, Alberto Ramón <[hidden email]>
> wrote:
>
> > When Kylin 2149 <https://issues.apache.org/jira/browse/KYLIN-2149> will
> be
> > solved the performance will be* improve even more*, because:
> >
> > you know that 2016-05-05 Belongs to May, Week 18, and friday , but kylin
> > doesnt know it
> > It will try to calulate the combination of 2016-05-05 with January
> February
> > March, ... Monday Tuesday ..., W1 W2 ..., Q2 Q3 Q4 ==> There are a lot of
> > combination wasted
> >
> > 2016-12-21 12:57 GMT+01:00 Luke_Selina <[hidden email]>:
> >
> > > Great and Agree! But I still have an question like Alberto, why in an
> AGG
> > > one
> > > dim can use only one regulation(mandatory, join, hierachy)?
> > >
> > > --
> > > View this message in context: http://apache-kylin.74782.x6.
> > > nabble.com/Kylin-Performance-tp6713p6728.html
> > > Sent from the Apache Kylin mailing list archive at Nabble.com.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Kylin Performance

shaofengshi
Alberto, I didn't test ORC format; but as you know, Kylin consumes the
source data row by row (all columns at once), so I guess columnar format
like ORC may not benefit much. But this is a good try, if there is better
format we can switch to it.

The "redistribute flat hive table" will add time but it can reduce time in
subsequent cube building (avoid data skew), especially when there are lots
of records. Usually it is fast (a couple minutes to ten or twenty minutes)
comparing to the cube build time. You mentioned it took 30% of total time,
what's the total time and what's the input number? When the input is small,
the overhead may overcome the benefit.

For the method you mentioned (count on fact table, then put the
redistribute to step 1), actually it is supported in Kylin 1.5.4 (maybe
also 1.5.3) with a config parameter; but that method is not recommended as
it is unstable: In some cases (e.g, the fact table is a big hive view, or
it is a big table but not partitioned by date), a simple "select count(*)
from fact_table" will cost lots of resources on Hadoop, a second "create
intermediate_table as select ..." will start the same mappers again.

In contrast, the as-is method is relatively stable for extreme case;
usually the intermediate table is much smaller than fact table, count and
redistribute on it will be low-cost; In next version there will be a
further optimization (https://issues.apache.org/jira/browse/KYLIN-2165) to
reduce the time in this step.


2016-12-27 1:20 GMT+08:00 Alberto Ramón <[hidden email]>:

> Hello
>
> from v0, I correct english sintaxis
>
>
> After tunning of cube:
>   -  Use Hive input compress table
>   -  Define  Hierarchy, Joint, Dim
>   -  . . .
>
> Now:  57% if for first steps (flat table, steps: 1,2,3)  and 43% for build
> cube
>
> I saw flat table uses SEQUENCEFILE, then I tested to use
>    ORC,
>    ORC + Snappy
>    ORC + Snappy + Vectorization
>
> without good results, more ideas ??
>
>
> I'm thinking that 'Redistribute Flat Hive Table' is a simple count and uses
>
> *30% of total time*
>   Is this the normal case ?
>   We can aprox this count to: count of Fact Table (Will true 99% of time),
> and put in // with step 1, is necessary be precise?
>
> 2016-12-22 14:00 GMT+01:00 Li Yang <[hidden email]>:
>
> > Very good work!
> >
> > Btw, we are also doing benchmarks on SSB and TPC-H data sets, based on
> > below work. Will share more info soon.
> >
> > - http://www.cs.umb.edu/~poneil/StarSchemaB.PDF
> > - https://github.com/hortonworks/hive-testbench
> >
> >
> > Cheers
> > Yang
> >
> > On Wed, Dec 21, 2016 at 8:45 PM, Alberto Ramón <
> [hidden email]>
> > wrote:
> >
> > > When Kylin 2149 <https://issues.apache.org/jira/browse/KYLIN-2149>
> will
> > be
> > > solved the performance will be* improve even more*, because:
> > >
> > > you know that 2016-05-05 Belongs to May, Week 18, and friday , but
> kylin
> > > doesnt know it
> > > It will try to calulate the combination of 2016-05-05 with January
> > February
> > > March, ... Monday Tuesday ..., W1 W2 ..., Q2 Q3 Q4 ==> There are a lot
> of
> > > combination wasted
> > >
> > > 2016-12-21 12:57 GMT+01:00 Luke_Selina <[hidden email]>:
> > >
> > > > Great and Agree! But I still have an question like Alberto, why in an
> > AGG
> > > > one
> > > > dim can use only one regulation(mandatory, join, hierachy)?
> > > >
> > > > --
> > > > View this message in context: http://apache-kylin.74782.x6.
> > > > nabble.com/Kylin-Performance-tp6713p6728.html
> > > > Sent from the Apache Kylin mailing list archive at Nabble.com.
> > > >
> > >
> >
>



--
Best regards,

Shaofeng Shi 史少锋
Reply | Threaded
Open this post in threaded view
|

Re: Kylin Performance

Alberto Ramón
Kylin 2165 will be nice

Yes 30% of total cube, because the cardinality of  DIM was low ( 2K and 11K)

You are in true: When the cardinality of  DIM are 1M,  the intermediate table is only 5% of total: Picture (I don't know you can see pictures in this mailList)



2016-12-27 2:32 GMT+01:00 ShaoFeng Shi <[hidden email]>:
Alberto, I didn't test ORC format; but as you know, Kylin consumes the
source data row by row (all columns at once), so I guess columnar format
like ORC may not benefit much. But this is a good try, if there is better
format we can switch to it.

The "redistribute flat hive table" will add time but it can reduce time in
subsequent cube building (avoid data skew), especially when there are lots
of records. Usually it is fast (a couple minutes to ten or twenty minutes)
comparing to the cube build time. You mentioned it took 30% of total time,
what's the total time and what's the input number? When the input is small,
the overhead may overcome the benefit.

For the method you mentioned (count on fact table, then put the
redistribute to step 1), actually it is supported in Kylin 1.5.4 (maybe
also 1.5.3) with a config parameter; but that method is not recommended as
it is unstable: In some cases (e.g, the fact table is a big hive view, or
it is a big table but not partitioned by date), a simple "select count(*)
from fact_table" will cost lots of resources on Hadoop, a second "create
intermediate_table as select ..." will start the same mappers again.

In contrast, the as-is method is relatively stable for extreme case;
usually the intermediate table is much smaller than fact table, count and
redistribute on it will be low-cost; In next version there will be a
further optimization (https://issues.apache.org/jira/browse/KYLIN-2165) to
reduce the time in this step.


2016-12-27 1:20 GMT+08:00 Alberto Ramón <[hidden email]>:

> Hello
>
> from v0, I correct english sintaxis
>
>
> After tunning of cube:
>   -  Use Hive input compress table
>   -  Define  Hierarchy, Joint, Dim
>   -  . . .
>
> Now:  57% if for first steps (flat table, steps: 1,2,3)  and 43% for build
> cube
>
> I saw flat table uses SEQUENCEFILE, then I tested to use
>    ORC,
>    ORC + Snappy
>    ORC + Snappy + Vectorization
>
> without good results, more ideas ??
>
>
> I'm thinking that 'Redistribute Flat Hive Table' is a simple count and uses
>
> *30% of total time*
>   Is this the normal case ?
>   We can aprox this count to: count of Fact Table (Will true 99% of time),
> and put in // with step 1, is necessary be precise?
>
> 2016-12-22 14:00 GMT+01:00 Li Yang <[hidden email]>:
>
> > Very good work!
> >
> > Btw, we are also doing benchmarks on SSB and TPC-H data sets, based on
> > below work. Will share more info soon.
> >
> > - http://www.cs.umb.edu/~poneil/StarSchemaB.PDF
> > - https://github.com/hortonworks/hive-testbench
> >
> >
> > Cheers
> > Yang
> >
> > On Wed, Dec 21, 2016 at 8:45 PM, Alberto Ramón <
> [hidden email]>
> > wrote:
> >
> > > When Kylin 2149 <https://issues.apache.org/jira/browse/KYLIN-2149>
> will
> > be
> > > solved the performance will be* improve even more*, because:
> > >
> > > you know that 2016-05-05 Belongs to May, Week 18, and friday , but
> kylin
> > > doesnt know it
> > > It will try to calulate the combination of 2016-05-05 with January
> > February
> > > March, ... Monday Tuesday ..., W1 W2 ..., Q2 Q3 Q4 ==> There are a lot
> of
> > > combination wasted
> > >
> > > 2016-12-21 12:57 GMT+01:00 Luke_Selina <[hidden email]>:
> > >
> > > > Great and Agree! But I still have an question like Alberto, why in an
> > AGG
> > > > one
> > > > dim can use only one regulation(mandatory, join, hierachy)?
> > > >
> > > > --
> > > > View this message in context: http://apache-kylin.74782.x6.
> > > > nabble.com/Kylin-Performance-tp6713p6728.html
> > > > Sent from the Apache Kylin mailing list archive at Nabble.com.
> > > >
> > >
> >
>



--
Best regards,

Shaofeng Shi 史少锋

Reply | Threaded
Open this post in threaded view
|

Re: Kylin Performance

shaofengshi
Alberto, the image can not be displayed :-<

2016-12-28 2:39 GMT+08:00 Alberto Ramón <[hidden email]>:

> Kylin 2165 will be nice
>
> Yes 30% of total cube, because the cardinality of  DIM was low ( 2K and
> 11K)
>
> You are in true: When the cardinality of  DIM are 1M,  the intermediate
> table is only 5% of total: Picture (I don't know you can see pictures in
> this mailList)
> [image: Imágenes integradas 1]
>
>
> 2016-12-27 2:32 GMT+01:00 ShaoFeng Shi <[hidden email]>:
>
>> Alberto, I didn't test ORC format; but as you know, Kylin consumes the
>> source data row by row (all columns at once), so I guess columnar format
>> like ORC may not benefit much. But this is a good try, if there is better
>> format we can switch to it.
>>
>> The "redistribute flat hive table" will add time but it can reduce time in
>> subsequent cube building (avoid data skew), especially when there are lots
>> of records. Usually it is fast (a couple minutes to ten or twenty minutes)
>> comparing to the cube build time. You mentioned it took 30% of total time,
>> what's the total time and what's the input number? When the input is
>> small,
>> the overhead may overcome the benefit.
>>
>> For the method you mentioned (count on fact table, then put the
>> redistribute to step 1), actually it is supported in Kylin 1.5.4 (maybe
>> also 1.5.3) with a config parameter; but that method is not recommended as
>> it is unstable: In some cases (e.g, the fact table is a big hive view, or
>> it is a big table but not partitioned by date), a simple "select count(*)
>> from fact_table" will cost lots of resources on Hadoop, a second "create
>> intermediate_table as select ..." will start the same mappers again.
>>
>> In contrast, the as-is method is relatively stable for extreme case;
>> usually the intermediate table is much smaller than fact table, count and
>> redistribute on it will be low-cost; In next version there will be a
>> further optimization (https://issues.apache.org/jira/browse/KYLIN-2165)
>> to
>> reduce the time in this step.
>>
>>
>> 2016-12-27 1:20 GMT+08:00 Alberto Ramón <[hidden email]>:
>>
>> > Hello
>> >
>> > from v0, I correct english sintaxis
>> >
>> >
>> > After tunning of cube:
>> >   -  Use Hive input compress table
>> >   -  Define  Hierarchy, Joint, Dim
>> >   -  . . .
>> >
>> > Now:  57% if for first steps (flat table, steps: 1,2,3)  and 43% for
>> build
>> > cube
>> >
>> > I saw flat table uses SEQUENCEFILE, then I tested to use
>> >    ORC,
>> >    ORC + Snappy
>> >    ORC + Snappy + Vectorization
>> >
>> > without good results, more ideas ??
>> >
>> >
>> > I'm thinking that 'Redistribute Flat Hive Table' is a simple count and
>> uses
>> >
>> > *30% of total time*
>> >   Is this the normal case ?
>> >   We can aprox this count to: count of Fact Table (Will true 99% of
>> time),
>> > and put in // with step 1, is necessary be precise?
>> >
>> > 2016-12-22 14:00 GMT+01:00 Li Yang <[hidden email]>:
>> >
>> > > Very good work!
>> > >
>> > > Btw, we are also doing benchmarks on SSB and TPC-H data sets, based on
>> > > below work. Will share more info soon.
>> > >
>> > > - http://www.cs.umb.edu/~poneil/StarSchemaB.PDF
>> > > - https://github.com/hortonworks/hive-testbench
>> > >
>> > >
>> > > Cheers
>> > > Yang
>> > >
>> > > On Wed, Dec 21, 2016 at 8:45 PM, Alberto Ramón <
>> > [hidden email]>
>> > > wrote:
>> > >
>> > > > When Kylin 2149 <https://issues.apache.org/jira/browse/KYLIN-2149>
>> > will
>> > > be
>> > > > solved the performance will be* improve even more*, because:
>> > > >
>> > > > you know that 2016-05-05 Belongs to May, Week 18, and friday , but
>> > kylin
>> > > > doesnt know it
>> > > > It will try to calulate the combination of 2016-05-05 with January
>> > > February
>> > > > March, ... Monday Tuesday ..., W1 W2 ..., Q2 Q3 Q4 ==> There are a
>> lot
>> > of
>> > > > combination wasted
>> > > >
>> > > > 2016-12-21 12:57 GMT+01:00 Luke_Selina <[hidden email]>:
>> > > >
>> > > > > Great and Agree! But I still have an question like Alberto, why
>> in an
>> > > AGG
>> > > > > one
>> > > > > dim can use only one regulation(mandatory, join, hierachy)?
>> > > > >
>> > > > > --
>> > > > > View this message in context: http://apache-kylin.74782.x6.
>> > > > > nabble.com/Kylin-Performance-tp6713p6728.html
>> > > > > Sent from the Apache Kylin mailing list archive at Nabble.com.
>> > > > >
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>
>


--
Best regards,

Shaofeng Shi 史少锋
Reply | Threaded
Open this post in threaded view
|

Re: Kylin Performance

Alberto Ramón
Don`t worry, I'm going to completed my KylinPerformace_I.pdf with new tests
and some notes

2016-12-28 11:19 GMT+01:00 ShaoFeng Shi <[hidden email]>:

> Alberto, the image can not be displayed :-<
>
> 2016-12-28 2:39 GMT+08:00 Alberto Ramón <[hidden email]>:
>
> > Kylin 2165 will be nice
> >
> > Yes 30% of total cube, because the cardinality of  DIM was low ( 2K and
> > 11K)
> >
> > You are in true: When the cardinality of  DIM are 1M,  the intermediate
> > table is only 5% of total: Picture (I don't know you can see pictures in
> > this mailList)
> > [image: Imágenes integradas 1]
> >
> >
> > 2016-12-27 2:32 GMT+01:00 ShaoFeng Shi <[hidden email]>:
> >
> >> Alberto, I didn't test ORC format; but as you know, Kylin consumes the
> >> source data row by row (all columns at once), so I guess columnar format
> >> like ORC may not benefit much. But this is a good try, if there is
> better
> >> format we can switch to it.
> >>
> >> The "redistribute flat hive table" will add time but it can reduce time
> in
> >> subsequent cube building (avoid data skew), especially when there are
> lots
> >> of records. Usually it is fast (a couple minutes to ten or twenty
> minutes)
> >> comparing to the cube build time. You mentioned it took 30% of total
> time,
> >> what's the total time and what's the input number? When the input is
> >> small,
> >> the overhead may overcome the benefit.
> >>
> >> For the method you mentioned (count on fact table, then put the
> >> redistribute to step 1), actually it is supported in Kylin 1.5.4 (maybe
> >> also 1.5.3) with a config parameter; but that method is not recommended
> as
> >> it is unstable: In some cases (e.g, the fact table is a big hive view,
> or
> >> it is a big table but not partitioned by date), a simple "select
> count(*)
> >> from fact_table" will cost lots of resources on Hadoop, a second "create
> >> intermediate_table as select ..." will start the same mappers again.
> >>
> >> In contrast, the as-is method is relatively stable for extreme case;
> >> usually the intermediate table is much smaller than fact table, count
> and
> >> redistribute on it will be low-cost; In next version there will be a
> >> further optimization (https://issues.apache.org/jira/browse/KYLIN-2165)
> >> to
> >> reduce the time in this step.
> >>
> >>
> >> 2016-12-27 1:20 GMT+08:00 Alberto Ramón <[hidden email]>:
> >>
> >> > Hello
> >> >
> >> > from v0, I correct english sintaxis
> >> >
> >> >
> >> > After tunning of cube:
> >> >   -  Use Hive input compress table
> >> >   -  Define  Hierarchy, Joint, Dim
> >> >   -  . . .
> >> >
> >> > Now:  57% if for first steps (flat table, steps: 1,2,3)  and 43% for
> >> build
> >> > cube
> >> >
> >> > I saw flat table uses SEQUENCEFILE, then I tested to use
> >> >    ORC,
> >> >    ORC + Snappy
> >> >    ORC + Snappy + Vectorization
> >> >
> >> > without good results, more ideas ??
> >> >
> >> >
> >> > I'm thinking that 'Redistribute Flat Hive Table' is a simple count and
> >> uses
> >> >
> >> > *30% of total time*
> >> >   Is this the normal case ?
> >> >   We can aprox this count to: count of Fact Table (Will true 99% of
> >> time),
> >> > and put in // with step 1, is necessary be precise?
> >> >
> >> > 2016-12-22 14:00 GMT+01:00 Li Yang <[hidden email]>:
> >> >
> >> > > Very good work!
> >> > >
> >> > > Btw, we are also doing benchmarks on SSB and TPC-H data sets, based
> on
> >> > > below work. Will share more info soon.
> >> > >
> >> > > - http://www.cs.umb.edu/~poneil/StarSchemaB.PDF
> >> > > - https://github.com/hortonworks/hive-testbench
> >> > >
> >> > >
> >> > > Cheers
> >> > > Yang
> >> > >
> >> > > On Wed, Dec 21, 2016 at 8:45 PM, Alberto Ramón <
> >> > [hidden email]>
> >> > > wrote:
> >> > >
> >> > > > When Kylin 2149 <https://issues.apache.org/jira/browse/KYLIN-2149
> >
> >> > will
> >> > > be
> >> > > > solved the performance will be* improve even more*, because:
> >> > > >
> >> > > > you know that 2016-05-05 Belongs to May, Week 18, and friday , but
> >> > kylin
> >> > > > doesnt know it
> >> > > > It will try to calulate the combination of 2016-05-05 with January
> >> > > February
> >> > > > March, ... Monday Tuesday ..., W1 W2 ..., Q2 Q3 Q4 ==> There are a
> >> lot
> >> > of
> >> > > > combination wasted
> >> > > >
> >> > > > 2016-12-21 12:57 GMT+01:00 Luke_Selina <[hidden email]
> >:
> >> > > >
> >> > > > > Great and Agree! But I still have an question like Alberto, why
> >> in an
> >> > > AGG
> >> > > > > one
> >> > > > > dim can use only one regulation(mandatory, join, hierachy)?
> >> > > > >
> >> > > > > --
> >> > > > > View this message in context: http://apache-kylin.74782.x6.
> >> > > > > nabble.com/Kylin-Performance-tp6713p6728.html
> >> > > > > Sent from the Apache Kylin mailing list archive at Nabble.com.
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Best regards,
> >>
> >> Shaofeng Shi 史少锋
> >>
> >
> >
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
Reply | Threaded
Open this post in threaded view
|

Re: Kylin Performance

Alberto Ramón
About Kylin performance, I completed some uses cases:


https://github.com/albertoRamon/Kylin/tree/master/KylinPerformance


Any contribution or correction will be appreciated
BR, Alb

2016-12-28 11:32 GMT+01:00 Alberto Ramón <[hidden email]>:

> Don`t worry, I'm going to completed my KylinPerformace_I.pdf with new
> tests and some notes
>
> 2016-12-28 11:19 GMT+01:00 ShaoFeng Shi <[hidden email]>:
>
>> Alberto, the image can not be displayed :-<
>>
>> 2016-12-28 2:39 GMT+08:00 Alberto Ramón <[hidden email]>:
>>
>> > Kylin 2165 will be nice
>> >
>> > Yes 30% of total cube, because the cardinality of  DIM was low ( 2K and
>> > 11K)
>> >
>> > You are in true: When the cardinality of  DIM are 1M,  the intermediate
>> > table is only 5% of total: Picture (I don't know you can see pictures in
>> > this mailList)
>> > [image: Imágenes integradas 1]
>> >
>> >
>> > 2016-12-27 2:32 GMT+01:00 ShaoFeng Shi <[hidden email]>:
>> >
>> >> Alberto, I didn't test ORC format; but as you know, Kylin consumes the
>> >> source data row by row (all columns at once), so I guess columnar
>> format
>> >> like ORC may not benefit much. But this is a good try, if there is
>> better
>> >> format we can switch to it.
>> >>
>> >> The "redistribute flat hive table" will add time but it can reduce
>> time in
>> >> subsequent cube building (avoid data skew), especially when there are
>> lots
>> >> of records. Usually it is fast (a couple minutes to ten or twenty
>> minutes)
>> >> comparing to the cube build time. You mentioned it took 30% of total
>> time,
>> >> what's the total time and what's the input number? When the input is
>> >> small,
>> >> the overhead may overcome the benefit.
>> >>
>> >> For the method you mentioned (count on fact table, then put the
>> >> redistribute to step 1), actually it is supported in Kylin 1.5.4 (maybe
>> >> also 1.5.3) with a config parameter; but that method is not
>> recommended as
>> >> it is unstable: In some cases (e.g, the fact table is a big hive view,
>> or
>> >> it is a big table but not partitioned by date), a simple "select
>> count(*)
>> >> from fact_table" will cost lots of resources on Hadoop, a second
>> "create
>> >> intermediate_table as select ..." will start the same mappers again.
>> >>
>> >> In contrast, the as-is method is relatively stable for extreme case;
>> >> usually the intermediate table is much smaller than fact table, count
>> and
>> >> redistribute on it will be low-cost; In next version there will be a
>> >> further optimization (https://issues.apache.org/jira/browse/KYLIN-2165
>> )
>> >> to
>> >> reduce the time in this step.
>> >>
>> >>
>> >> 2016-12-27 1:20 GMT+08:00 Alberto Ramón <[hidden email]>:
>> >>
>> >> > Hello
>> >> >
>> >> > from v0, I correct english sintaxis
>> >> >
>> >> >
>> >> > After tunning of cube:
>> >> >   -  Use Hive input compress table
>> >> >   -  Define  Hierarchy, Joint, Dim
>> >> >   -  . . .
>> >> >
>> >> > Now:  57% if for first steps (flat table, steps: 1,2,3)  and 43% for
>> >> build
>> >> > cube
>> >> >
>> >> > I saw flat table uses SEQUENCEFILE, then I tested to use
>> >> >    ORC,
>> >> >    ORC + Snappy
>> >> >    ORC + Snappy + Vectorization
>> >> >
>> >> > without good results, more ideas ??
>> >> >
>> >> >
>> >> > I'm thinking that 'Redistribute Flat Hive Table' is a simple count
>> and
>> >> uses
>> >> >
>> >> > *30% of total time*
>> >> >   Is this the normal case ?
>> >> >   We can aprox this count to: count of Fact Table (Will true 99% of
>> >> time),
>> >> > and put in // with step 1, is necessary be precise?
>> >> >
>> >> > 2016-12-22 14:00 GMT+01:00 Li Yang <[hidden email]>:
>> >> >
>> >> > > Very good work!
>> >> > >
>> >> > > Btw, we are also doing benchmarks on SSB and TPC-H data sets,
>> based on
>> >> > > below work. Will share more info soon.
>> >> > >
>> >> > > - http://www.cs.umb.edu/~poneil/StarSchemaB.PDF
>> >> > > - https://github.com/hortonworks/hive-testbench
>> >> > >
>> >> > >
>> >> > > Cheers
>> >> > > Yang
>> >> > >
>> >> > > On Wed, Dec 21, 2016 at 8:45 PM, Alberto Ramón <
>> >> > [hidden email]>
>> >> > > wrote:
>> >> > >
>> >> > > > When Kylin 2149 <https://issues.apache.org/jir
>> a/browse/KYLIN-2149>
>> >> > will
>> >> > > be
>> >> > > > solved the performance will be* improve even more*, because:
>> >> > > >
>> >> > > > you know that 2016-05-05 Belongs to May, Week 18, and friday ,
>> but
>> >> > kylin
>> >> > > > doesnt know it
>> >> > > > It will try to calulate the combination of 2016-05-05 with
>> January
>> >> > > February
>> >> > > > March, ... Monday Tuesday ..., W1 W2 ..., Q2 Q3 Q4 ==> There are
>> a
>> >> lot
>> >> > of
>> >> > > > combination wasted
>> >> > > >
>> >> > > > 2016-12-21 12:57 GMT+01:00 Luke_Selina <
>> [hidden email]>:
>> >> > > >
>> >> > > > > Great and Agree! But I still have an question like Alberto, why
>> >> in an
>> >> > > AGG
>> >> > > > > one
>> >> > > > > dim can use only one regulation(mandatory, join, hierachy)?
>> >> > > > >
>> >> > > > > --
>> >> > > > > View this message in context: http://apache-kylin.74782.x6.
>> >> > > > > nabble.com/Kylin-Performance-tp6713p6728.html
>> >> > > > > Sent from the Apache Kylin mailing list archive at Nabble.com.
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Best regards,
>> >>
>> >> Shaofeng Shi 史少锋
>> >>
>> >
>> >
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>
>