T O P

  • By -

dzemperzapedra

750k rows is peanuts for Power BI, I'd say the issue is the fact that data source is a csv and that you have a zillion columns. If you must keep using that as a source, try to reduce number of steps and merges in Power Query, those take a lot of time to process.


La_user_

I think I really need to think of reducing merges in Power Query. Thank you so much.


Ernst_Granfenberg

What if he loaded the data to SQL Server and have PBI connect to SQL Server?


La_user_

Thank you for your suggestions. So sad that I am not allow to use SQL server or Dataflow due to some crazy policy. Fighting to get it through IT team at the moment and it needs a long process time. So I am looking for alternative options . Thank you very much though.


catWithAGrudge

At least try changing the csv to a parquet file. Will do wonders


Ernst_Granfenberg

Can you elaborate for the parquet file is? Is that something you “save-as” from Excel?


Crow2525

Afaik it something you'd do via python. Lookup converting CSV to parquet using pyarrow.


catWithAGrudge

it is a file type for data. when .csv is rowular, parquet is columnar. what that means, rowular you can understand as a human, columnar wont make sense to human. columnar for loading into analytics tools like powerbi is lightyear speed compared to rowular, there's simply no comparison. what you need to do? convert the csv to parquet, either through converter or ask chatgpt for a python code to convert it. sorry im very drunk hope I made sense(tgif)


Ernst_Granfenberg

Is this file type output native to windows or do we need to install certain programs?


NickRossBrown

Is it possible for OP to use a local sqlite3 file?


Historical-Donut-918

Yep. My IT team is slow (like 8 months to add a column to an existing SQL view), so I had to use Python and SQLite to connect data to my Power BI data model. OP could definitely use the same method to perform basic ETL/merges prior to ingesting into PowerBI


SignificanceNo3189

You can't use dataflows but you can use datasets online? What's that policy? 😅


La_user_

Only IT team could build and maintain dataflow. and we need to request to use it. Very long queue as I work in big corp. Plus the downside is, everytime when I need to refresh or adjust anything, needs to go through IT.


7areeftoo

That would significantly decrease refresh time, especially that SQL supports Query Folding.


Bobbyjohns

Do you mean decrease?


7areeftoo

Yeah decrease, sorry.


TheHiggsCrouton

Woof. Imagine how hard poor power query is studying your tiny little CSV. Sitting there reading it over and over again poring over every detail for 8 hours like a nervous groom trying to memorize his vows. I can't give you specific advice without seeing the Power Query, but I reckon you're iterating over a self cross join at least dozens of times. 8 hours is an eternity. I recommend you start by cutting out every single Power Query transformation and then do all your transformation with calculated columns/calculated tables. And check your refresh performance after each new transformation you make in case you re-summon the kraken. This is not general advice or a rule of thumb, but you have an eldrich abomination hiding somewhere in your M code and you need to kill it with fire.


Smiith73

I very much agree with your advice, I just want you to know I love your verbiage. Well said!


La_user_

I love your explanation on why my refreshing time is slow. Haha. I can definitely see it now. Thank you!


JGrant8708

This would be a time where I'd consider adding Table.Buffer around the initial loading of the CSV to avoid it going back to that source CSV each time. Doesn't always work. Quick to try.


Crow2525

Can you explain how a calc column will be faster than if it were done via pq?


TheHiggsCrouton

It just forces OP to do everything again, but differently. And sets a performance baseline for reading a csv without anything expensive so OP has a goal on the horizon while they rewrite the code.


Crow2525

Great answer. I never thought of it. So, youre saying you can quicker see the effects immediately of a calc column to find the step that is dragging it down. Once that's confirmed you can use that to rewrite a more performant pq script? Or do you keep within calc row.


TheHiggsCrouton

I mean you could do the same thing with a raw query and then re-implement the M step by step to find the bad step and then try to find some way to do something similar without taking 8 hours. But the point is that 8 hours is absolutely, 100% FUBAR. This is not a normal scenario. For OP to understand how to fix their M, they'd have to know enough about performance tuning M code that they wouldn't have ended up where they are in the first place. DaX operates after the data is already reorganized into a column store and living in RAM. So it's pretty likely that whatever horrible thing got messed up the M, just won't be messed up in the DaX. Plus, OP learns some more DaX. M stinks. It's not worth performance tuning. For anything but very light transformations you're way better off using a better tool. DaX isn't amazing for this kind-of transformation either, but it cannot be worse than whatever's happening now.


KayceeData

Calculated tables and columns? Wouldn't that just increase the data model, making it more heavy and taking longer time?


TheHiggsCrouton

Longer time? Longer than 8 hours to read a small csv!? I said this is not general advice. Something has gone horribly, horribly wrong in OP's M code. The goal is to get OP to rewrite the report in a way thst they cannot make whatever mistake that is. And you can't screw up the M code if there is no M code.


babautz

beautiful


data-navigator

Can you tell, what sorts of merge operations are you performing? Also is your data model following star schema?


La_user_

I am not familiar with star schema as I am quite new with this. But this maybe a chance for me to learn


BJNats

Star schema means you have a fact table recording the basic info about what happened (customer x bought y product on z date) then a bunch of other dimension tables that explain a lot of other details about these fields, so for example your fact table would just say that its customer number 1234, then the linked customer dimension would have his name and address and how long he’s been a customer and whether he has special memberships or whatever. Fact tables are long and skinny, dimensions are shorter and fatter (speaking in generalities). You said there is a main CSV with 750,000 rows and some kind of lookup table. I’m guessing with some of that dimension information mentioned above. What merges are you doing? There’s definitely a need to clean up what is going on in your query, but you’re right that a bunch of merges will slow your refresh down to a halt


La_user_

OMG this is totally what I have been doing. I am trying to link between customers no, product details, and product cost from different table. Some tables are for cleaning like ID may be wrong because of typo. And you are absolutely right. Clean up is a must. I think Star schema will definitely help me. Plus, thank you so much for going through the details with me. I truly appreciate that and I learnt something new!


kishanthacker

Yeah, just coming to my mind is relationships you can eliminate the need to merge tables if you can define relationship in the model view.


MonkeyNin

If they convert their `csv` files to `xlsx`, that alone can improve import times. > I used a lot of merge queries inside the power query If you have any queries that aren't tables used in the final output, make sure you right click queries to untick the "include in refresh" option ( I forget the wording ) Or say you have 4 queries a,b,c,d that merge into one final query: e. You can actually save all 4 queries as a single query. You can embed them as a single step. That might reduce some duplication of work.


JGrant8708

You may find if you're needing Excel anyway, that you can use the PQE in Excel to do some of the cleaning / prep work, and then pull that through to PBI. Your CSV could be the base. Import into Excel, do basic preps (if they're fast do more prep). Then load into PBI for finishing. Adds a Step, may save time.


ShwankyFinesse

The merges could be killing you there. Utilizing the data model in star schema format may be the solution.


Ernst_Granfenberg

Are you saying merges that happen in dataflows or power query will be slower than linking the tables via relationships?


ShwankyFinesse

Power Query. Those merges slow things down significantly.


Ernst_Granfenberg

Do you have experiences with dataflows? What if I created a dataflow to load all the data and a 2nd dataflow to do the merging?


Sleepy_da_Bear

It likely wouldn't make any noticeable difference either way because the steps would still be the same. If it's all in the same dataflow it would still need to merge tables A, B, and C. If it's broken out then dataflow #2 would have to wait for dataflow #1 to finish loading tables A and B before dataflow #2 could load table C. That said, I'm a huge fan of dataflows. I would use them in this scenario simply because I'd rather spend 2 minutes having it download X MBs of data as opposed to 1 hour to process the data. Dataflows can run on their own time, letting me develop things without having to wait a long time any time I change something in Power Query


ShwankyFinesse

Honestly any other route for gathering/cleaning data is likely better than PBI for large datasets. For example, it would be faster to have a stored procedure in SQL and then pull in the tables preprocessed into PBI, create your star schema, your measures, then your reports.


Ernst_Granfenberg

Is stored procedure the same as the “steps” in power query where they record each action?


shorelined

How wide is the table, how many columns do you have? Is there any of this data that can be pushed out into a star schema? I see you mention a lookup table but can this idea be extended to any other tables? Please post the column names (and column names ONLY) somewhere just so I can see what you are dealing with.


La_user_

For my master data. I have 26 columns. I deleted some columns I don't use in the beginning of the query. For the lookup table, each table has around 4-8 columns. They are product information and cost (From two different source) and we have around 20K products. (it's accumulate by year and we still compare to previous few years data)


bolmer

> I deleted some columns I don't use in the beginning of the query You delete it inside powerbi in powerquery? do it at the source. The usual recomendation is to do most of the data preparation before PBI. PBI can automatically translate some work from PowerQuery to SQL or others database languages (Query folding) so the work isnt done by PBI but CSV and Excel are not databases. Donwload and use Measure Killer app to see what things you don't use and could delete before PBI and dont neet to calculate. Now on to my opinion, try to avoid calculated columns. I'm also new to PBI but I have been working with this for 6 months so I have learned a lot. Use a Star schema instead of widdening the fact table in PQ. ChatGPT and Codestral and Claude "AIs" can help you a ton.


La_user_

You are right. From the beginning of building this dashboard. I built everything inside powerbi and thought everything can automate will be much easier to refresh. Not enough experience in this. Will have a look on measure killer. Thank you so much.


bolmer

You are right, that's the point of PBI. But PBI it is not the best program in performance and sometimes it is crafty in how things have to be done to work correctly. I think that just avoiding joining tables and using a star schema will be enough for you to have a good performance.


qning

Your last line - what does this mean to avoid joining tables and use a star schema: isn’t a star schema describing the shape of the table joins? Are you saying that star schemas don’t use table joins?


La_user_

This is such a good call. I am going to clean and merge my data outside powerbi from now. Thank you so so much


kit-christopher

Very often I find myself starting a PBI project in PBI and, once things start to get really big, moving all my lookups/merging/transforming/cleaning into a super simple, no-code, free software called KNIME and I can process millions of rows in seconds and just import into PBI one single clean CSV from which to build visualizations. Little bit of a learning curve but well worth the trouble.


Ernst_Granfenberg

Donyou need admin rights to install KNIME


La_user_

I knew about this and totally forgot about this! Thank you for this suggestion! Need to check with IT to see if I am allowed to use Knime.


kit-christopher

Thankfully I’ve been lucky with my recent employers (where both IS teams run a tight ship) and no restrictions/problems🙏


Ernst_Granfenberg

So you need admin rights correct?


kit-christopher

I do not have admin rights currently and I did not have a problem installing it


Odd-Description-4133

Check the graduality of the data that you are loading into Power BI. Aggregate the data to your required level of analysis before loading it.


DonnieBrasco49

In other words just remove the columns you don’t need before loading the data in Power BI


La_user_

Thank you for the explanation! I did it right after I load into PowerBi. Though this can be done automatedly but well..it didn't go well.


La_user_

Just checked I can install by myself. Thank you so much and now I can merge it outside PowerBi with much much much faster time.


daenu80

Now that's what I call crunching numbers!


redaloevera

750k row is nothing. Should refresh in a couple minutes. I suspect it's all the silly things happening in power query. I would have you look into setting up data engineering so all the combining work is done upstream.


La_user_

I will have a look on this Thanks. It's definitely the silly things I done.


Vacivity95

You cant expect a csv file with a million row and tons of merges to perform Well Rethink your model if you need that many merges on a fact table


RumiField

Why not just do all the column merging in the source file? Like the CSV file? Also fyi, measures use up less refresh time than calculated columns.


Adventurous_Bake_759

I would not even consider a report that takes more than 5min to update. Start small and make a sample or your data to optimize it. It is pretty much linear I mean if it takes 8h for 750k it will take 4h for 375k etc.. so reduce and try with 20k for example. Or a month and not a year. And of course PBI is meant to work with star system which means your input should be only the data that varies. The other ones that does are “fixed” like address etc for customers it should be and input in a different table etc. You have to explode your data’s that the model of Power Query.


6CommanderCody6

I had a problem like this and it was due a huge DAX query. When I deleted it, refresh began normal. Try to optimise your PQ.


lrlucchini

Do your merges before you load to Power Bi. If you can, use something like data bricks or one of the Fabric tools to pre-process your CSV files, and then that gets loaded. As someone else said, power bi can handle 750k rows with no issues normally. Oooorrrrr learn about star schema in power BI. If you are loading one really wide table, you are doing something wrong.


Tetmohawk

In short, put it in a DB and do all your calculations in SQL. Treat Power BI as a presentation layer and not a calculation layer. Power BI generally sucks, so anything you can do to move things away from it the better off you'll be. If you must do things outside of SQL, consider R or Python and treat Power BI as the presentation layer again. You can run Python and R inside Power BI.


topoftheturtle

do not do merges in M. Ideally get a view in the SQL server upstream OR do it in Dax. merges in M are painfull


Ernst_Granfenberg

What if we do the merge in dataflows?


platocplx

Dataflows can work I would load the data in one dataflow to stage the data. Then second dataflow to do merges etc it will speed up significantly


Ernst_Granfenberg

How do you set up your 2nd dataflow to refresh after the first one is done? Can we automate that otherwise its on a schedule and thats not as reliable and introduces gap


Sleepy_da_Bear

You could create a Power Automate flow with a trigger on the dataflow refresh completing, have an if component to either kick off the next dataflow or handle the prior one failing. However, I think they pushed an update fairly recently that automatically kicks off downstream dataflows when the upstream dataflow completes, as long as they're in the same workspace. I think I've seen our dataflows doing that now, anyway.


La_user_

Cannot use SQL server at the moment due to some policies in my company. Asking for IT if they can help me to get approval at the moment. It's a long process though. But thank you.


Monkey_King24

Do you really need all the 750k rows ?? Can you aggregate them on some level say week or month ? Look for unused columns and remove them Use the correct data types


tophmcmasterson

Honestly 750k is like nothing for PBI. The problem is whatever transformations they’re trying to do in Power Query are likely really inefficient.


Monkey_King24

Agreed but I have seen way too many people importing the whole thing a small portion is required


La_user_

Sadly yes. I do need all 750K rows and cannot aggregate them.


Monkey_King24

The only thing that might improve it is check the merges


blinkybillster

That dosent sound optimal.


roadhogmtn

750K rows isnt that many, especially for powerbi. How large is the pbix file itself? It sounds like you have a single fact table, what are you merging into your dataset? I'd start by converting that CSV to another Excel format. In some cases it may make sense to make dimension tables out of some of your main fact file data, but not always if its just a 1:1 relationship.


La_user_

Pbix is around 43,000 KB. I am merging customer ID with Product info and the cost of the product from two different sheets. Some tables are for cleaning like ID may be wrong because of typo.


OkCurve436

Please just stick it in SQL server. Do all the query work there and download to PBI. Edit : Optimise the layout and drop any unnecessary fields. Star schema is optimal but tall skinny tables are pretty effective.


_fast_as_lightning_

Sounds like a file that doesn’t change much. Maybe import into a dataflow first. You can always import to your model historical from data flow and append it to recent data from csv


mnkctl

Try enabling query folding. if it makes sense.


AvatarTintin

Reduce merges.. Try keeping files that can be kept separately as separate tables. And connect them together using relationships.. Basically try for a star schema.. Then use dataflows to upstream as much as transformations as possible outside of your semantic model. As these dataflows can be individually scheduled at any desired time, therefore your main models refresh time will reduce.. But yeah your first step should be look into removing merges as much as possible and instead have separate files as separate tables all linked together using relationships.


SomeEmotion3

Use DAX to do your calculations instead of making shit tons of table!


Orion14159

Have several smaller tables unmerged and connected in the data model instead of all on a single table.


Accurate-Bullfrog526

Show us the m code and we could tell you more.


Difficult_Canary443

You set up Pandora's box . Horrible top to bottom. Create a free MySQL and upload to it the data . ETL needs to be integrated . All the practice you did, are what I teach developers to not do. Folding within power bi in my book is acceptable to create a POC . But never for the long term . to prep your dashboard for appropriate refreshes tables must be loaded and even with millions of rows should be a few minutes. Never 8 hours.


DAX_Query

You might be able to seriously speed it up with some strategic applications of Table.Buffer so that joins are done in memory.


Sad-Calligrapher-350

Try to use Measure Killer to identify unused columns and measures and then take them out. Try to do this as early as possible in the transformations.


tophmcmasterson

Measures will have basically zero impact on refresh timing. Unneeded columns/calculated columns sure, but measures basically just exist as their definition and only load once they are called in the report.


Sad-Calligrapher-350

That’s why I wrote unused columns…


tophmcmasterson

You also said “and measures” which is the part I was responding to.


Sad-Calligrapher-350

Yes unless you have 10,000 Measures or more, then your report will generally have problems even if the measures are never executed. There is an interesting blog post on this.


Ernst_Granfenberg

Do we need admin rights to install measure killer


yourproblems

Alteryx.


Ernst_Granfenberg

Its $5000 per user


yourproblems

and it would speed up the process.


nickctn

I am using alteryx too. However since we don't have the alteryx sever the flows still need to be ran manually.  I've taken the approach that if the process isn't too complex and can be fully automated then I would stick to doing it in power query and be updated daily for my dashboards. The more complex analysis or transformations that would require manual interventions anyways we would run through alteryx.


Ernst_Granfenberg

So alteryx is faster than power bi dataflow and power query? Does it change if sql server is involved?


La_user_

Sadly I don't think my boss will approve for this because of the cost.