0
Answered
brandonmp 6 months ago in Portia • updated by Pablo Vaz (Support Engineer) 6 months ago 3

Searched for this topic but no luck, apologies if this is a duplicate


I'm scraping a few pages where I need to extract a few non-visible pieces of data. Specifically, the `src` attributes on images, some `data-*` attributes on misc. `html` tags, and some raw text from the content of a few `<script>` tags.


Is this possible to do in Portia? I haven't been able to figure it out on my own.


If not possible, is it possible to augment a Portia scraper with custom python? Or does a job have to be either all-Scrapy/Python or All-Portia?

Answer

+1
Answer

Yes Brandon! You can add an extractor to the annotation. In the same options where you configured the CSS selector you can add an extractor which will process the text with your pre-defined regex.

Answered

Hi Brandon, yes, you could use css selectors.
When defining new samples, check the options, and configure css selector, paste it into the field box and start scraping.
Let me know if you need further assistance!
Best regards!

Pablo

Thank you Pablo--that sounds like it'll work for both image sources and script tags.


If you don't mind a follow-on question: is there a way to apply a regex function to the resultant text?

+1
Answer

Yes Brandon! You can add an extractor to the annotation. In the same options where you configured the CSS selector you can add an extractor which will process the text with your pre-defined regex.