0
Answered
Dave 1 week ago in Portia • updated by Pablo Vaz (Support Engineer) 6 days ago 1

Am I able to create a new column based on a regex extractor with Portia?

Answer

Answer
Answered

Hi Dave, using regex you can:


1. Configure URL patterns and use Query cleaner Addon:

http://help.scrapinghub.com/portia/using-regular-expressions-and-query-cleaner-addon


2. You can also use regex for more complex actions like crawl paginated listings:

https://portia.readthedocs.io/en/latest/examples.html#crawling-paginated-listings


3. You can also use regular expressions to extract a portion of the variable.

For example, let’s say you need to extract a parameter from a URL like this: http://www.example.com/product.html?item_no=345. The normal syntax, { "sku": "$field:url" } will store the full URL into the sku field. If we want to extract only the item_no value, we can use a regex like this:

{ "sku": "$field:url,r'item_no=(\d+)'" }

Not sure if the above suggestions can help but you can find more information in Portia docs:
https://portia.readthedocs.io/en/latest/index.html


Kind regards,

Pablo

Answer
Answered

Hi Dave, using regex you can:


1. Configure URL patterns and use Query cleaner Addon:

http://help.scrapinghub.com/portia/using-regular-expressions-and-query-cleaner-addon


2. You can also use regex for more complex actions like crawl paginated listings:

https://portia.readthedocs.io/en/latest/examples.html#crawling-paginated-listings


3. You can also use regular expressions to extract a portion of the variable.

For example, let’s say you need to extract a parameter from a URL like this: http://www.example.com/product.html?item_no=345. The normal syntax, { "sku": "$field:url" } will store the full URL into the sku field. If we want to extract only the item_no value, we can use a regex like this:

{ "sku": "$field:url,r'item_no=(\d+)'" }

Not sure if the above suggestions can help but you can find more information in Portia docs:
https://portia.readthedocs.io/en/latest/index.html


Kind regards,

Pablo